# Handling Missing Data with Pandas
Pandas has some great functions for detecting and handling missing data, which is a common problem in data analysis.

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.DataFrame({'A':[1,2,np.nan],
                  'B':[5,np.nan,np.nan],
                  'C':[1,2,3]})

In [19]:
df.head()

Unnamed: 0,A,B,C
0,1.0,5.0,1
1,2.0,,2
2,,,3


In [20]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   A       2 non-null      float64
 1   B       1 non-null      float64
 2   C       3 non-null      int64  
dtypes: float64(2), int64(1)
memory usage: 200.0 bytes


The .dropna() function drops any row of our dataset with a missing value (defined as NaN).

In [4]:
#dropna()
df.dropna()

Unnamed: 0,A,B,C
0,1.0,5.0,1


In [5]:
df

Unnamed: 0,A,B,C
0,1.0,5.0,1
1,2.0,,2
2,,,3


In [6]:
#Again, in order for the change to update the dataframe, we must add inplace=True
df.dropna(inplace=True)

In [7]:
df

Unnamed: 0,A,B,C
0,1.0,5.0,1


**Other .dropna() parameters**

In [8]:
df = pd.DataFrame({'A':[1,2,np.nan],
                  'B':[5,np.nan,np.nan],
                  'C':[1,2,3]})

In [13]:
df

Unnamed: 0,A,B,C
0,1.0,5.0,1
1,2.0,,2
2,,,3


The axis parameter shows whether to drop rows or columns. default is rows (axis=0). To drop columns, add axis=1 to function.

In [10]:
df.dropna(axis=1)

Unnamed: 0,C
0,1
1,2
2,3


The thresh parameter allows us to define a threshold for the number of missing values a row must have before it is dropped.

In [12]:
df.dropna(thresh=2)

Unnamed: 0,A,B,C
0,1.0,5.0,1
1,2.0,,2


In the example above, only the third row is dropped from the dataframe, since it is the only row with 2 or more missing values.

**.fillna()** <br><br>
The .fillna() function does exactly what it sounds like - fills NA values with a specified value.

In [14]:
df

Unnamed: 0,A,B,C
0,1.0,5.0,1
1,2.0,,2
2,,,3


In [15]:
df.fillna('Fill Value')

Unnamed: 0,A,B,C
0,1,5,1
1,2,Fill Value,2
2,Fill Value,Fill Value,3


In [16]:
df['A']

0    1.0
1    2.0
2    NaN
Name: A, dtype: float64

In [17]:
df['A'].mean()

1.5

In [18]:
df['A'].fillna(df['A'].mean())

0    1.0
1    2.0
2    1.5
Name: A, dtype: float64

https://www.geeksforgeeks.org/working-with-missing-data-in-pandas/