In Pandas, missing data occurs when some values are missing or not collected properly and these missing values are represented as:

* None: A Python object used to represent missing values in object-type arrays.
* NaN: A special floating-point value from NumPy which is recognized by all systems that use IEEE floating-point standards.
In this article we see how to detect, handle and fill missing values in a DataFrame to keep the data clean and ready for analysis.

## Checking Missing Values in Pandas
Pandas provides two important functions which help in detecting whether a value is NaN helpful in making data cleaning and preprocessing easier in a DataFrame or Series are given below :

### 1. Using isnull()
isnull() returns a DataFrame of Boolean value where True represents missing data (NaN). This is simple if we want to find and fill missing data in a dataset.

Example 1: Finding Missing Values in a DataFrame

We will be using Numpy and Pandas libraries for this implementation.

In [1]:
import pandas as pd
import numpy as np

d = {'First Score': [100, 90, np.nan, 95],
        'Second Score': [30, 45, 56, np.nan],
        'Third Score': [np.nan, 40, 80, 98]}
df = pd.DataFrame(d)

mv = df.isnull()

print(mv)

   First Score  Second Score  Third Score
0        False         False         True
1        False         False        False
2         True         False        False
3        False          True        False


#### Example 2: Filtering Data Based on Missing Values

Here we used random Employee dataset. The isnull() function is used over the "Gender" column in order to filter and print out rows containing missing gender data.

In [3]:
import pandas as pd
d = pd.read_csv("employees.csv")

bool_series = pd.isnull(d["Gender"])
missing_gender_data = d[bool_series]
missing_gender_data

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
20,Lois,,4/22/1995,7:18 PM,64714,4.934,True,Legal
22,Joshua,,3/8/2012,1:58 AM,90816,18.816,True,Client Services
27,Scott,,7/11/1991,6:58 PM,122367,5.218,False,Legal
31,Joyce,,2/20/2005,2:40 PM,88657,12.752,False,Product
41,Christine,,6/28/2015,1:08 AM,66582,11.308,True,Business Development
...,...,...,...,...,...,...,...,...
961,Antonio,,6/18/1989,9:37 PM,103050,3.050,False,Legal
972,Victor,,7/28/2006,2:49 PM,76381,11.159,True,Sales
985,Stephen,,7/10/1983,8:10 PM,85668,1.909,False,Legal
989,Justin,,2/10/1991,4:58 PM,38344,3.794,False,Legal


### 2. Using isna()
isna() returns a DataFrame of Boolean values where True indicates missing data (NaN). It is used to detect missing values just like isnull().

Example: Finding Missing Values in a DataFrame

In [5]:
data = {'Name': ['Amit', 'Sita', np.nan, 'Raj'],
        'Age': [25, np.nan, 22, 28]}

df = pd.DataFrame(data)
print(df)
# Check for missing values using isna()
print(df.isna())

   Name   Age
0  Amit  25.0
1  Sita   NaN
2   NaN  22.0
3   Raj  28.0
    Name    Age
0  False  False
1  False   True
2   True  False
3  False  False


### 3. Checking for Non-Missing Values Using notnull()
notnull() function returns a DataFrame with Boolean values where True indicates non-missing (valid) data. This function is useful when we want to focus only on the rows that have valid, non-missing values.

Example 1: Identifying Non-Missing Values in a DataFrame

In [7]:
d = {'First Score': [100, 90, np.nan, 95],
        'Second Score': [30, 45, 56, np.nan],
        'Third Score': [np.nan, 40, 80, 98]}
df = pd.DataFrame(d)
print(df)
nmv = df.notnull()

print(nmv)

   First Score  Second Score  Third Score
0        100.0          30.0          NaN
1         90.0          45.0         40.0
2          NaN          56.0         80.0
3         95.0           NaN         98.0
   First Score  Second Score  Third Score
0         True          True        False
1         True          True         True
2        False          True         True
3         True         False         True


#### Example 2: Filtering Data with Non-Missing Values

notnull() function is used over the "Gender" column in order to filter and print out rows containing missing gender data.






In [8]:
d = pd.read_csv("employees.csv")

nmg = pd.notnull(d["Gender"])

nmgd= d[nmg]

display(nmgd)

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,8/6/1993,12:42 PM,97308,6.945,True,Marketing
1,Thomas,Male,3/31/1996,6:53 AM,61933,4.170,True,
2,Maria,Female,4/23/1993,11:17 AM,130590,11.858,False,Finance
3,Jerry,Male,3/4/2005,1:00 PM,138705,9.340,True,Finance
4,Larry,Male,1/24/1998,4:47 PM,101004,1.389,True,Client Services
...,...,...,...,...,...,...,...,...
994,George,Male,6/21/2013,5:47 PM,98874,4.479,True,Marketing
996,Phillip,Male,1/31/1984,6:30 AM,42392,19.675,False,Finance
997,Russell,Male,5/20/2013,12:39 PM,96914,1.421,False,Product
998,Larry,Male,4/20/2013,4:45 PM,60500,11.985,False,Business Development


## Filling Missing Values in PandasFilling Missing Values in Pandas
Following functions allow us to replace missing values with a specified value or use interpolation methods to find the missing data.

### 1. Using fillna()
fillna() used to replace missing values (NaN) with a given value. Lets see various example for this.

Example 1: Fill Missing Values with Zero

In [14]:
import pandas as pd
import numpy as np

d = {'First Score': [100, 90, np.nan, 95],
        'Second Score': [30, 45, 56, np.nan],
        'Third Score': [np.nan, 40, 80, 98]}
df = pd.DataFrame(d)

# df.fillna(0)

In [12]:
df.fillna(method='pad')

  df.fillna(method='pad')


Unnamed: 0,First Score,Second Score,Third Score
0,100.0,30.0,
1,90.0,45.0,40.0
2,90.0,56.0,80.0
3,95.0,56.0,98.0


The pad method is used to fill missing values with the previous value.

In [15]:
df.fillna(method='bfill') 

  df.fillna(method='bfill')


Unnamed: 0,First Score,Second Score,Third Score
0,100.0,30.0,40.0
1,90.0,45.0,40.0
2,95.0,56.0,80.0
3,95.0,,98.0


The bfill function is used to fill it with the next value.

In [16]:
import pandas as pd
import numpy as np
d = pd.read_csv("employees.csv")

d[10:25]

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
10,Louise,Female,8/12/1980,9:01 AM,63241,15.132,True,
11,Julie,Female,10/26/1997,3:19 PM,102508,12.637,True,Legal
12,Brandon,Male,12/1/1980,1:08 AM,112807,17.492,True,Human Resources
13,Gary,Male,1/27/2008,11:40 PM,109831,5.831,False,Sales
14,Kimberly,Female,1/14/1999,7:13 AM,41426,14.543,True,Finance
15,Lillian,Female,6/5/2016,6:09 AM,59414,1.256,False,Product
16,Jeremy,Male,9/21/2010,5:56 AM,90370,7.369,False,Human Resources
17,Shawn,Male,12/7/1986,7:45 PM,111737,6.414,False,Product
18,Diana,Female,10/23/1981,10:27 AM,132940,19.082,False,Client Services
19,Donna,Female,7/22/2010,3:48 AM,81014,1.894,False,Product


In [17]:
d["Gender"].fillna('No Gender', inplace = True) 
d[10:25]

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  d["Gender"].fillna('No Gender', inplace = True)


Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
10,Louise,Female,8/12/1980,9:01 AM,63241,15.132,True,
11,Julie,Female,10/26/1997,3:19 PM,102508,12.637,True,Legal
12,Brandon,Male,12/1/1980,1:08 AM,112807,17.492,True,Human Resources
13,Gary,Male,1/27/2008,11:40 PM,109831,5.831,False,Sales
14,Kimberly,Female,1/14/1999,7:13 AM,41426,14.543,True,Finance
15,Lillian,Female,6/5/2016,6:09 AM,59414,1.256,False,Product
16,Jeremy,Male,9/21/2010,5:56 AM,90370,7.369,False,Human Resources
17,Shawn,Male,12/7/1986,7:45 PM,111737,6.414,False,Product
18,Diana,Female,10/23/1981,10:27 AM,132940,19.082,False,Client Services
19,Donna,Female,7/22/2010,3:48 AM,81014,1.894,False,Product


### 2. Using replace()
Use replace() function to replace NaN values with a specific value.

In [18]:
import pandas as pd
import numpy as np

data = pd.read_csv("employees.csv")
data[10:25]

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
10,Louise,Female,8/12/1980,9:01 AM,63241,15.132,True,
11,Julie,Female,10/26/1997,3:19 PM,102508,12.637,True,Legal
12,Brandon,Male,12/1/1980,1:08 AM,112807,17.492,True,Human Resources
13,Gary,Male,1/27/2008,11:40 PM,109831,5.831,False,Sales
14,Kimberly,Female,1/14/1999,7:13 AM,41426,14.543,True,Finance
15,Lillian,Female,6/5/2016,6:09 AM,59414,1.256,False,Product
16,Jeremy,Male,9/21/2010,5:56 AM,90370,7.369,False,Human Resources
17,Shawn,Male,12/7/1986,7:45 PM,111737,6.414,False,Product
18,Diana,Female,10/23/1981,10:27 AM,132940,19.082,False,Client Services
19,Donna,Female,7/22/2010,3:48 AM,81014,1.894,False,Product


In [19]:
data.replace(to_replace=np.nan, value=-99)

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,8/6/1993,12:42 PM,97308,6.945,True,Marketing
1,Thomas,Male,3/31/1996,6:53 AM,61933,4.170,True,-99
2,Maria,Female,4/23/1993,11:17 AM,130590,11.858,False,Finance
3,Jerry,Male,3/4/2005,1:00 PM,138705,9.340,True,Finance
4,Larry,Male,1/24/1998,4:47 PM,101004,1.389,True,Client Services
...,...,...,...,...,...,...,...,...
995,Henry,-99,11/23/2014,6:09 AM,132483,16.655,False,Distribution
996,Phillip,Male,1/31/1984,6:30 AM,42392,19.675,False,Finance
997,Russell,Male,5/20/2013,12:39 PM,96914,1.421,False,Product
998,Larry,Male,4/20/2013,4:45 PM,60500,11.985,False,Business Development


### 3. Using interpolate()
The interpolate() function fills missing values using interpolation techniques such as the linear method.

In [20]:
import pandas as pd
   
df = pd.DataFrame({"A": [12, 4, 5, None, 1], 
                   "B": [None, 2, 54, 3, None], 
                   "C": [20, 16, None, 3, 8], 
                   "D": [14, 3, None, None, 6]})  
print(df)

      A     B     C     D
0  12.0   NaN  20.0  14.0
1   4.0   2.0  16.0   3.0
2   5.0  54.0   NaN   NaN
3   NaN   3.0   3.0   NaN
4   1.0   NaN   8.0   6.0


Let’s interpolate the missing values using Linear method. This method ignore the index and consider the values as equally spaced. 

In [21]:
df.interpolate(method ='linear', limit_direction ='forward')

Unnamed: 0,A,B,C,D
0,12.0,,20.0,14.0
1,4.0,2.0,16.0,3.0
2,5.0,54.0,9.5,4.0
3,3.0,3.0,3.0,5.0
4,1.0,3.0,8.0,6.0


## Dropping Missing Values in Pandas
The dropna() function used to removes rows or columns with NaN values. It can be used to drop data based on different conditions.

### 1. Dropping Rows with At Least One Null Value
Remove rows that contain at least one missing value.

In [22]:
dict = {'First Score': [100, 90, np.nan, 95],
        'Second Score': [30, np.nan, 45, 56],
        'Third Score': [52, 40, 80, 98],
        'Fourth Score': [np.nan, np.nan, np.nan, 65]}
df = pd.DataFrame(dict)

df.dropna()

Unnamed: 0,First Score,Second Score,Third Score,Fourth Score
3,95.0,56.0,98,65.0


### 2. Dropping Rows with All Null Values
We can drop rows where all values are missing using dropna(how='all').

In [23]:
dict = {'First Score': [100, np.nan, np.nan, 95],
        'Second Score': [30, np.nan, 45, 56],
        'Third Score': [52, np.nan, 80, 98],
        'Fourth Score': [np.nan, np.nan, np.nan, 65]}
df = pd.DataFrame(dict)
print(df)
df.dropna(how='all')

   First Score  Second Score  Third Score  Fourth Score
0        100.0          30.0         52.0           NaN
1          NaN           NaN          NaN           NaN
2          NaN          45.0         80.0           NaN
3         95.0          56.0         98.0          65.0


Unnamed: 0,First Score,Second Score,Third Score,Fourth Score
0,100.0,30.0,52.0,
2,,45.0,80.0,
3,95.0,56.0,98.0,65.0


### 3. Dropping Columns with At Least One Null Value
To remove columns that contain at least one missing value we use dropna(axis=1).

In [25]:
dict = {'First Score': [100, np.nan, np.nan, 95],
        'Second Score': [30, np.nan, 45, 56],
        'Third Score': [52, np.nan, 80, 98],
        'Fourth Score': [60, 67, 68, 65]}
df = pd.DataFrame(dict)
print(df)
df.dropna(axis=1)

   First Score  Second Score  Third Score  Fourth Score
0        100.0          30.0         52.0            60
1          NaN           NaN          NaN            67
2          NaN          45.0         80.0            68
3         95.0          56.0         98.0            65


Unnamed: 0,Fourth Score
0,60
1,67
2,68
3,65


### 4. Dropping Rows with Missing Values in CSV Files
When working with CSV files, we can drop rows with missing values using dropna().

In [26]:
d = pd.read_csv("employees.csv")

nd = d.dropna(axis=0, how='any')

print("Old data frame length:", len(d))
print("New data frame length:", len(nd))
print("Rows with at least one missing value:", (len(d) - len(nd)))

Old data frame length: 1000
New data frame length: 764
Rows with at least one missing value: 236
