#### Handling Null values in pandas ####

In [1]:
# Import required libraries
import pandas as pd

In [2]:
# Load data
df = pd.read_json('../../datasets/data.json')
df.head()

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
0,60.0,110.0,130.0,409.1
1,60.0,117.0,145.0,479.0
2,60.0,103.0,135.0,340.0
3,45.0,109.0,175.0,282.4
4,45.0,117.0,148.0,406.0


In [3]:
# Get no.of column and rows of the data
df.shape

(169, 4)

In [4]:
# Get non null count of all columns
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 169 entries, 0 to 168
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Duration  166 non-null    float64
 1   Pulse     165 non-null    float64
 2   Maxpulse  165 non-null    float64
 3   Calories  162 non-null    float64
dtypes: float64(4)
memory usage: 6.6 KB


In [5]:
# Check does dataframe has null values
df.isnull().sum()

Duration    3
Pulse       4
Maxpulse    4
Calories    7
dtype: int64

In [6]:
df.describe()

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
count,166.0,165.0,165.0,162.0
mean,64.277108,107.448485,133.945455,375.917901
std,42.484454,14.365612,16.556722,268.013841
min,15.0,80.0,100.0,50.3
25%,45.0,100.0,124.0,250.775
50%,60.0,105.0,131.0,317.85
75%,60.0,111.0,141.0,386.7
max,300.0,159.0,184.0,1860.4


In [7]:
# Mean (Best for normal distributions)
df["Duration"].fillna(df["Duration"].mean(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["Duration"].fillna(df["Duration"].mean(), inplace=True)


In [8]:
df.isnull().sum()

Duration    0
Pulse       4
Maxpulse    4
Calories    7
dtype: int64

In [9]:
# Median (Best for skewed distributions)
df["Maxpulse"].fillna(df["Maxpulse"].median(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["Maxpulse"].fillna(df["Maxpulse"].median(), inplace=True)


In [10]:
df.isnull().sum()

Duration    0
Pulse       4
Maxpulse    0
Calories    7
dtype: int64

In [11]:
# Mode (Best for categorical data)
df["Maxpulse"].fillna(df["Maxpulse"].mode(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["Maxpulse"].fillna(df["Maxpulse"].mode(), inplace=True)


In [12]:
# Removes any row with at least one NaN
df.dropna(inplace=True)

In [13]:
# drop entire column
df.drop(columns=["Pulse"], inplace=True)

In [14]:
# drop all null values in subset columns
df.dropna(subset=["Calories"], inplace=True)

In [15]:
df.isnull().sum()

Duration    0
Maxpulse    0
Calories    0
dtype: int64

In [16]:
df.shape

(158, 3)