# Handling Missing Data

There are multiple ways to handle missing data. Following are some of the major techniques to do so:

* Dropping Rows
* Dropping Columns
* Replacing Missing Values
* Interpolation

**Pands Operations**

In [1]:
# Importing libraries
import numpy as np
import pandas as pd

In [7]:
# nan is "not a number". Remember, it is not zero or something else, its nan.
print(np.nan)

# NaT is not "not a timestamp". It means that the value that is missing some sort of timestamp
print(pd.NaT)

# np.nan == np.nan is false, because you cannot really tell if the two missing values are equal
print(np.nan == np.nan)

# So we say that nan is a nan, it returns true
print(np.nan is np.nan)

nan
NaT
False
True


In [9]:
# Importing the dataset
df = pd.read_csv('/kaggle/input/movie-scores/movie_scores.csv')

In [11]:
# We can see that some of the values are missing
df

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
1,,,,,,
2,Hugh,Jackman,51.0,m,,
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [12]:
# Checking which entry is null
df.isnull()

# All the values returned with true are missing/null

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,False,False,False,False,False,False
1,True,True,True,True,True,True
2,False,False,False,False,True,True
3,False,False,False,False,False,False
4,False,False,False,False,False,False


In [13]:
# Checking which entry is not nulll
df.notnull()

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,True,True,True,True,True,True
1,False,False,False,False,False,False
2,True,True,True,True,False,False
3,True,True,True,True,True,True
4,True,True,True,True,True,True


In [14]:
# Returning those rows which has a "pre_movie_score"

df[df['pre_movie_score'].notnull()]

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


**Dropping Rows**

In [16]:
# The below line means: Do not drop a row that has atleast one non-missing value. As only
# row 1 had all the values missing, thus it was dropped. We can increase the threshold to
# any number, let's say 3, which will then mean to not drop any row which has atleast 3
# non-missing values.

df.dropna(thresh=1)

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
2,Hugh,Jackman,51.0,m,,
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [18]:
# We can drop rows using the axis parameter too. The below line means to drop any row
# that has atleast 1 missing value. Axis = 0 means rows, Axis = 1 means columns.

df.dropna(axis=0)

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [19]:
# The below line of code will drop any row that has a missing value in the "last_name"
# column

df.dropna(subset=['last_name'])

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
2,Hugh,Jackman,51.0,m,,
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


**Dropping Columns**

In [20]:
# We just need to turn the axis to 1.

df.dropna(axis=1)

# It returned an empty table, because each column had some missing values, so they were
# dropped.

0
1
2
3
4


**Replacing Missing Values**

In [21]:
df.fillna("New Value")

# It replaced all the missing values with the string "New Value". This is not a good idea
# because some columns have numeric values. So we use interpolation.

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
1,New Value,New Value,New Value,New Value,New Value,New Value
2,Hugh,Jackman,51.0,m,New Value,New Value
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


**Interpolation**

In [26]:
# We need to handle missing values column by column so that we do not mess up with the
# data type.

# For first and last name, we need to put in some string there.
df[['first_name', 'last_name']].fillna("No Name")

# To assign it to the dataset, we need to go for "df[['first_name', 'last_name']] = df[['first_name', 'last_name']].fillna("No Name")"

Unnamed: 0,first_name,last_name
0,Tom,Hanks
1,No Name,No Name
2,Hugh,Jackman
3,Oprah,Winfrey
4,Emma,Stone


In [29]:
# Similarly, for age and pre_movie_score, we can have statistical operations as they are
# numeric columns

df['pre_movie_score'].fillna(df['pre_movie_score'].mean())

0    8.0
1    7.0
2    7.0
3    6.0
4    7.0
Name: pre_movie_score, dtype: float64

# The End