# How to Deal with Missing Data in Python

## Learning Objectives
During the process of data exploration, it is not uncommon to realize that the data we have is incomplete. Missing data could arise as a result of changes in data collection methods, human error, bias, or simply the lack of reliable input. There are several ways to deal with missing data. By the end of the tutorial, you will have learned:

+ how to detect missing data
+ how to remove missing data
+ how to resolve missing data

## How to Detect Missing Values

In [32]:
import pandas as pd

In [33]:
students = pd.read_excel("./data/students.xlsx", engine='openpyxl')
students

Unnamed: 0,ID,FirstName,LastName,Major,Minor,Age,Gender,City,State,Zip
0,1869,Hana,Barton,Finance,,22.0,Female,,,
1,2010,Alicia,Kanuri,Management,,21.0,Female,Berrien Springs,,
2,2075,Becca,Swanson,Marketing,,22.0,Female,Chicago,IL,60608.0
3,2228,Halima,Aminu,Business Analytics,,20.0,Female,Atlanta,GA,30303.0
4,2442,Holly,Robinson,Finance,,20.0,Female,Charlotte,NC,28202.0
5,2858,Carla,Harding,Accountancy,,,Female,Youngstown,OH,44502.0
6,3459,Alex,Swanson,Accountancy,,,,Granger,IN,
7,3585,Aminah,Zalim,Marketing,,20.0,Female,Ashburn,VA,20147.0
8,5170,Gus,Cunningham,Finance,Innovation,21.0,Male,Massillon,OH,44646.0
9,5317,Rafael,Solis,Business Analytics,ACMS,,,South Bend,IN,46601.0


In [8]:
# total missing values
students.isnull().sum()

ID            0
FirstName     0
LastName      0
Major         0
Minor        12
Age           6
Gender        3
City          1
State         3
Zip           4
dtype: int64

In [13]:
# find out which rows have mising values of State
mask = students['State'].isnull()
mask

0      True
1      True
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13     True
14    False
15    False
16    False
17    False
18    False
19    False
Name: State, dtype: bool

In [14]:
# find out which rows have mising values of State
students[mask]

Unnamed: 0,ID,FirstName,LastName,Major,Minor,Age,Gender,City,State,Zip
0,1869,Hana,Barton,Finance,,22.0,Female,,,
1,2010,Alicia,Kanuri,Management,,21.0,Female,Berrien Springs,,
13,6504,Chantelle,Woods,Business Technology,,18.0,Female,Chicago,,60608.0


## How to Remove Missing Values

In [15]:
# dropping all rows if it has a missing values in any of the columns. This method is a bit extreme
students.dropna()

Unnamed: 0,ID,FirstName,LastName,Major,Minor,Age,Gender,City,State,Zip
8,5170,Gus,Cunningham,Finance,Innovation,21.0,Male,Massillon,OH,44646.0
12,6453,Otis,Johnson,Management,Political Science,19.0,Male,Anchorage,AK,99501.0
15,7511,Laila,Carroll,Marketing,Innovation,20.0,Female,New York,NY,10001.0
16,7965,Rocco,Decola,Finance,Innovation,21.0,Male,Oakland,CA,94603.0
17,9232,Julie,Holmes,Business Technology,Innovation,18.0,Female,Webster,NY,14580.0
18,9268,Albert,Palmer,Management,Real Estate,21.0,Male,Detroit,MI,48201.0


In [16]:
students

Unnamed: 0,ID,FirstName,LastName,Major,Minor,Age,Gender,City,State,Zip
0,1869,Hana,Barton,Finance,,22.0,Female,,,
1,2010,Alicia,Kanuri,Management,,21.0,Female,Berrien Springs,,
2,2075,Becca,Swanson,Marketing,,22.0,Female,Chicago,IL,60608.0
3,2228,Halima,Aminu,Business Analytics,,20.0,Female,Atlanta,GA,30303.0
4,2442,Holly,Robinson,Finance,,20.0,Female,Charlotte,NC,28202.0
5,2858,Carla,Harding,Accountancy,,,Female,Youngstown,OH,44502.0
6,3459,Alex,Swanson,Accountancy,,,,Granger,IN,
7,3585,Aminah,Zalim,Marketing,,20.0,Female,Ashburn,VA,20147.0
8,5170,Gus,Cunningham,Finance,Innovation,21.0,Male,Massillon,OH,44646.0
9,5317,Rafael,Solis,Business Analytics,ACMS,,,South Bend,IN,46601.0


### Dropping subset of dataset

In [17]:
# we want to drop rows if there is missing values in State and Zip
# only row 0 and 1 should be dropped as they don't have both State and Zip values

students.dropna(subset=['State', 'Zip'], how='all')

Unnamed: 0,ID,FirstName,LastName,Major,Minor,Age,Gender,City,State,Zip
2,2075,Becca,Swanson,Marketing,,22.0,Female,Chicago,IL,60608.0
3,2228,Halima,Aminu,Business Analytics,,20.0,Female,Atlanta,GA,30303.0
4,2442,Holly,Robinson,Finance,,20.0,Female,Charlotte,NC,28202.0
5,2858,Carla,Harding,Accountancy,,,Female,Youngstown,OH,44502.0
6,3459,Alex,Swanson,Accountancy,,,,Granger,IN,
7,3585,Aminah,Zalim,Marketing,,20.0,Female,Ashburn,VA,20147.0
8,5170,Gus,Cunningham,Finance,Innovation,21.0,Male,Massillon,OH,44646.0
9,5317,Rafael,Solis,Business Analytics,ACMS,,,South Bend,IN,46601.0
10,5336,Homer,White,Finance,,,Male,St. Paul,MN,55101.0
11,5495,Lara,Woods,Accountancy,,,,Dallas,TX,75201.0


### Dropping columns if there is any missing values

In [18]:
students.dropna(axis=1) # this is also extreme as most of our columns are dropped

Unnamed: 0,ID,FirstName,LastName,Major
0,1869,Hana,Barton,Finance
1,2010,Alicia,Kanuri,Management
2,2075,Becca,Swanson,Marketing
3,2228,Halima,Aminu,Business Analytics
4,2442,Holly,Robinson,Finance
5,2858,Carla,Harding,Accountancy
6,3459,Alex,Swanson,Accountancy
7,3585,Aminah,Zalim,Marketing
8,5170,Gus,Cunningham,Finance
9,5317,Rafael,Solis,Business Analytics


### Dropping Columns using `thresh`

In [22]:
len(students.columns)

10

In [23]:
# As we have 20 rows, we want to remove only 50% of the data if there is any missing values in the columns
# we can see that only the minor column is removed
students.dropna(axis=1, thresh=10)

Unnamed: 0,ID,FirstName,LastName,Major,Age,Gender,City,State,Zip
0,1869,Hana,Barton,Finance,22.0,Female,,,
1,2010,Alicia,Kanuri,Management,21.0,Female,Berrien Springs,,
2,2075,Becca,Swanson,Marketing,22.0,Female,Chicago,IL,60608.0
3,2228,Halima,Aminu,Business Analytics,20.0,Female,Atlanta,GA,30303.0
4,2442,Holly,Robinson,Finance,20.0,Female,Charlotte,NC,28202.0
5,2858,Carla,Harding,Accountancy,,Female,Youngstown,OH,44502.0
6,3459,Alex,Swanson,Accountancy,,,Granger,IN,
7,3585,Aminah,Zalim,Marketing,20.0,Female,Ashburn,VA,20147.0
8,5170,Gus,Cunningham,Finance,21.0,Male,Massillon,OH,44646.0
9,5317,Rafael,Solis,Business Analytics,,,South Bend,IN,46601.0


## How to Resolve Missing Values

### Fill Values with specific value

In [34]:
# there are 3 missing values in Gender column
students['Gender'].isnull().sum()

3

In [36]:
# fill those missing values with value Female
students = students.fillna({'Gender': 'Female'})
students

Unnamed: 0,ID,FirstName,LastName,Major,Minor,Age,Gender,City,State,Zip
0,1869,Hana,Barton,Finance,,22.0,Female,,,
1,2010,Alicia,Kanuri,Management,,21.0,Female,Berrien Springs,,
2,2075,Becca,Swanson,Marketing,,22.0,Female,Chicago,IL,60608.0
3,2228,Halima,Aminu,Business Analytics,,20.0,Female,Atlanta,GA,30303.0
4,2442,Holly,Robinson,Finance,,20.0,Female,Charlotte,NC,28202.0
5,2858,Carla,Harding,Accountancy,,,Female,Youngstown,OH,44502.0
6,3459,Alex,Swanson,Accountancy,,,Female,Granger,IN,
7,3585,Aminah,Zalim,Marketing,,20.0,Female,Ashburn,VA,20147.0
8,5170,Gus,Cunningham,Finance,Innovation,21.0,Male,Massillon,OH,44646.0
9,5317,Rafael,Solis,Business Analytics,ACMS,,Female,South Bend,IN,46601.0


In [37]:
students['Gender'].isnull().sum()

0

### Fill values with Medium Values

In [38]:
students['Age'].isnull().sum()

6

In [39]:
students = students.fillna({'Age': students['Age'].mean()})

In [40]:
students['Age'].isnull().sum()

0

### Fill missing values by Cell by Cell Approach

We want to fill Zip code for row with `City=Granger & State=IN`

In [41]:
students

Unnamed: 0,ID,FirstName,LastName,Major,Minor,Age,Gender,City,State,Zip
0,1869,Hana,Barton,Finance,,22.0,Female,,,
1,2010,Alicia,Kanuri,Management,,21.0,Female,Berrien Springs,,
2,2075,Becca,Swanson,Marketing,,22.0,Female,Chicago,IL,60608.0
3,2228,Halima,Aminu,Business Analytics,,20.0,Female,Atlanta,GA,30303.0
4,2442,Holly,Robinson,Finance,,20.0,Female,Charlotte,NC,28202.0
5,2858,Carla,Harding,Accountancy,,20.357143,Female,Youngstown,OH,44502.0
6,3459,Alex,Swanson,Accountancy,,20.357143,Female,Granger,IN,
7,3585,Aminah,Zalim,Marketing,,20.0,Female,Ashburn,VA,20147.0
8,5170,Gus,Cunningham,Finance,Innovation,21.0,Male,Massillon,OH,44646.0
9,5317,Rafael,Solis,Business Analytics,ACMS,20.357143,Female,South Bend,IN,46601.0


In [42]:
mask = (students['City'] == 'Granger') & (students['State']=='IN')

In [45]:
students.loc[mask, :]

Unnamed: 0,ID,FirstName,LastName,Major,Minor,Age,Gender,City,State,Zip
6,3459,Alex,Swanson,Accountancy,,20.357143,Female,Granger,IN,


In [48]:
# updating with correct zip code

students.loc[mask, 'Zip'] = 46530
students

Unnamed: 0,ID,FirstName,LastName,Major,Minor,Age,Gender,City,State,Zip
0,1869,Hana,Barton,Finance,,22.0,Female,,,
1,2010,Alicia,Kanuri,Management,,21.0,Female,Berrien Springs,,
2,2075,Becca,Swanson,Marketing,,22.0,Female,Chicago,IL,60608.0
3,2228,Halima,Aminu,Business Analytics,,20.0,Female,Atlanta,GA,30303.0
4,2442,Holly,Robinson,Finance,,20.0,Female,Charlotte,NC,28202.0
5,2858,Carla,Harding,Accountancy,,20.357143,Female,Youngstown,OH,44502.0
6,3459,Alex,Swanson,Accountancy,,20.357143,Female,Granger,IN,46530.0
7,3585,Aminah,Zalim,Marketing,,20.0,Female,Ashburn,VA,20147.0
8,5170,Gus,Cunningham,Finance,Innovation,21.0,Male,Massillon,OH,44646.0
9,5317,Rafael,Solis,Business Analytics,ACMS,20.357143,Female,South Bend,IN,46601.0


#### now we want to update row 14 too

In [50]:
mask = (students['City'] == 'Niles') & (students['State'] == 'MI')

In [52]:
students.loc[mask, :]

Unnamed: 0,ID,FirstName,LastName,Major,Minor,Age,Gender,City,State,Zip
14,6768,Caroline,Marsh,Business Analytics,ACMS,22.0,Female,Niles,MI,


In [53]:
students.loc[mask, 'Zip'] = 49120
students

Unnamed: 0,ID,FirstName,LastName,Major,Minor,Age,Gender,City,State,Zip
0,1869,Hana,Barton,Finance,,22.0,Female,,,
1,2010,Alicia,Kanuri,Management,,21.0,Female,Berrien Springs,,
2,2075,Becca,Swanson,Marketing,,22.0,Female,Chicago,IL,60608.0
3,2228,Halima,Aminu,Business Analytics,,20.0,Female,Atlanta,GA,30303.0
4,2442,Holly,Robinson,Finance,,20.0,Female,Charlotte,NC,28202.0
5,2858,Carla,Harding,Accountancy,,20.357143,Female,Youngstown,OH,44502.0
6,3459,Alex,Swanson,Accountancy,,20.357143,Female,Granger,IN,46530.0
7,3585,Aminah,Zalim,Marketing,,20.0,Female,Ashburn,VA,20147.0
8,5170,Gus,Cunningham,Finance,Innovation,21.0,Male,Massillon,OH,44646.0
9,5317,Rafael,Solis,Business Analytics,ACMS,20.357143,Female,South Bend,IN,46601.0
