In this video we will explore how we can use various pandas techniques to handle the missing data in our datasets. We will learn how to find out how much data is missing and from which columns. We will see how to drop rows or columns where all or a lot of records are missing data. We will also learn how, instead of dropping data, we can also fill in the missing records with zeros or the mean of the remianing values

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
%cd /content/drive/My Drive/Colab Notebooks

/content/drive/My Drive/Colab Notebooks


In [3]:
import pandas as pd
data = pd.read_csv('data-titanic.csv')
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


It is a time to look at how many records are missing first. To do this we will first need to find out the total number of records in the dataset. We can do this by calling the shape property on the DataFrame.

In [4]:
data.shape

(891, 12)

We can see that the total number of records is 891 and that the total number of coluumns is 12.


We then find out the number of records in each column. We can do this by calling the count method on the DataFrame

In [5]:
data.count()

PassengerId    891
Survived       891
Pclass         891
Name           891
Sex            891
Age            714
SibSp          891
Parch          891
Ticket         891
Fare           891
Cabin          204
Embarked       889
dtype: int64

The difference between the total records and the count per column represents the numnber of records missing from that column. Out of the 12 columns we have 3 columns where values are missing. For example. Age only has 714 values out of a total of 891 rows. Cabin only has 204 records and Embraked  has values for 889 records. There are different ways we could handle these missing values. One of the ways is to drop any row here a value is missing even from a single column.

In [6]:
data_missing_dropped = data.dropna()
data_missing_dropped.shape

(183, 12)

When we run this drop rows method, we assign the results back into a new DataFrame. This leaves us with just 183 records out of a total of 891. However, this may lead to losing a lot of the data and may not be acceptable.

Another method is to drop only those rows where all values are missing.

In [7]:
data_all_missing_dropped = data.dropna(how="all")
data_all_missing_dropped.shape

(891, 12)

We do this by setting the how parameter for the dropna method to all. 

Instead of dropping rows, another method is to fill in the missing values with some data. We can fill in the missing values with 0.

In [8]:
data_filled_zeros = data.fillna(0)
data_filled_zeros.count()

PassengerId    891
Survived       891
Pclass         891
Name           891
Sex            891
Age            891
SibSp          891
Parch          891
Ticket         891
Fare           891
Cabin          891
Embarked       891
dtype: int64

Here we use pandas' fillna method, and we pass the numeric value of 0 to the column where the data should be filled in. You can see that we have now filled all teh missing values with 0 and hence the count for all the column has gone up to the total number of count of records in the dataset. 


Also , instead of filling in missing values with 0, we could fill them with the mean of the remaining exisiting values. To do this we call the fillna method on the column where we are filling the values in, and we pass the mean of the column as the parameter.

In [9]:
data_filled_in_mean = data.copy()
data_filled_in_mean.Age.fillna(data.Age.mean(), inplace=True)
data_filled_in_mean.count()

PassengerId    891
Survived       891
Pclass         891
Name           891
Sex            891
Age            891
SibSp          891
Parch          891
Ticket         891
Fare           891
Cabin          204
Embarked       889
dtype: int64

For example here we did fill in the missing value of Agfe with the mean of the existing values.

**Extra work**

Try to apply all the above methods to fill in all the values to get 891 so we can elimiate all the missing values.

**Indexing in pandas DataFrames**

In [10]:
data = pd.read_csv('data-titanic.csv')
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


The above table show how our default index looks like right now, which is a numeric index starting from 0.


Let's set it to a column of our choice. Here we will sue the set_index method to set the index to the name of the passenger from our data.

In [11]:
data.set_index('Name')

Unnamed: 0_level_0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
"Braund, Mr. Owen Harris",1,0,3,male,22.0,1,0,A/5 21171,7.2500,,S
"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",2,1,1,female,38.0,1,0,PC 17599,71.2833,C85,C
"Heikkinen, Miss. Laina",3,1,3,female,26.0,0,0,STON/O2. 3101282,7.9250,,S
"Futrelle, Mrs. Jacques Heath (Lily May Peel)",4,1,1,female,35.0,1,0,113803,53.1000,C123,S
"Allen, Mr. William Henry",5,0,3,male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...
"Montvila, Rev. Juozas",887,0,2,male,27.0,0,0,211536,13.0000,,S
"Graham, Miss. Margaret Edith",888,1,1,female,19.0,0,0,112053,30.0000,B42,S
"Johnston, Miss. Catherine Helen ""Carrie""",889,0,3,female,,1,2,W./C. 6607,23.4500,,S
"Behr, Mr. Karl Howell",890,1,1,male,26.0,0,0,111369,30.0000,C148,C


As you can see the index has been changed from the simple numeric value of 0 to the names of the passengers from our dataset. 


Next we will see how to set an index while reading in the data. We can do this by passing an extra parameter, index_col to the read method.

In [12]:
data = pd.read_table('data-titanic.csv', sep=',', index_col=3)

In [13]:
data.head()

Unnamed: 0_level_0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
"Braund, Mr. Owen Harris",1,0,3,male,22.0,1,0,A/5 21171,7.25,,S
"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",2,1,1,female,38.0,1,0,PC 17599,71.2833,C85,C
"Heikkinen, Miss. Laina",3,1,3,female,26.0,0,0,STON/O2. 3101282,7.925,,S
"Futrelle, Mrs. Jacques Heath (Lily May Peel)",4,1,1,female,35.0,1,0,113803,53.1,C123,S
"Allen, Mr. William Henry",5,0,3,male,35.0,0,0,373450,8.05,,S


The index_col parameter takes a single numeric value or a sequence of values. Here we are passing the index of the Name column. 

Next let's see how to do data selection using the index. We willc call the loc method on the DataFrame and pass in the index level of the record that we want to select.

In [14]:
data.loc['Braund, Mr. Owen Harris',:]

PassengerId            1
Survived               0
Pclass                 3
Sex                 male
Age                   22
SibSp                  1
Parch                  0
Ticket         A/5 21171
Fare                7.25
Cabin                NaN
Embarked               S
Name: Braund, Mr. Owen Harris, dtype: object

In this case, it is the name of one of the passengers from the dataset. We can do this because we set up the name as the index for the dataset previously.

And finally we cna reset the index back to what it was before we changed it. We will use the reset_index method for this.

In [15]:
data.reset_index(inplace=True)

In [16]:
data.head()

Unnamed: 0,Name,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,"Braund, Mr. Owen Harris",1,0,3,male,22.0,1,0,A/5 21171,7.25,,S
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",2,1,1,female,38.0,1,0,PC 17599,71.2833,C85,C
2,"Heikkinen, Miss. Laina",3,1,3,female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",4,1,1,female,35.0,1,0,113803,53.1,C123,S
4,"Allen, Mr. William Henry",5,0,3,male,35.0,0,0,373450,8.05,,S


We did pass inplace=True as we want to reset it in the original DataFrame itself.