- Dropping Missing Values

In [19]:
# Here is how to import pandas in your project
import pandas as pd;

In [20]:
# DataFrame with None or Null Values
userDataFrame3 = pd.DataFrame({'name': ['Alice', 'Janvier', None, 'Kate', 'Bob'], 'age': [25, 30, 35, 25, 40], 'city': ['Rockville', 'Germantown', 'Denver', 'Rome', None]})
userDataFrame3

Unnamed: 0,name,age,city
0,Alice,25,Rockville
1,Janvier,30,Germantown
2,,35,Denver
3,Kate,25,Rome
4,Bob,40,


In [21]:
# To return true of false based on if the data is null (Good to visualize a small dataset)
userDataFrame3.isnull()

Unnamed: 0,name,age,city
0,False,False,False
1,False,False,False
2,True,False,False
3,False,False,False
4,False,False,True


In [22]:
# To return a report on earch column wheither it has a null value or no
userDataFrame3.isnull().any()

name     True
age     False
city     True
dtype: bool

In [23]:
# Check number of missing value per each column
userDataFrame3.isnull().sum()

name    1
age     0
city    1
dtype: int64

In [24]:
# Check number of not missing value per each column
userDataFrame3.notnull().sum()

name    4
age     5
city    4
dtype: int64

1. Drop missing values using dropna method

Dropna method on a dataframe will drop any row that has atleast one column with a missing value

In [25]:
userDataFrame4 = pd.DataFrame({'name': ['Alice', 'Janvier', None, 'Kate', 'Bob'], 'age': [25, 30, 35, 25, 40], 'city': ['Rockville', 'Germantown', 'Denver', 'Rome', None]})
userDataFrame4.dropna() # using dropna method to remove rows with missing values

Unnamed: 0,name,age,city
0,Alice,25,Rockville
1,Janvier,30,Germantown
3,Kate,25,Rome


2. Drop problematic or columns with alot of missing values by using drop method

When cleaning up data, you should always remove problematig columns or columns with a lot of missing values first before removing rows with missing values

In [26]:
userDataFrame5 = pd.DataFrame({'name': ['Alice', 'Janvier', None, 'Kate', 'Bob', 'Naomie'], 'age': [25, 30, 35, 25, 40, 11], 'city': ['Rockville', 'Germantown', 'Denver', 'Rome', None, 'Harrisonburg'], 'hobby': [None, 'Soccer', None, None, None, None]})
userDataFrame5.drop(['hobby'], axis=1).dropna() # Removing problematic columns with drop methods then use dropna to remove rows that still have missing data (so hoppy columns will be remove first then rows that still have missing values)

Unnamed: 0,name,age,city
0,Alice,25,Rockville
1,Janvier,30,Germantown
3,Kate,25,Rome
5,Naomie,11,Harrisonburg


3. To make the change apply to the original dataframe, pass inplace=True as parameter in the dropna method

In [27]:
userDataFrame6 = pd.DataFrame({'name': ['Alice', 'Janvier', None, 'Kate', 'Bob'], 'age': [25, 30, 35, 25, 40], 'city': ['Rockville', 'Germantown', 'Denver', 'Rome', None]})
userDataFrame6.dropna(inplace=True) # remove rows with missing values
userDataFrame6 # call the original dataframe

Unnamed: 0,name,age,city
0,Alice,25,Rockville
1,Janvier,30,Germantown
3,Kate,25,Rome


4. To remove rows only if all the values are missing, pass how='all' as parameter to the dropna method

In [28]:
userDataFrame7 = pd.DataFrame({'name': ['Alice', 'Janvier', None, 'Kate', 'Bob', None], 'age': [25, 30, 35, 25, 40, None], 'city': ['Rockville', 'Germantown', 'Denver', 'Rome', None, None]})
userDataFrame7.dropna(how='all')

Unnamed: 0,name,age,city
0,Alice,25.0,Rockville
1,Janvier,30.0,Germantown
2,,35.0,Denver
3,Kate,25.0,Rome
4,Bob,40.0,


5. To remove Columns where all values are missing, pass how='all' and axis=1 as parameters to the dropna method

In [29]:
userDataFrame8 = pd.DataFrame({'name': ['Alice', 'Janvier', None, 'Kate', 'Bob', 'Naomie'], 'age': [25, 30, 35, 25, 40, 11], 'city': ['Rockville', 'Germantown', 'Denver', 'Rome', None, 'Harrisonburg'], 'hobby': [None, None, None, None, None, None]})
userDataFrame8.dropna(how='all', axis=1)

Unnamed: 0,name,age,city
0,Alice,25,Rockville
1,Janvier,30,Germantown
2,,35,Denver
3,Kate,25,Rome
4,Bob,40,
5,Naomie,11,Harrisonburg


6. To remove rows with missing values on at least one of the given columns, pass subset=['col_1', 'col_2', 'col_5', 'col_9'] as parameters to the dropna method

In [30]:
userDataFrame9 = pd.DataFrame({'name': ['Alice', 'Janvier', 'Naomie', 'Kate', 'Bob', None, 'Jack'], 'age': [25, 30, 11, 25, 40, 25, None], 'city': ['Rockville', 'Germantown', 'Denver', 'Rome', None, 'Harrisonburg', 'Paris']})
userDataFrame9.dropna(subset=['name', 'age']) # 2 rows should be removed

Unnamed: 0,name,age,city
0,Alice,25.0,Rockville
1,Janvier,30.0,Germantown
2,Naomie,11.0,Denver
3,Kate,25.0,Rome
4,Bob,40.0,


7. To remove rows with missing values on all the given columns, pass subset=['col_1', 'col_2', 'col_5', 'col_9'] and how='all' as parameters to the dropna method

In [31]:
userDataFrame10 = pd.DataFrame({'name': ['Alice', 'Janvier', 'Naomie', 'Kate', 'Bob', 'Jeff', 'Jack'], 'age': [25, 30, 11, 25, 40, 25, None], 'city': ['Rockville', 'Germantown', 'Denver', 'Rome', None, 'Harrisonburg', None]})
userDataFrame10.dropna(subset=['age', 'city'], how='all') # Only the last row should be removed

Unnamed: 0,name,age,city
0,Alice,25.0,Rockville
1,Janvier,30.0,Germantown
2,Naomie,11.0,Denver
3,Kate,25.0,Rome
4,Bob,40.0,
5,Jeff,25.0,Harrisonburg
