# Data Analysis in Python - VIII: Handling Missing Data

## Introduction


In this lesson, we will learn various ways to handle missing data using pandas. 

Note: 
1. Use the TOC to navigate between sections.


## Missing data representation

pandas uses the constant `NaN` to signify missing values. Alternately, you can use `NAN` or `nan`. These constants belong to the `numpy` library. You must import numpy to use them. 

In [1]:
import numpy as np

`NaN` is **not** the same as a blank string, a zero numeric value or a False boolean value.

In [2]:
# compare NaN to a blank string and check the result.
np.NaN == ''

False

In [3]:
# compare NaN to zero
np.NaN=='0'

False

## Checking if a value is considered a missing value in a pandas (i.e., np.NaN)

NaN is not even comparable to itself. 

In [5]:
name = np.NaN
# compare name to NaN
name==np.NaN

False

For this reason, we must use special functions to test whether a value is missing or not. The pandas library provides two such functions: `isnull()` and `notnull()`.

In [6]:
import pandas as pd

In [7]:
age = 10 

In [8]:
# check if age is missing/ null
pd.isnull(age)

False

In [9]:
# check if age is not missing/ null
pd.notnull(age)

True

In [10]:
age = np.NaN

In [11]:
# check if age is missing/ null
pd.isnull(age)

True

In [12]:
# check if age is not missing/ null
pd.notnull(age)

False

## Identifying missing values/ symbols in raw data

You can use the above options to correctly denote missing values as np.NaN only after you have correctly identified which symbols/values in the raw data represent missing data. To identify these values/symbols, you should first check the data dictionary. If the data dictionary does not include this information, you can manually examine all the data but doing so is not very efficient. You should consider these alternate strategies. 
- check if the result of the describe function is missing a column that was supposed to be numeric. The likely reason is that the column has a string datatype because there are missing values/symbols in that column which were not specified in the na_values option of the read_csv function.
- check if a column is of string datatype when you don't expect it to be. That may be the case because there are missing values/symbols in that column which were not specified in the na_values option of the read_csv function. 
- check the unique values in a column and see if some of them might represent missing values. 

## Identifying missing values when reading data

Missing values may be represented in different ways in datasets. For example, some numeric values like 0 or -9999 may be used to represent missing data. Alternately, blank strings (''), hyphens (-) or text such as NA may be used. 

When reading data using functions such as read_csv, we can specify how pandas can identify missing values in the data. Once identified, these values will be stored as NaN within the DataFrame. 

As a default, pandas can detect the following as missing values: ‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘<NA>’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘n/a’, ‘nan’, ‘null’.

In [16]:
# read the poverty data
povdata_na_file = "../scratch/PovertyDataNA.csv"
povData = pd.read_csv(povdata_na_file, sep=',')
povData.head(10)

Unnamed: 0,LiveBirthRate,DeathRate,InfantDeaths,MaleLifeExpectancy,FemaleLifeExpectancy,GNI,Region,Country
0,24.7,5.7,30.8,69.6,75.5,600,0,Albania
1,12.5,11.9,14.4,68.3,74.7,2250,1,Bulgaria
2,13.4,11.7,11.3,71.8,77.7,2980,1,Czechoslovakia
3,12.0,12.4,7.6,69.8,75.9,,1,Former_E._Germany
4,11.6,13.4,14.8,65.4,73.8,2780,1,Hungary
5,-1.0,10.2,16.0,67.2,75.7,1690,1,Poland
6,13.6,10.7,26.9,66.5,72.4,1640,1,Romania
7,14.0,9.0,20.2,68.6,74.5,*,1,Yugoslavia
8,17.7,10.0,23.0,64.6,74.0,2242,1,USSR
9,15.2,9.5,13.1,66.4,75.9,1880,1,Byelorussian_SSR


For other values, use the `na_values` parameter of the read_csv() function to specify which values should be identified as missing.

In [17]:
# read the poverty data
povData = pd.read_csv(povdata_na_file, sep=',',na_values=['*',-1])
povData.head(10)

Unnamed: 0,LiveBirthRate,DeathRate,InfantDeaths,MaleLifeExpectancy,FemaleLifeExpectancy,GNI,Region,Country
0,24.7,5.7,30.8,69.6,75.5,600.0,0,Albania
1,12.5,11.9,14.4,68.3,74.7,2250.0,1,Bulgaria
2,13.4,11.7,11.3,71.8,77.7,2980.0,1,Czechoslovakia
3,12.0,12.4,7.6,69.8,75.9,,1,Former_E._Germany
4,11.6,13.4,14.8,65.4,73.8,2780.0,1,Hungary
5,,10.2,16.0,67.2,75.7,1690.0,1,Poland
6,13.6,10.7,26.9,66.5,72.4,1640.0,1,Romania
7,14.0,9.0,20.2,68.6,74.5,,1,Yugoslavia
8,17.7,10.0,23.0,64.6,74.0,2242.0,1,USSR
9,15.2,9.5,13.1,66.4,75.9,1880.0,1,Byelorussian_SSR


In [None]:
# specify list of missing values
povData = pd.read_csv(povdata_na_file, sep=',')
povData.head(10)

In [18]:
# specify missing values by column 
povData = pd.read_csv(povdata_na_file, sep=',',na_values={'LiveBirthRate':-1, 'GNI':'*','Region':0})
povData.head(10)

Unnamed: 0,LiveBirthRate,DeathRate,InfantDeaths,MaleLifeExpectancy,FemaleLifeExpectancy,GNI,Region,Country
0,24.7,5.7,30.8,69.6,75.5,600.0,,Albania
1,12.5,11.9,14.4,68.3,74.7,2250.0,1.0,Bulgaria
2,13.4,11.7,11.3,71.8,77.7,2980.0,1.0,Czechoslovakia
3,12.0,12.4,7.6,69.8,75.9,,1.0,Former_E._Germany
4,11.6,13.4,14.8,65.4,73.8,2780.0,1.0,Hungary
5,,10.2,16.0,67.2,75.7,1690.0,1.0,Poland
6,13.6,10.7,26.9,66.5,72.4,1640.0,1.0,Romania
7,14.0,9.0,20.2,68.6,74.5,,1.0,Yugoslavia
8,17.7,10.0,23.0,64.6,74.0,2242.0,1.0,USSR
9,15.2,9.5,13.1,66.4,75.9,1880.0,1.0,Byelorussian_SSR


The `keep_default_na` option can be used (it is set to True by default) to indicate whether or not to parse the default NaN values (listed above) as missing or NaN. Read the [reference](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) for details.

In [19]:
# experiment with keep_default_na
povData = pd.read_csv(povdata_na_file, sep=',', na_values = {'LiveBirthRate':-1, 'GNI':'*', 'Region':0},keep_default_na=True)
povData.head(10)

Unnamed: 0,LiveBirthRate,DeathRate,InfantDeaths,MaleLifeExpectancy,FemaleLifeExpectancy,GNI,Region,Country
0,24.7,5.7,30.8,69.6,75.5,600.0,,Albania
1,12.5,11.9,14.4,68.3,74.7,2250.0,1.0,Bulgaria
2,13.4,11.7,11.3,71.8,77.7,2980.0,1.0,Czechoslovakia
3,12.0,12.4,7.6,69.8,75.9,,1.0,Former_E._Germany
4,11.6,13.4,14.8,65.4,73.8,2780.0,1.0,Hungary
5,,10.2,16.0,67.2,75.7,1690.0,1.0,Poland
6,13.6,10.7,26.9,66.5,72.4,1640.0,1.0,Romania
7,14.0,9.0,20.2,68.6,74.5,,1.0,Yugoslavia
8,17.7,10.0,23.0,64.6,74.0,2242.0,1.0,USSR
9,15.2,9.5,13.1,66.4,75.9,1880.0,1.0,Byelorussian_SSR


In [20]:
# experiment with keep_default_na
povData = pd.read_csv(povdata_na_file, sep=',', na_values = {'LiveBirthRate':-1, 'GNI':'*', 'Region':0},keep_default_na=False)
povData.head(10)

Unnamed: 0,LiveBirthRate,DeathRate,InfantDeaths,MaleLifeExpectancy,FemaleLifeExpectancy,GNI,Region,Country
0,24.7,5.7,30.8,69.6,75.5,600.0,,Albania
1,12.5,11.9,14.4,68.3,74.7,2250.0,1.0,Bulgaria
2,13.4,11.7,11.3,71.8,77.7,2980.0,1.0,Czechoslovakia
3,12.0,12.4,7.6,69.8,75.9,,1.0,Former_E._Germany
4,11.6,13.4,14.8,65.4,73.8,2780.0,1.0,Hungary
5,,10.2,16.0,67.2,75.7,1690.0,1.0,Poland
6,13.6,10.7,26.9,66.5,72.4,1640.0,1.0,Romania
7,14.0,9.0,20.2,68.6,74.5,,1.0,Yugoslavia
8,17.7,10.0,23.0,64.6,74.0,2242.0,1.0,USSR
9,15.2,9.5,13.1,66.4,75.9,1880.0,1.0,Byelorussian_SSR


**Note:** setting keep_default_na to False may be useful when NA is a valid value in your data (e.g., it stands for North America).  

`na_filter` is another option for missing values. It is set to True by default and specifies whether or not pandas should detect missing values. Analysts often set this option to False when they are working with large datasets and they know that there are no missing values in the data. Doing so speeds up the read operation.

In [22]:
# experiment with keep_default_na
povData = pd.read_csv(povdata_na_file, sep=',', na_values = {'LiveBirthRate':-1, 'GNI':'*', 'Region':0}, keep_default_na= True, na_filter=False)
povData.head(10)

Unnamed: 0,LiveBirthRate,DeathRate,InfantDeaths,MaleLifeExpectancy,FemaleLifeExpectancy,GNI,Region,Country
0,24.7,5.7,30.8,69.6,75.5,600,0,Albania
1,12.5,11.9,14.4,68.3,74.7,2250,1,Bulgaria
2,13.4,11.7,11.3,71.8,77.7,2980,1,Czechoslovakia
3,12.0,12.4,7.6,69.8,75.9,,1,Former_E._Germany
4,11.6,13.4,14.8,65.4,73.8,2780,1,Hungary
5,-1.0,10.2,16.0,67.2,75.7,1690,1,Poland
6,13.6,10.7,26.9,66.5,72.4,1640,1,Romania
7,14.0,9.0,20.2,68.6,74.5,*,1,Yugoslavia
8,17.7,10.0,23.0,64.6,74.0,2242,1,USSR
9,15.2,9.5,13.1,66.4,75.9,1880,1,Byelorussian_SSR


## Detecting or counting missing values in data that has been read into a DataFrame

There are several ways to identify if there are missing values in a data frame. When you describe a data frame, a lower count for one or more columns indicates missing values.

In [23]:
povData = pd.read_csv(povdata_na_file, sep=',', keep_default_na = True, na_values = {'LiveBirthRate':-1, 'GNI':'*', 'Region':0})

In [25]:
# summary stats
povData.describe()

Unnamed: 0,LiveBirthRate,DeathRate,InfantDeaths,MaleLifeExpectancy,FemaleLifeExpectancy,GNI,Region
count,96.0,97.0,97.0,97.0,97.0,91.0,96.0
mean,29.385417,10.836082,54.901031,61.485567,66.151134,5741.252747,3.979167
std,13.53048,4.647495,45.992584,9.61597,11.005391,8093.679853,1.722783
min,9.7,2.2,4.5,38.1,41.2,80.0,1.0
25%,14.8,7.8,13.1,55.8,57.5,475.0,3.0
50%,29.1,9.5,43.0,63.7,67.8,1690.0,4.0
75%,42.275,12.5,83.0,68.6,75.4,7325.0,6.0
max,52.2,25.0,181.6,75.9,81.8,34064.0,6.0


You can directly pull the count of values of all or specific columns.

In [26]:
# counts of all columns in the povData data frane 
povData.count()

LiveBirthRate           96
DeathRate               97
InfantDeaths            97
MaleLifeExpectancy      97
FemaleLifeExpectancy    97
GNI                     91
Region                  96
Country                 97
dtype: int64

In [28]:
# total rows in povData
print("Total rows = ", povData.shape[0])

# count of values in the GNI column
print("Count of GNI values = ", povData['GNI'].count())

Total rows =  97
Count of GNI values =  91


## What if you don't identify all missing values/symbols?

Occassionally you will find that a function call fails or does not perform as expected. That may be because the column of data being used in the analysis has missing values that have gone undetected (and have not been converted to NaN). See the example below. 

In [29]:
# read povData without parsing missing values
povData = pd.read_csv(povdata_na_file, sep=',')
povData.describe()

Unnamed: 0,LiveBirthRate,DeathRate,InfantDeaths,MaleLifeExpectancy,FemaleLifeExpectancy,Region
count,97.0,97.0,97.0,97.0,97.0,97.0
mean,29.072165,10.836082,54.901031,61.485567,66.151134,3.938144
std,13.80888,4.647495,45.992584,9.61597,11.005391,1.760767
min,-1.0,2.2,4.5,38.1,41.2,0.0
25%,14.5,7.8,13.1,55.8,57.5,3.0
50%,29.0,9.5,43.0,63.7,67.8,4.0
75%,42.2,12.5,83.0,68.6,75.4,6.0
max,52.2,25.0,181.6,75.9,81.8,6.0


In [30]:
# calculate mean GNI 
povData['GNI'].mean()

TypeError: can only concatenate str (not "int") to str

## Treating missing values

An important decision in data analysis is how to treat missing values. Options include:
1. Ignoring missing values in calculations. 
2. Dropping data (rows, columns) with missing values. 
3. Replacing missing values.

### Ignoring missing values in calculations

Some functions ignore missing values by default (e.g., sum() and mean()). You can also set skipna = True in such functions (True is the default value).  

In [32]:
# average GNI of countries
povData = pd.read_csv(povdata_na_file, sep=',', na_values = {'LiveBirthRate':-1, 'GNI':'*', 'Region':0})
povData['GNI'].mean(skipna=True)

5741.252747252747

In [33]:
# set skipna to False
povData = pd.read_csv(povdata_na_file, sep=',', na_values = {'LiveBirthRate':-1, 'GNI':'*', 'Region':0})
povData['GNI'].mean(skipna=False)

nan

In other cases, missing values are not ignored.

In [38]:
# multiply GNI by 1000
povData['GNI']*1000


0      600000.0
1     2250000.0
2     2980000.0
3           NaN
4     2780000.0
        ...    
92     220000.0
93     110000.0
94     220000.0
95     420000.0
96     640000.0
Name: GNI, Length: 97, dtype: float64

### Dropping missing values

It is possible to drop a row (case) or a column (variable) from the data frame when it has missing values. Caution should be exercised however, to ensure that doing so does not have unintended consequences in terms of the results of analyses performed on the data. For example, it is generally advised not to drop missing values if there is a systematic reason for missing values. 

The `dropna()` method can be used to drop missing values. `axis = 0` drops rows (default) and `axis = 1` drops columns.

In [39]:
# read povData and parse missing values
povData = pd.read_csv(povdata_na_file, sep=',', na_values = {'LiveBirthRate':-1, 'GNI':'*', 'Region':0})
povData.shape

(97, 8)

In [40]:
# drop all rows (cases) with missing values 
povData.dropna(axis=0,inplace=True)
povData.shape

(89, 8)

In [41]:
povData.describe()

Unnamed: 0,LiveBirthRate,DeathRate,InfantDeaths,MaleLifeExpectancy,FemaleLifeExpectancy,GNI,Region
count,89.0,89.0,89.0,89.0,89.0,89.0,89.0
mean,29.68427,10.796629,55.997753,61.223596,65.815281,5844.539326,4.044944
std,13.748131,4.705472,46.559275,9.777584,11.161651,8154.67952,1.691625
min,9.7,2.2,4.5,38.1,41.2,80.0,1.0
25%,14.9,7.7,13.0,55.0,56.0,470.0,3.0
50%,29.2,9.5,44.0,63.4,67.6,1880.0,4.0
75%,42.6,12.5,90.0,68.4,75.1,7600.0,6.0
max,52.2,25.0,181.6,75.9,81.8,34064.0,6.0


In [43]:
# drop columns with missing values
povData = pd.read_csv(povdata_na_file, sep=',', na_values = {'LiveBirthRate':-1, 'GNI':'*', 'Region':0})
povData.drop(axis=1)
povData.shape

ValueError: Need to specify at least one of 'labels', 'index' or 'columns'

In [44]:
povData.describe()

Unnamed: 0,LiveBirthRate,DeathRate,InfantDeaths,MaleLifeExpectancy,FemaleLifeExpectancy,GNI,Region
count,96.0,97.0,97.0,97.0,97.0,91.0,96.0
mean,29.385417,10.836082,54.901031,61.485567,66.151134,5741.252747,3.979167
std,13.53048,4.647495,45.992584,9.61597,11.005391,8093.679853,1.722783
min,9.7,2.2,4.5,38.1,41.2,80.0,1.0
25%,14.8,7.8,13.1,55.8,57.5,475.0,3.0
50%,29.1,9.5,43.0,63.7,67.8,1690.0,4.0
75%,42.275,12.5,83.0,68.6,75.4,7325.0,6.0
max,52.2,25.0,181.6,75.9,81.8,34064.0,6.0


We can further refine the behavior of dropna() function using the `how` and `thresh` parameters. 

`how='any'` is the default behavior that we have seen so far. 

`how='all'` will drop rows and columns only when all values are missing. 

`thresh=n`, where n is a numeric value, will *keep* a row or column only when it has *at least* n *non-missing* values. 

In [45]:
# drop column only wnen all values are missing
povData = pd.read_csv(povdata_na_file, sep=',', na_values = {'LiveBirthRate':-1, 'GNI':'*', 'Region':0})
povData.dropna(axis=1,inplace=True,thresh=95)
povData.shape

(97, 7)

In [None]:
# keep column only wnen there are more than 95 non-missing values
povData = pd.read_csv(povdata_na_file, sep=',', na_values = {'LiveBirthRate':-1, 'GNI':'*', 'Region':0})

povData.shape

In [None]:
# keep column only wnen there are more than 80 non-missing values
povData = pd.read_csv(povdata_na_file, sep=',', na_values = {'LiveBirthRate':-1, 'GNI':'*', 'Region':0})

povData.shape

### Replacing / recoding missing values

You can choose to replace missing values with another value. Options include using special values such as zero, replacing the missing values with the mean of the non-missing values, among others. 

The `fillna()` function can be used to replace missing values.

In [46]:
# read poverty data
povData = pd.read_csv(povdata_na_file, sep=',', na_values = {'LiveBirthRate':-1, 'GNI':'*', 'Region':0})
povData.head(10)

Unnamed: 0,LiveBirthRate,DeathRate,InfantDeaths,MaleLifeExpectancy,FemaleLifeExpectancy,GNI,Region,Country
0,24.7,5.7,30.8,69.6,75.5,600.0,,Albania
1,12.5,11.9,14.4,68.3,74.7,2250.0,1.0,Bulgaria
2,13.4,11.7,11.3,71.8,77.7,2980.0,1.0,Czechoslovakia
3,12.0,12.4,7.6,69.8,75.9,,1.0,Former_E._Germany
4,11.6,13.4,14.8,65.4,73.8,2780.0,1.0,Hungary
5,,10.2,16.0,67.2,75.7,1690.0,1.0,Poland
6,13.6,10.7,26.9,66.5,72.4,1640.0,1.0,Romania
7,14.0,9.0,20.2,68.6,74.5,,1.0,Yugoslavia
8,17.7,10.0,23.0,64.6,74.0,2242.0,1.0,USSR
9,15.2,9.5,13.1,66.4,75.9,1880.0,1.0,Byelorussian_SSR


In [50]:
# replace missing GNI values with zero.
povData.fillna(value={'GNI':0},inplace=True)
povData.head(10)

Unnamed: 0,LiveBirthRate,DeathRate,InfantDeaths,MaleLifeExpectancy,FemaleLifeExpectancy,GNI,Region,Country
0,24.7,5.7,30.8,69.6,75.5,600.0,,Albania
1,12.5,11.9,14.4,68.3,74.7,2250.0,1.0,Bulgaria
2,13.4,11.7,11.3,71.8,77.7,2980.0,1.0,Czechoslovakia
3,12.0,12.4,7.6,69.8,75.9,0.0,1.0,Former_E._Germany
4,11.6,13.4,14.8,65.4,73.8,2780.0,1.0,Hungary
5,,10.2,16.0,67.2,75.7,1690.0,1.0,Poland
6,13.6,10.7,26.9,66.5,72.4,1640.0,1.0,Romania
7,14.0,9.0,20.2,68.6,74.5,0.0,1.0,Yugoslavia
8,17.7,10.0,23.0,64.6,74.0,2242.0,1.0,USSR
9,15.2,9.5,13.1,66.4,75.9,1880.0,1.0,Byelorussian_SSR


In [51]:
# replace all remaining missing values with -100
povData.fillna(value=-100,inplace=True)
povData.head(10)

Unnamed: 0,LiveBirthRate,DeathRate,InfantDeaths,MaleLifeExpectancy,FemaleLifeExpectancy,GNI,Region,Country
0,24.7,5.7,30.8,69.6,75.5,600.0,-100.0,Albania
1,12.5,11.9,14.4,68.3,74.7,2250.0,1.0,Bulgaria
2,13.4,11.7,11.3,71.8,77.7,2980.0,1.0,Czechoslovakia
3,12.0,12.4,7.6,69.8,75.9,0.0,1.0,Former_E._Germany
4,11.6,13.4,14.8,65.4,73.8,2780.0,1.0,Hungary
5,-100.0,10.2,16.0,67.2,75.7,1690.0,1.0,Poland
6,13.6,10.7,26.9,66.5,72.4,1640.0,1.0,Romania
7,14.0,9.0,20.2,68.6,74.5,0.0,1.0,Yugoslavia
8,17.7,10.0,23.0,64.6,74.0,2242.0,1.0,USSR
9,15.2,9.5,13.1,66.4,75.9,1880.0,1.0,Byelorussian_SSR
