<font color="green">*To start working on this notebook, or any other notebook that we will use in the Moringa Data Science Course, we will need to save our own copy of it. We can do this by clicking File > Save a Copy in Drive. We will then be able to make edits to our own copy of this notebook.*</font>

# Where does Missing Data come from?

In the course of your data science career,  it is rare that you will work with raw data that does not have missing values. 

Missing data can arise from the loading of data, when merging data, data entry or the reindexing process - the process of transforming data from one process to the other. We will see how this can happen in the sections below:

## 1.0 Loading Data



When we load data with pandas, we get a dataframe with the NaN value in the cell that had missing data. This is as a result of the read_csv function, which can take the following parameters which relate to reading in missing values: na_values, keep_default_na and na_filter.

We will now see how these parameters work.

In [1]:
# First lets import our dataset that we will use for the examples

# Let's import our pandas library 
import pandas as pd

# Let's set the location for our data
url = 'http://bit.ly/UniversityDataSet'

# We then store this dataset in a dataframe
# We us the latin1 encoding because our dataset contains special characters. 
# An encoding is one specific way of interpreting bytes: It's a look-up table that says, 
# for example, that a byte with the value 97 stands for 'a'.
#
df = pd.read_csv(url, encoding = "latin1")

# then view the columns of this dataframe, this will help us get familiar with the dataset before working on it
df.columns.values.tolist()

['Year',
 'Org Name',
 'Org Code',
 'Spec Ed. Grad Rate - Grad Cohort #',
 'Spec Ed. Grad Rate - Grad #',
 'Spec Ed. Grad Rate - Grad %',
 'Spec Ed. Dropout - Enr #',
 'Spec Ed. Dropout - Drop #',
 'Spec Ed. Dropout - Drop %',
 'LRE Ages 6-21 - Students #',
 'LRE Ages 6-21 - Full Incl #',
 'LRE Ages 6-21 - Full Incl %',
 'LRE Ages 3-5 - Students #',
 'LRE Ages 3-5 - Full Incl #',
 'LRE Ages 3-5 - Full Incl %',
 'Cohort Completion Year',
 'Substantial growth of knowledge & skills',
 'Survey Period',
 'Surv Meet Std #',
 'Surv Meet Std %',
 'Sch Yr Rev',
 '# of Students Engaged',
 'Dist Rate']

In [2]:
# We will then load the first 5 rows of our data and try to understand it more, while noting the missing values in the records, 
# afterwards learn more about the parameters
# 
df.head()

Unnamed: 0,Year,Org Name,Org Code,Spec Ed. Grad Rate - Grad Cohort #,Spec Ed. Grad Rate - Grad #,Spec Ed. Grad Rate - Grad %,Spec Ed. Dropout - Enr #,Spec Ed. Dropout - Drop #,Spec Ed. Dropout - Drop %,LRE Ages 6-21 - Students #,...,LRE Ages 3-5 - Full Incl #,LRE Ages 3-5 - Full Incl %,Cohort Completion Year,Substantial growth of knowledge & skills,Survey Period,Surv Meet Std #,Surv Meet Std %,Sch Yr Rev,# of Students Engaged,Dist Rate
0,2012,Abington,10000,27.0,17,63%,84.0,3,3.60%,295.0,...,1,5%,2012-13,-,Spring 2013,,,2010-11,8,100%
1,2013,Abington,10000,20.0,16,80%,70.0,2,2.90%,272.0,...,4,19%,2012-13,-,Spring 2013,11,73.20%,2010-11,8,100%
2,2012,Acton,20000,,,,,,,337.0,...,6,10.30%,2010-11,86.7,Spring 2011,-,-,2012-13,NR,NR
3,2013,Acton,20000,,,,0.0,-,-,335.0,...,16,26.70%,2010-11,86.7,Spring 2011,-,-,2012-13,NR,NR
4,2012,Acushnet,30000,3.0,-,-,0.0,-,-,174.0,...,7,25%,2013-14,-,Spring 2012,-,-,2011-12,,


**i) na_values**

The na_values parameter will allow us to specify additional missing or NaN values. What are NaN values you ask?

> *These NaN values are from the python numpy library - Missing values in python are displayed in a few ways - NaN, NAN or nan and are all equivalent.*

By default, this parameter is usually never used when reading the file mainly because these default missing values - NaN or nan, are already available. 

In [3]:
# now we load the data with the default values
print(pd.read_csv(url, encoding = 'latin1').head())

   Year  Org Name  Org Code Spec Ed. Grad Rate - Grad Cohort #  \
0  2012  Abington     10000                                 27   
1  2013  Abington     10000                                 20   
2  2012     Acton     20000                                NaN   
3  2013     Acton     20000                                NaN   
4  2012  Acushnet     30000                                  3   

  Spec Ed. Grad Rate - Grad # Spec Ed. Grad Rate - Grad %  \
0                          17                         63%   
1                          16                         80%   
2                         NaN                         NaN   
3                         NaN                         NaN   
4                           -                           -   

  Spec Ed. Dropout - Enr # Spec Ed. Dropout - Drop #  \
0                       84                         3   
1                       70                         2   
2                      NaN                       NaN   
3           

**ii) keep_default_na**

The second parameter is the `keep_default_na` parameter, which is a bool (boolean). It allows us to specify whether any additional values need to be considered missing. By default this parameter is True, meaning any additional missing values specified with the na_values paramenter will be appended to list of missing values. 

We can also set `keep_default_na=False` which will mean that we will only use missing values specified in na_values. 

In [4]:
# Example 2
# We then load the data without default missing values 
# 
print(pd.read_csv(url, encoding = 'latin1', keep_default_na = False).head())

   Year  Org Name  Org Code Spec Ed. Grad Rate - Grad Cohort #  \
0  2012  Abington     10000                                 27   
1  2013  Abington     10000                                 20   
2  2012     Acton     20000                                      
3  2013     Acton     20000                                      
4  2012  Acushnet     30000                                  3   

  Spec Ed. Grad Rate - Grad # Spec Ed. Grad Rate - Grad %  \
0                          17                         63%   
1                          16                         80%   
2                                                           
3                                                           
4                           -                           -   

  Spec Ed. Dropout - Enr # Spec Ed. Dropout - Drop #  \
0                       84                         3   
1                       70                         2   
2                                                      
3           

In [5]:
# Challenge 1: 
# Using the keep_default_na parameter, load the data with the default missing values below
# 
print(pd.read_csv(url, encoding = 'latin1', keep_default_na = True).head())

   Year  Org Name  Org Code Spec Ed. Grad Rate - Grad Cohort #  \
0  2012  Abington     10000                                 27   
1  2013  Abington     10000                                 20   
2  2012     Acton     20000                                NaN   
3  2013     Acton     20000                                NaN   
4  2012  Acushnet     30000                                  3   

  Spec Ed. Grad Rate - Grad # Spec Ed. Grad Rate - Grad %  \
0                          17                         63%   
1                          16                         80%   
2                         NaN                         NaN   
3                         NaN                         NaN   
4                           -                           -   

  Spec Ed. Dropout - Enr # Spec Ed. Dropout - Drop #  \
0                       84                         3   
1                       70                         2   
2                      NaN                       NaN   
3           

iii) **na_filter**

Our third parameter is the *na_filter parameter,* which is a bool that specifies whether any values will be read as missing. By default the na_filter=True, meaning that missing values will be coded as NaN. If na_filter=False, then nothing will be recorded as missing. We normally use this parameter when we want to set false the na_values and keep_default_na parameter all at the same time as well as when we want to quickly load any data without missing values. 



In [6]:
# Example 3
# Let's set the na_filter parameter to True
# 
print(pd.read_csv(url, na_filter = True, encoding ='latin1').head())

   Year  Org Name  Org Code Spec Ed. Grad Rate - Grad Cohort #  \
0  2012  Abington     10000                                 27   
1  2013  Abington     10000                                 20   
2  2012     Acton     20000                                NaN   
3  2013     Acton     20000                                NaN   
4  2012  Acushnet     30000                                  3   

  Spec Ed. Grad Rate - Grad # Spec Ed. Grad Rate - Grad %  \
0                          17                         63%   
1                          16                         80%   
2                         NaN                         NaN   
3                         NaN                         NaN   
4                           -                           -   

  Spec Ed. Dropout - Enr # Spec Ed. Dropout - Drop #  \
0                       84                         3   
1                       70                         2   
2                      NaN                       NaN   
3           

In [7]:
# Challenge 2
# Now let's see what happens when we set the na_filter parameter to False
# 
print(pd.read_csv(url, na_filter = True, encoding ='latin1').head())

   Year  Org Name  Org Code Spec Ed. Grad Rate - Grad Cohort #  \
0  2012  Abington     10000                                 27   
1  2013  Abington     10000                                 20   
2  2012     Acton     20000                                NaN   
3  2013     Acton     20000                                NaN   
4  2012  Acushnet     30000                                  3   

  Spec Ed. Grad Rate - Grad # Spec Ed. Grad Rate - Grad %  \
0                          17                         63%   
1                          16                         80%   
2                         NaN                         NaN   
3                         NaN                         NaN   
4                           -                           -   

  Spec Ed. Dropout - Enr # Spec Ed. Dropout - Drop #  \
0                       84                         3   
1                       70                         2   
2                      NaN                       NaN   
3           

## 1.1 Merging Data


Missing data can also result from the merging of datasets. This we will see when we create a merged table from the two datasets below. 

Having missing values in the merged table can be as a result of either or neither of the datasets having any missing values i.e. if the datasets have unique columns.



In [None]:
apple_orange_dataset = 'http://bit.ly/AppleOrangeDataSet'
stability_dataset    = 'http://bit.ly/StabilityDataset'

apple_orange_dataset = pd.read_csv(apple_orange_dataset, encoding ='latin1')
stability_dataset    = pd.read_csv(stability_dataset, encoding ='latin1')

The appleorange dataset  contains production volumes of apples and oranges while the stability dataset contains country stability indicators. 

In [None]:
# Displaying our the first dataset
#
apple_orange_dataset.head(10)

In [None]:
# Displaying our second dataset
# 
stability_dataset.head(10)

In [None]:
# Now merging the two dataframes and analysing the output
merged_dataset = apple_orange_dataset.merge(stability_dataset, left_on='FAOSTAT 2013', right_on='FAOSTAT 2013')

# Displayig our dataset
merged_dataset.head(20)

In [None]:
# Challenge 4:
# Merge the following two manifesto project datasets and 
# discuss with your peer on origin of the missing values in the merged dataset
#
party_names_dataset  = 'http://bit.ly/PartyNamesMarporDataSet'
party_names_dataset2 = 'http://bit.ly/PartyNamesPledges'

OUR CODE GOES HERE

## 1.2 User Input Values


Missing values within a dataset can also result from mistakes from the user during the data entry/collection phase. This could also result from creating a vector of values from a calculation or from manually curating a vector as shown in the examples below.

In [None]:
# Importing nan from numpy so as to represent missing values
#
from numpy import nan

In [None]:
# Creating a missing value in a series
# NB: Series is the datastructure for a single column of a DataFrame. 
# 
population = pd.Series({'Kenya': 47, 'Tanzania': nan, 'Uganda': 30})

# Lets now print. Note that NaNs are also valid values for both Series and DataFrames
#
OUR CODE GEOS HERE

In [None]:
# Challenge 5:
# Create a series with the names and ages of your peers. Note nan for the ages of peers who you don't know.
# 
OUR CODE GOES HERE

In [None]:
# Now creating a missing value in a dataframe
list_data = [['Steve', 56, 'Nairobi', 'no'],['Bill', 57, 'Machackos', 'yes'],['Richard', nan, nan, 'yes']]

# Defining our columns
age_df = pd.DataFrame(list_data, columns=['Name', 'Age', 'Hometown', 'Public Commute'])

# Printing age_df
age_df

In [None]:
# We can also assign a column of missing values to a dataframe directly
#
age_df['Public Commute'] = 'Yes';

# print age_df
#
OUR CODE GOES HERE

In [None]:
# Challenge 6
# Create a dataframe with a list of your country's presidents 
# with their no. of years in office and their last year in office. 
# NB: You do not know the last year of office for the current president. 
#
OUR CODE GOE HERE

## 1.3 Re-indexing


Missing values in data can also be introduced when we reindex our dataframe. We will show this through  a dataset from Gapminder.

In [None]:
# Getting our dataset
gapminder_dataset    = 'http://bit.ly/GapMinderKenyaDataSet'
gapminder_dataset_df = pd.read_csv(gapminder_dataset, encoding ='latin1')

# printing the dataframe
print(gapminder_dataset_df)

In [None]:
# now performing a groupby operation so as to subset the data
life_exp = gapminder_dataset_df.groupby(['year'])['lifeExp'].mean()
print(life_exp)

In [None]:
# We then reindexing by slicing the data. We continue to chain the 'loc' from the code above
print(life_exp.loc[range(2000, 2007), ])

Now, it seems we have been provided a suggestion to use the reindex() method. We will do this below as we choose to subset the data separately and then use the reindex() method.

In [None]:
# subsetting the dataframe
y_2000_df = life_exp[life_exp.index > 2000] 
print(y_2000_df)

In [None]:
# then reindexing
print(y_2000_df.reindex(range(2000, 2010)))

In [None]:
# Challenge 7
# We would want to look at only the years from 2000 to 2010 from the gapminder dataset below. 
# Perform grouped operations, subseting the data and then re-indexing it. 
# Show how this process might result to getting missing values.

# Assinging data the url for our dataset
data = 'http://bit.ly/GapMinderUgandaDataSet'

OUR CODE GOES HERE