### Importing neccessary packages 

In [1]:
import pandas as pd
import numpy as np 

### Introduction

Missing Data can occur when no information is provided for one or more items or for a whole unit.Missing Data is a very big problem in real life scenario. For example,Suppose different user being surveyed may choose not to share their income, some user may choose not to share the address in this way income and address are missing for such users.

In Pandas missing data is represented by two value:

    - None: None is a Python singleton object that is often used for missing data in Python code.
    - NaN : NaN (an acronym for Not a Number), is a special floating-point value recognized by all systems that use the standard  IEEE-754 floating-point representation. In Python, it is defined using nan function in numpy.


### Creating a simple DataFrame 

In [2]:
# dictionary of lists 
dict1 = { 'Names':['Saumu','Alice','Smith',np.nan,'Bob'],
        'Math':[90, 70, np.nan, np.nan,95], 
        'English': [86, 65, 56, np.nan,np.nan], 
        'History':[np.nan, 60, 72, np.nan,98]} 
  
# creating a dataframe from the dictionary above 
df = pd.DataFrame(dict1) 
    
df

Unnamed: 0,Names,Math,English,History
0,Saumu,90.0,86.0,
1,Alice,70.0,65.0,60.0
2,Smith,,56.0,72.0
3,,,,
4,Bob,95.0,,98.0


### 1. Checking for missing values using isnull()

In [3]:
df1= df.copy()
df1.isnull() #or df1.isna()
#The converse of isnull() is notnull()

Unnamed: 0,Names,Math,English,History
0,False,False,False,True
1,False,False,False,False
2,False,True,False,False
3,True,True,True,True
4,False,False,True,False


### 2. Count the number of missing values -  in a column, for the whole DataFrame

In [4]:
df2 = df.copy()
# number of missing values for the whole dataframe
print(sum(df2.isna().sum()))
print(70*"#")
# number of missing values for each column'
print(df2.isna().sum())
print(70*"#")
# numbe of missing values in specific columns. Syntax df2["name_of_col"].isna().sum()
df2["Math"].isna().sum()


7
######################################################################
Names      1
Math       2
English    2
History    2
dtype: int64
######################################################################


2

### 3. Filling missing values

#### a. using fillna( )

In [5]:
# Example of replacing NaN values on numerical columns with 0 
# Remember that replacing NaN values with zero may not make sense
# because it will affect the statistics.
df3 = df.copy()
df3.fillna(0) # note that this does not affect df3 unless you
# assign it as a variable, i.e, df3 = df3.fillna(0)

Unnamed: 0,Names,Math,English,History
0,Saumu,90.0,86.0,0.0
1,Alice,70.0,65.0,60.0
2,Smith,0.0,56.0,72.0
3,0,0.0,0.0,0.0
4,Bob,95.0,0.0,98.0


In [6]:
 #Filling null values with the previous ones

In [7]:
df4 = df.copy()
df4.fillna(method="pad") 
# Note that this method cannot replace NaN values in row one
# because in row one there is no PREVIOUS VALUE.

Unnamed: 0,Names,Math,English,History
0,Saumu,90.0,86.0,
1,Alice,70.0,65.0,60.0
2,Smith,70.0,56.0,72.0
3,Smith,70.0,56.0,72.0
4,Bob,95.0,56.0,98.0


In [8]:
# Filling null value with the next ones
df5 = df.copy()
df5.fillna(method="bfill") 
# Note that this method cannot replace NaN values in the last row
# becausee in row one there is no NEXT VALUE.


Unnamed: 0,Names,Math,English,History
0,Saumu,90.0,86.0,60.0
1,Alice,70.0,65.0,60.0
2,Smith,95.0,56.0,72.0
3,Bob,95.0,,98.0
4,Bob,95.0,,98.0


#### b)  Filling a null values using replace( ) method

In [9]:
df6 = df.copy()
# replace NaN values with a specific value, -999
df6.replace(np.nan,-999)

Unnamed: 0,Names,Math,English,History
0,Saumu,90.0,86.0,-999.0
1,Alice,70.0,65.0,60.0
2,Smith,-999.0,56.0,72.0
3,-999,-999.0,-999.0,-999.0
4,Bob,95.0,-999.0,98.0


#### c)  Using interpolate( ) function to fill the missing values using linear method.

In [10]:
df6 = df.copy()
df6.interpolate(method ='linear', limit_direction ='forward') 
# Note that Linear method ignore the index and treat the values as equally spaced.
# another option is "backward" direction

Unnamed: 0,Names,Math,English,History
0,Saumu,90.0,86.0,
1,Alice,70.0,65.0,60.0
2,Smith,78.333333,56.0,72.0
3,,86.666667,56.0,85.0
4,Bob,95.0,56.0,98.0


In [11]:
# As we can see the output, values in the first row could not get filled as the direction of filling of values is 
# forward and there is no previous value which could have been used in interpolation.

# The same explanation holds for the backward linear interpolation but now with the NaN values on the 
# last row not being field

In [12]:
df6 = df.copy()
df6.interpolate(method ='linear', limit_direction ='backward') 

Unnamed: 0,Names,Math,English,History
0,Saumu,90.0,86.0,60.0
1,Alice,70.0,65.0,60.0
2,Smith,78.333333,56.0,72.0
3,,86.666667,,85.0
4,Bob,95.0,,98.0


### 4. Drop Missing Values

In [17]:
# Droping any row with atleast one missing value
df7 = df.copy()
df7.dropna(axis=0) # axis=1 for dropping columns

Unnamed: 0,Names,Math,English,History
1,Alice,70.0,65.0,60.0


In [14]:
# Dropping rows if all values in that row are missing.
df8 = df.copy()
df8.dropna(how = 'all') 

Unnamed: 0,Names,Math,English,History
0,Saumu,90.0,86.0,
1,Alice,70.0,65.0,60.0
2,Smith,,56.0,72.0
4,Bob,95.0,,98.0


In [None]:
# Drop column with one mi

### EXERCISE AND SOLUTION (Handling missing values on Empoyee data)

In [20]:
# Loading employee.csv file
employee_df = pd.read_csv("employee.csv")
# Displaying first seven rows
employee_df.head(7)

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Rose,Female,7/6/2002,3:57 PM,63494,19.385,True,Human Resources
1,Rose,Female,8/25/2002,5:12 AM,134505,11.051,True,Marketing
2,Randy,,2/6/1986,3:04 PM,133943,8.94,True,Sales
3,Steven,Male,11/21/2006,8:30 AM,83706,6.96,True,Human Resources
4,Christopher,Male,4/22/2000,10:15 AM,37919,11.449,False,
5,Jane,Female,1/12/1992,1:23 PM,51923,13.623,False,Business Development
6,Joe,Male,12/8/1998,10:28 AM,126120,1.02,False,


In [25]:
# Replace MIssing values in Gender columns with "No Gender"
# Count the number of missing values in each column
# Drop rows with missing First Name


In [24]:
employee_df.isna().sum()

First Name            67
Gender               145
Start Date             0
Last Login Time        0
Salary                 0
Bonus %                0
Senior Management     67
Team                  43
dtype: int64