## Missing Data in DataFrame
In many scenario Data you are working with may not have values in all the Columns.
Certain columns may have junk record and some may be empty or missing(also referred to as NA). Typically NaN(Not a number) is the default missing value marker, which is coming from NumPy. Pandas actually built on top of NumPy, hence it also used same value to represent non existing entry.It helps in gaining computational speed. In some cases None also may appear and we may have to consider that as missing or "NA".
By Default NaN is of float type. If you have a integer field with at least one missing value then the entire field will be typed as float.
We will try to find the ways to work with these kind of data in Python Pandas.

In [117]:
# Importing pandas and numpy modules
import pandas as pd
import numpy as np
# Create a Dict object with some missing and invalid data
people = {"first":["Christie", "Ian", "John", "Donald", np.nan, None, 'Not available'],
          "last":["Bell", "Miller", "Smith", "Jones", np.nan, np.nan, 'Not available'],
          "email":["christie.bell@email.com", "ian.miller@email.com", "john.smith@email.com", "donald.jones@domain.com", np.nan, np.nan, 'NA'],
          "age": ['28', '34', '40', '55', np.nan, 'not known', None]
        }

In [118]:
# Creating the DataFrame
people_df = pd.DataFrame(people)
people_df

Unnamed: 0,first,last,email,age
0,Christie,Bell,christie.bell@email.com,28
1,Ian,Miller,ian.miller@email.com,34
2,John,Smith,john.smith@email.com,40
3,Donald,Jones,donald.jones@domain.com,55
4,,,,
5,,,,not known
6,Not available,Not available,,


### Checking for missing data
isna() or isnull() Functions are used to check whether a field contains missing entry.

In [119]:
people_df.isna()  # isnull() does exactly same thing 

Unnamed: 0,first,last,email,age
0,False,False,False,False
1,False,False,False,False
2,False,False,False,False
3,False,False,False,False
4,True,True,True,True
5,True,True,True,False
6,False,False,False,True


### Deleting the missing data
To delete the records with missing values, dropna() function is used. This function provides the option on how to drop records by how keyword argument; and which axis to consider while deleting by axis keyword argument.

In [120]:
# Default values for dropna() function's keyword args are axis='index', how='any'
# Another acceptable values for axis='columns' and for how='all'
people_df.dropna() # Removed index 4, 5, 6; as atleast one column was having missing value

Unnamed: 0,first,last,email,age
0,Christie,Bell,christie.bell@email.com,28
1,Ian,Miller,ian.miller@email.com,34
2,John,Smith,john.smith@email.com,40
3,Donald,Jones,donald.jones@domain.com,55


In [121]:
people_df.dropna(how='all', inplace=False)  # Removed index 4; as all column was having missing value.

Unnamed: 0,first,last,email,age
0,Christie,Bell,christie.bell@email.com,28
1,Ian,Miller,ian.miller@email.com,34
2,John,Smith,john.smith@email.com,40
3,Donald,Jones,donald.jones@domain.com,55
5,,,,not known
6,Not available,Not available,,


In [122]:
# Dropping records if certain columns contains missing value
people_df.dropna(axis='index', how='all', subset=['last', 'email'], inplace=False)

Unnamed: 0,first,last,email,age
0,Christie,Bell,christie.bell@email.com,28.0
1,Ian,Miller,ian.miller@email.com,34.0
2,John,Smith,john.smith@email.com,40.0
3,Donald,Jones,donald.jones@domain.com,55.0
6,Not available,Not available,,


### Filling missing values
We can put some meaningful value in place of missing value by using fillna() Function. This is mostly required while performing any aggregation on a numerical column.

In [123]:
# fill nan value with given value only to age column
people_df['age'] = people_df['age'].fillna('0')
people_df

Unnamed: 0,first,last,email,age
0,Christie,Bell,christie.bell@email.com,28
1,Ian,Miller,ian.miller@email.com,34
2,John,Smith,john.smith@email.com,40
3,Donald,Jones,donald.jones@domain.com,55
4,,,,0
5,,,,not known
6,Not available,Not available,,0


In [124]:
people_df.fillna('-', inplace=True)
people_df

Unnamed: 0,first,last,email,age
0,Christie,Bell,christie.bell@email.com,28
1,Ian,Miller,ian.miller@email.com,34
2,John,Smith,john.smith@email.com,40
3,Donald,Jones,donald.jones@domain.com,55
4,-,-,-,0
5,-,-,-,not known
6,Not available,Not available,,0


### Replacing invalid data
Invalid or bad data can be replaced with desired data using replace() Function.

In [125]:
# Creating a Dict object with the mapping of desired value to be replaced with
replacement = {'Not available': '-',
           'NA':'-', 
           'not known': '0'
          }
people_df.replace(replacement, inplace=True)
people_df

Unnamed: 0,first,last,email,age
0,Christie,Bell,christie.bell@email.com,28
1,Ian,Miller,ian.miller@email.com,34
2,John,Smith,john.smith@email.com,40
3,Donald,Jones,donald.jones@domain.com,55
4,-,-,-,0
5,-,-,-,0
6,-,-,-,0
