# Filling the missing data and replacing  incorrect data

In an organization, data comes from so many different sources following different standards and different quality control across the globe. While working with the data, many times, we see missing or incorrect data value in a particular field. For example – 
•	“ “ (space) in an amount field 
•	"$", “?”, “aa”, “,” in amount field (amount field should have numeric for mathematical calculations)

Most of these errors are due to human data entry error or intentionally not updating the data, equipment error, faulty measurements, incorrect ETL (Extraction, Transformation and Loading) process and other faulty processes. Faulty or missing data can lead to incorrect and incompetent calculations leading to bias and underfitting a model. 
Data can be completely missing at random (due to reasons explained above) or could be due to optional fields in the forms (for example, cell number is mandatory, home number is optional) or Survey questions that can be skipped or any field that can be skipped.

So before handling a machine learning problem, it becomes utmost important to clean, standardize and prepare the data so that it can be used effectively for the building efficient machine learning models. Preparing the data takes majority of time for a data scientist and data engineers (anywhere around 50% to 70% of total time).

In this tutorial we will learn about how to identify the missing data or incorrect data and how to replace the same. We will be using Python code to illustrate and solve.

Data - we will use the height weight dataset.


In [2]:
#Import Library for Data manipulation
import numpy as np
import pandas as pd
from sklearn.preprocessing import Imputer

In [5]:
# Load the dataset
dataset = pd.read_csv("C://Users//jagan//OneDrive//Documents//jagannathbanerjee.com//Blog//Height_Weight_Data.csv")

### Step 1 : Get the shape of the dataset (rows, columns) 

In [8]:
dataset.shape

(200, 3)

#### Observation - we have 200 rows and 3 columns.

### Step 2 : Get the column names

In [14]:
list(dataset.columns)

['Index', 'Height(Inches)', 'Weight(Pounds)']

#### We have 3 columns 'Index', 'Height(Inches)', 'Weight(Pounds)' . Always a good practice to remove brackets and replace with underscores.

In [17]:
# Renaming the columns using pandas rename function.
dataset = dataset.rename(columns={'Height(Inches)':'Height_in_Inches', 'Weight(Pounds)' : 'Weight_in_Pounds'})

#getting the column names again
list(dataset.columns)

['Index', 'Height_in_Inches', 'Weight_in_Pounds']

### Step 3 : Get the column information or Datatype

In [18]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 3 columns):
Index               200 non-null int64
Height_in_Inches    197 non-null float64
Weight_in_Pounds    198 non-null object
dtypes: float64(1), int64(1), object(1)
memory usage: 4.8+ KB


#### Lookout!!!! 
Weight_in_Pounds is non-numeric. This raises  red flag. We are expecting weight to be a numeric field but it turns out to be object or non-numeric. We will assess this field more

### Step 4 : Getting basic statistics for the columns

In [20]:
dataset.describe()  # Basic Statistics on Numeric Columns

Unnamed: 0,Index,Height_in_Inches
count,200.0,197.0
mean,100.5,67.590508
std,57.879185,5.211051
min,1.0,0.0
25%,50.75,66.5
50%,100.5,67.93
75%,150.25,69.2
max,200.0,73.9


#### Observation for Height_in_Inches Column:
1. Look at the row - "count" . Index has 200 rows while Height_in_Inches contains 197 rows. There must be 3 rows missing data
2. Look at the row "min" - Height_in_Inches contains 0 value. Minimum height is 0, which is impossible. This must an error.

In [22]:
dataset.describe(include=['O'])  # Frequency table for non numeric columns

Unnamed: 0,Weight_in_Pounds
count,198.0
unique,194.0
top,123.49
freq,2.0


#### Observation for Weight_in_Pounds Column:
1. Look at the row - "count" . It contains 198 rows. 2 rows missing data
2. Look at the row "unique" - It has 194 unique counts, meaning rest 6 rows must be having something different. We will find it out.