In [1]:
import pandas as pd
import numpy as np
import matplotlib as plt
import seaborn as sns 

# Data Preprocessing

For Data Preprocessing here we are using Social_Network_Ads_1 Data Set. It is talking about the customers information. A particular company is displaying lots of ads to the customers through social media(Company is trying to show ads only the customers who purchased the product) . What is our task is , Here we have a particular variable $ purchased $. It is our target variable . Here based on various other information(Country,Gender,Salary etc..). Here we are trying to estimate whether he is purchased the product or not or ideally is there is any chance that a particular customer purchases the product

In [2]:
data = pd.read_csv(r"Social_Network_Ads_1.csv")

In [3]:
data

Unnamed: 0,User ID,Date,Country,Gender,Age,EstimatedSalary,Purchased
0,15624510,01-03-2012,France,Male,19,19000.0,0
1,15810944,01-04-2012,Italy,Male,35,20000.0,0
2,15668575,01-05-2012,France,Female,26,43000.0,0
3,15603246,01-06-2012,Germany,Female,27,57000.0,0
4,15804002,01-09-2012,France,Male,19,76000.0,0
...,...,...,...,...,...,...,...
398,15691863,08-05-2013,France,Female,46,41000.0,1
399,15706071,08-06-2013,Italy,Male,51,23000.0,1
400,15654296,08-07-2013,Italy,Female,50,20000.0,1
401,15755018,08-08-2013,Germany,Male,36,33000.0,0


In [4]:
data.shape

(403, 7)

## Checking for any missing values

What ? --> What is the missing value

Why ? ---> Why do we need to handle this

Example -: Let's assume that our data is --> 200,400,nan,500,750

So we know that our machine learning model is a mathematical model. Let's assume that our machine is doing something like this,

200+400+nan+500+750 -> Is this valid ? No. We can't add nan value with numbers

ie, we can't use nan value for calculations in a machine learning model
   

How ? How to handle the missing value 

Either fill the missing value

    --- 1. Fill With mean or median or mode
    
    ----2. Fill by analyzing the data
    
           -----2.1 GroupBy Analysis
           
           -----2.2 Stats analysis
           
    ----3. Fill with random selection in the variable
    
    ----4. Fill with some relevant value
    
    ----5. Predict the missing value
           

OR drop the missing value

    ---- 1. Drop a row ( When you have a huge data set)
    
    ---- 2. Drop the entire column (When there is 30%-40% of missing values in the column)

In [5]:
data.isnull().any()

User ID            False
Date               False
Country            False
Gender             False
Age                False
EstimatedSalary     True
Purchased          False
dtype: bool

In [6]:
data.isnull().any().sum() # Gives the number of columns with missing values

1

In [7]:
data.isnull().sum()

User ID            0
Date               0
Country            0
Gender             0
Age                0
EstimatedSalary    2
Purchased          0
dtype: int64

In [8]:
data["EstimatedSalary"].isnull().sum()

2

# Handling Missing Values

### Filling with mean or median

In [9]:
data.mean(numeric_only=True)

User ID            1.569141e+07
Age                3.769231e+01
EstimatedSalary    8.236658e+04
Purchased          3.598015e-01
dtype: float64

In [10]:
data["EstimatedSalary"].mean()

82366.58354114713

In [11]:
data.fillna(data["EstimatedSalary"].mean()).isnull().any() # To get the data get updated use inplace=True

User ID            False
Date               False
Country            False
Gender             False
Age                False
EstimatedSalary    False
Purchased          False
dtype: bool

In [12]:
data["EstimatedSalary"].median()

70000.0

Here we can see that mean is 82K+ and median is 70K  Big difference between mean and median which means, our mean value is got affected by outliers . Mean is greater than median which means on the right side we have outliers. In this we can go for median

In [13]:
data.fillna(data["EstimatedSalary"].median())

Unnamed: 0,User ID,Date,Country,Gender,Age,EstimatedSalary,Purchased
0,15624510,01-03-2012,France,Male,19,19000.0,0
1,15810944,01-04-2012,Italy,Male,35,20000.0,0
2,15668575,01-05-2012,France,Female,26,43000.0,0
3,15603246,01-06-2012,Germany,Female,27,57000.0,0
4,15804002,01-09-2012,France,Male,19,76000.0,0
...,...,...,...,...,...,...,...
398,15691863,08-05-2013,France,Female,46,41000.0,1
399,15706071,08-06-2013,Italy,Male,51,23000.0,1
400,15654296,08-07-2013,Italy,Female,50,20000.0,1
401,15755018,08-08-2013,Germany,Male,36,33000.0,0


## Mode

Mode is used to fill categorical data because mean and median can't be used with categorical data . It gives the most occurring value

In [16]:
data["EstimatedSalary"].mode() 

0    72000.0
dtype: float64

It gives data in series format so in order to get the mode value we need ti slice it

In [17]:
data["EstimatedSalary"].mode()[0]

72000.0

In [20]:
data["EstimatedSalary"].fillna(data["EstimatedSalary"].mode()[0]).isnull().sum() # For updating the data frame use inplace=True

0

While filling with mean and median we used data.fillna(). But while filling with median we have to use data[Column_Name].fillna()

## bfill() and ffill()

bfill -> Backward Fill

ffill -> Forward Fill

Whenever the data is depends on time, ie. when the previous data(row) is depends on the current data(row). and wen there is a NaN(Missing) Value we can fill with the previous row value(ffill) or next row value(bfill)


In [24]:
data[data.isnull().any(axis=1)] # Shows the rows where we have mising values

Unnamed: 0,User ID,Date,Country,Gender,Age,EstimatedSalary,Purchased
10,15570769,1/18/2012,Italy,Female,26,,0
167,15762228,8/30/2012,France,Female,22,,0


In [25]:
data.iloc[8:15]

Unnamed: 0,User ID,Date,Country,Gender,Age,EstimatedSalary,Purchased
8,15600575,1/13/2012,Italy,Male,25,33000.0,0
9,15727311,1/17/2012,Germany,Female,35,65000.0,0
10,15570769,1/18/2012,Italy,Female,26,,0
11,15606274,1/19/2012,France,Female,26,52000.0,0
12,15746139,1/20/2012,Italy,Male,20,86000.0,0
13,15704987,1/23/2012,France,Male,32,18000.0,0
14,15628972,1/24/2012,France,Male,18,82000.0,0


In [26]:
data.bfill().iloc[8:15] # Missing data was [10][Estimated Salary] filled with next row value

Unnamed: 0,User ID,Date,Country,Gender,Age,EstimatedSalary,Purchased
8,15600575,1/13/2012,Italy,Male,25,33000.0,0
9,15727311,1/17/2012,Germany,Female,35,65000.0,0
10,15570769,1/18/2012,Italy,Female,26,52000.0,0
11,15606274,1/19/2012,France,Female,26,52000.0,0
12,15746139,1/20/2012,Italy,Male,20,86000.0,0
13,15704987,1/23/2012,France,Male,32,18000.0,0
14,15628972,1/24/2012,France,Male,18,82000.0,0


In [28]:
data.ffill().iloc[8:15] # Missing data was [10][Estimated Salary] filled with previous row value

Unnamed: 0,User ID,Date,Country,Gender,Age,EstimatedSalary,Purchased
8,15600575,1/13/2012,Italy,Male,25,33000.0,0
9,15727311,1/17/2012,Germany,Female,35,65000.0,0
10,15570769,1/18/2012,Italy,Female,26,65000.0,0
11,15606274,1/19/2012,France,Female,26,52000.0,0
12,15746139,1/20/2012,Italy,Male,20,86000.0,0
13,15704987,1/23/2012,France,Male,32,18000.0,0
14,15628972,1/24/2012,France,Male,18,82000.0,0


## GroupBy

In [29]:
data[data.isnull().any(axis=1)]

Unnamed: 0,User ID,Date,Country,Gender,Age,EstimatedSalary,Purchased
10,15570769,1/18/2012,Italy,Female,26,,0
167,15762228,8/30/2012,France,Female,22,,0


While using the statistic(Mean , Median or Mode) to fill missing value , We are filling all the missing values with the same value