# Data pre-processing 

It refers to the transformation applied to data before feeding into the algorithm.
Scikit learn library has full built-in functions for data pre-processing. Besides Scikit learn, we will also use Numpy and Pandas as supporting libraries for data manipulation

The things that we will learn in this session are:

1. Loading data set
2. Handling missing values
3. Scaling 
4. Binning
5. Encoding




# Loading data set

In [70]:
import pandas as pd

My_dataset = pd.read_csv('Dataset.csv')

In [71]:
print(My_dataset.size) # total numnber of elements present in the data set
print(My_dataset.shape) # no.rows x no. columns
print(My_dataset.ndim)  # data frame has 2 dimensions
print(My_dataset.dtypes) # To know data type of each feature


288
(24, 12)
2
Gender                object
Married               object
Time                  object
Dependents             int64
Education             object
Self_Employed         object
ApplicantIncome      float64
CoapplicantIncome    float64
LoanAmount             int64
Loan_Amount_Term     float64
Credit_History         int64
Property_Area         object
dtype: object


In [72]:
print(My_dataset)

    Gender Married      Time  Dependents     Education Self_Employed  \
0     Male      No   4:20 AM           0      Graduate            No   
1     Male     Yes   4:21 AM           1      Graduate            No   
2     Male     Yes   3:20 AM           0      Graduate            No   
3     Male     Yes   4:20 AM           2      Graduate            No   
4     Male     Yes   5:20 AM           0           NaN            No   
5     Male     Yes   6:20 AM           3      Graduate            No   
6     Male     Yes   7:20 AM           0      Graduate            No   
7     Male     Yes   8:20 AM           2  Not Graduate            No   
8     Male      No   9:20 AM           0  Not Graduate            No   
9     Male     Yes  10:20 AM           0  Not Graduate            No   
10    Male     Yes  11:20 AM           0      Graduate            No   
11    Male     Yes  12:20 PM           2      Graduate            No   
12    Male      No  13:20 PM           2      Graduate          

# Dealing with missing values

In [73]:
print(My_dataset.isnull().sum()) # total number of missing values feature wise

Gender               0
Married              0
Time                 0
Dependents           0
Education            1
Self_Employed        0
ApplicantIncome      1
CoapplicantIncome    1
LoanAmount           0
Loan_Amount_Term     3
Credit_History       0
Property_Area        0
dtype: int64


# Imputing missing values in Numerical features

In [77]:
meanV= My_dataset['ApplicantIncome'].median()

My_dataset['ApplicantIncome'] = My_dataset['ApplicantIncome'].fillna(meanV)


meanV= My_dataset['CoapplicantIncome'].mean()

My_dataset['CoapplicantIncome'] = My_dataset['CoapplicantIncome'].fillna(meanV)

meanV= My_dataset['Loan_Amount_Term'].mean()

My_dataset['Loan_Amount_Term'] = My_dataset['Loan_Amount_Term'].fillna(meanV)

print(My_dataset)


    Gender Married      Time  Dependents     Education Self_Employed  \
0     Male      No   4:20 AM           0      Graduate            No   
1     Male     Yes   4:21 AM           1      Graduate            No   
2     Male     Yes   3:20 AM           0      Graduate            No   
3     Male     Yes   4:20 AM           2      Graduate            No   
4     Male     Yes   5:20 AM           0           NaN            No   
5     Male     Yes   6:20 AM           3      Graduate            No   
6     Male     Yes   7:20 AM           0      Graduate            No   
7     Male     Yes   8:20 AM           2  Not Graduate            No   
8     Male      No   9:20 AM           0  Not Graduate            No   
9     Male     Yes  10:20 AM           0  Not Graduate            No   
10    Male     Yes  11:20 AM           0      Graduate            No   
11    Male     Yes  12:20 PM           2      Graduate            No   
12    Male      No  13:20 PM           2      Graduate          

# Imputing missing values in Categorical features

In [79]:
My_dataset["Education"].fillna("Graduate", inplace = True) 
print(My_dataset)

    Gender Married      Time  Dependents     Education Self_Employed  \
0     Male      No   4:20 AM           0      Graduate            No   
1     Male     Yes   4:21 AM           1      Graduate            No   
2     Male     Yes   3:20 AM           0      Graduate            No   
3     Male     Yes   4:20 AM           2      Graduate            No   
4     Male     Yes   5:20 AM           0      Graduate            No   
5     Male     Yes   6:20 AM           3      Graduate            No   
6     Male     Yes   7:20 AM           0      Graduate            No   
7     Male     Yes   8:20 AM           2  Not Graduate            No   
8     Male      No   9:20 AM           0  Not Graduate            No   
9     Male     Yes  10:20 AM           0  Not Graduate            No   
10    Male     Yes  11:20 AM           0      Graduate            No   
11    Male     Yes  12:20 PM           2      Graduate            No   
12    Male      No  13:20 PM           2      Graduate          

# Normalizing features using Min-Max in Scikit Learn¶

In [81]:
from sklearn.preprocessing import MinMaxScaler
import numpy as np

min_max=MinMaxScaler()

# normalize all int features
col=My_dataset.columns[My_dataset.dtypes =='int64']
for i in col:
    x = My_dataset[i]
    y = np.array(x).reshape(-1,1)
    My_dataset[i] = min_max.fit_transform(y)

# normalize all float features    
col=My_dataset.columns[My_dataset.dtypes =='float64']
for i in col:
    x = My_dataset[i]
    y = np.array(x).reshape(-1,1)
    My_dataset[i] = min_max.fit_transform(y)  
    
print(My_dataset)    

    Gender Married      Time  Dependents     Education Self_Employed  \
0     Male      No   4:20 AM    0.000000      Graduate            No   
1     Male     Yes   4:21 AM    0.333333      Graduate            No   
2     Male     Yes   3:20 AM    0.000000      Graduate            No   
3     Male     Yes   4:20 AM    0.666667      Graduate            No   
4     Male     Yes   5:20 AM    0.000000      Graduate            No   
5     Male     Yes   6:20 AM    1.000000      Graduate            No   
6     Male     Yes   7:20 AM    0.000000      Graduate            No   
7     Male     Yes   8:20 AM    0.666667  Not Graduate            No   
8     Male      No   9:20 AM    0.000000  Not Graduate            No   
9     Male     Yes  10:20 AM    0.000000  Not Graduate            No   
10    Male     Yes  11:20 AM    0.000000      Graduate            No   
11    Male     Yes  12:20 PM    0.666667      Graduate            No   
12    Male      No  13:20 PM    0.666667      Graduate          

# Preprocessing special features

In [85]:
My_dataset = pd.read_csv('Dataset.csv')
bad_chars = ['AM', 'PM'] 

My_dataset['Time'] = My_dataset['Time'].str.replace(' AM', '')
My_dataset['Time'] = My_dataset['Time'].str.replace(' PM', '')
My_dataset['Time'] = My_dataset['Time'].str.replace(':', '')





My_dataset['Time'] = My_dataset['Time'].astype(int)

# Binning data. values in 5 bins


bins = [0, 500, 1200, 1600, 2000, 5000]


labels = ['early moring', 'Morning', 'afternoon', 'eve', 'late eve']


My_dataset['Time']= pd.cut(My_dataset['Time'], bins=bins, labels=labels)

print(My_dataset)
        

    Gender Married          Time  Dependents     Education Self_Employed  \
0     Male      No  early moring           0      Graduate            No   
1     Male     Yes  early moring           1      Graduate            No   
2     Male     Yes  early moring           0      Graduate            No   
3     Male     Yes  early moring           2      Graduate            No   
4     Male     Yes       Morning           0           NaN            No   
5     Male     Yes       Morning           3      Graduate            No   
6     Male     Yes       Morning           0      Graduate            No   
7     Male     Yes       Morning           2  Not Graduate            No   
8     Male      No       Morning           0  Not Graduate            No   
9     Male     Yes       Morning           0  Not Graduate            No   
10    Male     Yes       Morning           0      Graduate            No   
11    Male     Yes     afternoon           2      Graduate            No   
12    Male  

# Requirements to use Scikit learn for supervised modeling

1. The data set is divided into two sets, i.e., indicator set and target set
2. The both the sets should be numeric in nature. So if few attributes are in categorical in nature, it is 
important to convert them in numerical. 
3. The both the sets should have a specific shape. As an eg. if data set is (100 x 10) where 10 features
inclused one target feature then, size of indicator 
set will be (100 x 9) and size of target set will be (100 x 1). 


# Converting selected features into numerical using LabelEncoder

In [86]:
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
My_dataset['Married'] = labelencoder.fit_transform(My_dataset['Married'])

In [87]:
print(My_dataset['Married'])

0     0
1     1
2     1
3     1
4     1
5     1
6     1
7     1
8     0
9     1
10    1
11    1
12    0
13    1
14    0
15    1
16    1
17    1
18    1
19    1
20    1
21    1
22    1
23    1
Name: Married, dtype: int64
