# Data Preprocessing

Data Preprocessing is a crucial part for machine learning. Without this it may possible that your ML algorithm will not work properly. Think like this, when you go on vacation you first plan your arrangements like travel, stay etc. That is your preprocessing, it's boring but at the same time it's necessary.   

### Importing Libraries
Here we are going to import 3 different libraries 
1. numpy -  It is the fundamental package for scientific computing with Python. Mostly use for mathematical parts.
2. matplotlib - It is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. Use for plotting data.
3. pandas - It is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. Use to working database

In [2]:
#import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

### Importing data

Using panda we will import our dataset from .csv file. I have other notebook that will specific for importing dataset. We will import our data in pandas data frame which is heterogeneous tabular data structure with labeled axes (rows and columns).   

In [3]:
#importing data
df = pd.read_csv('Data.csv')
df.head()

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes


head() will show few data records from data frame. As Data set can have many records it's good to use head() hear. head(n) will show first n record and we can store them to other data frame also. <br>
For Example, df_20rec = df.head(20) will store first 20 records in df_20rec.

**Note:** Remember python user 0 indexing here we can see in result where as R using 1 indexing for records. 

#### Independent and dependent variables

Independent variables (also referred to as Features) are the input for a process that is being analyzes. Dependent variables are the output of the process.

For example, in the data set, the independent variables are the input of the purchasing process being analyzed. The result (whether a user purchased or not) is the dependent variable.

Let's retrieve dependent and independent variable matrix

In [14]:
#Getting dependent and independent variable matrix
X = df.iloc[:,:-1].values
y = df.iloc[:,-1].values

here in [:,:-1] first : indicating all rows and :-1 indicating all columns except last one. [:,:3] means all records and first 3 column. [:,3:] means all records and all columns except the first 3.   

In [15]:
X

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, nan],
       ['France', 35.0, 58000.0],
       ['Spain', nan, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

In [16]:
y

array(['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes'], dtype=object)

### Taking care of missing data

One possible value is to remove that recored but if it's dangerous it may contain critical information therefore we can replace that values with mean.  

***check if data contains missing value or not?***

In [7]:
df.apply(lambda x : sum(pd.isnull(x))/len(df)*100)

Country       0.0
Age          10.0
Salary       10.0
Purchased     0.0
dtype: float64

It shows that Salary and Age columns has 10% missing values

*** Replacing with mean ***

In [8]:
from sklearn.preprocessing import Imputer

imputer = Imputer(missing_values = 'NaN',strategy = 'mean',axis = 0)
imputer = imputer.fit(X[:,1:3])
X[:,1:3] = imputer.transform(X[:,1:3])

sklearn is the simple machine learning libraries you can use it's imputer method here. Other then mean we can use different strategy like median and most frequent to remove missing values.

In [9]:
X

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, 63777.77777777778],
       ['France', 35.0, 58000.0],
       ['Spain', 38.77777777777778, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

### Categorical data

Machine learning models are based on mathematical equation, categorical data would cause some problem. We have to encode them. Encoder transform them into numerical values. There are many encoding techniques. We will use sklearn libraries here again


In [10]:
# Label Encoding
from sklearn.preprocessing import LabelEncoder
labelencoder_X = LabelEncoder()
X[:,0] = labelencoder_X.fit_transform(X[:,0])

In [11]:
X

array([[0, 44.0, 72000.0],
       [2, 27.0, 48000.0],
       [1, 30.0, 54000.0],
       [2, 38.0, 61000.0],
       [1, 40.0, 63777.77777777778],
       [0, 35.0, 58000.0],
       [2, 38.77777777777778, 52000.0],
       [0, 48.0, 79000.0],
       [1, 50.0, 83000.0],
       [0, 37.0, 67000.0]], dtype=object)

The problem in this is again in this model will consider 2 higher priority then 0 where these categorical data has no relation in between.<br>

Therefore, instead of this we do dummy column we will encode different dummy column for each country. put entry 1 is that country otherwise put 0. For that we have one hot encoder 

In [12]:
#Onehot Encodeing
from sklearn.preprocessing import OneHotEncoder

onehotencoder = OneHotEncoder(categorical_features = [0])
X = onehotencoder.fit_transform(X).toarray()

In [13]:
X

array([[  1.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          4.40000000e+01,   7.20000000e+04],
       [  0.00000000e+00,   0.00000000e+00,   1.00000000e+00,
          2.70000000e+01,   4.80000000e+04],
       [  0.00000000e+00,   1.00000000e+00,   0.00000000e+00,
          3.00000000e+01,   5.40000000e+04],
       [  0.00000000e+00,   0.00000000e+00,   1.00000000e+00,
          3.80000000e+01,   6.10000000e+04],
       [  0.00000000e+00,   1.00000000e+00,   0.00000000e+00,
          4.00000000e+01,   6.37777778e+04],
       [  1.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          3.50000000e+01,   5.80000000e+04],
       [  0.00000000e+00,   0.00000000e+00,   1.00000000e+00,
          3.87777778e+01,   5.20000000e+04],
       [  1.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          4.80000000e+01,   7.90000000e+04],
       [  0.00000000e+00,   1.00000000e+00,   0.00000000e+00,
          5.00000000e+01,   8.30000000e+04],
       [  1.00000000e+00,   0.0000000

In [17]:
labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)

In [18]:
y

array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1], dtype=int64)