### Data Pre-processing Stage

  This notebook contains the basic data pre processing steps.

Let's take a sample dataset for this exercise.
This dataset named "data.csv" contains whether a user purchased the product or not.
The users data has age,salary and the country they belonged to.

In [48]:
###############################################################
#       Step 1 : Importing the libraries                      #
###############################################################


# NumPy is module for Python. The name is an acronym for "Numeric Python" or "Numerical Python".
# This makes sure that the precompiled mathematical and numerical functions 
# and functionalities of Numpy guarantee great execution speed.

import numpy as np

# Pandas is an open-source Python Library providing high-performance data manipulation 
# and analysis tool using its powerful data structures. 
# The name Pandas is derived from the word Panel Data – an Econometrics from Multidimensional data.

import pandas as pd


# The OS module in Python provides a way of using operating system dependent functionality. 
# The functions that the OS module provides allows you to interface with the underlying operating system 
# that Python is running on – be that Windows, Mac or Linux.

import os

In [49]:
###############################################################
#       Step 2 : Importing the Dataset                        #
###############################################################

#Read the 'Data.csv' and store the data in the vairable dataset.
dataset = pd.read_csv("../input/Data.csv")
print('Load the datasets...')


# Print the shape of the dataset
print ('dataset: %s'%(str(dataset.shape)))


Load the datasets...
dataset: (15, 4)


The dataset contains 15 rows and 4 columns

In [50]:
dataset

Unnamed: 0,Country,Age,Salary,Purchased
0,India,34.0,92000.0,Yes
1,Sri lanka,22.0,25000.0,Yes
2,China,31.0,74000.0,Yes
3,Sri lanka,29.0,,No
4,China,55.0,98000.0,Yes
5,India,24.0,30000.0,No
6,Sri lanka,28.0,40000.0,No
7,India,,60000.0,No
8,China,51.0,89000.0,Yes
9,India,44.0,78000.0,Yes


In [51]:
# print the dataset
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

In [52]:
# Separate the dependent and independent variables

# Independent variable
# iloc[rows,columns]
# Take all rows
# Take last but one column from the dataset (:-1)


# Dependent variable
# iloc[rows,columns]
# Take all rows
# Take last column from the dataset (:-1)


In [53]:
# Print the X and Y
print(X)
print(y)

[['India' 34.0 92000.0]
 ['Sri lanka' 22.0 25000.0]
 ['China' 31.0 74000.0]
 ['Sri lanka' 29.0 nan]
 ['China' 55.0 98000.0]
 ['India' 24.0 30000.0]
 ['Sri lanka' 28.0 40000.0]
 ['India' nan 60000.0]
 ['China' 51.0 89000.0]
 ['India' 44.0 78000.0]
 ['Sri lanka' 21.0 20000.0]
 ['China' 25.0 30000.0]
 ['India' 33.0 45000.0]
 ['India' 42.0 65000.0]
 ['Sri lanka' 33.0 22000.0]]
['Yes' 'Yes' 'Yes' 'No' 'Yes' 'No' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'Yes'
 'Yes' 'No']


In [54]:
dataset.isnull().sum()

Country      0
Age          1
Salary       1
Purchased    0
dtype: int64

#### 1. Handle Missing Data

There are few missing data in the Age and salary columns (NaN values).

#### i. Deleting Rows:
*      We cannot remove the rows with the missing data as it will affect the output of the  machine learning algorithm.
*      However we can delete a particular row if it has a null value for a particular feature and a particular column if it has more than 70-75% of missing values.
      

#### ii. Replacing With Mean/Median/Mode:
*      This strategy can be applied on a feature which has numeric data like the age of a person.
*      We can calculate the mean, median or mode of the feature and replace it with the missing values.    
*     The loss of the data can be negated by this method which yields better results compared to removal of rows and  
*       columns.
*      Replacing with the above three approximations are a statistical approach of handling the missing values. 
*     This method is also called as leaking the data while training. 
*     Another way is to approximate it with the deviation of neighbouring values. 
*     This works better if the data is linear.


* # <font color='lime'>Solution 1 : Dropna</font>

In [63]:
df1 = dataset.copy()
df1.dropna(inplace=True)

print("Before:",df1.shape)

# drop rows with missing values
dataset.dropna(inplace=True)

# summarize the shape of the data with missing rows removed
print("After:",dataset)

Before: (13, 4)
After:       Country   Age   Salary Purchased
0       India  34.0  92000.0       Yes
1   Sri lanka  22.0  25000.0       Yes
2       China  31.0  74000.0       Yes
4       China  55.0  98000.0       Yes
5       India  24.0  30000.0        No
6   Sri lanka  28.0  40000.0        No
8       China  51.0  89000.0       Yes
9       India  44.0  78000.0       Yes
10  Sri lanka  21.0  20000.0        No
11      China  25.0  30000.0       Yes
12      India  33.0  45000.0       Yes
13      India  42.0  65000.0       Yes
14  Sri lanka  33.0  22000.0        No


# <font color='chartreuse'>Solution 2 : Fillna</font>

In [56]:
df2 = dataset.copy()
df2.dropna(inplace=True)

print("Before:",df1.shape)

# drop rows with missing values
dataset.fillna(df2.mean(),inplace=True)

# summarize the shape of the data with missing rows removed
print("After:",df2.shape)

Before: (13, 4)
After: (13, 4)


# <font color='darkgreen'>Solution 3 : Scikit-Learn</font>

#### 2. Encode the Categorical data

Categorical data are variables that contain label values rather than numeric values.
Some algorithms can work with categorical data directly.

For example, a decision tree can be learned directly from categorical data with no data transform required (this depends on the specific implementation).

Many machine learning algorithms cannot operate on label data directly. They require all input variables and output variables to be numeric.

This means that categorical data must be converted to a numerical form.

In our dataset there are 2 columns with categorical data.

The First column which contains the country and the last column purchased.

#### i.  Label Encoder: 

    * It is used to transform non-numerical labels to numerical labels (or nominal categorical variables).
    * Numerical labels are always between 0 and n_classes-1.     

#### ii. OneHotEncoder:
    * Encode categorical integer features using a one-hot aka one-of-K scheme.
    * The input to this transformer should be a matrix of integers, denoting the values taken on by categorical (discrete) 
      features.
    * The output will be a sparse matrix where each column corresponds to one possible value of one feature.
    * It is assumed that input features take on values in the range [0, n_values]
    * This encoding is needed for feeding categorical data to many scikit-learn estimators, notably linear models and SVMs
      with the standard kernels.        

 # <font color='darkorchid'>Solution 1 : Label Encoder</font>

In [66]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
label_encoder_country = LabelEncoder()
dataset['Country_Label'] = label_encoder_country.fit_transform(dataset['Country'])

dataset['Country_Label'] = label_encoder_country.fit_transform(dataset['Country'])
print("\nAfter Label Encoding for Country:")
print(dataset[['Country', 'Country_Label']].head())


After Label Encoding for Country:
     Country  Country_Label
0      India              1
1  Sri lanka              2
2      China              0
4      China              0
5      India              1


Now the categorical data of the country value is changed to numerical value.

| Country | Value |
|:--------|:------|
| China   |   0   |  
| India   |   1   |   
| Srilanka|   2   |   


#### Dummy Encoding

    * The above encoding will result in a problem.
    * The label encoding transforms the data as shown in the table above.
    * The Machine learning algorithm will assume that China>India>Sri Lanka.
    * But this is not the case. We just converted the categorical value and assigned it to a numeric value.
    * Hence there is a need to apply Dummy encoding to the above dataset.

| Country | China | India | Sri Lanka |
|:--------|:------|:------|:----------|
| China   |   1   |  0    |    0      |   
| India   |   0   |  1    |    0      |   
| Srilanka|   0   |  0    |    1      |   
| India   |   0   |  1    |    0      |  
| Srilanka|   0   |  0    |    1      |  
| China   |   1   |  0    |    0      |  
  



# <font color='darkorchid'>Solution 2 : ColumnTransformer</font>

In [58]:
# Applying the OneHotEncoder to the first column[0]


# <font color='magenta'>Solution 3 : LabelEncoder for labels</font>

In [68]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

#  One-Hot Encoding for 'Country'
one_hot_encoder = OneHotEncoder(sparse=False)  # Use sparse=False to get a dense array
country_encoded = one_hot_encoder.fit_transform(dataset[['Country']])
country_encoded_df = pd.DataFrame(country_encoded, columns=one_hot_encoder.get_feature_names_out(['Country']))


ValueError: could not convert string to float: 'Sri lanka'

# <font color='aqua'>6- Splitting the dataset</font>
Splitting the dataset is the next step in data preprocessing in machine learning. Every dataset for Machine Learning model must be split into two separate sets – training set and test set. 

# <font color='lightskyblue'>7- Feature scaling</font>
Feature scaling marks the end of the data preprocessing in Machine Learning. It is a method to standardize the independent variables of a dataset within a specific range. In other words, feature scaling limits the range of variables so that you can compare them on common grounds.

Another reason why feature scaling is applied is that few algorithms like gradient descent converge much faster with feature scaling than without it.

# <font color='red'>MinMax Scaler</font>
MinMax Scaler shrinks the data within the given range, usually of 0 to 1. It transforms data by scaling features to a given range. It scales the values to a specific value range without changing the shape of the original distribution.

# <font color='red'>Standard Scaler</font>
StandardScaler follows Standard Normal Distribution (SND). Therefore, it makes mean = 0 and scales the data to unit variance.

Now the all the data are in same scale. We can now apply different Machine learning model to the dataset.