# <p style="padding:10px;background-color:#254441;margin:0;color:#e9c46a;font-family:newtimeroman;font-size:150%;text-align:center;border-radius: 15px 50px;overflow:hidden;font-weight:500">Practice Preprocessing Machine Learning</p>


Data Preprocessing includes the steps we need to follow to transform or encode data so that it may be easily parsed by the machine. 

The main agenda for a model to be accurate and precise in predictions is that the algorithm should be able to easily interpret the data's features. 




<img src="https://assets-global.website-files.com/5d7b77b063a9066d83e1209c/627d122b8fdb884d672952bf_61f7bfab94334458028eec7d_data-preprocessing-cover.png" width=55% />

<img src="https://assets-global.website-files.com/5d7b77b063a9066d83e1209c/613749410d056eb67ec4b11f_model-building.png" width=55% />

he majority of the real-world datasets for machine learning are highly susceptible to be missing, inconsistent, and noisy due to their heterogeneous origin. 

Applying data mining algorithms on this noisy data would not give quality results as they would fail to identify patterns effectively. Data Processing is, therefore, important to improve the overall data quality.

* Duplicate or missing values may give an incorrect view of the overall statistics of data.
* Outliers and inconsistent data points often tend to disturb the model’s overall learning, leading to false predictions.

Quality decisions must be based on quality data. Data Preprocessing is important to get this quality data, without which it would just be a <font color='red'>Garbage In, Garbage Out scenario</font>

<img src="https://miro.medium.com/max/720/1*2JLGXEdZ8Bi2jwBDstIyLg.jpeg" />

<a id='top'></a>
<div class="list-group" id="list-tab" role="tablist">
<p style="padding:10px;background-color:#254441;margin:0;color:#e9c46a;font-family:newtimeroman;font-size:130%;text-align:center;border-radius: 15px 50px;overflow:hidden;font-weight:500">Table Of Contents</p>   
    

    
|No  | Contents |No  | Contents  |
|:---| :---     |:---| :----     |
|1   | [<font color="#254441"> Importing Libraries</font>](#1)                |5   | [<font color="#254441"> Fillna (Solution 2)</font>](#5)
|2   | [<font color="#254441"> importing Dataset</font>](#2)                  |6   | [<font color="#254441"> Scikit-learn (Solution 3)</font>](#6)    
|3   | [<font color="#254441"> Missing Values</font>](#3)                     |7   | [<font color="#254441"> Encoding the Independent Variable</font>](#7)     
|4   | [<font color="#254441"> Dropna (Solution 1)</font>](#4)                |8  | [<font color="#254441"> Encoding the Dependent Variable</font>](#8)      


<a id="1"></a>
# <p style="padding:10px;background-color:#254441;margin:0;color:#e9c46a;font-family:newtimeroman;font-size:100%;text-align:center;border-radius: 15px 50px;overflow:hidden;font-weight:500">Importing Libraries</p>

In [142]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn

<a id="2"></a>
# <p style="padding:10px;background-color:#254441;margin:0;color:#e9c46a;font-family:newtimeroman;font-size:100%;text-align:center;border-radius: 15px 50px;overflow:hidden;font-weight:500">importing Dataset</p>

__Now we have a small dataset that we want to examine__

In [143]:
df = pd.read_csv('Data.csv')

In [144]:
df.head()

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes


In [145]:
df.columns

Index(['Country', 'Age', 'Salary', 'Purchased'], dtype='object')

In [146]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Country    10 non-null     object 
 1   Age        9 non-null      float64
 2   Salary     9 non-null      float64
 3   Purchased  10 non-null     object 
dtypes: float64(2), object(2)
memory usage: 448.0+ bytes


In [147]:
df.shape

(10, 4)

<a id="3"></a>
# <p style="padding:10px;background-color:#254441;margin:0;color:#e9c46a;font-family:newtimeroman;font-size:100%;text-align:center;border-radius: 15px 50px;overflow:hidden;font-weight:500">Missing Values</p>

One of the major problem in Data Cleaning/Exploratory Data Analysis phase is handling the missing values. Missing value means the data value that is not stored for a variable in the observation. This problem is common in almost all research and it can have a significant effect on the conclusions that can be drawn from the data.

### Sources of Missing Values

Before we dive into code, it’s important to understand the sources of missing data. Here’s some typical reasons why data is missing:

* User forgot to fill in a field.
* Data was lost while transferring manually from a legacy database.
* There was a programming error.
* Users chose not to fill out a field tied to their beliefs about how the results would be used or interpreted.

<img src="https://miro.medium.com/max/720/1*0WNawqcbpDqNfu_ZDlgJ1w.jpeg" />


<a id="4"></a>
## <p style="padding:10px;background-color:#254441;margin:0;color:#e9c46a;font-family:newtimeroman;font-size:100%;text-align:center;border-radius: 15px 50px;overflow:hidden;font-weight:500">Dropna (Solution 1)</p>

### The dropna() function is used to remove missing values.

In [148]:
df_dropna = df.copy()

In [149]:
print('Before:', df_dropna.shape)

df_dropna.dropna(inplace=True)

print('After:', df_dropna.shape)

Before: (10, 4)
After: (8, 4)


In [150]:
df_dropna.isnull().value_counts()

Country  Age    Salary  Purchased
False    False  False   False        8
dtype: int64

In [151]:
df_dropna

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
5,France,35.0,58000.0,Yes
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


<a id="5"></a>
## <p style="padding:10px;background-color:#254441;margin:0;color:#e9c46a;font-family:newtimeroman;font-size:100%;text-align:center;border-radius: 15px 50px;overflow:hidden;font-weight:500">Fillna (Solution 2)</p>

The fillna() method replaces the NULL values with a specified value.

The fillna() method returns a new DataFrame object unless the <font color='red'>inplace </font> parameter is set to True, in that case the fillna() method does the replacing in the original DataFrame instead

In [152]:
dfFill = df.copy()
dfFillZero = df.copy()

In [153]:
dfFill.fillna(dfFill.mean(), inplace=True)

print (dfFill.isnull().sum())
dfFill

Country      0
Age          0
Salary       0
Purchased    0
dtype: int64


  dfFill.fillna(dfFill.mean(), inplace=True)


Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,63777.777778,Yes
5,France,35.0,58000.0,Yes
6,Spain,38.777778,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


In [154]:
dfFillZero.fillna(0, inplace=True)
print(dfFillZero.isnull().sum())
dfFillZero

Country      0
Age          0
Salary       0
Purchased    0
dtype: int64


Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,0.0,Yes
5,France,35.0,58000.0,Yes
6,Spain,0.0,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


<a id="6"></a>
## <p style="padding:10px;background-color:#254441;margin:0;color:#e9c46a;font-family:newtimeroman;font-size:100%;text-align:center;border-radius: 15px 50px;overflow:hidden;font-weight:500">Scikit-learn (Solution 3)</p>

The <font color='red'>**scikit-learn library**</font> provides the SimpleImputer pre-processing class that can be used to replace missing values.

It is a flexible class that allows you to specify the value to replace (it can be something other than NaN) and the technique used to replace it (<font color='red'>**such as mean, median, or mode** </font>). The SimpleImputer class operates directly on the NumPy array instead of the DataFrame.

<img src="https://cdn.analyticsvidhya.com/wp-content/uploads/2020/05/Scikit-learn.jpg" width=85% />

In [155]:
X = df[['Country', 'Age', 'Salary']].values
y = df['Purchased'].values

### <p style="padding:10px;background-color:#254441;margin:0;color:#e9c46a;font-family:newtimeroman;font-size:100%;text-align:center;border-radius: 15px 50px;overflow:hidden;font-weight:500">Mean</p>

In [156]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])

In [157]:
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


### <p style="padding:10px;background-color:#254441;margin:0;color:#e9c46a;font-family:newtimeroman;font-size:100%;text-align:center;border-radius: 15px 50px;overflow:hidden;font-weight:500">Median</p>

In [158]:
X = df[['Country', 'Age', 'Salary']].values
y = df['Purchased'].values
imputer = SimpleImputer(missing_values=np.nan, strategy='median')
imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 61000.0]
 ['France' 35.0 58000.0]
 ['Spain' 38.0 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


### <p style="padding:10px;background-color:#254441;margin:0;color:#e9c46a;font-family:newtimeroman;font-size:100%;text-align:center;border-radius: 15px 50px;overflow:hidden;font-weight:500">Most_frequent</p>

In [167]:
X = df[['Country', 'Age', 'Salary']].values
y = df['Purchased'].values
imputer = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 48000.0]
 ['France' 35.0 58000.0]
 ['Spain' 27.0 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


### <p style="padding:10px;background-color:#254441;margin:0;color:#e9c46a;font-family:newtimeroman;font-size:100%;text-align:center;border-radius: 15px 50px;overflow:hidden;font-weight:500">Constant</p>

In [160]:
X = df[['Country', 'Age', 'Salary']].values
y = df['Purchased'].values
imputer = SimpleImputer(missing_values=np.nan, strategy='constant')
imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 'missing_value']
 ['France' 35.0 58000.0]
 ['Spain' 'missing_value' 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


<a id="7"></a>
# <p style="padding:10px;background-color:#254441;margin:0;color:#e9c46a;font-family:newtimeroman;font-size:100%;text-align:center;border-radius: 15px 50px;overflow:hidden;font-weight:500">Encoding the Independent Variable</p>

One-Hot Encoding can be defined as the essential process of converting the categorical data variables to be provided to machine and deep learning algorithms which in turn improve predictions as well as classification accuracy of a model.

The image below shows what we want to achieve by implementing One-Hot Encoding.

<img src="https://miro.medium.com/max/1400/1*dWvkew37QCveEekRdTirsw.png" />

### <p style="padding:10px;background-color:#254441;margin:0;color:#e9c46a;font-family:newtimeroman;font-size:100%;text-align:center;border-radius: 15px 50px;overflow:hidden;font-weight:500">Pandas (get_dummies)</p>

In [161]:
pd.get_dummies(df, columns=['Country'])

Unnamed: 0,Age,Salary,Purchased,Country_France,Country_Germany,Country_Spain
0,44.0,72000.0,No,1,0,0
1,27.0,48000.0,Yes,0,0,1
2,30.0,54000.0,No,0,1,0
3,38.0,61000.0,No,0,0,1
4,40.0,,Yes,0,1,0
5,35.0,58000.0,Yes,1,0,0
6,,52000.0,No,0,0,1
7,48.0,79000.0,Yes,1,0,0
8,50.0,83000.0,No,0,1,0
9,37.0,67000.0,Yes,1,0,0


In [162]:
pd.get_dummies(df)

Unnamed: 0,Age,Salary,Country_France,Country_Germany,Country_Spain,Purchased_No,Purchased_Yes
0,44.0,72000.0,1,0,0,1,0
1,27.0,48000.0,0,0,1,0,1
2,30.0,54000.0,0,1,0,1,0
3,38.0,61000.0,0,0,1,1,0
4,40.0,,0,1,0,0,1
5,35.0,58000.0,1,0,0,0,1
6,,52000.0,0,0,1,1,0
7,48.0,79000.0,1,0,0,0,1
8,50.0,83000.0,0,1,0,1,0
9,37.0,67000.0,1,0,0,0,1


<a id="8"></a>
# <p style="padding:10px;background-color:#254441;margin:0;color:#e9c46a;font-family:newtimeroman;font-size:100%;text-align:center;border-radius: 15px 50px;overflow:hidden;font-weight:500">Encoding the Dependent Variable</p>

In [163]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)

In [164]:
print(y)

[0 1 0 0 1 1 0 1 0 1]
