# Missiong Data


In [5]:
import numpy as np
import pandas as pd
df=pd.read_csv('./Data.csv')
X=df.iloc[:,:-1].values
y=df.iloc[:,-1].values
X[:5,:],y[:5,]

(array([['France', 44.0, 72000.0],
        ['Spain', 27.0, 48000.0],
        ['Germany', 30.0, 54000.0],
        ['Spain', 38.0, 61000.0],
        ['Germany', 40.0, nan]], dtype=object),
 array(['No', 'Yes', 'No', 'No', 'Yes'], dtype=object))

In [6]:
#Taking care of missing data
from sklearn.impute import SimpleImputer
imputer=SimpleImputer(missing_values=np.nan,strategy='mean')
imputer=imputer.fit(X[:,1:3])
X[:,1:3]=imputer.transform(X[:,1:3])
X[:5,:]

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, 63777.77777777778]], dtype=object)

The **ﬁt** part is used to extract some info of the data on which the object is applied (here, Imputer will
spot the missing values and get the mean of the column). Then, the **transform** part is used to apply some
transformation (here, Imputer will replace the missing value by the mean).


## Is replacing by the mean the best strategy to handle missing values?
If for example you have a lot of missing values, then mean substitution is not the best thing. 

Other strategies include "median" imputation, "most frequent" imputation
or prediction imputation. 

**Prediction imputation**: 

You take your feature column that contains the missing values and you set this feature column as the dependent variable, while setting the other columns as the independent variables.

Then you split your dataset into a Training set and a Test set where the Training set contains all the observations (the lines) where your feature column that you just set as the dependent variable has no missing value and the Test set contains all the observations where your dependent variable column contains the missing values.

Then you perform a classiﬁcation model (a good one for this situation is k-NN) to predict the missing values in the test set. And eventually you replace your missing values by the predictions. A great strategy!

# Categorical Data
Since machine learning models are based on mathematical equations, so that's why we need to encode the categorical variables.

In [7]:
#Encode categorical data
from sklearn.preprocessing import LabelEncoder
labelencoder=LabelEncoder()
X[:,0]=labelencoder.fit_transform(X[:,0])
labelencoder_y=LabelEncoder()
y=labelencoder_y.fit_transform(y)
X[:5,:],y

(array([[0, 44.0, 72000.0],
        [2, 27.0, 48000.0],
        [1, 30.0, 54000.0],
        [2, 38.0, 61000.0],
        [1, 40.0, 63777.77777777778]], dtype=object),
 array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1]))

*There is a problem!*

These are actually three categories and there is no relational order between the three.

We cannot compare France Spain and Germany by saying that Spain is greater than Germany or Germany is greater than France.

> Dummy variables!

## Dummy Encoding


In [8]:
from sklearn.preprocessing import OneHotEncoder
onehotencoder=OneHotEncoder(categorical_features=[0])
X=onehotencoder.fit_transform(X).toarray()
X

In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


array([[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.40000000e+01,
        7.20000000e+04],
       [0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 2.70000000e+01,
        4.80000000e+04],
       [0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 3.00000000e+01,
        5.40000000e+04],
       [0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 3.80000000e+01,
        6.10000000e+04],
       [0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 4.00000000e+01,
        6.37777778e+04],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 3.50000000e+01,
        5.80000000e+04],
       [0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 3.87777778e+01,
        5.20000000e+04],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.80000000e+01,
        7.90000000e+04],
       [0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 5.00000000e+01,
        8.30000000e+04],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 3.70000000e+01,
        6.70000000e+04]])

# Splitting data into training and testing sets


In [46]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,  y_test=train_test_split(X,y,test_size=0.2,random_state=0)

#### What is the diﬀerence between the training set and the test set?
The training set is a subset of your data on which your model will learn how to predict the dependent
variable with the independent variables. The test set is the complimentary subset from the training set, on
which you will evaluate your model to see if it manages to predict correctly the dependent variable with the
independent variables.
#### Why do we split on the dependent variable?
Because we want to have well distributed values of the dependent variable in the training and test set. For
example if we only had the same value of the dependent variable in the training set, our model wouldn’t be
able to learn any correlation between the independent and dependent variables

# Feature Scaling

**Why we need feature scaling?**
A lot of machine learning models are based on Euclidean distance (the square root of the sum of the square coordinates). Without feature scaling, the Euclidean distance will be dominated by one feature. So we absolutely need to put the variables in the same scale.

Even if sometimes machine models are not based on Euclidean distances we will still need to do features scaling because the algorithm will converge much faster.


two types:
1. standardization

\begin{equation}
x_{stand}=\frac{x-mean(x)}{std(x)}
\end{equation}

> rescales data to have a mean (μ) of 0 and standard deviation (σ) of 1 (unit variance).


2. normalization

\begin{equation}
x_{norm}=\frac{x-min(x)}{max(x)-min(x)}
\end{equation}

>rescales the values into a range of [0,1]

#### Do we really have to apply Feature Scaling on the dummy variables?
Yes, if you want to optimize the accuracy of your model predictions.
No, if you want to keep the most interpretation as possible in your model.

#### When should we use Standardization and Normalization?
Generally you should normalize (normalization) when the data is normally distributed, and scale (standard-
ization) when the data is not normally distributed. In doubt, you should go for standardization. However what is commonly done is that the two scaling methods are tested.

In [48]:
from sklearn.preprocessing import StandardScaler
sc_X=StandardScaler()
X_train=sc_X.fit_transform(X_train)
X_test=sc_X.transform(X_test)

It's important to fit the object to X_train first so that X_train and X_test are scales on the same basis. 

X_test is the same as the features getting on X_train simply because the object StandardScaler was fitted to X_train.

