# Part 1 - Data Preprocessing

### Index

- [Dependent and Independent](#depind)
- [Missing values](#missval)
- [Categorical variables](#catvar)
- [Training and test set](#traintest)
- [Feature Scaling](#featscale)

In [63]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [64]:
df = pd.read_csv('Data.csv')

<a id='depind'></a>
##### 1. Dependent and Independent Variables
> We split the dataframe into dependent and independed variables.

In [65]:
# Independent variables
x = df.iloc[:, :-1].values
# Dependent variables
y = df.iloc[:, 3].values

<a id='missval'> </a>
##### 2. Missing values
> We now need to fill in missing values in our dataset. This is done by filling up the mean. We do this using scikitlearn.

In [66]:
from sklearn.preprocessing import Imputer

In [67]:
imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
imputer = imputer.fit(x[:, 1:3])
x[:, 1:3] = imputer.transform(x[:, 1:3])
x

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, 63777.77777777778],
       ['France', 35.0, 58000.0],
       ['Spain', 38.77777777777778, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

<a id='catvar'></a>
###### 3. Categorical variables
>Using scikit learn encoder library

label encoding and one hot encoding.

In [68]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

In [69]:
#label encoder, usually can be used for true or false or yes or no data.
label_encoder_y = LabelEncoder()
y = label_encoder_y.fit_transform(y=y)
y

array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])

In [70]:
# First the label encoding is used and then, they are converted to one hot values.
label_x = LabelEncoder()
x[:, 0] = label_x.fit_transform(x[:, 0])
onehotencoder = OneHotEncoder(categorical_features = [0])
x = onehotencoder.fit_transform(x).toarray()

<a id='traintest'></a>
##### 4. Training set and Test set.
> The given data is basically split into two sets, Training set and test set.

In [71]:
from sklearn.cross_validation import train_test_split;

In [78]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)
y_train, y_test

(array([1, 1, 1, 0, 1, 0, 0, 1]), array([0, 0]))

<a id='featscale'></a>
##### 5. Feature scaling
> It is very important that our data is in the same range. This is because, the data with the larger data range will dominate the feature learning if we do not normalize it or standardize it.

In [74]:
from sklearn.preprocessing import StandardScaler

In [79]:
# This object will hold the scaler transformation data of the particular data we are fitting.
sc_x = StandardScaler()
x_train = sc_x.fit_transform(x_train)
x_test = sc_x.transform(x_test)

> The categorical variables encoded as dummy variables are transformed based on the application. While it increases the accuracy of our model, it makes it difficult to identify the correct value of our categorical variable.