# Section 2
In this section we will create a data processing template.

---------------------

## Lecture 8 - Get the dataset

Arrange all data for example in a CSV file. Here we will use [Data.csv](Data.csv) file.

-------------------------------

## Lecture 9 - Importing the libraries


In [3]:
"""numpy: 
An array object of arbitrary homogeneous items
Fast mathematical operations over arrays
Linear Algebra, Fourier Transforms, Random Number Generation"""

import numpy as np
import matplotlib.pyplot as plt  # drawing plots
import pandas as pd  # import and manage datasets

-------------

### Data.csv

- It has 4 columns and 10 rows.
- Index no of __Country__ column = 0
- Index no of __Age__ column = 1
- Index no of __Salary__ column = 2
- Index no of __Purchased__ column = 3
- Index of _ROW_ starts from _0_ as well (in this file we have rows 0-9).

#### X:
Column 0, 1, 2 are independent

#### y:
Column 3 is dependent

#### Missing values:
r4c2 and r6c1

-----------

## Lecture 10 - Importing the dataset

In [5]:
dataset = pd.read_csv('Data.csv')

# all rows, all columns except the last column
X = dataset.iloc[:, :-1].values

# all rows, and only 4nd column (or column with index 3)
y = dataset.iloc[:, 3].values

#### Output:

![DatasetImport](Img/L9DatasetImport.JPG)

--------------

### iloc Vs loc

_iloc_: position based | _loc_: label based

In [7]:
s = pd.Series(np.nan, index=[49,48,47,46,45, 1, 2, 3, 4, 5])

In [8]:
s

49   NaN
48   NaN
47   NaN
46   NaN
45   NaN
1    NaN
2    NaN
3    NaN
4    NaN
5    NaN
dtype: float64

In [9]:
s.iloc[:3]  # from row 0 to 2

49   NaN
48   NaN
47   NaN
dtype: float64

In [10]:
s.loc[:3]  # till and including label 3

49   NaN
48   NaN
47   NaN
46   NaN
45   NaN
1    NaN
2    NaN
3    NaN
dtype: float64

-------------

## Lecture 12: Taking care of missing data

In our dataset, we have two missing data. So,

- we either can delete those rows
- or, we can populate those two missing data with mean, median or mode

Here we are going to take _mean_ values.

In [11]:
from sklearn.preprocessing import Imputer

imputer = Imputer(missing_values="NaN", strategy="mean", axis=0)
imputer = imputer.fit(X[:, 1:3])

X[:, 1:3] = imputer.transform(X[:, 1:3])

#### Imputer

__Definition:__ Imputer(self, missing_values="NaN", strategy="mean", axis=0, verbose=0, copy=True)

__Type:__ Class in sklearn.preprocessing.imputation module



__missing_values:__ integer or “NaN”, optional (default=”NaN”)
- The placeholder for the missing values.
- All occurrences of missing_values will be imputed.
- For missing values encoded as np.nan, use the string value “NaN”.

__strategy:__ string, optional (default=”mean”)
The imputation strategy.

- If “mean”, then replace missing values using the mean along the axis.
- If “median”, then replace missing values using the median along the axis.
- If “most_frequent”, then replace missing using the most frequent value along the axis.

__axis:__ integer, optional (default=0)
The axis along which to impute.

- If axis=0, then impute along columns.
- If axis=1, then impute along rows.

In [12]:
X

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, 63777.77777777778],
       ['France', 35.0, 58000.0],
       ['Spain', 38.77777777777778, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

--------------------

## Lecture 13: Encoding categorical data

We will encode _countries_:

- Since we have three countries, we need 3 columns

By encoding, we want to achieve the following result for countries:

- Column 0 corresponds to France
- Column 1 corresponds to Germany
- Column 2 corresponds to Spain

![CategoricalDataIntro](Img/CategoricalDataIntro.jpg)

In [13]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
onehotencoder = OneHotEncoder(categorical_features=[0])
X = onehotencoder.fit_transform(X).toarray()

In [14]:
X

array([[  1.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          4.40000000e+01,   7.20000000e+04],
       [  0.00000000e+00,   0.00000000e+00,   1.00000000e+00,
          2.70000000e+01,   4.80000000e+04],
       [  0.00000000e+00,   1.00000000e+00,   0.00000000e+00,
          3.00000000e+01,   5.40000000e+04],
       [  0.00000000e+00,   0.00000000e+00,   1.00000000e+00,
          3.80000000e+01,   6.10000000e+04],
       [  0.00000000e+00,   1.00000000e+00,   0.00000000e+00,
          4.00000000e+01,   6.37777778e+04],
       [  1.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          3.50000000e+01,   5.80000000e+04],
       [  0.00000000e+00,   0.00000000e+00,   1.00000000e+00,
          3.87777778e+01,   5.20000000e+04],
       [  1.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          4.80000000e+01,   7.90000000e+04],
       [  0.00000000e+00,   1.00000000e+00,   0.00000000e+00,
          5.00000000e+01,   8.30000000e+04],
       [  1.00000000e+00,   0.0000000

#### X (in plot, format %.0f):

![CategoricalDataX](Img/CategoricalData_X.JPG)

#### LabelEncoder

- Encode labels with value between 0 and n_classes-1.

##### For our dataset, we need only LabelEncoder to encode y (_yes_ and _no_ into _1_ and _0_).

#### OneHotEncoder

__Definition:__ OneHotEncoder(self, n_values="auto", categorical_features="all", dtype=np.float64, sparse=True, handle_unknown='error')

- Encode categorical integer features using a one-hot aka one-of-K scheme.
- The input to this transformer should be a matrix of integers, denoting the values taken on by categorical (discrete) features. The output will be a sparse matrix where each column corresponds to one possible value of one feature. 
- It is assumed that input features take on values in the range [0, n_values).

__categorical_features:__ “all” or array of indices or mask.

Specify what features are treated as categorical.

- ‘all’ (default): All features are treated as categorical.
- array of indices: Array of categorical feature indices.
- mask: Array of length n_features and with dtype=bool.

Non-categorical features are always stacked to the right of the matrix.

In [15]:
labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)

In [16]:
y

array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])

![Categorical Data y](Img/CategoricalData_y.JPG)

-----------

## Lecture 14: Splitting the dataset into _training set_ and _test set_

In [18]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

In [19]:
X_train

array([[  0.00000000e+00,   1.00000000e+00,   0.00000000e+00,
          4.00000000e+01,   6.37777778e+04],
       [  1.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          3.70000000e+01,   6.70000000e+04],
       [  0.00000000e+00,   0.00000000e+00,   1.00000000e+00,
          2.70000000e+01,   4.80000000e+04],
       [  0.00000000e+00,   0.00000000e+00,   1.00000000e+00,
          3.87777778e+01,   5.20000000e+04],
       [  1.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          4.80000000e+01,   7.90000000e+04],
       [  0.00000000e+00,   0.00000000e+00,   1.00000000e+00,
          3.80000000e+01,   6.10000000e+04],
       [  1.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          4.40000000e+01,   7.20000000e+04],
       [  1.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          3.50000000e+01,   5.80000000e+04]])

![X_train](Img/X_train.JPG)

In [20]:
X_test

array([[  0.00000000e+00,   1.00000000e+00,   0.00000000e+00,
          3.00000000e+01,   5.40000000e+04],
       [  0.00000000e+00,   1.00000000e+00,   0.00000000e+00,
          5.00000000e+01,   8.30000000e+04]])

![X_test](Img/X_test.JPG)

In [21]:
y_train

array([1, 1, 1, 0, 1, 0, 0, 1])

![y_train](Img/y_train.JPG)

In [22]:
y_test

array([0, 0])

![y_test](Img/y_test.JPG)

---------------

## Lecture 15: Feature scaling

### Euclidean distance:

![Euclidean distance](Img/Euclidean_distance.png)

In [39]:
from IPython.display import Math
Math(r'Euclidean Distance = \sqrt{(x_{2}-x_{1})^{2}+(y_{2}-y_{1})^{2}}')

<IPython.core.display.Math object>

In our dataset, _age_ and _salary_ are not on the same scale. _Age_ will be dominated by _salary_. That's why we need to scale the features.

### Standardisation:

In [41]:
Math(r'x_{stand} = \frac{x-x_{mean}}{x_{standard_deviation}}')

<IPython.core.display.Math object>

### Normalisation:

In [42]:
Math(r'x_{norm} = \frac{x-x_{min}}{x_{max}-x_{min}}')

<IPython.core.display.Math object>

In [43]:
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

To center the data (zero mean and unit standard error), we subtract the mean and then divide the result by the standard deviation.

x′ = (x−μ)/σ

We do that on the training set of data. Then we have to apply the same transformation to our testing set (e.g. in model-selection). But we have to use the same two parameters μ and σ that we used for centering the training set.

Hence, every sklearn's transform's fit() just calculates the parameters (e.g. μ and σ in case of StandardScaler) and saves them as an internal objects state. Afterwards, we can call its transform() method to apply the transformation to a particular set of examples.

fit_transform() joins these two steps and is used for the initial fitting of parameters on the training set X, but it also returns a transformed X′. Internally, it just calls first fit() and then transform() on the same data.

For _categorical data_, we can do _feature scaling_ or not which entirely depends on situation.
If we do _feature scaling_, we will lose the information about which country did purchase or not, but everything will be on the same scale.
In our example, we did _feature scaling_.

__X_train fit_transform:__

![X_train_fit_transform](Img/X_train_fit_transform.JPG)

__X_test transform:__

![X_test_transform](Img/X_test_transform.JPG)

We don't need to scale y dataset.

------------------

## Lecture 17: Data preprocessing final template

In [44]:
# -*- coding: utf-8 -*-

# Part 1 - Section 2: Data Preprocessing

# Importing the  Libraries

import numpy as np  # mathematics
import matplotlib.pyplot as plt  # drawing plots
import pandas as pd  # import and manage datasets

# Importing the datasets
dataset = pd.read_csv('Data.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 3].values

# Splitting the dataset into training set and test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

# Feature scaling
"""from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)"""

### If requires, take care of missing data and encoding data

'from sklearn.preprocessing import StandardScaler\nsc_X = StandardScaler()\nX_train = sc_X.fit_transform(X_train)\nX_test = sc_X.transform(X_test)'

----------------