# ML 101

This notebook contains the common methods to do dataset pre-processing, cleaning and normalization.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

## Loading a sample dataset

Let us consider a toy dataset with only four features:
1. Country (String)
2. Age (Int)
3. Salary (Int)
4. Purchased (Yes/No)

In [None]:
# import dataset
df = pd.read_csv('../datasets/data_prep.csv')
# print the first rows of the dataset
df.head()

In [None]:
# print the last rows of the dataset
df.tail()

In [None]:
# viewing statistical info about dataset
df.describe()

The dataset may contain duplicated rows due to any error on the acquisition.

In [None]:
# dropping duplicate values
duplicate_rows = df[df.duplicated()]
print(f'{duplicate_rows}')
df = df.drop_duplicates()
duplicate_rows = df[df.duplicated()]
print(f'{duplicate_rows}')
df.describe()

## Missing Data

Another common issue is the presence of missing values.

In [None]:
# checking for missing values
# checking the number of missing data
df.isnull().sum()

### Missing Data on categorical fields

There are two approaches:
1. Drop the rows with missing values
2. Replace them with the most frequent element

In [None]:
# Dropping categorical data rows with missing values
#dataset.dropna(how='any', subset=['Country', 'Purchased'], inplace=True)
# Replace null with the most frequent in that class
ax = df[['Country']].value_counts().plot(kind='barh')
plt.show()
ax = df[['Purchased']].value_counts().plot(kind='barh')
plt.show()
df['Country'] = df['Country'].fillna(df['Country'].value_counts().index[0])
df['Purchased'] = df['Purchased'].fillna(df['Purchased'].value_counts().index[0])
df.isnull().sum()

## Split the Dataset

In [None]:
# Splitting dataset into independent and dependent variable
X = df[['Country', 'Age', 'Salary']].values
y = df['Purchased'].values

In [None]:
print(X)

In [None]:
print(y)

### Replace Missing numerial data

We will use the capabilities of Scikit-learn to deal with missing values in the numerical fields.

In [None]:
# replacing the missing values in the age and salary column with the mean
# import the SimpleImputer class from the sklearn library
from sklearn.impute import SimpleImputer
# help(SimpleImputer)
print(X[:, 1:3])

In [None]:
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])

In [None]:
print(X[:, 1:3])

## Convert Categorical Data

The optimization process can not handle categorical data.
There are two possibilities:
1. **Label enconding**: Label encoding is simply converting each value in a column to a number.
2. **One Hot enconding**: The basic strategy is to convert each category value into a new column and assigns a 1 or 0 (True/False) value to the column. 

Label encoding has the advantage that it is straightforward but it has the disadvantage that the numeric values can be “misinterpreted” by the algorithms.
For example, the value of 0 is obviously less than the value of 4 but does that really correspond to the data set in real life.
One Hot Encoding has the benefit of not weighting a value improperly but does have the downside of adding more columns to the data set.

In [None]:
# Handling Categorical Data
# One Hot Encoding
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

ct = ColumnTransformer(transformers=[('enconder', OneHotEncoder(), [0])], remainder='passthrough')
X = np.array(ct.fit_transform(X))

In [None]:
print(X)

In [None]:
print(y)

In [None]:
# Encoding the target variable
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)

In [None]:
print(y)

In [None]:
# Splitting Dataset into Training and Test Set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2)

In [None]:
print(X_train)

In [None]:
print(X_test)

In [None]:
print(y_train)

In [None]:
print(y_test)

## Data Scaling and Normalization 

One of the reasons that it's easy to get confused between scaling and normalization is because the terms are sometimes used interchangeably and, to make it even more confusing, they are very similar! In both cases, you're transforming the values of numeric variables so that the transformed data points have specific helpful properties. The difference is that, in scaling, you're changing the range of your data while in normalization you're changing the shape of the distribution of your data. Let's talk a little more in-depth about each of these options.

### Scaling

This means that you're transforming your data so that it fits within a specific scale, like 0-100 or 0-1. You want to scale data when you're using methods based on measures of how far apart data points, like support vector machines, or SVM or k-nearest neighbors, or KNN. With these algorithms, a change of "1" in any numeric feature is given the same importance.

For example, you might be looking at the prices of some products in both Yen and US Dollars. One US Dollar is worth about 100 Yen, but if you don't scale your prices methods like SVM or KNN will consider a difference in price of 1 Yen as important as a difference of 1 US Dollar! This clearly doesn't fit with our intuitions of the world. With currency, you can convert between currencies. But what about if you're looking at something like height and weight? It's not entirely clear how many pounds should equal one inch (or how many kilograms should equal one meter).

By scaling your variables, you can help compare different variables on equal footing. To help solidify what scaling looks like, let's look at a made-up example. (Don't worry, we'll work with real data in just a second, this is just to help illustrate my point.)

### Normalization

Scaling just changes the range of your data. Normalization is a more radical transformation. The point of normalization is to change your observations so that they can be described as a normal distribution.

>[Normal distribution](https://en.wikipedia.org/wiki/Normal_distribution): Also known as the "bell curve", this is a specific statistical distribution where a roughly equal observations fall above and below the mean, the mean and the median are the same, and there are more observations closer to the mean. The normal distribution is also known as the Gaussian distribution.

In general, you'll only want to normalize your data if you're going to be using a machine learning or statistics technique that assumes your data is normally distributed. Some examples of these include t-tests, ANOVAs, linear regression, linear discriminant analysis (LDA) and Gaussian naive Bayes. (Pro tip: any method with "Gaussian" in the name probably assumes normality.)

In [None]:
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train[:, 4:] = sc.fit_transform(X_train[:, 4:])

In [None]:
print(X_train)

In [None]:
print(X_test)

In [None]:
X_test[:, 4:] = sc.transform(X_test[:, 4:])

In [None]:
print(X_test)