<center>
  <a href="MLSD-02-DataPreprocessing-A.ipynb" target="_self">Data Preprocessing A</a> | <a href="./">Content Page</a> | <a href="MLSD-02-DataPreprocessing-C.ipynb">Data Preprocessing C | <a href="MLSD-02-DataPreprocessing-Ex-1.ipynb">Data Preprocessing Exercise</a>
</center>

# <center>DATA PREPROCESSING B</center>

<center><b>Copyright &copy 2023 by DR DANNY POO</b><br> e:dannypoo@nus.edu.sg<br> w:drdannypoo.com</center><br>

# Data Preprocessing 
<b>Dataset</b>: Pima Indians Diabets data set.<br>
<b>Tasks</b>: 
- To read in and explore data set.
- To rescale data. 
- To binarize data (i.e. make them binary).
- To standardize data.

## Read in and Explore Data Set

In [None]:
# Import libraries
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

In [None]:
# Data set link
path = "./data/diabetes/diabetes.csv"

# Prepare dataframe using the data at given link and defined columns list
df = pd.read_csv(path)
df.head()

In [None]:
# Restructure dataframe to Numpy arrays
array = df.values
print(array)

In [None]:
# Separate array into input and output components
X = array[:,0:8]
y = array[:,8]
print("\nX values\n", X[0:5,:]) # print first 5 rows with all columns
print("\ny values\n", y[0:5])   # print first 5 values of y corresponding to first 5 rows of X

## Rescale Data

- When data is comprised of attributes with varying scales, many machine learning algorithms can benefit from rescaling the attributes to all have the same scale.
- This is useful for optimization algorithms used in the core of machine learning algorithms like gradient descent.
- It is also useful for algorithms that weight inputs like regression and neural networks and algorithms that use distance measures like K-Nearest Neighbors.
- We can rescale data using scikit-learn using the `MinMaxScaler` class.
- However, Tree-based methods, e.g. XGBoost, LightGBM, etc. are invariant to scaling.  

In [None]:
# Rescale X
scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX = scaler.fit_transform(X)

In [None]:
# Summarize transformed data
np.set_printoptions(precision=3)
print(rescaledX[0:5,:])

**Observations**:
- All values are in the range between 0 and 1.

## Binarize Data
- We can transform our data using a binary threshold. All values above the threshold are marked 1 and all equal to or below are marked as 0.
- This is called binarizing your data or threshold your data. It can be useful when you have probabilities that you want to make crisp values. 
- It is also useful when feature engineering and you want to add new features that indicate something meaningful.
- We can create new binary attributes in Python using scikit-learn with the `Binarizer` class.

In [None]:
# Import libraries
from sklearn.preprocessing import Binarizer

In [None]:
# Data set link
path = "./data/diabetes/diabetes.csv"

# Prepare dataframe using the data at given link and defined columns list
df = pd.read_csv(path)
df.head()

In [None]:
# Restructure dataframe to Numpy arrays
array = df.values
print(array)

In [None]:
# Separate array into input and output components
X = array[:, 0:8]
y = array[:, 8]
print(X[0:5,:]) # print first 5 rows with all columns

In [None]:
# Binarize X
binarizer = Binarizer(threshold = 0.0).fit(X) # all values equal or less than 0 are marked 0 else marked 1
binaryX = binarizer.transform(X)

In [None]:
# Summarize transformed data
np.set_printoptions(precision = 3)
print(binaryX[0:5,:])

**Observations**:
- All values equal or less than 0 are marked 0 and all of those above 0 are marked 1.

## Standardize Data
- Standardization is a useful technique to transform attributes with a Gaussian distribution and differing means and standard deviations to a standard Gaussian distribution with a mean of 0 and a standard deviation of 1.
- We can standardize data using scikit-learn with the `StandardScaler` class.

In [None]:
# Import libraries
from sklearn.preprocessing import StandardScaler

In [None]:
# Data set link
path = "./data/diabetes/diabetes.csv"

# Prepare dataframe using the data at given link and defined columns list
df = pd.read_csv(path)
df.head()

In [None]:
# Restructure dataframe to Numpy arrays
array = df.values
print(array)

In [None]:
# Separate array into input and output components
X = array[:, 0:8]
y = array[:, 8]

In [None]:
# Scale X
scaler = StandardScaler().fit(X)
rescaledX = scaler.transform(X)

In [None]:
# Summarize transformed data
np.set_printoptions(precision = 3)
print(rescaledX[0:5,:])

**Observations**:
- The values for each attribute now have a mean value of 0 and a standard deviation of 1.
- The features are now scaled on the same range, i.e., within the interval of -3 and 3.

<center>
  <a href="MLSD-02-DataPreprocessing-A.ipynb" target="_self">Data Preprocessing A</a> | <a href="./">Content Page</a> | <a href="MLSD-02-DataPreprocessing-C.ipynb">Data Preprocessing C | <a href="MLSD-02-DataPreprocessing-Ex-1.ipynb">Data Preprocessing Exercise</a>
</center>