<center>
  <a href="MLSD-02-DataPreprocessing-A.ipynb" target="_self">Data Preprocessing A</a> | <a href="./">Content Page</a> | <a href="MLSD-02-DataPreprocessing-C.ipynb">Data Preprocessing C | <a href="MLSD-02-DataPreprocessing-Ex-1.ipynb">Data Preprocessing Exercise</a>
</center>

# <center>DATA PREPROCESSING B</center>

<center><b>Copyright &copy 2023 by DR DANNY POO</b><br> e:dannypoo@nus.edu.sg<br> w:drdannypoo.com</center><br>

# Data Preprocessing 
<b>Dataset</b>: Pima Indians Diabets data set.<br>
<b>Tasks</b>: 
- To read in and explore data set.
- To rescale data. 
- To binarize data (i.e. make them binary).
- To standardize data.

## Read in and Explore Data Set

In [1]:
# Import libraries
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

In [2]:
# Data set link
path = "./data/diabetes/diabetes.csv"

# Prepare dataframe using the data at given link and defined columns list
df = pd.read_csv(path)
df.head()

Unnamed: 0,preg_count,glucose_concentration,blood_pressure,skin_thickness,serum_insulin,bmi,pedigree_function,age,class
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [3]:
# Restructure dataframe to Numpy arrays
array = df.values
print(array)

[[  6.    148.     72.    ...   0.627  50.      1.   ]
 [  1.     85.     66.    ...   0.351  31.      0.   ]
 [  8.    183.     64.    ...   0.672  32.      1.   ]
 ...
 [  5.    121.     72.    ...   0.245  30.      0.   ]
 [  1.    126.     60.    ...   0.349  47.      1.   ]
 [  1.     93.     70.    ...   0.315  23.      0.   ]]


In [4]:
# Separate array into input and output components
X = array[:,0:8]
y = array[:,8]
print("\nX values\n", X[0:5,:]) # print first 5 rows with all columns
print("\ny values\n", y[0:5])   # print first 5 values of y corresponding to first 5 rows of X


X values
 [[6.000e+00 1.480e+02 7.200e+01 3.500e+01 0.000e+00 3.360e+01 6.270e-01
  5.000e+01]
 [1.000e+00 8.500e+01 6.600e+01 2.900e+01 0.000e+00 2.660e+01 3.510e-01
  3.100e+01]
 [8.000e+00 1.830e+02 6.400e+01 0.000e+00 0.000e+00 2.330e+01 6.720e-01
  3.200e+01]
 [1.000e+00 8.900e+01 6.600e+01 2.300e+01 9.400e+01 2.810e+01 1.670e-01
  2.100e+01]
 [0.000e+00 1.370e+02 4.000e+01 3.500e+01 1.680e+02 4.310e+01 2.288e+00
  3.300e+01]]

y values
 [1. 0. 1. 0. 1.]


## Rescale Data

- When data is comprised of attributes with varying scales, many machine learning algorithms can benefit from rescaling the attributes to all have the same scale.
- This is useful for optimization algorithms used in the core of machine learning algorithms like gradient descent.
- It is also useful for algorithms that weight inputs like regression and neural networks and algorithms that use distance measures like K-Nearest Neighbors.
- We can rescale data using scikit-learn using the `MinMaxScaler` class.
- However, Tree-based methods, e.g. XGBoost, LightGBM, etc. are invariant to scaling.  

In [5]:
# Rescale X
scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX = scaler.fit_transform(X)

In [6]:
# Summarize transformed data
np.set_printoptions(precision=3)
print(rescaledX[0:5,:])

[[0.353 0.744 0.59  0.354 0.    0.501 0.234 0.483]
 [0.059 0.427 0.541 0.293 0.    0.396 0.117 0.167]
 [0.471 0.92  0.525 0.    0.    0.347 0.254 0.183]
 [0.059 0.447 0.541 0.232 0.111 0.419 0.038 0.   ]
 [0.    0.688 0.328 0.354 0.199 0.642 0.944 0.2  ]]


**Observations**:
- All values are in the range between 0 and 1.

## Binarize Data
- We can transform our data using a binary threshold. All values above the threshold are marked 1 and all equal to or below are marked as 0.
- This is called binarizing your data or threshold your data. It can be useful when you have probabilities that you want to make crisp values. 
- It is also useful when feature engineering and you want to add new features that indicate something meaningful.
- We can create new binary attributes in Python using scikit-learn with the `Binarizer` class.

In [7]:
# Import libraries
from sklearn.preprocessing import Binarizer

In [8]:
# Data set link
path = "./data/diabetes/diabetes.csv"

# Prepare dataframe using the data at given link and defined columns list
df = pd.read_csv(path)
df.head()

Unnamed: 0,preg_count,glucose_concentration,blood_pressure,skin_thickness,serum_insulin,bmi,pedigree_function,age,class
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [9]:
# Restructure dataframe to Numpy arrays
array = df.values
print(array)

[[  6.    148.     72.    ...   0.627  50.      1.   ]
 [  1.     85.     66.    ...   0.351  31.      0.   ]
 [  8.    183.     64.    ...   0.672  32.      1.   ]
 ...
 [  5.    121.     72.    ...   0.245  30.      0.   ]
 [  1.    126.     60.    ...   0.349  47.      1.   ]
 [  1.     93.     70.    ...   0.315  23.      0.   ]]


In [10]:
# Separate array into input and output components
X = array[:, 0:8]
y = array[:, 8]
print(X[0:5,:]) # print first 5 rows with all columns

[[6.000e+00 1.480e+02 7.200e+01 3.500e+01 0.000e+00 3.360e+01 6.270e-01
  5.000e+01]
 [1.000e+00 8.500e+01 6.600e+01 2.900e+01 0.000e+00 2.660e+01 3.510e-01
  3.100e+01]
 [8.000e+00 1.830e+02 6.400e+01 0.000e+00 0.000e+00 2.330e+01 6.720e-01
  3.200e+01]
 [1.000e+00 8.900e+01 6.600e+01 2.300e+01 9.400e+01 2.810e+01 1.670e-01
  2.100e+01]
 [0.000e+00 1.370e+02 4.000e+01 3.500e+01 1.680e+02 4.310e+01 2.288e+00
  3.300e+01]]


In [11]:
# Binarize X
binarizer = Binarizer(threshold = 0.0).fit(X) # all values equal or less than 0 are marked 0 else marked 1
binaryX = binarizer.transform(X)

In [12]:
# Summarize transformed data
np.set_printoptions(precision = 3)
print(binaryX[0:5,:])

[[1. 1. 1. 1. 0. 1. 1. 1.]
 [1. 1. 1. 1. 0. 1. 1. 1.]
 [1. 1. 1. 0. 0. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1.]
 [0. 1. 1. 1. 1. 1. 1. 1.]]


**Observations**:
- All values equal or less than 0 are marked 0 and all of those above 0 are marked 1.

## Standardize Data
- Standardization is a useful technique to transform attributes with a Gaussian distribution and differing means and standard deviations to a standard Gaussian distribution with a mean of 0 and a standard deviation of 1.
- We can standardize data using scikit-learn with the `StandardScaler` class.

In [13]:
# Import libraries
from sklearn.preprocessing import StandardScaler

In [14]:
# Data set link
path = "./data/diabetes/diabetes.csv"

# Prepare dataframe using the data at given link and defined columns list
df = pd.read_csv(path)
df.head()

Unnamed: 0,preg_count,glucose_concentration,blood_pressure,skin_thickness,serum_insulin,bmi,pedigree_function,age,class
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [15]:
# Restructure dataframe to Numpy arrays
array = df.values
print(array)

[[  6.    148.     72.    ...   0.627  50.      1.   ]
 [  1.     85.     66.    ...   0.351  31.      0.   ]
 [  8.    183.     64.    ...   0.672  32.      1.   ]
 ...
 [  5.    121.     72.    ...   0.245  30.      0.   ]
 [  1.    126.     60.    ...   0.349  47.      1.   ]
 [  1.     93.     70.    ...   0.315  23.      0.   ]]


In [16]:
# Separate array into input and output components
X = array[:, 0:8]
y = array[:, 8]

In [17]:
# Scale X
scaler = StandardScaler().fit(X)
rescaledX = scaler.transform(X)

In [18]:
# Summarize transformed data
np.set_printoptions(precision = 3)
print(rescaledX[0:5,:])

[[ 0.64   0.848  0.15   0.907 -0.693  0.204  0.468  1.426]
 [-0.845 -1.123 -0.161  0.531 -0.693 -0.684 -0.365 -0.191]
 [ 1.234  1.944 -0.264 -1.288 -0.693 -1.103  0.604 -0.106]
 [-0.845 -0.998 -0.161  0.155  0.123 -0.494 -0.921 -1.042]
 [-1.142  0.504 -1.505  0.907  0.766  1.41   5.485 -0.02 ]]


**Observations**:
- The values for each attribute now have a mean value of 0 and a standard deviation of 1.
- The features are now scaled on the same range, i.e., within the interval of -3 and 3.

<center>
  <a href="MLSD-02-DataPreprocessing-A.ipynb" target="_self">Data Preprocessing A</a> | <a href="./">Content Page</a> | <a href="MLSD-02-DataPreprocessing-C.ipynb">Data Preprocessing C | <a href="MLSD-02-DataPreprocessing-Ex-1.ipynb">Data Preprocessing Exercise</a>
</center>