<center>
  <a href="MLSD-02-DataPreprocessing-D.ipynb" target="_self">Data Preprocessing D</a> | <a href="./">Content Page</a> | <a href="MLSD-02-DataPreprocessing-Ex-1.ipynb">Data Preprocessing Exercise</a>
</center>

# <center>DATA PREPROCESSING E</center>

<center><b>Copyright &copy 2023 by DR DANNY POO</b><br> e:dannypoo@nus.edu.sg<br> w:drdannypoo.com</center><br>

# Data Preprocessing
<b>Dataset</b>: Loan Prediction data set.<br>
<b>Tasks</b>: 
- To read in and explore data set.
- To rescale features.
- To standardize features. 

## Read in and Explore Data Set

In [None]:
# Import libraries
import pandas as pd

In [None]:
# Read in data
X_train = pd.read_csv('./data/loanPrediction/X_train.csv')
y_train = pd.read_csv('./data/loanPrediction/y_train.csv')
X_test = pd.read_csv('./data/loanPrediction/X_test.csv')
y_test = pd.read_csv('./data/loanPrediction/y_test.csv')

In [None]:
# Print first 5 rows
X_train.head()

In [None]:
y_train

In [None]:
# Visualize histograms of columns with float64 or int64 data
import matplotlib.pyplot as plt
X_train[X_train.dtypes[(X_train.dtypes=="float64")|(X_train.dtypes=="int64")]
                        .index.values].hist(figsize=[11,11])

**Observations**:
- ApplicantIncome and CoapplicantIncome are in similar range (0-50000 dollars) 
- LoanAmount is in thousands and it ranges from 0 to 600.
- Loan_Amount_Term is completely different from other variables because its unit is months as opposed to other variables where the unit is dollars.
- If we try to apply distance based methods such as kNN on these features, feature with the largest range will dominate the outcome results and we’ll obtain less accurate predictions. 
- To overcome this, use feature scaling.

## Rescale Features

### Using k Nearest Neighbors Classifier

In [None]:
# Fitting k-NN on our scaled data set
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

knn=KNeighborsClassifier(n_neighbors=5)
X_train_limited = X_train[['ApplicantIncome', 'CoapplicantIncome',
                'LoanAmount', 'Loan_Amount_Term', 'Credit_History']]
X_test_limited = X_test[['ApplicantIncome', 'CoapplicantIncome',
                'LoanAmount', 'Loan_Amount_Term', 'Credit_History']]
y_train = y_train['Target'].values # convert y_train from a column vector to a 1d array
knn.fit(X_train_limited,y_train)

# Checking the model's accuracy
accuracy_score(y_test,knn.predict(X_test_limited))

In [None]:
# Importing MinMaxScaler and initializing it
from sklearn.preprocessing import MinMaxScaler
min_max=MinMaxScaler()

# Scaling down both train and test data set
X_train_minmax=min_max.fit_transform(X_train[['ApplicantIncome', 'CoapplicantIncome',
                'LoanAmount', 'Loan_Amount_Term', 'Credit_History']])
X_test_minmax=min_max.fit_transform(X_test[['ApplicantIncome', 'CoapplicantIncome',
                'LoanAmount', 'Loan_Amount_Term', 'Credit_History']])

In [None]:
# Fitting k-NN on our scaled data set
knn=KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_minmax,y_train)

# Checking the model's accuracy
accuracy_score(y_test,knn.predict(X_test_minmax))

**Observations**:
- The accuracy has increased from 61% to 75%.
- This means that some of the features with larger range were dominating the prediction outcome in the domain of the distance-based methods (e.g. kNN).

### Using Logistic Regression Classifier

In [None]:
# Fitting logistic regression on data set
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

log = LogisticRegression(penalty='l2',C=.01)
log.fit(X_train_limited,y_train)

# Checking the model's accuracy
accuracy_score(y_test,log.predict(X_test_limited))

In [None]:
# Importing MinMaxScaler and initializing it
from sklearn.preprocessing import MinMaxScaler
min_max = MinMaxScaler()

# Scaling down both train and test data set
X_train_minmax=min_max.fit_transform(X_train[['ApplicantIncome', 'CoapplicantIncome',
                'LoanAmount', 'Loan_Amount_Term', 'Credit_History']])
X_test_minmax=min_max.fit_transform(X_test[['ApplicantIncome', 'CoapplicantIncome',
                'LoanAmount', 'Loan_Amount_Term', 'Credit_History']])

In [None]:
# Fitting logistic regression on data set
log = LogisticRegression(penalty='l2',C=.01)
log.fit(X_train_minmax,y_train)

# Checking the model's accuracy
accuracy_score(y_test,log.predict(X_test_minmax))

**Observations**:
- The accuracy has increased from 61% to 63%.
- Not an impressive achievement.
- In logistic regression, each feature is assigned a weight or coefficient (Wi). If there is a feature with relatively large range and it is insignificant in the objective function then logistic regression will itself assign a very low value to its co-efficient, thus neutralizing the dominant effect of that particular feature, whereas distance based method such as kNN does not have this inbuilt strategy, thus it requires scaling.
- To improve the logistic regression, use feature standardization.

## Standardize Features

### Using Logistic Regression Classifier

In [None]:
# Standardizing the train and test data
from sklearn.preprocessing import StandardScaler

# Standardize X using StandardScaler
scaler = StandardScaler().fit(X_train[['ApplicantIncome', 'CoapplicantIncome',
                'LoanAmount', 'Loan_Amount_Term', 'Credit_History']])
rescaledX_train = scaler.transform(X_train[['ApplicantIncome', 'CoapplicantIncome',
                'LoanAmount', 'Loan_Amount_Term', 'Credit_History']])
rescaledX_test = scaler.transform(X_test[['ApplicantIncome', 'CoapplicantIncome',
                'LoanAmount', 'Loan_Amount_Term', 'Credit_History']])

# Fitting logistic regression on our standardized data set
from sklearn.linear_model import LogisticRegression
log=LogisticRegression(penalty='l2',C=.01)
log.fit(rescaledX_train,y_train)

# Checking the model's accuracy
accuracy_score(y_test,log.predict(rescaledX_test))

**Observations**:
- The accuracy has increased from 61% to 75%.
- This means standardizing the data when using a estimator having L1 or L2 regularization helps us to increase the accuracy of the prediction model. 
- Other learners like kNN with euclidean distance measure, k-means, SVM, perceptron, neural networks, linear discriminant analysis, principal component analysis may perform better with standardized data.

<center>
  <a href="MLSD-02-DataPreprocessing-D.ipynb" target="_self">Data Preprocessing D</a> | <a href="./">Content Page</a> | <a href="MLSD-02-DataPreprocessing-Ex-1.ipynb">Data Preprocessing Exercise</a>
</center>