<h1 align="center"><font size="5">Classification with Python</font></h1>

We load a dataset using Pandas library, and apply the following algorithms, and find the best one for this specific dataset by accuracy evaluation methods.

Lets first load required libraries:

### About dataset

This dataset is about past loans. The __Loan_train.csv__ data set includes details of 346 customers whose loan are already paid off or defaulted. It includes following fields:

| Field          | Description                                                                           |
|----------------|---------------------------------------------------------------------------------------|
| Loan_status    | Whether a loan is paid off on in collection                                           |
| Principal      | Basic principal loan amount at the                                                    |
| Terms          | Origination terms which can be weekly (7 days), biweekly, and monthly payoff schedule |
| Effective_date | When the loan got originated and took effects                                         |
| Due_date       | Since it’s one-time payoff schedule, each loan has one single due date                |
| Age            | Age of applicant                                                                      |
| Education      | Education of applicant                                                                |
| Gender         | The gender of applicant                                                               |

# Lets download the dataset

### Load Data From CSV File  and show the first 5 rows.

In [None]:
## Notice that the data has some extra columns. 
#  This is the way that csvs are read - it doesn't distinguish between indexes and columns.
# To avoid this, use an argument called index_col.

In [None]:
# How many columns and rows are there in the data?

### Convert to date time object 

In order to work with dates, you'll have to convert them to datetime objects.

Why?  Because if you have them as strings, something like 10/7/2016 will come earlier than 11/5/2016,
but that's not the proper sorting for dates. 

Also, datetime objects allows you to add dates to each other, index by year, quarter, month, etc.

In [None]:
### First let's see which of the columns should be datetime objects.
## See what format they're in.
## Then convert to datetime object. 
# Hint: You will have to reassign the existing column.


# Data visualization and pre-processing



Let’s see how many of each class is in our data set 

In [None]:
# Print how many paid off and how many collected.

Lets plot some columns to underestand data better:

In [None]:
# notice: installing seaborn might takes a few minutes
!conda install -c anaconda seaborn -y

In [None]:
import seaborn as sns

#Plot one histogram each for male and female, and stack the columns into 
# 'paid' and 'collection'
# Here's one way to do it with seaborn:

bins = np.linspace(df.Principal.min(), df.Principal.max(), 10)
g = sns.FacetGrid(df, col="Gender", hue="loan_status", palette="Set1", col_wrap=2)
g.map(plt.hist, 'Principal', bins=bins, ec="k")

g.axes[-1].legend()
plt.show()

# Pre-processing:  Feature selection/extraction

### Lets look at the day of the week people get the loan 

In [None]:
# Assign a new column called df['dayofweek'] that contains the day of the week.

## I'll help you with the plotting..
bins = np.linspace(df.dayofweek.min(), df.dayofweek.max(), 10)
g = sns.FacetGrid(df, col="Gender", hue="loan_status", palette="Set1", col_wrap=2)
g.map(plt.hist, 'dayofweek', bins=bins, ec="k")
g.axes[-1].legend()
plt.show()


We see that people who get the loan at the end of the week dont pay it off, so lets use Feature binarization to set a threshold values less then day 4 

In [None]:
# Create a new column called 'weekend' that is 1 if the day is a weekend, else 0.

df['weekend'] = df['dayofweek'].apply(lambda x: 1 if (x>3)  else 0)
df.head()

## Convert Categorical features to numerical values

Lets look at gender:

In [None]:
# Group by gender and get the distribution of loans in percentage terms per gender.

86 % of female pay there loans while only 73 % of males pay there loan

Lets convert male to 0 and female to 1:


In [None]:
# Convert the gender column where male = 0 and female = 1.
# You can overwrite the pre-existing column.


## One Hot Encoding  - you can read about one-hot encoding [here](https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/).
#### How about education?


In [None]:
# Group by education and get the distribution of loans in percentage terms per education.

#### Feature before One Hot Encoding

In [None]:
df[['Principal','terms','age','Gender','education']].head()

#### Use one hot encoding technique to convert categorical varables to binary variables and append them to the feature Data Frame 

In [None]:
Feature = df[['Principal','terms','age','Gender','weekend']]

## Hint: You'll have to use a one-hot encoding function to one-hot encode education. Then reassign the 

# Many people like to get rid of one One-Hot Encoding column. After all, if all values are 0, then that one dropped column should be 1. Read about the implications of dropping that one column here:

# https://inmachineswetrust.com/posts/drop-first-columns/ 

# And then decide whether you want to drop that column or not. There's no right answer and you can experiment with both.

### Feature selection

Lets define feature sets, X:

In [None]:
X = Feature
X[0:5]

What are our lables?

In [None]:
# Assign your label to a variable, y. Make sure y is a numpy array, not a pandas Series.
y = df['loan_status'].values
y[0:5]

## Normalize Data 

Before we standardize the data, let's do a train-test split. This is necessary because 
if we standardize the data using the test set, we'll already have some of the test set's
information when training a model.

In [None]:
from sklearn.model_selection import train_test_split

# Use an appropriate test_size - you can read more about that here:

# https://machinelearningmastery.com/much-training-data-required-machine-learning/

# TODO: Split your X, y data into train, test. 
X_train, X_test, y_train, y_test = #

print ('Train set:', X_train.shape,  y_train.shape)
print ('Test set:', X_test.shape,  y_test.shape)

Now let's normalize our data! Norma-what?
Read [this post before proceeding](https://medium.com/greyatom/why-how-and-when-to-scale-your-features-4b30ab09db5e). Ask yourself, why am I doing this? What's the purpose?

An interesting point of standardizing your data is that there's no one-size-fits all. Many people have their own opinion backed by their own research. For example, read this blog post:  https://towardsdatascience.com/normalization-vs-standardization-quantitative-analysis-a91e8a79cebf#:~:text=Normalization%20typically%20means%20rescales%20the,of%201%20(unit%20variance).

So, how you scale your data is up to you, but scaling your data is a common practice so try your had now.

In [None]:
## Something I would advise is to try different scaling functions - StandardScaler, MinMaxScaler - and see which one works best for your intended purposes.

In [None]:
from sklearn import preprocessing
# TODO: Scale your training data

# Classification 

Now, it is your turn, use the training set to build an accurate model. Then use the test set to report the accuracy of the model
You should use the following algorithm:
- K Nearest Neighbor(KNN)
- Decision Tree
- Support Vector Machine
- Logistic Regression



__ Notice:__ 
- You can go above and change the pre-processing, feature selection, feature-extraction, and so on, to make a better model.
- You should use either scikit-learn, Scipy or Numpy libraries for developing the classification algorithms.
- You should include the code of the algorithm in the following cells.

# K Nearest Neighbor(KNN)
Notice: You should find the best k to build the model with the best accuracy.  
**warning:** You can split your train_loan.csv into train and test to find the best __k__.

In [None]:
from sklearn.neighbors import KNeighborsClassifier
k = 5 # Starting you off with 5, but try to find the best parameter for k.

neighK6 = # TODO

# TODO: Fit your model and get the accuracies accordingly.

from sklearn import metrics
print("Train set Accuracy: ", metrics.accuracy_score(y_train, neighK6.predict(X_train)))

# What do you think is yhat? It's the ground truth values..
print("Test set Accuracy: ", metrics.accuracy_score(y_test, yhat))

In [None]:
##  Let's see what happens to the accuracy of your models as you go from 1 to N clusters...

Ks = 10
mean_acc = np.zeros((Ks-1))
std_acc = np.zeros((Ks-1))

ConfustionMx = [];

# Iterate from 1 to Ks
    
    # Initialize and train a model using X_train and y_train.
    neigh = 
    
    # Predict the X_test values
    yhat =
    
    # Store the accuracy score of y_test in the mean_acc array
    mean_acc[n-1] = #

    # Let's store the std to see the variance of the model.
    std_acc[n-1] = np.std(yhat==y_test)/np.sqrt(yhat.shape[0])

# Plot a line with x= a range from 1 to Ks, and y being mean_acc. Color it green.
plt.plot( # )
    
# We'll now fill between the line to display the standard deviation of your results.
plt.fill_between(range(1,Ks),
                 mean_acc - 1 * std_acc,
                 mean_acc + 1 * std_acc, 
                 alpha=0.10) # Alpha tells you the transparency of your model.
    
# Now label the axes.
plt.legend(('Accuracy ', '+/- 3xstd'))
plt.ylabel('Accuracy ')
plt.xlabel('Number of Nabors (K)')
plt.tight_layout()
plt.show()
    
# Let's display the best accuracy and what the value of k is at that accuracy.
best_k = #
best_acc = # 
print( "Best accuracy:", best_acc, "k=", best_k)

# Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier

#Modelling - Look up what max_depth and criterion are if you don't know already.
# Choose a max_depth you'll go with - Again, it's a process of trial and error.

Tree = DecisionTreeClassifier(criterion="entropy", max_depth = )
Tree

In [None]:
# Fit the tree to X and y


In [None]:
# Predict the values in the test set.
predTree = # Predict using the Tree object

print (predTree [0:5])
print (y_testset [0:5])

from sklearn import metrics
import matplotlib.pyplot as plt

# What's the accuracy score of your model?
print("Accuracy: ", )

In [None]:
# Let's install some visualization libraries...
!conda install -c conda-forge pydotplus -y
!conda install -c conda-forge python-graphviz -y
%matplotlib inline 

In [None]:
from sklearn.externals.six import StringIO
import pydotplus
import matplotlib.image as mpimg
from sklearn import tree

# This library will let you see the tree in full glory! 

# You may have to debug this a bit.. it's an old version of the library and it took me
# a long time to install all dependencies. If this is troublesome, please skip.

dot_data = StringIO()
filename = "loan.png"
featureNames = df.columns[0:8]
targetNames = df['loan_status'].unique().tolist()

out=tree.export_graphviz(Tree,feature_names=featureNames, out_file=dot_data, class_names= np.unique(y_trainset), filled=True,  special_characters=True,rotate=False)  

graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
graph.write_png(filename)
img = mpimg.imread(filename)
plt.figure(figsize=(100, 200))
plt.imshow(img,interpolation='nearest')

# Support Vector Machine

In [None]:
# Let's fit an SVM. What kernel should we use? Again, a process of trial and error...
# Read this guide to get you started: https://dataaspirant.com/svm-kernels/

from sklearn import svm
clf = svm.SVC( #TODO )
# Fit the model to the training data.    

In [None]:
# Predict the test set.
yhat = # TODO
yhat [0:5]



In [None]:
from sklearn.metrics import classification_report, confusion_matrix
import itertools
np.set_printoptions(precision=2)

#  What is a confusion matrix? Read this guide for better understanding:

What is a [confusion matrix?](https://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/#:~:text=A%20confusion%20matrix%20is%20a,related%20terminology%20can%20be%20confusing.)

Read that guide for better understanding. Please send a message to the discord channel in case you don't know what's happening in the function below.

In [None]:
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):

    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    
    
# Do a confusion matrix of your test data.
cnf_matrix = # TODO
    
# Print a classification report.

plt.figure()
# Plot the confusion matrix using the function above.

In [None]:
from sklearn.metrics import f1_score
# Print the f1 score of your y_test using the `f1_score` function

from sklearn.metrics import jaccard_similarity_score
# See what a jaccard similarity is and print it out.
# You can find more information here: https://en.wikipedia.org/wiki/Jaccard_index
# What does the jaccard similarity of your test and training set tell you?



# Logistic Regression

In [None]:
# Let's now predict what
df['loan_status'] = df['loan_status'].astype('int')

X = preprocessing.StandardScaler().fit(X).transform(X)
X[0:5]

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=4)
print ('Train set:', X_train.shape,  y_train.shape)
print ('Test set:', X_test.shape,  y_test.shape)

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
LogR = LogisticRegression(C=0.01, solver='liblinear').fit(X_train,y_train)
LogR

In [None]:
yhat = LogR.predict(X_test)
yhat
yhat_prob = LogR.predict_proba(X_test)
yhat_prob

In [None]:
from sklearn.metrics import jaccard_similarity_score
jaccard_similarity_score(y_test, yhat)
from sklearn.metrics import log_loss
log_loss(y_test, yhat_prob)

# Model Evaluation using Test set

In [None]:
from sklearn.metrics import jaccard_similarity_score
from sklearn.metrics import f1_score
from sklearn.metrics import log_loss

First, download and load the test set:

In [None]:
!wget -O loan_test.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/loan_test.csv

### Load Test set for evaluation 

In [None]:
test_df = pd.read_csv('loan_test.csv')
test_df.head()

In [None]:
X= preprocessing.StandardScaler().fit(X).transform(X)
X[0:5]
Y = test_df['loan_status'].values
Y[0:5]

In [None]:
#test the KNN algorithm already trained.
yhatKNN = # TODO
KNNJaccard = # TODO
KNNF1 = # TODO
print("Avg F1-score: %.2f" % KNNF1 )
print("KNN Jaccard Score: %.2f" % KNNJaccard)


# Test the Decision Tree model and see what we get.
yhatDEC = # TODO
DTJaccard = # TODO
DTF1 = # TODO
print("Avg F1-score: %.2f" % DTF1 )
print("Decision Tree Jaccard Score: %.2f" % DTJaccard)

# Now test the SVM model and see what we get.
yhatSVM = # TODO
SVMJaccard = # TODO
SVMF1 = # TODO
print("Avg F1-score: %.2f" % SVMF1)
print("SVM Jaccard score: %.2f" % SVMJaccard)

# Lastly, let's see what the logistic regressor tells us...
yhatLOG = # TODO
yhatLOGproba = #TODO
LogRJaccard = # TODO
LogRF1 = # TODO
Logloss = #TODO
print("LogLoss: : %.2f" % Logloss)
print("Avg F1-score: %.4f" % LogRF1)
print("LOG Jaccard score: %.4f" % LogRJaccard)

# Report
You should be able to report the accuracy of the built model using different evaluation metrics:

In [None]:
# Make a pandas dataframe with all your resuls.
# You could even use seaborn's heatmap in oorder to make a prettier, more digestible visualization.
# Was your intuition confirmed?

<h2>Want to learn more?</h2>

IBM SPSS Modeler is a comprehensive analytics platform that has many machine learning algorithms. It has been designed to bring predictive intelligence to decisions made by individuals, by groups, by systems – by your enterprise as a whole. A free trial is available through this course, available here: <a href="http://cocl.us/ML0101EN-SPSSModeler">SPSS Modeler</a>

Also, you can use Watson Studio to run these notebooks faster with bigger datasets. Watson Studio is IBM's leading cloud solution for data scientists, built by data scientists. With Jupyter notebooks, RStudio, Apache Spark and popular libraries pre-packaged in the cloud, Watson Studio enables data scientists to collaborate on their projects without having to install anything. Join the fast-growing community of Watson Studio users today with a free account at <a href="https://cocl.us/ML0101EN_DSX">Watson Studio</a>

<h4>Inspiration from <a href="https://www.coursera.org/learn/machine-learning-with-python">Saeed Aghabozorgi</a></h4>