# Machine Learning Model Generation - Multiclass Classification

### 1. [Tutorial links](#tutorials)
### 2. [Load CSV File](#csv_file)
### 3. [Split the dataset. Technique: Train-Test split](#train_test)
### 4. [Generate Multiclass Classification Models](#multiclass_train_test)

> #### 4.1. [Logistic Regression Model](#lr_train_test)
#### 4.2. [Random Forest Model using](#rf_train_test)
#### 4.3. [k-Nearest Neighbor Model](#knn_train_test)
#### 4.4. [Support Vector Machine (SVM) Model](#svm_train_test)
#### 4.5. [XGBoost Classifier Model using](#xgb_train_test)


### 5. [Analyze the model results](#analyze_results)

> #### 5.1. [Print Accuracy of all models](#print_acc)

## <a id='tutorials'>1. Tutorial links</a>

### Load libraries

In [1]:
# Import pandas library
import pandas as pd

# Import numpy library
import numpy as np

# Import Train-Test split library
from sklearn.linear_model import LogisticRegression

# Import RandomForestClassifier library
from sklearn.ensemble import RandomForestClassifier

# Import KNeighborsClassifier library
from sklearn.neighbors import KNeighborsClassifier

# Import Support Vector Machine (SVM) library
from sklearn import svm

# Import XGBClassifier library
from xgboost import XGBClassifier

# Import Train-Test split library
from sklearn.model_selection import train_test_split

# Import KFold split library
from sklearn.model_selection import KFold

# Import XGBClassifier library
from xgboost import XGBClassifier

# Import accuracy score computing library
from sklearn.metrics import accuracy_score

# Import metrics library
from sklearn import metrics

# Import matplotlib library
import matplotlib.pyplot as plt
%matplotlib inline

# Import warnings
import warnings
warnings.filterwarnings('ignore')

### Source data location and data dictionary

### <a id='csv_file'>2. Load CSV file from Git/ local<a>

In [2]:
# Load Wine Dataset from Github
# wine_data = pd.read_csv("https://raw.githubusercontent.com/socratesk/YHatSchoolOfAI/master/data/wineclass.csv")

# Load Wine Dataset from local disk
wine_data = pd.read_csv("data/wineclass.csv")
                       
# Print the shape
print (wine_data.shape)

# Print few rows to visualize the data
wine_data.head()

(178, 14)


Unnamed: 0,class,alcohol,malic_acid,ash,ash_alcalinity,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280_diluted_wine,proline
0,1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
2,1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
3,1,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
4,1,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735


## <a id='train_test'>3. Split the dataset. Technique: Train-Test split</a>

In [3]:
# Set the Train and Test split ratio to 80:20
SPLIT_RATIO = 0.2

# Split the dataset
X_train, X_test, Y_train, Y_test = train_test_split(wine_data.drop('class',  axis = 1), 
                                                    wine_data['class'], 
                                                    test_size=SPLIT_RATIO, 
                                                    random_state = 225)

# Print the shape of the Train set
print("Train dataset: ", X_train.shape, Y_train.shape)

# Print the shape of the Test set
print("Test dataset: ", X_test.shape, Y_test.shape)

Train dataset:  (142, 13) (142,)
Test dataset:  (36, 13) (36,)


## <a id='multiclass_train_test'>4. Generate Multiclass Classification</a>

### <a id='lr_train_test'>4.1 Logistic Regression Model</a>

In [4]:
# Generate a Logistic Regression object
lr_model = LogisticRegression( multi_class='ovr') # solver='liblinear',

# Train a Logistic Regression model with Train dataset
lr_model.fit(X_train, Y_train)

# Predict the multiclass outcome
y_hat_lr = lr_model.predict(X_test)

# Compute the accuracy score and print it
accuracy_lr = accuracy_score(Y_test, y_hat_lr)

# Compute accuracy score
print ("Accuracy of Logistic Regression model: ", accuracy_lr)

# Print confusion matrix of actual and predicted values using Crosstab function
pd.crosstab(Y_test, y_hat_lr, rownames=['Actual'], colnames=['Predicted'], margins=True)

Accuracy of Logistic Regression model:  0.9722222222222222


Predicted,1,2,3,All
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,12,1,0,13
2,0,11,0,11
3,0,0,12,12
All,12,12,12,36


#### 4.1.1 Print Logistic Regression Model parameters

In [5]:
# Print the Linear Regression Model
lr_model

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

#### 4.1.2 To output the Multiclass Probability using Logistic Regression model ...

As we discussed in one of the classes earlier, the Classifier models accept the Train dataset and generate the model that is capable of predicting "the probability of the outcome". <br>

In the above example, we used `lr_model.predict(X_test)` to predict the multiclass outcome. However if you want a multiclass probability outcome, do the following -

In [6]:
# Predict the multiclass outcome
y_hat_lr_proba = lr_model.predict_proba(X_test)

# Print first 8 rows to visuaize the prediction.
y_hat_lr_proba[:8]

array([[3.25387787e-04, 1.14311342e-04, 9.99560301e-01],
       [9.90860715e-01, 1.72734703e-03, 7.41193802e-03],
       [2.04164828e-03, 9.97679422e-01, 2.78929778e-04],
       [2.79216779e-04, 7.54560454e-04, 9.98966223e-01],
       [1.38918724e-01, 8.56645786e-01, 4.43549076e-03],
       [9.92316657e-01, 2.53397417e-03, 5.14936908e-03],
       [1.32750779e-02, 1.60445883e-01, 8.26279039e-01],
       [5.68124955e-07, 4.12972513e-01, 5.87026919e-01]])

#### 4.1.3 To draw Area Under the Curve (AUC) for Multiclass classification, refer to the below link <br>

####  https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html

### <a id='rf_train_test'>4.2 Random Forest Model</a>

In [7]:
# Generate a Random Forest Classifier object
rf_model = RandomForestClassifier(n_estimators=100, min_samples_leaf=1)

# n_estimators - represents no of trees in the forest
# n_jobs - No of cores to be used
# max_depth - depth of each tree in the forest
# min_samples_split - Minimum number of samples required to split an internal node
# min_samples_leaf  - Minimum number of samples required to be at a leaf node
# max_features - the number of features to consider when looking for best split

# Train a Random Forest model with Train dataset
rf_model.fit(X_train, Y_train)

# Predict the outcome
y_hat_rf = rf_model.predict(X_test)

# Compute the accuracy score and print it
accuracy_rf = accuracy_score(Y_test, y_hat_rf)

# Compute accuracy score
print ("Accuracy of Random Forest model: ", accuracy_rf)

# Print confusion matrix of actual and predicted values using Crosstab function
pd.crosstab(Y_test, y_hat_rf, rownames=['Actual'], colnames=['Predicted'], margins=True)

Accuracy of Random Forest model:  1.0


Predicted,1,2,3,All
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,13,0,0,13
2,0,11,0,11
3,0,0,12,12
All,13,11,12,36


#### 4.2.1 Print Random Forest Model parameters

In [8]:
# Print the Random Forest Model
rf_model

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

#### 4.2.2 To output the Multiclass Probability using Random Forest model ...

In [None]:
# Predict the multiclass outcome
y_hat_rf_proba = rf_model.predict_proba(X_test)

# Print first 8 rows to visuaize the prediction.
y_hat_rf_proba[:8]

### <a id='knn_train_test'>4.3 k-Nearest Neighbor Model</a>

In [9]:
# Generate a k-Nearest Neighbor object
knn_model = KNeighborsClassifier(n_neighbors = 9)

# Train a k-Nearest Neighbor model with Train dataset
knn_model.fit(X_train, Y_train)

# Predict the outcome
y_hat_knn = knn_model.predict(X_test)

# Compute the accuracy score and print it
accuracy_knn = accuracy_score(Y_test, y_hat_knn)

# Compute accuracy score
print ("Accuracy of k-Nearest Neighbor model: ", accuracy_knn)

# Print confusion matrix of actual and predicted values using Crosstab function
pd.crosstab(Y_test, y_hat_knn, rownames=['Actual'], colnames=['Predicted'], margins=True)

Accuracy of k-Nearest Neighbor model:  0.6666666666666666


Predicted,1,2,3,All
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,11,0,2,13
2,1,9,1,11
3,2,6,4,12
All,14,15,7,36


#### 4.3.1 Print k-Nearest Neighbor Model parameters

In [None]:
# Print the k-Nearest Neighbor Model
knn_model

#### 4.3.2 To output the Multiclass Probability using k-Nearest Neighbor model ...

In [None]:
<< HOME WORK: Your code goes here >>

### <a id='svm_train_test'>4.4 Support Vector Machine (SVM)  Model</a>

In [10]:
# Generate a Support Vector Machine (SVM) object
svm_model = svm.SVC(gamma='scale')

# Train a Support Vector Machine model with Train dataset
svm_model.fit(X_train, Y_train)

# Predict the outcome
y_hat_svm = svm_model.predict(X_test)

# Compute the accuracy score and print it
accuracy_svm = accuracy_score(Y_test, y_hat_svm)

# Compute accuracy score
print ("Accuracy of Support Vector Machine model: ", accuracy_svm)

# Print confusion matrix of actual and predicted values using Crosstab function
pd.crosstab(Y_test, y_hat_svm, rownames=['Actual'], colnames=['Predicted'], margins=True)

Accuracy of Support Vector Machine model:  0.7222222222222222


Predicted,1,2,3,All
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,11,0,2,13
2,1,9,1,11
3,2,4,6,12
All,14,13,9,36


#### 4.4.1 Print Support Vector Machine (SVM) Model parameters

In [None]:
# Print the Support Vector Machine Model
svm_model

#### 4.4.2 To output the Multiclass Probability using Support Vector Machine (SVM) model ...

In [None]:
<< HOME WORK: Your code goes here >>

### <a id='xgb_train_test'>4.5 XGBoost Classifier Model</a>

In [11]:
# Generate a XGBoost object
xgb_model = XGBClassifier(learning_rate =0.01, 
                      subsample=0.75, 
                      colsample_bytree=0.72, 
                      min_child_weight=8,
                      max_depth=5)

# Train a XGBoost model with Train dataset
xgb_model.fit(X_train, Y_train)

# Predict the outcome
y_hat_xgb = xgb_model.predict(X_test)

# Compute the accuracy score and print it
accuracy_xgb = accuracy_score(Y_test, y_hat_xgb)

# Compute accuracy score
print ("Accuracy of XGBoost model: ", accuracy_xgb)

# Print confusion matrix of actual and predicted values using Crosstab function
pd.crosstab(Y_test, y_hat_xgb, rownames=['Actual'], colnames=['Predicted'], margins=True)

Accuracy of XGBoost model:  1.0


Predicted,1,2,3,All
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,13,0,0,13
2,0,11,0,11
3,0,0,12,12
All,13,11,12,36


#### 4.5.1 Print XGBoost Classifier Model parameters

In [None]:
# Print the XGBoost Model
xgb_model

#### 4.5.2 To output the Multiclass Probability using XGBoost Classifier model ...

In [None]:
<< HOME WORK: Your code goes here >>

## <a id='analyze_results'>5. Analyze the model results</a> 

### <a id='print_acc'>5.1. Print Accuracy of all models</a> 

In [12]:
# Create a dataframe with Accuracy
acc_df = pd.DataFrame(
                    {'Metrics': ['Accuracy'],
                    'Logistic Regression': [accuracy_lr],
                    'Random Forest': [accuracy_rf],
                    'k-NN': [accuracy_knn],
                    'SVM': [accuracy_svm],
                    'XGBoost': [accuracy_xgb]}
)

acc_df

Unnamed: 0,Metrics,Logistic Regression,Random Forest,k-NN,SVM,XGBoost
0,Accuracy,0.972222,1.0,0.666667,0.722222,1.0
