# Machine Learning Model Generation - Multiclass Classification

### 1. [Tutorial links](#tutorials)
### 2. [Load CSV File](#csv_file)
### 3. [Split the dataset. Technique: Train-Test split](#train_test)
### 4. [Generate Multiclass Classification Models](#multiclass_train_test)

> #### 4.1. [Logistic Regression Model](#lr_train_test)
#### 4.2. [Random Forest Model using](#rf_train_test)
#### 4.3. [k-Nearest Neighbor Model](#knn_train_test)
#### 4.4. [Support Vector Machine (SVM) Model](#svm_train_test)
#### 4.5. [XGBoost Classifier Model using](#xgb_train_test)


### 5. [Analyze the model results](#analyze_results)

> #### 5.1. [Print Accuracy of all models](#print_acc)

## <a id='tutorials'>1. Tutorial links</a>

### Load libraries

In [1]:
# Import pandas library
import pandas as pd

# Import numpy library
import numpy as np

# Import Train-Test split library
from sklearn.linear_model import LogisticRegression

# Import RandomForestClassifier library
from sklearn.ensemble import RandomForestClassifier

# Import KNeighborsClassifier library
from sklearn.neighbors import KNeighborsClassifier

# Import Support Vector Machine (SVM) library
from sklearn import svm

# Import XGBClassifier library
from xgboost import XGBClassifier

# Import Train-Test split library
from sklearn.model_selection import train_test_split

# Import KFold split library
from sklearn.model_selection import KFold

# Import XGBClassifier library
from xgboost import XGBClassifier

# Import accuracy score computing library
from sklearn.metrics import accuracy_score

# Import metrics library
from sklearn import metrics

# Import matplotlib library
import matplotlib.pyplot as plt
%matplotlib inline

# Import warnings
import warnings
warnings.filterwarnings('ignore')

### Source data location and data dictionary

### <a id='csv_file'>2. Load CSV file from Git/ local<a>

In [10]:
# Load Wine Dataset from Github
wine_data = pd.read_csv("https://raw.githubusercontent.com/socratesk/YHatSchoolOfAI/master/data/wineclass.csv")

# Load Wine Dataset from local disk
# wine_data = pd.read_csv("data/wineclass.csv")
                       
# Print the shape
print (wine_data.shape)

# Print few rows to visualize the data
wine_data.head()

(178, 14)


Unnamed: 0,class,alcohol,malic_acid,ash,ash_alcalinity,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280_diluted_wine,proline
0,1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
2,1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
3,1,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
4,1,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735


### Describe the dataframe

In [15]:
wine_data.describe()

Unnamed: 0,class,alcohol,malic_acid,ash,ash_alcalinity,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280_diluted_wine,proline
count,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0
mean,1.938202,13.000618,2.336348,2.366517,9.747472,9.974157,2.295112,2.02927,0.361854,1.590899,5.05809,0.957449,2.611685,7.468933
std,0.775035,0.811827,1.117146,0.274344,1.669782,1.428248,0.625851,0.998859,0.124453,0.572359,2.318286,0.228572,0.70999,3.149075
min,1.0,11.03,0.74,1.36,5.3,7.0,0.98,0.34,0.13,0.41,1.28,0.48,1.27,2.78
25%,1.0,12.3625,1.6025,2.21,8.6,8.8,1.7425,1.205,0.27,1.25,3.22,0.7825,1.9375,5.005
50%,2.0,13.05,1.865,2.36,9.75,9.8,2.355,2.135,0.34,1.555,4.69,0.965,2.78,6.735
75%,3.0,13.6775,3.0825,2.5575,10.75,10.7,2.8,2.875,0.4375,1.95,6.2,1.12,3.17,9.85
max,3.0,14.83,5.8,3.23,15.0,16.2,3.88,5.08,0.66,3.58,13.0,1.71,4.0,16.8


### Reduce the dimension of few features of the dataframe

In [14]:
# Objective is to maintain the maximum value of the features in the order of 15 -ish.

wine_data['ash_alcalinity'] = wine_data['ash_alcalinity'] / 2

wine_data['magnesium'] = wine_data['magnesium'] / 10

wine_data['proline'] = wine_data['proline'] / 100

## <a id='train_test'>3. Split the dataset. Technique: Train-Test split</a>

In [16]:
# Set the Train and Test split ratio to 80:20
SPLIT_RATIO = 0.2

# Split the dataset
X_train, X_test, Y_train, Y_test = train_test_split(wine_data.drop('class',  axis = 1), 
                                                    wine_data['class'], 
                                                    test_size=SPLIT_RATIO, 
                                                    random_state = 1024)

# Print the shape of the Train set
print("Train dataset: ", X_train.shape, Y_train.shape)

# Print the shape of the Test set
print("Test dataset: ", X_test.shape, Y_test.shape)

Train dataset:  (142, 13) (142,)
Test dataset:  (36, 13) (36,)


## <a id='multiclass_train_test'>4. Generate Multiclass Classification</a>

### <a id='lr_train_test'>4.1 Logistic Regression Model</a>

In [17]:
# Generate a Logistic Regression object
lr_model = LogisticRegression(multi_class='ovr') # solver='liblinear',

# Train a Logistic Regression model with Train dataset
lr_model.fit(X_train, Y_train)

# Predict the multiclass outcome
y_hat_lr = lr_model.predict(X_test)

# Print predicted value
y_hat_lr_class = ['Class ' + str(y) for y in y_hat_lr]
print ("Predicted few values:", y_hat_lr_class[0:8])

# Compute the accuracy score
accuracy_lr = accuracy_score(Y_test, y_hat_lr)

# Print accuracy score
print ("Accuracy of Logistic Regression model:", accuracy_lr)

# Print confusion matrix of actual and predicted values using Crosstab function
pd.crosstab(Y_test, y_hat_lr, rownames=['Actual'], colnames=['Predicted'], margins=True)

Predicted few values: ['Class 1', 'Class 3', 'Class 2', 'Class 2', 'Class 3', 'Class 3', 'Class 1', 'Class 1']
Accuracy of Logistic Regression model: 0.9166666666666666


Predicted,1,2,3,All
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,13,1,0,14
2,0,11,1,12
3,0,1,9,10
All,13,13,10,36


#### 4.1.1 Print Logistic Regression Model parameters

In [18]:
# Print the Linear Regression Model
lr_model

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

#### 4.1.2 To output the Multiclass Probability using Logistic Regression model ...

As we discussed in one of the classes earlier, the Classifier models accept the Train dataset and generate the model that is capable of predicting "the probability of the outcome". <br>

In the above example, we used `lr_model.predict(X_test)` to predict the multiclass outcome. However if you want a multiclass probability outcome, do the following -

In [19]:
# Predict the multiclass outcome
y_hat_lr_proba = lr_model.predict_proba(X_test)

# Print first 8 rows to visuaize the prediction.
y_hat_lr_proba[:8]

array([[7.76672359e-01, 1.95535145e-01, 2.77924966e-02],
       [6.66844960e-04, 6.21201538e-02, 9.37213001e-01],
       [4.36034167e-02, 9.53877841e-01, 2.51874235e-03],
       [9.28693024e-02, 8.69881385e-01, 3.72493124e-02],
       [1.80654119e-02, 7.68357289e-04, 9.81166231e-01],
       [2.15146848e-02, 9.24153334e-02, 8.86069982e-01],
       [9.99406425e-01, 5.51987470e-05, 5.38376064e-04],
       [9.32128107e-01, 6.70689411e-02, 8.02951472e-04]])

#### 4.1.3 To draw Area Under the Curve (AUC) for Multiclass classification, refer to the below link <br>

####  https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html

### <a id='rf_train_test'>4.2 Random Forest Model</a>

In [20]:
# Generate a Random Forest Classifier object
rf_model = RandomForestClassifier(n_estimators=100, min_samples_leaf=2, random_state=2)

# n_estimators - represents no of trees in the forest
# n_jobs - No of cores to be used
# max_depth - depth of each tree in the forest
# min_samples_split - Minimum number of samples required to split an internal node
# min_samples_leaf  - Minimum number of samples required to be at a leaf node
# max_features - the number of features to consider when looking for best split

# Train a Random Forest model with Train dataset
rf_model.fit(X_train, Y_train)

# Predict the outcome
y_hat_rf = rf_model.predict(X_test)

# Print predicted value
y_hat_rf_class = ['Class ' + str(y) for y in y_hat_rf]
print ("Predicted few values:", y_hat_rf_class[0:8])

# Compute the accuracy score
accuracy_rf = accuracy_score(Y_test, y_hat_rf)

# Print accuracy score
print ("Accuracy of Random Forest model: ", accuracy_rf)

# Print confusion matrix of actual and predicted values using Crosstab function
pd.crosstab(Y_test, y_hat_rf, rownames=['Actual'], colnames=['Predicted'], margins=True)

Predicted few values: ['Class 1', 'Class 3', 'Class 1', 'Class 2', 'Class 3', 'Class 3', 'Class 1', 'Class 1']
Accuracy of Random Forest model:  0.9722222222222222


Predicted,1,2,3,All
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,14,0,0,14
2,0,11,1,12
3,0,0,10,10
All,14,11,11,36


#### 4.2.1 Print Random Forest Model parameters

In [21]:
# Print the Random Forest Model
rf_model

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=2, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
            oob_score=False, random_state=2, verbose=0, warm_start=False)

#### 4.2.2 To output the Multiclass Probability using Random Forest model ...

In [22]:
# Predict the multiclass outcome
y_hat_rf_proba = rf_model.predict_proba(X_test)

# Print first 8 rows to visuaize the prediction.
y_hat_rf_proba[:8]

array([[0.47238095, 0.39678571, 0.13083333],
       [0.015     , 0.14316667, 0.84183333],
       [0.636     , 0.33233333, 0.03166667],
       [0.01      , 0.7242381 , 0.2657619 ],
       [0.00666667, 0.093     , 0.90033333],
       [0.00666667, 0.22607143, 0.7672619 ],
       [0.97483333, 0.02516667, 0.        ],
       [0.85      , 0.14      , 0.01      ]])

### <a id='knn_train_test'>4.3 k-Nearest Neighbor Model</a>

In [23]:
# Generate a k-Nearest Neighbor object
knn_model = KNeighborsClassifier(n_neighbors = 3)

# Train a k-Nearest Neighbor model with Train dataset
knn_model.fit(X_train, Y_train)

# Predict the outcome
y_hat_knn = knn_model.predict(X_test)

# Print predicted value
y_hat_knn_class = ['Class ' + str(y) for y in y_hat_knn]
print ("Predicted few values:", y_hat_knn_class[0:8])

# Compute the accuracy score
accuracy_knn = accuracy_score(Y_test, y_hat_knn)

# Print accuracy score
print ("Accuracy of k-Nearest Neighbor model: ", accuracy_knn)

# Print confusion matrix of actual and predicted values using Crosstab function
pd.crosstab(Y_test, y_hat_knn, rownames=['Actual'], colnames=['Predicted'], margins=True)

Predicted few values: ['Class 1', 'Class 3', 'Class 1', 'Class 2', 'Class 3', 'Class 2', 'Class 1', 'Class 1']
Accuracy of k-Nearest Neighbor model:  0.9166666666666666


Predicted,1,2,3,All
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,14,0,0,14
2,1,10,1,12
3,0,1,9,10
All,15,11,10,36


#### 4.3.1 Print k-Nearest Neighbor Model parameters

In [24]:
# Print the k-Nearest Neighbor Model
knn_model

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=3, p=2,
           weights='uniform')

#### 4.3.2 To output the Multiclass Probability using k-Nearest Neighbor model ...

In [None]:
<< HOME WORK: Your code goes here >>

### <a id='svm_train_test'>4.4 Support Vector Machine (SVM)  Model</a>

In [26]:
# Generate a Support Vector Machine (SVM) object
svm_model = svm.SVC(kernel='linear', random_state=2) # gamma=0.05, degree=5, kernel='linear'

# Train a Support Vector Machine model with Train dataset
svm_model.fit(X_train, Y_train)

# Predict the outcome
y_hat_svm = svm_model.predict(X_test)

# Print predicted value
y_hat_svm_class = ['Class ' + str(y) for y in y_hat_svm]
print ("Predicted few values:", y_hat_svm_class[0:8])

# Compute the accuracy score
accuracy_svm = accuracy_score(Y_test, y_hat_svm)

# Print accuracy score
print ("Accuracy of Support Vector Machine model: ", accuracy_svm)

# Print confusion matrix of actual and predicted values using Crosstab function
pd.crosstab(Y_test, y_hat_svm, rownames=['Actual'], colnames=['Predicted'], margins=True)

Predicted few values: ['Class 1', 'Class 3', 'Class 1', 'Class 2', 'Class 3', 'Class 2', 'Class 1', 'Class 1']
Accuracy of Support Vector Machine model:  0.9166666666666666


Predicted,1,2,3,All
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,14,0,0,14
2,0,11,1,12
3,0,2,8,10
All,14,13,9,36


#### 4.4.1 Print Support Vector Machine (SVM) Model parameters

In [27]:
# Print the Support Vector Machine Model
svm_model

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='linear', max_iter=-1, probability=False, random_state=2,
  shrinking=True, tol=0.001, verbose=False)

#### 4.4.2 To output the Multiclass Probability using Support Vector Machine (SVM) model ...

In [None]:
<< HOME WORK: Your code goes here >>

### <a id='xgb_train_test'>4.5 XGBoost Classifier Model</a>

In [25]:
# Generate a XGBoost object
xgb_model = XGBClassifier(learning_rate =0.01, 
                      subsample=0.75, 
                      colsample_bytree=0.72, 
                      min_child_weight=8,
                      max_depth=5,
                      random_state=2)

# Train a XGBoost model with Train dataset
xgb_model.fit(X_train, Y_train)

# Predict the outcome
y_hat_xgb = xgb_model.predict(X_test)

# Print predicted value
y_hat_xgb_class = ['Class ' + str(y) for y in y_hat_xgb]
print ("Predicted few values:", y_hat_xgb_class[0:8])

# Compute the accuracy score
accuracy_xgb = accuracy_score(Y_test, y_hat_xgb)

# Print accuracy score
print ("Accuracy of XGBoost model: ", accuracy_xgb)

# Print confusion matrix of actual and predicted values using Crosstab function
pd.crosstab(Y_test, y_hat_xgb, rownames=['Actual'], colnames=['Predicted'], margins=True)

Predicted few values: ['Class 1', 'Class 3', 'Class 1', 'Class 2', 'Class 3', 'Class 3', 'Class 1', 'Class 1']
Accuracy of XGBoost model:  0.9722222222222222


Predicted,1,2,3,All
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,14,0,0,14
2,0,11,1,12
3,0,0,10,10
All,14,11,11,36


#### 4.5.1 Print XGBoost Classifier Model parameters

In [28]:
# Print the XGBoost Model
xgb_model

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bynode=1, colsample_bytree=0.72, gamma=0,
       learning_rate=0.01, max_delta_step=0, max_depth=5,
       min_child_weight=8, missing=None, n_estimators=100, n_jobs=1,
       nthread=None, objective='multi:softprob', random_state=2,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=None, subsample=0.75, verbosity=1)

#### 4.5.2 To output the Multiclass Probability using XGBoost Classifier model ...

In [None]:
<< HOME WORK: Your code goes here >>

## <a id='analyze_results'>5. Analyze the model results</a> 

### <a id='print_acc'>5.1. Print Accuracy of all models</a> 

In [29]:
# Create a dataframe with Accuracy
acc_df = pd.DataFrame(
                    {'Metrics': ['Accuracy'],
                    'Logistic Regression': [accuracy_lr],
                    'Random Forest': [accuracy_rf],
                    'k-NN': [accuracy_knn],
                    'SVM': [accuracy_svm],
                    'XGBoost': [accuracy_xgb]}
)

acc_df

Unnamed: 0,Metrics,Logistic Regression,Random Forest,k-NN,SVM,XGBoost
0,Accuracy,0.916667,0.972222,0.916667,0.916667,0.972222
