<div class='heading'>
    <div style='float:left;'><h1>CPSC 4300/6300: Applied Data Science</h1></div>
     <img style="float: right; padding-right: 10px" width="100" src="https://raw.githubusercontent.com/bsethwalker/clemson-cs4300/main/images/clemson_paw.png"> </div>
     </div>

**Clemson University**<br>
**Instructor(s):** Aaron Masino <br>

## Lab 6: Classification With Logistic Regression and Tree Models

In [None]:
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
import pandas as pd
from pandas.plotting import scatter_matrix
import seaborn as sns

from sklearn.datasets import load_wine, load_iris
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import roc_curve, roc_auc_score, classification_report, confusion_matrix,  ConfusionMatrixDisplay, RocCurveDisplay, PrecisionRecallDisplay
from sklearn.model_selection import learning_curve
from sklearn.decomposition import PCA
from sklearn.tree import plot_tree
from sklearn.preprocessing import StandardScaler

import warnings
warnings.filterwarnings('ignore')

In [None]:
SEED = 654321

# Learning Goals

By the end of this lab, you should be able to:
- Apply the scikit-learn LogisticRegression, DecisionTreeClassifier, RandomForrestClassifier modules to learn classification models from data 
- Apply the scikit-learn metrics (e.g., confustion_matrix) to trained classifier models
- Analyze classification metrics to compare different models and assess performance
- Apply the scikit-learn GridSearchCV and model selection principles to identify select the best model

# Part 1 Wine Data Classification

## Wine Data Exploration
Let's load the _wine_ dataset and briefly explore some of its characteristics. The _wine_ dataset is a _toy_ dataset that is available in the scikit-learn. We can load the dataset using scikit-learn's `load_wine` method that will create an sklearn `Bunch` object. Although we can work with this object directly in scikit-learn models, we'll convert it to our more familiar Pandas DataFrame. 

For more information see, [scikit-learn toy datasets](https://scikit-learn.org/stable/datasets/toy_dataset.html).

In [None]:
# load the wine dataset from sklearn
wine = load_wine()

# the wine object is a dictionary-like object NOT a pandas dataframe
print(wine.DESCR)

# we can work with the wine object directly, but let's convert it to a pandas dataframe to be consistent with our prior examples
df_wine = pd.DataFrame(wine.data, columns=wine.feature_names)
df_wine['target'] = wine.target
display(df_wine.head())
print(df_wine.target.value_counts())

Let's take a look at the distribution of the variables in the dataset. We will examine box plots of the different features when grouped by the target variable. Ideally, we would like to see that the distributions are different across the different values of the target variable (e.g., have different means or variance). Although such univariate differences are not a necessary condition for the features to be useful in a classification model, they are usually a sufficient condition.

In [None]:
def plt_box_grid_by_target(data, target_label, num_cols = 4, fig_size = (20, 20), add_xlabel = True, label_fontsize = 20):
    l = data.columns.tolist()
    l.remove(target_label)
    num_rows = -(-len(l) // num_cols)
    fig, ax = plt.subplots(num_rows, num_cols, figsize=fig_size, squeeze=False)
    row = 0
    col = 0
    cnt = 0
    for c in l:
        sns.boxplot(y=c, hue=target_label, ax=ax[row, col], data=data)
        if add_xlabel:
            ax[row, col].set_xlabel(data.columns[cnt], fontsize=label_fontsize)
        cnt += 1
        if cnt % num_cols == 0:
            row += 1
            col = 0
        else:
            col += 1
    fig.tight_layout()
    plt.show()
    return fig, ax

plt_box_grid_by_target(df_wine, 'target', num_cols=4, fig_size=(20, 20));

The fact that the many of the variables have different distributions when segmented by the target label, suggests we should be able to use these features to create a good performance classification model!

## Introduction to Sklearn Classifiers with Wine Data 
Let's first introduce some of the many classifier models available in scikit-learn. For now, we just want to see how we use the data to create these models, get the parameters, and evaluate the results. In later sections, we'll address using a systematic approach for model selection. We'll consider three specific classification models:
- Logistic Regression
- Decision Tree
- Random Forest 

There are many, many more. For a complete review of all of the supervised learning (regression and classification) models available in scikit-learn, see [here](https://scikit-learn.org/stable/supervised_learning.html).

### Logistic Regression with Sklearn
We'll begin with fitting a logistic regression model using the scikit-learn `LogisticRegression` module (see documenation [here](https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression)). 

As we've seen previously with sklearn models, the standard pattern for training a classification model using sklearn modules is:
```
my_model = Some_SKLearn_Classifier(arg_1=value, ...)
my_model.fit(my_X_data, my_y_data)
```

In [None]:
# create a Logistic Regression model
# we set penalty=None to avoid regularization
model_lr = LogisticRegression(random_state=SEED, penalty=None, solver='lbfgs')

# fit the model to our wine data - we should be using a training set, but we'll get to that later
# we'll use all the features as our X and the target as our y
model_lr.fit(df_wine.drop('target', axis=1), df_wine.target)

# get the coefficients of the model
# print the coefficients with the feature names
print('Intercept:', model_lr.intercept_[0])
print('\nCoefficients:')
for i, c in enumerate(model_lr.coef_[0]):
    print(f'{df_wine.columns[i]}: {c:.4f}')


Let's take a look at the confusion matrix. Remember, we're looking at `training set` performance here, so the results are likely to be optimistic. We'll address this later.

In [None]:

# first we need the model predictions 
predictions = model_lr.predict(df_wine.drop('target', axis=1))

# next we can use the confusion_matrix function to get the confusion matrix
cm = confusion_matrix(df_wine.target, predictions)

# we can use the ConfusionMatrixDisplay to plot the confusion matrix
disp = ConfusionMatrixDisplay(cm, display_labels=[f'Class {i}' for i in range(3)])
disp.plot(cmap='Blues')
plt.title('Confusion Matrix For Wine Data Using Logistic Regression')
plt.show()

Let's also look at some of the standard performance metrics. There are many performance metrics avaialable in the sklearn libary (see [documentation](https://scikit-learn.org/stable/modules/model_evaluation.html#)). We will use the `classification_report` function, which provides a summary table of the _intra class_ performance.

In [None]:
print(classification_report(df_wine.target, predictions, target_names=[f'Class {i}' for i in range(3)]))

To keep the table small and accomodate multiclass scenarios, scikit-learn presents _intraclass_ metrics. Let's break these down sarting with the first two rows. Let _P_ be the number of actual postive samples, _N_ be the number of actual negative samples, PP be the number of predicted positive, PN be the number of predicted negative, _TP_ be the number of true postives (correctly predicted postives), _FP_ be the number of false postives, let _TN_ be the number of true negatives (correctly predicted negatives), and _FN_ be the number of false negatives __within the given class__ (i.e., only the samples in the class are considered, the number of samples in the class is given by the `support` column):
- __class 0 precision ($C_0P$)__: equal to _TP/PP=TP_/(_TP_+_FP_) also known as _positive predictive value_ (PPV)
- __class 0 recall ($C_0R$)__: equal to _TP/P=TP_/(_TP_+_FN_) also known as _sensitivity_
- __class 0 f1-score__ : equal to $2\frac{C_0P\times C_0R}{C_0P+C_0R}$
- __class 1 precision ($C_1P$)__: equal to _TP/P=TP_/(_TP_+_FP_)  
- __class 1 recall ($C_1R$)__: equal to _TP/P=TP_/(_TP_+_FN_) 
- __class 1 f1-score__ : equal to $2\frac{C_1P\times C_1R}{C_1P+C_1R}$
- __class 2 precision ($C_2P$)__: equal to _TP/P=TP_/(_TP_+_FP_)  
- __class 2 recall ($C_2R$)__: equal to _TP/P=TP_/(_TP_+_FN_) 
- __class 2 f1-score__ : equal to $2\frac{C_2P\times C_2R}{C_2P+C_2R}$

The _accuracy_ row is simply the overall model accuracy = (TP+TN)/(P+N) taken across all samples

The _macro avg_ row is the unweighted average over the class metric scores for each class. For example, the _macro avg precision_ = $\frac{1}{3}(C_0P + C_1P + C_2P$). The _macro avg_ treats all classes equally.

The _weighted avg_ row is similar to to the _macro _avg_ but weights each class metric score by the support proportion. For example, the _macro avg precison_ = $\frac{1}{178}(59C_0P + 71C_1P + 48C_2P)$

### Decision Tree Classifier
Now let's see how to build a decision tree classifier with scikit-learn. We'll use the [DecisiionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier) module. 

In [None]:
model_dt = DecisionTreeClassifier(random_state=SEED)
model_dt.fit(df_wine.drop('target', axis=1), df_wine.target)

Let's see the confusion matrix for the decision tree.

In [None]:
predictions = model_dt.predict(df_wine.drop('target', axis=1))
cm = confusion_matrix(df_wine.target, predictions)
disp = ConfusionMatrixDisplay(cm, display_labels=[f'Class {i}' for i in range(3)])
disp.plot(cmap='Blues')
plt.title('Confusion Matrix For Wine Data Using Decision Tree')
plt.show()

Wow, the model got every __training__ example correct. Is this a good thing? We should be concerned about overfitting. Let's take a look at the decision tree. The depth of the tree might provide some information regarding the potential of overfitting. We can use the scikit-learn [plot_tree](https://scikit-learn.org/1.5/modules/generated/sklearn.tree.plot_tree.html) function.

In [None]:
fig = plt.figure(figsize=(16,16))
plot_tree(model_dt, feature_names=df_wine.columns[:-1], class_names=[f'Class {i}' for i in range(3)], filled=True);

This seems like a fairly complex tree for a dataset with only 178 training samples. We should definitely be concerned about overfitting. We'll address this later on.

### Random Forest Classifier
Finally, let's see how to build a random forest classifier with scikit-learn. Recall, a random forest is an ensemble classifier that builds many decision trees using a bagging approach wherein multiple training samples are created with bootstrapping. We'll use the [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier) module. 

In [None]:
model_rf = RandomForestClassifier(random_state=SEED)
model_rf.fit(df_wine.drop('target', axis=1), df_wine.target)

We again look at the confusion matrix to see how the model did on the _training_ data.

In [None]:
predictions = model_rf.predict(df_wine.drop('target', axis=1))
cm = confusion_matrix(df_wine.target, predictions)
disp = ConfusionMatrixDisplay(cm, display_labels=[f'Class {i}' for i in range(3)])
disp.plot(cmap='Blues')
plt.title('Confusion Matrix For Wine Data Using Random Forest')
plt.show()

## Wine Data Model Selection with GridSearch and Cross Validation

So far we have been training individual models using all of the data. There are several problems with this approach. First, we have not used a held out test set which means we only have performance evaluation on the training data which can be overly optimisitic (remember bias and overfitting are always a concern). Additionally, we all of our models have one or more hyperparameters. Recall, hyperparameters are used to tune model performance (e.g., regularization) but are not learned from the data. We need a principled approach to select the best model and hyperparameter combination. 

For this example, let's put all the tools that we've learned together and apply them in a systematic model selection and evaluation process. We will want to complete the following steps:
1. Check for missing data
2. Split our data into a training and test set
3. Standardize the data using the mean and variance from the training data.
4. Select our model classes and corresponding hyperparameter grids
5. Apply cross validation to find the best model and hyperparameter combination and then retrain the best combination on all training data 
6. Evaluate the best model on the test data

### Step 1: Check for missing data

In [None]:
# check the wine dataset for missing values
print(df_wine.isnull().sum())

We don't have any missing data in this case. If we did, we would need to assess if it was missing completely at random, missing at random, or missing not at random. From there, we could determine if we should impute missing data or drop certain features.

### Step 2: Train and test set split

As always, we need to split our data into training and test samples. 

In [None]:
# let's shuffle the wine data
df_wine = df_wine.sample(frac=1, random_state=SEED).reset_index(drop=True)

# let's split the df_wine data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df_wine.drop('target', axis=1), df_wine.target, test_size=0.2, random_state=SEED)

### Step 3: Standardize the feature values

Now let's standardize our continuous features. Our features are all continuous. However, if we had qualitative feature variables, we would also need to convert those to dummy variables in this step.

In [None]:
# let's standardize the data
scaler = StandardScaler()

# fit the scaler to the training data
scaler.fit(X_train)
# transform the training data and put it into a new dataframe
X_train_scaled = scaler.transform(X_train)
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns)

# transform the testing data and put it into a new dataframe
X_test_scaled = scaler.transform(X_test)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns)

# show the scaled training data
display(X_train_scaled.head())


### Step 4: Define the model & hyperparameter grid options (THIS PART IS NEW!!)

We now need to decide which models we want to train and compare. Here we are again considering logistic regression, decision tree, and random forest classifiers. These are subjective choices and in different scenarios, we might select other types of classifier models. 

Each of the models we are considering has different hyperparameters. For example, logistic regression has the parameter `C` which is the _inverse regularization_ parameter ($C=1/\lambda$) which is the reciprocal of the regularization parameter we've seen previously. The decision tree classifier has different parametrs, one of which is `max_depth` which specifies the maximum number of splits allowed. The random forest has `n_estimators` which specifies the number of trees to build and the `max_depth` for each tree. Each of these models actually has additional hyperparameters (refer to the documentation.) To evaluate the hyperparameters, we will specifiy a dictionary that indicates the values we want to test for each parameter. For models with more than one hyperparameter, we will be testing the cross product of the specified values.

In [None]:
# create a model dictionary
models = {
    'logistic_regression': LogisticRegression(random_state=SEED, max_iter=1000, tol=0.001),
    'decision_tree': DecisionTreeClassifier(random_state=SEED),
    'random_forest': RandomForestClassifier(random_state=SEED)
}

# create a hyperparameter dictionary for the models
param_grid = {'logistic_regression':{ 'C': [0.1, 1, 10],
                                      'penalty': ['l1', 'l2',
                                    'elasticnet'],'solver': ['saga']}, # we need to use the saga solver for l1 and elastic net regularization
              'decision_tree': {'max_depth': [None, 3, 5]},
              'random_forest': {'n_estimators': [10, 50, 100],
                                'max_depth': [None, 3, 5]}
}

### Step 5: Grid search with cross-validation & final model training (THIS PART IS NEW!!)
Now we want to train each model with each combination of hyperparameters and evaluate them to find the best performing one. However, we don't want to evaluate all of these models on the test set (why?). Instead, we will apply cross validation for __every__ model and hyperparameter combination. This will provide an average cross-validation score for each model and hyperparameter combination. 

Our __final selection__ will be the model and hyperparameter combination with the best cross-validation performance. To obtain a __final model__, we will use the best model class and hyperparameter combination and train it on the entire _training_ set. Fortunately, the scikit-learn [GridSearchCV](https://scikit-learn.org/dev/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV.score_samples) module does most of the work for us. For a given model class (e.g., `LogisticRegressionClassifier`), the `GridSearchCV` will perform cross validation for each of the specified hyperparameter combinations. It will also retrain that model using the best hyperparameters for that model class. Let's see how to use this below:

In [None]:
# declare variables to store the best model and score
model_best_wine = None
score_best_wine = 0.0

# specify the number of folds for cross-validation
K_folds = 5
# specify the scoring method to use in cross-validation
score = 'balanced_accuracy' # we'll use balanced accuracy since our classes are imbalanced
# loop through the models using GridSearchCV to tune the hyperparameters
for name, model in models.items():
    print(f'Tuning {name} ...........................')
    # create a GridSearchCV object for this model
    grid_search = GridSearchCV(model, param_grid[name], cv=K_folds, scoring=score, refit=True, n_jobs=-1)
    grid_search.fit(X_train_scaled, y_train)
    print(f'Best parameters: {grid_search.best_params_}')
    print(f'Best validation score: {grid_search.best_score_}')
    # check if this model is the best so far
    if grid_search.best_score_ > score_best_wine:
        print(f'*******New best model found: {name}')
        score_best_wine = grid_search.best_score_
        model_best_wine = grid_search.best_estimator_


### Wine Data Test Set Evaluation
Now that we have compared all the model and hyperparameter combinations that we are going to consider, we can examine model performance on the test set. 

First, let's get the predictions for the test set using the final model.

In [None]:
# let's evaluate the best model on the test set
# the grid search object has already refit the best model on the training set
# let's get the predictions on the test set
predictions = model_best_wine.predict(X_test_scaled)

In [None]:
# let's get the classification report
print(classification_report(y_test, predictions, target_names=[f'Class {i}' for i in range(3)]))

Now let's look at the confusion matrix.

In [None]:
# let's get the confusion matrix
cm = confusion_matrix(y_test, predictions)
disp = ConfusionMatrixDisplay(cm, display_labels=[f'Class {i}' for i in range(3)])
disp.plot(cmap='Blues')
plt.title('Confusion Matrix For Wine Data Using Best Model')
plt.show()

We see that the final model does extremely well on the test data. This is certainly not the normal situation on real-world problems. 

# Part 2 Iris Data Set
Let's repeat the process from above on another data set. Here we load the Iris dataset which was first published by the famous statistician, Sir R.A. Fisher. Per the scikit-learn [documentation](https://scikit-learn.org/stable/datasets/toy_dataset.html, "_The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are NOT linearly separable from each other_".

In [None]:
# load the iris dataset from sklearn
iris = load_iris()

# the iris object is a dictionary-like object NOT a pandas dataframe
print(iris.DESCR)

# we can work with the wine object directly, but let's convert it to a pandas dataframe to be consistent with our prior examples
df_iris = pd.DataFrame(iris.data, columns=iris.feature_names)
df_iris['target'] = iris.target
display(df_iris.head())
print(df_iris.target.value_counts())

Let's again take a quick look at the feature distributions within the iris classes.

In [None]:
plt_box_grid_by_target(df_iris, 'target', num_cols=4, fig_size=(12,4));

We (okay, I) would like to make this problem a little more difficult. So, in the cell block below, we are going to add noise to the data features. We would would (almost) never do this (there are some exceptions in deep learning). We're doing it here to make sure the model doesn't perform perfectly on the test set so we can illustrate performance evaluation.

In [None]:
# let's shuffle the wine data
df_iris = df_iris.sample(frac=1, random_state=SEED).reset_index(drop=True)

# add gaussian noise to the iris data to make it more interesting
np.random.seed(SEED)
df_iris.iloc[:, :-1] += np.random.normal(0, 0.4, df_iris.iloc[:, :-1].shape)

# let's split the df_iris data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df_iris.drop('target', axis=1), df_iris.target, test_size=0.3, random_state=SEED)

We now follow the same steps as before. First, we standardize the data.

In [None]:
# let's standardize the data
scaler = StandardScaler()

# fit the scaler to the training data
scaler.fit(X_train)
# transform the training data and put it into a new dataframe
X_train_scaled = scaler.transform(X_train)
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns)

# transform the testing data and put it into a new dataframe
X_test_scaled = scaler.transform(X_test)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns)

# show the scaled training data
display(X_train_scaled.head())

We'll use the same models and hyperparameter combinations as we used for the wine dataset (specified above). We again perform a systematic approach using grid search and cross validation to determine the best model for this dataset.

In [None]:
model_best_iris = None
score_best_iris = 0.0
K_folds = 5
score = 'accuracy' # we'll use accuracy since our classes are balanced
for name, model in models.items():
    print(f'Tuning {name} ...........................')
    grid_search = GridSearchCV(model, param_grid[name], cv=K_folds, scoring=score, refit=True, n_jobs=-1)
    grid_search.fit(X_train_scaled, y_train)
    print(f'Best parameters: {grid_search.best_params_}')
    print(f'Best validation score: {grid_search.best_score_}')
    if grid_search.best_score_ > score_best_iris:
        print(f'*******New best model found: {name}')
        score_best_iris = grid_search.best_score_
        model_best_iris = grid_search.best_estimator_

Let's get the predictions and plot the confusion matrix.

In [None]:
# let's evaluate the best model on the test set
# the grid search object has already refit the best model on the training set
# let's get the predictions on the test set
predictions = model_best_iris.predict(X_test_scaled)

# let's get the confusion matrix
cm = confusion_matrix(y_test, predictions)
disp = ConfusionMatrixDisplay(cm, display_labels=[f'Class {i}' for i in range(3)])
disp.plot(cmap='Blues')
plt.title('Confusion Matrix For Iris Data Using Best Model')
plt.show()

We see that the best model performs reasonably well on the test data, but not perfectly - though if we remove the intentionally added noise, it will do much better. Let's look at the classification report to get the point metrics.

In [None]:
# let's get the classification report
print(classification_report(y_test, predictions, target_names=[f'Class {i}' for i in range(3)]))

Let's also look at the Reciever Operating Characteristic Curves (ROC) and the Area Under the Curves (AUC). Recall, that the ROC let's us examine what happens as we change the decision threshold for assigning a sample to a class based on the probability of the sample belonging to that class as predicted by the model. The ROC and AUC metrics are typically computed for binary outcomes. In this example, which is a multi-class scenario, we need to address this in order to create the ROC. We will consider each class separately. That is, we will create an ROC for each class for which we will treat the class of interest as the _positive_ class and all other classes as the _negative_ class. This will give us one ROC for each class.

In [None]:
# let's plot the roc curve for each class
# we need to get the probabilities for each class
probs = model_best_iris.predict_proba(X_test_scaled)
fpr = dict()
tpr = dict()
roc_auc = {0:0, 1:0, 2:0}
for i in range(3):
    fpr[i], tpr[i], _ = roc_curve(y_test, probs[:, i], pos_label=i)
    # roc_auc expects binary values, so we need to convert the target to binary
    # we're using one vs. all, so we'll convert the target to 1 for the class we're interested in and 0 for the other classes
    y_test_i = np.array(y_test == i).astype(int)
    roc_auc[i] = roc_auc_score(y_test_i, probs[:, i])
    plt.plot(fpr[i], tpr[i], label=f'Class {i} (AUC = {roc_auc[i]:.2f})')
plt.plot([0, 1], [0, 1], color='black', linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve For Iris Data Using Best Model')
plt.legend()
plt.show()

Finally, let's look at the learning curves. The learning curves examine the the model performance as a function of the size of the training data. This analysis can help determine if our model is affected by bias, variance (overfitting), or both. The learning curve graph contains two line plots: (1) training - this represents the model performance on the training data with different training set sizes. (2) validation - this represents the model performance on a held out validation set with different training sets. There are a few key things we can interpret from learning curves:
1. If the training score is <1 for the largets available training set size, the model may suffer from bias (insufficient capacity) 
2. If the the training and validation score are different for the largest values (e.g, the validation socre is less than the training score), the model is likely affected by variance (overfitting)

In [None]:
# let's plot the learning curve for the best model
# we'll use the balanced accuracy as our scoring metric
train_sizes, train_scores, test_scores = learning_curve(model_best_iris, X_train_scaled, y_train, cv=5, scoring='accuracy', n_jobs=-1)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)

# let's create the plot
plt.figure(figsize=(10, 6))
plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                 train_scores_mean + train_scores_std, alpha=0.1,
                 color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                    test_scores_mean + test_scores_std, alpha=0.1, color="g")
plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
                label="Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
                label="Cross-validation score")
plt.xlabel("Training examples")
plt.ylabel("Score")
plt.title("Learning Curve For Iris Data Using Best Model")
plt.legend()
plt.show()