# The Wheat Seeds Dataset involves the prediction of species given measurements of seeds from different varieties of wheat.

# created by Nikolay K. MTK: 673010
It is a multiclass (3-class) classification problem. 
The number of observations for each class is balanced. 
There are 210 observations with 7 input variables and 1 output variable. The variable names are as follows:

Area.
Perimeter.
Compactness
Length of kernel.
Width of kernel.
Asymmetry coefficient.
Length of kernel groove.
Class (1, 2, 3).

The baseline performance of predicting the most prevalent class is a classification accuracy of approximately 28%.

In [128]:
# Let's layout our plan here
solution_steps = [
    "1. Getting the data ready",
    "2. Choose the right estimator/algorithm for our problems",
    "3. Fit the model/algorithm and use it to make predictions on our data",
    "4. Evaluating a model",
    "5. Improve a model",
    "6. Save and load a trained model", ]

In [129]:
solution_steps

['1. Getting the data ready',
 '2. Choose the right estimator/algorithm for our problems',
 '3. Fit the model/algorithm and use it to make predictions on our data',
 '4. Evaluating a model',
 '5. Improve a model',
 '6. Save and load a trained model']

In [10]:
# Standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [49]:
import pandas as pd
wheat_seeds_class = pd.read_excel("data/seeds.xlsx")
wheat_seeds_class

Unnamed: 0,Area,Perimeter,Compactness,Length of kernel,Width of kernel,Asymmetry coefficient,Length of kernel groove,"Class (1, 2, 3)"
0,15.26,14.84,0.8710,5.763,3.312,2.221,5.220,1
1,14.88,14.57,0.8811,5.554,3.333,1.018,4.956,1
2,14.29,14.09,0.9050,5.291,3.337,2.699,4.825,1
3,13.84,13.94,0.8955,5.324,3.379,2.259,4.805,1
4,16.14,14.99,0.9034,5.658,3.562,1.355,5.175,1
...,...,...,...,...,...,...,...,...
205,12.19,13.20,0.8783,5.137,2.981,3.631,4.870,3
206,11.23,12.88,0.8511,5.140,2.795,4.325,5.003,3
207,13.20,13.66,0.8883,5.236,3.232,8.315,5.056,3
208,11.84,13.21,0.8521,5.175,2.836,3.598,5.044,3


# 1. Getting our data ready to be used with machine learning

Split the data into features and lables (usually 'X' and 'y')

In [74]:
# Features
X = wheat_seeds_class.drop("Class (1, 2, 3)", axis=1)
X.head()

Unnamed: 0,Area,Perimeter,Compactness,Length of kernel,Width of kernel,Asymmetry coefficient,Length of kernel groove
0,15.26,14.84,0.871,5.763,3.312,2.221,5.22
1,14.88,14.57,0.8811,5.554,3.333,1.018,4.956
2,14.29,14.09,0.905,5.291,3.337,2.699,4.825
3,13.84,13.94,0.8955,5.324,3.379,2.259,4.805
4,16.14,14.99,0.9034,5.658,3.562,1.355,5.175


In [71]:
# Labels
y = wheat_seeds_class["Class (1, 2, 3)"]
y.head()

0    1
1    1
2    1
3    1
4    1
Name: Class (1, 2, 3), dtype: int64

In [77]:
# Split the data into training and test sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((168, 7), (42, 7), (168,), (42,))

In [130]:
solution_steps

['1. Getting the data ready',
 '2. Choose the right estimator/algorithm for our problems',
 '3. Fit the model/algorithm and use it to make predictions on our data',
 '4. Evaluating a model',
 '5. Improve a model',
 '6. Save and load a trained model']

# 2. Choose the right estimator/algorithm for our problems

* Sklearn refers to machine learning models, algorithms as estimators.
* Classification problem - predicting a category ( 1, 2 or 3)
 * Somethimes we will see clf (short for classifier) used as a classification estimator
* 3 class classification problem
* This is often referred to as model or clf (short for classifier) or estimator (as in the Scikit-Learn) documentation.

* Hyperparameters are like knobs on an oven you can tune to cook your favourite dish.

# Used choosing the right estimator flowchart from the scikit learn page
https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

## 2.1 Picking a machine learning model for a classification problem
Consulting the map and it says to try LinearSVC

In [139]:
# Import the LinearSVC estimator class
from sklearn.svm import LinearSVC

# Instantiate LinearSVC
clf = LinearSVC(dual=False)
clf.fit(X_train, y_train)

# Evaluate the LinearSVC
clf.score(X_train, y_train)

0.9583333333333334

In [141]:
wheat_seeds_class["Class (1, 2, 3)"].value_counts()

3    70
2    70
1    70
Name: Class (1, 2, 3), dtype: int64

In [142]:
# Import the Random Forest Classifier estimator class
from sklearn.ensemble import RandomForestClassifier

# Instantiate Random Forest Classifier
clf = RandomForestClassifier(n_estimators=100)
clf.fit(X_train, y_train)

# Evaluate the Random Forest Classifier
clf.score(X_train, y_train)

1.0

Tidbit:

1. If you have structured data, used ensemble methods
2. If you have unstructured data, use deep learning or transfer learning

## 3. Fit the model/algorithm on our data and use it to make predictions

# 3.1 Fitting the model to the data

Different names for:

* `X` = features, features variables, data
* `y` = labels, targets, target variables

In [90]:
# Fit the model to the data (training the ML model)
clf.fit(X_train, y_train)

RandomForestClassifier()

In [92]:
# Evaluate the Random Forest Classifier (use the patterns the model has learned)
clf.score(X_test, y_test)

0.8571428571428571

# Random Forest model deep dive
Link to more info about RFM

https://towardsdatascience.com/an-implementation-and-explanation-of-the-random-forest-in-python-77bf308a9b76

## 3.2 Make predictions using a ML model
 2 ways to make predictions:
        1. predict()
        2. predict_proba()

In [94]:
X_test.head()

Unnamed: 0,Area,Perimeter,Compactness,Length of kernel,Width of kernel,Asymmetry coefficient,Length of kernel groove
30,13.16,13.82,0.8662,5.454,2.975,0.8551,5.056
172,11.27,12.97,0.8419,5.088,2.763,4.309,5.0
84,19.51,16.71,0.878,6.366,3.801,2.962,6.185
199,12.76,13.38,0.8964,5.073,3.155,2.828,4.83
60,11.42,12.86,0.8683,5.008,2.85,2.7,4.607


In [96]:
# Use a trained model to make predictions
clf.predict(X_test)

array([1, 3, 2, 1, 1, 3, 1, 3, 1, 3, 2, 3, 1, 2, 1, 2, 1, 1, 3, 2, 2, 1,
       3, 2, 2, 1, 1, 2, 1, 3, 3, 3, 2, 1, 2, 2, 3, 2, 2, 3, 1, 3],
      dtype=int64)

In [98]:
np.array(y_test)

array([1, 3, 2, 3, 1, 3, 1, 3, 1, 3, 2, 3, 3, 2, 1, 2, 3, 1, 3, 2, 2, 1,
       3, 2, 2, 3, 1, 2, 1, 3, 3, 1, 2, 1, 2, 2, 3, 2, 2, 3, 3, 3],
      dtype=int64)

In [100]:
# Compare predictions to truth labels to evaluate the model
y_preds = clf.predict(X_test)
np.mean(y_preds == y_test)

0.8571428571428571

In [102]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_preds)

0.8571428571428571

Make predictions with `predict_proba()` - use this if someone asks you "what's the probability your model is assigning to each prediction?"

In [108]:
# predict_proba() returns probabilities of a classification label
clf.predict_proba(X_test[:10])

array([[0.85, 0.  , 0.15],
       [0.  , 0.  , 1.  ],
       [0.  , 1.  , 0.  ],
       [0.67, 0.  , 0.33],
       [0.63, 0.  , 0.37],
       [0.  , 0.  , 1.  ],
       [1.  , 0.  , 0.  ],
       [0.  , 0.  , 1.  ],
       [0.74, 0.26, 0.  ],
       [0.48, 0.  , 0.52]])

In [109]:
# Let's predict() on the same data...
clf.predict(X_test[:10])

array([1, 3, 2, 1, 1, 3, 1, 3, 1, 3], dtype=int64)

In [111]:
X_test[:10]

Unnamed: 0,Area,Perimeter,Compactness,Length of kernel,Width of kernel,Asymmetry coefficient,Length of kernel groove
30,13.16,13.82,0.8662,5.454,2.975,0.8551,5.056
172,11.27,12.97,0.8419,5.088,2.763,4.309,5.0
84,19.51,16.71,0.878,6.366,3.801,2.962,6.185
199,12.76,13.38,0.8964,5.073,3.155,2.828,4.83
60,11.42,12.86,0.8683,5.008,2.85,2.7,4.607
155,11.19,13.05,0.8253,5.25,2.675,5.813,5.219
45,13.8,14.04,0.8794,5.376,3.155,1.56,4.961
182,12.19,13.36,0.8579,5.24,2.909,4.857,5.158
9,16.44,15.25,0.888,5.884,3.505,1.969,5.533
196,12.79,13.53,0.8786,5.224,3.054,5.483,4.958


## 4. Evaluating a machine learning model
Three ways to evaluate Scikit-Learn models/estimators:

1. Estimator's built-in score() method
2. The scoring parameter
3. Problem-specific metric functions

    You can read more about these here: https://scikit-learn.org/stable/modules/model_evaluation.html

In [116]:
# Import cross validation score
from sklearn.model_selection import cross_val_score
# Evaluating using the score parameter
clf.score(X_test, y_test)

0.8571428571428571

In [118]:
cross_val_score(clf, X, y, cv=5)

array([0.9047619 , 0.92857143, 0.97619048, 0.97619048, 0.69047619])

In [120]:
cross_val_score(clf, X, y, cv=10)

array([0.85714286, 0.95238095, 0.95238095, 0.9047619 , 1.        ,
       0.95238095, 1.        , 0.95238095, 0.66666667, 0.85714286])

In [123]:
# Different classification metrics

# Accuracy
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_preds))

# Confusion matrix
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, y_preds))

# Classification report
from sklearn.metrics import classification_report
print(classification_report(y_test, y_preds))

0.8571428571428571
[[10  0  1]
 [ 0 14  0]
 [ 5  0 12]]
              precision    recall  f1-score   support

           1       0.67      0.91      0.77        11
           2       1.00      1.00      1.00        14
           3       0.92      0.71      0.80        17

    accuracy                           0.86        42
   macro avg       0.86      0.87      0.86        42
weighted avg       0.88      0.86      0.86        42



## 5. Improve through experimentation
Two of the main methods to improve a models baseline metrics (the first evaluation metrics you get).

From a data perspective asks:

* Could we collect more data? In machine learning, more data is generally better, as it gives a model more opportunities to learn patterns.
* Could we improve our data? This could mean filling in misisng values or finding a better encoding (turning things into numbers) strategy.

From a model perspective asks:

* Is there a better model we could use? If you've started out with a simple model, could you use a more complex one? (we saw an example of this when looking at the Scikit-Learn machine learning map, ensemble methods are generally considered more complex models)
* Could we improve the current model? If the model you're using performs well straight out of the box, can the hyperparameters be tuned to make it even better?

Hyperparameters are like settings on a model you can adjust so some of the ways it uses to find patterns are altered and potentially improved. Adjusting hyperparameters is referred to as hyperparameter tuning.

In [125]:
# How to find a model's hyperparameters
clf = RandomForestClassifier()
clf.get_params() # returns a list of adjustable hyperparameters

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

In [131]:
# Example of adjusting hyperparameters by hand

# Split data into X & y
X = wheat_seeds_class.drop("Class (1, 2, 3)", axis=1) # use all columns except target
y = wheat_seeds_class["Class (1, 2, 3)"] # we want to predict y using X

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y)

# Instantiate two models with different settings
clf_1 = RandomForestClassifier(n_estimators=100)
clf_2 = RandomForestClassifier(n_estimators=200)

# Fit both models on training data
clf_1.fit(X_train, y_train)
clf_2.fit(X_train, y_train)

# Evaluate both models on test data and see which is best
print(clf_1.score(X_test, y_test))
print(clf_2.score(X_test, y_test))

0.9433962264150944
0.9622641509433962


In [133]:
# Example of adjusting hyperparameters computationally (recommended)

from sklearn.model_selection import RandomizedSearchCV

# Define a grid of hyperparameters
grid = {"n_estimators": [10, 100, 200, 500, 1000, 1200],
        "max_depth": [None, 5, 10, 20, 30],
        "max_features": ["auto", "sqrt"],
        "min_samples_split": [2, 4, 6],
        "min_samples_leaf": [1, 2, 4]}

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Set n_jobs to -1 to use all cores (NOTE: n_jobs=-1 is broken as of 8 Dec 2019, using n_jobs=1 works)
clf = RandomForestClassifier(n_jobs=1)

# Setup RandomizedSearchCV
rs_clf = RandomizedSearchCV(estimator=clf,
                            param_distributions=grid,
                            n_iter=10, # try 10 models total
                            cv=5, # 5-fold cross-validation
                            verbose=2) # print out results

# Fit the RandomizedSearchCV version of clf
rs_clf.fit(X_train, y_train);

# Find the best hyperparameters
print(rs_clf.best_params_)

# Scoring automatically uses the best hyperparameters
rs_clf.score(X_test, y_test)

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV] n_estimators=10, min_samples_split=6, min_samples_leaf=2, max_features=auto, max_depth=30 
[CV]  n_estimators=10, min_samples_split=6, min_samples_leaf=2, max_features=auto, max_depth=30, total=   0.0s
[CV] n_estimators=10, min_samples_split=6, min_samples_leaf=2, max_features=auto, max_depth=30 
[CV]  n_estimators=10, min_samples_split=6, min_samples_leaf=2, max_features=auto, max_depth=30, total=   0.0s
[CV] n_estimators=10, min_samples_split=6, min_samples_leaf=2, max_features=auto, max_depth=30 
[CV]  n_estimators=10, min_samples_split=6, min_samples_leaf=2, max_features=auto, max_depth=30, total=   0.0s
[CV] n_estimators=10, min_samples_split=6, min_samples_leaf=2, max_features=auto, max_depth=30 
[CV]  n_estimators=10, min_samples_split=6, min_samples_leaf=2, max_features=auto, max_depth=30, total=   0.0s
[CV] n_estimators=10, min_samples_split=6, min_samples_leaf=2, max_features=auto, max_depth=30 
[CV]  n_estimat

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s


[CV]  n_estimators=1200, min_samples_split=4, min_samples_leaf=4, max_features=auto, max_depth=10, total=   1.2s
[CV] n_estimators=1200, min_samples_split=4, min_samples_leaf=4, max_features=auto, max_depth=10 
[CV]  n_estimators=1200, min_samples_split=4, min_samples_leaf=4, max_features=auto, max_depth=10, total=   1.3s
[CV] n_estimators=1200, min_samples_split=4, min_samples_leaf=4, max_features=auto, max_depth=10 
[CV]  n_estimators=1200, min_samples_split=4, min_samples_leaf=4, max_features=auto, max_depth=10, total=   1.3s
[CV] n_estimators=1200, min_samples_split=4, min_samples_leaf=4, max_features=auto, max_depth=10 
[CV]  n_estimators=1200, min_samples_split=4, min_samples_leaf=4, max_features=auto, max_depth=10, total=   1.2s
[CV] n_estimators=1200, min_samples_split=4, min_samples_leaf=4, max_features=auto, max_depth=10 
[CV]  n_estimators=1200, min_samples_split=4, min_samples_leaf=4, max_features=auto, max_depth=10, total=   1.3s
[CV] n_estimators=500, min_samples_split=6,

[CV]  n_estimators=1000, min_samples_split=4, min_samples_leaf=2, max_features=sqrt, max_depth=10, total=   1.0s
[CV] n_estimators=1000, min_samples_split=4, min_samples_leaf=2, max_features=sqrt, max_depth=10 
[CV]  n_estimators=1000, min_samples_split=4, min_samples_leaf=2, max_features=sqrt, max_depth=10, total=   1.0s
[CV] n_estimators=1000, min_samples_split=4, min_samples_leaf=2, max_features=sqrt, max_depth=10 
[CV]  n_estimators=1000, min_samples_split=4, min_samples_leaf=2, max_features=sqrt, max_depth=10, total=   1.0s
[CV] n_estimators=1000, min_samples_split=4, min_samples_leaf=2, max_features=sqrt, max_depth=10 
[CV]  n_estimators=1000, min_samples_split=4, min_samples_leaf=2, max_features=sqrt, max_depth=10, total=   1.0s
[CV] n_estimators=1000, min_samples_split=4, min_samples_leaf=2, max_features=sqrt, max_depth=10 
[CV]  n_estimators=1000, min_samples_split=4, min_samples_leaf=2, max_features=sqrt, max_depth=10, total=   1.0s
{'n_estimators': 100, 'min_samples_split': 

[Parallel(n_jobs=1)]: Done  50 out of  50 | elapsed:   24.5s finished


0.9523809523809523

## 6. Save and reload your trained model

You can save and load a model with pickle.

In [135]:
# Saving a model with pickle
import pickle

# Save an existing model to file
pickle.dump(rs_clf, open("rs_random_forest_model_wheat_1.pkl", "wb"))

In [136]:
# Load a saved pickle model
loaded_pickle_model = pickle.load(open("rs_random_forest_model_wheat_1.pkl", "rb"))

# Evaluate loaded model
loaded_pickle_model.score(X_test, y_test)

0.9523809523809523