# Review of scikit-learn via the Kaggle Leaf Classification competition
One of the Kaggle "Playground" competitions involves image based leaf identification. It's a great competition for reviewing how to use scikit-learn (sklearn, for short) and various statistical/machine learning techniques for an important classification problem. It also lets us review a little numpy, pandas and matplotlib.

Let's start by checking out the details of the competition.

https://www.kaggle.com/c/leaf-classification

I've already downloaded the data and let's check that out now too. 

* `data/` - folder containing the test and train data in csv format
* `images/` - folder containing the leaf images

A couple things to note about this problem:

* There are quite a few numeric features (192 predictor columns) but we really don't know exactly how they were computed nor what they mean other than that they describe the margin, shape, and texture of the leaves.
* We can get started by simply using these 192 input variables and NOT doing any image analysis ourselves. Later, we can push ourselves to generate our own features doing our own image analysis.
* It's NOT a binary classification problem. There are 99 leaf classes with 10 samples per class.
* The contest does NOT want binary predictions for each sample, but instead wants probabilities for each sample being in each of the 99 classes. We'll see that sklearn makes it easy to predict either classes or probabilities.

This notebook has multiple learning objectives related to Python, modeling, installing software, using external libraries and tools and more.

* Basic review of sklearn (and some numpy, pandas, and matplotlib)
* Checking out the notebooks for the PDSH book (extremely good)
* Creating numpy arrays from pandas dataframes
* The sklearn estimator API workflow (we will use Pipeline objects till later in module)
* Visualizing matrices and decision trees
* Quickly fitting and scoring multiple types of ML models
* Submitting entries to a Kaggle competition
* Combining models into an ensemble model using voting classifiers

## Intro to scikit-learn from Python Data Science Handbook

Let's start by reviewing part of Jake Vanderplas' notebook entitled `05.02-Introducing-Scikit-Learn.ipynb` from the [PDSH set of notebooks](https://github.com/jakevdp/PythonDataScienceHandbook) (which you should have already downloaded when you were setting up your machine for this course). This book and associated notebooks are a fantastic resource for learning to effectively use Jupyter notebooks (Ch1), numpy (Ch2), pandas (Ch3), matplotlib (Ch4) and scikit-learn (Ch5). 

In particular, we'll review how to represent data in scikit-learn and the basic estimator API for fitting and using statistical/machine learning models.

In addition, the [sklearn Getting Started guide](https://scikit-learn.org/stable/getting_started.html) gives a concise high level overview of using 

**TODO** **Create screencast for intro to sklearn in JVP and touching on the Getting Started page**

## Getting started with the Leaf Classification problem

Let's do our standard pandas, numpy and matplotlib imports.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import Image

We are also going to need the `mpimg` submodule from matplotlib. You can find a nice [image tutorial for matplotlib here](https://matplotlib.org/stable/tutorials/introductory/images.html) if you want to know more representing images with numpy arrays - we did a little of this back in the pcda class when we [recolorized a picture of a Blackburnian Warbler](http://www.sba.oakland.edu/faculty/isken/courses/mis5470_w21/modeling3_intro_scikit_learn.html#unsupervised-learning-with-r-and-python) using cluster analysis.

In [None]:
import matplotlib.image as mpimg

Need the standard magic command so our plots are displayed inline in the notebook.

In [None]:
%matplotlib inline

In order to use matplotlib with anything other than PNG images, we need Pillow. Let's check if it's installed. SPOILER ALERT: We installed it as part of the `aap` conda env.

In [None]:
# Linux or mac
# conda list | grep 'pillow'

# Windows Anaconda command prompt
# conda list | findstr "pillow"

In [None]:
# I've already installed it
#!conda install pillow

Let's look at a few random pics. I've heavily commented this code to serve as a bit of a matplotlib review.

In [None]:
# Choose three random numbers between 0 and 1499. These will correspond to picture filenames.
picnums = list(np.random.randint(1500, size=3))
print(picnums)

# Notice the use of a simple list comprehension to get a list of filenames with paths
paths_to_pics = ['./images/'+str(picnum)+'.jpg' for picnum in picnums]
print(paths_to_pics)

# Create an empty figure object with matplotlib and set the figure size
plt.figure(figsize=(10.0, 3.0))

# Loop over a range of ints to use as index nums for the pic list
for i in range(3):
    img = mpimg.imread(paths_to_pics[i]) # Read the image
    plt.subplot(1, 3, i + 1)             # Add a new subplot to our figure (1 row by 3 cols and this is i+1)
    plt.axis('off')                      # Supress the axes display
    plt.title(paths_to_pics[i])          # Add a title for this image
    plt.imshow(img)                      # Display the image within the subplot

plt.tight_layout()  # Tweak the subplot layout
plt.show()          # Show the entire plot

### Focus of this notebook
We are NOT going to be getting into details of different classification algorithms in this notebook. Instead, we'll see how easy it is with `scikit-learn` to try out several different classification techniques, even combining them via ensembles, and make submissions to a Kaggle competition. One of the strengths of `scikit-learn` is that it strives to have a very consistent interface no matter which technique you are using. In general, our approach will be:

* instantiate a specific classifier model object
* fit the model using our input data
* make predictions using the model for the test data
* write out a submission file for Kaggle

As you are going through this notebook, you might want to activate the integrated Table of Contents in the left sidebar. This is a new feature in Jupyter Lab 3 (it used to be an extension). Here's what it looks like:

In [None]:
Image("toc.png")

## Read and explore the training and test data
Let's read the data into Pandas dataframes since we are familiar with those and it's easy to convert them to numpy arrays as needed. Since there is an `id` column in the dataset and it's the first column, we'll tell pandas to use it as the `Index` for the `DataFrame`.

In [None]:
train_df = pd.read_csv("data/train.csv", index_col=0)
test_df = pd.read_csv("data/test.csv", index_col=0)

In [None]:
train_df.head()

In [None]:
train_df.tail()

In [None]:
test_df.head()

In [None]:
train_df.info()

Notice a few things:

* other than the `species` column, all of the other columns are numeric.
* the target column, `species`, is a string. We need to convert it to an integer for use with `scikit-learn`.

One nice feature of sklearn is that it includes functionality for many parts of the statistical predictive modeling workflow. In this case, we can use the `LabelEncoder` object from the `preprocessing` module to recode the target value. As you'll see from the following doc page, `LabelEncoder` is intended for use with **target** values, not for the features.

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html

Before just copying and pasting and editing the entire example code from the above page to accomplish the encoding, let's explore a few pieces at a time so that we see what's going on. I encourage you to do this throughout the course and beyond so that you develop an understanding of what's happening in a sequence of sklearn code lines.

In [None]:
from sklearn import preprocessing

# Create a LabelEncoder object
le = preprocessing.LabelEncoder()

# Use its fit method to fit the labels in the species column
le.fit(train_df['species'])

So, what information does our `LabelEncoder` object, `le`, actually contain after using its `fit` method. A few strategies for exploring such things are:

* Use the Python `dir` function on the object to see all of its attributes
* Look at the sklearn API for `LabelEncoder` - https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html#sklearn.preprocessing.LabelEncoder.fit
* Look in the sklearn User Guide for `LabelEncoder` - https://scikit-learn.org/stable/modules/preprocessing_targets.html#label-encoding

In [None]:
print(dir(le))

Hmm, that `classes_` attribute (note the trailing underscore) looks like it might contain the values of our target. BTW, what's up with that trailing underscore in `classes_`? This is a Python convention for avoiding conflicts with reserved Python keywords. Search the mighty interweb for "python trailing underscore convention".

In [None]:
# Let's look at the classes that were fit
targets = list(le.classes_)
print(targets)

Now that the encoder is fit, we use its `transform` method to apply it to our `species` column. To start with, I'm **not** going to add a new column to the dataframe. Start by just sticking the output into a variable that we can then explore and make sure things are working ok. I highly reco

<div class="alert alert-info">
  <b>I highly recommend doing things like intermediate value and data type checking as you are developing sklearn code. Not only will you catch errors more easily, reduce stress and frustration, but you will learn more about how sklearn is working.</b>
</div>

In [None]:
encoded_target = le.transform(train_df['species'])
encoded_target

Looks pretty good. Notice the data type of the `encoded_target` variable.

In [None]:
type(encoded_target)

We can go backwards with the `inverse_transform` method.

In [None]:
le.inverse_transform(encoded_target)

Now that we know things are working, let's redo the `transform` and stuff the results into a new column in `df_train_raw`.

In [None]:
# Add new column to train dataframe with encoded target values
train_df['target'] = le.transform(train_df['species'])

In [None]:
train_df.head()

Ok, almost ready to build models. **We've got all of our features as well as our target variable as numeric columns in a pandas dataframe**. The string `species` column is still in there as well. When we fit models, we'll be pulling out the features and the encoded target variables into numpy arrays to pass in the various modeling functions. Just to facilitate subsetting our dataframe, let's create a list of the column names in `train_df` - they are the columns numbered 1 through 193. 

What column is column 0?

In [None]:
features = list(train_df.columns[1:193])
print(features)

A common convention in sklearn is to use `X` for the feature matrix and `y` for the target vector. Also, while several of the sklearn modeling functions can take either numpy arrays or pandas dataframes as input, we will explicitly create `X` and `y` as numpy arrays (using the `np.array()` function). It's more standard to use arrays in sklearn and most examples in the documentation will do so. 

After creating the `X` and `y` arrays, notice how I check their shape to make sure things look ok before trying to build models. Also notice the use of `f-strings`. These are a [newish way of printing formatted strings](https://realpython.com/python-f-strings/) and I've become a big fan of them.

In [None]:
# Set the features variable, X for both the training and test data
X = np.array(train_df[features])
X_test = test_df[features] # I'll leave X_test as a pandas DataFrame just to show that they work too

# Set the target variable, y
y = np.array(train_df["target"])

print(f"X is a {X.shape} matrix of type {type(X)}\n")
print(f"X_test is a {X_test.shape} matrix of type {type(X_test)}\n")
print(f"y is a {y.shape} vector of type {type(y)}\n")


## Technique 1: Decision trees
Let's start with a simple decision tree. The sklearn doc page for decision trees, http://scikit-learn.org/stable/modules/tree.html, has a good review of the pros and cons of decision trees as well as info on fitting, predicting and visualizing with decision trees.

### Step 1 - create and fit the decision tree model

The first step is always to figure out which modules and/or functions you need to import from sklearn. The API is well organized and usually it is just a matter of navigating the table of contents from the sklearn home page, using the search box, or a simple web search to find the main doc page for the model you need. The examples on those pages will show the necessary imports.

In [None]:
from sklearn.tree import DecisionTreeClassifier

Now we can instantiate a new model object, fit it, and score the fit. Note that the decision tree is initialized with one parameter:

* `min_samples_split=20` --> a node in the tree must have at least 20 samples to be considered for splitting
* all other parameters are set to their default values

See http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier to read about all the decision tree parameters.

In [None]:
# Create a DecisionTreeClassifier model. 
tree_1 = DecisionTreeClassifier(min_samples_split=20)
# Fit the model using our features and target variables
tree_1.fit(X, y)
# Get % accuracy on the training data using the score() method
tree_1.score(X, y)

Ok, we've got just over 82% accuracy on the training data.

### Step 2 - make predictions for test dataset using tree and write submission file

Since this is our first use of `scikit-learn`, let's predict both classes (using `predict`) and probabilities (using `predict_proba`) just to see what these things look like.

In [None]:
# Classes - if we forced sklearn to pick a class for each data row, these are what it picked
tree_1_testclasses = tree_1.predict(X_test)
tree_1_testclasses[:5]

In [None]:
# Class probabilities - just the first three rows and first 15 cols
tree_1_testprobs = tree_1.predict_proba(X_test)
tree_1_testprobs[:3, :15]

Notice for the first row, since the predicted class was 51 and we are only showing the first 15 columns of probabilities, they are all small (actually, zero). In row 2, the predicted class was 6 and you can see the predicted probability in the sixth element of the second row is $1.0$. In row 3, the predicted class was 14. Notice for this case that the algorithms estimated a probability of $0.42105263$ for class 1 and $0.47368421$ for class 14 (remember, everything starts at index 0). Since $0.47368421 > 0.42105263$, the predicted class would be 14 if we were forced to pick a class.

Now write out the csv file for Kaggle submission. Note that there's a subfolder named `archived_submits` that contains previously created submission files. I'll submit this file to show how to do it, but I've also included my Kaggle model scoring results at the bottom of this notebook if you don't want to create a Kaggle account and try this for yourself.

In [None]:
df_tree_1_testprobs = pd.DataFrame(tree_1_testprobs, 
                                   columns=targets, 
                                   index=test_df.index)

df_tree_1_testprobs.to_csv("output/tree_1_submission.csv")

### Digression - visualization of probability matrices and trees
Visualizations can help in understanding how these techniques work and provide a way to compare solutions from different models. Here's a few examples.

Here's a very simple color map of the probability matrix. The darker the point, the higher the probability.

In [None]:
plt.matshow(tree_1_testprobs[:200, :100], cmap='Blues')
plt.show()

Speaking of visualization, yes, you can actually get a picture of the tree itself using this function. This uses a piece of software known as [Graphviz](https://graphviz.org/). It has bindings available for many languages, including Python. 

Make sure that `python-graphviz` is installed if you want the following to work. It should already be installed in the `aap` conda environment.

```
conda install python-graphviz
```

In [None]:
from sklearn.tree import export_graphviz
import subprocess

There's a lot going on in the following function. I encourage you to spend some time making sense of it. I've got some explanatory text in the cells below the function.

In [None]:
def visualize_tree(tree, feature_names, dot_filename, png_filename):
    """Create tree png using graphviz.

    Args
    ----
    tree -- scikit-learn DecisionTree.
    feature_names -- list of feature names.
    """
    with open(dot_filename, 'w') as f:
        export_graphviz(tree, out_file=f,
                        feature_names=feature_names)

    command = ["dot", "-Tpng", dot_filename, "-o", png_filename]
    try:
        subprocess.run(args=command, shell=True, check=True)
    except:
        exit("Could not run dot, ie graphviz, to "
             "produce visualization")

In [None]:
visualize_tree(tree_1, features, "output/dt.dot", "output/dt.png")

In case you're wondering, the code above uses a Python list and the string `join` method to create the command that needs to get run by the `subprocess` module. Learn more about [spawning subprocesses from within Python code from the docs](https://docs.python.org/3/library/subprocess.html). Each element of the list (after the first) is a command line option for the `dot` program.

In [None]:
command = ["dot", "-Tpng", "output/dt.dot", "-o", "output/dt.png"]
" ".join(command)

If the png file fails to get generated after calling `visualize_tree`, you can try running the command from the command prompt (in the same directory as this notebook). 

```
dot -Tpng output/dt.dot -o output/dt.png
```

In [None]:
from IPython.display import Image
Image("output/dt.png")

In case you're wondering, the code above uses a Python list and the string `join` method to create the command that needs to get run. Each element of the list (after the first) is a command line option for the `dot` program.

### A second tree
Let's try to grow a bigger tree by making the node splitting parameter smaller (set it to 2). Give this a shot yourself (copy-paste-edit is your friend). Look closely at the resulting score for the fitted model. What does this suggest about the model? Answer at bottom of notebook.

In [None]:
# 1. Create model object


# 2. Fit model and score the fit


# 3. Predict using test data


# 4. Create submission file


Let's go to Kaggle and submit these two models.

## Technique 2: Random forest
We've seen that random forests are generalizations of simple decision trees that can help avoid overfitting, reduce variance and result in a better predicter. It's an example of an *ensemble* method - we put a bunch of trees together and let them vote on the results. This is also known as an extension of *bagging*. In bagged decision trees we create a bunch of trees using several datasets resampled from our original data (i.e. we can end up with both duplicate rows and rows that don't appear). Random forests takes this notion of resampling even further and only uses a random subset of all available variables at each state of the tree construction process. Here's a graphical representation of bagged trees (or random forests if you pretend that the variables are being resampled, too).

In [None]:
Image("random_forest_diagram.png")

Check out the sklearn docs on ensemble modeling - http://scikit-learn.org/stable/modules/ensemble.html.

Look at how easy it is to use all the same ideas above to quickly try out this technique.

As always start by importing the model type you want to use. In our case, a `RandomForestClassifier`. Then go through the same sequence of steps we went through before:

    create model --> fit and score model on training data --> make predictions on test data --> create Kaggle submission file

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
# 1. Create model
randforest_1 = RandomForestClassifier()

# 2. Fit model
randforest_1.fit(X, y)
print(f"Score: {randforest_1.score(X, y)}")

# 3. Predict using test data
randforest_1_testprobs = randforest_1.predict_proba(X_test)
df_randforest_1_testprobs = pd.DataFrame(randforest_1_testprobs, columns=targets, index=test_df.index)

# 4. Create submission file
df_randforest_1_testprobs.to_csv("output/randforest_1_submission.csv")

Whoa! We got a perfect score on the training data. What do you think about that? Let's go to Kaggle and see how we do on the real test data.

## Technique 3: Logistic regression
There's a version of logistic regression known as *multinomial logistic regression* that can be used for classification problems with more than two classes. We will be exploring this method in greater detail in a subsequent notebook. 

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

We know the drill by now...

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
# 1. Create model
logistic_1 = LogisticRegression()

# 2. Fit model
logistic_1.fit(X, y)
# Notice how we can control formatting in f-strings. This is just like we do with format() method.
print(f"Score: {logistic_1.score(X, y):.4f}")

# 3. Predict using test data
logistic_1_testprobs = logistic_1.predict_proba(X_test)
df_logistic_1_testprobs = pd.DataFrame(logistic_1_testprobs, columns=targets, index=test_df.index)

# 4. Create submission file
df_logistic_1_testprobs.to_csv("output/logistic_1_submission.csv")

Not a very good score on the training data. Let's go see how we do on test.

## Technique 4: k-Nearest Neighbor
This is a simple technique and we should give it a try for this problem since we have all numeric data that has already been rescaled to be on a common scale. 

http://scikit-learn.org/stable/modules/neighbors.html

<div class="alert alert-warning">
  <b>Remember, any ML technique that relies on computing a distance between vectors should have the data rescaled so that the distance metric isn't influenced by the units of measurement of the variables.</b>
</div>

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
# 1. Create model
k = 5
knn_1 = KNeighborsClassifier(k)

# 2. Fit model
knn_1.fit(X, y)
print(f"Score: {knn_1.score(X, y):.4f}")

# 3. Predict using test data
knn_1_testprobs = knn_1.predict_proba(X_test)
df_knn_1_testprobs = pd.DataFrame(knn_1_testprobs, columns=targets, index=test_df.index)

# 4. Create submission file
df_knn_1_testprobs.to_csv("output/knn_1_submission.csv")

## Technique 5: Put 'em together into an ensemble
While it may seem counter intuitive, combining a bunch of different models into an overall classifier by doing some sort of voting (for class prediction) or weighted averaging (for class probabilities) has worked pretty well in practice. In fact, state of the art weather forecasting models tend to be ensemble models.

In [None]:
Image("weather_ensemble.png")

We'll gather up all the models we just fit and create a *soft voting* classifier using equal weights for the models. The soft style of voting equates to weighted averaging of the predicted probabilities where we get to specify the model weights.

* http://scikit-learn.org/stable/modules/ensemble.html
* http://scikit-learn.org/stable/modules/ensemble.html#votingclassifier

In [None]:
from sklearn.ensemble import VotingClassifier

In [None]:
# 1. Create ensemble model with weights for each submodel
ensemble_1 = VotingClassifier(estimators=[('tree_1', tree_1),
                                          ('tree_2', tree_1),
                                          ('randforest_1', randforest_1),
                                          ('logistic_1', logistic_1),
                                          ('knn_1', knn_1)], 
                              voting='soft', weights=[1.0, 1.0, 1.0, 1.0, 1.0])

# 2. Fit model
ensemble_1.fit(X, y)
print(f"Score: {ensemble_1.score(X, y):.4f}")

# 3. Predict using test data
ensemble_1_testprobs = ensemble_1.predict_proba(X_test)
df_ensemble_1_testprobs = pd.DataFrame(ensemble_1_testprobs, columns=targets, index=test_df.index)

# 4. Create submission file
df_ensemble_1_testprobs.to_csv("output/ensemble_1_submission.csv")

Submit this to Kaggle to see how we did.

Try creating a second ensemble model with different weights to see if you can improve on your performance on the test data.

## Closing Thoughts

This was just a bit of review of sklearn, pandas, numpy, Jupyter notebooks, conda, submitting entries to Kaggle and some basic modeling workflows. Now we will take a closer look at some more advanced statistical/machine learning techniques including logistic regression models with regularization and boosted trees. We will also learn about analysis pipelines, and setting up a good project structure with cookiecutters.

## Answers

In [None]:
# 1. Create model
tree_2 = DecisionTreeClassifier(min_samples_split=2) # 2 is default too

# 2. Fit model and score the fit
tree_2.fit(X, y)
print(tree_2.score(X, y))

# 3. Predict using test data
tree_2_testprobs = tree_2.predict_proba(X_test)
df_tree_2_testprobs = pd.DataFrame(tree_2_testprobs, columns=targets, index=test_df.index)

# 4. Create submission file
df_tree_2_testprobs.to_csv("output/tree_2_submission.csv")

Whoa. We got a perfect score for our fitted model. We have definitely overfitted the model by allowing many more splits. Such a model is unlikely to perform well on new data.

My kaggle scores for the various models: (lower is better)

* tree_1: 8.61652
* tree_2: 10.52444
* randomforest_1: 0.69329
* logistic_1: 4.08320
* ensemble_1: 0.81326