[![Binder](https://mybinder.org/badge_logo.svg)](https://notebooks.gesis.org/binder/v2/gh/joshmaglione/CS102-Jupyter/main?labpath=.%2FWeek09.ipynb) 

<a href="https://colab.research.google.com/github/joshmaglione/CS102-Jupyter/blob/main/Week09.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a> 

[View on GitHub](https://github.com/joshmaglione/CS102-Jupyter/blob/main/Week09.ipynb)

# Week 9: Introduction to machine learning

- Machine learning involves building mathematical models to turn data into *information*.

- These models depend on *tunable* parameters that can be adjusted.
  
- In this way the model can be considered to be 'learning' from the data as it tunes the parameters accordingly.

- Once these models have been fit to some data set, sometimes referred to as "trained", they can be used to convert data to information on other data sets. 

- The effectiveness of the model depends many factors, one of which is the size of the training data. 

We will be using `scikit-learn` for much of our ML discussion. 

There isn't just one tool or one 'ML algorithm'.

![](imgs/scikitlearn1.png)

![](imgs/scikitlearn2.png)

## Two flavors

There are two fundamental differences: *supervised* and *unsupervised* 'learning'.

This is simply about the learning or training process.

### Supervised learning

Supervised learning occurs when training data is labeled. 

Main categories of supervised learning:
1. Classification
2. Regression

Examples: 
1. Measurements of different species of Iris are compared against their species labeled to find a pattern (classification).
2. Determining a continuous function, so that events can be predicted in the future (regression).

Another classification example: 

Blueberry muffin or chihuahua? 

![](imgs/blueberry_chihuahua.jpeg)

### Unsupervised learning

Unsupervised learning occurs when training data is *not* labeled. 

Main categories:
1. Clustering
2. Association
3. Dimension Reduction

Examples include:
1. MRI scans are searched for problematic areas (clustering).
2. 'Market baset analysis' which is the 'customers who bought X also bought Y' (association).
3. Obtaining only the relevant information (e.g. 95% of total variance) of a data set (dimension reduction).

Because of the nature of unsupervised learning -- no need for labels -- it is easy to use multiple tools.

For example, reduce the dimension then cluster. 

This has the potential to save lots of time and electricity than just clustering without dimensiojn reduction.

### Back to irises

Let's look at the Iris dataset again.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn 
sklearn.__version__

In [None]:
# We'll discuss this as we get to it...
from sklearn.metrics import classification_report

## Classification of Irises

Let's load the iris data set.

In [None]:
from sklearn.datasets import load_iris
iris = load_iris()
iris

In [None]:
ser = pd.Series(iris.target_names[iris.target], name='species')
df_labeled = pd.DataFrame(
	iris.data, 
	columns=iris.feature_names, 
)
df_labeled = pd.concat([ser, df_labeled], axis=1)

In [None]:
df_labeled.head()

We can get some quick info about the individual species.

In [None]:
(df_labeled
	.query('species == "setosa"')
	.describe()
)

Let's try to visualize their differences.

In [None]:
setosa_sepal = df_labeled.query("species == 'setosa'")['sepal length (cm)']
versicolor_sepal = df_labeled.query("species == 'versicolor'")['sepal length (cm)']
virginica_sepal = df_labeled.query("species == 'virginica'")['sepal length (cm)']

# Three overlayed histograms helps to compare distributions
kwargs = dict(histtype='stepfilled', alpha=0.3, density=True, bins=10)

# Plots 
plt.hist(setosa_sepal, label='setosa', **kwargs)
plt.hist(versicolor_sepal, label='versicolor', **kwargs)
plt.hist(virginica_sepal, label='virginica', **kwargs)
plt.xlabel("Sepal Length (cm)")
plt.ylabel("Frequency")
plt.legend()
plt.title('Sepal length distributions of different species')
_ = plt.show()

Instead of viewing the three separated, let's stack them.

In [None]:
sepaldata = [setosa_sepal, versicolor_sepal, virginica_sepal]

kwargs = dict(histtype='barstacked', density=True, bins=10)

plt.hist(sepaldata,  label=['setosa','versicolor','virginica'], **kwargs)
plt.xlabel("X axis label")
plt.ylabel("Y axis label")
plt.legend()
plt.title('Sepal length distributions of different species')
plt.xlabel("Sepal Length (cm)")
plt.ylabel("Frequency")
_ = plt.show()

Remember the plot we did last week comparing lots of pairs of variables? 

We can do something similar very easily with `pandas`.

In [None]:
pd.plotting.scatter_matrix(
	df_labeled, 
	c=iris.target, 
	figsize=(15, 15), 
	marker='o', 
	hist_kwds={'bins': 20, 'alpha':.6, 'edgecolor':'black'}, 
	s=60, 
	alpha=.7
)
plt.show()

You might be able to get the histograms to look nice.

I could not figure out an *easy* way to distinguish species. (I'm sure there's a way...)

We can use `seaborn` to do this elegantly.

In [None]:
import seaborn as sns

sns.set_context('notebook')
_ = sns.pairplot(df_labeled, hue="species")

## Building the model

**Goal:** an algorithm that, given some measurements of an Iris, tells us whether it is 
- setosa 
- versicolor 
- virginica

### Step 1: Splitting the data

We need to split our data into two classes:
- training data
- validation data

We use training data to ... train our model

We use validation data to ... validify our model.

sklearn already has something for this `train_test_split`.

In [None]:
from sklearn.model_selection import train_test_split
train_test_split?

Let's see how it works.

In [None]:
test_data = np.random.randint(0, 100, size=(8, 3))
print(test_data)

In [None]:
train_test_split(test_data, test_size=0.25)	# Try test_size and train_size

We will go with an 80/20 split for training.

In [None]:
X = df_labeled.drop('species', axis=1)
y = df_labeled['species']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [None]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

### Step 2: Choosing the model

We will choose to use a decision tree model. 

Here's an example.

(Can you tell it's made by someone from Silicon Valley???)

![](https://images.ctfassets.net/wp1lcwdav1p1/4cpLu1KCkDsmNG3up3Hivs/a8e5c327b618b23c8d306bf1a2764cb9/Screen_Shot_2022-07-25_at_12.15.04_PM.png?w=1500&q=60)

So what will happen is that the algorithm will iteratively split the parameter space.

For us the iris data set the parameter space is, say, $\mathbb{R}^4$ or $\mathbb{R}_{>0}^4$.

Like the above example 
- the whole space is divided into two
- then one piece is divided into two

In total, three subspaces -- seen as leaves of the tree.

Let's work with the [`DecisionTreeClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) from `scikit-learn`.

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

clf = DecisionTreeClassifier(max_depth=1)	# Try different values for max_depth
clf.fit(X_train, y_train)					# Train the classifier
predictions = clf.predict(X_test)			# Test the classifier
score = accuracy_score(y_test, predictions)	# Evaluate the classifier
print(f"Accuracy: {score:.2%}")

OK, so .... what does that mean?

We can use a *confusion matrix* to look a little more closely at the performance of our classification algorithm.

In [None]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay

_ = ConfusionMatrixDisplay(
	confusion_matrix(y_test, predictions),
	display_labels=['setosa', 'versicolor', 'virginica']
).plot()

- The rows describe the *actual* values.
- The columns describe the *predicted* values.
- Ideally, everything is on the diagaonl.

We can see weights for each of the variables based on how important they were in the classification.

In [None]:
pd.Series(clf.feature_importances_, index=iris.feature_names)

Now let's actually look at the decision tree.

In [None]:
from sklearn.tree import plot_tree

plt.figure(figsize=(15,10))
plot_tree(clf, filled=True)
plt.show()

#### A note on the Gini (diversity) index

This is also called Gini impurity.

The idea is that the Gini index is a real number between $0$ and $1$. 

The precise formula and explanation is not needed for us.

- Closer to $0$ means less diverse,
- Closer to $1$ means more diverse.

(This has nothing to do with the Gini coefficient from economics -- same [Gini](https://en.wikipedia.org/wiki/Corrado_Gini))

What follows is code taken from `scikit-learn`'s [tutorial on decision trees](https://scikit-learn.org/stable/auto_examples/tree/plot_iris_dtc.html). 

They have a nice way to look at the how the parameter space is cut by the decision tree.

In [None]:
import matplotlib.pyplot as plt
import numpy as np

from sklearn.datasets import load_iris
from sklearn.inspection import DecisionBoundaryDisplay
from sklearn.tree import DecisionTreeClassifier

# Parameters
n_classes = 3
plot_colors = "ryb"
# plot_step = 0.02


for pairidx, pair in enumerate([[0, 1], [0, 2], [0, 3], [1, 2], [1, 3], [2, 3]]):
    # We only take the two corresponding features
    X = iris.data[:, pair]
    y = iris.target

    # Train
    clf = DecisionTreeClassifier().fit(X, y)

    # Plot the decision boundary
    ax = plt.subplot(2, 3, pairidx + 1)
    plt.tight_layout(h_pad=0.5, w_pad=0.5, pad=2.5)
    DecisionBoundaryDisplay.from_estimator(
        clf,
        X,
        cmap=plt.cm.RdYlBu,
        response_method="predict",
        ax=ax,
        xlabel=iris.feature_names[pair[0]],
        ylabel=iris.feature_names[pair[1]],
    )

    # Plot the training points
    for i, color in zip(range(n_classes), plot_colors):
        idx = np.where(y == i)
        plt.scatter(
            X[idx, 0],
            X[idx, 1],
            c=color,
            label=iris.target_names[i],
            edgecolor="black",
            s=15,
        )

plt.suptitle("Decision surface of decision trees trained on pairs of features")
plt.legend(loc="lower right", borderpad=0, handletextpad=0)
_ = plt.axis("tight")

### Overfitting

One of the biggest disadvantages of decision trees is a phenomenon called *overfitting*.

Overfitting occurs when the model is fit too closely to particulars of the given data set.

![](https://upload.wikimedia.org/wikipedia/commons/thumb/1/19/Overfitting.svg/2048px-Overfitting.svg.png)

- The black line represents a model that is not overfitted.
- The green line represents a model that is overfitted.
- The red and blue dots with a black outline represents new data.

Overfitting is very common with decision trees. 

There are ways to mitigate this, they aren't perfect.

One way mitigate is the following.
1. Plant a bunch of decision trees -- grow a decision forest. 
2. Allow them each to become overfitted.
3. Take an average.

In [None]:
from sklearn.ensemble import RandomForestClassifier

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)
pred_forest = clf.predict(X_test)
accuracy = accuracy_score(y_test, pred_forest)
print(f'Accuracy: {accuracy:.2%}')

In [None]:
_ = ConfusionMatrixDisplay(
	confusion_matrix(y_test, pred_forest),
	display_labels=['setosa', 'versicolor', 'virginica']
).plot()