# CPSC 330 Lecture 3

In [None]:
import numpy as np
import pandas as pd

pd.set_option('display.max_rows', 9)

import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# conda install python-graphviz
# pip install python-graphviz
import graphviz

In [None]:
# conda install scikit-learn
from sklearn.tree import DecisionTreeClassifier, export_graphviz

In [None]:
# pip install git+git://github.com/mgelbart/plot-classifier.git
from plot_classifier import plot_classifier

# Lecture outline
- Introduction to supervized learning (10 mins)
- Tabular data (5 mins)
- Training a decision tree using scikit-learn (20 mins)
- Decision tree splitting rules (5 mins)
- Break (5 mins)
- ML model parameters and hyperparameters (10 mins)
- True/False questions (25 min)

## Introduction to supervised learning (10 mins)

## An example of supervised learning
- In supervised learning, we have a set of features, $X$, with associated targets, $y$
- We wish to find a model function that relates $X$ to $y$
- Then use that model function to predict future observations.

In [None]:
df = pd.read_csv('data/cities_USA.csv', index_col=0)
df

In [None]:
blue = df.query('vote == "blue"')
red  = df.query('vote == "red"')

In [None]:
plt.scatter(blue["lon"], blue["lat"], color="blue", alpha=0.3);
plt.scatter(red["lon"], red["lat"], color="red", alpha=0.3);
plt.ylabel("latitude");
plt.xlabel("longitude");

What are $X$ and $y$ here?

In [None]:
X = df[["lon", "lat"]]
X

In [None]:
y = df[["vote"]]
y

- Note that $X$ is a 2-dimensional array, whereas $y$ is 1-dimensional.
- In CPSC 340 we would say $X$ is a matrix and $y$ is a vector, but here's we're avoiding linear algebra. 

#### Classification vs Regression
- Variables can be characterized as quantitative/numeric or qualitative/categorical
- **Classification** = prediction of a categorical target (e.g. red vs. blue)
- **Regression** = prediction of a quantitative response

<img src='img/regr.png' width="750">

### Classification vs Regression questions

Which of these are examples of classification? (To answer on Piazza)

1. Predicting the price of a house based on features like number of rooms.
2. Predicting if a house will sell or not based on features like the price of the house, number of rooms, etc.
3. Predicting your grade based on past grades.
4. Predicting whether you should bicycle to work tomorrow based on the weather forecast.


#### Unsupervised learning

- Later in the course we will discuss _unsupervised learning_ as well.
- It is kind of like supervised learning but without the "y".
- More on supervised vs. unsupervised learning later.

## Tabular data and terminology (5 min)
- For ML we typically work with "tabular data"
- Rows are examples
- Columns are features (the last column is typically the target)

- This dataset contains longtitude and latitude data for 400 cities in the US
- Each city is labelled as `red` or `blue` depending on how they voted in the 2012 election.
- The cities data was sampled from (http://simplemaps.com/static/demos/resources/us-cities/cities.csv). The election information was collected from Wikipedia.

### Terminology
- You will see a lot of variable terminology in machine learning and statistics
- See the MDS terminology resource [here](https://ubc-mds.github.io/resources_pages/terminology/).

Of particular note:

- **examples** = rows = samples = records = instances (usually denoted by $n$)
- **features** = inputs = predictors = explanatory variables = regressors = independent variables = covariates (usually denoted by $d$)
- **targets** = outputs = outcomes = response variable = dependent variable = labels (if categorical).
- **training** = learning = fitting

In [None]:
df

In [None]:
df.shape

In this data set we have 6 examples of 3 variables (2 features, 1 target).

## Training a decision tree using scikit-learn (20 min)

### Using scikit-learn's fit/predict

In [None]:
model = DecisionTreeClassifier(max_depth=1)
model

We'll pick a few examples at random just for a toy example.

In [None]:
df = pd.read_csv('data/cities_USA.csv', index_col=0).sample(6, random_state=100)
df

In [None]:
X = df.drop(columns=['vote'])
y = df[['vote']]

In [None]:
model.fit(X, y)

In [None]:
df

In [None]:
dot_data = export_graphviz(model)
graphviz.Source(export_graphviz(model,
                                out_file=None,
                                feature_names=X.columns,
                                class_names=["blue", "red"],
                                impurity=False))

In [None]:
plt.figure()
ax = plt.gca()
plot_classifier(X, y, model, ax=ax, ticks=True);
plt.ylabel("latitude");
plt.xlabel("longitude");

In [None]:
model.score(X, y)

- we can also predict a brand new (made up) point

In [None]:
X

In [None]:
made_up_X = np.array([-85, 30])
model.predict(made_up_X[np.newaxis])

In [None]:
model.fit(X, y)

Let's look at a deeper tree now, on the full data set.

In [None]:
df = pd.read_csv('data/cities_USA.csv', index_col=0)
X = df.drop(columns=['vote'])
y = df[['vote']]

In [None]:
model = DecisionTreeClassifier(max_depth=1)
model.fit(X,y)
plt.figure()
ax = plt.gca()
plot_classifier(X, y, model, ax=ax, ticks=True);
plt.ylabel("latitude");
plt.xlabel("longitude");

In [None]:
model.score(X,y)

In [None]:
model = DecisionTreeClassifier(max_depth=2)
model.fit(X,y)
plt.figure()
ax = plt.gca()
plot_classifier(X, y, model, ax=ax, ticks=True);
plt.ylabel("latitude");
plt.xlabel("longitude");

In [None]:
model.score(X,y)

In [None]:
model = DecisionTreeClassifier()
model.fit(X,y)
plt.figure()
ax = plt.gca()
plot_classifier(X, y, model, ax=ax, ticks=True);
plt.ylabel("latitude");
plt.xlabel("longitude");

In [None]:
model.score(X,y)

## Decision tree splitting rules (5 mins)

- You saw in the video that a tree with only one split is called a "decision stump"
- How do we decide how to split the data?
- Basic idea is to pick a criterion (see [here](https://scikit-learn.org/stable/modules/tree.html#mathematical-formulation)) and then maximize it across possible splits.
- It turns out accuracy is not a good metric, so we use some fancier metrics like "entropy" or "gini impurity".
- The basic idea is to try and make each leaf as "pure" as possible.

## Break (5 mins)

##  ML model parameters and hyperparameters (10 mins)

- When you call `fit`, a bunch of values get set, like the split variables and split thresholds. 
- These are called **parameters**
- But even before calling `fit` on a specific data set, we can set some "knobs" that control the learning.
- These are called **hyperparameters**

In [None]:
df = pd.read_csv('data/cities_USA.csv', index_col=0)
X = df.drop(columns=['vote'])
y = df[['vote']]
df

In scikit-learn, hyperparameters are set in the constructor:

In [None]:
model = DecisionTreeClassifier(max_depth=3) 
model.fit(X, y);

Here, `max_depth` is a hyperparameter. There are many, many more! See [here](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html).



In [None]:
dot_data = export_graphviz(model)
graphviz.Source(export_graphviz(model,
                                out_file=None,
                                feature_names=X.columns,
                                class_names=["red", "blue"],
                                impurity=True))

To summarize:

- **parameters** are automatically learned by the algorithm during training
- **hyperparameters** are specified based on:
    - expert knowledge
    - heuristics, or 
    - systematic/automated optimization (more on that later on)

## Preview of next class...

- Why not just use a very deep decision tree for every supervised learning problem and get super high accuracy?

## True/False questions (25 min)

Which of these are true about decision trees?

1. Decision trees are typically binary trees (2 children per node).
2. Typically, the features that we split on at each node are chosen by a human.
3. A decision stump is a decision tree with depth $\leq 3$.
5. The same feature can be split on multiple times in a tree with depth > 1.

<br><br><br><br><br><br>

For each of the following, answer with `fit` or `predict`:

1. At least for decision trees, this is where most of the hard work is done.
2. Only takes `X` as an argument.
3. In scikit-learn, we can ignore its output.
4. Is called first (before the other one).