# Supervised learning

Overview of today's topics:
  - model training, evaluation, and tuning
  - binomial and multinomial logistic regression
  - decision trees and random forests
  - *k*-nearest neighbors
  - naive bayes
  - perceptrons, support vector machines, and the kernel trick
  
In machine learning, we feed data to an algorithm so that it can make predictions and extract information. Machine learning's broad categories:
  - supervised learning: train a model on observed (labeled) data to predict unobserved data
    - classification: predict categorical variable
    - regresssion: predict continuous variable
  - unsupervised learning: discover structure in and extract information from unlabeled data
    - clustering: assign observations to groups based on their features
    - dimensionality reduction: transform many features to a lower-dimension space
  - reinforcement learning: train model by rewarding it when it takes correct action
  - artificial neural networks and deep learning
    
Basic machine learning tasks:
  - data collection and cleaning
  - feature selection: select a relevant subset of features to train your model
  - feature extraction: apply a function to a feature to create a new feature, dimensionality reduction, scaling
  - model choice: [identify](https://scikit-learn.org/stable/tutorial/machine_learning_map/) the right  learning algorithm for the task
  - model training: train the model on a set of data (if the model is parametric, we often call this step "parameter estimation")
  - model evaluation: assess its performance on a set of testing data, did it over or underfit?
  - model selection and hyperparameter tuning: adjust the features and the hyperparameters the algorithm uses to fit the model for optimum performance
  - prediction: use the model to make predictions on unseen data

### Probability refresher

**Probability** is the ratio of an event occuring to all possible events occurring, whereas the **odds** are the ratio of an event occuring to it not occurring. That is, the odds are the ratio of the probability of an event occurring to the probability of it not occurring: $\text{odds}=\frac{p}{1-p}$ and conversely $p=\frac{\text{odds}}{1 + \text{odds}}$

For example, if there are 8 blue marbles and 2 red marbles in an urn, the probability of drawing a blue marble is $\frac{8}{8+2}=0.8$, its odds are $\frac{8}{2}=4$ (often expressed 4:1) which is equivalent to $\frac{0.8}{1-0.8}=4$, and its log-odds are therefore $\log(4)=1.386$.

**Log-odds** (the logarithm of the odds) are useful because they take odds that are asymmetrically distributed around 1 and transform them symmetrically around 0, such that $\log(4)=-\log(\frac{1}{4})$, and allow us to linearly combine **odds ratios** by simply adding and subtracting (because the log of a ratio is the log of the numerator minus the log of the denominator). In other words, the *odds* are the ratio of two probabilities, and an *odds ratio* is the ratio of two odds: useful when comparing the odds of a "what if" scenario to the odds of the base scenario, as we will see.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression, Perceptron
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

np.random.seed(0)

In [None]:
# load CA tract-level census variables
df = pd.read_csv('../../data/census_tracts_data_ca.csv', dtype={'GEOID10':str}).set_index('GEOID10')
df.shape

In [None]:
df.head()

## 1. Logistic Regression

Logistic regression is a regression analysis technique used when the response is binary. Logistic regression is a classification method, but it uses the general linear model and the same formula as linear regression, and it generates a continuous probability value before converting it to a classification prediction. Logistic regression uses maximum likelihood estimation (with regularization by default in scikit-learn) to estimate the parameters of a logit model. It maximizes the (log) [likelihood](https://en.wikipedia.org/wiki/Likelihood_function) function, equivalent to minimizing the cost function via gradient descent.

The **logit** model of some probability $p$ represents its log-odds:

$\text{logit}(p) = \log{\frac{p}{1-p}} = \beta_0 + \beta_1 X_1 + \ldots + \beta_k X_k$

The logit function is the inverse of the logistic function. It takes a value $p$ that ranges from 0 to 1 and converts it to a value that ranges from $-\infty$ to $+\infty$, which is necessary for regression analysis. In our example, $p$ represents the probability of being assigned to one of the classes in our classification scheme.

**Today we will build some simple models to predict tract poverty status.**

In [None]:
# classify tracts into high poverty vs not
df['poverty'] = (df['pct_below_poverty'] > 20).astype(int)
df['poverty'].value_counts().sort_index()

In [None]:
# feature selection: which features are important for predicting our categories?
response = 'poverty'
predictors = ['median_age', 'pct_renting', 'pct_bachelors_degree', 'pct_english_only']
data = df[[response] + predictors].dropna()
y = data[response]
X = data[predictors]

In [None]:
# feature scaling: important for optimal performance especially if algorithm
# uses gradient descent or requires regularization
X_std = StandardScaler().fit_transform(X)

In [None]:
# split data into 70/30 training and test sets
X_train, X_test, y_train, y_test = train_test_split(X_std, y, test_size=0.3)

In [None]:
# train model on training data then use it to make predictions with test data
blr = LogisticRegression()
y_pred = blr.fit(X_train, y_train).predict(X_test)

In [None]:
# inspect the probabilities
probs = blr.predict_proba(X_test)
df_probs = pd.DataFrame(probs, columns=blr.classes_)
df_probs['pred'] = y_pred
df_probs['actual'] = y_test.values
df_probs.head()

Manually calculate the probability of observation $i$ belonging to class $1$ as $p = \frac{e^{\beta X_i}}{1 + e^{\beta X_i}}$ where the decision function is $\text{logit}(p) = \beta X$

In [None]:
# calculate the logit (log-odds) for observation 0 and convert to probability
# this is its probability of being assigned to class 1
log_odds = np.dot(blr.coef_, X_test[0]) + blr.intercept_
odds = np.exp(log_odds)
prob = odds / (1 + odds)
prob

In [None]:
# now it's your turn
# what is the predicted probability of test case 9 being a high-poverty tract?


## 2. Classification into multiple categories

Binomial logistic regression's predictions assign observations to one of two classes, but many real-world scenarios require three or more classes. The rest of today's examples will explore multi-class supervised learning classification by categorizing tracts into "low", "mid", or "high" poverty status. We will pretend like these class labels are not ordinal.

In [None]:
# create a poverty classification variable
# by default, set all as mid poverty tracts
df['poverty'] = 'mid'

# identify all low poverty tracts
mask_low = df['pct_below_poverty'] <= 5
df.loc[mask_low, 'poverty'] = 'low'

# identify all high poverty tracts
mask_high = df['pct_below_poverty'] >= 25
df.loc[mask_high, 'poverty'] = 'high'

df['poverty'].value_counts().sort_index()

In [None]:
# feature selection
response = 'poverty'
predictors = ['median_age', 'pct_renting', 'pct_bachelors_degree', 'pct_english_only']
data = df[[response] + predictors].dropna()
y = data[response]
X = data[predictors]

In [None]:
# feature scaling
X_std = StandardScaler().fit_transform(X)

In [None]:
# split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X_std, y, test_size=0.3)

## 3. Multinomial Logistic Regression

Multinomial logistic regression generalizes binomial logistic regression to multiple classes. That is, it is a regression analysis technique used when the response is categorical and contains >2 classes. Multinomial logistic regression uses the [softmax](https://en.wikipedia.org/wiki/Softmax_function) function to generalize the logistic function to multiple inputs: in probability theory, the softmax represents a probability distribution across a set of possible classes.

In [None]:
# train model on training data then use it to make predictions with test data
mlr = LogisticRegression(multi_class='multinomial', C=1)
y_pred = mlr.fit(X_train, y_train).predict(X_test)

### Making sense of the probabilities

Let's inspect the estimated probabilities of observation $i$ belonging to class $c$ given $\beta$ and $X_i$, the estimated coefficents and $i$'s features.

Then, manually calculate the logit, normalize via softmax, and compare.

In [None]:
probs = mlr.predict_proba(X_test)
df_probs = pd.DataFrame(probs, columns=mlr.classes_)
df_probs['pred'] = y_pred
df_probs['actual'] = y_test.values
df_probs.head()

In [None]:
# calculate the logit (log-odds) then normalize with softmax function
i = 0  # pick an observation
log_odds = np.dot(mlr.coef_, X_test[i]) + mlr.intercept_
prob = np.exp(log_odds) / np.exp(log_odds).sum()
prob

### Making sense of the coefficients

Logistic regression is a parametric method. We have estimated the parameters (coefficients) of a logit model. scikit-learn is a machine learning package and treats logistic regression in the predictive ML paradigm: for a traditional statistical inference treatment of logistic regression, use the statsmodels package instead.

Each estimated coefficient is the log of the odds ratio (pay attention to the difference between "log odds" and "log [of the] odds ratio"). An **odds ratio** is the ratio of the odds of the event occurring to the odds of it not occurring. In our case the event is a 1-unit increase in the predictor. Thus the logit coefficient $\beta_{c,k}=\log\frac{\text{odds}(y=c | X_k+1)}{\text{odds}(y=c | X_k)}$ is the ceteris paribus log of the odds of an observation being in class $c$ if $x_k$ is incremented by $1$ divided by the odds of it being in class $c$ if nothing changes. Conversely, the odds ratio is the exponentiated logit coefficient: $\text{odds ratio} = \frac{\text{odds}(y=c | X_k+1)}{\text{odds}(y=c | X_k)} = e^{\beta_{c,k}}$.

In [None]:
# estimated coefficients on each variable, for each class
df_coeffs = pd.DataFrame(mlr.coef_, columns=X.columns, index=mlr.classes_)
df_coeffs

In [None]:
# calculate the odds ratio for some class and some predictor
# a 1-unit increase in predictor k increases the odds of class c by what %
B_ck = df_coeffs.loc['low', 'pct_english_only']
odds_ratio = np.exp(B_ck)
odds_ratio

Given an odds ratio $\rho$, the percent change $\delta$ in the odds can be calculated as $\delta = 100(\rho - 1)$

That is, a 1-unit increase in the percent that speak English-only at home is associated with a $\delta$% increase in the odds of being classified in the low poverty category.

In [None]:
# manually calculate the odds ratio for some observation, class, and predictor
i = 0  # observation in position 0 (you can pick any one)
c = 1  # class in position 1 (ie, "low")
k = 3  # predictor in position 3 (ie, "pct_english_only")

# calculate the logit of class c if nothing changes, then convert to odds
x0 = X_test[i]
log_odds0 = np.dot(mlr.coef_, X_test[i]) + mlr.intercept_
odds0 = np.exp(log_odds0[c]) # convert log-odds to odds

# calculate the logit of class c if we increase k by 1, then convert to odds
x1 = x0.copy()
x1[k] = x1[k] + 1
log_odds1 = np.dot(mlr.coef_, x1) + mlr.intercept_
odds1 = np.exp(log_odds1[c]) # convert log-odds to odds

# calculate the odds ratio
odds_ratio = odds1 / odds0
odds_ratio

## 4. Model Assessment

The model's performance can be assessed via several **validation** methods, including:

  - holdout method: fit model to one subset of data, then test it on a different subset
  - *k*-fold cross validation: divide data into *k* groups then, for each group, train the model on all the *other* groups, then test the model on the group and record its assessment score
  - bootstrapping: sample with replacement from dataset to assess accuracy
  
Typical assessments to report include:
  - misclassification error rate or, alternatively, accuracy
  - precision: what share of all true + false positives are true positives? That is, among everything predicted to be in this class, how many were right?
  - recall, aka sensitivity = true positive rate: what share of all true positives + false negatives are true positives? That is, among all the actual items in this class, how many were predicted correctly?
  - specificity = true negative rate: what share of all true negatives + false positives are true negatives?
  - $F_1$ score: an overall measure of accuracy: the harmonic mean of precision and recall
  - plot ROC curves: true positives vs false positives, and measure area under curve
  
**Bias-variance** tradeoff: you want a model that both 1) captures the nuanced patterns of the training data and 2) generalizes well to new data. However you cannot improve both at the same time, you must trade them off. Overfitting means high variance: your model is too sensitive to noise in the training data and may need regularization, which reduces model complexity. Underfitting means high bias: your model is too smooth and misses important details in the training data. Judicious feature selection, dimensionality reduction, and larger training sample sizes can reduce variance (overfitting). Adding additional predictors can reduce bias (underfitting).

For example, with logistic regression, you can adjust regularization with the *C* parameter, which is the inverse of L2 regularization/shrinkage (i.e., lower *C* values give you higher regularization). If your model need not be specified according to specific theory, but rather just needs the greatest predictive accuracy, consider using something like stepwise selection for feature selection.

In [None]:
# calculate misclassification error rate and accuracy
misclassified = (y_test != y_pred).sum()
error_rate = misclassified / len(y_test)
accuracy = 1 - error_rate
error_rate, accuracy

In [None]:
# how did the classifier perform?
# the report tells about the quality of its predictions
# support means how many of each class it saw
print(classification_report(y_test, y_pred))

In [None]:
# now it's your turn
# adjust the C parameter of the logistic regression model
# how does it impact our model's accuracy? why?


In [None]:
# helper function to visualize the model's decision surface
# fits model pairwise to just 2 features at a time and plots them
def plot_decision(X, y, feature_names, classifier):
    
    class_colors = {'high': 'r', 'mid': 'y', 'low': 'b'}
    class_ints = {'high': 0, 'mid': 1, 'low': 2}
    pairs = [[0, 1], [0, 2], [0, 3], [1, 2], [1, 3], [2, 3]]
    fig, axes = plt.subplots(2, 3, figsize=(9, 6))
    for ax, pair in zip(axes.flat, pairs):
        
        # take the two corresponding features
        Xp = X[:, pair]
        x_min, x_max = Xp[:, 0].min() - 1, Xp[:, 0].max() + 1
        y_min, y_max = Xp[:, 1].min() - 1, Xp[:, 1].max() + 1
        xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02),
                             np.arange(y_min, y_max, 0.02))
        
        # fit model to the two features, predict for meshgrid points, then plot
        Z = classifier.fit(Xp, y).predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
        for cat, i in class_ints.items():
            Z[np.where(Z==cat)] = i
        cs = ax.contourf(xx, yy, Z, cmap=plt.cm.RdYlBu, alpha=0.7)
    
        # scatter plot each class in same color as corresponding contour
        for cat, color in class_colors.items():
            idx = np.where(y == cat)
            ax.scatter(Xp[idx, 0], Xp[idx, 1], c=color, label=cat, s=1)

        ax.set_xlabel(feature_names[pair[0]])
        ax.set_ylabel(feature_names[pair[1]])
        ax.figure.tight_layout()        
    plt.legend()

In [None]:
# plot the model's decision surface
# fits model pairwise to just 2 features at a time and plots them
plot_decision(X_train, y_train, X.columns, mlr)

Look at the figure above. Are our classes linearly separable? What are the implications?

**Next steps**: to select the best model for your task, you should choose several algorithms and compare their performance against each other, evaluating and tuning them iteratively. Consider using a [hyperparameter optimization](https://scikit-learn.org/stable/modules/grid_search.html) technique.

  1. choose an appropriate learning algorithm for your task
  2. choose a key performance metric to evaluate
  3. choose a classification algorithm and an optimization algorithm
  4. train the model and evaluate its performance
  5. tune its hyperparameters to improve performance then re-assess
  
The rest of the lecture will investigate other candidate algorithms for our task and consider their strengths and limitations.

## 5. Decision Trees and Random Forests

The logistic regression models we saw earlier were parametric models, meaning they learn a classification function. A **decision tree** is a nonparametric model that partitions the feature space into boxes. It has a tendency to overfit data, but ensembles can help prevent it from getting stuck in a local minimum. Ensemble techniques, which build many individual models then combine them, can be broadly divided into bagging and boosting methods:

  - bagging: train many models on different random bootstrap (ie, with replacement) samples of the data then average them ("bootstrap aggregating")
  - boosting: build ensemble incrementally by empasizing the previously misclassified data points as you train each new model

A **random forest** is an ensemble learning method that constructs multiple decision trees then uses bagging. This helps correct for decision trees' overfitting the training data. Random forests tend to work well out of the box, handle nonlinearity well, and work well in very high dimension spaces. Note we don't have to standardize the data to train a decision tree or random forest.

In [None]:
# train model on training data then use it to make predictions with test data
# set max_depth
dt = DecisionTreeClassifier(max_depth=3)
y_pred = dt.fit(X_train, y_train).predict(X_test)

In [None]:
# how did the classifier perform?
print(classification_report(y_test, y_pred))

In [None]:
# plot the model's decision surface
plot_decision(X_train, y_train, X.columns, dt)

In [None]:
# now it's your turn
# don't prune the tree: set max_depth to None. what happens? why?


In [None]:
%%time
# train model on training data then use it to make predictions with test data
# use 1,000 decision trees and all available CPUs
rf = RandomForestClassifier(n_estimators=1000, n_jobs=-1)
y_pred = rf.fit(X_train, y_train).predict(X_test)

In [None]:
# how did the classifier perform?
print(classification_report(y_test, y_pred))

In [None]:
# plot the model's decision surface
plot_decision(X_train, y_train, X.columns, rf)

## 6. *k*-Nearest Neighbors

kNN is a nonlinear, nonparametric, lazy-learning model and represents an example of instance-based learning.

By "lazy" we mean that it does not learn a classification function, but rather just memorizes the entire training dataset to subsequently find nearest neighbors. It can require a lot of memory but works well with a small number of dimensions, though less well with high dimensionality: nearest-neighbor search is hard in high-dimension feature spaces because of the curse of dimensionality.

In [None]:
# train model on training data then use it to make predictions with test data
knn = KNeighborsClassifier(n_neighbors=5, p=2, metric='minkowski')
y_pred = knn.fit(X_train, y_train).predict(X_test)

In [None]:
# how did the classifier perform?
print(classification_report(y_test, y_pred))

In [None]:
# plot the model's decision surface
plot_decision(X_train, y_train, X.columns, knn)

## 7. Naïve Bayes

Naïve Bayes is a high bias/low variance classifier that is less likely to overfit small training datasets than a low bias/high variance classifier is (such as kNN or logistic regression). It is a simple algorithm and converges quickly but strongly assumes independence (hence, naïve).

In [None]:
# train model on training data then use it to make predictions with test data
gnb = GaussianNB(priors=None)
y_pred = gnb.fit(X_train, y_train).predict(X_test)

In [None]:
# how did the classifier perform?
print(classification_report(y_test, y_pred))

In [None]:
# plot the model's decision surface
plot_decision(X_train, y_train, X.columns, gnb)

## 8. Perceptron

A perceptron is a simple linear, binary, parametric classifier. It is a very simple single-layer neural network: too simple to be useful for many real-world tasks.

In [None]:
# train model on training data then use it to make predictions with test data
ppn = Perceptron(eta0=1)
y_pred = ppn.fit(X_train, y_train).predict(X_test)

In [None]:
# how did the classifier perform?
print(classification_report(y_test, y_pred))

In [None]:
# plot the model's decision surface
plot_decision(X_train, y_train, X.columns, ppn)

## 9. Support Vector Machines

SVMs are models that extend the perceptron to find an optimal hyperplane providing the largest margin (ie, separation) between the classes of training data (the training data are the "support vectors"). In other words, SVM finds the hyperplane that maximizes the distance between it and the nearest data point on either side of it. SVMs can classify data linearly/parametrically or, using the [kernel trick](https://en.wikipedia.org/wiki/Kernel_method), nonlinearly/nonparametrically if the classes are not linearly separable.

An SVM with a linear kernel is very similar to logistic regression. But they might be a good choice instead of logistic regression if the problem is not linearly separable or has a high-dimensional feature space. Choosing the right kernel can be challenging and the results are not straightforwardly explainable. SVMs can be very inefficient to train, so not a good choice for large training data sets.

Tuning the SVM's hyperparameters is critical! Here we fit an untuned model as a quick demo, but you should run something like [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) for tuning.

In [None]:
# train model on training data then use it to make predictions with test data
# train the linear SVM (namely, support vector classification)
svc = SVC(kernel='linear', C=1)
y_pred = svc.fit(X_train, y_train).predict(X_test)

In [None]:
# how did the classifier perform?
print(classification_report(y_test, y_pred))

### SVMs for nonlinear classification with a kernel function

We can turn a linear SVM model into a nonlinear model by using the *kernel trick* to operate in a higher-dimension feature space. In this example, we will use the radial basis function, aka Gaussian kernel.

In [None]:
# train model on training data then use it to make predictions with test data
svc_kt = SVC(kernel='rbf', gamma=0.2, C=1)
y_pred = svc_kt.fit(X_train, y_train).predict(X_test)

In [None]:
# how did the classifier perform?
print(classification_report(y_test, y_pred))

In [None]:
# plot the model's decision surface: this is slow
plot_decision(X_train, y_train, X.columns, svc_kt)

In [None]:
# now it's your turn
# try a polynomial kernel. how does that impact model performance?


Higher gamma parameter values lead to tighter decision boundaries

In [None]:
# train model on training data then use it to make predictions with test data
svc_kt2 = SVC(kernel='rbf', gamma=10, C=1)
y_pred = svc_kt2.fit(X_train, y_train).predict(X_test)

In [None]:
# how did the classifier perform?
print(classification_report(y_test, y_pred))

In [None]:
# plot the model's decision surface: this is slow
plot_decision(X_train, y_train, X.columns, svc_kt2)

Here we see poor generalization because the model was overfitted with that high gamma value: the training overemphasized small fluctations in the training data.

## Self-paced exercise

Scroll back up to the steps at the bottom of the "model assessment" section. Work through those tasks, considering how to improve your model's performance through better feature selection/extraction, hyperparameter optimization, training, and testing.