# Lecture 13: ML Examples and Ethics

## What features distinguish a house in New York from a house in San Francisco?

### First, some intuition

Lets say you had to determine whether a home is in San Franciso or in New York. In machine learning terms, categorizing data points is a **classification task**.

* San Francisco is hilly... so elevation may be a helpful feature
* With the data here, homes >~73m should be classified as San Francisco homes

### Adding nuance

Additing another **dimension** allows for more nuance. For example, New York apartments can be extremely expensive per square foot.

So visualizating elevation and price per square foot in a **Scatterplot** helps us distinguish lower elevation homes.

The data suggests that, among homes at or below 73 meters, those that cost more than $19,116.7 per square meter are in New York City.

Dimensions in a data set are called **features**, **predictors**, or **variables**.

* Elevation isn't a perfect feature for classification, so we can look at its relationship to other features, like *price per square foot*

### Drawing boundaries

Boundaries can be drawn so that if a house falls in the *green box*, it's classified as a San Francisco home. Blue box, New York. Statistical learning figures out how to best draw these boxes.

Our training set will use 7 different **features**. A **scatterplot matrix of the relationship between these values can define these. Patterns are clear, but boundaries for delineationa are not obvious.

### And now, machine learning

Determining the best boundary is where **machine learning** comes in.

**Decision trees** are one example of machine learning method for classification tasks.

### Finding better boundaries

We guessed ~73m before. Let's improve on that guess...

A **histogram** helps display frequency of homes by elevation more easily.

73m is the highest home in New York, but most of them have lower elevations.

### Your first fork

A decision tree uses if-then statements to define patterns in the data.

In machine learning, the splits are called **forks** and they split the data into branches based on some value.

The value that splits the branches is the **split point**. Homes to the left get categorized differently then than those on the right.

### Tradeoffs

Splitting at ~73m incorrectly classifies some San Francisco homes as New York homes.

San Francisco homes that were misclassifies are **false negatives**.

If you split to capture *every* home in San Francisco, you'll also get a bunch of New York homes (**false positives**).

### The best split

The best split point aims for branches that are as homogenous (pure) as possible.

### Recursion

Additional split points are determined through repetition (recursion).

### Growing a tree

Addtional forks add new information to improve **prediction accuracy**.

Adding serveral more layers gets our example model accuracy up to 96%.

It's possible to add branches until your model is **100% accurate**.

### Making predictions

The decision tree **model** can then predict which homes are in which city.

Here, we're using the **training data**.

Because our tree was trained on this data and we grew the tree to 100% accracy, each house is perfectly sorted.

### Reality check

But... how does this tree do on the data that the model hasn't seen before?

The *test set* then makes its way through the decision tree.

Ideally the tree should perform similarly on both known and unknown data.

These errors are due to **overfitting**. Fitting every single detail in the training data led to a tree that modeled unimportant features, that did not allow for similar accracy in new data.

### Recap

1. Machine learning identifies patterns using **statistical learning** and computers by unearthing **boundaries** in data sets. You can use it to make predictions.
2. One method for making predictions is called a decsion tree, which uses a series of if-then statements to identify boundaries and define patterns in the data.
3. **Overfitting** happens when some boundaries are based on *distinctions that don't make a difference*. YOu can see if a model overfits by having test data flow through the model.

## What Can Be Done About Overfitting

### Bias-Variance Tradeoff

* **High variance** models make mistakes in *inconsistent* ways
* **Biased models** tend to be overly simple and not reflect reality
* What to do:
    * Consider tuning parameters in the model
        * can avoid overfitting by setting minimum node size threshold (fewer splits, variance decreased)
    * Changing model approach
        * Bagging, boosting, and ensemble methods
    * Reconsider data splitting approach
        * Training + test?
        * LOOCV
        * K-fold CV

### Can we determine what that function (`f`) *is* using these data?

* $y = f(x) + noise$
    * Linear regression
    * Quadratic regression
    * Piecewise linear nonparametric regression

## The Data Parition Method

1. Randomly choose 30% of the data to be in a **test set**
2. The remainder is a **training set**

### Train the model on your **training set**

3. Perform your regression on the training set

### Assess future performance using the **test set**

4. Estimate your future performance with the test set

### Go through this process for each possible model

### Pros and Cons of Data Partitioning

* Pros
    * Simple approach
    * can choose model with best test-set socre
* Cons
    * Model fit on 30% less data than you have
    * Without a large data set, removing 30% of the data could bias prediction


## Leave Out One Cross Validation (LOOCV)

For $k=1$ to $R$:

1. Let $(x_k, y_k)$ be the $k^{th}$ record
2. Temporarily remove $(x_k, y_k)$ from the dataset
3. Train on the remaining $R-1$ datapoints
4. Note your error $(x_k, y_k)$

When you've done all the points, report the mean error.

### Method Comparison

**Data Partitioning**

* Pros: Cheap
* Cons: Variance, unreliable estimate of future performance

**LOOCV**

* Pros: Uses all your data
* Cons: Computationally expensive, has weird behavior

## k-Fold Cross Validation

* For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points.
* For the green partition: Train on all the points not in the green partition. Find the test-set sum of errors on the green points.
* For the blue partition: Train on all the points not in the green partition. Find the test-set sum of errors on the blue points.

Then report the mean error.

**CLICKER QUESTION**

Given the example we just worked, how would you model these data?

A) Linear regression ($MSE_{3fold} = 2.05$)

**B) Quadratic regression ($MSE_{3fold} = 1.11$)**

C) Pairwise linear nonparametric regression ($MSE_{3fold} = 2.93$)

**CLICKER QUESTION**

Which approach would you use to limit overfitting?

A) Data partitioning

B) LOOCV

C) k-fold CV

## Predictive Analysis Ethics

When models are trained on historical data, predictions will perpetuate historical biases.

### What to do about bias

1. Anticipate and plan for potential biases before model generation. Check for biases after.
2. Have diverse teams.
3. Use machine learning to improve lives rather than for punitive purposes.
4. Revisit your models. Update your algorithms.
5. You are responsible for the models you put out into the world, unintended consequences and all.

## Discussed so far...

* Data partitioning
* Feature selection
* Supervised and unsupervised machine learning
    * Continuous variables: regression (supervised) and dimensionality reduction (unsupervised)
    * Categorical variables: classification (supervised, decision trees) or clustering (unsupervised)
* Model assessement
    * Continuous: RMSE (and Accuracy)
    * Categorical: Accuracy, Sensitivity, Specificity, AUC
* Biased data can and will lead to biased predictions

### Prediction Approach

Which would be the most predictive of your future success?

A) Grade in COGS 108

B) COGS 108 attendance

C) Gender

D) Hair color

E) Something else

N = 254

Train the model: N = 178 (70% of the data) - train the model

Test the model: N = 76 (30% of the data) - predicted success in test set

Assess the prediction model.

#### Think about whether the models you're building should even be built.

### Predictive algorithms should (*at a minimum*) be FAT

* Fair: Lacking biases which create unfair and discriminatory outcomes
    * For whom does this algorithm fail?
    * Steps to take:
        1. Verify data about individual is correct
        2. Carry out "sensitivity test"
* Accountable/Accrate: Answerable to the people subject to them
    * Correct data used? Is there a mechanism for appeal?
* Transparent: Open about how and why particular decisions were made
    * Think carefully about what transparency is (handing over source code likely isn't the answer)

### *A Mulching Proposal: Analyzing and Improving an Algorithmic System for Turning the Elderly into High-Nutrient Slurry* (Keyes et al., 2019)

https://dl.acm.org/doi/10.1145/3290607.3310433

* Fair: Equally considers all elderly individuals
* Accurate: Pre = Has mechanism for appeal; Post = Compensation
* Transparent: Website with all features, testable

Checklists are helpful, but they're not excuse for thoughtlessness.