**ATOC4500 Data Science Lab - Lecture #7 Notebook** 

*last updated: February 26, 2022* 

*Written by Prof. Kay (Jennifer.E.Kay@colorado.edu), based on examples in [Data Science from Scratch by Joel Grus](https://github.com/joelgrus/data-science-from-scratch/blob/master/scratch/machine_learning.py) and [Scikit-learn](https://scikit-learn.org/stable/auto_examples/model_selection/plot_underfitting_overfitting.html)*

In [1]:
## Load Python Packages

import scipy.stats as stats
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from matplotlib.gridspec import GridSpec

import random
from typing import TypeVar, List, Tuple

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score

##**Example of generating a data variable and then performing a "Train, Test Split"**##

*Look at the code below to analyze the data variable X.  How is this data variable generated? How much of the data variable is put into the training dataset? How much of the data variable is put into the test dataset? How is the function "assert" being used to test the code?*

In [2]:
#import random
#from typing import TypeVar, List, Tuple
X = TypeVar('X')  # generic type to represent a data point

def split_data(data: List[X], prob: float) -> Tuple[List[X], List[X]]:
    """Split data into fractions [prob, 1 - prob]"""
    data = data[:]                    # Make a shallow copy
    random.shuffle(data)              # because shuffle modifies the list.
    cut = int(len(data) * prob)       # Use prob to find a cutoff
    return data[:cut], data[cut:]     # and split the shuffled list there.

data = [n for n in range(1000)]
train, test = split_data(data, 0.75)

# The proportions should be correct
assert len(train) == 750
assert len(test) == 250

# And the original data should be preserved (in some order)
assert sorted(train + test) == data

Y = TypeVar('Y')  # generic type to represent output variables

def train_test_split(xs: List[X],
                     ys: List[Y],
                     test_pct: float) -> Tuple[List[X], List[X], List[Y], List[Y]]:
    # Generate the indices and split them.
    idxs = [i for i in range(len(xs))]
    train_idxs, test_idxs = split_data(idxs, 1 - test_pct)

    return ([xs[i] for i in train_idxs],  # x_train
            [xs[i] for i in test_idxs],   # x_test
            [ys[i] for i in train_idxs],  # y_train
            [ys[i] for i in test_idxs])   # y_test

xs = [x for x in range(1000)]  # xs are 1 ... 1000
ys = [2 * x for x in xs]       # each y_i is twice x_i
x_train, x_test, y_train, y_test = train_test_split(xs, ys, 0.25)

# Check that the proportions are correct
assert len(x_train) == len(y_train) == 750
assert len(x_test) == len(y_test) == 250

# Check that the corresponding data points are paired correctly.
assert all(y == 2 * x for x, y in zip(x_train, y_train))
assert all(y == 2 * x for x, y in zip(x_test, y_test))

##**Example of overfitting vs. underfitting using a polynomial**

*Look at the code below to analyze the data variable X.  How is this data variable generated? How much of the data variable is put into the training dataset? How much of the data variable is put into the test dataset? How is the function "assert" being used to test the code?*

In [3]:
np.random.seed(0)
n_samples = 30
degrees = [1, 4, 15]

X = np.sort(np.random.rand(n_samples))
y = true_fun(X) + np.random.randn(n_samples) * 0.1

plt.figure(figsize=(14, 5))
for i in range(len(degrees)):
    ax = plt.subplot(1, len(degrees), i + 1)
    plt.setp(ax, xticks=(), yticks=())

    polynomial_features = PolynomialFeatures(degree=degrees[i], include_bias=False)
    linear_regression = LinearRegression()
    pipeline = Pipeline(
        [
            ("polynomial_features", polynomial_features),
            ("linear_regression", linear_regression),
        ]
    )
    pipeline.fit(X[:, np.newaxis], y)

    # Evaluate the models using crossvalidation
    scores = cross_val_score(
        pipeline, X[:, np.newaxis], y, scoring="neg_mean_squared_error", cv=10
    )

    X_test = np.linspace(0, 1, 100)
    plt.plot(X_test, pipeline.predict(X_test[:, np.newaxis]), label="Model")
    plt.plot(X_test, true_fun(X_test), label="True function")
    plt.scatter(X, y, edgecolor="b", s=20, label="Samples")
    plt.xlabel("x")
    plt.ylabel("y")
    plt.xlim((0, 1))
    plt.ylim((-2, 2))
    plt.legend(loc="best")
    plt.title(
        "Degree {}\nMSE = {:.2e}(+/- {:.2e})".format(
            degrees[i], -scores.mean(), scores.std()
        )
    )

NameError: ignored

##**Assessing Outcomes: Example of calculating accuracy, precision, recall**##

*Predict outcome Y if and only if X.  For example, predict outcome of the last name Smith (Y) if this person's occupation is farming (X).*

*Note: 8% of Americans have the last name Smith, 2% of Americans are farmers.*

In [None]:
# --------------------------- Define Functions -----------------------

def accuracy(tp: int, fp: int, fn: int, tn: int) -> float:
    correct = tp + tn
    total = tp + fp + fn + tn
    return correct / total

def precision(tp: int, fp: int, fn: int, tn: int) -> float:
    return tp / (tp + fp)

def recall(tp: int, fp: int, fn: int, tn: int) -> float:
    return tp / (tp + fn)

def true_fun(X):
    return np.cos(1.5 * np.pi * X)

#### Data for Confusion Matrix
true_positive=16   ##predict last name Smith, is a farmer
false_positive=184 ##predict last name Smith, is NOT a farmer
true_negative=9016 ##predict last name NOT Smith, is NOT a farmer
false_negative=784 ##predict last name NOT Smith, is a farmer

calc_accuracy=accuracy(true_positive, false_positive, false_negative, true_negative)
print(f'the accuracy is: {calc_accuracy*100} %')

calc_precision=precision(true_positive, false_positive, false_negative, true_negative)
print(f'the precision is: {calc_precision*100} %')

calc_recall=recall(true_positive, false_positive, false_negative, true_negative)
print(f'the recall is: {calc_recall*100} %')

## WRITE NOTES HERE:##