## Overfitting and Underfitting

## overfitting
    —producing a model that performs well on the data you train it on but generalizes poorly to any new data. This could involve learning noise in the data. Or it could involve learning to identify specific inputs rather than whatever factors are actually predictive for the desired output.
    - when your data has too many features, it’s easy to overfit

## underfitting
    —producing a model that doesn’t perform well even on the training data, although typically when this happens you decide your model isn’t good enough and keep looking for a better one.
    - when your data doesn’t have enough features, your model is likely to underfit


## Train and Test

In [1]:
import random
from typing import TypeVar, List, Tuple
X = TypeVar('X')  # generic type to represent a data point

def split_data(data: List[X], prob: float) -> Tuple[List[X], List[X]]:
    """Split data into fractions [prob, 1 - prob]"""
    data = data[:]                    # Make a shallow copy
    random.shuffle(data)              # because shuffle modifies the list.
    cut = int(len(data) * prob)       # Use prob to find a cutoff
    return data[:cut], data[cut:]     # and split the shuffled list there.

data = [n for n in range(1000)]
train, test = split_data(data, 0.75)

In [2]:
Y = str("Y")

def train_test_split(xs, ys, test_pct):
    
#     generate indices and split them
    idxs = [i for i in range(len(xs))]
    train_idxs, test_idxs = split_data(idxs, 1 - test_pct)
    
    return ([xs[i] for i in train_idxs],  #x_train
           [xs[i] for i in test_idxs],     #x_test
            [ys[i] for i in train_idxs],   #y_train
            [ys[i] for i in test_idxs]    #y_test
           )

xs = [x for x in range(1000)] 
ys = [2*x for x in xs]  #each y_i is twice x_i

x_train, x_test, y_train, y_test = train_test_split(xs,ys,0.25)

### True positive
    “This message is spam, and we correctly predicted spam.”
### False positive (Type 1 error)
    “This message is not spam, but we predicted spam.”
### False negative (Type 2 error)
    “This message is spam, but we predicted not spam.”<br>
### True negative
    “This message is not spam, and we correctly predicted not spam.”

![Screenshot%202021-12-03%20at%208.32.11%20PM.png](attachment:Screenshot%202021-12-03%20at%208.32.11%20PM.png)

## Accuracy
    Accuracy is defined as the fraction of correct predictions.

In [9]:
def accuracy(tp,fp, fn,tn):
    correct = tn + tp
    total = tn + tp + fp + fn
    return correct/total

accuracy(70,4930,13930,981070)

0.98114

## Precison
    Measures how accurate our positive predictions are

In [10]:
def precision(tp, fp, fn,tn):
    return tp/ (tp+fp)

precision(70,4930,13930,981070)

0.014

## Recall
    what fraction of positive our model identifies

In [12]:
# Seeing what fraction of positive our model identifies

def recall(tp: int, fp: int, fn: int, tn: int) -> float:
    return tp / (tp + fn)

recall(70, 4930, 13930, 981070)

0.005

## Precision + recall form $F1$ Score

In [15]:
def f1_score(tp,fp,fn,tn):
    p = precision(tp,fp,fn,tn)
    r = recall(tp,fp,fn,tn)
    
    return 2*p*r/(p+r)

f1_score(70,4930,13930,981070)

0.00736842105263158

### This is the harmonic mean of _precision_ and _recall_ and necessarily lies between them
    Higher precison, lower recall is the confident model. Precision and recall are inversely independent. Choosing a right threshold is a matter fo finding the right tradeoff between precison and recall.

# <font color= 'red'> Bias-Variance Tradeoff