# Zero-Information Prediction Function (Classiﬁcation)
For classiﬁcation, let $y_{mode}$ be the most frequently occurring class in training
  - Prediction function that always predicts $y_{mode}$ is called  
    - zero-information prediction function, or 
    - no-information prediction function   

# Zero-Information Prediction Function (Regression)
What’s the right zero-Information prediction function for square loss?   
- Mean of training data labels   
What’s the right zero-Information prediction function for absolute loss?   
- Median of training data labels  

# Regularized Linear Model
Whatever fancy model you are using (gradient boosting, neural networks, etc.)  
- always spend some time building a linear baseline model 
Build a regularized linear model  
If your fancier model isn’t beating linear  
- perhaps something’s wrong with your fancier model (e.g. hyperparameter settings), or  
- you don’t have enough data to beat the simpler model  
Prefer simpler models if performance is the same  
- usually cheaper to train and easier to deploy


# Oracle Models
- Often helpful to get an upper bound on achievable performance  
- What’s the best performance function you can get, looking at your validation/test data?   
  - Performance will estimate the Bayes risk (i.e. optimal error rate). 
  - This won’t always be 0 - why?  
- Using same model class as your ML model  
  - ﬁt to the validation/test data without regularization  
  - Performance will tell us the limit of our model class, even with inﬁnite training data  
  - Gives estimate of the approximation error of our hypothesis space

# Describing Classifier Performance  
## Confusion Matrix  
A **confusion matrix** summarizes results for a binary classiﬁcation problem:  
<div align="center"><img src = "./confusion matrix.jpg" width = '500' height = '100' align = center /></div>  

- a is the number of examples of Class 0 that the classiﬁer predicted [correctly] as Class 0.  
- b is the number of examples of Class 1 that the classiﬁer predicted [incorrectly] as Class 0.  
- c is the number of examples of Class 0 that the classiﬁer predicted [incorrectly] as Class 1.  
- d is the number of examples of Class 1 that the classiﬁer predicted [correctly] as Class 1.  
 
- **Accuracy** is the fraction of correct predictions  
$$\frac{a + d}{a + b + c + d}$$
- **Error rate** is the fraction of incorrect predictions  
$$\frac{b + c}{a + b + c + d}$$  

We can talk about accuracy of diﬀerent subgroups of examples:   
- **Accuracy for examples** of class 0:  
$$\frac{a}{a+c}$$  
- **Accuracy for examples predicted** to have class 0: a/(a+b).
$$\frac{a}{a + b}$$  

## Issue with Accuracy and Error Rate  
Consider a no-information classiﬁer that achieves the following   
<div align="center"><img src = "./issue.jpg" width = '500' height = '100' align = center /></div>  

- Accuracy is 99.9% and error rate is .09%.  
- Two lessons: 
  - Accuracy numbers meaningless without knowing the **no-information rate** or **base rate**.  
  - Accuracy alone doesn’t capture what’s going on (0% success on class 1).

## Positive and Negative Classes
- In many contexts, it’s very natural to identify a **positive class** and a **negative class**  
   - pregnancy test (positive = you’re pregnant)
   - radar system (positive = threat detected)
   - searching for documents about bitcoin (positive = document is about bitcoin)
   - statistical hypothesis testing (positive = reject the null hypothesis)
###  FP, FN, TP, TN
- Let’s denote the positive class by + and negative class by −:
<div align="center"><img src = "./T.jpg" width = '500' height = '100' align = center /></div>  

- **TP** is the number of true positives: predicted correctly as Class +.  
- **FP** is the number of false positives: predicted incorrectly as Class + (i.e true class −)  
- **TN** is the number of true negatives: predicted correctly as Class −.   
- **FN** is the number of false negatives: predicted incorrectly as Class − (i.e. true class +)

### Precision and Recall
- The **precision** is the accuracy of the positive predictions  
$$\frac{TP}{TP + FP}$$  
  - High precision means low “false alarm rate” (if you test positive, you’re probably positive)
- The **recall** is the accuracy of the positive class  
$$\frac{TP}{TP + FN}$$
  - High recall means you’re not missing many positives


## Information Retrieval
- Consider a database of 100,000 documents  
- Query for bitcoin returns 200 documents  
- 100 of them are actually about bitcoin  
- 50 documents about bitcoin were not returned  

<div align="center"><img src = "./bitcoin.jpg" width = '500' height = '100' align = center /></div>  

- The **precision** is the accuracy of the + predictions: TP / (TP + FP) = 100/200 = 50%.  
  - 50% of the documents oﬀered as relevant are actually relevant  
- The **recall** is the accuracy of the positive class: TP/(TP+FN) = 100/(100+50) = 67%.  
  - 67% of the relevant documents were found (or “recalled”).   

## $F_1$ Score  
- We really want high precision **and** high recall  
- But to choose a “best” model, we need a single number performance summary  
- The **F-measure** or $F_1$ score is the harmonic mean of precision and recall:   
$$F_{1}=2 \cdot \frac{1}{\frac{1}{\text { recall }}+\frac{1}{\text { precision }}}=2 \cdot \frac{\text { precision } \cdot \text { recall }}{\text { precision }+\text { recal }}$$  

## $F_\beta$ Score  
- $F_\beta$ score for $\beta > 0$:
$$F_{\beta}=\left(1+\beta^{2}\right) \cdot \frac{\text { precision } \cdot \text { recall }}{\left(\beta^{2} \cdot \text { precision }\right)+\text { recall }}$$  

## TPR, FNR, FPR, TNR
- **True positive rate** is the accuracy on the positive class: TP / (FN + TP) 
  - same as recall, also called **sensitivity**
- **False negative rate** is the error rate on the positive class: FN / (FN + TP) (“miss rate”) 
- **False positive rate** is error rate on the negative class: FP / (FP + TN)  
  - also called fall-out or false alarm rate   
- **True negative rate** is accuracy on the negative class: TN / (FP + TN) (“speciﬁcity”)

## Medical Diagnostic Test: Sensitivity and Speciﬁcity
- **Sensitivity** is another name for TPR and recall  
  - What fraction of people with disease do we identify as having disease?   
  - How “sensitive” is our test to indicators of disease?   
- **Speciﬁcity** is another name for TNR   
  - What fraction of people without disease do we identify as being without disease?   
  - High speciﬁcity means few false alarms  
- In medical diagnosis, we want both sensitivity and speciﬁcity to be high.

# Thresholding Classification Score Function  
## The Classiﬁcation Problem
- Action space $A =R$ Output space $Y = \{−1,1\}$  
- **Real-valued** prediction function $f : X\to  R$, called the score function. 
- Convention is:  
$$\begin{array}{l}
f(x)>0 \Longrightarrow \text { Predict } 1 \\
f(x)<0 \Longrightarrow \text { Predict }-1
\end{array}$$

## Example: Scores, Predictions, and Labels
<div align="center"><img src = "./result.jpg" width = '500' height = '100' align = center /></div>  

- Performance measures  
  - Error Rate = 4/12≈.33 
  - Precision =4/6≈.67 
  - Recall =4/6≈.67 
  - F1 = 4/6≈.67   
- Now predict + iﬀ Score > 2? 
  - Error Rate = 2/12≈.17 
  - Precision =4/4 = 1.0 
  - Recall =4/6≈.67 
  - F1 = 0.8   
- Now predict + iﬀ Score> -1? 
  - Error Rate = 2/12≈.17 
  - Precision =6/8 = .75 
  - Recall =6/6 =1.0 
  - F1 = 0.86   
- Generally, diﬀerent thresholds on the score function lead to  
  - diﬀerent confusion matrices 
  - diﬀerent performance metrics   

# The performance Curve  
##  Precision-Recall Curve

<div align="center"><img src = "./precision recall curve.jpg" width = '500' height = '100' align = center /></div>   

## Receiver Operating Characteristic (ROC) Curve
- Recall FPR and TPR: 
  - FPR = FP / (Number of Negatives Examples) 
  - TPR = TP / (Number of Positives Examples)  
- As we decrease threshold from $+\infty$ to $-\infty$,  
  - Number of positives predicted increases - some correct, some incorrect. 
  - So both FP and TP increase  
- ROC Curve charts TPR vs FPR as we vary the threshold  
<div align="center"><img src = "./roc.jpg" width = '500' height = '100' align = center /></div>    

## Comparing ROC Curves
<div align="center"><img src = "./3roc.jpg" width = '500' height = '100' align = center /></div>   

- Here we have ROC curves for 3 score functions 
- For diﬀerent FPRs, diﬀerent score functions give better TPRs  
- No score function dominates another at every FPR  
- Can we come up with an overall performance measure for a score function  

## Area Under the ROC Curve
- AUC ROC = area under the ROC curve 
- Often just referred to as “AUC  
- A single number commonly used to summarize classiﬁer performance