# Intro to Data Science

[Gina Sprint](https://ginasprint.com/)

# Classifier Evaluation
What are our learning objectives for this lesson?
* Evaluate classifier performance using different metrics
* Divided a dataset into training and testing sets using different approaches

Content used in this lesson is based upon information in the following sources:
* Dr. Shawn Bowers' Data Mining notes

## Warm up Task(s)
Open our T-shirt example notes from last class
1. Normalize the weight values using min-max scaling (see our work with normalizing the height column for an example)
1. Calculate the distance between each training instance and the unseen instance using $distance(a, b) = \sqrt{\sum (a_i - b_i)^2}$
1. Which k=3 training instances have the smallest distances?

## Today
1. Finish our k nearest neighbors algorithm example
1. Code up our k nearest neighbors algorithm in ClassificationFun
1. Break
1. Finish ClassificationFun (train/test split and accuracy)
1. Go over project part 4

## TODO
1. Review all of the "fun" in-class examples we did
    1. PythonBasicsFun
    1. PandasFun
    1. DataCleaningFun
    1. DataVisualizationFun
    1. JupyterNotebookFun
    1. ClassificationFun
1. Work on project part 4 and project part 3 (BONUS)
    * Note: I moved project part 3 to be BONUS (extra credit) 😀
    * Note: If you are unable to get all code working for the project, please write Python comments to describe what you tried and what the code should do
1. Turn in the project by providing your Github repository URL link in Moodle

## Thank you for a great first time teaching in China!!

## Metrics
### Accuracy
Accuracy: % of test instances correctly classified by the classifier
* Sometimes called "recognition rate"
* Use the [KNeighborsClassifier.score(X, y)](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier.score) method to determine the accuracy of a kNN classifier 
        * Description: "Return the mean accuracy on the given test data and labels."
    * Note: for other classifiers, check the documentation for the `score()` method to see what the default evaluation metric is
* Warning: can be skewed if unbalanced distribution of class labels
    * e.g., lots of negative cases that are easily detected (e.g. 99% accuracy when 99% of the dataset is the negative class)
    * shadows performance on positive cases

### More Classifier Evaluation Metrics to Look Into
* Error rate: 1 - accuracy
* Precision: measure of "exactness"
* Recall (AKA sensitivity): measure of "completeness"
* F-Measure (AKA F1 score): combine the two via the harmonic mean of precision and recall

## Training and Testing
Building a classifier starts with a learning (training) phase
* Based on predefined set of examples (AKA the training set)

The classifier is then evaluated for predictive accuracy
* Based on another set of examples (AKA the testing set)
* We use the actual labels of the examples to test the predictions

In general, we want to try to avoid overfitting
* That is, encoding particular characteristics/anomalies of the training set into the classifier
* Similar notion is "underfitting" (too simple of a model, e.g., linear instead of polynomial)

There are several different ways to select training and testing sets:
1. The train/test split (holdout) method
    * The one we will go over
2. Random subsampling
3. $k$-Fold cross validation and variants
4. Bootstrap method

### Train/Test Split (Holdout) Method
In the train/test split method, the dataset is divided into two sets, the training and the testing set. The training set is used to build the model and the testing set is used to evaluate the model (e.g. the model's accuracy).

![](https://upload.wikimedia.org/wikipedia/commons/thumb/0/09/Supervised_machine_learning_in_a_nutshell.svg/2000px-Supervised_machine_learning_in_a_nutshell.svg.png)
(image from https://upload.wikimedia.org/wikipedia/commons/thumb/0/09/Supervised_machine_learning_in_a_nutshell.svg/2000px-Supervised_machine_learning_in_a_nutshell.svg.png)

Approaches to the train/test split method
* Randomly divide data set into a training and test set
* Partition evenly or, e.g., $\frac{2}{3}$ to $\frac{1}{3}$ (2:1) training to test set
* This is random selection without replacement
* Use the [train_test_split()](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function to apply the holdout method to a dataset
    * Description: "Split arrays or matrices into random train and test subsets"
    * `test_size` parameter: "If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size. If train_size is also None, it will be set to 0.25"
    * `random_state` parameter: "Controls the shuffling applied to the data before applying the split. Pass an int for reproducible output across multiple function calls."
    * `stratify` parameter: "If not None, data is split in a stratified fashion, using this as the class labels."