Copyright 2020 Andrew M. Olney and made available under [CC BY-SA](https://creativecommons.org/licenses/by-sa/4.0) for text and [Apache-2.0](http://www.apache.org/licenses/LICENSE-2.0) for code.


# LDA and KNN classification: Problem solving

In this session, you'll work through a complete example using a new dataset, `binary`.

## Load the dataframe

The `binary.csv` dataset contains 4 variables:

| Variable    | Type    | Description           |
|:-------------|:---------|:-----------------------|
| admit | Nominal   | the admittance status (0=not admitted, 1=admitted) |
| gre  | Ratio   | the student's GRE score  |
| gpa | Ratio   | the student's GPA |
| rank  | Ordinal   | rank of the institution (1=highest to 4=lowest prestige)  |


Start by importing `pandas`.

Load a dataframe with `binary.csv` and display the dataframe.

Unnamed: 0,admit,gre,gpa,rank
0,0,380,3.61,3
1,1,660,3.67,3
2,1,800,4.00,1
3,1,640,3.19,4
4,0,520,2.93,4
...,...,...,...,...
395,0,620,4.00,2
396,0,560,3.04,3
397,0,460,2.63,2
398,0,700,3.65,2


## Prepare the train/test data

To train the classifiers, you need to split the dataframe into training data and testing data.

Start by creating a dataframe `Y` that just has `admit` in it, and then display `Y` so you can be sure it worked.

Unnamed: 0,admit
0,0
1,1
2,1
3,1
4,0
...,...
395,0
396,0
397,0
398,0


Next do the same thing for `X` using the other columns in the dataframe.

Unnamed: 0,gre,gpa,rank
0,380,3.61,3
1,660,3.67,3
2,800,4.00,1
3,640,3.19,4
4,520,2.93,4
...,...,...,...
395,620,4.00,2
396,560,3.04,3
397,460,2.63,2
398,700,3.65,2


To split the data into training and testing data, import `model_selection`.

Now split the data into training and testing data, using `test_size` at one of the following 3 values depending on your birthday:

If your birthday is in:

- Jan, Feb, Mar, Apr, use `0.2` 
- May, Jun, Jul, Aug, use `0.4`
- Sep, Oct, Nov, Dec, use `0.6`

So depending on your birthday, we'll use 20, 40, or 60% of the data for testing.

## Train LDA

First, import `discriminant_analysis`

Next define the LDA model, e.g. using `create`.

Now perform LDA.
Because we need `ravel` to reformat the data for `sklearn`, go ahead and import `numpy`.

Train LDA and do the predictions in one cell.
Save the predictions in the variable `ldaPredictions`, and then show the predictions to make sure it worked.

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1])

You should see a mix of `1` and `0` in the predictions. 

**QUESTION:**

Do you think `0` or `1` is more common in this dataset?
What could you do with the dataframe to check?

**ANSWER: (click here to edit)**

*`0` looks more common. An easy way to check would be to use `describe` on the dataframe.*

## KNN

First import `neighbors`.

Next define the KNN model, e.g. using `create`.

Now train KNN and do the predictions in one cell.
Save the predictions in the variable `knnPredictions`, and then show the predictions to make sure it worked.

array([0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0])

## Classifier evaluation

To see if the models are any good, do some evaluations.

First import `metrics`.

### Accuracy

Calculate the LDA accuracy.

0.7375

And calculate the KNN accuracy.

0.7125

**QUESTION:**

Which has more accuracy for you, LDA or KNN?
Why do you think that might be?

**ANSWER: (click here to edit)**

*LDA is higher, but just by a bit. I would expect LDA to do better on linear problems in general; since LDA does better here, this could be a linear problem.*

### Recall/Precision per class

Get the LDA `classification_report`.

              precision    recall  f1-score   support

           0       0.76      0.91      0.83        57
           1       0.58      0.30      0.40        23

    accuracy                           0.74        80
   macro avg       0.67      0.61      0.62        80
weighted avg       0.71      0.74      0.71        80



And get the KNN `classification_report`.

              precision    recall  f1-score   support

           0       0.77      0.84      0.81        57
           1       0.50      0.39      0.44        23

    accuracy                           0.71        80
   macro avg       0.64      0.62      0.62        80
weighted avg       0.70      0.71      0.70        80



**QUESTION:**

Is there a particular class that LDA, KNN, or both do worse on?
Why do you think that might be?

**ANSWER: (click here to edit)**

*About 69% of the `admit`s are `0`, so the classifiers have a harder time with the `1`s because they are less common. This is called imbalanced classes and is problem often seen in the real world.*

### Comparing classifiers

There are several ways to compare classifiers, including using accuracy and the classification report.

But it is also interesting to ask how much the classifiers *agree* with each other on both correct and incorrect answers.

The easiest way to do this is to use `accuracy` again, but use `knnPredictions` and `ldaPredictions` instead of `Ytest`.

Try it.

0.85

**QUESTION:**

Are you surprised or not suprise by the amount of agreement between the classifiers?
Why?

**ANSWER: (click here to edit)**

*I'm somewhat surprised that the agreement is not higher since they have almost the same level of accuracy. This suggests that each is getting some answers correct that the other is failing to get correct.*

## Submit your work

When you have finished the notebook, please download it, log in to [OKpy](https://okpy.org/) using "Student Login", and submit it there.

Then let your instructor know on Slack.
