# Project: Grading with K-Means (30 pts)
## Due Sunday November 14th, 2021

In this project we consider a grading data set that I obtained from a colleague at Duke University. It contains 95 students grades in both raw numerical form and as a letter grade with $\pm$ decorations, e.g. $B+$ or $A-$.

In essence, grading is a partitioning task. At the end of the semester we must partition students into As, Bs, Cs, Ds, and Fs. Distinguishing into plus or minus grades is a further refinement of this partition.

To make grading less adhoc, let's use K-means clustering for K=5 (and later for K=13) to automatically partition our students into clusters.

However, the fact that data is already labeled for us by a professor, we can compute how our partitioning compares with the professor's, as well as compute some other signatures such as purity.

### Question 0: Loading the Data

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import warnings

warnings.filterwarnings("ignore")

# Read dataset to pandas dataframe
data = pd.read_csv('grades.csv')

print(data.shape)
data[60:95]

### Question 1 (1 pt)

Import the grades.csv file from blackboard as a Pandas DataFrame. Create another DataFrame that includes only the 'raw grade' and 'raw letter'. 

The 'raw grade' column includes numerical scores between 56.1 and 99.6. The 'raw letter' column should contain letter grades between $F$ and $A+$. A plus or minus (for some reason a $B-$ is represented as a $B=$ in this data set) is called a "decoration" of the letter grade.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

### Question 2 (3 pts)

Run the $K$-means clustering algorithm with $K=5$. Here $K=5$ corresponds to the 5 letter grades $A, B, C, D$ and $F$. In this question, all $\pm$ decorations are ignored, so $A$ and an $A-$ are regarded as the same.

***Make sure to set random_state=0 when you run KMeans!***

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

### Question 3 (5 pts)

Create a 1D scatter plot where the the points are colored by the prediction value given by the algorithm. Set the y value of your points in your scatter plot to correspond to the undecorated letter grade, measured as a GPA, i.e. A is a 4.0, B is a 3.0, and so on.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

### Question 4 (2 pts)

Create (or append) a new Pandas DataFrame with ’raw grade’, ’raw letter’ and a new column ’K-Means Letter’ that records the label number predicted by the K-means algorithm.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

### Question 5 (5 pts)

Notice that the K-means algorithm uses numbers $\{0,1, \ldots, K-1\}$ to label clusters, where as professors use grades $\{A,B,C,D,F\}$ to label clusters. For example in my solution to this project, the K-means label `0` *roughly* corresponded to the grade letter $C$ (and $C+$) and the K-means label 4 *roughly* corresponded to the letter grade $B$ (and $B-$).

![K-Means Prediction](Grade-k-means-predict.png)

Write a small program that returns the percentage of agreement between the professor's assigned letter grade (undecorated, so $A$ is the same as $A-$) and your K-means predictions.

For example in the example screenshot, as least two $B-$'s were `missclassified` as $C$'s, so our rate of agreement should be less than 98\%.

**Compute percentage of agreement between k-means with K=5 and the professor's grading schema.**

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

### Question 6 (3 pts each part)

Below is some code to compute the purity of a clustering schema for 5 clusters. We introduced this in a our second lecture on clustering.

In [None]:
## Provided Code
def clusters_stats(predict, y, num_clusters=5):
    stats = np.zeros((num_clusters,3))
    for i in range(num_clusters):
        indices = np.where(predict == i)
        cluster = y[indices]
        stats[i,:] = clust_stats(cluster, num_clusters)
    return stats
        
def clust_stats(cluster, num_clusters=5):
    class_freq = np.zeros(num_clusters)
    for i in range(num_clusters):
        class_freq[i] = np.count_nonzero(cluster == i)
    most_freq = np.argmax(class_freq)
    n_majority = np.max(class_freq)
    n_all = np.sum(class_freq)
    return (n_majority, n_all, most_freq)
    
def clusters_purity(clusters_stats, num_clusters=5):
    majority_sum  = clusters_stats[:,0].sum()
    n = clusters_stats[:,1].sum()
    return majority_sum / n

### Question 6 Part A (3 pts)

Compute the overall purity of your k-means clustering schema using the code above.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

### Question 6 Part B (3pts)

Using the markdown cell below, calculate the purity of each predicted k-means cluster and verify the above calculation of the overall purity. For example, with random_state=0, k-means cluster label 3 corresponds entirely to As so it has 100% purity. On the other hand, cluster label 1 has a mix of As and Bs and should have a cluster purity of roughly 70%.

YOUR ANSWER HERE

## Question 7 (8 pts as distributed below)

Repeat Questions 2 through 6 with $K=13$, which corresponds to the 13 possible *decorated* letter grades $\{A+,A,A-,B+,B,B-,C+,C,C-,D+,D,D-,F\}$. 

This means:

- Run $K=13$-means clustering
- Create a scatter plot where the the points are colored by the 13 possible prediction values given by the algorithm, with weighted GPA as the y-coordinate.
- Create a new Pandas DataFrame with 'raw grade', 'raw letter' and a new column 'K-Means Letter' that records the label number predicted by the K-means algorithm.
- Compute the percentage of agreement between the professor's *decorated* letter grades and the determination by K-means, for $K=13$.
- Compute the purity for each of the 13 predicted clusters as well as the overall purity.


### Question 7.2: Run k=13 means (1 pt)

Make sure to use `random_state=0`

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

### Question 7.3 Plot (2pts)

Separate out the y values based on the letter grade by using weighted GPA or 0-12.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

### Question 7.4 (1 pt)

Append or create data frame that has the k means predicted letter score.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

### Question 7.5 (2 pts)

Compute agreement percent

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

### Question 7.6A (1pt)

Compute overall purity score using functions above.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

### Question 7.6B (1pt)

Verify the above purity score by computing the purity of each individual cluster. Write this in a markdown cell below:

YOUR ANSWER HERE