# Assignment 2

(๑• .̫ •๑)

Your last pokemon adventure went well, but you aren't quite the very best like no one ever was. Faithful to your data scientist ways, you decide to further analyse your pokedex to improve your training.

The data can be found under `pokedex/pokemons.csv`, and is the same as assignment 1. Run the cell below to get an overview of the dataset:

In [None]:
import pandas as pd
import numpy as np

df = pd.read_csv('pokedex/pokemons.csv')
df.head()

## Problem 1

Analysing and grouping "smart" pokemons by `Type 1` wasn't very successful last assignment: we got a headache from trying to train a Psyduck. Since then however, we learnt a powerful unsupervised learning method for analysing **clusters** in our datasets.

💪 **Task: Use k-Means clustering to find 4 clusters in the pokemon dataset, and store the predictions in a vector called `y_kmeans`.**  
Pro-tip 1: You should only take into account the `Attack`, `Defense`, `Sp. Atk`, `Sp. Def`, `Speed`, and `HP` columns.  
Pro-tip 2: Please use the `random_state=42` argument when constructing your sklearn class, to make sure your results are reproducible. Marks won't be taken off for using the wrong random seed, but the unit tests won't pass!  
Pro-tip 3: We have seen in lectures that sklearn expects NumPy `ndarray`s as argument to its training and prediction methods. Whilst that is true, it can also accept pandas `DataFrame`s directly, since these are `ndarray` wrappers. You can use whichever you prefer.

In [None]:
# INSERT YOUR CODE HERE

In [None]:
def test_kmeans():
    assert len(y_kmeans) == 800, f'The size of your prediction vector is wrong: {len(y_kmeans)}. There should be 800, one per pokemon.'
    unique_clusters = len(np.unique(y_kmeans))
    assert unique_clusters == 4, f'There should 4 unique clusters, your prediction vector has {unique_clusters}'
    assert y_kmeans.mean() == 1.5025, f'Something is not quite right with your prediction vector. Have you used a random seed of 42?'
    print('Success! 🎉')
    return

test_kmeans()

## Problem 2

Now that we have clustered our pokemons, we'd like to explore these groups. Specifically, we'd like to know the mean stats of each cluster, so we can compare their average strengths and weaknesses.


💪 **Task: Group the pokemons by cluster, and calculate the mean statistics of each group. Save this in a `DataFrame` called `cluster_means`. For example, you should be able to clearly read the average `Defense` of cluster 2 in your `cluster_means` `DataFrame`.**   
Pro-tip 1: Adding a `Cluster` column to `df` will allow you to work on a single `DataFrame` and make the task much easier 🙃  
Pro-tip 2: You should only expect numerical columns in `cluster_means`, since the mean of a string is undefined.

In [None]:
# INSERT YOUR CODE HERE

In [None]:
import math

def test_cluster_means():
    assert len(cluster_means) == 4, f'Your dataframe has {len(cluster_means)} rows, but 4 are expected: one per cluster'
    assert 'Attack' in cluster_means.columns, f'Your dataframe should contain the Attack column'
    assert math.isclose(cluster_means.values.sum(), 5276.0872, rel_tol=1e-5), f'Something is not quite right with your cluster means. Have you used a random seed of 42?'
    print('Success! 🎉')
    return cluster_means
    
test_cluster_means()

🧠 **Bonus Question: Inspect the clusters and their traits. What do you think the clusters represent? Try to identify what makes each cluster stand out and qualitatively describe the "identity" of each cluster.**

ℹ️ Notice how building these kinds of clustered "profiles" is beyond anything we could have done just by manipulating the `DataFrame`. Last assignment, we split the pokemons by types, but k-Means takes into account the _density_ of the dataset to create more natural groupings.

## Problem 3

We're getting an idea of what our clusters represent, and how their distributions vary. However, we have recently acquired data visualization powers ⚡️, so we'd like to visualize these differences. 

💪 **Task: Visualize some aspect of `cluster_means`. Feel free to focus on a particular column, or to aggregate some of the data. The graph should show some differences between the clusters. Be creative!**   
Pro-tip 1: Don't overthink the chart content, you will mostly be graded on healthy visualization practices.  
Pro-tip 2: Try to use the matplotlib api instead of the `Dataframe.plot` built in pandas. This should give you more control and allow you to create a more effective visualization.  

In [None]:
# INSERT YOUR CODE HERE

🧠 **Bonus Question: Why you chose this data to plot? Why did you represent it in this particular way?**

## Problem 4

We have shown differences in the cluster average statistics with a beautiful graph. Now, we want to visualize the cluster assignments of ALL of the data. However, we have six "stats" columns, and even the world of pokemon is only three dimensional... Prepare for trouble, and make it double, it's time for dimensionality reduction!

💪 **Task: Reduce the dimensions of the pokemon dataset using PCA. Store the principal components in a NumPy `ndarray` called `components`. The unit test will call a `.plot_PCA()` method to display the data points, and their color coded cluster assignments.**   
Pro-tip 1: You should only use the numerical columns: `Attack`, `Defense`, `Sp. Atk`, `Sp. Def`, `Speed`, and `HP`.   
Pro-tip 2: Think of how many dimensions you must reduce the dataset to, so that we are able to visualize it. It's the same as we did in class!  
Pro-tip 3: Please use the `random_state=42` argument when constructing your sklearn class, to make sure your results are reproducible. Marks won't be taken off for using the wrong random seed, but the unit tests won't pass!  
Pro-tip 4: We have seen in lectures that sklearn expects NumPy `ndarrays` as argument to its training and prediction methods. Whilst that is true, it can also accept pandas `DataFrames` directly, since these are `ndarray` wrappers. You can use whichever you prefer.  
Pro-tip 5: The `plot_PCA()` method uses the `y_kmeans` predictions to pick marker colors. Make sure you have finished problem 1 and run the cells to make it available here.


In [None]:
# INSERT YOUR CODE HERE

In [None]:
import matplotlib.pyplot as plt
from matplotlib.lines import Line2D

def plot_PCA(components):
    # assign a color to each prediction
    colors = ['blue', 'red', 'green', 'orange']
    features_colors = [colors[y] for y in y_kmeans]

    # plot the PCA components
    fig = plt.figure()
    ax = fig.add_subplot('111')
    ax.scatter(components[:, 0], components[:, 1],
                c=features_colors, marker='o',
                alpha=0.4)
    
    ax.set_title('PCA visualization of pokemon k-Means clusters')

    legends = [legend(i, c) for i, c in enumerate(colors)]
    ax.legend(handles=legends, loc='upper left')
    
    plt.show()

def legend(i, color):
    return Line2D([0], [0], marker='o', color='w', label=f'Cluster {i}',markerfacecolor=color, markersize=8)

def test_pca():
    rows, columns = components.shape
    assert columns == 2, f'Your components have {columns} dimensions. In order to visualise the data, we expect 2 dimensions.'
    assert rows == 800, f'Your components have {rows} data points, but 800 are expected, one per pokemon.'
    assert math.isclose(components[42, 1], -18.321118, rel_tol=1e-5), f'Something is not quite right with your dimensional reduction. Have you used a random seed of 42?'
    print('Success! 🎉')
    plot_PCA(components)
    
test_pca()

🧠 **Bonus Question: Do you think this matches the results of problem 2? Why? What do the 2 principal axes seem to represent?**

## Problem 5

An Old man once told you how to catch Weedles. 🐛 But he also said that winning battles comes down to unique fighting styles. We want to find the pokemons that stand out the most from the rest.

💪 **Task: Use gaussian distribution anomaly detection to identify the top 1% of most unique pokemons. Use the resulting predictions vector to filter our `df` `DataFrame`, and save the outlier pokemons in a new `DataFrame` called `outliers`.**   
Pro-tip 1: You should only use the numerical columns: `Attack`, `Defense`, `Sp. Atk`, `Sp. Def`, `Speed`, and `HP`.   
Pro-tip 2: Please use the `random_state=42` argument when constructing your sklearn class, to make sure your results are reproducible. Marks won't be taken off for using the wrong random seed, but the unit tests won't pass!  
Pro-tip 3: We have seen in lectures that sklearn expects NumPy ndarrays as argument to its training and prediction methods. Whilst that is true, it can also accept pandas DataFrames directly, since these are ndarray wrappers. So use whichever you prefer.  
Pro-tip 4: Remember that the `contamination` argument changes the percentage of our dataset we expect to be outliers.
Pro-tip 5: It could help to add the predictions in an `Outlier` column to the original `df`, to make the filtering of the anomalous pokemons easier 🙃 


In [None]:
# INSERT YOUR CODE HERE

In [None]:
def test_anomaly_detection():
    assert len(outliers) == 8, f'You found {len(outliers)} outliers, but we expected 800 * 1% = 8' 
    assert outliers['Total'].sum() == 4284, f'Something is not quite right with your anomaly detection. Have you used a random seed of 42?'
    print('Success! 🎉')
    return outliers
    
test_anomaly_detection()

🧠 **Bonus Question: Is this what you expected? Can you explain why these pokemons are outliers? Can you spot a pattern?**