# The Sorting Hat 
The sorting hat assigns new witches and wizards into the houses that they will join for the remainder of their stay at Hogwarts. ![Sorting Hat](https://vignette.wikia.nocookie.net/harrypotter/images/6/62/Sorting_Hat.png/revision/latest?cb=20161120072849)Presumably, the Hat uses some **prior knowledge** to inform its decision of where a student will best fit in. From a Machine Learning perspective we could view this process as a **classification** task: given some **labeled** data (for example, information about previous Hogwarts students (the **data**) and which house they belonged to (the **label/class**)) can we build a model that can **predict** which house a new student belongs to. 

### In order to understand the magical ways of the Hat, we will perform the following:
1. Generate a dataset *...mmhhm...* I mean survey some previous Hogwarts students
2. Do some basic visualization to investigate how our features separate the classes
3. Teach a machine learning model about previous students
4. Predict your house and visualize where you stand relative to past students!

In an attempt to organize things a bit, I've put some functions in a script called [sortinghat_functions.py](https://github.com/michaelsilverstein/TheSortingHat/blob/master/sortinghat_functions.py)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sortinghat_functions as sh

houses = ['Gryffindor', 'Hufflepuff', 'Ravenclaw', 'Slytherin']
palette = dict(zip(houses, ['red', 'gold', 'lightblue', 'green']))

# (1) Generate dataset
The Sorting Hat itself has been endowed with years upon years of knowledge about different students. Unforunately the Hat wasn't available to send me all of its data, so we will have to generate it ourself. In my opinion, generating datasets is a valuable exercise in and of itself. A machine learning dataset consists of a few components: **samples** (in this case, students), **features** (measured characteristics), and **class labels** (in this case, the house each student belonged to). In general, a machine learning **training set** (the dataset which contains previous measurements we wish to learn from), looks like this:

| Sample | Feature 1 | $\cdot\cdot\cdot$ | Feature N | Class |
| --- | --- | --- | --- | --- |
| Sample$_1$ | Observation$_{1,1}$ | $\cdot\cdot\cdot$ | Observation$_{1,N}$ | Class$_1$ |
|  $\cdot\cdot\cdot$ |  $\cdot\cdot\cdot$ | $\cdot\cdot\cdot$ | $\cdot\cdot\cdot$| $\cdot\cdot\cdot$ |
| Sample$_M$ | Observation$_{M,1}$ | $\cdot\cdot\cdot$ | Observation$_{M,N}$ | Class$_M$ |

Now, we have to imagine that the Sorting Hat has gathered information on all sorts of features, some of which will have more discriminatory power than others. For example, below is the height distribution of the students from each house. 
```python
"""Generate height example"""
# Assume same mean height, standard deviation, and class size for each house
mean_height = 5*12+7/12
std = 5
n = 20
# The seed for a random process establishes where the process "starts from". This allows us to 
# re
random_seed = 1

data = sh.generate_feature(mean_height, std, n, houses, random_seed)
df = pd.DataFrame(data, columns=['height', 'house'])
g = sns.FacetGrid(df, hue='house', aspect=2.5)
g.map(sns.kdeplot, 'height', shade=True).add_legend(title='House')
plt.xlabel('Height (in)')
plt.yticks([])
plt.show()
```
![Height distribution](figures/height_dist.png) 

As we can see it doesn't seem like this feature (height) does not provide much discriminatory power between the different classes (as in, if all we knew about the students was their height we would have very little ability to distinguish which ones belonged to which house). Below we will generate data for some more features we believe the Hat may have observed.