Copyright 2020 Andrew M. Olney, Dale Bowman and made available under [CC BY-SA](https://creativecommons.org/licenses/by-sa/4.0) for text and [Apache-2.0](http://www.apache.org/licenses/LICENSE-2.0) for code.

# Clustering: Problem solving

In this session, you will look at at dataset of teeth from different animals.

Each row contains the name of an animal with measurements of these variables for teeth:

| Variable | Type  | Description                    |
|:----------|:-------|:--------------------------------|
| Name     | Nominal | the name of the animal         |
| I        | Ratio | the number of top incisors     |
| i        | Ratio | the number of bottom incisors  |
| C        | Ratio | the number of top canines      |
| c        | Ratio | the number of bottom canines   |
| P        | Ratio | the number of top premolars    |
| p        | Ratio | the number of bottom premolars |
| M        | Ratio | the number of top molars       |
| m        | Ratio | the number of bottom molars    |

from *Dentition of Mammals*, Hartigan (1975), p 170.

First, you will cluster the data using hierarchical clustering (a dendrogram), followed by k-means clusters with different sizes of *g*.

## Hierarchical clustering

We need to load the data into a dataframe, so start with importing `pandas`.

Now load the data in `"datasets/teeth.csv"` into a dataframe, remembering that `Name` is an ID column.

Now create a dendrogram using this dataframe.

Start by importing `plotly.figure_factory` and from importing `scipy.cluster.hierarchy`.

Now create a dendrogram using a `linkagfun` like before.  

If it's hard to read because of the size, try using the plot interactive tools that appear in the top right of the plot when you hover over it.

-----------------
**QUESTION:**

At what point on the y-axis (between 0 and 7) would you draw a horizontal line to get the best clusters?
Why wouldn't you draw it higher or lower?
How many clusters would this give you?

**ANSWER: (click here to edit)**


<hr>

## K-means

Do K-means with this same data, using the number of clusters you identified above using the dendrogram.

First import `sklearn.cluster`.

Create a `KMeans` with your number of clusters and store it in a variable.

Using `fit_predict` get clusters and display them.

Add these predictions as a new column, `cluster` in your dataframe, converting them to `str` type for plotting:

Take a look at your clusters with this trick to show the whole thing:

- `print with dataframe do to_string using`

-----------------
**QUESTION:**

For each one of your clusters, what kind of animal does the cluster correspond to?

**ANSWER: (click here to edit)**


<hr>

### Scatterplots

Make 3 scatterplots

- I vs. P
- I vs. M
- P vs. M

and color the categories in each one.
This is necessary because our data has many dimensions, but our plots only have two dimensions.

First, import `plotly.express`.

Create the I vs P scatterplot, but remove `ols`.

-----------------
**QUESTION:**

Mouse over each do to see what cluster it belongs to (it will be darker if there are many datapoints under it). 
Which clusters are well separated? 
Which are not?

**ANSWER: (click here to edit)**


<hr>

Create the I vs M scatterplot.

-----------------

**QUESTION:**

Mouse over each do to see what cluster it belongs to (it will be darker if there are many datapoints under it). 
Which clusters are well separated? 
Which are not?

**ANSWER: (click here to edit)**


<hr>

Create the P vs M scatterplot.

-----------------

**QUESTION:**

Mouse over each do to see what cluster it belongs to (it will be darker if there are many datapoints under it). 
Which clusters are well separated? 
Which are not?

**ANSWER: (click here to edit)**


<hr>

### Summary

Now consider all your plots and clusters.

-----------------
**QUESTION:**

Do you still think your number of clusters is the best? 
Why or why not?

**ANSWER: (click here to edit)**


<hr>

**QUESTION:**

If any of your scatterplot clusters were not well separated, does that concern you?

**ANSWER: (click here to edit)**


<hr>

<!--  -->