## Side notes 
_(code snippets, summaries, resources, etc.)_
- Exact videos from Lesson 8 in Intro to Machine Learning.
- [Notes from lesson](https://www.evernote.com/shard/s37/nl/1033921319/f451c63d-723b-48cb-ab71-803e23322059/) copied from Evernote, straight from html to markdown auto-conversion.

__Resources:__
- [Visualization of clustering](http://www.naftaliharris.com/blog/visualizing-k-means-clustering)
- Sklearn's [Overview of Clustering](http://scikit-learn.org/stable/modules/clustering.html)
- [K-Means User Guide](http://scikit-learn.org/stable/modules/clustering.html#k-means)
- Sklearn's [comparison table of the clustering algorithms](http://scikit-learn.org/stable/modules/clustering.htm) (useful summary)

# Clustering

## Unsupervised Learning

__definition:__ Finding structure in data without labels 

### Example: Clustering Movies

Can determine labels from clusters (like what genres are matched with those clusters)

<img src="clustering_images/A2FF619E-6D13-48C0-A31E-687CE9B0FFF4.png" width="653" height="365" />

### K-means algorithm

**Algorithm steps:** (that run iteratively)
1.  Assign (or associate)
2.  Optimize
3.  Repeat

__1) Assign:__ Randomly assign cluster centers
- Quiz: Which points are closer to Centroid 1 than Centroid 2 (green point lower than other)?
    - To determine: draw orthogonal line to line connecting two randomly-chosen test centers (green points)
    - Points above this line are closer to Centroid 1, below to Centroid 2

<img src="clustering_images/82935E96-CAFA-462A-8136-9FE08067BA34.png" width="417" height="335" />

__2) Optimize:__ Minimizing total *quadratic distance* or *quadratic error* of our cluster centers to the points

- Think of lines between points to appropriate cluster centers as rubber bands whose total energy/length must be minimized
- Quiz, same question but for Centroid 2:

<img src="clustering_images/80EC9A9B-73DE-448C-BF40-03FAACC6B29F.png" width="417" height="318" />

__3) Iteratively Assigning and Optimizing:__ To reach this optimized/assigned result

<img src="clustering_images/32C8089B-3EC8-4DD0-AFEA-5495AC0FEB1C.png" width="653" height="333" />

### K-Means Cluster Visualization

Useful tool to play around with: <http://www.naftaliharris.com/blog/visualizing-k-means-clustering/>

__Example 1__
<img src="clustering_images/F5B44326-4A4D-497F-994D-3F16CBF8D425.png" width="571" height="301" />

__Example 2__
- Using Uniform Points selection on site
- K-means can give a mathematical description of messy data
    - Not sure how many starting centroids, so starting with two.

After two assignments/optimizations:
<img src="clustering_images/ADB399C4-C600-4BF7-B040-7170EE545AF6.png" width="575" height="301" />

Initial assignments of centroids can have large impact of where k-means ends up clustering data
- Easy way to solve this problem

## Sklearn

[Overview of Clustering](http://scikit-learn.org/stable/modules/clustering.html)

Comparison of different clustering algorithms in scikit-learn:
-   <img src="clustering_images/3E0DA856-3336-460F-ACF8-F302198947A2.png" width="803" height="647" />

K-Means:
    - challenging to decide how many clusters we want to try

### [`sklearn.cluster`.KMeans()](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans)

**Important parameters:**
- `n_clusters` (most important)
    - almost always change the default value of 8
- `max_iter`
    - how many iterations of the assignment/optimization steps
    - Default is often appropriate
-   `n_init`
    - how many times does the algorithm re-inititializes
    - higher than default value of 10 for data that would be prone to problem of initial points resulting in vastly different clusterings  

### Challenges/limitations with K-Means

__definition:__ Local Minimum, output of clustering with K-Means
- K-Means is referred to as a “_(local) hill-climbing algorithm_” because algorithm will not always provide the same clustering output
- E.g. This stable solution is a possible output:
    - Two clusters will “fight” for the same points
    - Because initial assignment could bias output

<img src="clustering_images/15686A7D-D7AB-4732-852D-68649F08D79E.png" width="571" height="324" />

Quiz: Could there be a poor local minimum result of k-means in this example?
- Yes, but unlikely
- Rule of thumb: More clusters, more local minima (so we are force to try algorithm multiple times)

<img src="clustering_images/B801BE80-FFCB-4716-9C3A-3C21CC3E0277.png" width="563" height="318" />

## Mini-Project! with K-Means and Enron financial data

**Overview**
- "In this project, we’ll apply k-means clustering to our Enron financial data. Our final goal, of course, is to identify persons of interest; since we have labeled data, this is not a question that particularly calls for an unsupervised approach like k-means clustering.
- Nonetheless, you’ll get some hands-on practice with k-means in this project, and play around with feature scaling, which will give you a sneak preview of the next lesson’s material."

**Initial scatterplot**

- with salary as the input variable, and exercised\_stock\_options as the output variable

<img src="clustering_images/C48B1B14-F828-499C-BAA9-EB928B0017A5.png" width="752" height="650" />

### Quiz: Deploying Clustering with 2 features

Deploy k-means clustering on the financial\_features data, with 2 clusters specified as a parameter:

<img src="clustering_images/CE415F54-020F-47ED-AC53-896B09599441.png" width="725" height="555" />

### Quiz: Deploying Clustering with 3 features

Add a third feature to features\_list, “total\_payments”. Now rerun clustering, using 3 input features instead of 2 (obviously we can still only visualize the original 2 dimensions).
- This new clustering, using 3 features, couldn’t have been guessed by eye–it was the k-means algorithm that identified it.

<img src="clustering_images/9DE625BE-827E-4542-A007-45A4E550EFB6.png" width="728" height="549" />

### Quiz: Range of Salary, Stock Option, etc.

(excluding ‘NaN’ values)

<img src="clustering_images/4EFF5037-CDEC-4F94-8AC3-EC9F518C2DB8.png" width="538" height="174" />

This is prep. for a sneak peak at feature scaling