**<font size=5>Visualizing and Understanding K-Means Clusters</font>**

The purpose of this notebook is to analyze the 1995 U.S. News and World Report college statistics dataset using K-means clustering. In this notebook I generate the clusters and then look at a couple different ways of visualizing and understanding the cluster output. Let's begin with the usual: import statements, data load, quick look at the dataset.

In [None]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from sklearn.cluster import KMeans
#import sklearn.cluster.hierarchical as hclust
from sklearn import preprocessing
import seaborn as sns

In [None]:
df = pd.read_csv('../input/College.csv')

In [None]:
print(df.shape)
df.head()

The Kaggle [site](https://www.kaggle.com/flyingwombat/us-news-and-world-reports-college-data/home) has the descriptions of each data column, copied here for easy reference:
* "Private A factor with levels No and Yes indicating private or public university
* Apps Number of applications received
* Accept Number of applications accepted
* Enroll Number of new students enrolled
* Top10perc Pct. new students from top 10% of H.S. class
* Top25perc Pct. new students from top 25% of H.S. class
* F.Undergrad Number of fulltime undergraduates
* P.Undergrad Number of parttime undergraduates
* Outstate Out-of-state tuition
* Room.Board Room and board costs
* Books Estimated book costs
* Personal Estimated personal spending
* PhD Pct. of faculty with Ph.D.’s
* Terminal Pct. of faculty with terminal degree
* S.F.Ratio Student/faculty ratio
* perc.alumni Pct. alumni who donate
* Expend Instructional expenditure per student
* Grad.Rate Graduation rate"

**<font size=5>Features</font>**

Note that there's a categorical variable in our data - 'Private'. Categorical variables are tricky for clustering. You can't cluster off a categorical variable, so you'd have to do some kind of mapping to it. This can be intuitive for ordinal data, but for non-ordinal categorical variables, assigning numerical values can impact the clusters in ways not meaningful about the underlying data. 'Private' is a binary variable, yes or no, but mapping 0 or 1 would have outsized impact on clustering, since each point would be all the way at the min or the max of this variable while other variables will be continuous. For now, we will disregard this variable.

In [None]:
#exclude the categorical column and the college names
features = df.drop(['Private', 'Unnamed: 0'],axis=1)

There are three columns - 'Apps', 'Accept', and 'Enroll' - that can be collapsed into percentages if we choose. The absolute numbers could be informative - maybe a high number of acceptances means we are looking at a very large school, for example. However, if two schools have an "Accept" of, say, 1000, this could mean very different things if "Apps" was 10,000 (10% acceptance rate) versus 2,000 (50%). So let's create a % accepted column (Accept / Apps) and % enroll column (Enroll / Accept).

In [None]:
features['Acceptperc'] = features['Accept'] / features['Apps']
features['Enrollperc'] = features['Enroll'] / features['Accept']

In [None]:
features.describe()

**Normalization**

Note that the different categories have different ranges. If we don't normalize them, then columns with wider ranges will have disproportionate contributions to cluster separations.

In [None]:
scaler = preprocessing.MinMaxScaler()
features_normal = scaler.fit_transform(features)

In [None]:
pd.DataFrame(features_normal).describe()

Now all of our variables are scaled to be distributed between 0 and 1.

**<font size=5>K-Means Clustering</font>**

How many clusters should we group these colleges into? We can use the elbow method to decide. Plot the sum of squared distances of the data points from their cluster's center for increasing numbers of clusters and see if you can find a clear cluster number where the decrease in distortion starts to level off. A quick tutorial that worked me through this part of the code is [here](https://pythonprogramminglanguage.com/kmeans-elbow-method/).

In [None]:
inertia = []
K = range(1,10)
for k in K:
    kmeanModel = KMeans(n_clusters=k).fit(features_normal)
    kmeanModel.fit(features_normal)
    inertia.append(kmeanModel.inertia_)

In [None]:
# Plot the elbow
plt.plot(K, inertia, 'bx-')
plt.xlabel('k')
plt.ylabel('Inertia')
plt.show()

The elbow method is subjective but it looks like 4 might be the pivot point we're looking for. Let's try 4 clusters.

In [None]:
kmeans = KMeans(n_clusters=4).fit(features_normal)

In [None]:
labels = pd.DataFrame(kmeans.labels_) #This is where the label output of the KMeans we just ran lives. Make it a dataframe so we can concatenate back to the original data
labeledColleges = pd.concat((features,labels),axis=1)
labeledColleges = labeledColleges.rename({0:'labels'},axis=1)

In [None]:
labeledColleges.head()

**<font size=5>Visualization</font>**
    
*(Nota bene: I'm plotting the original data in these visualizations, not their normalized scaled versions. We clustered based on the normalized data but I wanted to see how that translates to the colleges' actual stats)*

The original dataset had 18 features. We dropped one and added two more, so we clustered on 19. We have 5 clusters of points in 19-dimensional space, which is hard to visualize. If we only had two attributes, we could look at how the clusters separate like this:

In [None]:
sns.lmplot(x='Top10perc',y='S.F.Ratio',data=labeledColleges,hue='labels',fit_reg=False)

Here we plotted the Top 10 Percent column (" Pct. new students from top 10% of H.S. class") versus the Student/Faculty ratio column and color-coded each data point by the cluster to which it was assigned. You can start to get the sense of which clusters have lower student/faculty ratios or are more selective in the students they accept. However, we can't see 4 clearly distinct clusters just by plotting these two variables; we have 17 other variables contributing to the separation that we have to consider to get the full picture. We can't plot all 19 variables together on one plot like the one above. We could plot every variable against every other variable:

In [None]:
sns.pairplot(labeledColleges,hue='labels')

This is nice for scanning by eye and seeing what variables give you nice separation and getting a sense for what happened in the clusters, but there's a lot going on and it's hard to get a quick answer to questions like "what features tend to define cluster 0? How about cluster 3?" Let's try visualizing each variable separately using strip plots and swarm plots.

In [None]:
labeledColleges['Constant'] = "Data" #This is just to add something constant for the strip/swarm plots' X axis. Can be anything you want it to be.

In [None]:
sns.stripplot(x=labeledColleges['Constant'],y=labeledColleges['Top10perc'],hue=labeledColleges['labels'],jitter=True)

This is a strip plot. Seaborn plots one data point for each row and we've color coded the points by the cluster to which they were assigned. Adding jitter fans out the points horizontally. In a strip plot, the points can overlap. In a swarm plot (below), the points cannot overlap.

In [None]:
sns.swarmplot(x=labeledColleges['Constant'],y=labeledColleges['Top10perc'],hue=labeledColleges['labels'])

Let's look at all the features. 

In [None]:
f, axes = plt.subplots(4, 5, figsize=(20, 25), sharex=False) #create a 4x5 grid of empty figures where we will plot our feature plots. We will have a couple empty ones.
f.subplots_adjust(hspace=0.2, wspace=0.7) #Scooch em apart, give em some room
#In this for loop, I step through every column that I want to plot. This is a 4x5 grid, so I split this up by rows of 5 in the else if statements
for i in range(0,len(list(labeledColleges))-2): #minus two because I don't want to plot labels or constant
    col = labeledColleges.columns[i]
    if i < 5:
        ax = sns.stripplot(x=labeledColleges['Constant'],y=labeledColleges[col].values,hue=labeledColleges['labels'],jitter=True,ax=axes[0,(i)])
        ax.set_title(col)
    elif i >= 5 and i<10:
        ax = sns.stripplot(x=labeledColleges['Constant'],y=labeledColleges[col].values,hue=labeledColleges['labels'],jitter=True,ax=axes[1,(i-5)]) #so if i=6 it is row 1 column 1
        ax.set_title(col)
    elif i >= 10 and i<15:
        ax = sns.stripplot(x=labeledColleges['Constant'],y=labeledColleges[col].values,hue=labeledColleges['labels'],jitter=True,ax=axes[2,(i-10)])
        ax.set_title(col)
    elif i >= 15:
        ax = sns.stripplot(x=labeledColleges['Constant'],y=labeledColleges[col].values,hue=labeledColleges['labels'],jitter=True,ax=axes[3,(i-15)])
        ax.set_title(col)

In [None]:
f, axes = plt.subplots(4, 5, figsize=(20, 25), sharex=False) 
f.subplots_adjust(hspace=0.2, wspace=0.7)
for i in range(0,len(list(labeledColleges))-2):
    col = labeledColleges.columns[i]
    if i < 5:
        ax = sns.swarmplot(x=labeledColleges['Constant'],y=labeledColleges[col].values,hue=labeledColleges['labels'],ax=axes[0,(i)])
        ax.set_title(col)
    elif i >= 5 and i<10:
        ax = sns.swarmplot(x=labeledColleges['Constant'],y=labeledColleges[col].values,hue=labeledColleges['labels'],ax=axes[1,(i-5)])
        ax.set_title(col)
    elif i >= 10 and i<15:
        ax = sns.swarmplot(x=labeledColleges['Constant'],y=labeledColleges[col].values,hue=labeledColleges['labels'],ax=axes[2,(i-10)])
        ax.set_title(col)
    elif i >= 15:
        ax = sns.swarmplot(x=labeledColleges['Constant'],y=labeledColleges[col].values,hue=labeledColleges['labels'],ax=axes[3,(i-15)])
        ax.set_title(col)

So, if you were looking for a college in 1995, you could scan these clusters and get a sense for which cluster might offer what you're looking for. Do you want a more exclusive school? Look for clusters that plot higher in Top10perc and Top25perc. But cluster 1 schools are also more expensive, with higher out-of-state tuition and room and board costs. Maybe you're looking for a big school - look for clusters with higher numbers of full-time undergrad students. When you find a cluster you like, you can see the college list here:

In [None]:
colleges = df['Unnamed: 0']
colleges = pd.concat((colleges,labels),axis=1)
colleges = colleges.rename({'Unnamed: 0':'College',0:'Cluster'},axis=1)
sortcolleges = colleges.sort_values(['Cluster'])
pd.set_option('display.max_rows', 1000)
sortcolleges