# Clustering colleges
Try to cluster colleges into two groups - private and public - using K-Means Clustering.

In [3]:
import pandas as pd
import seaborn as sns

### Import and explore data  

In [4]:
df = pd.read_csv('College_Data.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,Private,Apps,Accept,Enroll,Top10perc,Top25perc,F.Undergrad,P.Undergrad,Outstate,Room.Board,Books,Personal,PhD,Terminal,S.F.Ratio,perc.alumni,Expend,Grad.Rate
0,Abilene Christian University,Yes,1660,1232,721,23,52,2885,537,7440,3300,450,2200,70,78,18.1,12,7041,60
1,Adelphi University,Yes,2186,1924,512,16,29,2683,1227,12280,6450,750,1500,29,30,12.2,16,10527,56
2,Adrian College,Yes,1428,1097,336,22,50,1036,99,11250,3750,400,1165,53,66,12.9,30,8735,54
3,Agnes Scott College,Yes,417,349,137,60,89,510,63,12960,5450,450,875,92,97,7.7,37,19016,59
4,Alaska Pacific University,Yes,193,146,55,16,44,249,869,7560,4120,800,1500,76,72,11.9,2,10922,15


### Train model

In [13]:
ignore = ['Unnamed: 0', 'Private']
feats = [el for el in df.columns if el not in ignore]

# Import and train model
from sklearn.cluster import KMeans
model = KMeans(n_clusters=2)
model.fit(df[feats])

# How did the model do?
print('Cluster centers: \n', model.cluster_centers_)
print('Labels: \n', model.labels_)

Cluster centers: 
 [[1.81323468e+03 1.28716592e+03 4.91044843e+02 2.53094170e+01
  5.34708520e+01 2.18854858e+03 5.95458894e+02 1.03957085e+04
  4.31136472e+03 5.41982063e+02 1.28033632e+03 7.04424514e+01
  7.78251121e+01 1.40997010e+01 2.31748879e+01 8.93204634e+03
  6.51195815e+01]
 [1.03631389e+04 6.55089815e+03 2.56972222e+03 4.14907407e+01
  7.02037037e+01 1.30619352e+04 2.46486111e+03 1.07191759e+04
  4.64347222e+03 5.95212963e+02 1.71420370e+03 8.63981481e+01
  9.13333333e+01 1.40277778e+01 2.00740741e+01 1.41705000e+04
  6.75925926e+01]]
Labels: 
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0
 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 1 0 0 0
 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0
 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 

This doesn't tell us a whole lot on its own. At best, we have a prediction of whether a school is private or public. Since we do have the labels, however, we are able to compare actual to predicted.

In [16]:
# Cast 'Yes' to 0 and 'No' to 1
# Doing so the other way around causes the model to appear to perform very poorly
actual = df['Private'].replace('Yes', 0).replace('No', 1)

from sklearn.metrics import classification_report, confusion_matrix
print('Classification report: \n', classification_report(model.labels_, actual))
print('Confusion matrix: \n', confusion_matrix(model.labels_, actual))

Classification report: 
               precision    recall  f1-score   support

           0       0.94      0.79      0.86       669
           1       0.35      0.69      0.46       108

    accuracy                           0.78       777
   macro avg       0.64      0.74      0.66       777
weighted avg       0.86      0.78      0.81       777

Confusion matrix: 
 [[531 138]
 [ 34  74]]


In [15]:
df['Private'].value_counts()

Yes    565
No     212
Name: Private, dtype: int64

The model overall performs very well, considering it is just predicting whether a school is private or public from a list of several features. The one caveat is that the model may have predicted the other way around, in which case the performance is very poor. It seems more likely that the model meant to classify as "Private" = 0 and "Public" = 0.