# Machine Learning with Python
# Part 3. Unsupervised Learning - Clustering

Author: Kang P. Lee<br>
References:
- Documentation scikit-learn (http://scikit-learn.org/stable/documentation.html)
- Introduction to Machine Learning with Python (http://shop.oreilly.com/product/0636920030515.do)
- Major League Baseball data form SeanLahman.com (http://www.seanlahman.com/baseball-archive/statistics/)
- Baseball statistics (https://en.wikipedia.org/wiki/Baseball_statistics)

## Set the Goal

Let's aim to build a clustering model from the Major League Baseball dataset that is able to group, or cluster, all batters into several groups, or clusters, of similar batters. 

For example, can we identify a group of sluggers or a group of bench players?

## Import Modules

In [1]:
import pandas as pd
from sklearn.cluster import KMeans

## Load the Dataset into a Pandas Dataframe

In [2]:
df = pd.read_csv("Batting.csv")

## Filter Out Unnecessary Data

In [3]:
df2k = df[df.yearID >= 2000]

In [4]:
df2k.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 23566 entries, 79250 to 102815
Data columns (total 22 columns):
playerID    23566 non-null object
yearID      23566 non-null int64
stint       23566 non-null int64
teamID      23566 non-null object
lgID        23566 non-null object
G           23566 non-null int64
AB          23566 non-null int64
R           23566 non-null int64
H           23566 non-null int64
2B          23566 non-null int64
3B          23566 non-null int64
HR          23566 non-null int64
RBI         23566 non-null float64
SB          23566 non-null float64
CS          23566 non-null float64
BB          23566 non-null int64
SO          23566 non-null float64
IBB         23566 non-null float64
HBP         23566 non-null float64
SH          23566 non-null float64
SF          23566 non-null float64
GIDP        23566 non-null float64
dtypes: float64(9), int64(10), object(3)
memory usage: 4.1+ MB


In [5]:
df2k = df2k.drop(["playerID", "yearID", "stint", "teamID", "lgID"], axis=1)

## Prepare Data for Modeling

In [6]:
X = df2k.copy()

There is no <i>y </i> in unsupervised learning. Also, you don't have to split the data into training and test sets. 

## Modeling with k-Means Clustering

In [7]:
# KMeans(n_clusters=8, init='k-means++', n_init=10, max_iter=300, tol=0.0001, precompute_distances='auto', 
#        verbose=0, random_state=None, copy_x=True, n_jobs=1, algorithm='auto')

kmeans = KMeans(n_clusters=5, random_state=0)     # k = 5

In [8]:
kmeans.fit(X)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=5, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=0, tol=0.0001, verbose=0)

In [9]:
kmeans.cluster_centers_           # Stores the values of centroids. 

array([[2.35111017e+01, 7.42626091e+00, 6.31623750e-01, 1.25508974e+00,
        2.19692133e-01, 1.95076967e-02, 8.05135845e-02, 5.18479109e-01,
        6.34886855e-02, 2.98645102e-02, 4.50095765e-01, 2.45208200e+00,
        1.47549124e-02, 5.19968788e-02, 4.32503370e-01, 3.75257147e-02,
        1.38185430e-01],
       [1.16970063e+02, 3.92674895e+02, 5.16444328e+01, 1.03831933e+02,
        2.06785714e+01, 2.16754202e+00, 1.16790966e+01, 4.96670168e+01,
        6.33718487e+00, 2.68277311e+00, 3.70677521e+01, 7.65399160e+01,
        2.78939076e+00, 4.17804622e+00, 2.31250000e+00, 3.15756303e+00,
        9.14548319e+00],
       [8.06253310e+01, 2.27492939e+02, 2.83005296e+01, 5.71690203e+01,
        1.14210062e+01, 1.25948808e+00, 5.98455428e+00, 2.69682259e+01,
        3.38349515e+00, 1.49161518e+00, 2.08464254e+01, 4.80167696e+01,
        1.33539276e+00, 2.33583407e+00, 1.78949691e+00, 1.75551633e+00,
        5.13901147e+00],
       [1.48959961e+02, 5.59242644e+02, 8.34973468e+01, 1.571

In [10]:
kmeans.labels_                    # Stores the cluster labels of data points 

array([2, 2, 0, ..., 3, 2, 0], dtype=int32)

Each training data point in X is assigned a cluster label, which is a number between 0 and k-1. 

In [11]:
df2k["label"] = kmeans.labels_

In [12]:
df2k.head(10)

Unnamed: 0,G,AB,R,H,2B,3B,HR,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP,label
79250,80,215,31,59,15,1,3,29.0,2.0,1.0,21,38.0,1.0,2.0,2.0,1.0,2.0,2
79251,79,157,22,34,7,1,6,12.0,1.0,1.0,14,51.0,2.0,1.0,0.0,1.0,2.0,2
79252,35,5,1,2,1,0,0,0.0,0.0,0.0,0,1.0,0.0,0.0,1.0,0.0,0.0,0
79253,154,576,103,182,42,10,25,79.0,28.0,8.0,100,116.0,9.0,1.0,0.0,3.0,12.0,3
79254,62,1,1,0,0,0,0,0.0,0.0,0.0,1,1.0,0.0,0.0,0.0,0.0,0.0,0
79255,66,2,0,0,0,0,0,0.0,0.0,0.0,0,1.0,0.0,0.0,1.0,0.0,0.0,0
79256,119,350,59,101,19,1,15,60.0,5.0,5.0,54,68.0,2.0,7.0,0.0,3.0,6.0,1
79257,54,0,0,0,0,0,0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0
79258,21,45,9,13,1,0,4,7.0,0.0,0.0,3,7.0,0.0,0.0,0.0,0.0,0.0,0
79259,23,0,0,0,0,0,0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0


In [13]:
df2k.label.value_counts()         # Counts the number of rows by a specified column

0    14097
4     3226
2     2265
3     2074
1     1904
Name: label, dtype: int64

In [14]:
df2k[df2k.label == 0].sample(n=10, replace=False, random_state=0)

Unnamed: 0,G,AB,R,H,2B,3B,HR,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP,label
79536,56,8,0,2,0,0,0,0.0,0.0,0.0,0,2.0,0.0,0.0,0.0,0.0,0.0,0
102347,14,40,5,8,0,0,1,3.0,0.0,0.0,2,14.0,0.0,0.0,1.0,0.0,2.0,0
94315,11,11,1,2,0,0,0,2.0,0.0,0.0,0,3.0,0.0,0.0,1.0,0.0,0.0,0
88357,63,1,0,0,0,0,0,0.0,0.0,0.0,0,1.0,0.0,0.0,0.0,0.0,0.0,0
79308,1,2,0,0,0,0,0,0.0,0.0,0.0,0,1.0,0.0,0.0,0.0,0.0,0.0,0
100182,31,0,0,0,0,0,0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0
95866,32,47,1,5,0,0,0,0.0,0.0,1.0,0,21.0,0.0,0.0,8.0,0.0,0.0,0
79643,60,0,0,0,0,0,0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0
80032,10,0,0,0,0,0,0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0
95292,24,0,0,0,0,0,0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0


In [15]:
df2k[df2k.label == 3].sample(n=10, replace=False, random_state=0)

Unnamed: 0,G,AB,R,H,2B,3B,HR,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP,label
93430,144,514,47,127,28,5,13,70.0,4.0,3.0,54,134.0,4.0,5.0,0.0,7.0,12.0,3
91326,149,503,93,123,28,0,34,90.0,1.0,0.0,91,147.0,9.0,4.0,0.0,4.0,17.0,3
93211,139,517,71,151,33,1,23,82.0,2.0,1.0,59,102.0,11.0,3.0,0.0,6.0,11.0,3
83376,158,559,50,134,30,2,23,80.0,2.0,2.0,37,103.0,4.0,5.0,1.0,6.0,13.0,3
95641,136,512,57,128,28,0,13,60.0,14.0,6.0,23,77.0,0.0,2.0,3.0,6.0,6.0,3
95881,159,585,86,152,45,0,24,90.0,8.0,4.0,48,140.0,2.0,5.0,0.0,4.0,7.0,3
83489,156,588,94,166,39,5,29,101.0,8.0,3.0,62,89.0,10.0,1.0,0.0,3.0,14.0,3
82375,156,635,101,197,56,5,24,120.0,5.0,2.0,41,63.0,4.0,6.0,0.0,11.0,17.0,3
99472,133,494,47,141,23,1,15,66.0,3.0,0.0,21,75.0,2.0,13.0,0.0,3.0,18.0,3
97557,159,601,66,158,27,0,25,76.0,2.0,1.0,38,73.0,3.0,0.0,3.0,2.0,14.0,3


In [16]:
df2k.groupby("label").mean()

Unnamed: 0_level_0,G,AB,R,H,2B,3B,HR,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP
label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
0,23.511102,7.426261,0.631624,1.25509,0.219692,0.019508,0.080514,0.518479,0.063489,0.029865,0.450096,2.452082,0.014755,0.051997,0.432503,0.037526,0.138185
1,116.951155,392.58666,51.634979,103.801996,20.67437,2.168067,11.673319,49.657563,6.332983,2.682773,37.050945,76.535189,2.78729,4.178046,2.314076,3.157563,9.141807
2,80.618543,227.453863,28.298896,57.160706,11.419426,1.259161,5.98543,26.961148,3.384989,1.492274,20.849007,48.015011,1.335982,2.336865,1.788962,1.754967,5.137748
3,148.951784,559.206365,83.481196,157.095468,31.722276,3.256027,20.620058,79.982642,11.427194,4.15622,56.809065,101.902122,5.157184,5.722758,2.0892,4.887175,12.917551
4,40.591445,92.718537,10.00248,20.722257,4.026348,0.422195,1.83788,9.263174,1.117173,0.526038,7.402666,22.986981,0.412895,0.845319,2.154371,0.622133,1.987601


# Exercises

Try with a different number of clusters, or <i>k</i>, say k = 10, and examine each cluster to find any similarity within the cluster.

In [17]:
# kmeans = KMeans(n_clusters=YOUR_K)