<h3 style='color: blue'>K-means clustering: NBA player</h3>
<p> &emsp; K-means clustering is a type of unsupervised learning, which is used when we have unlabeled
data. The goal of this algorithm is to find groups in the data, with the number of groups represented by the variabled K. The algorithm works iteratively to assign each data point to one of K groups based on the features that are provided. Data points are clustered based on feature similarity.</p>

<h4 style='color: red'>Processes:</h4>
<ol>
    <li>Randomly pick k centroids(sensible initial partition) from the samples</li>
    <li>Assign remaining individuals (samples) to the centroid (cluster) which they were <em>closest</em> to (by Euclidean distance)</li>
    <li>Recalculate the centroid to the <em>mean value</em> of the values of all samples in the cluster.</li>
    <li>Repeat process 2 and 3 until there are no more relocations, or reaches the tolerance or maximum of iterations that is pre-chosen by the user.</li>
</ol>

<h4 style='color: red'>Results:</h4>
<ol>
    <li>The centroids of the K clusters, which can be used to label data</li>
    <li>Labels for the training data (each data point is assigned to a single cluster)</li>
</ol>

<h5>Reference:</h5>
<ol>
    <li><a href='http://mnemstudio.org/clustering-k-means-example-1.htm'>k-Means: Step-By-Step</li>
    <li><a href='https://www.datascience.com/blog/introduction-to-k-means-clustering-algorithm-learn-data-science-tutorials'>Introduction to K-means Clusting</a></li>
    <li><a href='https://www.naftaliharris.com/blog/visualizing-k-means-clustering/'>Visualizing K-Means Clustering</a></li>
</ol>

<h4 style='color: red'>sklearn.metrics:</h4>
<p>The <em>sklearn.metrics</em> module includes score functions, performance metrics and pairwise
metrics and distance computations</p>

<p style='color: blue'>sklearn.metrics.<b>silhouette_samples</b>(X, labels, metric='euclidean', **kwds)</p>
<p>Compute the Silhouette Coefficient for each sample.</p>
    
<p>The Silhouette Coefficient is a measure of how well samples are clustered with the samples  that are similar to themselves.</p>
    
<p>Clustering models with high Silhouette Coefficient are said to be dense, where samples in the same cluster are similiar to each other, and well separated, where samples in different clusters are  not very similar to each other.</p>

<p>The Silhouette Coefficient is calculated using the mean intra-cluster distance (<mark>a</mark>) and the mean nearest-cluster distance (<mark>b</mark>) for each sample. The Silhouette Coefficient for a sample is <mark>(b-a)/max(a,b)</mark>.

In [1]:
# import packages
from sklearn import datasets
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
from matplotlib import cm
import numpy as np
import pandas as pd
from scipy.spatial.distance import pdist,squareform
from scipy.cluster.hierarchy import linkage
from scipy.cluster.hierarchy import dendrogram

In [3]:
# Load data file as pandas dataframe
df = pd.read_csv('player_traditional.csv')
X = df.iloc[:,2:]
print(X)

      MIN   PTS   FGM   FGA   FG%  3PM  3PA   3P%  FTM  FTA ...   DREB   REB  \
0     7.4   2.2   0.8   1.9  40.5  0.2  0.5  50.0  0.4  0.9 ...    1.3   1.6   
1    13.7   5.0   1.9   4.6  40.3  0.7  2.0  37.5  0.5  0.6 ...    0.8   1.1   
2    28.7  12.7   4.9  10.8  45.4  1.0  3.3  28.8  2.0  2.7 ...    3.6   5.1   
3     3.3   0.2   0.0   0.8   0.0  0.0  0.4   0.0  0.2  0.4 ...    0.6   0.6   
4     7.5   3.5   1.3   3.0  42.6  0.2  0.8  20.0  0.8  1.1 ...    1.3   1.8   
5    32.3  14.0   5.6  11.8  47.3  1.3  3.6  35.5  1.6  2.0 ...    5.4   6.8   
6    14.1   8.1   3.6   7.1  49.9  0.0  0.0   0.0  1.0  1.3 ...    3.1   4.2   
7    29.1   8.7   3.0   7.6  39.3  1.1  3.5  33.0  1.6  2.2 ...    6.1   7.4   
8    10.3   2.9   1.0   2.7  37.5  0.5  1.5  31.8  0.4  0.5 ...    0.7   0.8   
9    15.1   7.4   2.9   5.7  51.7  0.0  0.0   0.0  1.5  2.4 ...    4.2   6.2   
10   15.5   6.7   2.4   5.9  39.9  0.6  1.8  32.9  1.4  1.9 ...    2.5   2.9   
11   15.5   6.0   2.0   5.0  39.3  1.4  

In [13]:
# Standardize
sc = StandardScaler()
sc.fit(X)

# X's mean values of each feature
X_mean = X.mean(axis = 0)
print(sc.mean_)
print(X_mean)

# X's variances of each feature which are used to compute scale_
print(sc.var_)

# X's standard deviations of each feature
X_std = X.std(axis = 0)
print(sc.scale_)
print(X_std)

# Transform X 
X_train_std = sc.transform(X)
print(X_train_std)

[ 19.89958848   8.42674897   3.12016461   6.91399177  44.03703704
   0.768107     2.1845679   27.71769547   1.42489712   1.86069959
  71.82798354   0.85144033   2.71522634   3.56522634   1.83024691
   1.09814815   0.62530864   0.39259259   1.68024691   4.17901235
   0.24074074  -0.34032922]
MIN     19.899588
PTS      8.426749
FGM      3.120165
FGA      6.913992
FG%     44.037037
3PM      0.768107
3PA      2.184568
3P%     27.717695
FTM      1.424897
FTA      1.860700
FT%     71.827984
OREB     0.851440
DREB     2.715226
REB      3.565226
AST      1.830247
TOV      1.098148
STL      0.625309
BLK      0.392593
PF       1.680247
DD2      4.179012
TD3      0.240741
+/-     -0.340329
dtype: float64
[  8.19322221e+01   3.66232351e+01   4.64864689e+00   2.04286108e+01
   1.01768505e+02   5.69661849e-01   3.76266308e+00   2.20456724e+02
   2.03923610e+00   2.98345548e+00   3.52375349e+02   5.75707802e-01
   3.23972701e+00   5.88543688e+00   3.11272710e+00   6.09564472e-01
   1.68310090e-01   1