<h3 style='color: blue'>K-means clustering: NBA player</h3>
<p> &emsp; K-means clustering is a type of unsupervised learning, which is used when we have unlabeled
data. The goal of this algorithm is to find groups in the data, with the number of groups represented by the variabled K. The algorithm works iteratively to assign each data point to one of K groups based on the features that are provided. Data points are clustered based on feature similarity.</p>

<h4 style='color: red'>Processes:</h4>
<ol>
    <li>Randomly pick k centroids(sensible initial partition) from the samples</li>
    <li>Assign remaining individuals (samples) to the centroid (cluster) which they were <em>closest</em> to (by Euclidean distance)</li>
    <li>Recalculate the centroid to the <em>mean value</em> of the values of all samples in the cluster.</li>
    <li>Repeat process 2 and 3 until there are no more relocations, or reaches the tolerance or maximum of iterations that is pre-chosen by the user.</li>
</ol>

<h4 style='color: red'>Results:</h4>
<ol>
    <li>The centroids of the K clusters, which can be used to label data</li>
    <li>Labels for the training data (each data point is assigned to a single cluster)</li>
</ol>

<h4 style='color: red'>Confusing KMeans parameter:</h4>

<table align='left' style='margin-bottom: 10px'>
    <tr style='border: 1px solid black'>
        <th style='text-align: left; border-right: 1px solid black'>max_iter</th>
        <td style='text-align: left'>
           Maximum number of iterations of the K-means algorithm for a single run. (int, default: 300)
        </td>
    </tr>
    <tr style='border: 1px solid black'>
        <th style='text-align: left; border-right: 1px solid black'>n_init</th>
        <td style='text-align: left'>
           Number of time the k-means algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs in terms of inertia. (int, default: 10)
        </td>
    </tr>
</table>

<h5>Example:</h5>
<p>With <mark>max_inter=300</mark> and <mark>n_init=15</mark>, kmeans will choose initial centroids 15 times, and each run will use up to 300 iterations. The best out of those 10 runs will be the final result.</p>
<p>The centroids are chosen by weighted probability where the probability is propotional to <mark>D(x)^2</mark>  (the distance between new dat a point which is the candidate of new centroid and the nearest centroid that has already been chosen, k-means++)</p>
    
<h5>Reference:</h5>
<ol>
    <li><a href='http://mnemstudio.org/clustering-k-means-example-1.htm'>k-Means: Step-By-Step</li>
    <li><a href='https://www.datascience.com/blog/introduction-to-k-means-clustering-algorithm-learn-data-science-tutorials'>Introduction to K-means Clusting</a></li>
    <li><a href='https://www.naftaliharris.com/blog/visualizing-k-means-clustering/'>Visualizing K-Means Clustering</a></li>
    <li><a href='https://stackoverflow.com/questions/5466323/how-exactly-does-k-means-work'>How exactly does k-means++ work?</a></li>
    <li><a href='https://stats.stackexchange.com/questions/246061/what-are-the-advantages-of-the-pre-defined-initial-centroids-in-clustering'>What are the advantages of the pre-defined initial centroids in clustering?</a></li>
    <li><a href='https://stackoverflow.com/questions/40895697/sklearn-kmeans-parameter-confusion'>Sklearn Kmeans paremeter confusion?</a></li>
</ol>

<h4 style='color: red'>sklearn.metrics:</h4>
<p>The <em>sklearn.metrics</em> module includes score functions, performance metrics and pairwise
metrics and distance computations</p>

<p style='color: blue'>sklearn.metrics.<b>silhouette_samples</b>(X, labels, metric='euclidean', **kwds)</p>
<p>Compute the Silhouette Coefficient for each sample.</p>
    
<p>The Silhouette Coefficient is a measure of how well samples are clustered with the samples  that are similar to themselves.</p>
    
<p>Clustering models with high Silhouette Coefficient are said to be dense, where samples in the same cluster are similiar to each other, and well separated, where samples in different clusters are  not very similar to each other.</p>

<p>The Silhouette Coefficient is calculated using the mean intra-cluster distance (<mark>a</mark>) and the mean nearest-cluster distance (<mark>b</mark>) for each sample. The Silhouette Coefficient for a sample is <mark>(b-a)/max(a,b)</mark>.

<h4 style='color: red'>scipy.spatial.distance.pdist / scipy.spatial.distance.squareform:</h4>
<p><mark>scipy.spatial.distance.pdist:</mark> Pairwise distances between observations in n-dimentional space</p>
<p><mark>scipy.spatial.distance.squareform:</mark> Converts a vector-form distance vector (pdist) to a square-form distance matrix, and vice-versa.</p>

<h5>Reference:</h5>
<p><a href='https://stackoverflow.com/questions/32946241/scipy-pdist-on-a-pandas-dataframe'>scipy pdist() on a pandas DataFrame</a></p>
<p><a href='https://joernhees.de/blog/2015/08/26/scipy-hierarchical-clustering-and-dendrogram-tutorial/'>Scipy Hierarchical Clustering and Dendrogram Tutorial</a></p>
<p><a href='https://stackoverflow.com/questions/37712465/what-is-the-meaning-of-the-return-values-of-the-scipy-cluster-hierarchy-linkage'>What is the meaning of the return values of the scipy.cluster.hierarchy.linkage?</a></p>
<p><a href='https://stackoverflow.com/questions/36847022/what-numbers-that-i-can-put-in-numpy-random-seed'>What numbers that I can put in numpy.random.seed()?</a></p>

In [1]:
# import packages
from sklearn import datasets
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
from matplotlib import cm
import numpy as np
import pandas as pd
from scipy.spatial.distance import pdist,squareform
from scipy.cluster.hierarchy import linkage
from scipy.cluster.hierarchy import dendrogram

In [2]:
# Load data file as pandas dataframe
df = pd.read_csv('player_traditional.csv')
X = df.iloc[:,2:]
print(X)

      MIN   PTS   FGM   FGA   FG%  3PM  3PA   3P%  FTM  FTA ...   DREB   REB  \
0     7.4   2.2   0.8   1.9  40.5  0.2  0.5  50.0  0.4  0.9 ...    1.3   1.6   
1    13.7   5.0   1.9   4.6  40.3  0.7  2.0  37.5  0.5  0.6 ...    0.8   1.1   
2    28.7  12.7   4.9  10.8  45.4  1.0  3.3  28.8  2.0  2.7 ...    3.6   5.1   
3     3.3   0.2   0.0   0.8   0.0  0.0  0.4   0.0  0.2  0.4 ...    0.6   0.6   
4     7.5   3.5   1.3   3.0  42.6  0.2  0.8  20.0  0.8  1.1 ...    1.3   1.8   
5    32.3  14.0   5.6  11.8  47.3  1.3  3.6  35.5  1.6  2.0 ...    5.4   6.8   
6    14.1   8.1   3.6   7.1  49.9  0.0  0.0   0.0  1.0  1.3 ...    3.1   4.2   
7    29.1   8.7   3.0   7.6  39.3  1.1  3.5  33.0  1.6  2.2 ...    6.1   7.4   
8    10.3   2.9   1.0   2.7  37.5  0.5  1.5  31.8  0.4  0.5 ...    0.7   0.8   
9    15.1   7.4   2.9   5.7  51.7  0.0  0.0   0.0  1.5  2.4 ...    4.2   6.2   
10   15.5   6.7   2.4   5.9  39.9  0.6  1.8  32.9  1.4  1.9 ...    2.5   2.9   
11   15.5   6.0   2.0   5.0  39.3  1.4  

In [3]:
# Standardize
sc = StandardScaler()
sc.fit(X)

# X's mean values of each feature
X_mean = X.mean(axis = 0)
print(sc.mean_)
print(X_mean)

# X's variances of each feature which are used to compute scale_
print(sc.var_)

# X's standard deviations of each feature
X_std = X.std(axis = 0)
print(sc.scale_)
print(X_std)

# Transform X 
X_train_std = sc.transform(X)
print(X_train_std)

[ 19.89958848   8.42674897   3.12016461   6.91399177  44.03703704
   0.768107     2.1845679   27.71769547   1.42489712   1.86069959
  71.82798354   0.85144033   2.71522634   3.56522634   1.83024691
   1.09814815   0.62530864   0.39259259   1.68024691   4.17901235
   0.24074074  -0.34032922]
MIN     19.899588
PTS      8.426749
FGM      3.120165
FGA      6.913992
FG%     44.037037
3PM      0.768107
3PA      2.184568
3P%     27.717695
FTM      1.424897
FTA      1.860700
FT%     71.827984
OREB     0.851440
DREB     2.715226
REB      3.565226
AST      1.830247
TOV      1.098148
STL      0.625309
BLK      0.392593
PF       1.680247
DD2      4.179012
TD3      0.240741
+/-     -0.340329
dtype: float64
[  8.19322221e+01   3.66232351e+01   4.64864689e+00   2.04286108e+01
   1.01768505e+02   5.69661849e-01   3.76266308e+00   2.20456724e+02
   2.03923610e+00   2.98345548e+00   3.52375349e+02   5.75707802e-01
   3.23972701e+00   5.88543688e+00   3.11272710e+00   6.09564472e-01
   1.68310090e-01   1

In [4]:
#Normal Kmeans method

km_norm = KMeans(n_clusters=3, init='random', max_iter=300, tol=1e-04, random_state=0)
y_km = km_norm.fit(X_train_std)
y_km.predict(X_train_std)
y_2_km = km_norm.fit_predict(X_train_std)
print(y_km.labels_)
print(y_2_km)

[1 1 0 1 1 0 1 0 1 0 1 1 0 0 1 0 1 0 1 2 0 0 0 0 1 2 1 1 2 1 0 1 1 1 0 0 0
 1 1 1 1 0 2 1 1 1 0 1 2 1 1 0 0 0 1 1 1 1 2 1 1 0 1 2 0 1 1 0 2 1 0 1 1 1
 1 1 2 1 0 0 1 1 0 0 1 0 1 1 1 2 1 1 2 0 0 1 1 0 1 0 1 0 1 1 1 2 1 1 2 2 0
 1 1 1 2 1 0 0 1 0 1 2 1 0 1 1 0 0 0 1 1 1 1 2 2 0 2 0 1 1 0 1 0 0 2 0 0 0
 0 0 1 0 0 1 1 0 1 1 1 0 2 1 2 2 0 0 1 0 2 1 1 1 0 0 1 1 2 0 0 1 0 0 0 0 1
 2 0 0 1 1 0 0 0 0 2 0 1 1 1 0 1 1 1 1 1 1 1 1 1 0 2 1 1 0 0 1 0 2 0 0 0 0
 0 1 1 1 2 1 0 1 1 2 1 0 1 0 1 1 0 0 1 1 1 1 1 1 1 0 2 1 0 1 0 1 0 0 0 1 2
 2 1 0 0 2 0 0 0 2 2 1 1 0 2 0 1 1 2 1 0 2 0 1 1 2 2 1 1 1 1 0 1 1 2 1 0 1
 0 1 1 0 1 0 1 1 1 0 2 1 0 0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 1 1 1 0 0 1 0 2 1
 1 1 1 1 1 1 1 0 0 2 0 0 0 1 0 2 1 1 0 2 0 2 1 0 1 1 1 1 0 0 1 1 1 0 1 0 0
 0 2 2 1 1 1 1 1 1 0 1 1 1 1 1 0 1 0 1 1 0 2 0 0 0 0 1 1 0 1 1 2 2 2 0 1 1
 1 1 0 1 0 0 0 1 1 1 0 1 0 0 0 0 1 1 2 1 1 0 0 0 0 0 1 0 0 1 0 0 1 1 1 0 0
 1 0 1 0 1 1 0 0 0 1 0 0 1 1 0 1 1 0 1 0 1 1 0 0 1 1 0 0 1 0 1 1 0 0 0 1 0
 0 0 0 0 0]
[1 1 0 1 1 0 

In [5]:
# k-means++ method
km_pp = KMeans(n_clusters=3, init='k-means++',n_init=10, max_iter= 300, tol=1e-04)
y_km_pp = km_pp.fit(X_train_std)
y_km_pp.predict(X_train_std)
y_km_pp_2 = km_pp.fit_predict(X_train_std)

print(y_km_pp.labels_)
print(y_km_pp_2)


[0 0 1 0 0 1 0 1 0 1 0 0 1 1 0 1 0 1 0 2 1 1 1 1 0 2 0 0 2 0 1 0 0 0 0 1 1
 0 0 0 0 1 2 0 0 0 1 0 2 0 0 1 1 1 0 0 0 0 2 0 0 1 0 2 1 0 0 1 2 0 1 0 0 0
 0 0 2 0 1 1 0 0 1 1 0 1 0 0 0 2 0 0 2 1 1 0 0 1 0 1 0 1 0 0 0 2 0 0 2 2 1
 0 0 0 2 0 1 1 0 1 0 2 0 1 0 0 1 1 1 0 0 0 0 2 2 1 2 1 0 0 1 0 1 1 2 1 1 1
 1 1 0 1 1 0 0 1 0 0 0 1 2 0 2 2 1 1 0 1 2 0 0 0 1 1 0 0 2 1 1 0 1 1 1 1 0
 2 1 1 0 0 1 1 1 1 2 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 2 0 0 1 1 0 1 2 1 1 1 1
 1 0 0 0 2 0 1 0 0 2 0 1 0 1 0 0 1 1 0 0 0 0 0 0 0 1 2 0 1 0 1 0 1 1 1 0 2
 2 0 1 1 2 1 1 1 2 2 0 0 1 2 1 0 0 2 0 1 2 1 0 0 2 2 0 0 0 0 1 0 0 2 0 1 0
 1 0 0 1 0 1 0 0 0 1 2 0 1 1 0 1 1 0 0 1 1 1 0 1 1 1 1 1 0 0 0 1 1 0 1 2 0
 0 0 0 0 0 0 0 1 1 2 1 1 1 0 1 2 0 0 1 2 1 2 0 1 0 0 0 0 1 1 0 0 0 1 0 1 1
 1 2 2 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 1 2 1 1 1 1 0 0 1 0 0 2 2 2 1 0 0
 0 0 1 0 1 1 1 0 0 0 1 0 1 1 1 1 0 0 2 0 0 1 1 1 1 1 0 1 1 0 1 1 0 0 0 1 1
 0 1 0 1 0 0 1 1 1 0 1 1 0 0 1 0 0 1 0 1 0 0 1 1 0 0 1 1 0 1 0 0 1 1 1 0 1
 1 1 1 1 1]
[0 0 1 0 0 1 

In [6]:
#Hierarchical clustering on a distance matrix
new_df = pd.read_csv('player_traditional2.csv')
# print(new_df.columns)

row_dist = pd.DataFrame(squareform(pdist(new_df, metric='euclidean')))
row_dist


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,476,477,478,479,480,481,482,483,484,485
0,0.000000,38.003026,44.131961,64.743108,41.630878,50.222306,60.915762,39.843695,35.474780,57.456331,...,50.548294,42.764237,58.465887,33.223787,42.719668,45.889650,45.581685,55.680697,51.409240,63.014205
1,38.003026,0.000000,24.186360,63.931682,20.042954,26.385413,39.419031,21.531140,9.675226,46.161889,...,24.553208,20.523401,43.330474,32.628055,23.384610,27.352331,14.973310,31.271233,32.829864,41.296005
2,44.131961,24.186360,0.000000,66.348323,27.987497,13.263107,34.273313,10.673800,26.274512,35.846618,...,16.945796,10.549408,32.659914,27.395803,15.614737,7.655717,18.197802,20.919608,16.379866,34.913894
3,64.743108,63.931682,66.348323,0.000000,53.101601,75.960582,58.736360,63.004524,55.751682,57.789272,...,72.197368,70.576129,59.521845,64.018044,67.526735,71.724682,70.424144,81.060533,66.230356,63.372391
4,41.630878,20.042954,27.987497,53.101601,0.000000,35.650947,23.744473,28.129877,13.741543,30.828883,...,36.116755,31.004355,28.479642,25.183526,21.835293,34.175869,29.908527,42.728796,31.756889,27.045517
5,50.222306,26.385413,13.263107,75.960582,35.650947,0.000000,42.306619,16.069225,32.222663,45.474498,...,14.286707,11.033585,42.284749,36.444753,21.788759,9.573923,16.341665,14.776332,21.491161,41.044488
6,60.915762,39.419031,34.273313,58.736360,23.744473,42.306619,0.000000,38.870683,35.493521,20.251913,...,44.402252,41.077123,13.306765,33.748333,29.465913,40.636560,42.441254,48.522057,33.440395,10.723805
7,39.843695,21.531140,10.673800,63.004524,28.129877,16.069225,38.870683,0.000000,23.487443,40.654028,...,16.221282,11.559412,38.106692,29.427538,20.306403,13.222708,16.570154,23.512975,22.520879,39.134256
8,35.474780,9.675226,26.274512,55.751682,13.741543,32.222663,35.493521,23.487443,0.000000,41.382243,...,30.046131,25.681511,39.133617,29.502542,23.932196,31.147392,22.076005,37.743079,33.451009,38.574473
9,57.456331,46.161889,35.846618,57.789272,30.828883,45.474498,20.251913,40.654028,41.382243,0.000000,...,50.754901,45.189933,9.921189,28.870227,29.196233,41.474932,49.537662,54.810309,29.861011,20.237589


In [21]:
a = np.random.multivariate_normal([0, 0], [[1,0],[0,100]], 5)
print(a)

[[ -0.82737475 -23.18235809]
 [ -1.03879472  -1.42685925]
 [ -0.6182129    5.35646919]
 [  0.6003344    1.11075041]
 [ -0.5590925   16.06892693]]
