<img src="https://juniorworld.github.io/python-workshop-2018/img/portfolio/week9.png" width="350px">

---

# Unsupervised Machine Learning

- Train a model to give predictions only based on input data
- Dimension Reduction: PCA or SVD
- Clustering Analysis: Modularity, KMeans

## PCA: Principal Component Analysis
- Purpose: Dimension Reduction + Visualization
- Input: high-dimensional data
- Output: low-dimensional data
- Survey: 100 items -> 5 components

In [None]:
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris

In [None]:
iris=load_iris()

In [None]:
print(iris.feature_names)

In [None]:
data=iris.data

In [None]:
data.shape

In [None]:
pca = PCA(n_components=2)
pca.fit(data)

In [None]:
pca.explained_variance_ratio_

In [None]:
pca = PCA(n_components=3) #initialize a PCA decomposer
pca.fit(data) #train this decomposer with current data set

In [None]:
pca.explained_variance_ratio_

In [None]:
iris_pca=pca.transform(data)

In [None]:
iris_pca[0,:]

In [None]:
#plot this matrix
import plotly.plotly as py
import plotly.graph_objs as go

py.sign_in('USER NAME','API TOKEN')

In [None]:
trace=go.Scatter3d(
                 x=iris_pca[:,0],
                 y=iris_pca[:,1],
                 z=iris_pca[:,2],
                 mode='markers',
                 marker={'size':3,'color':iris.target})
py.iplot([trace],filename='iris pca')

In [None]:
trace=go.Scatter(
                 x=iris_pca[:,0],
                 y=iris_pca[:,1],
                 mode='markers',
                 marker={'size':3,'color':iris.target})
py.iplot([trace],filename='iris pca')

Try another data set...

In [None]:
! pip3 install matplotlib

In [None]:
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
#load digits data
digits = load_digits()

digits is composed of three lists: image list, data list, and label list

In [None]:
#"images" contains 8x8 images of each data point.
plt.gray() 
plt.matshow(digits.images[0])
plt.show() 

In [None]:
#"target" contains a list of results
digits.target[0]

In [None]:
#"data" contains brighness numbers of each data point
data=digits.data

In [None]:
data.shape #overall 1797 cases * 64 brightness number

In [None]:
data[0,:] #first case

In [None]:
#Find the minimum number of components which can explain over 80% of data variance




In [None]:
pca = PCA(n_components=3) #specify the component number to 3
pca.fit(data)
digits_3d=pca.transform(data) #obtain the transformed data

In [None]:
trace=go.Scatter3d(
        x=digits_3d[:,0],
        y=digits_3d[:,1],
        z=digits_3d[:,2],
        mode='markers',
        marker={'size':2,'color':digits.target,'colorscale':'Rainbow'},
        text=digits.target)
py.iplot([trace],filename='digits space')

In [None]:
trace=go.Scatter(
        x=digits_3d[:,0],
        y=digits_3d[:,1],
        mode='markers',
        marker={'size':2,'color':digits.target,'colorscale':'Rainbow'},
        text=digits.target)
py.iplot([trace],filename='digits space')

---
## Break
---

## KMeans
- Purpose: cluster data points according to their euclidean distance
- Input: observation data
- Output: predicted groups
- Procedure:
    - STEP 1. Intialize K cluster centroids
    - STEP 2. Calculate the distance between each data point and each centroid
    - STEP 3. Assign data point to the cluster whose centroid is closest to it
    - STEP 4. Update the cluster centroids with new group
    - STEP 5. Repeat STEP 1~4 for specific time or until convergence

<h3 style="color: red">1. Step-by-Step Breakdowns</h3>

### STEP 1. Initialize K cluster centroids

In [None]:
#randomly pick K points as centroids
#Suppose: K=10
random_centroids_index=np.random.choice(range(data.shape[0]),10)

In [None]:
random_centroids=data[random_centroids_index,:]

### STEP 2. Calculate the pairwise distance between data point and each centroid

**a) VECTOR NORM**

<img src="https://juniorworld.github.io/python-workshop-2018/img/vector_norm.png" width="200px" align='left'>

In [None]:
#According to Pythagorean theorem, if U=(U1,U2,U3,...,Un), ‖U‖=sqrt(U1^2 + U2^2 + U3^2 +...+ Un^2)
a=np.array([0,1,2,3])
norm=

In [None]:
#SHORTCUT: Use np.linalg.norm() to calculate the vector norm
np.linalg.norm(a)

**b) VECTOR SUBTRACTION**

<img src="https://juniorworld.github.io/python-workshop-2018/img/vector_minus.png" width="250px" align='left'>

In [None]:
a=np.array([0,1])
b=np.array([0,2])
np.linalg.norm(a-b)

In [None]:
#get the pairwise distance between first data point and first cluster's centroid
np.linalg.norm(data[0]-random_centroids[0])

In [None]:
#get the distance between all data and first centroid in one line
np.linalg.norm(data-random_centroids[0],axis=1)

### STEP 3. Pairwise Distance of digit data

In [None]:
#Calculate the distance between each data point and each centroid, assign point to the cluster depending on this distance
#You need to get a list of cluster assignment
#HINT: you can use np.argmin() to find the index of minimum value
#----------------------------------
def single_run_KMeans(data,centroids):
    clusters=[] #initialization
    
    
    #Write your code here
    clusters=np.array(clusters)
    return(clusters)

In [None]:
first_run_cluster=single_run_KMeans(data,random_centroids)

In [None]:
first_run_cluster[0] #first point belongs to this cluster after first run

### STEP 4. Update the centroids

Centroids: centers of a group of points/vectors
    - Measure: avarage of coordinates

<img src="https://juniorworld.github.io/python-workshop-2018/img/centroids.png" width="250px" align='left'>

In [None]:
a=[[1,2,3],
   [2,3,4],
   [4,5,6],
   [6,7,8]]
print(np.mean(a,axis=0)) #column sum
print(np.mean(a,axis=1)) #row sum

In [None]:
#Update the centroids of clusters
def update_centroids(data,clusters):
    centroids=np.array([])
    for i in range(10):
        
        #Write your code here
    
    return(centroids)

In [None]:
data[first_run_cluster==0]

In [None]:
update_centroids(data,first_run_cluster) #get our second-run centroids

### STEP 5. Calculate the Loss

Loss in Machine Learning = Goodness-of-fit in Social Sciences
- Types of loss: L1 (abs error), L2 (sqaured error) and logistic/cross-entropy
    - L1: mean(abs(y-ŷ))
    - L2: mean((y-ŷ)^2)
    - Log: mean(-sum(y*log(ŷ))
- For KMeans, we use L2 loss:
    - average squared distance between points and their centroids
    - formula: mean((y-centroid)^2)

In [None]:
def loss(data,centroids):
    ls=[]
    
    #WRITE YOUR CODE HERE
        
    return(ls)

In [None]:
#performance of first run
loss(data,first_run_cluster,random_centroids)

### Training model for specific times (integrated)

In [None]:
centroids=random_centroids
loss_list=[]
for run in range(20):
    clusters=single_run_KMeans(data,centroids)
    current_loss=loss(data,clusters,centroids)
    print(run,current_loss)
    loss_list.append(current_loss)
    centroids=update_centroids(data,clusters)

In [None]:
#elbow method of finding the optimal number of iteration
trace=go.Scatter(
    x=list(range(20)),
    y=loss_list,
    mode='lines'
)
py.iplot([trace],filename='learning curve')

In [None]:
#have a look at the cluster results
digits.target[clusters==0]

### Training model until convergence (integrated)

In [None]:
centroids=random_centroids
current_loss=10000
loss_list=[]
while current_loss>700:
    clusters=single_run_KMeans(data,centroids)
    current_loss=loss(data,clusters,centroids)
    print(current_loss)
    loss_list.append(current_loss)
    centroids=update_centroids(data,clusters)

## KMeans++: An Improvement of KMeans

- New way of initialization
    - STEP 1: Random pick one centroid from the points
    - STEP 2: Calculate the distance _D(k)_ between points and their nearest centroid
    - STEP 3: Pick one more centroid with probability proportional to _D(k)_
    - STEP 4: Repeat STEP 2 and STEP 3 until the number of centroids reaching the required value

In [None]:
#STEP 1
first_centroid_index=np.random.choice(range(data.shape[0]),1)
first_centroid=data[first_centroid_index]

In [None]:
#HINT 1: You can use np.random.choice(data,1,p=[probability list]) to pick one point randomly with given list of probability
#HINT 2: You can use np.vstack((array1,array2)) to add new row
#WRITE YOUR CODE HERE
centroids=first_centroid
for i in range(9):
    

    centroids.append(next_centroid)
centroids=np.array(centroids)

In [None]:
centroids=first_centroid
for i in range(9):
    distances=[]
    for j in data:
        dist=np.min(np.linalg.norm(j-centroids))
        distances.append(dist)
    distances=np.array(distances)/sum(distances)
    next_centroid_index=np.random.choice(range(data.shape[0]),1,p=distances)
    next_centroid=data[next_centroid_index]
    centroids=np.vstack((centroids,next_centroid))
centroids=np.array(centroids)

# Combine everything into a giant function
- input: data
- parameter:init = random/kmeans++, iteration = num/covergence threshold
- output: a list of cluster assignments

In [None]:
def KMeans(data,init='random',iteration=10):



    return(clusters)

In [None]:
#APPLY Kmeans() to iris data




## SHORTCUT: sklearn function

documentation: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans

In [None]:
from sklearn.cluster import KMeans
kmeans=KMeans(n_clusters=10,init='k-means++',max_iter=20)
kmeans.fit(data)

In [None]:
kmeans.labels_

In [None]:
digits.target[kmeans.labels_==1]

## Find the best value of K

In [None]:
#a data set about development score of countries
countries=pd.read_csv('https://juniorworld.github.io/python-workshop-2018/doc/country-index.csv')

In [None]:
countries.head()

In [None]:
kmeans=KMeans(n_clusters=5,init='k-means++')
kmeans.fit(countries.iloc[:,3:])

In [None]:
countries['countries'][kmeans.labels_==0]

In [None]:
ls_list=[]
for i in range(1,21):
    kmeans=KMeans(n_clusters=i,init='k-means++')
    kmeans.fit(countries.iloc[:,3:])
    ls=loss(np.array(countries.iloc[:,3:]),kmeans.labels_,kmeans.cluster_centers_)
    ls_list.append(ls)

In [None]:
#elbow method of finding the optimal K
trace=go.Scatter(
    x=list(range(1,21)),
    y=ls_list,
    mode='lines'
)
py.iplot([trace],filename='learning curve')