# Exploratory Data Analysis - Assignment 10

## Data and Package Import

In [None]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
df = pd.read_excel('data/impurity_dataset-training.xlsx')

In [None]:
def is_real_and_finite(x):
    if not np.isreal(x):
        return False
    elif not np.isfinite(x):
        return False
    else:
        return True

In [None]:
all_data = df[df.columns[1:]].values #drop the first column (date)
numeric_map = df[df.columns[1:]].applymap(is_real_and_finite)
real_rows = numeric_map.all(axis=1).copy().values #True if all values in a row are real numbers
X_dow = np.array(all_data[real_rows,:-5], dtype='float') #drop the last 5 cols that are not inputs
y_dow = np.array(all_data[real_rows,-3], dtype='float')
y_dow = y_dow.reshape(-1,1)

## 1. k-Means

In [None]:
def dist(pt1, pt2):
    "Euclidean distance between two points"
    #note that this can also be performed with np.linalg.norm(pt1-pt2)
    return np.sqrt(sum([(xi-yi)**2 for xi, yi in zip(pt1, pt2)]))

def expected_assignment(pt, cluster_centers):
    # Expectation: find the closest points to each cluster center
    dists = [dist(pt,ci) for ci in cluster_centers] #<- find distance to each center
    min_index = dists.index(min(dists)) #<- find the index (cluster) with the minimum dist
    return min_index

def new_centers(cluster_points, centers):
    # Maximization: maximize the proximity of centers to points in a cluster
    centers = list(centers)
    for i,ci in enumerate(cluster_points):
        if ci != []:
            centers[i] = np.mean(ci, axis=0)
    return centers

def dist_centers(centers):
    return(np.array([dist(centers[0], centers[1]), dist(centers[0], centers[2]), dist(centers[1], centers[2])]))

**Modify the code from the topic notes into a function for k-means clustering.**  

This function should take the followings as arguemnts:  
- the dataset `X`
- the initial guesses `centers`

Convergence criterion: the maximum change in distance between cluster centers `tol` < 0.1

In [None]:
def kmeans(X, centers, tol = 0.1):
    #new_centers = ?
    return new_centers

**Use TSNE on `X_dow` and reduce its dimensionality to 2.**

In [None]:
from sklearn.manifold import TSNE

# X_tsne = ?

**Pass `X_tsne` and the initial guess `centers` to the `kmeans` function you created.**  

Plot the result of clustering by color coding the points. Locate the cluster centers using `*` markers (`marker="*"`).

In [None]:
centers = [[5, 0], [7, 5], [100, 100], [15, 20]]

# kmeans(X_tsne, centers)

fig, ax = plt.subplots()

**Use the built-in `scikit-learn` `KMeans` model to perform k-means clustering.**

Set `n_clusters` to 4 and fit `X_tsne` to the model.  
Plot the result of clustering by color coding the points. Locate the cluster centers using `*` markers (`marker="*"`).

In [None]:
from sklearn.cluster import KMeans

fig, ax = plt.subplots()

**Do the results of your implementation match the `scikit-learn` implementation? If not, briefly explain what might cause the discrepancy.**

## 2. Silhouette Score vs. `bandwidth` for Mean Shift

**Load the MNIST dataset.**

**Use `KernelPCA` on the MNIST dataset.**  

Set `n_components = 2, kernel = 'rbf'` and use default values for all other hyperparameters.

In [None]:
from sklearn.decomposition import KernelPCA

# X_kpca = ?

**Plot the silhouette score as a function of `bandwidth` for the `MeanShift` model.**  

Apply mean shift algorithm to `X_kpca`.  Vary the `bandwidth` in the range [0.01, 0.05, 0.1, 0.2, 0.3].  

In [None]:
from sklearn.metrics import silhouette_score

fig, ax = plt.subplots()

**Plot the resulting clustering from the best model.**  

Plot the result of clustering by color coding the points. Locate the cluster centers using `*` markers (`marker="*"`).

In [None]:
fig, ax = plt.subplots()

## 3. Generative Model for Handwritten Digit

**Select the points labeled as 6 in the MNIST dataset.**

In [None]:
# X_mnist_6 = ?

**Train a kernel density estimation (KDE) model.**

Use a bandwidth of 0.35 and a Gaussian kernel. 

**Visualize an example of a synthetic 6 generated by the KDE model.**

In [None]:
def show_image(digit_data, n, ax=None):
    if ax is None:
        fig, ax = plt.subplots()
    img = digit_data[n].reshape(8,8)
    colormap = ax.imshow(img,cmap='binary')
    fig.colorbar(colormap, ax=ax)

**6745 Only: Find the optimal number of Gaussians by using BIC.**  

You will use GMM in this problem.  
Use `covariance_type = full` and train the GMM models with `X_mnist_6`.  
You should search over `n_components` from 2 to 40.  
Plot the BIC vs. `n_components`.

In [None]:
from sklearn.mixture import GaussianMixture

fig, ax = plt.subplots()

# optimal_n = ?
print('Optimal number of Gaussians: {}'.format(optimal_n))

**6745 Only: Which model (GMM or KDE) would you expect to perform better in a Bayesian classification scheme? Briefly explain.**