# Unsupervised Learning - K Means Clustering 


###  PCA : Principal Component Analysis
PCA is a way of reducing the dimensions of a dataset by projecting features onto the principal components. The goal is to reduce the number of features while only losing a small amount of information. When dealing with multi-feature data sets, PCA is often used to make the dataset 'plottable' in a 2 or 3-dimensional planes or speed up the machine learning algorithms.

During the dimensional reduction using PCA, we will be using the "explained Variance" attribute which tells you how much information is still retained after dimensional reduction to each of the principal components. This is important as converting 4-dimensional spaces to 2-dimensional spaces can lose some of the information on your data set. Therefore it is recommended to use the attributes to compare which dimension we should use to maximize explained_variance_ratio (e.g. 2D vs. 3D).

For this exercise, PCA will solely be used for visualization purposes. Therefore we will conduct K means on the dataset first and then use PCA to visualize data.

### Reference links for PCA
https://www.datacamp.com/community/tutorials/principal-component-analysis-in-python
https://www.youtube.com/watch?v=kw9R0nD69OU
https://www.youtube.com/watch?v=_nZUhV-qhZA
https://www.youtube.com/watch?v=kuzJJgPBrqc
https://www.youtube.com/watch?v=kApPBm1YsqU&t=709s

### Dataset for the project and its objective.
Davis is an intern for a winery. For her summer project, she has to cluster wines based on their physicochemical analysis results. Upon the completion of the project, the product development will merge the data with sales data and use it as a reference to develop a wine profile to uplift sales.

Wine Dataset for Clustering(https://archive.ics.uci.edu/ml/datasets/wine)

### Algorithm outline

- steps to take
1. Explore data and confirm the model assumptions are met on data 
2. Prepare data for clustering
3. Determine the optimum number of clusters, k value, using Elbow method and Silhouette method
4. Explore the results of fitting k-means to the dataset while changing k values
5. Interpretation and discuss clusters
6. Set up PCA
7. Visualize clusters using PCA


In [None]:
#1. Explore data and confirm the K-means model assumptions are met
import pandas as pd

# 1-1-a: import pandas

# 1-1-b: read the dataset in csv format as a dataframe

# 1-1-a: info() function to verify data type of each column

# 1-1-d: head() function to see preview data set - ensure to check if there are any categorical column and no label

#1-1-e: describe() function to review statistics of the data

#1-1-f: check if the data has null values


In [None]:
# 2-1 define categorical data and keep only numerical data within the data set for
# depending on the dataset one can define a categorical data as a column with
# unique values less than 5 
# ex. Mall_Customers = Mall_Customers.loc[:, Mall_Customers.nunique() > 5]
# based on the above definition the Gender column is defined as a categorical feature.

# 2-1-a: define a new dataframe which keeps column that has unique values greater than 5

# 2-1-b: Remove columns with string or object data tpyes using for loop

# 2-1-c. Review the result of the data frame cleaning 


In [None]:
# 2-2 use visualization to understand data's skewness

# 2-2-a: import seaborn statistical visualization library

# 2-2-b: plot a pairwise relationship with kernel density estimates (KDE)
# KDE estimates the probability density of a continuous random variable, smoother representation of box plots

# 2-2-c: import matplotlib library

# 2-2-d: use the function plt.show() to show the plot


In [None]:
# 2-3 apply the logarithmic transformation to create normally/symmetrically distributed data set(not skewed, bell-shaped)
#works only on positive values, for negative values add a constant before LT or use cube root transformation

# 2-3-a: import numpy library to appply the logarithmic transformation on the dataframe

# 2-3-b: create a new dataframe after applying np.log() onto the original dataframe

# 2-3-c: plot a pairwise relationship with kernel density estimates (KDE)

# 2-3-d: use the function plt.show() to show the plot


In [None]:
# 2-4 standardize data using Standardscalar (mean = 0, std = 1)

# 2-4-a: check if data is already standardized (mean = 0, std = 1) by applying the describe() function

# 2-4-b: import StandardScaler library as the dataset is not standardized (mean != 0, std != 1)

# 2-4-c: initialize a scalar using StandardScaler()

# 2-4-d: fit a scalar using fit()

# 2-4-e: scale and center the data using transform()

# 2-4-f: create a pandas DataFrame with the standardized data

# 2-4-g: print summary statistics of the datset using describe()


In [None]:
# 3 Determine K

# 3-1 Elbow Method

# 3-1-a: import KMeans library from sklearn.cluster import KMeans
from sklearn.cluster import KMeans

# 3-1-b: initiate a list called wcss

# 3-1-c-1: build a for loop looping through k values between 1,11
    
    # 3-1-c-2: initiate the instance of KMeans by setting the number of clusters as k, in order to replicate the
    # result it is crucial to select any number but a number and define the random_state parameter
    
    # 3-1-c-3: use the fit() function on the standardized data set into the KMeans instance
    
    # 3-1-c-4: append inertia_ attribute of instance to wcss
    # inertia_ attribute is the sum of squared distances of samples to their closest cluster center.

# 3-1-d: assign title to the elbow method plot using plt.title()

# 3-1-e: assign x axis label using plt.xlabel()

# 3-1-f: assign y axis label using plt.ylabel()

# 3-1-g: use sns.pointplot to draw a line plot of K vs. Within Cluster Sum of Squares (WCSS)

# 3-1-h: show the plot using plt.show()


In [None]:
#3-2 Silhouette Method to compare between K = 3,4,5

# 3-2-a: import silhouette_score library 
from sklearn.metrics import silhouette_score

# 3-2-b: initiate a list called si

# 3-2-c-1: because we are interested in reviewing the k values between 3,4,5
# build a for loop looping through k values between 3,6

    # 3-2-c-2: initiate the instance of KMeans by setting the number of clusters as k to replicate the
    # result it is crucial to select any number but a number and define the random_state parameter
    
    # 3-2-c-3: use the fit() function on the standardized data set into the KMeans instance
    
    # 3-2-c-4: assign labels_ attribute of the instance to the variable called lables
    # labels_ attribute is labels or cluster of each point
    
    # 3-2-c-5: apply the silhouette_score function which returns the mean silhouette coefficient of the
    # overall samples using the euclidean distance

# 3-2-d: assign title to the silhouette method plot using plt.title()

# 3-2-e: assign x axis label using plt.xlabel()

# 3-2-f: assign y axis label using plt.ylabel()

# 3-2-g: use sns.pointplot to draw a line plot of K vs. Silhouette score (si)

# 3-2-h: show the plot using plt.show()

In [None]:
#4. test & learn

# 4-1: initialize `KMeans` with 3 clusters

# 4-2: fit the model on the pre-processed standardized dataset

# 4-3: assign the generated labels to a new column

# 4-4: group the dataset by the segment labels and calculate average feature values

# 4-5: sort the cluster by average age

# 4-6: add a new column 'sampleSize' to ensure the cluster size is meaningful

# 4-7: print the average column values per each segment


In [None]:
# Write a sentence summary of each cluster

In [None]:
# 5 Principal component Analysis

# 5-1: import PCA package from sklearn.decomposition

# 5-2: initiate 2D dimension PCA instance with the n_components = 2 and assign to the variable pca_2

# 5-3: use the fit_transform funciton with the input of wine_data_df (fit_transform(wine_data_df))
# and assign to the variable pca_2_result

# 5-4: print the cumulative variance explained by 2 principal components using pca_2.explained_variance_ratio_

# 5-5: initiate 3D dimension PCA instance with the n_components = 3 and assign to the variable pca_3

# 5-6: use the fit_transform funciton with the input of wine_data_df (fit_transform(wine_data_df))
# and assign to the variable pca_3_result

# 5-7: print the cumulative variance explained by 3 principal components using pca_3.explained_variance_ratio_


In [None]:
# 5-4-a: create a dataframe called pcf_2_df using the pca_2_result

# 5-4-b: print pcf_2_df using the head()


In [None]:
# 5-5-a: create a dataframe called pcf_3_df using the pca_3_result

# 5-5-b: print pcf_3_df using the head()

In [None]:
# 6 Principal component Analysis visualization in 2D

# 6-1-a: use sns.set() to configure figure aesthetics
sns.set(style='white', rc={'figure.figsize':(9,6)},font_scale=1.1)

# 6-1-b: use plt.figure() to initiate an empty figure/plot

# 6-1-c: use plt.figure() and use the parameter figsize to set the figure esize to be 10 x 10

# 6-1-d: use plt.xticks() and use the parameter fontsize to set the ticks font size to be 12

# 6-1-e: use plt.yticks() and use the parameter fontsize to set the ticks font size to be 14

# 6-2-a: use plt.scatter to show pca_2_result in scatter plots
# make sure to set the parameter c as kmeans_labels and parameter cmap as 'spring'

# 6-2-b: assign x axis label using plt.xlabel()

# 6-2-c: assign x axis label using plt.ylabel()

# 6-2-d: assign title using plt.title()

# 6-2-e: Show the plot using plt.show()


In [None]:
# 7 Principal component Analysis visualization in 3D

# 7-1: import mplot3d from mpl_toolkits
from mpl_toolkits import mplot3d 

# 7-2: assign x, y, z value from the pca_3_result

# 7-3: use plt.figure() and use the parameter figsize to set the figure esize to be 10 x 7
# assign to the variable fig

# 7-4: set the axes using plt.axes(projection ="3d") and assign to the variable ax
ax = plt.axes(projection ="3d") 
  
# 7-5: apply the scatter3D function to ax instance
# make sure to set the parameters c as kmeans_labels and parameter cmap as 'spring'

# 7-6: assign title using plt.title()
  
# 7-7: Show the plot using plt.show()


# End of the Exercise Congrats!


## Want to explore more?

Here are some suggestions:

* Try with a different combination of features
* Try other datasets
* Try some feature engineering (create your own features)
* Try new models
* Try creating K-Means from scatch

### Other datasets to play with

1. Credit Card Dataset for clustering (https://www.kaggle.com/arjunbhasin2013/ccdata)
2. Cities Dataset for clustering (https://www.kaggle.com/sourabhsabharwal/cities-dataset-for-clustering)
3. Breast cancer data for clustering (https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic))