# Homework 20: Unsupervised Machine Learning

## Step 1: Prepare the Data

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
from pathlib import Path
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler

###  Step 1: Load the MyOpia dataset

In [None]:
# Loading the preprocessed MyOpia CSV file
file_path = Path("Resource/myopia.csv")
df_myopia = pd.read_csv(file_path)
df_myopia.head()

In [None]:
# Column names
df_myopia.columns

In [None]:
# There were 81 myopica (1) children samples and 537 (0) non-myopic children samples

df_myopia["MYOPIC"].value_counts()

#### Preprocess the data

In [None]:
# Split the DataFrame into data and target

y = df_myopia["MYOPIC"].values
X = df_myopia.drop("MYOPIC", axis=1)

In [None]:
# Split the data into two groups, the training and test set

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [None]:
# Create a scaler to standardize the data, this is the default scaler to use.

scaler = StandardScaler()

In [None]:
# Fit the X_train data to the standard scaler

scaler.fit(X_train)

In [None]:
# Transform X_train and X_test data
# Note that the scaler used to transform X_train and X_test was trained on X_train set

X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

### Create predictions with KNN

In [None]:
# Instantiate KNN model and make predictions, KNN use odd values only, not even values.

knn = KNeighborsClassifier(n_neighbors=9)
knn.fit(X_train_scaled, y_train)
y_pred = knn.predict(X_test_scaled)

In [None]:
# Access the accuracy score
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

## Step 2: Perform Dimensionality Reduction with PCA
One good thing about using dimensionality reduction is it's techniques in which it can help to speed up machine learning by reducing the size of large datasets, while preserving most of the useful information that needed to better fit a predictive model.

Principal Component Analysis (PCA) happens to be one of the dimensionality reduction techniques that I will use for this dataset.

In [None]:
# Dependencies
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans


In [None]:
# Do this on Windows machine before importing Kmeans to avoid a known bug (memory leak).
import os
os.environ["OMP_NUM_THREADS"] = '1'

In [None]:
# Initialize PCA model

pca = PCA(n_components=2)

# Get two principa components for the data
myopia = pca.fit_transform(df_myopia)

In [None]:
# Transform PCA data to a DataFrame
df_myopia_pca = pd.DataFrame(
    data=myopia, columns=["principal component 1", "principal component 2"]
)
df_myopia_pca.head()

In [None]:
# Fetch the explained variance
pca.explained_variance_ratio_

### Sample Analysis
According to the explained variance, the first principal component contains approximately 73% of the variance and the second principal component contains 16% of the variance. We have approximately 89% of the information in the original dataset, and we will see whether increasing the number of principal components to 3 will increase the explained variance.

In [None]:
# Initialize PCA model for 3 principal components
pca = PCA(n_components=3)

# Get two principal components for the iris data.
myopia_pca = pca.fit_transform(df_myopia)

In [None]:
# Transform PCA data to a DataFrame
df_myopia_pca = pd.DataFrame(
    data=myopia_pca,
    columns=["principal component 1", "principal component 2", "principal component 3"],
)
df_myopia_pca.head()

In [None]:
# Fetch the explained variance
pca.explained_variance_ratio_

### Sample Analysis
The first principal component has 73%, the second principal component has 16%, and the third principal component has 1%, an overall total of 90% for the increased variance. 

## Step 3: Perform a Cluster Analysis with K-means

In [None]:
# Initialize the K-means with K = 3
from sklearn.cluster import KMeans
model = KMeans(n_clusters=3, random_state=5)

In [None]:
# Fit the model
model.fit(df_myopia_pca)

In [None]:
# Get predictions
predictions = model.predict(df_myopia_pca)
print(predictions)

In [None]:
# Add a new class column to df_myopia
df_myopia["class"] = model.labels_
df_myopia.head()

In [None]:
new_df = df_myopia.copy()
new_df['cluster'] = predictions

In [None]:
new_df.head(20)

In [None]:
# Initialize the K-means model
model = KMeans(n_clusters=3, random_state=0)

# Fit the model
model.fit(df_myopia_pca)

# Predict clusters
predictions = model.predict(df_myopia_pca)

# Add the predicted class columns
# df_myopia_pca["class"] = model.labels_
# df_myopia_pca.head(50)

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt


In [None]:
# Generate 3 clusters of random data
from sklearn.datasets import make_blobs

data, _ = make_blobs(n_samples=300, centers=3,
                    cluster_std=0.60, random_state=0)

In [None]:
# Plot the data
plt.scatter(data[:, 0], data[:, 1])

In [None]:
# Use n_clusters=4 as te k value
# We can see from the plot above that there are 4 clusters
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=4)

# Fit the model to the data
kmeans.fit(data)

In [None]:
# Predict the clusters
predicted_clusters = kmeans.predict(data)

In [None]:
# Plot the predicted clusters to see if the model predicted the correct clusters
# This is visual validation that the model was trained correctly.

plt.scatter(data[:, 0], data[:, 1], c=predicted_clusters, s=50, cmap='viridis')

In [None]:


inertia = []
# Same as k = list(range(1, 11))
k = [1,2,3,4,5,6,7,8,9,10]


# Looking for the best k
for i in k:
    km = KMeans(n_clusters=i, random_state=0)
    km.fit(df_myopia)
    inertia.append(km.inertia_)

# Define a DataFrame to plot the Elbow Curve using hvPlot
elbow_data = {"k": k, "inertia": inertia}
df_elbow = pd.DataFrame(elbow_data)

plt.plot(df_elbow['k'], df_elbow['inertia'])
plt.xticks(range(1,11))
plt.xlabel('Number of clusters')
plt.ylabel('Inertia')
plt.show()

## Step 4: Make a Recommendation
When we start at 1 cluster on x-axis, the inertia is at it's highest point. When we increased the k to 2 clusters the inertia decreased tremendously going downwards. But when we add the 3 clusters, the inertia remained to be a small drop and it gradually moved downwards on the increased k number of clusters.

Therefore, I would say that the elbow of the curve marks the most difference is at the point of k-3 because anything else larger than k-3 shows a minimal change in the decreased of inertia, or the error of the model. 

Based on the findings, my recommendation is that these patients could be clustered into four groups. Because of the similarities in the different deminsions of because of the large dataset, K-means will assign the k-groups to each of the four clusters based on the distance from each group's centroid, or most clustered.