# Unsupervised Machine Learning

The previous lesson was an introduction to supervised machine learning. Building upon that lesson we will learn about unsupervised learning today. We learned that for supervised learning your model will be trained on labeled data, data with an already clear output. Today we will be dealing with unlabeled data. 

### Lesson overview

- What is Unsupervised Machine Learning
- Different kinds of Unsupervised ML models
- Use cases for Unsupervised ML models
- Setting up a Model
- Challenges that come with Unsupervised ML models

Let's get started!

## What is Unsupervised Machine Learning?

Unsupervised Machine Learning utilizes data to group and categorize unlabeled datasets. It finds patterns or relationships within the data independently. There are many different models for many different use cases.

## Different Types of Unsupervised ML models

#### Clustering
The process of using centroids* to create groupings of datapoints that can be used to classify those datapoints

- *centroids- The centers of clusters that are intially randomly generated. Then using the points closest to the centroids to calculate new positions for the centroids. This is done until there no longer any movement with the centroids  

#### Dimentionality Reduction
Also known as PCA(Principal Component Analysis), this is finding the most *important* features while maintaining the variation in the data. 

#### Association Rule Learning
Grouping items by how dependant they are on one another. This process uses no attributes. An example of this could be on Amazon when the system realizes that 2 items are usually bought together. Later if one of those items is put into a cart the other item is recommended to the user.

![image.png](attachment:image.png)

## Use Cases for Unsupervised Machine Learning

- Customer segmentation in marketing
- Anomaly detection in cybersecurity
- Data compression and visualization
- Recommender systems (e.g., movie or product recommendations)

## Creating Unsupervised ML models

### Setting up an Unsupervised ML model

To further understand how these are incorperated into our data preprocessing steps let us use an example by loading in the iris dataset...

Using this new found insight the next step would be to develope some hypothesis. This is important as it will set us up for tuning our hyperparamaters for our unsupervised Ml model. The first thing we can hypothesize is how many potential groups or **clusters** can be made.

### How clustering works in Unsupervised Machine Learning

Looking at the scatterplot again we can identify 3 distinct clusters. Now we can use this as a base for our model. First, we must split our data into training and testing data such as in the Intro to Machine Learning lesson. This is for when we test the viability of the model and its predictive capabilities.

### What is Kmeans

Kmeans is a popular unsupervised learning model. The idea behind Kmeans is to emphasize clustering of objects that have like qualities. This technique uses centroids as a way of centering clusters.

**Centroids** - Data points at the center of a cluster which are intially randomly generated

Centroids are calculated repeatadly until the process reaches a no change point where recalculating it again would yield the same result. These centroids can be used to categorize unlabeled data in a relatively effective manor.

In [None]:
# Importing necessary libraries
from sklearn import datasets
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import pandas as pd

# Loading the iris dataset
iris = datasets.load_iris()
data = iris.data
target = iris.target

# Using KMeans clustering
model = KMeans(n_clusters=3)  # As there are 3 classes in Iris dataset
model.fit(data)
labels = model.labels_

# Plotting the clusters
# For simplicity, let's use only the first two features (sepal length and sepal width)
x = data[:, 0]
y = data[:, 1]
plt.scatter(x, y, c=labels, cmap='rainbow')

# Plotting the cluster centers
centers = model.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], s=200, c='black', marker='X', label='Centroids')
plt.xlabel('Sepal Length')
plt.ylabel('Sepal Width')
plt.title('K-Means Clustering on Iris Dataset')
plt.legend()
plt.show()

# Comparing predicted clusters with the actual classes
df = pd.DataFrame({'Labels': labels, 'Actual Classes': target})
ct = pd.crosstab(df['Labels'], df['Actual Classes'])
print(ct)

### Model Performance

Testing your model's performance is the last step in the loop and can provide some very important information regarding what further tuning may be required to optimize your model. For unsupervised models there a mutlide of ways to measure model performance but for our sake we will be using silhouette score. 

### Silhouette Score

Silhouette score measures the compactness and relevant distance of points in a cluster to other points in that same cluster. It also measures the distance between points in the orginal cluster to points in another cluster. Essentially it measures how good the clusters are at grouping data.

### Using Principal Component Analysis to make it better

In the code it is often the case where Kmeans clustering is used jointly with PCA. This is because certain features that are input into kmeans can often be taken out while maintaining the variance in the data.

In [None]:
# Show  a pairplot using the pairplot function
# Loading the iris dataset
iris = datasets.load_iris()
data = iris.data
target = iris.target

# Importing necessary libraries
from sklearn import datasets
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import pandas as pd
# Create a pairplot
sns.pairplot(iris_df, palette="husl", diag_kind='kde')

# Show the plot
plt.show()

Examining these pairplots above can provide us a lot of valuable information. Now we can take a pairplot and view them as a scatterplot and examine any trends and patterns that may arise from plotting the selected inputs.

<span style = "background-color: yellow">
TODO: Examine the graph. Identify any patterns or trends. Think about what this may mean for identifying what species of iris flower it is. Use 2 minutes to write down any thoughts and then compare with a partner about your findings.
<span\>

In [None]:
# Show the weights of the features uses

### Remeasuring Model Performance after PCA

In [None]:
# Compute the silhouette score
silhouette_avg = silhouette_score(cluster_df, clusters)
print("The average silhouette score is:", silhouette_avg)

### Association Rule Learning

Quick Guide + more info

In [None]:
# Code