# Unsupervised Machine Learning

In the previous lesson we went over some important vocabulary for Machine Learning. We learned that for supervised learning your model will be trained on labeled data. In today's lesson we will be going over how to train a model with data that is not labeled, which we call Unsupervised Machine Learning.

## What is Unsupervised Machine Learning?

Unsupervised Machine Learning utilizes data to label and categorize unlabeled datasets. It finds patterns or relationships within the data independently. There are several types of machine learning 3 important ideas specific to Unsupervised ML. 

- Anomaly Detection
- Dimentionality Reduction
- Clustering

To further understand how these are incorperated into our data preprocessing steps let us use an example by loading in the iris dataset...

In [None]:
from sklearn.datasets import load_iris

# Load Iris Data
iris_data = load_iris()

# This step is important to setting the dataset
import pandas as pd
iris_df = pd.DataFrame(data=iris_data.data, columns=iris_data.feature_names)
iris_df.head()



You may notice this is the same dataset we used last time so you will have some familiarity working with it.

Now before we input our data into any model it is best to examine it and get a better understanding of what we are working with...

But before we can do that we will need to import some important packages that will allow for basic data analysis...

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd

## Data Exploration

Now let's move on to some basic EDA that we learned in our Time Series Analysis lesson.

In [None]:
#Use the `.info` and `.describe` methods
iris_df.info()
iris_df.describe

Using the information gathered using the `.dropna` function we can drop any examples that have missing values.

In [None]:
iris_df = iris_df.dropna(inplace = True)

### Data Visualization

Now that our data has no missing values, we can move on to looking at potential features. Let's visualize how each potential input plots relative to every other input. Basically we are plotting every column to every other column. This can show us a visual representation of what type of natural clusters are formed. Later this useful when deciding what features to use. For this visual representation we will use a pairplot. 

In [None]:
sns.pairplot(iris_df)
plt.show()

<span style = "background-color: yellow">
TODO: Examining the graph what patterns, trends or anything interesting happening. Think about what this may mean for identifying what species of iris flower it is. USe 2 minutes to write down any thoughts and then compare with a partner about your findings.
<span\>

Examining these pairplots above can provide us a lot of valuable information. Now we can take a pairplot and view them as a scatterplot and examine any trends and patterns that may arise from plotting the selected inputs.

In [None]:
plt.scatter(iris_df['petal length (cm)'],iris_df['petal width (cm)'])

Using this new found insight the next step would be to develope some hypothesis. This is important as it will set us up for tuning our hyperparamaters for our unsupervised Ml model. The first thing we can hypothesize is how many potential groups or **clusters** can be made.

### What is Clustering and Why is it important?

Clustering is an important aspect of Unsupervised Machine Learning. This is because clusters indicate potential labels and groupings for your unlabeled data. The better and more defined these groups are the more sound your predictions will be. This is why data preprocessing and filtering data is so important, it allows us to find the features that work and won't.

### How clustering works in Unsupervised Machine Learning

Looking at the scatterplot again we can identify 3 distinct clusters. Now we can use this as a base for our model. First, we must split our data into training and testing data such as in the Intro to Machine Learning lesson. This is for when we test the viability of the model and its predictive capabilities.

In [None]:
from sklearn.model_selection import train_test_split

X, y = iris_df.drop('class'), ['class']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)


Now we must import the Kmeans clustering method from the sklearn library

In [None]:
from sklearn.cluster import KMeans
from sklearn.datasets import load_iris
import pandas as pd

x = 'petal length (cm)'
y = 'petal width (cm)'
X = iris_df.data[:, [iris_df.feature_names.index(x), iris_df.feature_names.index(y)]]

kmeans = KMeans(n_clusters=3).fit(X)


### What is Kmeans

Kmeans is a popular unsupervised learning model. The idea behind Kmeans is to emphasize clustering of objects that have like qualities. This technique uses centroids as a way of centering clusters.

**Centroids** - Data points at the center of a cluster which are intially randomly generated

![Centroid Diagram]("Centroids.png")

Centroids are calculated repeatadly until the process reaches a no change point where recalculating it again would yield the same result. These centroids can be used to categorize unlabeled data in a relatively effective manor.

In [None]:
clusters = kmeans.labels_
cluster_df = pd.DataFrame(np.hstack((X, clusters.reshape(-1,1))), columns = [x,y,"class"])

### Plotting your predictions

Below is the orginial scatterplot using the selected features

In [None]:
sns.scatterplot(x=x, y=y, data = iris_df, hue = 'class')

These are our predicted values

In [None]:
sns.scatterplot(x=x, y=y, data = cluster_df, hue = 'class')

### Model Performance

Testing your model's performance is the last step in the loop and can provide some very important information regarding what further tuning may be required to optimize your model. For unsupervised models there a mutlide of ways to measure model performance but for our sake we will be using silhouette score. 

### Silhouette Score

Silhouette score measures the compactness and relevant distance of points in a cluster to other points in that same cluster. It also measures the distance between points in the orginal cluster to points in another cluster. Essentially it measures how good the clusters are at grouping data.

In [None]:
# Compute the silhouette score
silhouette_avg = silhouette_score(cluster_df, clusters)
print("The average silhouette score is:", silhouette_avg)