# $Unsupervised Learning$

- Learning without labels
- we do not have a target label, or column
- 
- We try to do detective work, by identifying patterns in data


![](https://i.vas3k.ru/7vx.jpg)

![](https://i.vas3k.ru/7w1.jpg)

### Exercise 1:
- create a `pairplot` of the iris dataset and send screenshot in chat
- 

In [None]:
import seaborn as sns

df = sns.load_dataset('iris')

df.head(2)

: 

In [None]:
sns.pairplot(df,)

Example:
- I give you a box of legos to sort:
    - sort by color?
    - shape?
    - size?
- 
- 
- it will create groups of legos - we won't know what the names of the colors are
- 

What is it we do using unsupervised learning?

- **Clustering**: Grouping similar data points into clusters (e.g., customer segmentation using K-Means or DBSCAN).
- **Association** Rule Learning: Finding relationships between variables (e.g., market basket analysis with Apriori or FP-Growth)
- **Dimensionality Reduction**: Simplifying datasets by reducing features while retaining essential information (e.g., PCA, t-SNE)

### Clustering:
- grouping data
1. **KMeans**
2. **Hierarchical Clustering**
3. DBSCAN - *Density Based Spatial Clustering of Applications with Noise*

#### K-Means algorithm

![](https://i.vas3k.ru/7w6.jpg)

Algorithm:
1. **Initialization**: Randomly select K points as the initial cluster centroids
2. **Assignment**: Assign each data point to the nearest centroid.
3. **Update**: Recalculate the centroids by averaging the points in  each cluster
4. **Convergence**: Repeat the assignment and update untill centroids stabilize

Random data points --- Naive initialization:
- K random data points are selected

K-Means ++ Initialization:
- Centroids are chosen so they are spread more evenly.
- Idea is to push the centroids as far as possible
- try to take the maximum squared distance


```py
KMeans(init = random / k-means++)

In [None]:
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans

import pandas as pd

In [None]:
data = load_iris()

df_iris = pd.DataFrame(data.data, columns= data.feature_names)
df_iris.head(2)

In [None]:
# Kmeans

model = KMeans(
    n_clusters= 3,
    init = 'random',
    random_state = 42
)

In [None]:
df_iris['cluster'] = model.fit_predict(df_iris)

In [None]:


# scatter plot
import seaborn as sns

sns.pairplot(data=df_iris, hue= 'cluster', palette='husl')

In [None]:
iris_target = pd.DataFrame(data.target)
iris_target[0]

In [None]:
# cross tabulations
print(pd.crosstab(iris_target[0], df_iris.cluster))

### Exercise 
- cluster the data for penguins

In [None]:
df_penguins = sns.load_dataset('penguins') # but this df has labels too
df_penguins.head(2)

# remove the label, use the unlabelled dataset
# you can cluster the dataset and check wheher our algorithm actually found out the three catagories

---
## **Hierarchical Clustering**
- 
- agglomerative : 
    - bottom up approach
    - each data point is treated as cluster
    - you merge the closest pairs of clusters together
    - until only one remains

- divisive
    - top down approach
    - you start off with a cluster of clusters
    - split it into smaller ones
    - untill you get to data points
    

![](https://assets.ibm.com/is/image/ibm/7-1_dendrogram-diagram-with-h1-h2-lines:16x9?fmt=png-alpha&dpr=on%2C1.25&wid=960&hei=540)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.datasets import load_iris
from sklearn.cluster import AgglomerativeClustering, DBSCAN

from scipy.cluster.hierarchy import dendrogram, linkage

In [None]:
iris = load_iris()


X = iris.data

y = iris.target

In [None]:
# perform the hierarchical clutering

linked = linkage(X, method = 'ward')


In [None]:

# Reduce down the number of labels in dendogram
iris.target_names[y]

# --------------- List Comprehension ----------------------------------------------------
reduced_labels = [iris.target_names[label] if index % 5 == 0 else '' for index, label in enumerate(y)] # index an iterable --- y --- 0 --> 0
# reduced_labels

In [None]:
plt.figure(figsize=(10,7))

dendrogram(linked, labels= reduced_labels,  leaf_rotation= 90, leaf_font_size= 12, color_threshold= 6  )

plt.axhline(y = 10, color = "r", linestyle = '--')
plt.show()

In [None]:
# Agglomerative clustering to get clusters

hc_model = AgglomerativeClustering(n_clusters= 3, linkage= 'ward')

hc_cluster = hc_model.fit_predict(X)
hc_cluster

In [None]:
predictions = iris.target_names[hc_cluster]

**DBSCAN**
- Core Points --- Points within the eps radius
- Border Points --- Ones on the border --
- Noise Points --- that don't fit as core or border points
- 
- 
- eps : espsilon - radius of cluter ----- float, default=0.5
    - The maximum distance between two samples for one to be considered as in the neighborhood of the other. This is not a maximum bound on the distances of points within a cluster. This is the most important DBSCAN parameter to choose appropriately for your data set and distance function.
- min_samples= 5 ---- minimum number of points needed to form a cluster
    - The number of samples (or total weight) in a neighborhood for a point to be considered as a core point.

In [None]:
# DBSCAN 


dbscan_model = DBSCAN(eps = 0.5, min_samples= 5)

dbs_cluster = dbscan_model.fit_predict(X)
dbs_cluster

# -1 will be points treated as NOISE

- https://medium.com/@abhaysingh71711/hierarchical-clustering-a-tree-based-approach-to-data-grouping-241131b1c4c5
- https://mbrenndoerfer.com/writing/hierarchical-clustering-complete-guide-dendrograms-linkage-criteria

Please implement Unsupervised learning, on `penguins` or `tips` dataset. 
Apply the `hierarchical` method and `DBSCAN`

### Evaluation of Clustering Results:

1. **Silhouette Score:**
- Measure how similar an object is to its own cluster
- Range: -1 to 1
- Closer to 1, means that point is far from its neighbouring clusters
- Closer to 0, its at the boundary
- 
- Higher is better

2. **Davies- Bouldin Index**
- measures the average similarity between clusters
- if similarity is less between clusters, it means they are very different, and farrrrrrrrrrr
- Lower values will indicate better clustering

3. **Adjusted Rand Index ARI**
- Compare clustering performance with a known truth
- if you have access to ground truth labels ------- have a labelled Dataset
- look at the similarity --- predicted clusters, and actual labels
- 

In [None]:
from sklearn.metrics import silhouette_score, davies_bouldin_score, adjusted_rand_score

hcsc = silhouette_score(iris.data, hc_cluster)

hcdbs = davies_bouldin_score(iris.data, hc_cluster)

hcari = adjusted_rand_score(iris.target, hc_cluster)

In [None]:
iris.target

In [None]:
dbsc = silhouette_score()


dbdbs = davies_bouldin_score()

dbari = adjusted_rand_score( )