# <center> Clustering

### <center> Unsupervised learning with K-means and Hierarchical clustering

<center><img src="superVSunsuper.png">

<center><img src='unsupervised.jpg'>

## <center> K-Means Clustering

In [None]:
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
X,y = make_blobs(n_features=2, n_samples=10, centers=1, random_state=4)

In [None]:
plt.figure(figsize=(10,10))
plt.scatter(X[:,0],X[:,1],s=300, c='black')

In [None]:
from sklearn.cluster import KMeans

In [None]:
k_cluster = KMeans(n_clusters=3).fit(X)
plt.figure(figsize=(10,10))
plt.scatter(X[:,0],X[:,1],s=300, c=k_cluster.predict(X))
for i in k_cluster.cluster_centers_:
    plt.scatter(i[0],i[1],s=200, c='black')

In [None]:
k_cluster.predict(X)

In [None]:
X,y = make_blobs(n_features=2, n_samples=1000, centers=11, random_state=4, cluster_std=2)
plt.figure(figsize=(10,10))
plt.scatter(X[:,0],X[:,1],s=300, c='black')

## <center> Elbow Method
<center> Finding the best value of K

In [None]:
plt.figure(figsize=(20,10))
plt.plot([k for k in range(1,10)], [KMeans(n_clusters=k).fit(X).inertia_ for k in range(1,10)])

In [None]:
k_cluster = KMeans(n_clusters=4).fit(X)
plt.figure(figsize=(10,10))
plt.scatter(X[:,0],X[:,1],s=300, c=k_cluster.predict(X))
for i in k_cluster.cluster_centers_:
    plt.scatter(i[0],i[1],s=200, c='black')

## <center> Hierarchical Clustering

In [None]:
X,y = make_blobs(n_features=2, n_samples=10, centers=1, random_state=4)
plt.figure(figsize=(10,10))
plt.scatter(X[:,0],X[:,1],s=300, c='black')

In [None]:
from sklearn.cluster import AgglomerativeClustering

In [None]:
hier_cluster = AgglomerativeClustering().fit(X)
plt.figure(figsize=(10,10))
plt.scatter(X[:,0],X[:,1],s=300, c=hier_cluster.labels_)

In [None]:
hier_cluster.labels_

In [None]:
from scipy.cluster.hierarchy import dendrogram, linkage
Z = linkage(X)
dendrogram(Z)  
plt.show()

In [None]:
hier_cluster = AgglomerativeClustering(n_clusters=5).fit(X)
plt.figure(figsize=(10,10))
plt.scatter(X[:,0],X[:,1],s=300, c=hier_cluster.labels_)

In [None]:
X,y = make_blobs(n_features=2, n_samples=1000, centers=11, random_state=4, cluster_std=2)
plt.figure(figsize=(10,10))
plt.scatter(X[:,0],X[:,1],s=300, c='black')

In [None]:
Z = linkage(X)
dendrogram(Z)  
plt.show()

#### <center> Checking how many samples are in each cluster

In [None]:
hier_cluster = AgglomerativeClustering(n_clusters=3).fit(X)

In [None]:
import pandas as pd
pd.Series(hier_cluster.labels_).value_counts()

#### <center> Checking the statistics of each cluster

In [None]:
n_clusters = 5
X,y = make_blobs(n_features=5, n_samples=1000, centers=11, random_state=4, cluster_std=2)
X = X + 25
hier_cluster = AgglomerativeClustering(n_clusters=n_clusters).fit(X)

In [None]:
## get the mean value for each feature in each cluster and add to dataframe
cluster_stats = pd.DataFrame()
for i in range(n_clusters):
    cluster_stats['Cluster ' + str(i+1)] = X[hier_cluster.labels_==i].mean(axis=0)
cluster_stats.index = ['X' + str(i) for i in range(n_clusters)]
cluster_stats = cluster_stats.T
cluster_stats

In [None]:
for i in cluster_stats.columns:
    plt.bar(cluster_stats.index, cluster_stats[i])
    plt.title(i)
    plt.show()

# <center> Activity

Using customer data in <i>Mall_Customers.csv</i>, perform market segmentation with both K-Means and Hierarchical Clustering, choosing the best number of clusters for each method.

Use statistics of each cluster to summarize the differences between the customer segments.

In which segment would you put a customer who is a male that makes $75,000 a year, is 34 years old, and has a spending score of 75?