# ProgrammingAssignment06_clusterAnalysis 

## 1. k-means using scikit-learn
The healthy_lifestyle dataset contains information on lifestyle measures such as amount of sunshine, pollution, and happiness levels for 44 major cities around the world. Apply k-means clustering to the cities' number of hours of sunshine and happiness levels.

- Import the needed packages for clustering.
- Initialize and fit a k-means clustering model using sklearn's Kmeans() function.
- Use the user-defined number of clusters, init='random', n_init=10, random_state=123, and algorithm='elkan'.
- Find the cluster centroids and inertia.

Ex: If the input is: 4

the output should be:

- Centroids: [[ 0.8294  0.2562]
 [ 1.3106 -1.887 ]
 [-0.9471  0.8281]
 [-0.6372 -0.7943]]
- Inertia: 16.4991

In [None]:
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np

# Load dataset
healthy = pd.read_csv('healthy_lifestyle.csv')

# Input the number of clusters
number = int(input())

# Define input features
X = healthy[['sunshine_hours', 'happiness_levels']]

# Drop rows with any NaN values
X = X.dropna()

# Use StandardScaler() to standardize input features
scaler = StandardScaler()
X = scaler.fit_transform(X)

# Initialize a k-means clustering algorithm with k-means++ initialization
kmeans = KMeans(init='k-means++', n_clusters=number, n_init=10, random_state=123)

# Fit the algorithm to the input features
kmeans.fit(X)

# Print the cluster centroids and inertia
centroids = kmeans.cluster_centers_
print("Centroids:", np.round(centroids, 4))

inertia = kmeans.inertia_
print("Inertia:", np.round(inertia, 4))


## 2. Hierarchical clustering using scikit-learn
The healthy_lifestyle dataset contains information on lifestyle measures such as amount of sunshine, pollution, and happiness levels for 44 major cities around the world. Apply agglomerative clustering to the cities' number of hours of sunshine and happiness levels using both sklearn and SciPy.

- Import the needed packages for agglomerative clustering from sklearn and SciPy.
- Initialize and fit an agglomerative clustering model using sklearn's AgglomerativeClustering() function. Use the user-defined number of clusters and ward linkage.
- Add cluster labels to the input feature dataframe.
- Calculate the distances between all instances using SciPy's pdist() function.
- Convert the distance matrix to a square matrix using SciPy's squareform() function.
- Define a clustering model with ward linkage using SciPy's linkage() function.

Ex: If the input is: 4

the output should be:
|       | sunshine_hours | happiness_levels | labels |
|-------|----------------|------------------|--------|
| 0     | -0.691660      | 1.025642         | 3      |
| 1     | 0.695725       | 0.801124         | 0      |
| 2     | -0.645295      | 0.872562         | 3      |
| 3     | -0.757641      | 0.933794         | 3      |
| 4     | -1.098246      | 1.229750         | 3      |


First five rows of the linkage matrix from SciPy:
    
 - [[39. 40.  0.  2.]
 [28. 43.  0.  3.]
 [ 7. 18.  0.  2.]
 [ 8. 42.  0.  2.]
 [ 0.  3.  0.  2.]]

In [None]:
# Import needed packages
from sklearn.cluster import AgglomerativeClustering
from sklearn.preprocessing import StandardScaler
from scipy.spatial.distance import pdist, squareform
from scipy.cluster.hierarchy import linkage
import pandas as pd
import numpy as np

# Load dataset
healthy = pd.read_csv('healthy_lifestyle.csv')

# Input the number of clusters
number = int(input())

# Define input features
X = healthy[['sunshine_hours', 'happiness_levels']]

# Drop rows with any NaN values
X = X.dropna()

# Use StandardScaler() to standardize input features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_scaled = pd.DataFrame(X_scaled, columns=['sunshine_hours', 'happiness_levels'])

# Initialize and fit an agglomerative clustering model using ward linkage
agglo_clustering = AgglomerativeClustering(n_clusters=number, linkage='ward')
X_scaled['labels'] = agglo_clustering.fit_predict(X_scaled)

# Round to match example output format
X_scaled['sunshine_hours'] = X_scaled['sunshine_hours'].round(6)
X_scaled['happiness_levels'] = X_scaled['happiness_levels'].round(6)

# Print the DataFrame
print(X_scaled.head())

# Calculate the distances between all instances
distances = pdist(X_scaled[['sunshine_hours', 'happiness_levels']])

# Convert the distance matrix to a square matrix
square_distances = squareform(distances)

# Define a clustering model with ward linkage
clustersHealthyScipy = linkage(square_distances, method='ward')
print("First five rows of the linkage matrix from SciPy:\n", np.round(clustersHealthyScipy[:5, :], 1))


## 3. DBSCAN using scikit-learn
- Increase the **number of points sampled to 500**.
- Apply the DBSCAN model with **epsilon=1** and **min_samples=8** to identify the number of core-points and outliers (or noise). 
- EX: if the epsilon=1 and min_samples = 10 and number of points sampled to 100.
  - the number of core-points = 85
  - the number of outliers    = 11

In [None]:
# Import needed packages
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import DBSCAN

# Load dataset and take a random sample of 500 instances
data = pd.read_csv('customer_personality.csv').sample(500, random_state=123)

# Use StandardScaler() to standardize input features
X = data[['Fruits', 'Meats']]
scaler = StandardScaler()
X = scaler.fit_transform(X)
X = pd.DataFrame(X)

# Apply DBSCAN with epsilon=1 and min_samples=8
dbscan = DBSCAN(eps=1, min_samples=8)
dbscan.fit(X)

# Print the cluster labels and core point indices
print('Labels:', dbscan.labels_)
print('Core points:', len(dbscan.core_sample_indices_))
print('Number of core points:', len(dbscan.core_sample_indices_))

# Add the cluster labels to the dataset as strings
data['clusters'] = dbscan.labels_.astype(str)

# Sort by cluster label for plotting purposes
data.sort_values(by='clusters', inplace=True)

# Plot clusters on the original data
p = sns.scatterplot(data=data, x='Fruits', y='Meats', hue='clusters', style='clusters')
p.set_xlabel('Fruits', fontsize=16)
p.set_ylabel('Meats', fontsize=16)
p.legend(title='DBSCAN')
plt.show()
