# Assignment 4

Welcome to your machine learning assignment! In this assignment, we will explore the fascinating world of unsupervised learning and delve into the K-means clustering algorithm. We will be using Python and the scikit-learn library to implement and evaluate the performance of the K-means clustering algorithm.

In this assignment, I will provide you with a dataset containing various data points. We will use this dataset to identify clusters within the data using the K-means algorithm. By grouping similar data points together, K-means clustering allows us to gain insights and discover patterns in the data.

Once we have implemented the K-means algorithm and obtained the cluster labels and cluster centers, we will evaluate the performance of the clustering algorithm. We will use the silhouette score, a widely used metric for unsupervised clustering evaluation, to assess the quality of the identified clusters.

Furthermore, we will visualize the data and the identified clusters, allowing us to gain a better understanding of the clustering results and how the data points are grouped together.

Are you ready? Let's dive into the exciting world of K-means clustering and unleash the power of unsupervised learning!

The dataset for my example can be found here: [Mall Customer Segmentation Data](https://www.kaggle.com/datasets/vjchoudhary7/customer-segmentation-tutorial-in-python)

## Part 1 - How it's done

In this part I will demonstrate how to train a model on existing data by loading the dataset, extract the relevant columns for clustering, which include 'Age', 'Annual Income (k$)', and 'Spending Score (1-100)'.

I generate an elbow plot for finding the right amount of clusters, here we can see that 5 clusters is a fitting number since that's where the plot starts to flatten out. I then initialize the K-means clustering algorithm with n_clusters=5, indicating that we want to identify 5 different customer segments.

After fitting the model to the data, I obtain the predicted labels and cluster centers. The labels indicate which segment each customer belongs to, and the centers represent the centroids of each cluster.

I visualize the data and the identified clusters using a scatter plot. Each customer is represented by their 'Annual Income (k$)' and 'Spending Score (1-100)', and colored according to their assigned cluster. The cluster centers are marked with red crosses.

By running this code, you can observe how the K-means algorithm segments customers based on their characteristics and buying patterns. Feel free to customize the code further to suit your specific needs or explore different aspects of customer segmentation.

### 1. Import the required libraries

First, I import the necessary libraries and modules that will be used in the code:

- `numpy` is imported and aliased as `np` to provide functions for mathematical operations and array manipulations. It is used, for example, to generate sample data and perform computations in the code.
- `matplotlib.pyplot` module is imported and aliased as `plt` to create various types of plots, such as line plots, scatter plots, and bar plots. It is used to generate the elbow plot and visualize the data and clusters.
- `KMeans` is a class in the `sklearn.cluster` module that implements the K-means clustering algorithm. By importing `KMeans`, we can create an instance of the K-means clustering algorithm to cluster data points.
- `pandas` is imported and aliased as `pd` to provide functions and data structures to work with tabular data, such as CSV files. It is used to read the customer data from a CSV file and extract the relevant columns for clustering.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
import pandas as pd
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

### 2. Read the customer data from a CSV file

Here I am reading a CSV file named 'A4_Mall_Customers.csv' using the `pd.read_csv()` function from the pandas library.

By using `pd.read_csv()`, I can load the contents of the CSV file into a pandas `DataFrame` object named `df`. This DataFrame allows me to easily manipulate and analyze the data within the CSV file.

The CSV file need to be located in the same directory as the code file. If it is located elsewhere, I would need to provide the correct path to the file.

Once the CSV file is read and stored in the DataFrame `df`, I can perform various operations on the data, such as data cleaning, exploration, visualization, or applying machine learning algorithms for further analysis.

In [None]:
df = pd.read_csv('A4_Mall_Customers.csv')
df

### 3. Extract the relevant columns for clustering

Now I am accessing specific columns of the DataFrame `df` and storing them in a new DataFrame named X.

Using the double square brackets `[['Age', 'Annual Income (k$)', 'Spending Score (1-100)']]`, I am selecting the columns 'Age', 'Annual Income (k$)', and 'Spending Score (1-100)' from the original DataFrame `df`.

By doing this, I am creating a subset of the original DataFrame that contains only the selected columns. This can be useful when we want to focus on specific features or variables for further analysis or modeling.

The resulting DataFrame `X` will have the same number of rows as the original DataFrame, but it will only contain the columns 'Age', 'Annual Income (k$)', and 'Spending Score (1-100)'. This subset of columns can be used for various purposes, such as clustering, regression, or any other analysis that requires working with specific features of the data.

By executing this code, I am creating a new DataFrame `X` that contains the desired columns from the original DataFrame `df`, allowing me to work with a more focused set of features for subsequent steps in my analysis.

In [None]:
X = df[['Age', 'Annual Income (k$)', 'Spending Score (1-100)']]

### 4. Scale the dataset using StandardScaler

Scaling the dataset with StandardScaler ensures that the features have similar scales, which helps the K-means algorithm perform better and converge faster. It is important to apply the same scaling to new data points before using the trained K-means model for classification. To do this we follow these steps:

1. Create an instance of the StandardScaler class.
2. Fit the scaler on our dataset.
3. Transform the dataset using the fitted scaler, this will scale the features of our dataset based on the mean and standard deviation calculated during the fit step.

In [None]:
# 1
scaler = StandardScaler()

# 2
scaler.fit(X)

# 3
X_scaled = scaler.transform(X)


# Output our results
print("Shape of X_scaled:", X_scaled.shape)
print("Mean of each feature in X_scaled:", X_scaled.mean(axis=0))
print("Standard deviation of each feature in X_scaled:", X_scaled.std(axis=0))

### 5. Principal Component Analysis

Principal Component Analysis (PCA) is a dimensionality reduction technique used to simplify complex data and helps make the data easier to work with by transforming it into a new set of variables called principal components. Sometimes, data can have many different measurements or features, which can be overwhelming. PCA simplifies the data by finding the most important patterns or directions in the data.

The K-means algorithm looks for groups or clusters in the data. PCA can help by rearranging the data in a way that makes these clusters more distinct and easier to find. It finds the directions where the data varies the most and aligns them with the most important patterns.

When we use PCA we simplify and clean the data, making it easier to understand and find patterns. This helps in identifying distinct groups or clusters in the data, even if we have limited knowledge about machine learning.

The code below does the following:

1. Create an instance of the PCA class, specifying 2 as the number of variables / principal components.
2. Fit the PCA-object on our dataset.
3. Transform the dataset using the fitted PCA-object with our scaled data, this will simplify the data by finding the most important patterns in the data.

*Step 2 and 3 can be combined by using `pca.fit_transform(X_scaled)`*

In [None]:
# 1
pca = PCA(n_components=2)

# 2
pca.fit(X_scaled)

# 3
X_pca = pca.transform(X_scaled)

# Output the shape of the transformed data
X_pca.shape

X_pca

### 6. Generate elbow plot to find the right amount of clusters

I am performing K-means clustering on the data stored in the DataFrame `X` to determine the optimal number of clusters using the elbow method. I then plot an elbow curve to visualize the relationship between the number of clusters and the inertia.

I first initialize an empty list called `inertia_values`. This list will be used to store the inertia (sum of squared distances to the nearest centroid) for different values of `k`.

Afterwards, I create a range of values for `k` using `range(1, 11)`. This range represents the number of clusters that will be tested.

I then enter a `for` loop to iterate over each value of k. For each iteration, I perform the following steps:

1. Initialize the K-means clustering algorithm using `KMeans(n_clusters=k, random_state=0)`. This creates an instance of the KMeans class with the specified number of clusters (`k`) and a random seed value of 0 for reproducibility.

2. Fit the K-means model to the data using `kmeans.fit(X)`. This step calculates the cluster centers and assigns data points to their nearest centroids based on the specified `k` value.

3. Compute the inertia of the K-means model using `kmeans.inertia_`, which represents the sum of squared distances of samples to their closest cluster center.

4. Append the inertia value to the `inertia_values` list using `inertia_values.append(kmeans.inertia_)`.

After the loop finishes, I plot the elbow curve using `plt.plot()` with k_values on the x-axis and inertia_values on the y-axis. This curve helps visualize the inertia values for different numbers of clusters.

I add labels and a title to the plot using `plt.title()`, `plt.xlabel()`, and `plt.ylabel()`. I display the plot using `plt.show()`.

So, the code will iterate over different values of k, fit the K-means model to the data, compute the inertia values, and generate an elbow plot. This allows me to determine the optimal number of clusters for customer segmentation based on the elbow point where the inertia values start to level off.

In [None]:
# Initialize a list to store the inertia values
inertia_values = []

# Try different values of k
k_values = range(1, 15)
for k in k_values:
    # Initialize K-means clustering algorithm
    kmeans = KMeans(n_clusters=k, random_state=0, n_init=10)
    
    # Fit the model to the data
    kmeans.fit(X_pca)
    
    # Get the inertia (sum of squared distances to the nearest centroid)
    inertia_values.append(kmeans.inertia_)
    
# Plot the elbow curve
plt.plot(k_values, inertia_values, marker='o')
plt.title('Elbow Plot (Customer Segmentation)')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.show()

In [None]:
# Let's save the number of clusters we want as K to use for later.
# Based on the elbow graph, 4 seems like a good fit
# If your choice here gives a bad result you can just 
# change this number and run the code again from here
K = 4

### 7. Initialize K-means clustering algorithm

In this code, I am initializing the K-means clustering algorithm with 4 clusters using the `KMeans` class from scikit-learn.

I create an instance of the `KMeans` class and assign it to the variable `kmeans`. The `n_clusters` parameter is set to 5, indicating that I want to divide the data into 5 clusters.

Additionally, I set the `random_state` parameter to 0 to ensure reproducibility of the results.

The K-means algorithm is typically run multiple times with different centroid seeds (random initial positions of the cluster). The parameter `n_init` specifies the number of runs to perform. Among these runs, the one with the best result is selected. The standasd value to start with here is 10.

The code will create a K-means clustering algorithm object (`kmeans`) ready to be fitted to the data to identify the 5 clusters based on their characteristics and patterns.

In [None]:
kmeans = KMeans(n_clusters=K, random_state=0, n_init=10)

### 8. Fit the model to the data

Here I am fitting the K-means clustering algorithm to the data stored in the DataFrame `X` using the `fit()` method of the `kmeans` object.

By calling `kmeans.fit(X)`, I am training the K-means model on the data `X`. The algorithm will calculate the cluster centers and assign each data point to its nearest centroid based on the specified number of clusters and the features in `X`.

This step is crucial as it performs the actual clustering of the data. The algorithm iteratively adjusts the cluster centers to minimize the inertia (sum of squared distances to the nearest centroid) until convergence.

After this code is executed, the K-means model has learned the cluster centers based on the data and is ready to make predictions or perform further analysis using the trained clusters.

In [None]:
kmeans.fit(X_pca)

### 9. Get the predicted labels and cluster centers

Below I am using the fitted K-means clustering algorithm (kmeans) to obtain the cluster labels and cluster centers.

By calling `kmeans.labels_`, I am retrieving the cluster labels for each data point in the DataFrame `X`. The resulting labels array will contain an assigned cluster label for each data point, indicating which cluster that data point belongs to.

Similarly, by accessing `kmeans.cluster_centers_`, I am obtaining the coordinates of the cluster centers for each of the clusters identified by the K-means algorithm. The resulting `centers` array will contain the centroid coordinates for each cluster.

These cluster labels and cluster centers can be used for various purposes. The cluster labels can help in understanding the assignment of data points to different clusters, while the cluster centers provide insight into the representative characteristics of each cluster.

Now I can access the cluster labels and cluster centers obtained from the fitted K-means model, allowing me to further analyze and interpret the results of the clustering algorithm.

In [None]:
labels = kmeans.labels_
centers = kmeans.cluster_centers_

centers

### 10. Visualize the data and clusters

Here I am creating a scatter plot to visualize the customer segmentation results using the data from the DataFrame `X`, the cluster labels (`labels`), and the cluster centers (`centers`).

Using `plt.scatter()`, I plot the data points from `X_pca` on the `scatter` plot. 

I also use `plt.scatter()` to plot the cluster centers on the scatter plot. The x-coordinate of the centers is taken from `centers[:, 0]` and `centers[:, 1]`. The c parameter is set to 'red' to color the cluster centers as red, and the marker parameter is set to 'X' to mark the cluster centers with an 'X' symbol.

This code visualizes the customer segmentation results on a scatter plot, where each data point is colored according to its assigned cluster label, and the cluster centers are marked with red 'X' symbols. This visualization helps in understanding the grouping of customers based on their annual income and spending score.

In [None]:
# Plot the data with cluster assignments
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=labels)
plt.scatter(centers[:, 0], centers[:, 1], marker='x', color='red', label='Cluster Centers')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('K-means Clustering on PCA Data')
plt.legend()
plt.show()

### 11. Evaluate the performance using silhouette score

The Silhouette Score is a measure that quantifies how well each data point fits into its assigned cluster. It provides an indication of how compact and well-separated the clusters are.

Its value ranges from -1 to 1. When the average Silhouette Score across all data points is closer to 1, it suggests that the clusters are well-separated and distinct. Conversely, a score closer to 0 or negative values may indicate overlapping or poorly separated clusters.

In [None]:
silhouette_avg = silhouette_score(X_pca, labels)
print(f"Silhouette Score: {silhouette_avg}")

### 12. More clusters

If we want more clusters (for more fine-grained differentiation) while maintaining a silhouette score near the one we have we can run the same loop as we did on step 6 but plot the silhouette score for each cluster instead.

There are a few caveats to this method:

- Increasing the number of clusters to maximize the silhouette score can lead to overfitting. Overfitting occurs when the model becomes too complex and starts capturing noise or random variations in the data, rather than meaningful patterns.
- The more clusters you have, the fewer samples there will be in each. Too few samples per cluster makes the model and the clusters irrelevant.
- For every additional value of `k` (number of clusters to test) we have to train the model once more, this can be time and resource consuming on large datasets.
- The silhouette score is just one metric for evaluating cluster quality. It may not capture all aspects of cluster quality.

With that in mind, let's test anyways:

In [None]:
# Initialize a list to store the inertia values
sc_values = []
max_score = 0
max_score_k = 0

# Try different values of k
k_values = range(5, 15)
for k in k_values:
    # Initialize K-means clustering algorithm
    kmeans = KMeans(n_clusters=k, random_state=0, n_init=10)
    
    # Fit the model to the data
    kmeans.fit(X_pca)
    labels = kmeans.labels_
    
    # Get the silhouette score
    sc_value = silhouette_score(X_pca, labels)
    sc_values.append(sc_value)
    
    # Save the highest value
    (max_score, max_score_k) = (sc_value, k) if sc_value > max_score else (max_score, max_score_k)
    
# Plot the elbow curve
plt.plot(k_values, sc_values, marker='o')
plt.title(f"optimal K: {max_score_k} with score: {max_score}")
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.show()

Here we can see that 10 clusters might be fitting since the score is almost as much as 4 clusters. If we were to continue we would get even higher scores a 26, 30, and 37, but that's way too many clusters since our data only contains 200 samples (see for yourself with `df.describe()`)

Let's plot the data with 10 clusters by doing the same as before:

1. Initialize and train the model
2. Get the labels and centers
3. Plot the results

In [None]:
kmeans = KMeans(n_clusters=10, random_state=0, n_init=10)

kmeans.fit(X_pca)

labels = kmeans.labels_
centers = kmeans.cluster_centers_

# Define a list of distinct colors for the labels (0 to 9)
label_colors = ['b', 'g', 'r', 'c', 'm', 'y', 'k', 'purple', 'orange', 'brown']

# Map each label to its corresponding color
cluster_colors = [label_colors[label] for label in labels]

scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=cluster_colors)

# Create a custom color map for the colorbar
cmap = plt.cm.colors.ListedColormap(label_colors)
norm = plt.cm.colors.BoundaryNorm(np.arange(0, 11, 1), cmap.N)

# Create a colorbar with discrete colors
# cb = plt.colorbar(plt.cm.ScalarMappable(cmap=cmap, norm=norm), ticks=np.arange(0, 10, 1))

cb = plt.colorbar(scatter, cmap=cmap, norm=norm, ticks=np.arange(0, 10, 1))
cb.set_label("Cluster Labels")

plt.scatter(centers[:, 0], centers[:, 1], marker='x', color='red', label='Cluster Centers')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('K-means Clustering on PCA Data')
plt.legend()
plt.show()

In [None]:
df["labels"] = labels # Add labels as new column
df["pca1"] = X_pca[:, 0]
df["pca2"] = X_pca[:, 1]
df

In [None]:
df[df["labels"] == 7] # Get all label 7

In [None]:
#1 df["Age"].value_counts()
#2 df["Age"].describe()
#3 df[df["labels"] == 7]["Age"].describe()

--- 

## 2. Your turn!

Now it's your turn to do the same thing but with *Credit card data* instead. Train a model and try to find a fitting number of clusters based on the number of samples. These are the basic steps:

- Load the data
- Extract the relevant columns
- Scale the data
- Reduce the dimensions using PCA
- Plot the inertia elbow plot and make a guess for the number of clusters
- Train your model
- Plot the segmented data and evaluate the model

The dataset can be found and downloaded here [Kaggle CreditCardData](https://www.kaggle.com/datasets/vikashsingh999/creditcarddata)

### 1. Load and prepare the dataset

Load the data with `pd.read_csv()`.

Explore the columns with `df.info()` or on Kaggle.

Use `df.drop('COLUMN_NAME', axis=1)` to remove irrelevant columns.

Use `df.dropna()` to drop invalid rows.

In [27]:
import warnings
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
import pandas as pd
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import os

# Suppress the memory leak warning for KMeans on Windows
warnings.filterwarnings("ignore", category=UserWarning, module="sklearn")

# Set the environment variable to avoid memory leak issues with MKL (Intel Math Kernel Library)
os.environ["OMP_NUM_THREADS"] = "1"

# Step 1: Read the customer data from a CSV file
df = pd.read_csv('A4_Mall_Customers.csv')

# Step 2: Extract the relevant columns for clustering
X = df[['Age', 'Annual Income (k$)', 'Spending Score (1-100)']]

# Step 3: Scale the dataset using StandardScaler
scaler = StandardScaler()
scaler.fit(X)
X_scaled = scaler.transform(X)

# Step 4: Principal Component Analysis (PCA) to reduce to 2D for visualization
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Step 5: Generate elbow plot to find the right number of clusters
inertia_values = []
k_values = range(1, 15)
for k in k_values:
    kmeans = KMeans(n_clusters=k, random_state=0, n_init=10)
    kmeans.fit(X_pca)
    inertia_values.append(kmeans.inertia_)

plt.plot(k_values, inertia_values, marker='o')
plt.title('Elbow Plot (Customer Segmentation)')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.show()

# Based on the elbow plot, choose the optimal K (e.g., K = 4)
K = 4

# Step 6: Initialize K-means clustering algorithm
kmeans = KMeans(n_clusters=K, random_state=0, n_init=10)

# Step 7: Fit the model to the data
kmeans.fit(X_pca)

# Step 8: Get the predicted labels and cluster centers
labels = kmeans.labels_
centers = kmeans.cluster_centers_

# Step 9: Visualize the data and clusters
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=labels, cmap='viridis')
plt.scatter(centers[:, 0], centers[:, 1], marker='x', color='red', label='Cluster Centers')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('K-means Clustering on PCA Data')
plt.legend()
plt.show()

# Step 10: Evaluate the performance using silhouette score
silhouette_avg = silhouette_score(X_pca, labels)
print(f"Silhouette Score: {silhouette_avg}")

# Step 11: Explore more clusters and evaluate silhouette score
sc_values = []
max_score = 0
max_score_k = 0
for k in range(5, 15):
    kmeans = KMeans(n_clusters=k, random_state=0, n_init=10)
    kmeans.fit(X_pca)
    labels = kmeans.labels_
    sc_value = silhouette_score(X_pca, labels)
    sc_values.append(sc_value)
    if sc_value > max_score:
        max_score, max_score_k = sc_value, k

# Step 12: Plot silhouette score for different values of k
plt.plot(range(5, 15), sc_values, marker='o')
plt.title(f"Optimal K: {max_score_k} with score: {max_score}")
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Silhouette Score')
plt.show()

# Print the optimal K and its silhouette score
print(f"Optimal number of clusters (K): {max_score_k} with Silhouette Score: {max_score}")


Silhouette Score: 0.42092860310837976
Optimal number of clusters (K): 11 with Silhouette Score: 0.4012100267330176


### 2. Scale the dataset

Scale your data using StandardScaler to make sure that the features have similar scales.

1. Create an instance of the StandardScaler class.
2. Fit the scaler on your dataset.
3. Transform the dataset using the fitted scaler.

In [55]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
import pandas as pd
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Step 1: Load the dataset
df = pd.read_csv(r'C:\Users\nayif\py3b_nayef_omer\week9\A4_Mall_Customers.csv')  # Use the correct path
print(df.head())

# Step 2: Extract relevant columns for clustering
X = df[['Age', 'Annual Income (k$)', 'Spending Score (1-100)']]

# Step 3: Scale the dataset
scaler = StandardScaler()
scaler.fit(X)
X_scaled = scaler.transform(X)

# Output scaling information
print("Shape of X_scaled:", X_scaled.shape)
print("Mean of each feature in X_scaled:", X_scaled.mean(axis=0))
print("Standard deviation of each feature in X_scaled:", X_scaled.std(axis=0))

# Step 4: Apply PCA (Principal Component Analysis)
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Output transformed data shape
print("PCA transformed data shape:", X_pca.shape)

# Step 5: Generate elbow plot to find the optimal number of clusters
inertia_values = []
k_values = range(1, 15)
for k in k_values:
    kmeans = KMeans(n_clusters=k, random_state=0, n_init=10)
    kmeans.fit(X_pca)
    inertia_values.append(kmeans.inertia_)

# Plot elbow curve
plt.plot(k_values, inertia_values, marker='o')
plt.title('Elbow Plot (Customer Segmentation)')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.show()

# Step 6: Choose the optimal number of clusters based on the elbow plot
K = 4  # Change this based on the elbow plot's optimal point

# Step 7: Fit K-means model
kmeans = KMeans(n_clusters=K, random_state=0, n_init=10)
kmeans.fit(X_pca)

# Step 8: Get the predicted labels and cluster centers
labels = kmeans.labels_
centers = kmeans.cluster_centers_

# Step 9: Visualize the clusters
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=labels, cmap='viridis')
plt.scatter(centers[:, 0], centers[:, 1], marker='x', color='red', label='Cluster Centers')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title(f'K-means Clustering (k={K})')
plt.legend()
plt.show()

# Step 10: Evaluate performance using silhouette score
silhouette_avg = silhouette_score(X_pca, labels)
print(f"Silhouette Score: {silhouette_avg}")

# Step 11: Experiment with more clusters and evaluate silhouette score
sc_values = []
max_score = 0
max_score_k = 0

for k in range(5, 15):
    kmeans = KMeans(n_clusters=k, random_state=0, n_init=10)
    kmeans.fit(X_pca)
    labels = kmeans.labels_
    sc_value = silhouette_score(X_pca, labels)
    sc_values.append(sc_value)
    if sc_value > max_score:
        max_score = sc_value
        max_score_k = k

# Plot silhouette scores for different values of k
plt.plot(range(5, 15), sc_values, marker='o')
plt.title(f"Optimal K: {max_score_k} with score: {max_score}")
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Silhouette Score')
plt.show()

# Step 12: Visualize the final clustering result with the optimal k
kmeans_final = KMeans(n_clusters=max_score_k, random_state=0, n_init=10)
kmeans_final.fit(X_pca)
labels_final = kmeans_final.labels_
centers_final = kmeans_final.cluster_centers_

plt.scatter(X_pca[:, 0], X_pca[:, 1], c=labels_final, cmap='viridis')
plt.scatter(centers_final[:, 0], centers_final[:, 1], marker='x', color='red', label='Cluster Centers')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title(f'K-means Clustering (k={max_score_k})')
plt.legend()
plt.show()


   CustomerID  Gender  Age  Annual Income (k$)  Spending Score (1-100)
0           1    Male   19                  15                      39
1           2    Male   21                  15                      81
2           3  Female   20                  16                       6
3           4  Female   23                  16                      77
4           5  Female   31                  17                      40
Shape of X_scaled: (200, 3)
Mean of each feature in X_scaled: [-1.02140518e-16 -2.13162821e-16 -1.46549439e-16]
Standard deviation of each feature in X_scaled: [1. 1. 1.]
PCA transformed data shape: (200, 2)
Silhouette Score: 0.42092860310837976


### 3. Principal Component Analysis

Use PCA to simplify and clean the data, which will make it easier to understand and find patterns.

Do the following:

1. Create an instance of the PCA class and specify the number of variables / principal components.
2. Fit the PCA-object on our dataset.
3. Transform the dataset using the fitted PCA-object (try doing this at the same time as step 2 by using `fit_transform()`).

In [87]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Step 1: Load the dataset
df = pd.read_csv(r'C:\Users\nayif\py3b_nayef_omer\week9\A4_Mall_Customers.csv')  # Replace with the correct file path
print("Dataset Head:")
print(df.head())

# Step 2: Extract relevant columns for PCA (e.g., 'Age', 'Annual Income (k$)', 'Spending Score (1-100)')
X = df[['Age', 'Annual Income (k$)', 'Spending Score (1-100)']]

# Step 3: Scale the dataset (important for PCA)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Step 4: Create PCA instance and specify the number of components (e.g., 2 components)
pca = PCA(n_components=2)

# Step 5: Fit and transform the dataset with PCA
X_pca = pca.fit_transform(X_scaled)

# Step 6: Print the transformed dataset and explained variance
print("\nPCA Transformed Data:")
print(X_pca[:5])  # Print first 5 rows of the transformed dataset

print("\nExplained Variance Ratio for Each Principal Component:")
print(pca.explained_variance_ratio_)


Dataset Head:
   CustomerID  Gender  Age  Annual Income (k$)  Spending Score (1-100)
0           1    Male   19                  15                      39
1           2    Male   21                  15                      81
2           3  Female   20                  16                       6
3           4  Female   23                  16                      77
4           5  Female   31                  17                      40

PCA Transformed Data:
[[-0.61572002 -1.76348088]
 [-1.66579271 -1.82074695]
 [ 0.33786191 -1.67479894]
 [-1.45657325 -1.77242992]
 [-0.03846521 -1.66274012]]

Explained Variance Ratio for Each Principal Component:
[0.44266167 0.33308378]


### 4. Find the right amount of clusters

Do the same thing as in **Part 1** to generate an elbow plot of the inertia values from the model. When the plot is displayed, set K below to the number of cluster you think will fit best.

In [83]:
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

# Step 1: Generate elbow plot to find the optimal number of clusters
inertia_values = []
k_values = range(1, 11)  # Testing for k = 1 to 10 clusters

# Run K-means for each k and compute inertia
for k in k_values:
    kmeans = KMeans(n_clusters=k, random_state=0, n_init=10)
    kmeans.fit(X_pca)  # Use PCA-transformed data
    inertia_values.append(kmeans.inertia_)

# Step 2: Plot the elbow curve
plt.plot(k_values, inertia_values, marker='o')
plt.title('Elbow Plot for Optimal Number of Clusters')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.grid(True)
plt.show()


In [85]:
# After inspecting the elbow plot, set K to the optimal number of clusters
K = 4  # Replace with your chosen value based on the elbow plot

# Step 2: Fit K-means model with chosen K
kmeans = KMeans(n_clusters=K, random_state=0, n_init=10)
kmeans.fit(X_pca)  # Use PCA-transformed data

# Step 3: Get the predicted labels and cluster centers
labels = kmeans.labels_
centers = kmeans.cluster_centers_

# Step 4: Visualize the clusters
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=labels, cmap='viridis')
plt.scatter(centers[:, 0], centers[:, 1], marker='x', color='red', label='Cluster Centers')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title(f'K-means Clustering (k={K})')
plt.legend()
plt.show()


### 5. Visualize and evaluate

Initialize and train your model using `KMeans()` and `kmeans.fit()`. Then extract the the labels and centers and plot them using `plt.scatter()`. You can add the argument `s=1` in `plt.scatter()` to make the markers smaller if they are too large to differentiate.

After showing the plot: calculate and print the silhouette score

In [71]:
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt

# Step 1: Train the K-means model using the chosen K
kmeans = KMeans(n_clusters=K, random_state=0, n_init=10)
kmeans.fit(X_pca)  # Use the PCA-transformed data

# Step 2: Extract the labels and cluster centers
labels = kmeans.labels_
centers = kmeans.cluster_centers_

# Step 3: Visualize the clusters
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=labels, cmap='viridis', s=1)  # s=1 to make markers smaller
plt.scatter(centers[:, 0], centers[:, 1], marker='x', color='red', label='Cluster Centers')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title(f'K-means Clustering (k={K})')
plt.legend()
plt.show()

# Step 4: Evaluate the clustering using silhouette score
silhouette_avg = silhouette_score(X_pca, labels)
print(f"Silhouette Score: {silhouette_avg}")


Silhouette Score: 0.42092860310837976


### 6. Test more clusters

Use the technique in the end of **Part 1** to see what higher values of K would result in. This dataset is about 20 times larger, which means both that each iteration of the loop (training the model and calculating the score) takes longer, but also that more clusters are "allowed" without having too few samples per cluster.

First, plot the score for different amounts of clusters:

In [73]:
# Step 1: Test multiple values of K (starting from 5 to higher values)
sc_values = []  # To store silhouette scores
k_values = range(5, 21)  # You can try different ranges, here we try 5 to 20 clusters

# Step 2: Loop over each K value
for k in k_values:
    kmeans = KMeans(n_clusters=k, random_state=0, n_init=10)
    kmeans.fit(X_pca)  # Fit the model using PCA-transformed data
    labels = kmeans.labels_
    
    # Calculate silhouette score for each K
    sc_value = silhouette_score(X_pca, labels)
    sc_values.append(sc_value)

# Step 3: Plot silhouette scores for different values of K
plt.plot(k_values, sc_values, marker='o')
plt.title('Silhouette Score vs Number of Clusters')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Silhouette Score')
plt.grid(True)
plt.show()

# Optional: Identify the optimal K based on the maximum silhouette score
optimal_k = k_values[sc_values.index(max(sc_values))]
print(f"Optimal number of clusters based on silhouette score: {optimal_k}")


Optimal number of clusters based on silhouette score: 17


### 7. Visualize more clusters

Select a value for K other than the one you set right after the elbow plot to train a new model on.

Train the model and plot the clusters to see the results.

In [77]:
# Step 1: Select a new value for K (other than the one chosen after the elbow plot)
K_new = 6  # You can change this value to a different one than the initial choice from the elbow plot

# Step 2: Train the KMeans model with the new value of K
kmeans_new = KMeans(n_clusters=K_new, random_state=0, n_init=10)
kmeans_new.fit(X_pca)  # Fit the model using PCA-transformed data

# Step 3: Extract the labels and cluster centers
labels_new = kmeans_new.labels_
centers_new = kmeans_new.cluster_centers_

# Step 4: Plot the clusters
plt.figure(figsize=(8, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=labels_new, cmap='viridis', s=10)  # Smaller points
plt.scatter(centers_new[:, 0], centers_new[:, 1], marker='x', color='red', label='Cluster Centers')
plt.title(f'K-means Clustering (k={K_new})')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend()
plt.show()


## Complete!

Submit your work by pushing the changes to Github, inviting the teacher/s to your repository and submitting this link on ItsLearning under Assignment 4.