

# **IFI8410: Programming for Business**



## **Assignment 11**

In [None]:
# Do not change the content of this cell. Execute this cell first, and everytime after you restarted the kernel.
%reload_ext autoreload
%autoreload 2




##  Multiple linear regression model.

# Problem 11.1: K-Means Clustering with Elbow Method

**Problem Statement:**

You are tasked to cluster data points into groups using the K-Means clustering algorithm. Your implementation should also identify the optimal number of clusters using the Elbow Method, and return the optimal k along with the coordinates of one of the centroids.


**We generate a data set for clustering:**

Use `sklearn.datasets.make_blobs` to generate a synthetic dataset:

- n_samples=300 (300 data points).

- n_features=2 (2 features).

- centers=5 (5 distinct cluster centers).

- cluster_std=1.0 (standard deviation for clusters).

- random_state=42 (seed for reproducibility).


**Tasks:**

**1.Apply the K-Means Algorithm**:

- Use K-Means clustering from sklearn.cluster.KMeans to group the data.

- Train the model for k values ranging from 1 to 10 and record the Within-Cluster Sum of Squares (WCSS) for each k.

- Use n_init=1.

- Use random_state=42 (fixed seed for reproducibility).


**2.Determine the Optimal Number of Clusters**:

- Plot the WCSS against k values to find the "elbow point," which indicates the optimal k.

- Select the value of k where the WCSS stops decreasing significantly.

**3.Return Results**:

- Train the K-Means model using the optimal k.

- Return the optimal k and the coordinates of the centroid for one of the clusters (e.g., Cluster 0).


**Output**: The function should return:

- The optimal number of clusters (k).
  
- The coordinates of the centroid for Cluster 0.

In [None]:
# Generate synthetic dataset
from sklearn.datasets import make_blobs
X, _ = make_blobs(n_samples=300, centers=5, cluster_std=1.0, random_state=42)

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

def kmeans_with_elbow_method(X):
    
    # Write your code here





In [None]:
# Example usage
optimal_k, centroid = kmeans_with_elbow_method(X)
print(f"Optimal k: {optimal_k}")
print(f"Centroid (Cluster 0): {centroid}")

#### Save your solution to a file ...

In [None]:
%%writefile solution_11_1.py

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

def kmeans_with_elbow_method(X):



    


#### Test 11.1 Execute the cell below to test your solution...

In [None]:
! test/run_test.sh 1

### Problem 11.2: K-Means vs. K-Means++ Clustering Comparison

**Problem Statement:**

Your task is to implement clustering on a synthetic dataset using both K-Means and K-Means++ initialization methods. The dataset will consist of distinct clusters generated using sklearn.datasets.make_blobs. Perform multiple runs to highlight the differences in centroid positions due to initialization randomness.


**We generate a data set for clustering:**

Use `sklearn.datasets.make_blobs` to create a dataset with:

- n_samples=300 (300 data points).

- n_features=2 (2 features).

- centers=5 (5 distinct cluster centers).

- cluster_std=1.0 (standard deviation for clusters).

- random_state=42 (fixed seed for reproducibility).


**Tasks:**

**1.Apply K-Means Clustering:**

- Train the K-Means algorithm using two different initialization methods:

- random: Random initialization for centroids.

- k-means++: Smart centroid initialization.

- Fix the number of clusters (k=5).

- Perform multiple runs (e.g., 10 runs) for each method to observe variations in centroid positions.

- Use n_init=1.

- Use random_state=42 (fixed seed for reproducibility).

**2.Compare Centroid Convergence:**

- Record the positions of centroids after each run for both methods.

- Identify if centroids consistently converge to the same positions with k-means++.

**3.Output Results:**

- Return the final centroid positions for one specific run for both methods.

- Highlight the differences in convergence patterns between random and k-means++.


**Output**:

- Centroid positions for both K-Means (random) and K-Means++ (k-means++) for a specific run.

- Visualization of centroid positions across multiple runs for both methods.


In [None]:
# Generate synthetic dataset
from sklearn.datasets import make_blobs
X, _ = make_blobs(n_samples=300, centers=5, cluster_std=1.0, random_state=42)

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

def kmeans_vs_kmeanspp(X):

    # Write your code here




In [None]:
# Example usage
random_centroids, kmeanspp_centroids = kmeans_vs_kmeanspp(X)
print(f"Random Initialization Centroids (Run 1): {random_centroids}")
print(f"K-Means++ Initialization Centroids (Run 1): {kmeanspp_centroids}")

#### Save your solution to a file ...

In [None]:
%%writefile solution_11_2.py

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

def kmeans_vs_kmeanspp(X):





#### Test 11.2 Execute the cell below to test your solution...

In [None]:
! test/run_test.sh 2

### Problem 11.3: Agglomerative Hierarchical Clustering (AHC) on Concentric Rings

**Description**:

Cluster data arranged in concentric rings using Agglomerative Hierarchical Clustering (AHC) with three linkage methods. Visualize and save the resulting clusters as .png images for comparison.


**We generate a data set for clustering:**

Generate a dataset with concentric rings using `sklearn.datasets.make_circles`:

- n_samples=500: 500 points.

- noise=0.05: Adds slight noise.

- factor=0.5: Distance between inner and outer circles.


**Tasks:**

Perform AHC using `sklearn.cluster.AgglomerativeClustering` with the following linkage methods:

- Single Linkage

- Complete Linkage

- Average Linkage

For each linkage method:

- Plot and label the clusters.

- Save the plot as **.png** file.


**Submission**:

Submit the **.png** images (**ahc_single.png, ahc_complete.png, ahc_average.png**) showcasing the clusters for each linkage method.

In [None]:
# Generate concentric rings dataset
from sklearn.datasets import make_circles
X, _ = make_circles(n_samples=500, noise=0.05, factor=0.5, random_state=42)

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import AgglomerativeClustering

def plot_ahc_clustering(X):
    
    # Write your code here





In [None]:
# Example usage
plot_ahc_clustering(X)

#### Save your solution to a file ...

In [None]:
%%writefile solution_11_3.py

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import AgglomerativeClustering

def plot_ahc_clustering(X):





#### Test 11.3 Execute the cell below to test your solution...

In [None]:
! test/run_test.sh 3

### 11.4 DBSCAN Clustering with Blobs and Rings

**Problem Statement:**

You are tasked to cluster data points using the DBSCAN clustering algorithm. Your implementation should apply DBSCAN to two different datasets and visualize the results.


**We generate a data set for clustering:**

Dataset 1: Random blobs (clusters of varying densities).

- Use `make_blobs` to create 3 clusters with varying densities.

- Parameters:

    * n_samples=500 (500 data points)

    * centers=3 (3 distinct cluster centers)

    * cluster_std=[1.0, 2.5, 0.5] (standard deviations for clusters)

    * random_state=42 (for reproducibility)

Dataset 2: Concentric rings.

- Use make_circles to generate two concentric rings.
  
- Parameters:
    * n_samples=500 (500 data points)
      
    * factor=0.5 (controls the spacing between rings)
      
    * noise=0.05 (adds noise to the data)


**Tasks:**

1. Apply the **DBSCAN** Algorithm:
   
- Use DBSCAN from `sklearn.cluster.DBSCAN` for clustering.
- Train DBSCAN on each dataset with different values for eps and min_samples:
    * eps: The maximum distance between two points to consider them neighbors.
    * min_samples: The minimum number of points required to form a dense cluster.

2. Visualize the Results:
   
- Plot the clustering results for both datasets in a 2x1 grid.
- Save the plot as a **.png** image.


In [None]:
from sklearn.datasets import make_blobs, make_circles

# Dataset 1: Random blobs
X_blobs, _ = make_blobs(n_samples=500, centers=3, cluster_std=[1.0, 2.5, 0.5], random_state=42)

# Dataset 2: Concentric rings
X_circles, _ = make_circles(n_samples=500, factor=0.5, noise=0.05)

datasets = {"Blobs": X_blobs, "Rings": X_circles}

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN

def apply_dbscan_and_plot(datasets):

    # Write your code here

    

    

In [None]:
# Apply DBSCAN and plot results
apply_dbscan_and_plot(datasets)

#### Save your solution to a file ...

In [None]:
%%writefile solution_11_4.py

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN

def apply_dbscan_and_plot(datasets):





#### Test 11.4 Execute the cell below to test your solution...

In [None]:
! test/run_test.sh 4

### 11.5: Comparison of Clustering Techniques on Complex Data

**Problem Statement:**

You are tasked to compare six different clustering techniques on a complex dataset that includes ring or half-moon shapes. The goal is to understand how each method performs on non-linear data and visualize the clustering results.


**We generate a data set for clustering:**

- Use `make_moons` from `sklearn.datasets` to generate a dataset with two interleaved half-moon shapes.

- Dataset Parameters:

    * n_samples=500 (500 data points)

    * noise=0.05 (introduce noise for complexity)

    * random_state=42 (ensure reproducibility)


**Tasks**

1. Apply Clustering Techniques:

Apply the following six clustering algorithms to the dataset:

**KMeans++:** 

- Use **KMeans** from sklearn.cluster.
- Initialization method: k-means++.
- Number of clusters: 2.

**Agglomerative Hierarchical Clustering (AHC):**

- Use **AgglomerativeClustering** from sklearn.cluster.
- Number of clusters: 2.

**DBSCAN:**

- Use **DBSCAN** from sklearn.cluster.
- eps=0.2 (radius of neighborhood).
- min_samples=5 (minimum number of points in a neighborhood to form a cluster).

**Spectral Clustering:**

- Use **SpectralClustering** from sklearn.cluster.
- Number of clusters: 2.
- affinity='nearest_neighbors'.

**Gaussian Mixture Model (GMM):**

- Use **GaussianMixture** from sklearn.mixture.
- Number of components: 2.

**BIRCH:**

- Use **Birch** from sklearn.cluster.
- Number of clusters: 2.

2. Visualize Clustering Results: For each clustering method:

- Plot the data points colored by their assigned cluster.
- Display all six clustering results side by side in a 2x3 grid for comparison.
- Save the visualization as **comparison_of_clustering_techniques.png**.


In [None]:
# Create a dataset with half-moon shapes
from sklearn.datasets import make_moons
X, _ = make_moons(n_samples=500, noise=0.05, random_state=42)

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN, SpectralClustering, Birch
from sklearn.mixture import GaussianMixture

def apply_clustering_methods(X):

    # Write your code here

    
    
    return results


def plot_clustering_results(X, clustering_results):

    # Write your code here




In [None]:
# Apply clustering methods
clustering_results = apply_clustering_methods(X)

# Plot and save results
plot_clustering_results(X, clustering_results)

#### Save your solution to a file ...

In [None]:
%%writefile solution_11_5.py
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN, SpectralClustering, Birch
from sklearn.mixture import GaussianMixture

def apply_clustering_methods(X):


    
    
    return results


def plot_clustering_results(X, clustering_results):






#### Test 11.5 Execute the cell below to test your solution...

In [None]:
! test/run_test.sh 5

# Run all the tests again ...

In [None]:
! ./test/run_test.sh

# Homework Submission

- This homework is due by **2024-12-04, 6:00 PM (EDT)**.

- Make sure that all your programs and output files are in the exact folder as specified in the instructions.

- All file names on this system are case sensitive. Verify if you copy your work from a local computer to your home directory on ARC.

**Execute the cell below to submit your assignment**

In [None]:
! ./submit.sh -y