# Data mining

# Lesson 2

# Data clustering using different algorithms

### **Objective:**
To learn and implement various clustering techniques for working with real data. Students will be able to understand how the algorithms divide data into clusters in different ways and where they can be applied.

### **Description**

Clustering is the task of dividing a set of objects into groups (clusters) so that objects within the same cluster are as similar to each other as possible, while objects from different clusters are different. In this lab activity, students will study several clustering methods to help them understand their applicability, features, and drawbacks. We will use several clustering algorithms such as KMeans, DBSCAN, Gaussian Mixture Models (GMM), hierarchical clustering, and the Girvan-Newman algorithm for working with graphs.

### Libraries that we use:

- [Pandas](https://pandas.pydata.org/) - a library for working with tabular data, which will help us in the data preparation phase.
- [Matplotlib](https://matplotlib.org/) and [Seaborn](https://seaborn.pydata.org/) - for data visualization and identifying interesting patterns.
- [Scikit-learn](https://scikit-learn.org/stable/) - machine learning library for building and evaluating models.
- [networkx](https://networkx.org/) - NetworkX is a Python package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks.

### Structure of the laboratory work:

- We have sales data and want to predict which customers are most likely to make a purchase in the next month.

Our **data.csv** with columns:

    "latitude",
    "longitude",
    "activity_intensity"

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Download data
data = pd.read_csv('data.csv')
geo_data = pd.DataFrame(data)
# Description
print(geo_data.head())


## **Exercise 1:** Clustering using KM
- Apply KMeans to cluster geospatial data into k clusters.
- Experiment with different values of k and use the elbow method to determine the optimal number of clusters.
- Visualize the clusters and centroids.

In [None]:
from sklearn.cluster import KMeans

coordinates = geo_data[['latitude', 'longitude']].values
# Find the optimal number of clusters using the elbow method
inertia = []
for k in range(1, 10):
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(coordinates)
    inertia.append(kmeans.inertia_)

# Plot the elbow curve
plt.figure(figsize=(8, 5))
plt.plot(range(1, 10), inertia, marker='o')
plt.title('Elbow Method for Optimal k')
plt.xlabel('Number of Clusters')
plt.ylabel('Inertia')
plt.show()

# Apply KMeans with the optimal number of clusters (e.g., k=3)
kmeans = KMeans(n_clusters=3, random_state=42)
geo_data['kmeans_cluster'] = kmeans.fit_predict(coordinates)

# Visualize KMeans clusters
plt.figure(figsize=(10, 6))
plt.scatter(geo_data['longitude'], geo_data['latitude'], c=geo_data['kmeans_cluster'], cmap='Set2', alpha=0.6)
plt.scatter(kmeans.cluster_centers_[:, 1], kmeans.cluster_centers_[:, 0], color='red', marker='X', s=100, label='Centroids')
plt.title('KMeans Clustering of Geospatial Data')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.legend()
plt.show()


## **Exercise 2:** 
- Apply DBSCAN to group densely populated points.
- Tune the eps and min_samples parameters to refine clustering results.
- Identify noise points (if any) and visualize the clusters.

In [None]:
from sklearn.cluster import DBSCAN
import numpy as np


# Apply DBSCAN
dbscan = DBSCAN(eps=0.0015, min_samples=5)  # Adjust eps for better clustering
geo_data['dbscan_cluster'] = dbscan.fit_predict(coordinates)

# Visualize DBSCAN clusters
plt.figure(figsize=(10, 6))
scatter = plt.scatter(geo_data['longitude'], geo_data['latitude'], c=geo_data['dbscan_cluster'], cmap='tab10', alpha=0.6)
plt.colorbar(scatter, label='Cluster')
plt.title('DBSCAN Clustering of Geospatial Data')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.show()

# Analyze noise points
noise_points = geo_data[geo_data['dbscan_cluster'] == -1]
print(f"Number of noise points: {len(noise_points)}")


## **Exercise 3:** 
- Apply GMM for probabilistic clustering of geospatial data.
- Analyze the cluster probabilities and assign points to the cluster with the highest probability.
- Visualize the results with a probabilistic heatmap.

In [None]:
from sklearn.mixture import GaussianMixture

# Apply GMM with 3 clusters
gmm = GaussianMixture(n_components=3, random_state=42)
geo_data['gmm_cluster'] = gmm.fit_predict(coordinates)

# Visualize GMM clusters
plt.figure(figsize=(10, 6))
plt.scatter(geo_data['longitude'], geo_data['latitude'], c=geo_data['gmm_cluster'], cmap='Dark2', alpha=0.6)
plt.title('GMM Clustering of Geospatial Data')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.show()

# Visualize probabilistic cluster memberships
probs = gmm.predict_proba(coordinates)
plt.figure(figsize=(10, 6))
for i in range(probs.shape[1]):
    plt.scatter(geo_data['longitude'], geo_data['latitude'], c=probs[:, i], cmap='viridis', alpha=0.6, label=f'Cluster {i}')
plt.colorbar(label='Probability')
plt.title('GMM Probabilistic Heatmap')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.legend()
plt.show()


## Consclusion:

In this lab we:

- Describe each clustering algorithm and its characteristics.
- Visualize the clustering results for DBSCAN, KMeans, and GMM.
- Analyze differences between the clustering methods, including noise handling, cluster shapes, and performance.


