 <h1>Problem Statement</h1>

 <p>Organizing and participating in effective study sessions for the data community is challenging due to differing schedules, academic backgrounds, and geographical locations. Individuals often struggle to find like-minded peers for collaboration, resulting in missed opportunities for learning and knowledge exchange. The MVP aims to address these challenges by providing an Intelligent Session Matching system to connect users based on common availability and interests</p>

 <h1>Solution</h1>

 <p>The goal is to develop a **K-Means based model** to group students based on their similarities (availability, skills, location, etc.) and assign them to groups for study sessions. This approach helps streamline the process of matching users for study groups.</p>

 Dataset

 The dataset used to train this model has the following fields:

 1. **UserID**: Uniquely identifies each user. Not used for K-Means but helps distinguish users.

 2. **Latitude**: Represents the user's geographic latitude, used to match users who are close to each other.

 3. **Longitude**: Represents the user's geographic longitude, also used for proximity-based matching.

 4. **Availability (Hour_0 to Hour_23)**: Binary values (0 or 1) representing the user's hourly availability.

 5. **Days_Available (Monday to Sunday)**: Binary values indicating which days of the week the user is available.

 6. **Skill_Level**: Indicates the user's skill level (e.g., Beginner, Intermediate, Advanced).

 7. **Preferred_Group_Size**: User's preference for group size (Small, Medium, Large).

 8. **Topics of Interest**: One-hot encoded columns representing the user's areas of interest (e.g., Python, Machine Learning, etc.).

In [None]:
import pandas as pd
data = pd.read_csv("./data/data.csv")
print(data.head())


 Data Preprocessing

 To prepare the dataset for the K-Means algorithm, we need to:

 1. **Convert categorical features**: Categorical data such as `Skill_Level` and `Preferred_Group_Size` should be encoded into numeric values.

 2. **Normalize numerical features**: Numerical features with large values (e.g., Latitude and Longitude) should be normalized to avoid bias in the distance calculations used in K-Means.

# Categorical Data Conversion

 We use the `LabelEncoder` to convert categorical columns like `Skill_Level` and `Preferred_Group_Size` to numerical form so that K-Means can work with them.

In [None]:
from sklearn.preprocessing import LabelEncoder, StandardScaler

# Convert categorical data into numeric form
label_encoder = LabelEncoder()

data['Skill_Level'] = label_encoder.fit_transform(data['Skill_Level'])
data['Preferred_Group_Size'] = label_encoder.fit_transform(data['Preferred_Group_Size'])
print(data.head())


# Normalization of Numeric Features

 We use `StandardScaler` to normalize numerical features like `Latitude` and `Longitude` to ensure that the distance metric used by K-Means is not biased toward larger values.

In [None]:
scaler = StandardScaler()
numeric_features = ['Latitude', 'Longitude']
data[numeric_features] = scaler.fit_transform(data[numeric_features])

# Check the processed data
print(data.head())


# Dropping Unnecessary Columns

 We drop columns such as `UserID` that are not needed for clustering. K-Means relies on numerical data, and `UserID` is just an identifier.

In [None]:
data = data.drop("UserID",axis=1)
print(data.head())


# Train-Test Split

 We split the dataset into a training set and a test set with 20% of the data being used for testing. This ensures we can validate the model after training.

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,_,_ = train_test_split(data,[0] * 1000,test_size=0.20,random_state=42)
print(X_train.head())
print(X_test.head())


# Finding the Optimal `k` Using the Elbow Method

 The Elbow Method is used to determine the optimal number of clusters (`k`). By plotting the sum of squared errors (SSE) for different values of `k`, we can identify where the SSE starts to level off, indicating the optimal number of clusters.

In [None]:
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# Use the Elbow Method to find the optimal number of clusters
sse = []  # Sum of squared errors
k_range = range(1, 100)  # Test for values of k from 1 to 20

for k in k_range:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(data)
    sse.append(kmeans.inertia_)  # Inertia is the sum of squared distances to the nearest cluster center

# Plot the Elbow curve
plt.plot(k_range, sse, "bx-")
plt.xlabel('Number of clusters (k)')
plt.ylabel('Sum of squared errors (SSE)')
plt.title('Elbow Method for Optimal k')
plt.show()


# Validating Clustering with Silhouette Score

 The **Silhouette Score** is a metric used to evaluate how well the clustering is performing. A higher silhouette score means better-defined clusters. We can try multiple values of `k` to find the one that maximizes the silhouette score.

In [None]:
from sklearn.metrics import silhouette_score

range_n_clusters = range(2, 100)
silhouette_avg = []

for num_clusters in range_n_clusters:
    kmeans = KMeans(n_clusters=num_clusters)
    kmeans.fit(data)
    cluster_labels = kmeans.labels_
    silhouette_avg.append(silhouette_score(data, cluster_labels))

plt.plot(range_n_clusters, silhouette_avg, "bx-")
plt.xlabel("Values of K") 
plt.ylabel("Silhouette score") 
plt.title("Silhouette analysis For Optimal k")
plt.show()


# Building the K-Means Model with Selected `k`

 Based on the Elbow Method and Silhouette Score, we choose an optimal value for `k` and apply the K-Means algorithm.

 <h2>Building the KMeans model based on selected k parameter</h2>

In [None]:
# Set the optimal number of clusters
k_optimal = 75  # Use the best value from the analysis
kmeans = KMeans(n_clusters=k_optimal, random_state=42)
X_train['Cluster'] = kmeans.fit_predict(X_train)  # Assign cluster labels to each user

# Check the cluster assignments
print(X_train['Cluster'].head())


# Checking Variance Within Each Cluster

We calculate the variance within each cluster to ensure that users within the same cluster are similar in terms of their features (e.g., skill level, preferred group size).

In [None]:
avg = 0
for cluster_id in X_train['Cluster'].unique():
    cluster_data = X_train[X_train['Cluster'] == cluster_id]
    variance = cluster_data[['Skill_Level', 'Preferred_Group_Size', 'Big Data', 'Data Analysis', 'Machine Learning', 'Python', 'SQL', 'Statistics']].var().mean()
    assert variance < 1, f"High variance found in cluster {cluster_id}"  # Adjust threshold as necessary
    avg += variance
    print(variance)
print(avg / k_optimal)
print("Within Cluster Variance Test: PASS")


# Visualizing Clusters Using PCA

 To visualize the clustering results, we use **Principal Component Analysis (PCA)** to reduce the dataset to two dimensions and plot the clusters.

In [None]:

from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Reduce dimensionality to 2 components for visualization
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_train.drop(columns=['Cluster'],inplace=False))

# Plot the clusters
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=X_train['Cluster'], cmap='viridis')
plt.title('K-Means Clustering of Users')
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.colorbar(label='Cluster')
plt.show()



# Saving the Clustered Data

 We save the training dataset with the assigned cluster labels to a CSV file for further analysis or use in the session matching application.

In [None]:
X_train.to_csv("./data/output.csv")

# Conclusion
In this project, we developed a K-Means based model to cluster users for organizing study groups based on their availability, skill level, location, and interests. We used the Elbow Method and Silhouette Score to determine the optimal number of clusters (`k`), validated the results by checking the within-cluster variance, and visualized the clusters using PCA.

Further improvements can include:
1. Testing with real-world data to improve clustering accuracy.
2. Refining the model by adding additional features.
3. Automating the session-matching process based on the user's nearest cluster.

This solution helps in efficiently organizing users into study groups, facilitating collaboration and learning.
