## Cell 1: Importing Libraries, Dataset, and Using EDA

In [None]:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
import pandas as pd

# Load the dataset
data_path = 'file_goes_here.csv'  # Replace with the path to the CSV file as needed
df = pd.read_csv(data_path)

# Display the first few rows of the dataset
print(df.head())


## Cell 2: Data Preparation for Clustering

In this section, we prepare our dataset for the clustering analysis. The dataset consists of students' academic scores (`Semester_Total`) and their attendance records (`Absence_Percentage`). 

### Feature Selection

We start by extracting the relevant features for clustering:

- `Semester_Total`: The total academic score for the semester, which may reflect the students' academic performance.
- `Absence_Percentage`: The percentage of classes the student has missed, which can be an indicator of engagement or external factors affecting the student's ability to attend.

These two features are chosen because they are likely to provide meaningful insights when distinguishing between different groups of students.

### Standardization

To ensure that each feature contributes equally to the distance calculations during clustering, we standardize the features using `StandardScaler`. This scaling technique transforms our data such that the distribution of each feature has a mean value of 0 and a standard deviation of 1. In mathematical terms, for each feature:

- The mean (μ) is subtracted from each data point.
- The result is then divided by the standard deviation (σ).

This process is crucial because the K-means clustering algorithm, which we intend to use, is sensitive to the scales of the data points and relies on the Euclidean distance between them. You can find the results of these two important values after the graphical representation of the `Data after Scaling`.

### Visualization

After scaling, we visualize the standardized features using a scatter plot. This gives us a preliminary look at our data in the standardized feature space and may reveal any apparent groupings or outliers that exist before we proceed with the clustering algorithm.


In [None]:

# Selecting the relevant features for clustering
X = df[['Semester_Total', 'Absence_Percentage']].values

# Standardizing the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Visualizing the data (Optional)
plt.scatter(X_scaled[:, 0], X_scaled[:, 1], s=50)
plt.title("Data after Scaling")
plt.xlabel('Semester Total (Standardized)')
plt.ylabel('Absence Percentage (Standardized)')
plt.show()

# Print the mean and standard deviation used for scaling
print("Feature Means:", scaler.mean_)
print("Feature Standard Deviations:", scaler.scale_)

# Create a DataFrame from the scaled features
scaled_df = pd.DataFrame(X_scaled, columns=['Semester_Total_Scaled', 'Absence_Percentage_Scaled'])

# Add columns for the mean and standard deviation
scaled_df['Semester_Total_Mean'] = scaler.mean_[0]
scaled_df['Semester_Total_Std'] = scaler.scale_[0]
scaled_df['Absence_Percentage_Mean'] = scaler.mean_[1]
scaled_df['Absence_Percentage_Std'] = scaler.scale_[1]

# Export to CSV
scaled_df.to_csv('scaled_data.csv', index=False)


In [None]:
# Load the new dataset
data_path = 'scaled_data.csv'
df = pd.read_csv(data_path)

# Display the first few rows of the dataset
print(df.head())

### Breakdown of Each Column
`Semester_Total_Scaled`: This column contains the scaled values of the `Semester_Total` feature. Scaling is a method used to standardize the range of independent variables or features of data. In this case, it seems like standard scaling has been used, which means the feature has been scaled to have a mean of 0 and a standard deviation of 1. The actual values are the number of standard deviations away from the mean each original value was.

`Absence_Percentage_Scaled`: Similar to the `Semester_Total_Scaled` column, this represents the scaled values of the `Absence_Percentage` feature. The scaling process normalizes the distribution of the data, allowing for a more meaningful comparison between the variables.

`Semester_Total_Mean`: This is the mean of the `Semester_Total` feature before it was scaled. All the values in this column are the same, indicating that the mean value was calculated from the original data and then used to scale the `Semester_Total` feature.

`Semester_Total_Std`: This column shows the standard deviation of the `Semester_Total` feature before scaling. It's used alongside the mean to scale the data. The standard deviation is a measure of the amount of variation or dispersion in a set of values. A low standard deviation indicates that the values tend to be close to the mean, while a high standard deviation indicates that the values are spread out over a wider range.

`Absence_Percentage_Mean`: This is the mean of the `Absence_Percentage` feature before it was scaled. Similar to the `Semester_Total_Mean,` it's constant across all rows, indicating it was computed from the entire dataset before scaling.

`Absence_Percentage_Std`: This column shows the standard deviation of the `Absence_Percentage` feature before it was scaled. Like the `Semester_Total_Std`, it's a measure of how much the absence percentages vary from the average absence percentage.

### What's Happening with the Data?
**Standardization**: The original `Semester_Total` and `Absence_Percentage` features have been standardized. Standardization typically transforms data to have a mean of 0 and a standard deviation of 1. This process makes the data unitless and brings different variables to a comparable scale, making it easier to apply machine learning algorithms effectively.

**Record of Parameters**: The mean and standard deviation for each feature before scaling are recorded. This is crucial for several reasons:
* Interpretability: You can understand how the data was transformed and interpret the scaled values in the context of the original data distribution.
* Consistency: If you want to apply the same transformation to new data (e.g., during a model deployment phase), you'll need these parameters to ensure that the scaling is consistent with the original model training data.
* Reversibility: If you need to convert the scaled data back to the original scale for interpretation or reporting, these values will be necessary to reverse the scaling.

### Why is this Useful?
1. Model Readiness: Many machine learning algorithms perform better or converge faster when features are on a relatively similar scale and/or close to normally distributed.
2. Outlier Sensitivity Reduction: Algorithms that are sensitive to the scale of the data, like K-Nearest Neighbors, or models that use distance measures, like K-Means clustering, can perform better if the scale of the data doesn't overly influence the feature representation.
3. Improved Learning: Features with large ranges can disproportionately influence neural networks and gradient descent algorithms. Scaling can mitigate this issue.


## Cell 3: Using the Elbow Method & Silhouette Scores to Calculate the Optimal Number of Clusters

### Choose a Range for K

`K_range = range(1, 11)` is a Python command that creates a range object representing a sequence of numbers from 1 to 10 (inclusive of 1 and exclusive of 11). This is often used in the context of K-means clustering or other K-related algorithms where you might want to iterate over several values of K to determine the optimal number.

**Components:**
* `range(start, stop)`: This is a built-in Python function that generates a sequence of numbers. range(1, 11) generates numbers from 1 to 10.
* `K_range`: This is the variable name where the range object is stored. The name suggests it's being used to store a range of values for "K".

**Uses in Clustering (K-means):**

Determining Optimal K: In K-means clustering, "K" represents the number of clusters you want to divide your data into. However, the optimal number of clusters is not always apparent and often needs to be determined empirically. A common method for finding the optimal K is the Elbow Method, where you run K-means for a range of K values and then plot the results to look for an "elbow" point where the rate of decrease sharply changes. This point is often considered a good trade-off for the number of clusters.

Iterating Through K Values: The variable K_range can be used in a for loop to systematically apply K-means clustering (or other K-related algorithms) with different K values. For each K, you might calculate metrics (e.g., Within-Cluster-Sum of Squared Errors (WSS), Silhouette Score) to assess the quality of the clustering.

### Run K-means for Each K and Calculate Inertia

This cell's code block is part of a process used to find the best number of groups, or "clusters", for organizing a dataset based on similarities among the data points:

- **Tracking Performance**: It initializes an empty list named `inertias`, which will be used to record a score that indicates how well the data points within each cluster are grouped.

- **Testing Cluster Counts**: It then sets up a loop to test different numbers of clusters, ranging from 1 to a maximum number (this maximum is determined by `K_range`).

- **Building Clusters**: For each number in that range, it creates a new cluster model and applies it to the data that's been scaled to a uniform size. The `n_init` parameter is set to 5, meaning it will try 5 different starting points and choose the best result for each number of clusters.

- **Evaluating the Clusters**: After the model has created the clusters, the code calculates the 'inertia' for each model — this is a measure of how internally coherent the clusters are.

- **Recording Results**: The calculated inertia is added to the list created earlier. Once the loop is done, the code prints out the list of inertias.

The inertias give us a clue about the best number of clusters by showing us how much each additional cluster improves the tightness of the clustering. Generally, we look for a point where adding more clusters doesn't improve the inertia by much, known as the "elbow" method.

In [None]:
K_range = range(1, 11)

inertias = []

for k in K_range:
    # Create a KMeans instance with k clusters and explicitly set n_init to 5
    kmeans = KMeans(n_clusters=k, n_init=5, random_state=42)
    
    # Fit the model to the scaled data
    kmeans.fit(X_scaled)
    
    # Append the inertia to the list of inertias
    inertias.append(kmeans.inertia_)

# Print the inertias for review
print(inertias)


### Plot the Elbow Curve

If the code doesn't automatically determine the elbow point, then you should observe the plot and choose the point where the inertia begins to decrease more slowly.

In [None]:

plt.figure(figsize=(10, 6))
plt.plot(K_range, inertias, 'bo-')
plt.title('Elbow Method For Optimal k')
plt.xlabel('Number of clusters, k')
plt.ylabel('Inertia')
plt.xticks(K_range)
plt.show()


### Calculate the Silhouette Score (Higher is Better)

This code block is part of a data analysis pipeline that reads a dataset, selects specific features for clustering, scales those features, applies K-means clustering, and then evaluates the clustering result using the Silhouette Score.

#### When or If Changes Need to Be Done

If you want to experiment with the number of clusters (`n_clusters`), you would change the value from 5 to the desired number of clusters.

If you want to ensure the stability of the results, you might change `n_init` to a higher value to perform more runs with different centroid seeds.

The `random_state` can be adjusted or removed if you want different results each time for comparison.
If you decide to use a different set of features for clustering, you would modify the X DataFrame to include those columns.

If the Silhouette Score is low, it may suggest that the chosen number of clusters is not ideal, and you might consider adjusting `n_clusters` or revisiting your feature selection or preprocessing steps.

At the end, the exact coordinates of the cluster centers are printed using the `kmeans.cluster_centers_` command.

In [None]:
# This reads the original dataset path
data_path = 'applied_college.csv'
df = pd.read_csv(data_path)

# These are the columns to be used for clustering
X = df[['Semester_Total', 'Absence_Percentage']]

# Scale the features before clustering
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply K-means clustering (n_init tells how many times to run the clustering with different centroid seeds)
kmeans = KMeans(n_clusters=5, n_init=20, random_state=0).fit(X_scaled)

# Calculate Silhouette Score
silhouette_avg = silhouette_score(X_scaled, kmeans.labels_)

# Print the Silhouette Score (Higher is better)
print(f"The Silhouette Score for the clustering is: {silhouette_avg}")
print(f"\n Cluster centers:")
print(kmeans.cluster_centers_)

## Cell 4: Plotting the Results

This cell is used to visualize data that has been organized into groups, or "clusters", based on similarity. Here's what each part does:

- **Setting Up the Plot**: It prepares a visual space, setting the size of the plot so the data will be easy to see.

- **Displaying the Data**: It creates a scatter plot, which is a type of graph that shows individual data points on an X-Y axis. The data points represent two specific characteristics of the data — for this example, these are the 'Semester_Total' and 'Absence_Percentage' after they've been scaled to a standard size. Each point is colored based on the cluster it belongs to, showing the grouping created by the clustering process.

- **Highlighting the Centers**: The code also marks the "center" of each cluster with a special symbol (an 'X') and a different color. These centers are like the average location for each group, showing the middle ground of all the points in that cluster.

- **Adding Details**: To make the graph more informative, it includes a title, labels for each axis, and a color legend that explains the color coding of the clusters.

- **Presenting the Plot**: Finally, the graph is displayed to the user, allowing them to see the distribution of the data and how it's been clustered.


In [None]:

# Create a scatter plot of the two features, color-coded by cluster label
plt.figure(figsize=(10, 8))
plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=kmeans.labels_, cmap='viridis', marker='o', edgecolor='k')

# Plot the centroids
centroids = kmeans.cluster_centers_
plt.scatter(centroids[:, 0], centroids[:, 1], c='red', s=200, alpha=0.75, marker='X')

# Optional enhancements
plt.title('Clustered Data Points')
plt.xlabel('Semester_Total (scaled)')
plt.ylabel('Absence_Percentage (scaled)')
plt.colorbar(label='Cluster Label')

# Show the plot
plt.show()


## Points to Consider when Categorizing Clusters

1. Demographics: Age, gender, nationality, and socioeconomic status could reveal patterns in attendance and performance.
2. Academic Performance: Average grades, major subjects, and performance trends over time. For example, performance in core vs. elective courses.
3. Attendance Patterns: Frequency and timing of absences (e.g., more absences around certain events or times of the semester). For example, correlation between attendance and exam periods or deadlines.
4. Engagement: Participation in class, such as involvement in discussions, group work, and other interactive segments of the course. Engagement in extracurricular activities, which could be an indicator of overall engagement with the college experience.
5. Financial Status: Scholarship status, need for financial aid, or employment alongside studies, which might impact both attendance and performance.
6. Housing and Commute: Whether students live on-campus, off-campus, or commute from home, and how this might impact their attendance and performance.
7. Health and Well-being: Access to and use of health services, and any reported health issues that might affect attendance. Mental health considerations, particularly stress or anxiety related to studies or personal life.
8. Study Habits: Time spent on self-study, group study sessions, or utilization of academic support services such as tutoring. For example, preferred study times and methods (e.g., late-night studying vs. regular hours, digital vs. traditional materials).
9. Technology Access: Access to and usage of technology for learning, which could influence study habits and academic performance.
10. Language Proficiency: Proficiency in the language of instruction, which in many Middle Eastern universities is often English, and how this impacts academic performance.
11. Cultural Factors: Cultural obligations, such as family responsibilities or religious practices, that might influence attendance and study patterns.
12. Feedback from Instructors: Instructors' observations about students' behavior, participation, and any challenges they might be facing.

This Python notebook was created by S. Hatting for the University of Tabuk's English Language Institute in January 2024.