# Student Engagement Recommendation System
This project implements a recommendation system to personalize educational material suggestions for students based on their engagement data and interests.

In [88]:
#import required pacakges and libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics.pairwise import cosine_similarity

# Generating synthetic data 

## We'll create synthetic datasets for:
- *Student Data*
- *Material Data*
- *Engagement Data*

For the data generation, we will be using randomly generated data to simulate a realistic student dataset.
We define the number of students and create unique StudentIDs using a formatted string.
The students are randomly assigned to various courses, academic years, and interests. Each student can have multiple interests, randomly selected from a predefined list.
We simulate student performance by generating scores from a normal distribution with some realistic instinct
This synthetic student data will be stored in a pandas DataFrame called students, with columns for StudentID, Course, Year, Interests, and Performance.



In [89]:
# 1. Generate Synthetic Student Data
np.random.seed(42)
num_students = 100

student_ids = [f"S{str(i).zfill(3)}" for i in range(1, num_students + 1)]
courses = ["Computer Science", "Mechanical Engineering", "Electrical Engineering", "Civil Engineering"]
years = [1, 2, 3, 4]
interests_list = ["AI", "Blockchain", "Environmental Science", "Robotics", "Data Science", "Cybersecurity", "Energy", "Automation"]

students = pd.DataFrame({
    "StudentID": student_ids,
    "Course": np.random.choice(courses, num_students),
    "Year": np.random.choice(years, num_students),
    "Interests": [np.random.choice(interests_list, size=np.random.randint(1,4), replace=False).tolist() for _ in range(num_students)],
    "Performance": np.clip(np.random.normal(65, 20, num_students).astype(int), 0, 100)  # Average quiz scores as noramal distribution with mean 65 and dev 20
})



## Study Material Data
In this step, we generate synthetic data for the learning materials that will be recommended to the students:

We define the number of materials (num_materials) to be generated and assign unique MaterialIDs in a similar way as student IDs.
Each material is associated with a subject, which is randomly chosen from the list of student interests (interests_list). This simulates the different topics the materials cover.

The difficulty level of the materials is randomly chosen from predefined levels: "Easy", "Medium", and "Hard".
Each material is assigned a Popularity score, which is a random integer between 50 and 100, representing how popular the material is based on user interactions or engagement.

The ContentLength of each material is also randomly generated, ranging from 15 to 45 pages, to represent the material's size or time to consume.
The synthetic material data is stored in a pandas DataFrame called materials, with columns for MaterialID, Subject, Difficulty, Popularity, and ContentLength.

In [90]:
# 2. Generate Synthetic Material Data
num_materials = 50
material_ids = [f"M{str(i).zfill(3)}" for i in range(1, num_materials + 1)]
subjects = interests_list
difficulty_levels = ["Easy", "Medium", "Hard"]
materials = pd.DataFrame({
    "MaterialID": material_ids,
    "Subject": np.random.choice(subjects, num_materials),
    "Difficulty": np.random.choice(difficulty_levels, num_materials),
    "Popularity": np.random.uniform(50, 100, num_materials).astype(int),  
    "ContentLength": np.random.randint(3, 10, num_materials) * 5  
})



## Data for student-material engagement
We create an empty list engagements to store individual engagement records for each student.
For each student in the dataset, a random set of materials (between 5 to 15) is selected from the available material_ids to simulate the materials they have viewed.

For each viewed material, we log the following information:

 StudentID: The ID of the student.\n
 MaterialID: The ID of the material that the student viewed.
 Viewed: A flag indicating that the student has viewed the material (set to 1).
 Rating: A randomly assigned rating between 1 and 5, representing how much the student liked the material.
This engagement data is then stored in a pandas DataFrame called engagements, which captures the interaction between students and materials, forming the foundation for the recommendation system's training and evaluation.

In [91]:
# 3. Generate Synthetic Engagement Data
engagements = []
for student in students["StudentID"]:
    viewed_materials = np.random.choice(material_ids, size=np.random.randint(5,15), replace=False)
    for material in viewed_materials:
        engagements.append({
            "StudentID": student,
            "MaterialID": material,
            "Viewed": 1,
            "Rating": np.random.randint(1,6)  
        })

engagements = pd.DataFrame(engagements)

In [92]:
# lets have a look at the data
pd.set_option('display.width', 1000)
print(students.head())
print(materials.head())
print(engagements.head(20))


  StudentID                  Course  Year                                          Interests  Performance
0      S001  Electrical Engineering     3  [Environmental Science, Data Science, Blockchain]           21
1      S002       Civil Engineering     2                    [Cybersecurity, Blockchain, AI]          100
2      S003        Computer Science     2                         [Energy, AI, Data Science]           29
3      S004  Electrical Engineering     4                [Automation, Environmental Science]           94
4      S005  Electrical Engineering     2              [Cybersecurity, Data Science, Energy]           46
  MaterialID       Subject Difficulty  Popularity  ContentLength
0       M001        Energy     Medium          60             40
1       M002  Data Science       Easy          85             45
2       M003            AI       Easy          74             25
3       M004            AI     Medium          87             45
4       M005            AI       Easy  

# Data preprocessing
Data Preprocessing
As the data was generated synthetically using random or semi-random techniques (e.g., from normal distributions for performance scores), we do not require specific preprocessing steps typically necessary for real-world datasets such as handling missing values, outliers, or scaling numerical features.

Instead, the focus of preprocessing here is to:

Transform categorical data into a machine-readable format (e.g., one-hot encoding for student interests).
Convert difficulty levels into numerical values for easier mathematical operations.
Merge various datasets (students, materials, engagements) to create a comprehensive profile that can be used directly for further analysis and building the recommendation system.
Thus, we skip traditional preprocessing steps like data cleaning and normalization, as the generated data is already in a consistent, usable format.


# Data processing 
we perform essential data preprocessing to prepare the dataset for the recommendation system:
### One hot encoding required data
We use MultiLabelBinarizer() to transform the list of student interests into a one-hot encoded format.
Each column represents an interest, and each row for a student will have a 1 if they have that particular interest, otherwise 0.
The resulting encoded interests are added to the students DataFrame, replacing the original 'Interests' column.
### Map Difficulty Levels:
The difficulty levels of the materials (Easy, Medium, Hard) are mapped to numerical values using a dictionary:
Easy → 1
Medium → 2
Hard → 3
This mapping is stored in a new column called DifficultyNum in the materials DataFrame to facilitate mathematical computations in the recommendation logic.
### Merge Engagement Data:
We merge the engagements DataFrame with the materials DataFrame using the MaterialID column to enrich the engagement data with details about the materials (e.g., difficulty, popularity).
We then merge this enriched engagement data with the students DataFrame using the StudentID column, combining both student profiles and their interactions with the materials into a comprehensive student_profiles DataFrame.

In [93]:
# 4. Data Preprocessing

# a. One-Hot Encode Interests
mlb = MultiLabelBinarizer()
interests_encoded = pd.DataFrame(mlb.fit_transform(students['Interests']), columns=mlb.classes_)
students = pd.concat([students.drop('Interests', axis=1), interests_encoded], axis=1)

# b. Map Difficulty Levels
difficulty_mapping = {"Easy":1, "Medium":2, "Hard":3}
materials["DifficultyNum"] = materials["Difficulty"].map(difficulty_mapping)

# c. Merge Engagement Data
student_engagement = engagements.merge(materials, on="MaterialID", how="left")
student_profiles = students.merge(student_engagement, on="StudentID", how="left")



# Define Recommendation function
The recommendation system is designed to suggest the most relevant learning materials to a student based on a combination of their interests, academic performance, and the popularity of the materials. Here’s the approach:

Extract the student's profile: For the given student, we fetch their interests and performance from the student dataset.

Calculate similarity based on interests: We compute a vector for the student's interests and use it to calculate the cosine similarity between the student's interest profile and the subject of each material. This generates an "Interest Score" for each material.

Performance compatibility: We assess how well the material's difficulty matches the student's performance level. If a student’s performance is close to the difficulty level of the material, the compatibility score will be higher.

Total score calculation:

The final score is a weighted combination of the interest similarity score (45%), performance compatibility (35%), and material popularity (20%).
This total score determines the ranking of each material for the student.
Excluding already viewed materials (optional): Optionally, materials that the student has already viewed could be excluded from recommendations, though this feature is commented out in this version.

Top N recommendations: The system sorts the materials based on their total score and returns the top N materials as the recommendations.

In [94]:
# 5. Recommendation Algorithm

def recommend_materials(student_id, top_n=5):
    # Get the student's profile
    student = students[students["StudentID"] == student_id].iloc[0]
    # Calculate similarity based on interests
    student_interest_vector = student[mlb.classes_].values.reshape(1, -1)
    material_subjects = materials["Subject"].apply(lambda x: [x])
    material_subjects_encoded = pd.DataFrame(mlb.transform(material_subjects), columns=mlb.classes_)
    # Compute cosine similarity
    similarities = cosine_similarity(student_interest_vector, material_subjects_encoded)
    materials["InterestScore"] = similarities[0]
    # Calculate performance compatibility
    materials["PerformanceCompatibility"] = 1 - abs(materials["DifficultyNum"] * 33 - student["Performance"])/100
    # Calculate total score
    materials["TotalScore"] = (0.45 * materials["InterestScore"] +
                               0.35 * materials["PerformanceCompatibility"] +
                               0.2 * materials["Popularity"]/100)
    # Exclude materials already viewed
    # viewed = engagements[engagements["StudentID"] == student_id]["MaterialID"].unique()
    # recommendations = materials[~materials["MaterialID"].isin(viewed)]
    # Get top N recommendations
    top_recommendations = materials.sort_values(by="TotalScore", ascending=False).head(top_n)
    return top_recommendations[["MaterialID", "Subject", "Difficulty", "TotalScore"]]\

#### IMPOTANT ####
####################################################################################################################
"""i have included the viewed materials in the recommendations for the sake of the evaluation of the algorithm othrwise the map@k and ndcg@k will be 0
but we can exclude the viewed materials by uncommenting the two lines above for real recommendation as students will not be recommended the materials 
they have already viewed(i means completed as just viewed and not completed can be recommended but not implemented here) """
####################################################################################################################   


# Generate Recommendations for Each Student
In this step, we iterate through each student in the dataset and generate personalized recommendations for them:

Generate recommendations for each student: For every student in the dataset, the recommend_materials function is called to calculate the top materials based on their interests, performance, and the material's popularity.

Store recommendations: We store the recommendations for each student in a dictionary called student_recommendations, where the key is the student's ID, and the value is a DataFrame of the top recommended materials.

Example output: To demonstrate, we print out the top recommendations for a specific student, in this case, student with ID "S002".

In [95]:
# 6. Generate Recommendations for Each Student
student_recommendations = {}
for student_id in students["StudentID"]:
    recs = recommend_materials(student_id)
    student_recommendations[student_id] = recs

# Example: Print recommendations for a specific student
student_id = "S002"
print(f"Top recommendations for Student {student_id}:")
print(student_recommendations[student_id])

Top recommendations for Student S002:
   MaterialID        Subject Difficulty  TotalScore
19       M020  Cybersecurity       Hard    0.788308
13       M014             AI       Hard    0.782308
8        M009  Cybersecurity       Hard    0.766308
31       M032  Cybersecurity       Hard    0.764308
24       M025     Blockchain       Hard    0.734308


# Evaluating the Algorithm
### 1) Calculate MAP@K for the Recommendation System
we implement a function to compute the Mean Average Precision at K (MAP@K) score, an important metric for evaluating recommendation systems:

Function Overview: The calculate_map_at_k function calculates the MAP@K score based on:

Parameters:
recommendations: A dictionary with student IDs as keys and their recommended MaterialIDs as values.
engagements: A DataFrame containing data about materials each student has engaged with.
k: The number of top recommendations to consider.
Average Precision Calculation: For each student:

Extract the top K recommendations.
Identify relevant materials based on student engagement.
Calculate Precision at K by counting hits and averaging the precision scores.
MAP@K Score Calculation: The MAP@K score is derived by averaging the precision scores across all students, providing a measure of the recommendation system's accuracy.

Example Execution: The code includes an example that computes and prints the MAP@K scores for K values ranging from 0 to 9.

This evaluation helps assess how effectively the recommendation system identifies materials that students are likely to engage with.

In [96]:
def calculate_map_at_k(recommendations, engagements, k):
    """
    Calculate MAP@K for the recommendation system.

    Parameters:
        recommendations: Dictionary where keys are student IDs and values are lists of recommended MaterialIDs.
        engagements: DataFrame with student engagement data containing 'StudentID' and 'MaterialID'.
        k: Number of top recommendations to consider.

    Returns:
        MAP@K score.
    """
    average_precision_scores = []

    for student_id, recommended in recommendations.items():
        # Get top K recommendations
        top_k_recommended = recommended.head(k)['MaterialID'].tolist()
        
        # Get relevant items (engaged materials)
        relevant_items = engagements[engagements["StudentID"] == student_id]["MaterialID"].tolist()

        # Calculate Precision at K
        hits = 0
        precision_at_k = 0

        for i, item in enumerate(top_k_recommended):
            if item in relevant_items:
                hits += 1
                precision_at_k += hits / (i + 1)  # Precision = hits / (position + 1)

        if hits > 0:
            average_precision = precision_at_k / min(hits, k)  # Average precision
            average_precision_scores.append(average_precision)

    # MAP@K
    map_at_k = sum(average_precision_scores) / len(average_precision_scores) if average_precision_scores else 0
    return map_at_k

# Example usage
for i in range(1,6):
    k = i # Set K for MAP@K
    map_at_k_score = calculate_map_at_k(student_recommendations, engagements, k)
    print(f"MAP@{k}: {map_at_k_score:.4f}")


MAP@1: 1.0000
MAP@2: 0.8750
MAP@3: 0.6865
MAP@4: 0.5643
MAP@5: 0.5052


### NDCG@K for the Recommendation System
This section implements a function to compute the Normalized Discounted Cumulative Gain at K (NDCG@K), which evaluates the effectiveness of our recommendations:

Function Purpose: The calculate_ndcg_at_k function calculates NDCG@K using:

recommendations: Dictionary of student IDs and their recommended materials.
engagements: DataFrame of student engagement data.
k: Number of top recommendations to consider.
DCG Calculation: For each student, the function:

Retrieves the top K recommendations.
Identifies relevant materials based on engagement.
Computes DCG by summing relevance scores, applying position-based discounts.
NDCG Calculation: NDCG is derived by normalizing DCG with Ideal DCG (IDCG), providing a score between 0 and 1.

Example Execution: The code snippet computes NDCG@K scores for K values from 1 to 10, offering insights into the ranking quality of the recommendations.

In [97]:

def calculate_ndcg_at_k(recommendations, engagements, k):
    """
    Calculate NDCG@K for the recommendation system.

    Parameters:
        recommendations: Dictionary where keys are student IDs and values are DataFrames of recommended MaterialIDs.
        engagements: DataFrame with student engagement data containing 'StudentID' and 'MaterialID'.
        k: Number of top recommendations to consider.

    Returns:
        NDCG@K score.
    """
    ndcg_scores = []

    for student_id, recommended in recommendations.items():
        # Get top K recommendations
        top_k_recommended = recommended.head(k)['MaterialID'].tolist()
        
        # Get relevant items (engaged materials)
        relevant_items = engagements[engagements["StudentID"] == student_id]["MaterialID"].tolist()

        # Calculate DCG
        dcg = 0.0
        for i, item in enumerate(top_k_recommended):
            if item in relevant_items:
                # Assign relevance score (can be 1 for engaged materials)
                relevance_score = 1
                # Use the position in the list for discounting
                dcg += relevance_score / np.log2(i + 2)  # i + 2 to avoid log(1) for the first position

        # Calculate IDCG (Ideal DCG)
        ideal_relevance_scores = [1] * min(len(relevant_items), k)  # All relevant items have a relevance score of 1
        idcg = sum(relevance / np.log2(i + 2) for i, relevance in enumerate(ideal_relevance_scores))

        # Calculate NDCG
        ndcg = dcg / idcg if idcg > 0 else 0
        ndcg_scores.append(ndcg)

    # Average NDCG@K
    ndcg_at_k = sum(ndcg_scores) / len(ndcg_scores) if ndcg_scores else 0
    return ndcg_at_k

# Example usage
for i in range(1,6):
    k = i  # Set K for NDCG@K
    ndcg_at_k_score = calculate_ndcg_at_k(student_recommendations, engagements, k)
    print(f"NDCG@{k}: {ndcg_at_k_score:.4f}")


NDCG@1: 0.2100
NDCG@2: 0.1674
NDCG@3: 0.1727
NDCG@4: 0.1824
NDCG@5: 0.1834
