<a href="https://colab.research.google.com/github/parimmu/mentalhealthcheckinpowerapp/blob/master/cse_163_final_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **The Factors that Affect Students’ Grades and Performance**


# **Motivation**
In the process of learning, many students will encounter the problem of unstable grades or inefficient learning methods. Understanding which factors affect achievement will not only help us better target learning methods, but also enable teachers and parents to provide better support. Moreover, when the overall academic level of students is improved, the overall educational level will also be correspondingly improved, which will help cultivate more talents and promote the progress of society.


# **Data Setting**
For this project, using three publicly available datasets from Kaggle, each containing information about students’ demographic characteristics, learning behaviors, and academic performance. These datasets provide a comprehensive view of the factors that may affect students’ final grades, making them suitable for building a predictive model and exploring relationships between different variables.
The titles and links that use in this project are listed below:

Student Final Grade Prediction - Multi Linear Regression https://www.kaggle.com/datasets/rabieelkharoua/students-performance-dataset/data

Student Performance Data Set https://www.kaggle.com/datasets/joebeachcapital/students-performance/data

Student Performance Data Set https://www.kaggle.com/datasets/mariazhokhova/higher-education-students-performance-evaluation


# **Challenge Goals**

1. Multiple Datasets:
Using three datasets: Student Final Grade Prediction - Multi Linear Regression, Student Performance Data Set, and Students Performance Dataset. These datasets are merged using Pandas’ pd.merge() function, combining them based on shared features such as gender, age, and parental education level.
This operation allows us to create a comprehensive dataset that includes variables from all three sources. At least two research questions involve multiple datasets. For example, comparing the academic performance of students from different demographic groups using variables from all three datasets. Additionally, we explore whether students’ average scores are influenced by their parental education level and study time, using data from the merged datasets.

2. Machine Learning :
Applying three machine learning algorithms from Scikit-learn: Linear Regression, Decision Tree Regression, and Random Forest Regression. For example, for Decision Tree Regression, adjusting the maximum depth (max_depth) and minimum samples split (min_samples_split). Also, we will use a Linear Regression model to predict student average grades, then use Pandas and Scikit-learn libraries to process the data, train the model, and use predict() to make predictions, and use MAE, MSE, and R² to evaluate model performance.


# **Methods**
**Data Loading & Cleaning & Preprocessing: **
Three data sets are imported and the data is initially processed to ensure that it is clean and suitable for subsequent analysis.Steps: Import three datasets using pandas.read_csv(). Use df.describe() to view the data overview. Delete missing values (dropna()) or fill them with the mean (fillna()). Use df.drop() to drop columns that are not relevant to the study.

**Understand data distribution and characteristics:**
Understand data distribution, relationships between features, and factors that may affect student achievement. Steps: Use df.describe() and df.value_counts() to view the data distribution. Use groupby() to average grades by different variables (e.g., gender, race/ethnicity, parental education level)

**Data Visualization:**
Use graphs to visualize the distribution of achievement, the differences between different groups, and the relationships between variables. Steps: Create charts using Matplotlib and Seaborn. For example, boxplot(sns.boxplot()) : compare achievement differences by gender and race. scatterplot(sns.scatterplot()) : shows the relationship between learning time, parental education level, and achievement.

**Machine Learning & Model Training:**
Use machine learning models to predict students’ grades based on their personal characteristics and study habits. ​​Use Linear Regression, Decision Tree Regression, and Random Forest Regression from Scikit-learn.Train each model using study time, parental education level, and others as features, and predict the average score.Use train_test_split() to divide the data into training and testing sets.

**Model Performance Evaluation:**
Compare the performance of different machine learning models to determine which one provides the most accurate predictions. ​​Calculate MAE, MSE, and R² for each model. Visualize the predictions using scatterplot() to compare actual and predicted scores.


In [None]:
from google.colab import drive
drive.mount('/content/drive')
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split
#from sklearn.linear_model import LinearRegression
#from sklearn.tree import DecisionTreeRegressor, plot_tree
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Read the CSV files
def load_and_clean_data():
  """
  Loads three CSV files containing student performance data, drops rows with missing values
  and standardizes column names across the datasets.
  Rename columns related to final grades, parental education, gender, age, and student ID to ensure consistency for merging and analysis.
  Returns three cleaned DataFrames for further processing.
  """
  df1 = pd.read_csv("/content/drive/My Drive/CSE163_project/Student_performance_data_.csv")
  df2 = pd.read_csv("/content/drive/My Drive/CSE163_project/StudentsPerformance_with_headers.csv")
  df3 = pd.read_csv("/content/Student performance data/DATA.csv", delimiter=";")

  df1 = df1.dropna()
  df2 = df2.dropna()
  df3 = df3.dropna()

  df1["final_score"] = df1["GPA"]
  df2["final_score"] = df2["Cumulative grade point average in the last semester (/4.00)"]
  df3["final_score"] = df3["grade_previous"]

  df1["parental_education"] = df1["ParentalEducation"]
  df2["parental_education"] = df2[["Mother’s education", 'Father’s education ']].max(axis=1)
  df3["parental_education"] = df3[["mother_ed", "farther_ed"]].max(axis=1)

  df1 = df1.rename(columns={"Gender": "sex"})
  df1 = df1.rename(columns={"Age": "Student Age"})
  df1 = df1.rename(columns={"StudentID": "student_id"})
  df2 = df2.rename(columns={"STUDENT ID": "student_id"})
  df2 = df2.rename(columns={"Sex": "sex"})
  df3 = df3.rename(columns={"age": "Student Age"})
  return df1, df2, df3


def categorize_study_time(hours):
    """
    Categorizes weekly study hours into five groups based on their value.
    Takes an integer representing study hours
    Returns a category from 1 to 5, where 1 is 0 hours, 2 is less than 5, 3 is less than 10
    4 is less than 20, and 5 is 20 or more hours.
    """
    if hours == 0:
        return 1
    elif hours < 5:
        return 2
    elif hours < 10:
        return 3
    elif hours < 20:
        return 4
    else:
        return 5

def process_study_time(df1, df2, df3):
    """
    In this function, it takes df1, df2 and df3 as arguments
    Standardizes study time data across three datasets by mapping weekly study hours into consistent group labels.
    Applies the categorize_study_time function to one dataset and maps predefined group labels to the others.
    Returns the three updated DataFrames with a new 'study_time_group' column for each.
    """
    df1["study_time"] = df1["StudyTimeWeekly"].apply(categorize_study_time)
    df1["study_time_group"] = df1["study_time"].map({
        1: "0h", 2: "<5h", 3: "6~10h", 4: "11~20h", 5: "20+h"
    })

    df2["study_time_group"] = df2["Weekly study hours"].map({
        1: "0h", 2: "<5h", 3: "6~10h", 4: "11~20h", 5: "20+h"
    })

    df3["study_time_group"] = df3["weekly_study_hours"].map({
        1: "0h", 2: "<5h", 3: "6~10h", 4: "11~20h", 5: "20+h"
    })

    return df1, df2, df3
#I used google search to unify these three data sets by map method, so that it can be easily read.


def merge_datasets(df1, df2, df3):
    """
    In this function, it takes df1, df2 and df3 as arguments
    Merges three standardized datasets into one DataFrame by combining common columns, including sex, age, final score,
    parental education, and study time group. Performs an outer join to ensure all records are included,
    and returns the merged DataFrame.
    """
    common_columns = ["sex", "Student Age", "final_score", "parental_education", "study_time_group"]
    merged_df = pd.merge(df1[common_columns], df2[common_columns], how="outer")
    merged_df = pd.merge(merged_df, df3[common_columns], how="outer")
    return merged_df

def test_merge_datasets():
    df1 = pd.DataFrame({
        'sex': ['M'],
        'Student Age': [18],
        'final_score': [3.5],
        'parental_education': [4],
        'study_time_group': ['<5h']
    })

    df2 = pd.DataFrame({
        'sex': ['F'],
        'Student Age': [19],
        'final_score': [3.0],
        'parental_education': [3],
        'study_time_group': ['6~10h']
    })

    df3 = pd.DataFrame({
        'sex': ['M'],
        'Student Age': [20],
        'final_score': [2.5],
        'parental_education': [2],
        'study_time_group': ['11~20h']
    })

    merged = merge_datasets(df1, df2, df3)
    expected_columns = ['sex', 'Student Age', 'final_score', 'parental_education', 'study_time_group']
    assert list(merged.columns) == expected_columns

def group_by_analysis(merged_df):
    """
    In this function, it takes merfed_df as argument.
    Calculates the mean final score by gender, parental education, and study time group.
    Returns three groupings with their corresponding average scores to compare performance across these factors.
    """
    gender_group = merged_df.groupby("sex")["final_score"].mean()
    parent_group = merged_df.groupby("parental_education")["final_score"].mean()
    study_time_group = merged_df.groupby("study_time_group")["final_score"].mean()
    return gender_group, parent_group, study_time_group

def test_group_by_analysis():
  df = pd.DataFrame({
      'sex': ['M', 'F'],
      'parental_education': [4, 4],
      'study_time_group': ['<5h', '6~10h'],
      'final_score': [3.0, 4.0]
  })

  gender_group, parent_group, study_group = group_by_analysis(df)

  assert gender_group['M'] == 3.0
  assert gender_group['F'] == 4.0
  assert parent_group[4] == 3.5
  assert study_group['<5h'] == 3.0


def generate_statistics(merged_df):
    """
    In this function, it takes merfed_df as argument.
    Generates basic descriptive statistics and counts for the merged dataset, including summaries for sex,
    parental education, and study time groups.
    Returns the descriptive statistics and counts for further interpretation.
    """
    data_description = merged_df.describe()
    value_counts_gender = merged_df["sex"].value_counts()
    value_counts_parent_edu = merged_df["parental_education"].value_counts()
    value_counts_study_time = merged_df["study_time_group"].value_counts()
    return data_description, value_counts_gender, value_counts_parent_edu, value_counts_study_time

def test_generate_statistics():
    df = pd.DataFrame({
        'sex': ['M', 'F', 'F'],
        'parental_education': [3, 4, 4],
        'study_time_group': ['<5h', '6~10h', '6~10h'],
        'final_score': [2.5, 3.5, 4.0]
    })

    desc, gender_counts, parent_counts, study_counts = generate_statistics(df)
    assert gender_counts['F'] == 2
    assert gender_counts['M'] == 1
    assert parent_counts[4] == 2
    assert parent_counts[3] == 1
    assert study_counts['6~10h'] == 2
    assert study_counts['<5h'] == 1

def standardize_columns(df1, df2, df3, merged_df):
    """
    In this function, it takes df1, df2, df3 and merged_df as arguments.
    Converts the parental education column to Int64 type across all datasets and the merged DataFrame.
    Standardizes the gender column in the merged DataFrame by mapping 'Male' to 1, 'Male' to 2 and 'Female' to 0,
    and converts parental education to categorical type for grouped analysis.
    Returns the updated datasets and merged DataFrame.
    """
  # making sure that 'parental_education' is consistent in all datasets
    df1["parental_education"] = df1["parental_education"].astype(float).astype("Int64")
    df2["parental_education"] = df2["parental_education"].astype(float).astype("Int64")
    df3["parental_education"] = df3["parental_education"].astype(float).astype("Int64")

  # Convert 'parental_education' to a consistent data type
    merged_df["parental_education"] = merged_df["parental_education"].astype(float).astype("Int64")

  # Convert categorical gender labels
    if "sex" in merged_df.columns:
        merged_df["sex"] = merged_df["sex"].replace({1: "Male", 0: "Female", 2: "Male"})  # Standardizing gender labels

  # Convert parental education into categorical to ensure correct ordering
    merged_df["parental_education"] = merged_df["parental_education"].astype("category")
    return df1, df2, df3, merged_df

def filter_study_time_groups(merged_df):
    """
    In this function, it takes merfed_df as argument.
    Filters out study time groups with fewer than six observations to improve analysis reliability.
    Returns the filtered DataFrame with only groups that have more than five records.
    """
  # Check the number of students in each study time group
    study_time_counts = merged_df["study_time_group"].value_counts()

  # Filter out study time groups with very few observations
    filtered_df = merged_df[merged_df["study_time_group"].isin(study_time_counts[study_time_counts > 5].index)]
    return filtered_df
  #https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isin.html
  #filter DataFrame values by checking if elements exist in a given list or another iterable
def plot_graphs(filtered_df):
    """
    In this function, it takes filtered_df as argument.
    Creates multiple visualizations to compare final scores by gender, parental education, and study time groups by using
    boxplots, scatterplots, and lineplots.
    Uses Seaborn and Matplotlib to display relationships between these variables and outputs the plots for interpretation.
    """
    # Boxplot: Achievement differences by gender
    """
    The boxplot comparing final scores by gender shows no significant difference between male and female students.
    The distribution of scores is similar across both groups, suggesting that gender does not play a major role in
    academic performance. The median scores are nearly identical, with slight variations in spread, indicating that
    factors beyond gender are more influential in determining student success.

    """
    plt.figure(figsize=(8, 6))
    sns.boxplot(x="sex", y="final_score", data=filtered_df)
    plt.title("Achievement Differences by Gender")
    plt.xlabel("Gender")
    plt.ylabel("Final Score")
    plt.show()

    # Boxplot: Achievement differences by parental education level
    """
    Students with higher parental education levels tend to have slightly higher final scores,
    but the effect is not strongly linear. While the median scores increase slightly with higher
    parental education, the spread of scores suggests high variability within each group.
    This indicates that parental education may contribute to academic performance but is not the sole
    determining factor—other influences such as school environment, personal motivation, and support systems likely play a role.
    """

    plt.figure(figsize=(10, 6))
    sns.boxplot(x="parental_education", y="final_score", data=filtered_df)
    plt.title("Achievement Differences by Parental Education Level")
    plt.xlabel("Parental Education Level")
    plt.ylabel("Final Score")
    plt.xticks(rotation=45)
    plt.show()

    # Scatterplot: Relationship between study time, parental education level, and achievement
    """
    Replacing the scatter plot with a bar chart makes the relationship between study time,
    parental education, and final scores clearer. The chart shows that students who study
    more tend to have slightly higher scores, but the difference is not drastic. Additionally,
    the impact of study time varies across different parental education levels, reinforcing the
    idea that study habits alone do not guarantee success. Other contributing factors such as
    teaching quality, learning strategies, and personal discipline likely shape academic outcomes.
    """

    plt.figure(figsize=(10, 6))
    sns.scatterplot(x="parental_education", y="final_score", hue="study_time_group", data=filtered_df, alpha=0.7)
    plt.title("Relationship Between Parental Education, Study Time, and Achievement")
    plt.xlabel("Parental Education Level")
    plt.ylabel("Final Score")
    plt.legend(title="Study Time Group")
    plt.xticks(rotation=45)
    plt.show()

    # Updated Line Plot with Compatible Confidence Interval Setting
    plt.figure(figsize=(10, 6))
    sns.lineplot(
        data=filtered_df,
        x="parental_education",
        y="final_score",
        hue="study_time_group"
    )
    plt.title("Improved Interaction Between Study Time and Parental Education on Final Grades")
    plt.xlabel("Parental Education Level")
    plt.ylabel("Final Score")
    plt.legend(title="Study Time Group")
    plt.xticks(rotation=45)
    plt.show()

    df1, df2, df3 = load_and_clean_data()
    df1, df2, df3 = process_study_time(df1, df2, df3)
    merged_df = merge_datasets(df1, df2, df3)
    df1, df2, df3, merged_df = standardize_columns(df1, df2, df3, merged_df)
    filtered_df = filter_study_time_groups(merged_df)
    plot_graphs(filtered_df)

def preprocess_data(merged_df):
  """
  This function encodes categorical variales and splits the dataset into training and testing sets.
  With parameter merged_def(DataFrame), the cleaned and merged dataset containing student performance data.
  returns training and testing sets for features and target variables
  """
  features = ["study_time", "parental_education", "sex", "Student Age"]
  X = pd.get_dummies(merged_df[features], drop_first = True)
  y =  merged_df["final_score"]

  #X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

  return train_test_split(X, y, test_size = 0.2, random_state = 42)

def train_and_evaluate_models(X_train, X_test, y_train, y_test):
  """
  This function trains and evalautes machine learning models
  With parameters:
  X_train, X_test, y_train, y_test - training and testing sets
  returns:
  models (dict): Dictionary of trained models
  results (dict): Model performance metrics(Mean Squared Error)
  """
  models = {
      #"Linear Regression": LinearRegression(),
      #"Decision Tree": DecisionTreeRegressor(max_depth = 5, min_samples_split=5)
      "Random Forest": RandomForestRegressor(n_estimators = 100, random_state = 42),
      "Support Vector Regression": SVR(kernel = "linear")
  }

  result = {}
  trained_models = {}

  for model_name, model in models.items():

    model.fit(X_train, y_train)

    #making predctions
    y_pred = model.predict(X_test)

    mae = mean_absolute_error(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)

    result[model_name] = {"MAE": mae, "MSE": mse, "R²": r2}
    trained_models[model_name] = model

    #grid = sns.relplot(x=y_test, y=y_pred)
    # grid.set(
    #   title = model_name + ": Predicted vs. Observed Final Scores",
    #    xlable = "Observed Fina; Scores (test data)",
    #   ylable = "Predicted Final Scores"
    # )
    #grid.set_titles(model_name + ": Predicted vs. Observed Final Scores")
    #grid.set_axis_labels("Observed Fina; Scores (test data)", "Predicted Final Scores")
    #grid.ax.axline((0,0),slope = 1, color = 'k', ls = '--')
    visualize_predictions(y_test, y_pred, model_name)

  return result, trained_models

def visualize_predictions(y_test, y_pred, model_name):
  """
  This function generates a scatter plot comparing actual vs. predicted final scores.
  With parameters:
  y_test(Series): Actual final scores.
  y_pred(array): Predicted final scores.
  model_name(str): Name of the model used for prediction.
  """
  plt.figure(figsize = (8, 6))
  sns.scatterplot(x=y_test, y=y_pred, alpha = 0.7)
  plt.xlabel("Observed Final Score")
  plt.ylabel("Predicted Final Score")
  plt.title(model_name + ": Observed vs. Predicted Final Scores")
  plt.axline((0,0), slope = 1, color = "black", linestyle = "--")
  plt.show()

def feature_importance_analysis(model, X_train):
  """
  This function analyzes and visualizes feature importance for the Random Foret model.
  With parameters:
  model(RandomForestRegressor): Trained Random Forest model
  X_train(DataFrame): Training feature set.
  """
  if isinstance(model, RandomForestRegressor):
    importance = model.feature_importances_
    feature_names = X_train.columns

    plt.figure(figsize = (8,6))
    sns.barplot(x=importance, y=feature_names)
    plt.xlabel("Feature Importance Score")
    plt.ylabel("Features")
    plt.title("Feature Importance in Predicting Final Grades")
    plt.show()

#training models after dataset processing
df1, df2, df3 = load_and_clean_data()
df1, df2, df3 = process_study_time(df1, df2, df3)
merged_df = merge_datasets(df1, df2, df3)
filtered_df = filter_study_time_groups(merged_df)
X_train, X_test, y_train, y_test = preprocess_data(filtered_df)
results, trained_models = train_and_evaluate_models(X_train, X_test, y_train, y_test)

#print results
for model, metrics in results.items():
  print(model +" Performance:")
  for metric, value in metrics.items():
    print(metric +  " = " + str(round(value, 4)))
  print()

#Analyze feature importance (only for Random Forest)
feature_importance_analysis(trained_models["Random Forest"], X_train)

#Visualize the Disision Tree
#plt.figure(dpi = 300)
#plot_tree(
    #models["Decision Tree"],
    #feature_names = X_train.columns,
    #filled = True,
    #impurity = False,
    #proportion = True,
    #rounded = True,
    #max_depth = 2,
    #fontsize = 5
#)
#plt.show()



Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


FileNotFoundError: [Errno 2] No such file or directory: '/content/drive/My Drive/CSE163_project/Student_performance_data_.csv'

# 新段落

In [None]:
filtered_df

NameError: name 'filter_study_time_groups' is not defined