# **Synthetic Learning Behavior Analysis: Transform**

## Objectives

* By the end of the transformation phase, I will:
    1. Encode and transform features.
    2. Run statistical tests and validate hypothesis.
    3. Visualize results and build a dashboard for communication.
    4. Build a model that is ready for real-world use.


## Inputs

* [Task outline](https://code-institute-org.github.io/5P-Assessments-Handbook/da-ai-bootcamp-capstone-prelims.html)
* Extract phase
* personalized_learning_dataset_copy.csv 


## Outputs

* Transformed dataset.
* Statistical tests that prove how features interact.
* PowerBI Dashboard.
* Logistic Regression and ML Model 

---

# Import key libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from feature_engine.encoding import OneHotEncoder
from sklearn.pipeline import Pipeline
import pingouin as pg #I need to install pingouin library to perform statistical tests.

# Data upload

In [3]:
df = pd.read_csv("../data/transformed_data/personalized_learning_dataset_transformed.csv")
df.head(5)

Unnamed: 0,Age,Education_Level,Time_Spent_on_Videos,Quiz_Attempts,Quiz_Scores,Forum_Participation,Assignment_Completion_Rate,Engagement_Level,Final_Exam_Score,Feedback_Score,Dropout_Likelihood,Gender_Female,Gender_Male,Course_Name_Machine Learning,Course_Name_Python Basics,Course_Name_Data Science,Course_Name_Web Development,Learning_Style_Visual,Learning_Style_Reading/Writing,Learning_Style_Kinesthetic
0,15,0,171,4,67,2,89,1,51,1,0,1,0,1,0,0,0,1,0,0
1,49,1,156,4,64,0,94,1,92,5,0,0,1,0,1,0,0,0,1,0
2,20,1,217,2,55,2,67,1,45,1,0,1,0,0,1,0,0,0,1,0
3,37,1,489,1,65,43,60,2,59,4,0,1,0,0,0,1,0,1,0,0
4,34,2,496,3,59,34,88,1,93,3,0,1,0,0,1,0,0,1,0,0


---

# Statistical tests

From the exrtact phase, we know that the synthetic dataset has features that are non-normally distributed. However, let me confirm that.

In [4]:
pg.normality(data = df.sample(n= 5000), alpha = 0.05) 
#Checking for normality in the sample data. AS the original dataset has 10,000 samples I am using a smaller sample for testing.

Unnamed: 0,W,pval,normal
Age,0.953077,2.492659e-37,False
Education_Level,0.804485,3.676601e-61,False
Time_Spent_on_Videos,0.956831,3.994997e-36,False
Quiz_Attempts,0.855167,1.200496e-55,False
Quiz_Scores,0.954341,6.2162249999999995e-37,False
Forum_Participation,0.956378,2.8310969999999998e-36,False
Assignment_Completion_Rate,0.953736,4.003278e-37,False
Engagement_Level,0.805101,4.215336e-61,False
Final_Exam_Score,0.957164,5.1594409999999996e-36,False
Feedback_Score,0.886264,2.117744e-51,False


The observation is in line with what I learnt from the Extract phase. None of the features are normally distributed.

## Statistical method: Justification

As all the features are non-normally distributed, I will require non-parametric tests. The actual test will depend on the hypotheses I am trying to prove.

Here are a couple of non-parameteic tests:
* Mann-Whitney U-Test
* Kruskal-Wallis Test

---

# Business requirement #1: Learner clusters

User story: As a digital learning service provider, we want to group learners and enable adaptive learning experiences, so that we engage better with the existing users.

In [5]:
pip install nbformat

Note: you may need to restart the kernel to use updated packages.


Freezing the requirement.txt in the terminal now.

# Business requirement #2: Dropout likelihood

User story: As a program manager, I want to be able to predict dropout probability, so that we can engage with high-risk users.

**Hypotheses:**

2.1. Learning style impacts dropout likelihood

2.2. Course choice impacts dropout likelihood

2.3. Time spent on videos impacts dropout likelihood

## 2.1. Learning style impacts dropout likelihood

**Note**: I am testing two categorical features here. Chi-Squared Test can process categorical variables that are object-type data and not just integers. I will be reuse the dataset from pre-transformation phase and run a Chi-Square Test.

In [6]:
df_old = pd.read_csv("../data/copied_data/personalized_learning_dataset_copy.csv")
df_old.head(5)

Unnamed: 0,Student_ID,Age,Gender,Education_Level,Course_Name,Time_Spent_on_Videos,Quiz_Attempts,Quiz_Scores,Forum_Participation,Assignment_Completion_Rate,Engagement_Level,Final_Exam_Score,Learning_Style,Feedback_Score,Dropout_Likelihood
0,S00001,15,Female,High School,Machine Learning,171,4,67,2,89,Medium,51,Visual,1,No
1,S00002,49,Male,Undergraduate,Python Basics,156,4,64,0,94,Medium,92,Reading/Writing,5,No
2,S00003,20,Female,Undergraduate,Python Basics,217,2,55,2,67,Medium,45,Reading/Writing,1,No
3,S00004,37,Female,Undergraduate,Data Science,489,1,65,43,60,High,59,Visual,4,No
4,S00005,34,Female,Postgraduate,Python Basics,496,3,59,34,88,Medium,93,Visual,3,No


In [7]:
observed, expected, stats = pg.chi2_independence(data = df_old, 
                                                 x = "Learning_Style",
                                                 y= "Dropout_Likelihood")

stats

Unnamed: 0,test,lambda,chi2,dof,pval,cramer,power
0,pearson,1.0,0.3039,3.0,0.959293,0.005513,0.068472
1,cressie-read,0.666667,0.303654,3.0,0.95934,0.00551,0.068457
2,log-likelihood,0.0,0.303165,3.0,0.959432,0.005506,0.068426
3,freeman-tukey,-0.5,0.302801,3.0,0.959501,0.005503,0.068403
4,mod-log-likelihood,-1.0,0.302439,3.0,0.959569,0.005499,0.06838
5,neyman,-2.0,0.301725,3.0,0.959704,0.005493,0.068335


Accept null

In [13]:
contingency_table = pd.crosstab(df_old["Learning_Style"],
                                df_old["Dropout_Likelihood"],
                                normalize = "index")
#Creating a contingency table to plot categorical variables.

contingency_table_melted = contingency_table.reset_index().melt(
    id_vars = "Learning_Style",
    var_name = "Dropout_Likelihood",
    value_name = "Proportion"
) #Melting the contingency table for easier plotting.

fig = px.bar(
    data_frame = contingency_table_melted,
    x = "Learning_Style",
    y = "Proportion",
    color = "Dropout_Likelihood",
    barmode = "group",
    title = "Proportion of Dropout Likelihood by Learning Style")

fig.update_layout(
    xaxis_title = "Learning Style",
    yaxis_title = "Proportion",
    legend_title = "Dropout Likelihood")

fig.show() #Visualizing the relationship between Learning Style and Dropout Likelihood using a bar plot.






## 2.2. Course choice impacts dropout likelihood

In [8]:
expected, observed, stats = pg.chi2_independence(data = df_old,
                                                 x = "Course_Name",
                                                 y = "Dropout_Likelihood")

stats

Unnamed: 0,test,lambda,chi2,dof,pval,cramer,power
0,pearson,1.0,5.037829,4.0,0.283438,0.022445,0.399038
1,cressie-read,0.666667,5.034226,4.0,0.283804,0.022437,0.398767
2,log-likelihood,0.0,5.028061,4.0,0.28443,0.022423,0.398302
3,freeman-tukey,-0.5,5.024344,4.0,0.284809,0.022415,0.398022
4,mod-log-likelihood,-1.0,5.021402,4.0,0.285109,0.022408,0.397801
5,neyman,-2.0,5.017838,4.0,0.285472,0.022401,0.397532


Accept null

In [14]:
contingency_table_course = pd.crosstab(df_old["Course_Name"],
    df_old["Dropout_Likelihood"],
    normalize = "index")
# Creating a contingency table for Course_Name vs Dropout_Likelihood.

contingency_table_course_melted = contingency_table_course.reset_index().melt(
    id_vars = "Course_Name",
    var_name = "Dropout_Likelihood",
    value_name = "Proportion"
) # Melting the contingency table for easier plotting.

fig = px.bar(
    data_frame = contingency_table_course_melted,
    x = "Course_Name",
    y = "Proportion",
    color = "Dropout_Likelihood",
    barmode = "group",
    title = "Proportion of Dropout Likelihood by Course Choice"
)

fig.update_layout(
    xaxis_title = "Course Name",
    yaxis_title = "Proportion",
    legend_title = "Dropout Likelihood"
)

fig.show() # Visualizing the relationship between Course Choice and Dropout Likelihood using a bar plot.





Used GitHub copilot to create a similar code as I did for 2.1.

## 2.3. Time spent on videos impacts dropout likelihood

**Note:** This hypothesis involves a continuous variable, which is the number of minutes spent on video and a categorical variable of dropout likelihood. To handle such situations using Mann-Whitney U-Test.

In [None]:
pg.mwu(x = df_old["Time_Spent_on_Videos"], y = df_old["Dropout_Likelihood"])
#This will not work as I need to slice columns. Mann-Whitney U-Test works on ordered data and not on categories.

In [10]:
group_yes = df_old[df_old["Dropout_Likelihood"] == "Yes"]["Time_Spent_on_Videos"]
group_no = df_old[df_old["Dropout_Likelihood"] == "No"]["Time_Spent_on_Videos"]


pg.mwu(x = group_yes, y = group_no)

Unnamed: 0,U-val,alternative,p-val,RBC,CLES
MWU,7943901.0,two-sided,0.519207,0.009381,0.50469


In [11]:
fig = px.box(data_frame= df_old,
       x = "Time_Spent_on_Videos",
       y = "Dropout_Likelihood",
       color = "Dropout_Likelihood",
       title = "Scatter Plot of Time Spent on Videos vs Dropout Likelihood",
       labels = {"Time_Spent_on_Videos": "Time Spent on Videos (minutes)",
                 "Dropout_Likelihood": "Dropout Likelihood"
                 },
        width = 1000,
        height = 600
    )
fig.show() #Visualizing the relationship between Time_Spent_on_Videos and Dropout_Likelihood using a scatter plot.

#I'm ignoring the FutureWarning for now.

  sf: grouped.get_group(s if len(s) > 1 else s[0])


Accept null hypothesis

---

# Business requirement #2: Logistic Regression for dropout prediction

In [17]:
from sklearn.model_selection import train_test_split # Importing train_test_split for splitting the dataset

# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    df.drop(["Dropout_Likelihood"], axis=1),
    df["Dropout_Likelihood"],
    test_size = 0.3,
    random_state = 42
)

print("Training set size: ",
      X_train.shape, 
      y_test.shape,
      "\n Testing set size: ",
      X_test.shape,
      y_test.shape)


Training set size:  (7000, 19) (3000,) 
 Testing set size:  (3000, 19) (3000,)


In [32]:
from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler #Feature scaling(scales down the value between 0 and 1).

from sklearn.feature_selection import SelectFromModel #Helps the model select the most relevant features.

from sklearn.linear_model import LogisticRegression #This will be the final step in modeling.

#Importing the necessary libraries to create a model.

def dropout_prediction_log_reg():
    """This function created a pipeline to predict student dropout. 
    I am using a simple logistic regression model as only need to predict a binary outcome."""
    pipeline = Pipeline([
        ("feature_scaling", StandardScaler()),
        ("feature_selection", SelectFromModel(LogisticRegression(
            class_weight="balanced", #This helps to handle class imbalance in the dataset.
            random_state=42))),
        ("model", LogisticRegression(
            class_weight="balanced",
            random_state=42))
    ])

    return pipeline

**Lessons learned:** 
As the number of droupout likelihood is overwhelmingly no (80.43%), the dataset is imbalanced. Through my iteration with ChatGPT, I undetstood that there's a way to ensure balance. By adding "clsas_weight" the model will penalize everytime a yes is detected.

In [33]:
pipeline = dropout_prediction_log_reg()
pipeline.fit(X_train, y_train) #Fitting the model for training.

I completed training the model, before evaluating its performance I will check the model coefficients.

In [34]:
def log_reg_coef(model, columns):
    """I will extract the coefficients of the logistic regression model to
        understand the impact of each feature on dropout likelihood."""
    coeff_df = pd.DataFrame(
        model.coef_,
        index = ["Coefficient"],
        columns = columns).T.sort_values(by = ["Coefficient"], key = abs,
                                         ascending = False)
    print(coeff_df)

In [35]:
log_reg_coef(model = pipeline["model"],
             columns = X_train.columns[pipeline["feature_selection"]
                                       .get_support()])

                                Coefficient
Gender_Male                        0.158434
Gender_Female                      0.119374
Course_Name_Web Development       -0.050082
Engagement_Level                   0.047610
Learning_Style_Visual             -0.044148
Learning_Style_Reading/Writing    -0.034814


These coefficients are reflective of the sample data. In reality, I engagement level, education, or feedback score could be flagged as a coefficient. 

When there's actual data available, the model needs to be updated and rerun. The basic logic wouldn't change. Like I set out, I have created and plug-and-play model that will work well with actual data.

In [36]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

def log_reg_report(X, y, pipeline, label_map):
    """ This function will:
     1. Predict the outcome (dropout probability)
     2. Evaluate model performance through classification report and a confusion matrix"""
    
    prediction = pipeline.predict(X)

    print("### Classification Report ###")
    print(classification_report(y, prediction, target_names = label_map), "\n")

    print("\n")

    print("--- Confusion Matrix ---")
    print(pd.DataFrame(
        confusion_matrix(y_true = prediction, y_pred = y),
        columns = [["Actual " + sub for sub in label_map]],
        index = [["Predicted " + sub for sub in label_map]]
    ), "\n")

def log_reg_performance(X_train, y_train, X_test, y_test, pipeline, label_map):
    """This function will evaluate the model performance on both training and testing sets."""
    print("### Training Set Performance ### \n")
    log_reg_report(X_train, y_train, pipeline, label_map)

    print("### Testing Set Performance ### \n")
    log_reg_report(X_test, y_test, pipeline, label_map)


In [37]:
log_reg_performance(
    X_train = X_train,
    y_train = y_train,
    X_test = X_test,
    y_test = y_test,
    pipeline = pipeline,
    label_map = ["yes", "no"]
)

### Training Set Performance ### 

### Classification Report ###
              precision    recall  f1-score   support

         yes       0.82      0.44      0.57      5612
          no       0.21      0.60      0.31      1388

    accuracy                           0.47      7000
   macro avg       0.51      0.52      0.44      7000
weighted avg       0.70      0.47      0.52      7000
 



--- Confusion Matrix ---
              Actual yes Actual no
Predicted yes       2456       556
Predicted no        3156       832 

### Testing Set Performance ### 

### Classification Report ###
              precision    recall  f1-score   support

         yes       0.80      0.44      0.57      2431
          no       0.18      0.54      0.27       569

    accuracy                           0.46      3000
   macro avg       0.49      0.49      0.42      3000
weighted avg       0.69      0.46      0.51      3000
 



--- Confusion Matrix ---
              Actual yes Actual no
Predicted yes    

**Code credit:** [LMS](https://learn.codeinstitute.net/courses/course-v1:CodeInstitute+ADAT+3/courseware/5d589ae79ce74ea091f4580abb8b3f0c/39c88ed9a910406fb631fa4420c10d45/), retyped the code from the LMS.

**Issue**: The logistic regression model has run into a problem. As I am working with an imbalanced dataset, 80.43 percent are most likely to not dropout, I need a more robust model.

I added a class_weight = balance to deal with the issue of imblanced data. I no longer run into the issue of UndefinedMetricWarning:

Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use zero_division parameter to control this behavior.

Overfitting requires another model.

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [38]:
from sklearn.model_selection import train_test_split # Importing train_test_split for splitting the dataset

# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    df.drop(["Dropout_Likelihood"], axis=1),
    df["Dropout_Likelihood"],
    test_size = 0.3,
    random_state = 42
)

print("Training set size: ",
      X_train.shape, 
      y_test.shape,
      "\n Testing set size: ",
      X_test.shape,
      y_test.shape)


Training set size:  (7000, 19) (3000,) 
 Testing set size:  (3000, 19) (3000,)


In [40]:
from sklearn.model_selection import train_test_split

from sklearn.feature_selection import SelectFromModel #Helps the model select the most relevant features.

from sklearn.ensemble import RandomForestClassifier #Trying regression with a different model.

#Importing the necessary libraries to create a model.

def dropout_prediction_random_forest():
    """This function created a pipeline to predict student dropout."""
    pipeline = Pipeline([
        ("feature_selection", SelectFromModel(RandomForestClassifier(
            class_weight="balanced",
            random_state=42),
            threshold="median")), #Added threshold to use features only above the median importance.
        ("model", RandomForestClassifier(
            class_weight="balanced",
            random_state=42))
    ])

    return pipeline

In [41]:
pipeline = dropout_prediction_random_forest()
pipeline.fit(X_train, y_train) #Fitting the model for training.

# Challenges