# **Synthetic Learning Behavior Analysis: Transform**

## Objectives

* By the end of the transformation phase, I will:
    1. Encode and transform features.
    2. Run statistical tests and validate hypothesis.
    3. Visualize results and build a dashboard for communication.
    4. Build a model that is ready for real-world use.


## Inputs

* [Task outline](https://code-institute-org.github.io/5P-Assessments-Handbook/da-ai-bootcamp-capstone-prelims.html)
* Extract phase
* personalized_learning_dataset_copy.csv 


## Outputs

* Transformed dataset.
* Statistical tests that prove how features interact.
* PowerBI Dashboard.
* Logistic Regression and ML Model 

---

# Import key libraries

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from feature_engine.encoding import OneHotEncoder
from sklearn.pipeline import Pipeline
import pingouin as pg #I need to install pingouin library to perform statistical tests.

# Data upload

In [25]:
df = pd.read_csv("../data/transformed_data/personalized_learning_dataset_transformed.csv")
df.head(5)

Unnamed: 0,Age,Education_Level,Time_Spent_on_Videos,Quiz_Attempts,Quiz_Scores,Forum_Participation,Assignment_Completion_Rate,Engagement_Level,Final_Exam_Score,Feedback_Score,Dropout_Likelihood,Gender_Female,Gender_Male,Course_Name_Machine Learning,Course_Name_Python Basics,Course_Name_Data Science,Course_Name_Web Development,Learning_Style_Visual,Learning_Style_Reading/Writing,Learning_Style_Kinesthetic
0,15,0,171,4,67,2,89,1,51,1,0,1,0,1,0,0,0,1,0,0
1,49,1,156,4,64,0,94,1,92,5,0,0,1,0,1,0,0,0,1,0
2,20,1,217,2,55,2,67,1,45,1,0,1,0,0,1,0,0,0,1,0
3,37,1,489,1,65,43,60,2,59,4,0,1,0,0,0,1,0,1,0,0
4,34,2,496,3,59,34,88,1,93,3,0,1,0,0,1,0,0,1,0,0


---

# Statistical tests

From the exrtact phase, we know that the synthetic dataset has features that are non-normally distributed. However, let me confirm that.

In [4]:
pg.normality(data = df.sample(n= 5000), alpha = 0.05) 
#Checking for normality in the sample data. AS the original dataset has 10,000 samples I am using a smaller sample for testing.

Unnamed: 0,W,pval,normal
Age,0.952445,1.588994e-37,False
Education_Level,0.803199,2.766548e-61,False
Time_Spent_on_Videos,0.95553,1.4960169999999999e-36,False
Quiz_Attempts,0.850966,3.677029e-56,False
Quiz_Scores,0.953115,2.561093e-37,False
Forum_Participation,0.955024,1.026642e-36,False
Assignment_Completion_Rate,0.951869,1.059617e-37,False
Engagement_Level,0.805198,4.306917e-61,False
Final_Exam_Score,0.955718,1.721713e-36,False
Feedback_Score,0.887276,3.016777e-51,False


The observation is in line with what I learnt from the Extract phase. None of the features are normally distributed.

## Statistical method: Justification

As all the features are non-normally distributed, I will require non-parametric tests. The actual test will depend on the hypotheses I am trying to prove.

Here are a couple of non-parameteic tests:
* Mann-Whitney U-Test
* Kruskal-Wallis Test

---

# Business requirement #1: Learner clusters

User story: As a digital learning service provider, we want to group learners and enable adaptive learning experiences, so that we engage better with the existing users.

In [None]:
pip install nbformat

Freezing the requirement.txt in the terminal now.

# Business requirement #2: Dropout likelihood

User story: As a program manager, I want to be able to predict dropout probability, so that we can engage with high-risk users.

**Hypotheses:**

2.1. Learning style impacts dropout likelihood

2.2. Course choice impacts dropout likelihood

2.3. Time spent on videos impacts dropout likelihood

## 2.1. Learning style impacts dropout likelihood

**Note**: I am testing two categorical features here. Chi-Squared Test can process categorical variables that are object-type data and not just integers. I will be reuse the dataset from pre-transformation phase and run a Chi-Square Test.

In [26]:
df_old = pd.read_csv("../data/copied_data/personalized_learning_dataset_copy.csv")
df_old.head(5)

Unnamed: 0,Student_ID,Age,Gender,Education_Level,Course_Name,Time_Spent_on_Videos,Quiz_Attempts,Quiz_Scores,Forum_Participation,Assignment_Completion_Rate,Engagement_Level,Final_Exam_Score,Learning_Style,Feedback_Score,Dropout_Likelihood
0,S00001,15,Female,High School,Machine Learning,171,4,67,2,89,Medium,51,Visual,1,No
1,S00002,49,Male,Undergraduate,Python Basics,156,4,64,0,94,Medium,92,Reading/Writing,5,No
2,S00003,20,Female,Undergraduate,Python Basics,217,2,55,2,67,Medium,45,Reading/Writing,1,No
3,S00004,37,Female,Undergraduate,Data Science,489,1,65,43,60,High,59,Visual,4,No
4,S00005,34,Female,Postgraduate,Python Basics,496,3,59,34,88,Medium,93,Visual,3,No


In [28]:
observed, expected, stats = pg.chi2_independence(data = df_old, 
                                                 x = "Learning_Style",
                                                 y= "Dropout_Likelihood")

stats

Unnamed: 0,test,lambda,chi2,dof,pval,cramer,power
0,pearson,1.0,0.3039,3.0,0.959293,0.005513,0.068472
1,cressie-read,0.666667,0.303654,3.0,0.95934,0.00551,0.068457
2,log-likelihood,0.0,0.303165,3.0,0.959432,0.005506,0.068426
3,freeman-tukey,-0.5,0.302801,3.0,0.959501,0.005503,0.068403
4,mod-log-likelihood,-1.0,0.302439,3.0,0.959569,0.005499,0.06838
5,neyman,-2.0,0.301725,3.0,0.959704,0.005493,0.068335


Accept null

## 2.2. Course choice impacts dropout likelihood

In [30]:
expected, observed, stats = pg.chi2_independence(data = df_old,
                                                 x = "Course_Name",
                                                 y = "Dropout_Likelihood")

stats

Unnamed: 0,test,lambda,chi2,dof,pval,cramer,power
0,pearson,1.0,5.037829,4.0,0.283438,0.022445,0.399038
1,cressie-read,0.666667,5.034226,4.0,0.283804,0.022437,0.398767
2,log-likelihood,0.0,5.028061,4.0,0.28443,0.022423,0.398302
3,freeman-tukey,-0.5,5.024344,4.0,0.284809,0.022415,0.398022
4,mod-log-likelihood,-1.0,5.021402,4.0,0.285109,0.022408,0.397801
5,neyman,-2.0,5.017838,4.0,0.285472,0.022401,0.397532


Accept null

## 2.3. Time spent on videos impacts dropout likelihood

**Note:** This hypothesis involves a continuous variable, which is the number of minutes spent on video and a categorical variable of dropout likelihood. To handle such situations using Mann-Whitney U-Test.

In [32]:
pg.mwu( x = df_old["Time_Spent_on_Videos"], y = df_old["Dropout_Likelihood"])

TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

# Challenges