# **Synthetic Learning Behavior Analysis: Transform**

## Objectives

* By the end of the transformation phase, I will:
    1. Encode and transform features.
    2. Run statistical tests and validate hypothesis.
    3. Visualize results and build a dashboard for communication.
    4. Build a model that is ready for real-world use.


## Inputs

* [Task outline](https://code-institute-org.github.io/5P-Assessments-Handbook/da-ai-bootcamp-capstone-prelims.html)
* Extract phase
* personalized_learning_dataset_copy.csv 


## Outputs

* Transformed dataset.
* Statistical tests that prove how features interact.
* PowerBI Dashboard.
* Logistic Regression and ML Model 

---

# Import key libraries

In [17]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from feature_engine.encoding import OneHotEncoder, RareLabelEncoder
from sklearn.pipeline import Pipeline


# Data reupload

In [3]:
df = pd.read_csv("../data/copied_data/personalized_learning_dataset_copy.csv")
df.head(10)

Unnamed: 0,Student_ID,Age,Gender,Education_Level,Course_Name,Time_Spent_on_Videos,Quiz_Attempts,Quiz_Scores,Forum_Participation,Assignment_Completion_Rate,Engagement_Level,Final_Exam_Score,Learning_Style,Feedback_Score,Dropout_Likelihood
0,S00001,15,Female,High School,Machine Learning,171,4,67,2,89,Medium,51,Visual,1,No
1,S00002,49,Male,Undergraduate,Python Basics,156,4,64,0,94,Medium,92,Reading/Writing,5,No
2,S00003,20,Female,Undergraduate,Python Basics,217,2,55,2,67,Medium,45,Reading/Writing,1,No
3,S00004,37,Female,Undergraduate,Data Science,489,1,65,43,60,High,59,Visual,4,No
4,S00005,34,Female,Postgraduate,Python Basics,496,3,59,34,88,Medium,93,Visual,3,No
5,S00006,34,Male,Undergraduate,Web Development,184,1,87,34,70,Medium,43,Visual,4,No
6,S00007,45,Male,High School,Cybersecurity,454,3,69,46,83,Low,37,Kinesthetic,5,No
7,S00008,47,Male,High School,Cybersecurity,425,2,62,23,52,High,35,Reading/Writing,5,No
8,S00009,48,Male,Undergraduate,Cybersecurity,359,1,59,10,88,Medium,49,Reading/Writing,2,No
9,S00010,45,Female,Undergraduate,Data Science,263,4,63,30,99,Low,61,Auditory,3,No


---

# Data transformation plan

In [5]:
df_categorical_cols = df.select_dtypes(include = ["object"]) #Reviewing the categorical columns as I want to determine how I will be transforming them.
df_categorical_cols

Unnamed: 0,Student_ID,Gender,Education_Level,Course_Name,Engagement_Level,Learning_Style,Dropout_Likelihood
0,S00001,Female,High School,Machine Learning,Medium,Visual,No
1,S00002,Male,Undergraduate,Python Basics,Medium,Reading/Writing,No
2,S00003,Female,Undergraduate,Python Basics,Medium,Reading/Writing,No
3,S00004,Female,Undergraduate,Data Science,High,Visual,No
4,S00005,Female,Postgraduate,Python Basics,Medium,Visual,No
...,...,...,...,...,...,...,...
9995,S09996,Female,Undergraduate,Machine Learning,Medium,Kinesthetic,No
9996,S09997,Male,Postgraduate,Machine Learning,Medium,Reading/Writing,Yes
9997,S09998,Female,Postgraduate,Machine Learning,High,Visual,No
9998,S09999,Male,High School,Python Basics,Medium,Visual,No


In [14]:
for col in df_categorical_cols.columns[1:]: #For better visibility, looping over the columns and printing the number of unique values.
    
    print(f"{col}: ", df_categorical_cols[col].nunique(),
           df_categorical_cols[col].unique())

Gender:  3 ['Female' 'Male' 'Other']
Education_Level:  3 ['High School' 'Undergraduate' 'Postgraduate']
Course_Name:  5 ['Machine Learning' 'Python Basics' 'Data Science' 'Web Development'
 'Cybersecurity']
Engagement_Level:  3 ['Medium' 'High' 'Low']
Learning_Style:  4 ['Visual' 'Reading/Writing' 'Kinesthetic' 'Auditory']
Dropout_Likelihood:  2 ['No' 'Yes']


**My transformation plan:**

1. I do not see the requirement for the Student_ID feature. I will remove this before saving a transformed version of the file.

2. Gender: OneHotEncoder and Rare Label Encoder; values do not represent any order. In addiotion to OneHotEncoder, I will also use Rare Label Encoder as "Other" gender samples are less than 1 percent of the total gender values.

3. Education_Level: Manual ordinal encoding; I will encode using 0, 1, and 2 to handle the values.

4. Course_Name: OneHotEncoder; values do not represent any order.

5. Engagement_Level: Manual ordinal encoding; I will encode using 0, 1, and 2 to handle the values.

6. Learning_Style: OneHotEncoder; values do not represent any order.

7. Droupout_Likelihood: Manual binary encoding; I will encode using 0 and 1.

### Rationale behind my choice

As I am dealing with synthetic data, I want to control how the features are encoded. By manually encoding key features, I can decide which features have an inherent order in them and which don't. 

I will use OneHotEncoder where there is no order. While this does increase the number of features, I will use Feature_Selection while modeling.

Models process data as 0s and 1s. If I use manual encoding to assign gender or course names 0, 1, 2, and so on, it creates an illusion of order as 1 is > 0. In reality, there is no order in gender or course names.

**Explanation credit:** I iterated with ChatGPT to understand how I should encode the features. Earlier, I wanted to use manual encoding as this allows me to have a grearter control over encoding and the number of featues. However, I understood that some models are sensitive to order. Hence, I read more about encoding and changed my approach accordingly.

---

# Data transformation

I will first complete the manual encoding.

In [None]:
df["Education_Level"] = (
    df["Education_Level"].replace({
        "High School": 0, 
        "Undergraduate": 1,
        "Postgraduate": 2
        }).astype(int)
        )
#Manual ordinal encoding for Education_level as it has a natural order(Postgraduate > Undergraduate > High School).

df["Engagement_Level"] = (
    df["Engagement_Level"].replace({
        "Low": 0,
        "Medium": 1,
        "High": 2
    }).astype(int)
    )
#Manual ordinal encoding for Engagement_Level as it has a natural order (High > Medium > Low).

df["Dropout_Likelihood"] = (
    df["Dropout_Likelihood"].replace({
        "Yes": 1,
        "No": 0
        }).astype(int)
        )
#Manual binary encoding for Dropout_Likelihood as it is a binary feature (Yes/No).

df.head(10) #Checking the changes.

Unnamed: 0,Student_ID,Age,Gender,Education_Level,Course_Name,Time_Spent_on_Videos,Quiz_Attempts,Quiz_Scores,Forum_Participation,Assignment_Completion_Rate,Engagement_Level,Final_Exam_Score,Learning_Style,Feedback_Score,Dropout_Likelihood
0,S00001,15,Female,0,Machine Learning,171,4,67,2,89,1,51,Visual,1,0
1,S00002,49,Male,1,Python Basics,156,4,64,0,94,1,92,Reading/Writing,5,0
2,S00003,20,Female,1,Python Basics,217,2,55,2,67,1,45,Reading/Writing,1,0
3,S00004,37,Female,1,Data Science,489,1,65,43,60,2,59,Visual,4,0
4,S00005,34,Female,2,Python Basics,496,3,59,34,88,1,93,Visual,3,0
5,S00006,34,Male,1,Web Development,184,1,87,34,70,1,43,Visual,4,0
6,S00007,45,Male,0,Cybersecurity,454,3,69,46,83,0,37,Kinesthetic,5,0
7,S00008,47,Male,0,Cybersecurity,425,2,62,23,52,2,35,Reading/Writing,5,0
8,S00009,48,Male,1,Cybersecurity,359,1,59,10,88,1,49,Reading/Writing,2,0
9,S00010,45,Female,1,Data Science,263,4,63,30,99,0,61,Auditory,3,0


In [20]:
df.isnull().sum() #Checking if the transformed dataset has any null values.

Student_ID                    0
Age                           0
Gender                        0
Education_Level               0
Course_Name                   0
Time_Spent_on_Videos          0
Quiz_Attempts                 0
Quiz_Scores                   0
Forum_Participation           0
Assignment_Completion_Rate    0
Engagement_Level              0
Final_Exam_Score              0
Learning_Style                0
Feedback_Score                0
Dropout_Likelihood            0
dtype: int64

**Code credit:** I created the code for Education_Level and Dropout_Likelihood and asked Github Copilot to create a similar piece for Engagement_Level

**Thought credit:** ChatGPT recommended that I also checked for missing values.

Changes are indeed reflected. I will now progress with OneHotEncoding and then club both of these transformed features into the transformed dataset.