# **Synthetic Learning Behavior Analysis: Load**

## Objectives

* By the end of the load phase, I will:
    1. Summarize the ETL steps.
    2. Outline the next steps.


## Inputs

* [Task outline](https://code-institute-org.github.io/5P-Assessments-Handbook/da-ai-bootcamp-capstone-prelims.html)
* Extract phase and transformation phase outcomes
* personalized_learning_dataset_transformed.csv 


## Outputs

* Summary of the ETL process
* Next steps

---

# Import key libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# Data upload

In [2]:
df = pd.read_csv("../data/transformed_data/personalized_learning_dataset_transformed.csv")
df.head(5)

Unnamed: 0,Age,Education_Level,Time_Spent_on_Videos,Quiz_Attempts,Quiz_Scores,Forum_Participation,Assignment_Completion_Rate,Engagement_Level,Final_Exam_Score,Feedback_Score,Dropout_Likelihood,Gender_Female,Gender_Male,Course_Name_Machine Learning,Course_Name_Python Basics,Course_Name_Data Science,Course_Name_Web Development,Learning_Style_Visual,Learning_Style_Reading/Writing,Learning_Style_Kinesthetic
0,15,0,171,4,67,2,89,1,51,1,0,1,0,1,0,0,0,1,0,0
1,49,1,156,4,64,0,94,1,92,5,0,0,1,0,1,0,0,0,1,0
2,20,1,217,2,55,2,67,1,45,1,0,1,0,0,1,0,0,0,1,0
3,37,1,489,1,65,43,60,2,59,4,0,1,0,0,0,1,0,1,0,0
4,34,2,496,3,59,34,88,1,93,3,0,1,0,0,1,0,0,1,0,0


---

# Sanity check

Checking if transformation has left any unintended values or missing values.

In [4]:
df.isnull().sum()

Age                               0
Education_Level                   0
Time_Spent_on_Videos              0
Quiz_Attempts                     0
Quiz_Scores                       0
Forum_Participation               0
Assignment_Completion_Rate        0
Engagement_Level                  0
Final_Exam_Score                  0
Feedback_Score                    0
Dropout_Likelihood                0
Gender_Female                     0
Gender_Male                       0
Course_Name_Machine Learning      0
Course_Name_Python Basics         0
Course_Name_Data Science          0
Course_Name_Web Development       0
Learning_Style_Visual             0
Learning_Style_Reading/Writing    0
Learning_Style_Kinesthetic        0
dtype: int64

No data is missing from the transformed dataset.

In [5]:
df_duplicate = df[df.duplicated()]
df_duplicate

Unnamed: 0,Age,Education_Level,Time_Spent_on_Videos,Quiz_Attempts,Quiz_Scores,Forum_Participation,Assignment_Completion_Rate,Engagement_Level,Final_Exam_Score,Feedback_Score,Dropout_Likelihood,Gender_Female,Gender_Male,Course_Name_Machine Learning,Course_Name_Python Basics,Course_Name_Data Science,Course_Name_Web Development,Learning_Style_Visual,Learning_Style_Reading/Writing,Learning_Style_Kinesthetic


Transformation has not duplicated any values.

In [3]:
(df == 0).sum()

Age                                  0
Education_Level                   2923
Time_Spent_on_Videos                 0
Quiz_Attempts                        0
Quiz_Scores                          0
Forum_Participation                204
Assignment_Completion_Rate           0
Engagement_Level                  2093
Final_Exam_Score                     0
Feedback_Score                       0
Dropout_Likelihood                8043
Gender_Female                     5114
Gender_Male                       5301
Course_Name_Machine Learning      7957
Course_Name_Python Basics         8006
Course_Name_Data Science          8016
Course_Name_Web Development       8047
Learning_Style_Visual             7475
Learning_Style_Reading/Writing    7446
Learning_Style_Kinesthetic        7557
dtype: int64

Forum_Participation contains 204 null values and I assumed that they represent learners who did not participate in forum discussions within the synthetic learning dataset.

Transformation included ordinal (manual) and OneHot encoding. The null values seen above represent encoded values rather than missing data or actual observations.

---

# Data dictionary

|Feature| Explanation|
|------------------------------|-----------------------------------------------------------------------------| 
|Student_ID| Unique identifier for each learner|
|Age| Age of the learners|
|Gender| Gender of the learners|
|Education_Level| The level of schooling|
|Course_Name| Learners' choice of digital training session|
|Time_Spent_on_Videos| The number of minutes learners spend on reviewing videos|
|Quiz_Attempts| The number of trials on a quiz|
|Quiz_Scores| The measurable outcome of a quiz (measured in percentage)|
|Forum_Participation| The number of times learners participated in a discussion forum|
|Assignment_Completion_Rate| The percentage of completed assignments|
|Engagement_Level| The level of learner engagement (based on activity metrics)|
|Final_Exam_Score| The measurable outcome of the learning session (measured in percentage)|
|Learning_Style| Preferred method of learning|
|Feedback_Score| Measure of learners' rating for the course (measured upon 5)|
|Dropout_Likelihood| Probability of a learner dropping out of the course|

**Transformed features: Legend**

| Feature            | Original Value     | Encoded Value | Method                |
|--------------------|-------------------|---------------|-----------------------|
| Gender             | Male              | -             | OneHot Encoding       |
|              | Female            | -             | OneHot Encoding       |
|              | Other             | -             | OneHot Encoding       |
| Education_Level    | High School       | 0             | Manual, ordinal       |
|     | Undergraduate     | 1             | Manual, ordinal       |
|     | Postgraduate      | 2             | Manual, ordinal       |
| Course_Name        | Machine_Learning  | -             | OneHot Encoding       |
|        | Python_Basics     | -             | OneHot Encoding       |
|         | Data_Science      | -             | OneHot Encoding       |
|         | Web_Development   | -             | OneHot Encoding       |
|         | Cybersecurity     | -             | OneHot Encoding       |
| Engagement_Level   | Low               | 0             | Manual, ordinal       |
|    | Medium            | 1             | Manual, ordinal       |
|    | High              | 2             | Manual, ordinal       |
| Learning_Style     | Visual            | -             | OneHot Encoding       |
|     | Reading/Writing   | -             | OneHot Encoding       |
|      | Kinesthetic       | -             | OneHot Encoding       |
|      | Auditory          | -             | OneHot Encoding       |
| Dropout_Likelihood | No                | 0             | Manual, ordinal       |
|  | Yes               | 1             | Manual, ordinal       |

**Table modification credit:** Used GitHub copilot to modify the table as the one I created ran into issue. Then, I removed the repetitive values.

---

# ETL summary

**Extract:**
1. Loaded the dataset from Kaggle

2. Checked for missing data, null values, and duplicates

3. Drafted business problems, captured assumptions, explained ehtical implications and added a disclaimer

4. Captured assumptions and added a disclaimer regarding the synthetic nature of the data

5. Inspected data distrubition, analyzed correlation, probed for outliers, and visualized the data for exploration

6. Determined transformation steps and copied the dataset

7. Explained key concepts such as mean, median, standard deviation at various points as part of exploratory data analysis

**Transform:**
1. Encoded the categorical variables in two steps
    a. Manual ordinal encoding where categories formed a natural order
    b. OneHot encoding where there was no inherent order

2. Ran correlation analysis on transformed data

3. Verified data distribution, determined research methodlogy, conducted statistical test on hypotheses, explained the results, and visualized them

4. Explained key statistical concepts such as probability, hypotheses, p-value, and the likes

5. Captured business recommendations and explained why they matter and how they need to be interpret in the real world

6. Created two machine learning models and contextualized the results considering the data is synthetic

7. Captured challenges, solutions, and ethical implications

**Load:**
1. Reloaded the transformed dataset and performed a sanity check

2. Summarized the ETL process

3. Captured the next steps

---

# Next steps

1. Recommend we collate real-world data. Such data can be found internally or can be sourced externally.

2. Based on real data, the following needs to be reviewed and updated:

    a. Features present in the dataset

    b. correlation between features
    
    c. Outliers
    
    d. Data distribution

    e. Data transformation
    
    f. Research methodology and the hypotheses tested

    g. Machine Learing models and pipeline

3. In the next sprint, review how data bias data will be handled

4. Build a dashboard that enables exploring the current dataset


---

**Note:**
I have created a [PowerBI dashboard](https://app.powerbi.com/groups/me/reports/4ccffe8b-95d8-4c7b-9dec-535c2629c6bf/70d7640d009e4ea9983d?experience=power-bi&clientSideAuth=0) that helps visualize and explore the dataset futher.