# **Synthetic Learning Behavior Analysis: Extract**

## Objectives

* By the end of the extraction phase, I will:
    1. Import the dataset from Kaggle.
    2. Check the data for missing and null values, and duplicates.
    3. Define a data dictionary.
    4. Analyze summary statistics.
    5. Analyze correlation.
    6. Explore further through visualizations.
    7. State assumptions.
    8. Summarize findings.


## Inputs

* [Task outline](https://code-institute-org.github.io/5P-Assessments-Handbook/da-ai-bootcamp-capstone-prelims.html)
* [Kaggle dataset](https://www.kaggle.com/datasets/adilshamim8/personalized-learning-and-adaptive-education-dataset/data)
* Libraries from the requirements.txt file 


## Outputs

* Overview of the dataset, its features, and how they are interacting.
* Basic visaulizations that complement the overview.
* Summary of the findings and an overview of the next step. 

## Disclaimer

* This dataset is synthetically created, as called out by the owner in Kaggle. The observations I find here may reflect real-world behavior to some degree. However, I will not use the dataset to draw causal inferences. Having said that, I will analyze data and develop exploratory models that will serve as placeholders for similar real-world applications.



---

# Import key libraries

In [5]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\mshin\\vs-code-projects\\synthetic_learning_behavior_analysis\\project_work_jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\mshin\\vs-code-projects\\synthetic_learning_behavior_analysis'

# Data import

Manually downloaded the source data file from [Kaggle](https://www.kaggle.com/datasets/adilshamim8/personalized-learning-and-adaptive-education-dataset/data), saved it locally (within data > source data), and importing it onto VSCode.

In [26]:
df = pd.read_csv("data/source_data/personalized_learning_dataset.csv")
print(df.shape)
df.head(10)

(10000, 15)


Unnamed: 0,Student_ID,Age,Gender,Education_Level,Course_Name,Time_Spent_on_Videos,Quiz_Attempts,Quiz_Scores,Forum_Participation,Assignment_Completion_Rate,Engagement_Level,Final_Exam_Score,Learning_Style,Feedback_Score,Dropout_Likelihood
0,S00001,15,Female,High School,Machine Learning,171,4,67,2,89,Medium,51,Visual,1,No
1,S00002,49,Male,Undergraduate,Python Basics,156,4,64,0,94,Medium,92,Reading/Writing,5,No
2,S00003,20,Female,Undergraduate,Python Basics,217,2,55,2,67,Medium,45,Reading/Writing,1,No
3,S00004,37,Female,Undergraduate,Data Science,489,1,65,43,60,High,59,Visual,4,No
4,S00005,34,Female,Postgraduate,Python Basics,496,3,59,34,88,Medium,93,Visual,3,No
5,S00006,34,Male,Undergraduate,Web Development,184,1,87,34,70,Medium,43,Visual,4,No
6,S00007,45,Male,High School,Cybersecurity,454,3,69,46,83,Low,37,Kinesthetic,5,No
7,S00008,47,Male,High School,Cybersecurity,425,2,62,23,52,High,35,Reading/Writing,5,No
8,S00009,48,Male,Undergraduate,Cybersecurity,359,1,59,10,88,Medium,49,Reading/Writing,2,No
9,S00010,45,Female,Undergraduate,Data Science,263,4,63,30,99,Low,61,Auditory,3,No


This dataset contains **10,000 rows** (or observations) and **15 columns** (or features). 

**Discalimer:** After going through the Provenance section in Kaggle, I understand that this dataset is synthetically generated. While it is based on real-world observations, I am aware that it is not the ground truth. I will ensure that my observations, claims, or inferences are framed carefully.

In [29]:
df.info()#Reviewing the data types to understand the data better

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 15 columns):
 #   Column                      Non-Null Count  Dtype 
---  ------                      --------------  ----- 
 0   Student_ID                  10000 non-null  object
 1   Age                         10000 non-null  int64 
 2   Gender                      10000 non-null  object
 3   Education_Level             10000 non-null  object
 4   Course_Name                 10000 non-null  object
 5   Time_Spent_on_Videos        10000 non-null  int64 
 6   Quiz_Attempts               10000 non-null  int64 
 7   Quiz_Scores                 10000 non-null  int64 
 8   Forum_Participation         10000 non-null  int64 
 9   Assignment_Completion_Rate  10000 non-null  int64 
 10  Engagement_Level            10000 non-null  object
 11  Final_Exam_Score            10000 non-null  int64 
 12  Learning_Style              10000 non-null  object
 13  Feedback_Score              10000 non-null  int

This dataset contains learner behaviors in a Learning Management System. It contains a good mix of numerical data (time_spent_on_video, Quiz_Scores, and others), descriptive data (Course_Name and Learning_Style), and categorical data (Engagement_Level and Dropout_Likelihood).

In [11]:
df.isnull().sum() #Checking missing or null values in the dataset.

Student_ID                    0
Age                           0
Gender                        0
Education_Level               0
Course_Name                   0
Time_Spent_on_Videos          0
Quiz_Attempts                 0
Quiz_Scores                   0
Forum_Participation           0
Assignment_Completion_Rate    0
Engagement_Level              0
Final_Exam_Score              0
Learning_Style                0
Feedback_Score                0
Dropout_Likelihood            0
dtype: int64

There are no missing values or nan values. Considering it is an engineered dataset, this is not surprising.

In [25]:
(df == 0).sum() #Checking for zero values in the dataset as this can cause a problem during statistical analysis.

Student_ID                      0
Age                             0
Gender                          0
Education_Level                 0
Course_Name                     0
Time_Spent_on_Videos            0
Quiz_Attempts                   0
Quiz_Scores                     0
Forum_Participation           204
Assignment_Completion_Rate      0
Engagement_Level                0
Final_Exam_Score                0
Learning_Style                  0
Feedback_Score                  0
Dropout_Likelihood              0
dtype: int64

There are **204** zero-value entries under **Forum_Participation**. Considering this a huge dataset with 10, 000 entries, having 204 learners not interacting on a forum is quite real-world.

**Assumption:** The zero-value entries within Forum_Participation are valid and true entries. 

In [22]:
df_duplicate = df[df.duplicated()] #Checking for duplicated values in the dataset.
df_duplicate

Unnamed: 0,Student_ID,Age,Gender,Education_Level,Course_Name,Time_Spent_on_Videos,Quiz_Attempts,Quiz_Scores,Forum_Participation,Assignment_Completion_Rate,Engagement_Level,Final_Exam_Score,Learning_Style,Feedback_Score,Dropout_Likelihood


An empty dataframe indicates that there are no duplicate values. I want to reconfirm this.

In [24]:
df.duplicated().sum() #Rechecking to confirm that there are indeed no duplicates.

0

There are no duplicates. The author has created quite a smooth dataset. The challenge lies in how I can analyze the data. 

---

# Data dictionary

Here are the different features (columns) along with their meaning. This helps with reviewing and understanding the data better.

## Dictionary

|Feature| Explanation|
|------------------------------|-----------------------------------------------------------------------------| 
|Student_ID| Unique identifier for each learner|
|Age| Age of the learners|
|Gender| Gender of the learners|
|Education_Level| The level of schooling|
|Course_Name| Learners' choice of digital training session|
|Time_Spent_on_Videos| The number of minutes learners spend on reviewing videos|
|Quiz_Attempts| The number of trials on a quiz|
|Quiz_Scores| The measurable outcome of a quiz (measured in percentage)|
|Forum_Participation| The number of times learners participated in a discussion forum|
|Assignment_Completion_Rate| The percentage of assignments completed|
|Engagement_Level| The level of learner engagement (based on activity metrics)|
|Final_Exam_Score| The measurable outcome of the learning session (measured in percentage)|
|Learning_Style| Preferred method to learning|
|Feedback_Score| Measure of learners'rating for the course (measured upon 5)|
|Dropout_Likelihood| Probability of a learner dropping out of the course|

										

**Source credit:** Populated this section based on previous industry experience. I also leveraged Kaggle to come up with certain contextual information(such as how time spent on videos is measured).