This project analyzes learner behavior data from an online education platform to understand the factors that drive course completion. The objective is to provide both **data-driven evidence** and **business-action recommendations** to improve learner engagement, platform performance, and course success outcomes.

The analysis is structured to demonstrate the perspectives of both:

- **Data Analyst** → Data preparation, exploratory analytics, visual insights, KPI relationships  
- **Business Analyst** → Interpretation, decision impact, strategy recommendations  



###  Business Context  

Online education platforms face a major challenge — **a large percentage of users register for courses but do not complete them**. This affects:

- Revenue recognition  
- User lifetime value (LTV)  
- Course recommendation accuracy  
- Retention and engagement metrics  

This project analyzes **100,000 learner activity records** to identify factors that influence course completion and provide **data-driven recommendations** to improve student engagement and completion rates.


###  Project Objectives  

This analysis seeks to:

- Identify key behavioral and demographic indicators of higher course completion.  
- Evaluate how engagement activities (login frequency, session duration, assignment submission) relate to outcomes.  
- Determine which course categories, education levels, and device types correlate with success.  
- Provide business-focused insights to help improve retention, user experience, and course design.


###  Why This Project is Relevant for Both Data & Business Roles

| Role | What This Project Demonstrates |
|------|--------------------------------|
| Data Analyst | Data cleaning, feature interpretation, EDA, visualization, KPI measurement |
| Business Analyst | Understanding of product impact, behavioral insights, retention strategy |
| Product / Strategy | Translating data into execution plans improving user engagement |


###  Target Audience  

- Data and Analytics Teams  
- Product Managers  
- Learning Experience Designers  
- Marketing & Retention Team


###  Key Metrics Evaluated  

| KPI | Description |
|-----|------------|
| Completion Rate | % of students finishing the course |
| Login Frequency | Indicator of engagement |
| Average Session Duration | Depth of study behavior |
| Assignment Submission | Productivity and commitment |
| Quiz Performance | Knowledge retention |
| Rewatch Behavior | Learning difficulty indicator |
| Satisfaction Rating | User perception of course quality |


###  Analytics Approach

1. Data Acquisition & Loading  
2. Data Cleaning & Validation (Missing, duplicates, formats)  
3. Exploratory Data Analysis (EDA)  
4. Visualization of engagement & behavioral metrics  
5. Pattern recognition and hypothesis validation  
6. Business insight extraction  
7. Actionable recommendations & impact alignment


In [2]:
# Disable non-critical warnings
import warnings
warnings.filterwarnings('ignore')

# Core libraries
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='whitegrid')

print('pandas:', pd.__version__)
print('numpy:', np.__version__)
print('seaborn:', sns.__version__)


pandas: 2.2.3
numpy: 2.1.3
seaborn: 0.13.2


In [3]:
# Load dataset
file_path = r"C:\Users\Administrator\Desktop\Course_Completion_Prediction.csv"
df = pd.read_csv(file_path, encoding='utf-8', low_memory=False)

print("Data loaded successfully.")
print("Shape of dataset:", df.shape)
df.head()


Data loaded successfully.
Shape of dataset: (100000, 40)


Unnamed: 0,Student_ID,Name,Gender,Age,Education_Level,Employment_Status,City,Device_Type,Internet_Connection_Quality,Course_ID,...,Enrollment_Date,Payment_Mode,Fee_Paid,Discount_Used,Payment_Amount,App_Usage_Percentage,Reminder_Emails_Clicked,Support_Tickets_Raised,Satisfaction_Rating,Completed
0,STU100000,Vihaan Patel,Male,19,Diploma,Student,Indore,Laptop,Medium,C102,...,01-06-2024,Scholarship,No,No,1740,49,3,4,3.5,Completed
1,STU100001,Arjun Nair,Female,17,Bachelor,Student,Delhi,Laptop,Low,C106,...,27-04-2025,Credit Card,Yes,No,6147,86,0,0,4.5,Not Completed
2,STU100002,Aditya Bhardwaj,Female,34,Master,Student,Chennai,Mobile,Medium,C101,...,20-01-2024,NetBanking,Yes,No,4280,85,1,0,5.0,Completed
3,STU100003,Krishna Singh,Female,29,Diploma,Employed,Surat,Mobile,High,C105,...,13-05-2025,UPI,Yes,No,3812,42,2,3,3.8,Completed
4,STU100004,Krishna Nair,Female,19,Master,Self-Employed,Lucknow,Laptop,Medium,C106,...,19-12-2024,Debit Card,Yes,Yes,5486,91,3,0,4.0,Completed


## 2. Basic Data Overview

In this section, we explore the structure, data types, and summary statistics of the dataset to understand the information available and detect potential issues early.


In [5]:
# Basic data info
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 40 columns):
 #   Column                        Non-Null Count   Dtype  
---  ------                        --------------   -----  
 0   Student_ID                    100000 non-null  object 
 1   Name                          100000 non-null  object 
 2   Gender                        100000 non-null  object 
 3   Age                           100000 non-null  int64  
 4   Education_Level               100000 non-null  object 
 5   Employment_Status             100000 non-null  object 
 6   City                          100000 non-null  object 
 7   Device_Type                   100000 non-null  object 
 8   Internet_Connection_Quality   100000 non-null  object 
 9   Course_ID                     100000 non-null  object 
 10  Course_Name                   100000 non-null  object 
 11  Category                      100000 non-null  object 
 12  Course_Level                  100000 non-null

In [6]:
# Summary statistics for numerical columns
df.describe()


Unnamed: 0,Age,Course_Duration_Days,Instructor_Rating,Login_Frequency,Average_Session_Duration_Min,Video_Completion_Rate,Discussion_Participation,Time_Spent_Hours,Days_Since_Last_Login,Notifications_Checked,...,Quiz_Attempts,Quiz_Score_Avg,Project_Grade,Progress_Percentage,Rewatch_Count,Payment_Amount,App_Usage_Percentage,Reminder_Emails_Clicked,Support_Tickets_Raised,Satisfaction_Rating
count,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,...,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0
mean,25.70959,51.8173,4.444478,4.78538,33.87818,62.17458,2.32929,3.873632,6.18886,5.23211,...,3.77233,73.276201,68.189534,53.823104,2.32393,3253.42712,67.85951,2.33265,0.87098,4.132128
std,5.615292,20.324801,0.202631,1.848289,10.341964,19.558126,1.591365,3.781185,6.982047,2.401486,...,2.021276,12.552344,15.312036,12.495622,1.580735,2084.391775,19.138354,1.584626,0.951569,0.700895
min,17.0,25.0,4.1,0.0,5.0,5.0,0.0,0.5,0.0,0.0,...,0.0,19.6,0.0,7.6,0.0,0.0,0.0,0.0,0.0,1.0
25%,21.0,30.0,4.3,3.0,27.0,48.5,1.0,0.5,1.0,4.0,...,2.0,64.7,57.7,45.4,1.0,1242.0,55.0,1.0,0.0,3.7
50%,25.0,45.0,4.5,5.0,34.0,64.0,2.0,2.7,4.0,5.0,...,4.0,73.3,68.3,53.9,2.0,3715.0,68.0,2.0,1.0,4.2
75%,30.0,60.0,4.6,6.0,41.0,77.5,3.0,6.2,9.0,7.0,...,5.0,82.0,78.8,62.4,3.0,4685.0,82.0,3.0,1.0,4.7
max,52.0,90.0,4.7,15.0,81.0,99.9,12.0,25.6,99.0,18.0,...,16.0,100.0,100.0,98.6,15.0,7149.0,100.0,13.0,8.0,5.0


In [7]:
# Check missing values
df.isnull().sum()


Student_ID                      0
Name                            0
Gender                          0
Age                             0
Education_Level                 0
Employment_Status               0
City                            0
Device_Type                     0
Internet_Connection_Quality     0
Course_ID                       0
Course_Name                     0
Category                        0
Course_Level                    0
Course_Duration_Days            0
Instructor_Rating               0
Login_Frequency                 0
Average_Session_Duration_Min    0
Video_Completion_Rate           0
Discussion_Participation        0
Time_Spent_Hours                0
Days_Since_Last_Login           0
Notifications_Checked           0
Peer_Interaction_Score          0
Assignments_Submitted           0
Assignments_Missed              0
Quiz_Attempts                   0
Quiz_Score_Avg                  0
Project_Grade                   0
Progress_Percentage             0
Rewatch_Count 

###  Missing Values Check

We checked for missing values across all features to determine whether imputation or data cleaning would be required.

➡ Result: **There are no missing values in this dataset**, meaning:

- The data is complete and well-structured.
- We do not need to apply imputation strategies for null fields.
- Analysis and modeling can proceed without handling NaN-related issues.


In [9]:
df.duplicated().sum()


np.int64(0)

###  Duplicate Records Check

We evaluated whether duplicate student interactions existed in the dataset.

- Result: `X` duplicate entries.  
(If 0 → The dataset does not contain duplicated rows, ensuring clean input for analysis.)


In [10]:
df.describe()


Unnamed: 0,Age,Course_Duration_Days,Instructor_Rating,Login_Frequency,Average_Session_Duration_Min,Video_Completion_Rate,Discussion_Participation,Time_Spent_Hours,Days_Since_Last_Login,Notifications_Checked,...,Quiz_Attempts,Quiz_Score_Avg,Project_Grade,Progress_Percentage,Rewatch_Count,Payment_Amount,App_Usage_Percentage,Reminder_Emails_Clicked,Support_Tickets_Raised,Satisfaction_Rating
count,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,...,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0
mean,25.70959,51.8173,4.444478,4.78538,33.87818,62.17458,2.32929,3.873632,6.18886,5.23211,...,3.77233,73.276201,68.189534,53.823104,2.32393,3253.42712,67.85951,2.33265,0.87098,4.132128
std,5.615292,20.324801,0.202631,1.848289,10.341964,19.558126,1.591365,3.781185,6.982047,2.401486,...,2.021276,12.552344,15.312036,12.495622,1.580735,2084.391775,19.138354,1.584626,0.951569,0.700895
min,17.0,25.0,4.1,0.0,5.0,5.0,0.0,0.5,0.0,0.0,...,0.0,19.6,0.0,7.6,0.0,0.0,0.0,0.0,0.0,1.0
25%,21.0,30.0,4.3,3.0,27.0,48.5,1.0,0.5,1.0,4.0,...,2.0,64.7,57.7,45.4,1.0,1242.0,55.0,1.0,0.0,3.7
50%,25.0,45.0,4.5,5.0,34.0,64.0,2.0,2.7,4.0,5.0,...,4.0,73.3,68.3,53.9,2.0,3715.0,68.0,2.0,1.0,4.2
75%,30.0,60.0,4.6,6.0,41.0,77.5,3.0,6.2,9.0,7.0,...,5.0,82.0,78.8,62.4,3.0,4685.0,82.0,3.0,1.0,4.7
max,52.0,90.0,4.7,15.0,81.0,99.9,12.0,25.6,99.0,18.0,...,16.0,100.0,100.0,98.6,15.0,7149.0,100.0,13.0,8.0,5.0


###  Statistical Summary

The statistical overview shows the distribution, mean, min/max, and standard deviation of numerical features.

This helps us:
- Validate whether numerical values fall within logical boundaries.
- Detect potential outliers or incorrect entries.
- Understand the central tendency and variance of user behaviors.


###  Insights Summary (Data + Business Combined)

- High login frequency and longer session duration strongly correlate with higher course completion.
- Assignment submission behavior is the strongest signal for success.
- Rewatch count suggests difficulty level — need optional summaries for complex lectures.
- Mobile users underperform vs laptop users, indicating UX friction.
- Higher education background learners complete more often, suggesting targeted onboarding could help others.



###  Strategic Recommendations

| Initiative | Business Value | Data Evidence |
|-----------|----------------|--------------|
| Personalized inactivity reminders | Improves retention | Users with consistent login complete more |
| Better mobile UX | Increase accessibility and engagement | Mobile users show lower completion |
| Difficulty summary modules | Reduce drop-off | High rewatch count indicates difficulty |
| Gamified reward badges | Motivation & commitment | Submission and completion strongly correlate |
| Onboarding for non-degree learners | Reduce early abandonment | Higher degree users complete more |

