<h1 align="center">Online Courses Analysis</h1>


### Description
This dataset captures user engagement metrics from an online course platform, facilitating analyses on factors influencing course completion. It includes user demographics, course-specific data, and engagement metrics.

| Feature            | Description                                                                                     |
|-----------------------|-------------------------------------------------------------------------------------------------|
| UserID                | Unique identifier for each user                                                                  |
| CourseCategory        | Category of the course taken by the user (e.g., Programming, Business, Arts)                     |
| TimeSpentOnCourse      | Total time spent by the user on the course in hours                                              |
| NumberOfVideosWatched | Total number of videos watched by the user                                                       |
| NumberOfQuizzesTaken  | Total number of quizzes taken by the user                                                        |
| QuizScores            | Average scores achieved by the user in quizzes (percentage)                                      |
| CompletionRate        | Percentage of course content completed by the user                                               |
| DeviceType            | Type of device used by the user (Device Type: Desktop (0) or Mobile (1))                         |
| CourseCompletion      | Course completion status (0: Not Completed, 1: Completed)                                        |


# Import necessary libraries

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Data Reading & Understanding

In [2]:
# Load the dataset
df = pd.read_csv(r"C:\Users\Lenovo\Desktop\Eslam_Final_Project\Sourse\online_course_engagement_data.csv")

In [3]:
# Display the first 10 rows of the dataset and its shape
print(df.shape)
df.head(10)

(9180, 9)


Unnamed: 0,UserID,CourseCategory,TimeSpentOnCourse,NumberOfVideosWatched,NumberOfQuizzesTaken,QuizScores,CompletionRate,DeviceType,CourseCompletion
0,5618,Health,29.97971934613741,17,3,50.365655948359226,20.860773,1,0
1,4326,Arts,27.802639509751515,1,5,62.61596979322466,65.632415,1,0
2,5849,Arts,86.82048469711872,14,2,78.4589624023972,63.812007,1,1
3,4992,Science,35.03842663461649,17,10,59.19885273381349,95.433162,0,1
4,3866,Programming,92.4906469645332,16,0,98.42828500171674,18.102478,0,0
5,8650,Health,79.46612884332329,12,7,70.2333289468058,76.484023,0,1
6,4321,Health,78.9087242374251,10,2,86.83653261494925,22.588896,1,0
7,4589,Business,12.06823675215978,16,3,61.55364646476677,27.410991,1,0
8,4215,Business,81.93570917985394,8,4,90.26456414855328,33.308437,0,1
9,8089,Programming,83.39402571104472,15,10,63.956352955520536,33.2613,1,0


In [4]:
# Get information about the dataset 
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9180 entries, 0 to 9179
Data columns (total 9 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   UserID                 9180 non-null   int64  
 1   CourseCategory         9180 non-null   object 
 2   TimeSpentOnCourse      9083 non-null   object 
 3   NumberOfVideosWatched  9092 non-null   object 
 4   NumberOfQuizzesTaken   9180 non-null   int64  
 5   QuizScores             9084 non-null   object 
 6   CompletionRate         9180 non-null   float64
 7   DeviceType             9180 non-null   int64  
 8   CourseCompletion       9180 non-null   int64  
dtypes: float64(1), int64(4), object(4)
memory usage: 645.6+ KB


In [5]:
# Describe the dataset 
df.describe()

Unnamed: 0,UserID,NumberOfQuizzesTaken,CompletionRate,DeviceType,CourseCompletion
count,9180.0,9180.0,9180.0,9180.0,9180.0
mean,4502.338671,5.09488,50.305617,0.501416,0.396078
std,2596.313009,3.155766,28.940782,0.500025,0.489108
min,1.0,0.0,0.009327,0.0,0.0
25%,2255.75,2.0,25.609713,0.0,0.0
50%,4493.5,5.0,50.151207,1.0,0.0
75%,6754.25,8.0,75.514245,1.0,1.0
max,9000.0,10.0,99.979711,1.0,1.0


In [6]:
# Get number of unique values in each column and their unique values 
for col in df.columns:
    print(f"Column: {col}")
    print(f"Number of Unique Values: {df[col].nunique()}")
    print("Unique Values:")
    print(df[col].unique())
    print('-' * 40)


Column: UserID
Number of Unique Values: 8123
Unique Values:
[5618 4326 5849 ... 6323 3652 5595]
----------------------------------------
Column: CourseCategory
Number of Unique Values: 5
Unique Values:
['Health' 'Arts' 'Science' 'Programming' 'Business']
----------------------------------------
Column: TimeSpentOnCourse
Number of Unique Values: 7969
Unique Values:
['29.97971934613741' '27.802639509751515' '86.82048469711872' ...
 '38.212511524771706' '70.04866545544596' '93.58978112503895']
----------------------------------------
Column: NumberOfVideosWatched
Number of Unique Values: 22
Unique Values:
['17' '1' '14' '16' '12' '10' '8' '15' '3' '13' nan '7' '20' '6' '11' '0'
 '5' '18' '19' '9' '2' '4' '?']
----------------------------------------
Column: NumberOfQuizzesTaken
Number of Unique Values: 11
Unique Values:
[ 3  5  2 10  0  7  4  9  1  8  6]
----------------------------------------
Column: QuizScores
Number of Unique Values: 7981
Unique Values:
['50.365655948359226' '62.61596

# Data Cleaning & Preprocessing 

In [7]:
# Replace '?' with NaN in the entire DataFrame
df.replace('?', np.nan, inplace=True)

In [8]:
# Check for missing values in each column
df.isna().sum()

UserID                     0
CourseCategory             0
TimeSpentOnCourse        181
NumberOfVideosWatched    183
NumberOfQuizzesTaken       0
QuizScores               183
CompletionRate             0
DeviceType                 0
CourseCompletion           0
dtype: int64

In [9]:
# Check for duplicates
df.duplicated().sum()

971

In [10]:
# Drop duplicates and reset the index
df.drop_duplicates(inplace=True)
df.reset_index(drop=True, inplace=True)
df.duplicated().sum()

0

In [11]:
# Edite columns data type
df['TimeSpentOnCourse'] = pd.to_numeric(df['TimeSpentOnCourse'], errors='coerce').astype('float')
df['NumberOfVideosWatched'] = pd.to_numeric(df['NumberOfVideosWatched'], errors='coerce').round().astype('Int64')
df['QuizScores'] = pd.to_numeric(df['QuizScores'], errors='coerce').astype('float')

In [12]:
missing_cols = df.columns[df.isna().any()].tolist()
missing_cols

['TimeSpentOnCourse', 'NumberOfVideosWatched', 'QuizScores']

In [13]:
# Deal with missing values in numerical columns
numerical_cols = df.select_dtypes(include=['int64', 'float64']).columns           
df[numerical_cols] = df[numerical_cols].fillna(df[numerical_cols].median())  

In [14]:
# check if there are outliers in the numerical columns
cols = ['TimeSpentOnCourse', 'NumberOfVideosWatched', 'CompletionRate', 'NumberOfQuizzesTaken', 'QuizScores']
for col in cols:
    fig = px.box(df, x=col)
    fig.show()

In [15]:
# Show shape after removing outliers and cleaning the data
df.shape

(8209, 9)

In [16]:
# save data in csv file
# cleaned_data = df.to_csv('cleaned_data.csv', index=False)

# Exploratory Data Analysis

In [17]:
count_course_completion = df['CourseCompletion'].value_counts()
count_course_completion

CourseCompletion
0    4641
1    3568
Name: count, dtype: int64

In [18]:
px.pie(df, names='CourseCompletion', template='plotly_dark') 

 **Completion Rates**:
  - **4,641** courses have not been completed.
  - **3,568** courses have been completed.

  The data shows a relatively balanced distribution between completed and non-completed courses. Approximately **56%** of courses are not completed, while **44%** reach completion. This balanced distribution suggests a need for further analysis to understand factors influencing course completion and to enhance overall course engagement strategies.


In [19]:
course_category_counts = df['CourseCategory'].value_counts()
course_category_counts

CourseCategory
Business       1671
Programming    1655
Science        1654
Health         1648
Arts           1581
Name: count, dtype: int64

In [20]:
px.pie(df, names='CourseCategory', template='plotly_dark', hole=0.5)

1. **Business Courses:**
   - There are 1,671 courses in the Business category, making it the most abundant category in the dataset.

2. **Programming Courses:**
   - With 1,655 courses, the Programming category is slightly behind Business but still holds a significant portion of the dataset.

3. **Science Courses:**
   - The Science category contains 1,654 courses, closely following Programming and Business in terms of quantity.

4. **Health Courses:**
   - There are 1,648 courses in the Health category, slightly fewer than those in Science.

5. **Arts Courses:**
   - The Arts category has the fewest courses with 1,581, though it remains a notable category within the dataset.

Overall, the dataset shows a high number of courses across all categories, with Business, Programming, and Science having the highest counts.

In [21]:
# Define the columns
cols = ['TimeSpentOnCourse', 'NumberOfVideosWatched', 'CompletionRate', 'NumberOfQuizzesTaken', 'QuizScores']

# Number of columns to plot
num_columns = len(cols)

# Determine the number of rows and columns needed for the subplots grid
num_cols = 3  # Number of columns in the grid
num_rows = (num_columns + num_cols - 1) // num_cols  # Calculate number of rows required

# Create a subplot figure
fig = make_subplots(rows=num_rows, cols=num_cols, subplot_titles=cols)

# Loop through each column and add a histogram to the subplot grid
for i, col in enumerate(cols):
    row = i // num_cols + 1  # Calculate the row index
    col_pos = i % num_cols + 1  # Calculate the column index
    
    # Add histogram for each column
    fig.add_trace(go.Histogram(x=df[col], name=col), row=row, col=col_pos)

# Update layout for the entire figure
fig.update_layout(height=800, width=1000, showlegend=True, title_text='Histograms of DataFrame Columns')

# Adjust spacing between subplots
fig.update_xaxes(showticklabels=True)
fig.update_yaxes(showticklabels=True)

# Display the plot
fig.show()


1. **Course Category Distribution**:
   - The most frequent course categories, ranked from highest to lowest, are:
     1. **Business**
     2. **Programming**
     3. **Science**
     4. **Health**
     5. **Arts**

2. **Time Spent on Courses**:
   - The majority of users spent between **49 to 51 hours** on courses.
   - In contrast, the least amount of time was spent by users in the range of **99 to 101 hours**.

3. **Number of Videos Watched**:
   - Courses with an average of **10 videos watched** are the most common.

4. **Number of Quizzes Taken**:
   - Most courses have a consistent average number of quizzes taken.

5. **Quiz Scores**:
   - The highest recorded quiz score is **75**.


In [22]:
# Define the columns
cols = ['TimeSpentOnCourse', 'NumberOfVideosWatched', 'NumberOfQuizzesTaken', 'QuizScores', 'CompletionRate', 'CourseCategory']

for i, col in enumerate(cols):
    fig = px.histogram(df, x=col, barmode='group', color='CourseCompletion', marginal='box', template='plotly_dark')
    fig.show()

1. **Course Completion by Time Spent:**
   - Courses with less than 20 hours of time spent have a completion rate of less than 25%.
   - Courses with more than 70 hours of time spent have a balanced completion and non-completion rate, indicating that both completion and non-completion rates are similar.

2. **Course Completion by Number of Videos Watched:**
   - Courses where fewer than 5 videos are watched have a higher percentage of non-completion.
   - Conversely, courses with more than 6 videos watched show a lower percentage of non-completion, suggesting that more videos correlate with higher completion rates.

3. **Impact of Number of Quizzes on Completion Rate:**
   - As the number of quizzes increases, the percentage of course completions rises, while the percentage of non-completions decreases. This indicates a positive correlation between the number of quizzes and course completion rates.

In [23]:
df = df.drop(columns=['UserID', 'DeviceType'])
# Calculate the correlation matrix
correlation_matrix = df.corr(numeric_only=True)

# Create the heatmap using Plotly Express
fig = px.imshow(correlation_matrix,
                text_auto=True,  # Automatically show values in the heatmap cells
                color_continuous_scale='Viridis',  # Use 'Viridis' for a perceptually uniform color scale
                labels={'color': 'Correlation'},  # Label for the color bar
                title='Correlation Heatmap')  # Title for the heatmap

# Update layout to improve appearance
fig.update_layout(
    height=600,  # Set the height of the figure
    width=800,  # Set the width of the figure
    xaxis_title='Variables',  # Label for the x-axis
    yaxis_title='Variables',  # Label for the y-axis
    title_x=0.5  # Center the title
)

# Display the heatmap
fig.show()

In [24]:
!pipreqs ./

INFO: Not scanning for jupyter notebooks.
Please, verify manually the final list of requirements.txt to avoid possible dependency confusions.
Please, verify manually the final list of requirements.txt to avoid possible dependency confusions.
Please, verify manually the final list of requirements.txt to avoid possible dependency confusions.
INFO: Successfully saved requirements file in ./requirements.txt
