Project Overview This project provides a comprehensive analysis of a Udemy course dataset, aiming to uncover trends in pricing, enrollment, revenue, and content characteristics. The analysis was conducted using a dual-approach: Python for automated data processing and visualization, and Excel for statistical modeling, correlation analysis, and content classification.
Dataset Description The dataset includes 3,672 course records with the following key attributes:
Course Metadata: Title, ID, URL, Subject, and Level.
Engagement Metrics: Number of subscribers and reviews.
Course Specifics: Price, content duration (in hours), number of lectures, and publication timestamp.
Financial Data: Calculated revenue based on price and subscriber count.
Tools & Technologies Python:
Pandas & Numpy: Data cleaning, manipulation, and feature engineering.
Matplotlib & Seaborn: Exploratory Data Analysis (EDA) and data visualization.
Excel:
Statistical reporting and summary metrics.
Correlation analysis and classification modeling.
Key Analysis Components
- Data Cleaning & Preprocessing (Python) Duplicate Removal: Identified and removed duplicate course entries to ensure data integrity.
Type Conversion: Converted published_timestamp to datetime objects to extract publication year, date, and time.
Feature Engineering: Created a revenue column by calculating the product of price and num_subscribers.
- Statistical Analysis & Classification (Excel) Revenue Summary: - Total Revenue generated across all courses: ~881,674,940
Average Content Duration: ~4.10 hours
Correlations:
Price vs. Subscribers: Found a very weak positive correlation (0.05), suggesting that price is not a primary driver for enrollment volume.
Price vs. Reviews: Analyzed the relationship between course cost and user feedback frequency.
Content Duration Classification: Categorized courses based on their duration using Mean (4.10) and Standard Deviation (6.06):
Normal Content: Courses within a standard duration range (~92.1% of the dataset, 3,383 courses).
Long Content: Courses significantly exceeding the mean duration (~7.87% of the dataset, 289 courses).
- Exploratory Data Analysis (Python) Yearly Revenue Trends: Visualized how total revenue evolved over time using line and bar charts.
Subject Analysis: Analyzed revenue distribution across different subjects (e.g., Web Development, Business Finance) to identify high-performing categories.
Market Share: Used pie charts to visualize the percentage contribution of each year and subject to the total revenue.
Summary of Insights Top Performing Year: The analysis identifies the specific year with peak revenue and course publication activity.
Content Strategy: Most courses on the platform follow a "Normal" duration (~4 hours), indicating a preference for concise, focused learning modules.
Revenue Drivers: While individual course prices vary, the bulk of revenue is driven by high-subscriber counts in specific high-demand subjects like Web Development.
Project Structure udemy_courses_analysis.py: Main script for data cleaning, transformation, and plotting.
Report.csv: Summary of high-level project metrics.
Correlations.csv: Detailed statistical correlations between price, engagement, and duration.
ClassifyContentDuration.csv: Classification logic and results for course lengths.
How to Run Ensure Python 3.x is installed along with pandas, matplotlib, and seaborn.
Place the dataset udemy_online_education_courses_dataset.csv in the project directory.
Run the analysis script:
Bash python udemy_courses_analysis_second_project.py