Skip to content

iemad406/Udemy-Data-Analysis-using-python

Repository files navigation

Project Overview This project provides a comprehensive analysis of a Udemy course dataset, aiming to uncover trends in pricing, enrollment, revenue, and content characteristics. The analysis was conducted using a dual-approach: Python for automated data processing and visualization, and Excel for statistical modeling, correlation analysis, and content classification.

Dataset Description The dataset includes 3,672 course records with the following key attributes:

Course Metadata: Title, ID, URL, Subject, and Level.

Engagement Metrics: Number of subscribers and reviews.

Course Specifics: Price, content duration (in hours), number of lectures, and publication timestamp.

Financial Data: Calculated revenue based on price and subscriber count.

Tools & Technologies Python:

Pandas & Numpy: Data cleaning, manipulation, and feature engineering.

Matplotlib & Seaborn: Exploratory Data Analysis (EDA) and data visualization.

Excel:

Statistical reporting and summary metrics.

Correlation analysis and classification modeling.

Key Analysis Components

  1. Data Cleaning & Preprocessing (Python) Duplicate Removal: Identified and removed duplicate course entries to ensure data integrity.

Type Conversion: Converted published_timestamp to datetime objects to extract publication year, date, and time.

Feature Engineering: Created a revenue column by calculating the product of price and num_subscribers.

  1. Statistical Analysis & Classification (Excel) Revenue Summary: - Total Revenue generated across all courses: ~881,674,940

Average Content Duration: ~4.10 hours

Correlations:

Price vs. Subscribers: Found a very weak positive correlation (0.05), suggesting that price is not a primary driver for enrollment volume.

Price vs. Reviews: Analyzed the relationship between course cost and user feedback frequency.

Content Duration Classification: Categorized courses based on their duration using Mean (4.10) and Standard Deviation (6.06):

Normal Content: Courses within a standard duration range (~92.1% of the dataset, 3,383 courses).

Long Content: Courses significantly exceeding the mean duration (~7.87% of the dataset, 289 courses).

  1. Exploratory Data Analysis (Python) Yearly Revenue Trends: Visualized how total revenue evolved over time using line and bar charts.

Subject Analysis: Analyzed revenue distribution across different subjects (e.g., Web Development, Business Finance) to identify high-performing categories.

Market Share: Used pie charts to visualize the percentage contribution of each year and subject to the total revenue.

Summary of Insights Top Performing Year: The analysis identifies the specific year with peak revenue and course publication activity.

Content Strategy: Most courses on the platform follow a "Normal" duration (~4 hours), indicating a preference for concise, focused learning modules.

Revenue Drivers: While individual course prices vary, the bulk of revenue is driven by high-subscriber counts in specific high-demand subjects like Web Development.

Project Structure udemy_courses_analysis.py: Main script for data cleaning, transformation, and plotting.

Report.csv: Summary of high-level project metrics.

Correlations.csv: Detailed statistical correlations between price, engagement, and duration.

ClassifyContentDuration.csv: Classification logic and results for course lengths.

How to Run Ensure Python 3.x is installed along with pandas, matplotlib, and seaborn.

Place the dataset udemy_online_education_courses_dataset.csv in the project directory.

Run the analysis script:

Bash python udemy_courses_analysis_second_project.py

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages