Udemy Data Project

A data project to explore data collected from Udemy's webpage

Files

/Data/Udemy_Raw.csv --> Contains raw data extracted from Udemy
/Data/Udemy_Clean.csv --> Cleaned version of raw data Data Collection - Udemy Scrape.py --> Used for web scrapping of Udemy
Data Cleaning.ipynb --> Clean raw data obtained

Data Collection Method

Data was scrapped using a combination of Selenium and BeautifulSoup to obtain information from Udemy's webpage (https://www.udemy.com/). To obtain a variety of data, course details from various subcategories were scrapped, and it was ensured that these courses were from a good mix of filters used. These filters ranked courses based on popularity, highest ratings, newest courses and free courses.

Data Cleaning

The data was relatively clean because there was control and customization during the web scrapping. In general, cleaning of data was quite minimal.

Data Details

This dataset consists of 16,429 observations (rows) and 20 variables (columns).
The data file (Udemy_Clean.csv) consists of the following columns:

Title: Title of Udemy course
Overall_Rating: Overall average rating given by users who have taken the course
Best_Rating: Highest rating given by users who have taken the course
Worst_Rating: Lowest rating given by users who have taken the course
No_of_Ratings: Total number of ratings given for the course
Category: First level of categorization for course
Subcategory: Second level of categorization for course
Topic: Third level of categorization for course
Instructor: Name of instructor of course
Language: Original language in which course is conducted in
SkillsFuture: Indicates whether course is SkillsFuture Credit eligible
- For more information about SkillsFuture, you can refer to this link: https://www.myskillsfuture.gov.sg/content/portal/en/training-exchange/course-landing.html?cid=a1:Google%7Ca2:SEM%7Ca3:%2001062021_linkto_landing-courses
No_of_Practice_Test: Number of practice tests provided as part of the course materials
No_of_Articles: Number of articles provided as part of the course materials
No_of_Coding_Exercises: Number of coding exercises provided as part of the course materials
Video_Duration_Hr: Total duration of course videos (in hours)
Bestseller: Whether or not course is a bestseller
Price: Original price of course
Discounted_Price: Price of course after discount

Exploratory Data Analysis

An analysis of the Udemy data was conducted to investigate the relationship between various features and the target variable, "Price". Upon investigating, there were a few interesting observations which are as follows:

The number of practice tests has the lowest correlation of -0.08 with price.
- It is expected that more practice tests would translate to more course material that is provided, and hence a higher price.
- Furthermore, it is a negative relationship. This is shocking because it is logical that if more course materials are provided, the price would be higher. However, this is the opposite where more course materials translate to a lower price. This relationship could possibly be due to a confounder.
An interesting finding was that some of the free courses provided were relatively highly rated (3.7 and above).
- Based on the samples taken, the course content quality is good despite the fact that the course is free.
Safety & First Aid was the subcategory with the lowest average price.
- Safety and first aid are vital life-saving skills hence I would have expected that they would be quite popular and priced highly.
- One plausible explanation is that the nature of such causes would require in-person classes due to practical components such as CPR, hence it may not be as desirable online.
- Another possible reason would be that as a form of encouragement for users to pick up such important skills, Udemy would offer these courses for free.
Discounted courses had a lower average price compared to courses that are not discounted.
- Discounts could be given for more pricey courses in order to make them more affordable and encourage users to pick up these courses.
- But the other approach would be to create a price discrimination where customers are charged a premium for a quality course. Thus, this creates a wider variety for users where they can choose between the various classes of courses.

Project 1: Price Mechanism

Model

Objective: To build a model to determine the price mechanism of Udemy courses
Models: Multiple Linear Regression, Random Forest Regression, Gradient Boosting Models (XGBoost, LightGM)
Evaluation metrics: Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE)

RSME and MAE are commonly used metrics to evaluate models especially on platforms like Kaggle.
However, both metrics have limitations.
- RSME penalizes larger errors more than smaller errors which will inflate the mean error score.
- MAE doesn't differentiate between the types of errors and assumes a linear relationship with errors.
Hence, we can combine both where MAE will be used for adjustments within models for optimization and RSME will be used to evaluate the different models.

Evaluation

The root mean squared error (RMSE) of log(observed price) and log(actual price) is used to ensure that the errors for the more costly and less costly courses will affect the results to the same extent. Model Scores (RMSE)

Multiple Linear Regression: 0.633
Random Forest Regression: 0.574
XGB Regression: 0.571
LightGBM Regression: 0.561

Best performing model: Light Gradient Boosting Machine Model

Note: However, RMSE is still relatively high and unsatisfactory. This could be due to limitations of the assumptions made for the project which we will discuss in a later section.

Project 2: Discount Classification

Model

Objective: To build a model to classify whether a course will be given a discount
Models: Decision Tree Classifier, Random Forest Classifier, Logistic Regression, Naive Bayes Classification, Support Vector Classifier
Evaluation metrics: Classification Accuracy, ROC Area Under Curve

ROC helps us understand the ability of the model to classify accurately based on probabilities
- Hence, we will use ROC area under curve to tune the various models and improve it
Accuracy score shows us the number of predictions that have been accurately classified when compared with the actual value.
- This metric will be used for comparison between the models to compare the actual performance.

Evalution

Model Scores (Classification Accuracy)

Decision Tree, Score: 90.99%
Random Forest, Score: 85.19%
Logistic Regression, Score: 92.35%
Naive Bayes, Score: 23.21%
Support Vector, Score: 92.96%

Best performing model: Support Vector Classifier

Project Limitations

One limitation of this project is related to the biggest assumption made, which is that is it assumed that Udemy's price mechanism follows a rational relationship where various related factors, such as rating of the course etc. However, this assumption fails to take into account the fact that Udemy uses discounts as a marketing scheme (Urgency marketing). They often aritificially change prices based on your browser cookie. More can be read on this website: https://skillscouter.com/how-often-does-udemy-have-sales/.
Therefore this shows that it is hard to estimate the actual pricing of courses because the actual price of Udemy courses can in reality be defined as the discounted price since it is a gimmick to entice consumers to buy the courses by having a decoy option. But that may be hard to capture since the discounts are constantly changing based on your cookies. We can also observe that the results for the classification project are unexpectedly high which may be an indicator of overfitting. This could be due to a lack of data available to ensure that the model is not overtrained.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
.ipynb_checkpoints		.ipynb_checkpoints
Data		Data
.DS_Store		.DS_Store
.gitattributes		.gitattributes
Data Cleaning.ipynb		Data Cleaning.ipynb
Data Collection - Udemy Scrape.py		Data Collection - Udemy Scrape.py
Exploratory Data Analysis.ipynb		Exploratory Data Analysis.ipynb
Project 1 - Price Mechanism.ipynb		Project 1 - Price Mechanism.ipynb
Project 2 - Discount Classification.ipynb		Project 2 - Discount Classification.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Udemy Data Project

Files

Data Collection Method

Data Cleaning

Data Details

Exploratory Data Analysis

Project 1: Price Mechanism

Model

Evaluation

Project 2: Discount Classification

Model

Evalution

Project Limitations

About

Releases

Packages

Languages

keliza-toh/Udemy-Data-Project

Folders and files

Latest commit

History

Repository files navigation

Udemy Data Project

Files

Data Collection Method

Data Cleaning

Data Details

Exploratory Data Analysis

Project 1: Price Mechanism

Model

Evaluation

Project 2: Discount Classification

Model

Evalution

Project Limitations

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages