A data project to explore data collected from Udemy's webpage
/Data/Udemy_Raw.csv --> Contains raw data extracted from Udemy
/Data/Udemy_Clean.csv --> Cleaned version of raw data
Data Collection - Udemy Scrape.py --> Used for web scrapping of Udemy
Data Cleaning.ipynb --> Clean raw data obtained
Data was scrapped using a combination of Selenium and BeautifulSoup to obtain information from Udemy's webpage (https://www.udemy.com/). To obtain a variety of data, course details from various subcategories were scrapped, and it was ensured that these courses were from a good mix of filters used. These filters ranked courses based on popularity, highest ratings, newest courses and free courses.
The data was relatively clean because there was control and customization during the web scrapping. In general, cleaning of data was quite minimal.
This dataset consists of 16,429 observations (rows) and 20 variables (columns).
The data file (Udemy_Clean.csv) consists of the following columns:
- Title: Title of Udemy course
- Overall_Rating: Overall average rating given by users who have taken the course
- Best_Rating: Highest rating given by users who have taken the course
- Worst_Rating: Lowest rating given by users who have taken the course
- No_of_Ratings: Total number of ratings given for the course
- Category: First level of categorization for course
- Subcategory: Second level of categorization for course
- Topic: Third level of categorization for course
- Instructor: Name of instructor of course
- Language: Original language in which course is conducted in
- SkillsFuture: Indicates whether course is SkillsFuture Credit eligible
- For more information about SkillsFuture, you can refer to this link: https://www.myskillsfuture.gov.sg/content/portal/en/training-exchange/course-landing.html?cid=a1:Google%7Ca2:SEM%7Ca3:%2001062021_linkto_landing-courses
- No_of_Practice_Test: Number of practice tests provided as part of the course materials
- No_of_Articles: Number of articles provided as part of the course materials
- No_of_Coding_Exercises: Number of coding exercises provided as part of the course materials
- Video_Duration_Hr: Total duration of course videos (in hours)
- Bestseller: Whether or not course is a bestseller
- Price: Original price of course
- Discounted_Price: Price of course after discount
An analysis of the Udemy data was conducted to investigate the relationship between various features and the target variable, "Price". Upon investigating, there were a few interesting observations which are as follows:
- The number of practice tests has the lowest correlation of -0.08 with price.
- It is expected that more practice tests would translate to more course material that is provided, and hence a higher price.
- Furthermore, it is a negative relationship. This is shocking because it is logical that if more course materials are provided, the price would be higher. However, this is the opposite where more course materials translate to a lower price. This relationship could possibly be due to a confounder.
- An interesting finding was that some of the free courses provided were relatively highly rated (3.7 and above).
- Based on the samples taken, the course content quality is good despite the fact that the course is free.
- Safety & First Aid was the subcategory with the lowest average price.
- Safety and first aid are vital life-saving skills hence I would have expected that they would be quite popular and priced highly.
- One plausible explanation is that the nature of such causes would require in-person classes due to practical components such as CPR, hence it may not be as desirable online.
- Another possible reason would be that as a form of encouragement for users to pick up such important skills, Udemy would offer these courses for free.
- Discounted courses had a lower average price compared to courses that are not discounted.
- Discounts could be given for more pricey courses in order to make them more affordable and encourage users to pick up these courses.
- But the other approach would be to create a price discrimination where customers are charged a premium for a quality course. Thus, this creates a wider variety for users where they can choose between the various classes of courses.
Objective: To build a model to determine the price mechanism of Udemy courses
Models: Multiple Linear Regression, Random Forest Regression, Gradient Boosting Models (XGBoost, LightGM)
Evaluation metrics: Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE)
- RSME and MAE are commonly used metrics to evaluate models especially on platforms like Kaggle.
- However, both metrics have limitations.
- RSME penalizes larger errors more than smaller errors which will inflate the mean error score.
- MAE doesn't differentiate between the types of errors and assumes a linear relationship with errors.
- Hence, we can combine both where MAE will be used for adjustments within models for optimization and RSME will be used to evaluate the different models.
The root mean squared error (RMSE) of log(observed price) and log(actual price) is used to ensure that the errors for the more costly and less costly courses will affect the results to the same extent. Model Scores (RMSE)
- Multiple Linear Regression: 0.633
- Random Forest Regression: 0.574
- XGB Regression: 0.571
- LightGBM Regression: 0.561
Best performing model: Light Gradient Boosting Machine Model
Note: However, RMSE is still relatively high and unsatisfactory. This could be due to limitations of the assumptions made for the project which we will discuss in a later section.
Objective: To build a model to classify whether a course will be given a discount
Models: Decision Tree Classifier, Random Forest Classifier, Logistic Regression, Naive Bayes Classification, Support Vector Classifier
Evaluation metrics: Classification Accuracy, ROC Area Under Curve
- ROC helps us understand the ability of the model to classify accurately based on probabilities
- Hence, we will use ROC area under curve to tune the various models and improve it
- Accuracy score shows us the number of predictions that have been accurately classified when compared with the actual value.
- This metric will be used for comparison between the models to compare the actual performance.
Model Scores (Classification Accuracy)
- Decision Tree, Score: 90.99%
- Random Forest, Score: 85.19%
- Logistic Regression, Score: 92.35%
- Naive Bayes, Score: 23.21%
- Support Vector, Score: 92.96%
Best performing model: Support Vector Classifier
One limitation of this project is related to the biggest assumption made, which is that is it assumed that Udemy's price mechanism follows a rational relationship where various related factors, such as rating of the course etc. However, this assumption fails to take into account the fact that Udemy uses discounts as a marketing scheme (Urgency marketing). They often aritificially change prices based on your browser cookie. More can be read on this website: https://skillscouter.com/how-often-does-udemy-have-sales/.
Therefore this shows that it is hard to estimate the actual pricing of courses because the actual price of Udemy courses can in reality be defined as the discounted price since it is a gimmick to entice consumers to buy the courses by having a decoy option. But that may be hard to capture since the discounts are constantly changing based on your cookies.
We can also observe that the results for the classification project are unexpectedly high which may be an indicator of overfitting. This could be due to a lack of data available to ensure that the model is not overtrained.