In this EDA, we will discover the courses given by UDEMY.

- Let's import the required libraries

In [1]:
import pandas as pd
import numpy as np

import plotly.io as pio
pio.renderers.default = 'iframe'

import plotly 
import plotly.express as px
import plotly.graph_objs as go
import plotly.offline as py
from plotly.offline import iplot
from plotly.subplots import make_subplots
import plotly.figure_factory as ff

### Overview Stage

- Read the csv
- Look for basic information about the dataset

In [2]:
df = pd.read_csv('../input/udemy-courses/udemy_courses.csv')
df.head()

Unnamed: 0,course_id,course_title,url,is_paid,price,num_subscribers,num_reviews,num_lectures,level,content_duration,published_timestamp,subject
0,1070968,Ultimate Investment Banking Course,https://www.udemy.com/ultimate-investment-bank...,True,200,2147,23,51,All Levels,1.5,2017-01-18T20:58:58Z,Business Finance
1,1113822,Complete GST Course & Certification - Grow You...,https://www.udemy.com/goods-and-services-tax/,True,75,2792,923,274,All Levels,39.0,2017-03-09T16:34:20Z,Business Finance
2,1006314,Financial Modeling for Business Analysts and C...,https://www.udemy.com/financial-modeling-for-b...,True,45,2174,74,51,Intermediate Level,2.5,2016-12-19T19:26:30Z,Business Finance
3,1210588,Beginner to Pro - Financial Analysis in Excel ...,https://www.udemy.com/complete-excel-finance-c...,True,95,2451,11,36,All Levels,3.0,2017-05-30T20:07:24Z,Business Finance
4,1011058,How To Maximize Your Profits Trading Options,https://www.udemy.com/how-to-maximize-your-pro...,True,200,1276,45,26,Intermediate Level,2.0,2016-12-13T14:57:18Z,Business Finance


In [3]:
df.shape

(3678, 12)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3678 entries, 0 to 3677
Data columns (total 12 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   course_id            3678 non-null   int64  
 1   course_title         3678 non-null   object 
 2   url                  3678 non-null   object 
 3   is_paid              3678 non-null   bool   
 4   price                3678 non-null   int64  
 5   num_subscribers      3678 non-null   int64  
 6   num_reviews          3678 non-null   int64  
 7   num_lectures         3678 non-null   int64  
 8   level                3678 non-null   object 
 9   content_duration     3678 non-null   float64
 10  published_timestamp  3678 non-null   object 
 11  subject              3678 non-null   object 
dtypes: bool(1), float64(1), int64(5), object(5)
memory usage: 319.8+ KB


In [5]:
df.isnull().sum()

course_id              0
course_title           0
url                    0
is_paid                0
price                  0
num_subscribers        0
num_reviews            0
num_lectures           0
level                  0
content_duration       0
published_timestamp    0
subject                0
dtype: int64

In [6]:
df.describe()

Unnamed: 0,course_id,price,num_subscribers,num_reviews,num_lectures,content_duration
count,3678.0,3678.0,3678.0,3678.0,3678.0,3678.0
mean,675972.0,66.049483,3197.150625,156.259108,40.108755,4.094517
std,343273.2,61.005755,9504.11701,935.452044,50.383346,6.05384
min,8324.0,0.0,0.0,0.0,0.0,0.0
25%,407692.5,20.0,111.0,4.0,15.0,1.0
50%,687917.0,45.0,911.5,18.0,25.0,2.0
75%,961355.5,95.0,2546.0,67.0,45.75,4.5
max,1282064.0,200.0,268923.0,27445.0,779.0,78.5


Let's summarize what we have got from the dataset.

- Our dataset has info about the courses given by UDEMY.
- 'Course ID' and Course 'url' would not be necessary for our analysis, we will drop them.
- Course published date is given object format, neeeds to be formatted as a datetime object.
- There is no missing value, which is very good during the data preparation stage.
- 'Level' column is categorical variable, it would be good to see whether any significant differences among the levels.
-  Numerical variables deserves special attention for further analysis.

- Let's make the necessary adjustments before moving to the analysis part.

In [7]:
df['date'] = pd.to_datetime(df['published_timestamp'])

In [8]:
df = df.drop(['course_id','url','published_timestamp'], axis=1)
df.sample(2)

Unnamed: 0,course_title,is_paid,price,num_subscribers,num_reviews,num_lectures,level,content_duration,subject,date
1951,Piano From Zero To Pro - Beginner Essentials T...,True,70,811,129,52,Beginner Level,3.5,Musical Instruments,2016-03-16 15:28:29+00:00
30,Python Algo Stock Trading: Automate Your Trading!,True,95,1165,21,41,Beginner Level,2.5,Business Finance,2017-05-28 23:41:03+00:00


In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3678 entries, 0 to 3677
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype              
---  ------            --------------  -----              
 0   course_title      3678 non-null   object             
 1   is_paid           3678 non-null   bool               
 2   price             3678 non-null   int64              
 3   num_subscribers   3678 non-null   int64              
 4   num_reviews       3678 non-null   int64              
 5   num_lectures      3678 non-null   int64              
 6   level             3678 non-null   object             
 7   content_duration  3678 non-null   float64            
 8   subject           3678 non-null   object             
 9   date              3678 non-null   datetime64[ns, UTC]
dtypes: bool(1), datetime64[ns, UTC](1), float64(1), int64(4), object(3)
memory usage: 262.3+ KB


- Seems OK.  Let's move on to the next step: **analysis part**.

### Analysis Part

In [10]:
df.describe()

Unnamed: 0,price,num_subscribers,num_reviews,num_lectures,content_duration
count,3678.0,3678.0,3678.0,3678.0,3678.0
mean,66.049483,3197.150625,156.259108,40.108755,4.094517
std,61.005755,9504.11701,935.452044,50.383346,6.05384
min,0.0,0.0,0.0,0.0,0.0
25%,20.0,111.0,4.0,15.0,1.0
50%,45.0,911.5,18.0,25.0,2.0
75%,95.0,2546.0,67.0,45.75,4.5
max,200.0,268923.0,27445.0,779.0,78.5


Let's look at the some of the information, which we can get from the above table:

- At first look, we can see that, dataset has numbers, minimum 0 and maximum in hundreds or thousands for the variables. 
- Also mean and median values significantly differs from each other. All of the variables have signicifantly higher mean value than median value, which is a good sign of hihgly screwed distribution, more specifially right skewed distribution with the possible outliers on the maximum side of the distribution. So for further analysis it would be good to remember that.

- Aferomentioned reasons, in the following lines, median value will be used for the give some insights from the above table.

- Median value for the price as 45.

- Median value for the number of subscribers for the courses around 912 

- Median value for the number of reviews 18
- Median value for the  number of lectures 25
- Median value for the content duration is 2


- OK let's see this analysis in the plotly

#### **Prices of UDEMY Courses**

In [11]:
fig = px.histogram(df, x= 'price', title='Prices of UDEMY Courses')

fig.show()

As seen in the histogram, UDEMY has 310 free course and it's 295 courses are priced as $200 . As we expected, there is highly right skewed distribution.

#### **Number of Subscribers of UDEMY Courses**

In [12]:
fig = px.histogram(df, x= 'num_subscribers', title='Number of Subscribers of UDEMY Courses')

fig.show()

Number of subscribers ranges from 0 to 268923, highly skewed distribution. 

#### **Number of Reviews of UDEMY Courses**

In [13]:
fig = px.histogram(df, x= 'num_reviews', title='Number of Reviews of UDEMY Courses')

fig.show()

Number of reviews ranges from 0 to 27445, highly skewed distribution. 

#### **Number of Lectures of UDEMY Courses**

In [14]:
fig = px.histogram(df, x= 'num_lectures', title='Number of Lectures of UDEMY Courses')

fig.show()

From given table of Number of lectures for UDEMY courses, we can see that 20-45 range has a lot of courses. But as we have mentioned before and easily seen in the histogram, we have highly skewed data with outliers.  

#### **Durations of UDEMY Courses**

In [15]:
fig = px.histogram(df, x= 'content_duration', title='Durations of UDEMY Courses')

fig.show()

From given table of Durations of UDEMY courses, we can see that  0-3 hours range has a lot of courses. But as we have mentioned before and easily seen in the histogram, we have highly sekwed data with outliers.  

- Befor moving on the details, let's see the correlation matrix for our dataset

In [16]:
df.corr()

Unnamed: 0,is_paid,price,num_subscribers,num_reviews,num_lectures,content_duration
is_paid,1.0,0.328513,-0.266159,-0.087471,0.112574,0.094417
price,0.328513,1.0,0.050769,0.113696,0.33016,0.29345
num_subscribers,-0.266159,0.050769,1.0,0.649946,0.157746,0.161839
num_reviews,-0.087471,0.113696,0.649946,1.0,0.243029,0.228889
num_lectures,0.112574,0.33016,0.157746,0.243029,1.0,0.801647
content_duration,0.094417,0.29345,0.161839,0.228889,0.801647,1.0


In [17]:
index_vals = df['level'].astype('category').cat.codes
fig = go.Figure(data=go.Splom(
                dimensions=[dict(label='price',
                                 values=df['price']),
                            dict(label='num_subscribers',
                                 values=df['num_subscribers']),
                            dict(label='num_reviews',
                                 values=df['num_reviews']),
                            dict(label='num_lectures',
                                 values=df['num_lectures']),
                           dict(label='content_duration',
                                 values=df['content_duration'])],
                showupperhalf=False, 
                text=df['level'],
                marker=dict(color=index_vals,
                            showscale=False, # colors encode categorical variables
                            line_color='white', line_width=0.5)
                ))


fig.update_layout(
    title='UDEMY Courses',
    width=1000,
    height=1000,
)

fig.show()

Based on the results:
- There is positive but not so strenght relationship between number of reviews and number of subscribers
- Also there is positive and almost strength (.80) relationship between number of lectures in the course and the duration of the course.

- After getting overall picture about the data, we can go into more details.

### UDEMY Courses Based on the **Subject**

- Let's see UDEMY courses by their subjects.

In [18]:
np.round(df['subject'].value_counts(normalize=True),2)

Web Development        0.33
Business Finance       0.32
Musical Instruments    0.18
Graphic Design         0.16
Name: subject, dtype: float64

- Overall 33% of the Udemy Courses are from Web Development and 32% of the Udemy Courses are from Business Finance area. Other 34% of the courses are made by Musical Instruments related courses (18%) and Graphic Design (16%).
- Courses on Business Finance and Web Development subjects covers almost 2 out of 3 course selection.

In [19]:
fig = px.histogram(df, x="subject", title='Course Count by Subject')
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

#### **UDEMY Courses By Subject in Each Year**

In [20]:
df['year']= df['date'].dt.year
subject_by_year = df.groupby('year')['subject'].value_counts().reset_index(level=0).rename(columns={'subject': 'subject count'}, index={'index': 'Subject'})
subject_by_year

Unnamed: 0_level_0,year,subject count
subject,Unnamed: 1_level_1,Unnamed: 2_level_1
Web Development,2011,5
Web Development,2012,19
Graphic Design,2012,10
Musical Instruments,2012,10
Business Finance,2012,6
Business Finance,2013,84
Web Development,2013,56
Musical Instruments,2013,39
Graphic Design,2013,23
Business Finance,2014,192


In [21]:
fig = px.line(subject_by_year, x='year', y='subject count', color= subject_by_year.index, title='UDEMY Courses By Subject in Each Year')
fig.show()

- From the line plot we can see that Udemy courses on the Web Development and Business Finance significantly increased till 2015 
- Number of Business Finance related courses stay almost same in 2016 but Web Development related courses continued to increase significantly. 

### **Based on the Level of the Courses**

- Let's see UDEMY courses by their levels.

In [22]:
np.round(df['level'].value_counts(normalize=True),2)

All Levels            0.52
Beginner Level        0.35
Intermediate Level    0.11
Expert Level          0.02
Name: level, dtype: float64

- Overall 52% of the Udemy Courses contains information for all levels of the learner. 
- Beginner level courses make up 35% of all of the courses
- 1 out of 10 courses offered by UDEMY is in the intermediate level.
- Only 2 out of 100 courses offered by UDEMY appeal to advance or exper level learners.

In [23]:
fig = px.histogram(df, x="level", title='Course Count by Level of Courses')
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

#### **UDEMY Courses By Level in Each Year**

In [24]:
level_by_year = df.groupby('year')['level'].value_counts().reset_index(level=0).rename(columns={'level': 'level count'}, index={'index': 'Level_of_Courses'})
level_by_year

Unnamed: 0_level_0,year,level count
level,Unnamed: 1_level_1,Unnamed: 2_level_1
All Levels,2011,5
All Levels,2012,33
Beginner Level,2012,9
Intermediate Level,2012,3
All Levels,2013,102
Beginner Level,2013,73
Intermediate Level,2013,18
Expert Level,2013,9
All Levels,2014,272
Beginner Level,2014,155


In [25]:
fig = px.line(level_by_year, x='year', y='level count', color= level_by_year.index, title='UDEMY Courses By Level in Each Year')
fig.show()

- From the line plot we can see that Udemy courses in all levels, beginner levels and intermediate levels increased significantly by each year. 

- On the other hand, number of expert level courses offered by UDEMY are inconsistent.


### UDEMY Courses- Number of Subscribers & Num of Reviews and Number of Lectures by Year

In [26]:
df1 = df.groupby('year')[['num_subscribers','num_reviews','num_lectures']].sum().reset_index()
df1

Unnamed: 0,year,num_subscribers,num_reviews,num_lectures
0,2011,119028,4041,574
1,2012,555339,10272,2374
2,2013,1723438,48585,7261
3,2014,1930406,86667,19288
4,2015,3475324,196810,41930
5,2016,2966644,195429,50854
6,2017,988941,32917,25239


In [27]:
fig = px.line(df1, x='year', y=['num_subscribers','num_reviews','num_lectures'])
fig.show()

- As seen in the line chart, number of subscribers increased constantly till 2015 and then decreased around a half milliion on 2016. Since 2017 data does not fully cover the 2017, we can not make any assumption on that.

### Price & Courses

In [28]:
paid_by_year = df.groupby('year')['is_paid'].value_counts().reset_index(level=0).rename(columns={'is_paid': 'paid_free count'}, index={'index': 'is_paid'})
paid_by_year

Unnamed: 0_level_0,year,paid_free count
is_paid,Unnamed: 1_level_1,Unnamed: 2_level_1
True,2011,5
True,2012,41
False,2012,4
True,2013,185
False,2013,17
True,2014,439
False,2014,52
True,2015,952
False,2015,62
True,2016,1109


In [29]:
fig = px.line(paid_by_year, x='year', y='paid_free count', color= paid_by_year.index)
fig.show()

- Both number of free and paid courses increased by each year. 
- Yep, Agreed, not much increase on the free courses. It's a tough world.

### Top Paid Courses

In [30]:
top_15_paid_courses = df[df['price']!=0][['course_title','year','subject','num_subscribers']].sort_values(by= 'num_subscribers',ascending=False).head(15)
top_15_paid_courses

Unnamed: 0,course_title,year,subject,num_subscribers
3230,The Web Developer Bootcamp,2015,Web Development,121584
3232,The Complete Web Developer Course 2.0,2016,Web Development,114512
2619,Learn Javascript & JQuery From Scratch,2013,Web Development,84897
3247,JavaScript: Understanding the Weird Parts,2015,Web Development,79612
1979,Pianoforall - Incredible New Way To Learn Pian...,2014,Musical Instruments,75499
3204,Angular 4 (formerly Angular 2) - The Complete ...,2016,Web Development,73783
2701,Become a Web Developer from Scratch,2011,Web Development,69186
3246,Learn and Understand AngularJS,2014,Web Development,59361
3251,Learn and Understand NodeJS,2015,Web Development,58208
2662,The Complete HTML & CSS Course - From Novice T...,2015,Web Development,57422


In [31]:
fig = px.bar(top_15_paid_courses, y= 'num_subscribers', x='course_title', hover_data = top_15_paid_courses[['year','subject']], color='subject')
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

- Almost all of the top 15 paid courses are from Web Development area, except one course in Musical Instruments area.

### Top Free Courses

In [32]:
top_15_free_courses = df[df['price']==0][['course_title','year','subject','num_subscribers']].sort_values(by= 'num_subscribers',ascending=False).head(15)
top_15_free_courses

Unnamed: 0,course_title,year,subject,num_subscribers
2827,Learn HTML5 Programming From Scratch,2013,Web Development,268923
3032,Coding for Entrepreneurs Basic,2013,Web Development,161029
2783,Build Your First Website in 1 Week with HTML5 ...,2014,Web Development,120291
1896,Free Beginner Electric Guitar Lessons,2012,Musical Instruments,101154
2589,Web Design for Web Developers: Build Beautiful...,2015,Web Development,98867
3289,Practical PHP: Master the Basics and Code Dyna...,2014,Web Development,83737
3665,Beginner Photoshop to HTML5 and CSS3,2012,Web Development,73110
2782,Web Development By Doing: HTML / CSS From Scratch,2013,Web Development,72932
3325,HTML and CSS for Beginners - Build a Website &...,2015,Web Development,70773
492,Bitcoin or How I Learned to Stop Worrying and ...,2013,Business Finance,65576


In [33]:
fig = px.bar(top_15_free_courses, y= 'num_subscribers', x='course_title', hover_data = top_15_free_courses[['year','subject']], color='subject')
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

- Top 15 free courses are mostly from Web Development area, but also include other subjects areas.

### Top 15  Reviewed Courses

In [34]:
top_15_reviewed = df[['course_title','year','subject','is_paid','num_reviews']].sort_values(by='num_reviews', ascending=False).head(15)

top_15_reviewed

Unnamed: 0,course_title,year,subject,is_paid,num_reviews
3230,The Web Developer Bootcamp,2015,Web Development,True,27445
3232,The Complete Web Developer Course 2.0,2016,Web Development,True,22412
3204,Angular 4 (formerly Angular 2) - The Complete ...,2016,Web Development,True,19649
3247,JavaScript: Understanding the Weird Parts,2015,Web Development,True,16976
3254,Modern React with Redux,2015,Web Development,True,15117
3246,Learn and Understand AngularJS,2014,Web Development,True,11580
3251,Learn and Understand NodeJS,2015,Web Development,True,11123
2827,Learn HTML5 Programming From Scratch,2013,Web Development,False,8629
3228,Angular 2 with TypeScript for Beginners: The P...,2016,Web Development,True,8341
1979,Pianoforall - Incredible New Way To Learn Pian...,2014,Musical Instruments,True,7676


In [35]:
fig = px.bar(top_15_reviewed , y= 'num_reviews', x='course_title', hover_data = top_15_reviewed[['year','subject', 'is_paid']], color='subject')
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

- Top 15 reviewed courses are from Web Development area, except one course. And 11 out of 15 top reviewed courses are paid courses.

### Top 15 Expensive Courses

In [36]:
top_15_price = df[['course_title','year','subject','num_subscribers', 'price']].sort_values(by=['price','num_subscribers'], ascending=False).head(15)

top_15_price

Unnamed: 0,course_title,year,subject,num_subscribers,price
3230,The Web Developer Bootcamp,2015,Web Development,121584,200
3232,The Complete Web Developer Course 2.0,2016,Web Development,114512,200
1979,Pianoforall - Incredible New Way To Learn Pian...,2014,Musical Instruments,75499,200
1213,Photoshop for Entrepreneurs - Design 11 Practi...,2016,Graphic Design,36288,200
3233,Ultimate Web Designer & Developer Course: Buil...,2015,Web Development,33788,200
3206,PHP for Beginners -Become a PHP Master - Proje...,2015,Web Development,28880,200
2621,The Ultimate Web Developer How To Guide,2015,Web Development,24861,200
1526,How To Make Graphics For A Website,2014,Graphic Design,24857,200
3117,1 Hour JavaScript,2013,Web Development,22999,200
2755,Become A Web Developer And Seller - Build Webs...,2013,Web Development,21730,200


In [37]:
fig = px.bar(top_15_price , y= 'num_subscribers', x='course_title', hover_data = top_15_price[['price','year']], color='subject')
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

- Expensive courses are $ 200, and all of the subjects areas are in the top 15 expensive course list. 

- Thanks for the dataset contibutor for this data. only missing thing for me, this dataset should contain something about the course ratings. We can make some assumptions based on number of subscribers or numbers of reviews, but still it does not give us confidence to make an assumption on the quality of the courses.

- It was a quite pleasure to share with you this detailed, beginner friendly EDA. Thanks for your time.

- All the best 