### Data Preprocessing
Data preprocessing is a crucial step in the data preparation process for analysis or modeling. It involves a series of operations and techniques applied to raw or collected data to transform it into a format that is suitable for further analysis. The main goals of data preprocessing are to clean the data, handle missing values, deal with outliers, and prepare the data in a way that makes it ready for modeling or analysis.

1. **Data Cleaning:**
   - Removing or correcting inaccurate or inconsistent data.
   - Handling duplicate records or entries.
   - Standardizing data formats and values.

2. **Handling Missing Data:**
   - Identifying and handling missing values, which can involve imputation (replacing missing values with estimated values) or removal of rows or columns with excessive missing data.

3. **Data Transformation:**
   - Encoding categorical variables into numerical format, often using techniques like one-hot encoding or label encoding.
   - Scaling or normalizing numerical features to bring them to a common scale.
   - Logarithmic or power transformations to address data skewness.

4. **Handling Outliers:**
   - Identifying and addressing outliers, which can involve removing or transforming extreme values.

5. **Feature Engineering:**
   - Creating new features based on domain knowledge or insights from the data.
   - Reducing dimensionality through techniques like Principal Component Analysis (PCA).

6. **Data Reduction:**
   - Reducing the size of the dataset while preserving its essential characteristics, often through techniques like sampling or aggregation.

7. **Data Integration:**
   - Combining data from multiple sources into a single dataset.

8. **Data Formatting:**
   - Ensuring that the data is in the correct data types (e.g., dates are in date format) and has consistent units.

9. **Data Splitting:**
   - Splitting the data into training, validation, and test sets for machine learning tasks.

10. **Handling Imbalanced Data:**
    - Addressing class imbalance in classification problems through techniques like oversampling, undersampling, or using synthetic data generation methods.


### What is EDA?
EDA stands for Exploratory Data Analysis. It is an approach to data analysis that focuses on summarizing and visualizing data to better understand its underlying structure, patterns, and relationships. EDA is typically one of the first steps in the data analysis process and is used to gain insights into the data before more advanced statistical or machine learning techniques are applied.

Key objectives of EDA include:

1. **Data Summarization**: EDA involves calculating summary statistics such as mean, median, variance, and percentiles to get a basic understanding of the data's central tendencies and spread.

2. **Data Visualization**: Visualization techniques, such as histograms, scatter plots, box plots, and heatmaps, are used to visually represent the data. These visualizations can reveal patterns, trends, outliers, and potential issues in the data.

3. **Data Cleaning**: EDA often highlights missing data, duplicate records, and outliers, which may require data cleaning or preprocessing before further analysis.

4. **Pattern Discovery**: Analysts use EDA to discover patterns, correlations, and relationships within the data. This can involve identifying clusters, trends, or anomalies that may inform subsequent analysis.

5. **Hypothesis Generation**: EDA can help generate hypotheses about the data, which can be tested using statistical methods or machine learning techniques.

6. **Data Exploration**: EDA encourages the exploration of data from various angles and perspectives to gain a comprehensive understanding of its characteristics.

EDA is a crucial step in the data analysis process as it helps analysts and data scientists make informed decisions about how to proceed with data modeling, hypothesis testing, or other more advanced analytical tasks. It allows them to uncover valuable insights, identify data quality issues, and develop a deeper familiarity with the dataset they are working with.

## Udemy Course Analysis
Udemy is one of the largest online learning platforms, offering a wide range of courses across categories like Business, Technology, Music, Health, and more. With thousands of courses and millions of learners, understanding what makes a course successful is crucial for both the platform and instructors.

The goal of this project is to perform Exploratory Data Analysis (EDA) and Data Analysis on Udemy’s course dataset to answer questions such as:

* What types of courses are most popular across categories?

* How do course features (price, duration, number of lectures, content length, etc.) influence enrollment and ratings?

* Are free courses significantly different from paid courses in terms of enrollments and ratings?

* How do course levels (Beginner, Intermediate, Expert) impact student engagement?

* What trends can be observed in course publishing dates (yearly/monthly trends)?

* Are there any outliers or patterns in pricing, ratings, or enrollment distributions?

By analyzing this dataset, we aim to uncover insights on learner preferences, pricing strategies, and engagement drivers, which can help instructors design better courses and help Udemy optimize its platform offerings.

In [4]:
import pandas as pd
import numpy as np

In [5]:
df = pd.read_csv(r"C:\Users\aishw\Downloads\udemy_courses.csv",header=0)
df

Unnamed: 0,course_id,course_title,url,is_paid,price,num_subscribers,num_reviews,num_lectures,level,content_duration,published_timestamp,subject
0,1070968,Ultimate Investment Banking Course,https://www.udemy.com/ultimate-investment-bank...,True,200.0,2147,23,51,All Levels,1.5,2017-01-18T20:58:58Z,Business Finance
1,1113822,Complete GST Course & Certification - Grow You...,https://www.udemy.com/goods-and-services-tax/,True,75.0,2792,923,274,,39.0,2017-03-09T16:34:20Z,Business Finance
2,1006314,Financial Modeling for Business Analysts and C...,https://www.udemy.com/financial-modeling-for-b...,True,45.0,2174,74,51,Intermediate Level,2.5,2016-12-19T19:26:30Z,Business Finance
3,1210588,Beginner to Pro - Financial Analysis in Excel ...,https://www.udemy.com/complete-excel-finance-c...,True,95.0,2451,11,36,All Levels,3.0,2017-05-30T20:07:24Z,Business Finance
4,1011058,How To Maximize Your Profits Trading Options,https://www.udemy.com/how-to-maximize-your-pro...,True,200.0,1276,45,26,Intermediate Level,2.0,2016-12-13T14:57:18Z,Business Finance
...,...,...,...,...,...,...,...,...,...,...,...,...
3673,775618,Learn jQuery from Scratch - Master of JavaScri...,https://www.udemy.com/easy-jquery-for-beginner...,True,100.0,1040,14,21,All Levels,2.0,2016-06-14T17:36:46Z,Web Development
3674,1088178,How To Design A WordPress Website With No Codi...,https://www.udemy.com/how-to-make-a-wordpress-...,True,25.0,306,3,42,Beginner Level,3.5,2017-03-10T22:24:30Z,Web Development
3675,635248,Learn and Build using Polymer,https://www.udemy.com/learn-and-build-using-po...,True,40.0,513,169,48,All Levels,3.5,2015-12-30T16:41:42Z,Web Development
3676,905096,CSS Animations: Create Amazing Effects on Your...,https://www.udemy.com/css-animations-create-am...,True,50.0,300,31,38,All Levels,3.0,2016-08-11T19:06:15Z,Web Development


* **course_id** - Unique identifier for each course on Udemy. Serves as the primary key.  
* **course_title** - Title of the course as listed on Udemy. Provides information about the course’s subject/topic.  
* **url** - URL link to the course page on Udemy. Useful for reference but not for analysis.  
* **is_paid** - Indicates whether the course is paid (True/1) or free (False/0).  
* **price** -	Price of the course (in USD). For free courses, the price is 0.  
* **num_subscribers** - Number of students enrolled (subscribed) in the course. A measure of course popularity.  
* **num_reviews**	- Number of reviews/ratings received from students. Indicates engagement and feedback.  
* **num_lectures** - Total number of lectures included in the course. Represents course size/content volume.  
* **level** - Difficulty level of the course (e.g., Beginner, Intermediate, Expert, All Levels).  
* **content_duration** - Total length of the course content in hours. Indicates how long it would take to complete.  
* **published_timestamp**	- Date and time when the course was published on Udemy. Useful for analyzing trends over time.  
* **subject**	- Broad subject category of the course (e.g., Business Finance, Graphic Design, Musical Instruments, Web Development).  

In [7]:
df.shape

(3678, 12)

In [8]:
df.head(10)

Unnamed: 0,course_id,course_title,url,is_paid,price,num_subscribers,num_reviews,num_lectures,level,content_duration,published_timestamp,subject
0,1070968,Ultimate Investment Banking Course,https://www.udemy.com/ultimate-investment-bank...,True,200.0,2147,23,51,All Levels,1.5,2017-01-18T20:58:58Z,Business Finance
1,1113822,Complete GST Course & Certification - Grow You...,https://www.udemy.com/goods-and-services-tax/,True,75.0,2792,923,274,,39.0,2017-03-09T16:34:20Z,Business Finance
2,1006314,Financial Modeling for Business Analysts and C...,https://www.udemy.com/financial-modeling-for-b...,True,45.0,2174,74,51,Intermediate Level,2.5,2016-12-19T19:26:30Z,Business Finance
3,1210588,Beginner to Pro - Financial Analysis in Excel ...,https://www.udemy.com/complete-excel-finance-c...,True,95.0,2451,11,36,All Levels,3.0,2017-05-30T20:07:24Z,Business Finance
4,1011058,How To Maximize Your Profits Trading Options,https://www.udemy.com/how-to-maximize-your-pro...,True,200.0,1276,45,26,Intermediate Level,2.0,2016-12-13T14:57:18Z,Business Finance
5,192870,Trading Penny Stocks: A Guide for All Levels I...,https://www.udemy.com/trading-penny-stocks-a-g...,True,150.0,9221,138,25,All Levels,3.0,2014-05-02T15:13:30Z,Business Finance
6,739964,Investing And Trading For Beginners: Mastering...,https://www.udemy.com/investing-and-trading-fo...,True,65.0,1540,178,26,Beginner Level,,2016-02-21T18:23:12Z,Business Finance
7,403100,"Trading Stock Chart Patterns For Immediate, Ex...",https://www.udemy.com/trading-chart-patterns-f...,True,95.0,2917,148,23,All Levels,2.5,2015-01-30T22:13:03Z,Business Finance
8,476268,Options Trading 3 : Advanced Stock Profit and ...,https://www.udemy.com/day-trading-stock-option...,True,195.0,5172,34,38,Expert Level,2.5,2015-05-28T00:14:03Z,Business Finance
9,1167710,The Only Investment Strategy You Need For Your...,https://www.udemy.com/the-only-investment-stra...,True,200.0,827,14,15,All Levels,1.0,2017-04-18T18:13:32Z,Business Finance


In [9]:
df.tail(10)

Unnamed: 0,course_id,course_title,url,is_paid,price,num_subscribers,num_reviews,num_lectures,level,content_duration,published_timestamp,subject
3668,270976,A how to guide in HTML,https://www.udemy.com/a-how-to-guide-in-html/,False,0.0,7318,205,8,Beginner Level,0.583333,2014-08-10T20:19:10Z,Web Development
3669,679992,Building Better APIs with GraphQL,https://www.udemy.com/building-better-apis-wit...,True,50.0,555,89,16,All Levels,2.5,2015-11-29T22:02:02Z,Web Development
3670,330900,Learn Grunt with Examples: Automate Your Front...,https://www.udemy.com/learn-grunt-automate-you...,True,20.0,496,113,17,All Levels,1.0,2014-12-19T21:38:54Z,Web Development
3671,667122,Build A Stock Downloader With Visual Studio 20...,https://www.udemy.com/csharpyahoostockdownloader/,True,20.0,436,36,22,Intermediate Level,1.5,2015-11-19T17:22:47Z,Web Development
3672,865438,jQuery UI in Action: Build 5 jQuery UI Projects,https://www.udemy.com/jquery-ui-practical-buil...,True,150.0,382,28,140,All Levels,15.5,2016-10-10T22:00:32Z,Web Development
3673,775618,Learn jQuery from Scratch - Master of JavaScri...,https://www.udemy.com/easy-jquery-for-beginner...,True,100.0,1040,14,21,All Levels,2.0,2016-06-14T17:36:46Z,Web Development
3674,1088178,How To Design A WordPress Website With No Codi...,https://www.udemy.com/how-to-make-a-wordpress-...,True,25.0,306,3,42,Beginner Level,3.5,2017-03-10T22:24:30Z,Web Development
3675,635248,Learn and Build using Polymer,https://www.udemy.com/learn-and-build-using-po...,True,40.0,513,169,48,All Levels,3.5,2015-12-30T16:41:42Z,Web Development
3676,905096,CSS Animations: Create Amazing Effects on Your...,https://www.udemy.com/css-animations-create-am...,True,50.0,300,31,38,All Levels,3.0,2016-08-11T19:06:15Z,Web Development
3677,297602,Using MODX CMS to Build Websites: A Beginner's...,https://www.udemy.com/using-modx-cms-to-build-...,True,45.0,901,36,20,Beginner Level,2.0,2014-09-28T19:51:11Z,Web Development


**df.describe()** - Descriptive Statistics of the data is provided

In [11]:
df.describe()

Unnamed: 0,course_id,price,num_subscribers,num_reviews,num_lectures,content_duration
count,3678.0,3673.0,3678.0,3678.0,3678.0,3675.0
mean,675972.0,66.139396,3197.150625,156.259108,40.108755,4.097392
std,343273.2,60.998537,9504.11701,935.452044,50.383346,6.055466
min,8324.0,0.0,0.0,0.0,0.0,0.0
25%,407692.5,20.0,111.0,4.0,15.0,1.0
50%,687917.0,45.0,911.5,18.0,25.0,2.0
75%,961355.5,95.0,2546.0,67.0,45.75,4.5
max,1282064.0,200.0,268923.0,27445.0,779.0,78.5


In [12]:
df.describe(include="all")

Unnamed: 0,course_id,course_title,url,is_paid,price,num_subscribers,num_reviews,num_lectures,level,content_duration,published_timestamp,subject
count,3678.0,3678,3678,3678,3673.0,3678.0,3678.0,3678.0,3675,3675.0,3678,3678
unique,,3663,3672,2,,,,,4,,3672,4
top,,Acoustic Blues Guitar Lessons,https://www.udemy.com/cfa-level-2-quantitative...,True,,,,,All Levels,,2017-07-02T14:29:35Z,Web Development
freq,,3,2,3368,,,,,1926,,2,1200
mean,675972.0,,,,66.139396,3197.150625,156.259108,40.108755,,4.097392,,
std,343273.2,,,,60.998537,9504.11701,935.452044,50.383346,,6.055466,,
min,8324.0,,,,0.0,0.0,0.0,0.0,,0.0,,
25%,407692.5,,,,20.0,111.0,4.0,15.0,,1.0,,
50%,687917.0,,,,45.0,911.5,18.0,25.0,,2.0,,
75%,961355.5,,,,95.0,2546.0,67.0,45.75,,4.5,,


**df.info()** - Give the structural information of the data

In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3678 entries, 0 to 3677
Data columns (total 12 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   course_id            3678 non-null   int64  
 1   course_title         3678 non-null   object 
 2   url                  3678 non-null   object 
 3   is_paid              3678 non-null   bool   
 4   price                3673 non-null   float64
 5   num_subscribers      3678 non-null   int64  
 6   num_reviews          3678 non-null   int64  
 7   num_lectures         3678 non-null   int64  
 8   level                3675 non-null   object 
 9   content_duration     3675 non-null   float64
 10  published_timestamp  3678 non-null   object 
 11  subject              3678 non-null   object 
dtypes: bool(1), float64(2), int64(4), object(5)
memory usage: 319.8+ KB


In [15]:
df.dtypes

course_id                int64
course_title            object
url                     object
is_paid                   bool
price                  float64
num_subscribers          int64
num_reviews              int64
num_lectures             int64
level                   object
content_duration       float64
published_timestamp     object
subject                 object
dtype: object

In [16]:
df.columns

Index(['course_id', 'course_title', 'url', 'is_paid', 'price',
       'num_subscribers', 'num_reviews', 'num_lectures', 'level',
       'content_duration', 'published_timestamp', 'subject'],
      dtype='object')

In [17]:
df.price.describe()

count    3673.000000
mean       66.139396
std        60.998537
min         0.000000
25%        20.000000
50%        45.000000
75%        95.000000
max       200.000000
Name: price, dtype: float64

In [18]:
df[["price","num_subscribers"]].describe()

Unnamed: 0,price,num_subscribers
count,3673.0,3678.0
mean,66.139396,3197.150625
std,60.998537,9504.11701
min,0.0,0.0
25%,20.0,111.0
50%,45.0,911.5
75%,95.0,2546.0
max,200.0,268923.0


In [46]:
#Continous Numerical features ---> describe()
# Average course price on Udemy
df["price"].describe()

count    3673.000000
mean       66.139396
std        60.998537
min         0.000000
25%        20.000000
50%        45.000000
75%        95.000000
max       200.000000
Name: price, dtype: float64

In [48]:
df["price"].mean()

66.13939558943643

In [50]:
df["price"].median()

45.0

In [20]:
# Discrete or Categorical Feature ---> value_counts(),unique(),nunique(),crosstab()

In [56]:
#df.value_counts() --> Helps to find the frequency of values within the column
# Find the no. of courses in each level.
df["level"].value_counts()

level
All Levels            1926
Beginner Level        1270
Intermediate Level     421
Expert Level            58
Name: count, dtype: int64

In [72]:
# Find the distribution of free & paid courses?
df["is_paid"].value_counts()/len(df)*100

is_paid
True     91.571506
False     8.428494
Name: count, dtype: float64

In [84]:
df["num_lectures"].value_counts()

num_lectures
12     121
15     109
13     107
14     105
11     104
      ... 
362      1
156      1
202      1
225      1
152      1
Name: count, Length: 229, dtype: int64

In [88]:
# df.unique() --> find an array of unique values from the columns
df["subject"].unique()

array(['Business Finance', 'Graphic Design', 'Musical Instruments',
       'Web Development'], dtype=object)

In [90]:
df["level"].unique()

array(['All Levels', nan, 'Intermediate Level', 'Beginner Level',
       'Expert Level'], dtype=object)

In [92]:
df["price"].unique()

array([200.,  75.,  45.,  95., 150.,  65., 195.,  30.,  20.,  50., 175.,
       140., 115., 190., 125.,  60., 145., 105., 155., 185., 180., 120.,
        25., 160.,  40.,   0., 100.,  nan,  90.,  35.,  80.,  70.,  55.,
       165., 130.,  85., 170., 110., 135.])

In [94]:
df["num_lectures"].unique()

array([ 51, 274,  36,  26,  25,  23,  38,  15,  76,  17,  19,  16,  42,
        52,  12,  39,  40,  50,  81,  37,  41,  35,  80,  22,  28,  68,
        61, 138, 110, 174, 103,  79, 227,  43,  46,  62,  53,  77,  20,
        47,  33,  11, 102,  45,  32,  30,  18,  60,  54,  24, 134,   5,
        10,  49,  14,   6, 108,  57,   9,  13,   8, 462,  29,  59, 284,
        55,  34,  31, 544,  66,  21,  88,  44,  27,  48,  90,   7,  97,
       128,  63, 235, 211, 100,  82, 123, 332, 272,  69, 129, 316,  70,
       105, 176,  91,  64,  72,   4,  58, 142, 395, 194, 527,  74,  84,
        87,  65, 460, 101,  95, 107, 113,  71, 145,  75, 444,   0, 127,
        98, 286, 120, 130,  73, 121,  56, 158, 241,  86, 187, 111,  85,
       150,  96,  94, 119,  78, 122, 124, 163, 131,  67, 141, 118, 166,
       154, 185, 207, 225, 202, 115, 156,  83,  99, 196, 162,  89, 362,
       136, 310, 104, 291, 144, 161, 224, 240, 183, 192, 309, 215, 106,
       321, 151,  92, 126, 112,  93, 125, 348, 402, 135, 171, 21

In [100]:
#df.nunique() --> No. of unique values within the column
df["level"].nunique()

4

In [102]:
df["level"].describe()

count           3675
unique             4
top       All Levels
freq            1926
Name: level, dtype: object

In [104]:
df["num_lectures"].nunique()

229

In [108]:
df["course_id"].nunique()

3672

In [110]:
# pd.crosstab --> Comparing categories with each other
?pd.crosstab

[1;31mSignature:[0m
[0mpd[0m[1;33m.[0m[0mcrosstab[0m[1;33m([0m[1;33m
[0m    [0mindex[0m[1;33m,[0m[1;33m
[0m    [0mcolumns[0m[1;33m,[0m[1;33m
[0m    [0mvalues[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mrownames[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mcolnames[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0maggfunc[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mmargins[0m[1;33m:[0m [1;34m'bool'[0m [1;33m=[0m [1;32mFalse[0m[1;33m,[0m[1;33m
[0m    [0mmargins_name[0m[1;33m:[0m [1;34m'Hashable'[0m [1;33m=[0m [1;34m'All'[0m[1;33m,[0m[1;33m
[0m    [0mdropna[0m[1;33m:[0m [1;34m'bool'[0m [1;33m=[0m [1;32mTrue[0m[1;33m,[0m[1;33m
[0m    [0mnormalize[0m[1;33m:[0m [1;34m"bool | Literal[0, 1, 'all', 'index', 'columns']"[0m [1;33m=[0m [1;32mFalse[0m[1;33m,[0m[1;33m
[0m[1;33m)[0m [1;33m->[0m [1;34m'DataFrame'[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocs

In [118]:
df.dtypes

course_id                int64
course_title            object
url                     object
is_paid                   bool
price                  float64
num_subscribers          int64
num_reviews              int64
num_lectures             int64
level                   object
content_duration       float64
published_timestamp     object
subject                 object
dtype: object

In [122]:
df["is_paid"] = df["is_paid"].astype("str")

In [124]:
df.dtypes

course_id                int64
course_title            object
url                     object
is_paid                 object
price                  float64
num_subscribers          int64
num_reviews              int64
num_lectures             int64
level                   object
content_duration       float64
published_timestamp     object
subject                 object
dtype: object

In [126]:
df["is_paid"].unique()

array(['True', 'False'], dtype=object)

In [136]:
df["is_paid"].replace(['True','False'],["Paid","Unpaid"],inplace=True)

In [138]:
df

Unnamed: 0,course_id,course_title,url,is_paid,price,num_subscribers,num_reviews,num_lectures,level,content_duration,published_timestamp,subject
0,1070968,Ultimate Investment Banking Course,https://www.udemy.com/ultimate-investment-bank...,Paid,200.0,2147,23,51,All Levels,1.5,2017-01-18T20:58:58Z,Business Finance
1,1113822,Complete GST Course & Certification - Grow You...,https://www.udemy.com/goods-and-services-tax/,Paid,75.0,2792,923,274,,39.0,2017-03-09T16:34:20Z,Business Finance
2,1006314,Financial Modeling for Business Analysts and C...,https://www.udemy.com/financial-modeling-for-b...,Paid,45.0,2174,74,51,Intermediate Level,2.5,2016-12-19T19:26:30Z,Business Finance
3,1210588,Beginner to Pro - Financial Analysis in Excel ...,https://www.udemy.com/complete-excel-finance-c...,Paid,95.0,2451,11,36,All Levels,3.0,2017-05-30T20:07:24Z,Business Finance
4,1011058,How To Maximize Your Profits Trading Options,https://www.udemy.com/how-to-maximize-your-pro...,Paid,200.0,1276,45,26,Intermediate Level,2.0,2016-12-13T14:57:18Z,Business Finance
...,...,...,...,...,...,...,...,...,...,...,...,...
3673,775618,Learn jQuery from Scratch - Master of JavaScri...,https://www.udemy.com/easy-jquery-for-beginner...,Paid,100.0,1040,14,21,All Levels,2.0,2016-06-14T17:36:46Z,Web Development
3674,1088178,How To Design A WordPress Website With No Codi...,https://www.udemy.com/how-to-make-a-wordpress-...,Paid,25.0,306,3,42,Beginner Level,3.5,2017-03-10T22:24:30Z,Web Development
3675,635248,Learn and Build using Polymer,https://www.udemy.com/learn-and-build-using-po...,Paid,40.0,513,169,48,All Levels,3.5,2015-12-30T16:41:42Z,Web Development
3676,905096,CSS Animations: Create Amazing Effects on Your...,https://www.udemy.com/css-animations-create-am...,Paid,50.0,300,31,38,All Levels,3.0,2016-08-11T19:06:15Z,Web Development


In [144]:
# No. of courses which are  Free or paid wrt each course level
pd.crosstab(index=df["level"],columns=df["is_paid"],normalize=True)*100

is_paid,Paid,Unpaid
level,Unnamed: 1_level_1,Unnamed: 2_level_1
All Levels,49.088435,3.319728
Beginner Level,30.258503,4.29932
Expert Level,1.578231,0.0
Intermediate Level,10.639456,0.816327


In [156]:
?pd.crosstab

[1;31mSignature:[0m
[0mpd[0m[1;33m.[0m[0mcrosstab[0m[1;33m([0m[1;33m
[0m    [0mindex[0m[1;33m,[0m[1;33m
[0m    [0mcolumns[0m[1;33m,[0m[1;33m
[0m    [0mvalues[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mrownames[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mcolnames[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0maggfunc[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mmargins[0m[1;33m:[0m [1;34m'bool'[0m [1;33m=[0m [1;32mFalse[0m[1;33m,[0m[1;33m
[0m    [0mmargins_name[0m[1;33m:[0m [1;34m'Hashable'[0m [1;33m=[0m [1;34m'All'[0m[1;33m,[0m[1;33m
[0m    [0mdropna[0m[1;33m:[0m [1;34m'bool'[0m [1;33m=[0m [1;32mTrue[0m[1;33m,[0m[1;33m
[0m    [0mnormalize[0m[1;33m:[0m [1;34m"bool | Literal[0, 1, 'all', 'index', 'columns']"[0m [1;33m=[0m [1;32mFalse[0m[1;33m,[0m[1;33m
[0m[1;33m)[0m [1;33m->[0m [1;34m'DataFrame'[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocs

In [176]:
# Find the avg price of course regarding to the subject & course level
pd.crosstab(index=df["level"],columns = df["subject"],
            values=df["price"],aggfunc='mean')

subject,Business Finance,Graphic Design,Musical Instruments,Web Development
level,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
All Levels,75.130058,62.701342,55.978261,82.526555
Beginner Level,54.017857,53.477366,43.378378,68.363171
Expert Level,95.967742,70.0,36.428571,113.666667
Intermediate Level,66.054688,49.824561,51.039604,71.259259


#### Accessing the data
1. Labels / Indexes
2. Conditional Approach

In [191]:
# Access the entries from 100th to 500th rows
df.loc[100:500:2,"course_title":"num_lectures"]

Unnamed: 0,course_title,url,is_paid,price,num_subscribers,num_reviews,num_lectures
100,High performance Stock Trading using key Optio...,https://www.udemy.com/high-performance-stock-t...,Paid,40.0,2103,15,6
102,Basics of Economics (College Level),https://www.udemy.com/economics-for-accounting...,Paid,25.0,2516,12,20
104,Accounting for Beginners : Learn Basics in und...,https://www.udemy.com/accounting-for-beginners...,Paid,50.0,1971,13,20
106,Fundamentals of Forex Trading,https://www.udemy.com/fundamentals-of-forex-tr...,Unpaid,,17160,620,23
108,Website Investing 101 - Buying & Selling Onlin...,https://www.udemy.com/cash-flow-website-invest...,Unpaid,0.0,6811,151,51
...,...,...,...,...,...,...,...
492,Bitcoin or How I Learned to Stop Worrying and ...,https://www.udemy.com/bitcoin-or-how-i-learned...,Unpaid,0.0,65576,936,24
494,Forex Basics,https://www.udemy.com/forex-basics/,Unpaid,0.0,22344,712,26
496,Bitcoin - Ethereum: Trading -Watch me manage m...,https://www.udemy.com/bitcoin-tips/,Paid,165.0,431,58,22
498,FastTrack to Stock Trading Strategies,https://www.udemy.com/fasttrack-to-stock-tradi...,Paid,25.0,5685,5,12


In [195]:
# Fetch details only related to course_title, price,subject,level,num_lectures for all courses
df.loc[:,["course_title","price","subject","level","num_lectures"]]

Unnamed: 0,course_title,price,subject,level,num_lectures
0,Ultimate Investment Banking Course,200.0,Business Finance,All Levels,51
1,Complete GST Course & Certification - Grow You...,75.0,Business Finance,,274
2,Financial Modeling for Business Analysts and C...,45.0,Business Finance,Intermediate Level,51
3,Beginner to Pro - Financial Analysis in Excel ...,95.0,Business Finance,All Levels,36
4,How To Maximize Your Profits Trading Options,200.0,Business Finance,Intermediate Level,26
...,...,...,...,...,...
3673,Learn jQuery from Scratch - Master of JavaScri...,100.0,Web Development,All Levels,21
3674,How To Design A WordPress Website With No Codi...,25.0,Web Development,Beginner Level,42
3675,Learn and Build using Polymer,40.0,Web Development,All Levels,48
3676,CSS Animations: Create Amazing Effects on Your...,50.0,Web Development,All Levels,38


In [199]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3678 entries, 0 to 3677
Data columns (total 12 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   course_id            3678 non-null   int64  
 1   course_title         3678 non-null   object 
 2   url                  3678 non-null   object 
 3   is_paid              3678 non-null   object 
 4   price                3673 non-null   float64
 5   num_subscribers      3678 non-null   int64  
 6   num_reviews          3678 non-null   int64  
 7   num_lectures         3678 non-null   int64  
 8   level                3675 non-null   object 
 9   content_duration     3675 non-null   float64
 10  published_timestamp  3678 non-null   object 
 11  subject              3678 non-null   object 
dtypes: float64(2), int64(4), object(6)
memory usage: 344.9+ KB


In [205]:
df.iloc[100:500,[1,3,5,6,8,9]]

Unnamed: 0,course_title,is_paid,num_subscribers,num_reviews,level,content_duration
100,High performance Stock Trading using key Optio...,Paid,2103,15,All Levels,1.5
101,Introduction to Financial Statement Analysis,Paid,1480,25,Beginner Level,1.5
102,Basics of Economics (College Level),Paid,2516,12,All Levels,1.5
103,Stock Market Investing for Beginners,Unpaid,50855,2698,Beginner Level,1.5
104,Accounting for Beginners : Learn Basics in und...,Paid,1971,13,Beginner Level,1.5
...,...,...,...,...,...,...
495,Accounting Bank Reconciliation Statement (Coll...,Paid,2189,4,All Levels,1.5
496,Bitcoin - Ethereum: Trading -Watch me manage m...,Paid,431,58,All Levels,2.5
497,FOREX : Learn Technical Analysis,Paid,1199,37,All Levels,18.0
498,FastTrack to Stock Trading Strategies,Paid,5685,5,All Levels,3.5


In [213]:
# Find out the course having course price greater than 150 USD
df[df["price"]>150]

Unnamed: 0,course_id,course_title,url,is_paid,price,num_subscribers,num_reviews,num_lectures,level,content_duration,published_timestamp,subject
0,1070968,Ultimate Investment Banking Course,https://www.udemy.com/ultimate-investment-bank...,Paid,200.0,2147,23,51,All Levels,1.5,2017-01-18T20:58:58Z,Business Finance
4,1011058,How To Maximize Your Profits Trading Options,https://www.udemy.com/how-to-maximize-your-pro...,Paid,200.0,1276,45,26,Intermediate Level,2.0,2016-12-13T14:57:18Z,Business Finance
8,476268,Options Trading 3 : Advanced Stock Profit and ...,https://www.udemy.com/day-trading-stock-option...,Paid,195.0,5172,34,38,Expert Level,2.5,2015-05-28T00:14:03Z,Business Finance
9,1167710,The Only Investment Strategy You Need For Your...,https://www.udemy.com/the-only-investment-stra...,Paid,200.0,827,14,15,All Levels,1.0,2017-04-18T18:13:32Z,Business Finance
10,592338,Forex Trading Secrets of the Pros With Amazon'...,https://www.udemy.com/trading-with-amazons-aws...,Paid,200.0,4284,93,76,,5.0,2015-09-11T16:47:02Z,Business Finance
...,...,...,...,...,...,...,...,...,...,...,...,...
3620,1227578,Learning Path: The Road to Elasticsearch,https://www.udemy.com/learning-path-the-road-t...,Paid,200.0,50,5,60,Beginner Level,5.0,2017-05-29T17:56:24Z,Web Development
3642,709324,Learn Web Development by Creating a Social Net...,https://www.udemy.com/meteor-tutorial/,Paid,200.0,442,48,80,Beginner Level,6.5,2015-12-30T16:53:44Z,Web Development
3647,975916,17 Complete JavaScript projects explained st...,https://www.udemy.com/17-complete-javascript-p...,Paid,185.0,327,26,106,Beginner Level,9.5,2016-10-26T14:03:38Z,Web Development
3652,919354,Learn Bootstrap 4 The Most Popular HTML5 CSS3 ...,https://www.udemy.com/learn-bootstrap-4-the-mo...,Paid,200.0,279,37,119,All Levels,10.0,2017-04-25T00:57:35Z,Web Development


In [219]:
# Filter out the course details for Intermediate level course
df[df["level"]=="Intermediate Level"]

Unnamed: 0,course_id,course_title,url,is_paid,price,num_subscribers,num_reviews,num_lectures,level,content_duration,published_timestamp,subject
2,1006314,Financial Modeling for Business Analysts and C...,https://www.udemy.com/financial-modeling-for-b...,Paid,45.0,2174,74,51,Intermediate Level,2.5,2016-12-19T19:26:30Z,Business Finance
4,1011058,How To Maximize Your Profits Trading Options,https://www.udemy.com/how-to-maximize-your-pro...,Paid,200.0,1276,45,26,Intermediate Level,2.0,2016-12-13T14:57:18Z,Business Finance
15,504036,Short Selling: Learn To Sell Stocks Before The...,https://www.udemy.com/short-selling-learn-to-s...,Paid,75.0,2276,106,19,Intermediate Level,1.5,2015-06-22T21:18:35Z,Business Finance
27,447362,Create Your Own Hedge Fund: Trade Stocks Like ...,https://www.udemy.com/create-your-own-hedge-fu...,Paid,175.0,4005,237,25,Intermediate Level,2.0,2015-04-12T20:13:47Z,Business Finance
35,434774,Options Trading Stocks: Proven Toolbox For Fin...,https://www.udemy.com/trading-stock-options-ii...,Paid,195.0,7884,118,68,Intermediate Level,10.0,2015-05-19T23:25:41Z,Business Finance
...,...,...,...,...,...,...,...,...,...,...,...,...
3619,973932,Learn Spring Security 4 Intermediate - Hands On,https://www.udemy.com/learn-spring-security-4-...,Paid,95.0,227,16,20,Intermediate Level,2.0,2016-10-03T16:16:36Z,Web Development
3635,576054,"WordPress Development - Themes, Plugins & Sing...",https://www.udemy.com/wordpress-development-cr...,Paid,150.0,817,164,131,Intermediate Level,19.5,2016-03-02T06:26:16Z,Web Development
3639,261148,create a search engine for your website!,https://www.udemy.com/create-a-simple-php-mysq...,Paid,20.0,1832,6,12,Intermediate Level,1.5,2014-07-17T18:28:04Z,Web Development
3644,554136,Building Responsive Websites with Bootstrap 3 ...,https://www.udemy.com/building-responsive-webs...,Paid,75.0,1322,14,27,Intermediate Level,3.0,2015-07-22T22:54:03Z,Web Development


In [227]:
# Filter out those courses having course price greater than 150 usd & are for intermediate level
df[(df["price"]>150) & (df["level"]=="Intermediate Level")]

Unnamed: 0,course_id,course_title,url,is_paid,price,num_subscribers,num_reviews,num_lectures,level,content_duration,published_timestamp,subject
4,1011058,How To Maximize Your Profits Trading Options,https://www.udemy.com/how-to-maximize-your-pro...,Paid,200.0,1276,45,26,Intermediate Level,2.0,2016-12-13T14:57:18Z,Business Finance
27,447362,Create Your Own Hedge Fund: Trade Stocks Like ...,https://www.udemy.com/create-your-own-hedge-fu...,Paid,175.0,4005,237,25,Intermediate Level,2.0,2015-04-12T20:13:47Z,Business Finance
35,434774,Options Trading Stocks: Proven Toolbox For Fin...,https://www.udemy.com/trading-stock-options-ii...,Paid,195.0,7884,118,68,Intermediate Level,10.0,2015-05-19T23:25:41Z,Business Finance
125,528784,Stock market Investing Encyclopedia: How to in...,https://www.udemy.com/stockmarket/,Paid,200.0,3143,11,39,Intermediate Level,3.0,2015-11-10T22:55:53Z,Business Finance
147,1070886,Python Algo Trading: FX Trading with Oanda,https://www.udemy.com/python-algo-trading-fx-t...,Paid,200.0,453,42,33,Intermediate Level,3.0,2017-03-14T00:39:45Z,Business Finance
274,867440,"Bitcoin: el futuro del dinero, hoy",https://www.udemy.com/bitcoin-el-futuro-del-di...,Paid,200.0,57,22,19,Intermediate Level,3.0,2016-06-06T00:06:37Z,Business Finance
332,990440,My Forex Strategy that win consistently over a...,https://www.udemy.com/my-forex-strategy-that-h...,Paid,200.0,204,23,9,Intermediate Level,1.5,2016-10-29T14:51:53Z,Business Finance
415,1208148,Coaching Course:Investment Analysis for your c...,https://www.udemy.com/coaching-courseinvestmen...,Paid,200.0,1,0,6,Intermediate Level,0.566667,2017-06-23T16:35:04Z,Business Finance
437,130366,Trading: High-ROI Trading,https://www.udemy.com/the-high-roi-trading-vid...,Paid,190.0,126,20,47,Intermediate Level,12.5,2014-02-05T19:02:33Z,Business Finance
750,971110,The Truths about (in)secure Retirement,https://www.udemy.com/retirement-planning-calc...,Paid,200.0,86,6,32,Intermediate Level,4.5,2016-10-03T18:42:18Z,Business Finance


In [235]:
# Filter out the course which are either Unpaid or having course price as 200 USD
df[(df["is_paid"]=="Unpaid") | (df["price"]==200) | (df["level"]=="Expert Level")]

Unnamed: 0,course_id,course_title,url,is_paid,price,num_subscribers,num_reviews,num_lectures,level,content_duration,published_timestamp,subject
0,1070968,Ultimate Investment Banking Course,https://www.udemy.com/ultimate-investment-bank...,Paid,200.0,2147,23,51,All Levels,1.500000,2017-01-18T20:58:58Z,Business Finance
4,1011058,How To Maximize Your Profits Trading Options,https://www.udemy.com/how-to-maximize-your-pro...,Paid,200.0,1276,45,26,Intermediate Level,2.000000,2016-12-13T14:57:18Z,Business Finance
8,476268,Options Trading 3 : Advanced Stock Profit and ...,https://www.udemy.com/day-trading-stock-option...,Paid,195.0,5172,34,38,Expert Level,2.500000,2015-05-28T00:14:03Z,Business Finance
9,1167710,The Only Investment Strategy You Need For Your...,https://www.udemy.com/the-only-investment-stra...,Paid,200.0,827,14,15,All Levels,1.000000,2017-04-18T18:13:32Z,Business Finance
10,592338,Forex Trading Secrets of the Pros With Amazon'...,https://www.udemy.com/trading-with-amazons-aws...,Paid,200.0,4284,93,76,,5.000000,2015-09-11T16:47:02Z,Business Finance
...,...,...,...,...,...,...,...,...,...,...,...,...
3652,919354,Learn Bootstrap 4 The Most Popular HTML5 CSS3 ...,https://www.udemy.com/learn-bootstrap-4-the-mo...,Paid,200.0,279,37,119,All Levels,10.000000,2017-04-25T00:57:35Z,Web Development
3653,1248172,Essentials of Spring 5.0 for Developers,https://www.udemy.com/essentials-of-spring-50-...,Paid,125.0,34,2,21,Expert Level,1.500000,2017-06-11T18:34:40Z,Web Development
3654,949134,The Extreme Web Development Course - For Begin...,https://www.udemy.com/the-extreme-web-developm...,Paid,200.0,1420,62,152,All Levels,5.500000,2016-09-04T20:51:08Z,Web Development
3665,21386,Beginner Photoshop to HTML5 and CSS3,https://www.udemy.com/psd-html5-css3/,Unpaid,0.0,73110,1716,22,All Levels,2.000000,2012-07-27T12:54:57Z,Web Development


In [245]:
# Filter out the course_title,price,subject,num_lectures & course duration for Unpaid course
df[df["is_paid"]=="Unpaid"][["course_title","price","subject","num_lectures","content_duration"]]

Unnamed: 0,course_title,price,subject,num_lectures,content_duration
95,Options Trading 101: The Basics,0.0,Business Finance,11,0.550000
103,Stock Market Investing for Beginners,,Business Finance,15,1.500000
106,Fundamentals of Forex Trading,,Business Finance,23,1.000000
108,Website Investing 101 - Buying & Selling Onlin...,0.0,Business Finance,51,2.000000
112,Stock Market Foundations,,Business Finance,9,2.000000
...,...,...,...,...,...
3638,Building a Search Engine in PHP & MySQL,0.0,Web Development,12,2.500000
3643,CSS Image filters - The modern web images colo...,0.0,Web Development,16,1.500000
3651,Drupal 8 Site Building,0.0,Web Development,48,4.500000
3665,Beginner Photoshop to HTML5 and CSS3,0.0,Web Development,22,2.000000


In [255]:
df.loc[df["is_paid"]=="Unpaid",["course_title","price","subject","num_lectures","content_duration"]]

Unnamed: 0,course_title,price,subject,num_lectures,content_duration
95,Options Trading 101: The Basics,0.0,Business Finance,11,0.550000
103,Stock Market Investing for Beginners,,Business Finance,15,1.500000
106,Fundamentals of Forex Trading,,Business Finance,23,1.000000
108,Website Investing 101 - Buying & Selling Onlin...,0.0,Business Finance,51,2.000000
112,Stock Market Foundations,,Business Finance,9,2.000000
...,...,...,...,...,...
3638,Building a Search Engine in PHP & MySQL,0.0,Web Development,12,2.500000
3643,CSS Image filters - The modern web images colo...,0.0,Web Development,16,1.500000
3651,Drupal 8 Site Building,0.0,Web Development,48,4.500000
3665,Beginner Photoshop to HTML5 and CSS3,0.0,Web Development,22,2.000000


### Data Cleaning
Data cleaning means fixing messy data in your data set.As the machine learning model never accepts messy data ,so before model building you have to clean your messy data.

Messy data could be:

Empty cells  
Data in wrong format  
Wrong data  
Duplicates  

#### 1. Handling Missing values

In [266]:
df.isnull().sum()/len(df)*100

course_id              0.000000
course_title           0.000000
url                    0.000000
is_paid                0.000000
price                  0.135943
num_subscribers        0.000000
num_reviews            0.000000
num_lectures           0.000000
level                  0.081566
content_duration       0.081566
published_timestamp    0.000000
subject                0.000000
dtype: float64

In [268]:
df[df["price"].isnull()]

Unnamed: 0,course_id,course_title,url,is_paid,price,num_subscribers,num_reviews,num_lectures,level,content_duration,published_timestamp,subject
103,133536,Stock Market Investing for Beginners,https://www.udemy.com/the-beginners-guide-to-t...,Unpaid,,50855,2698,15,Beginner Level,1.5,2013-12-25T19:53:34Z,Business Finance
106,265960,Fundamentals of Forex Trading,https://www.udemy.com/fundamentals-of-forex-tr...,Unpaid,,17160,620,23,All Levels,1.0,2014-08-29T20:10:38Z,Business Finance
112,191854,Stock Market Foundations,https://www.udemy.com/how-to-invest-in-the-sto...,Unpaid,,19339,794,9,Beginner Level,2.0,2014-03-31T21:35:06Z,Business Finance
128,777444,Corporate Finance - A Brief Introduction,https://www.udemy.com/finance-a-brief-introduc...,Unpaid,,11724,649,17,Beginner Level,1.5,2016-03-04T05:58:09Z,Business Finance
143,48841,Accounting in 60 Minutes - A Brief Introduction,https://www.udemy.com/accounting-in-60-minutes...,Unpaid,,56659,4397,16,Beginner Level,1.5,2013-04-07T21:39:25Z,Business Finance


In [270]:
df.dropna()

Unnamed: 0,course_id,course_title,url,is_paid,price,num_subscribers,num_reviews,num_lectures,level,content_duration,published_timestamp,subject
0,1070968,Ultimate Investment Banking Course,https://www.udemy.com/ultimate-investment-bank...,Paid,200.0,2147,23,51,All Levels,1.5,2017-01-18T20:58:58Z,Business Finance
2,1006314,Financial Modeling for Business Analysts and C...,https://www.udemy.com/financial-modeling-for-b...,Paid,45.0,2174,74,51,Intermediate Level,2.5,2016-12-19T19:26:30Z,Business Finance
3,1210588,Beginner to Pro - Financial Analysis in Excel ...,https://www.udemy.com/complete-excel-finance-c...,Paid,95.0,2451,11,36,All Levels,3.0,2017-05-30T20:07:24Z,Business Finance
4,1011058,How To Maximize Your Profits Trading Options,https://www.udemy.com/how-to-maximize-your-pro...,Paid,200.0,1276,45,26,Intermediate Level,2.0,2016-12-13T14:57:18Z,Business Finance
5,192870,Trading Penny Stocks: A Guide for All Levels I...,https://www.udemy.com/trading-penny-stocks-a-g...,Paid,150.0,9221,138,25,All Levels,3.0,2014-05-02T15:13:30Z,Business Finance
...,...,...,...,...,...,...,...,...,...,...,...,...
3673,775618,Learn jQuery from Scratch - Master of JavaScri...,https://www.udemy.com/easy-jquery-for-beginner...,Paid,100.0,1040,14,21,All Levels,2.0,2016-06-14T17:36:46Z,Web Development
3674,1088178,How To Design A WordPress Website With No Codi...,https://www.udemy.com/how-to-make-a-wordpress-...,Paid,25.0,306,3,42,Beginner Level,3.5,2017-03-10T22:24:30Z,Web Development
3675,635248,Learn and Build using Polymer,https://www.udemy.com/learn-and-build-using-po...,Paid,40.0,513,169,48,All Levels,3.5,2015-12-30T16:41:42Z,Web Development
3676,905096,CSS Animations: Create Amazing Effects on Your...,https://www.udemy.com/css-animations-create-am...,Paid,50.0,300,31,38,All Levels,3.0,2016-08-11T19:06:15Z,Web Development


In [274]:
df.dropna(subset=["price"])

Unnamed: 0,course_id,course_title,url,is_paid,price,num_subscribers,num_reviews,num_lectures,level,content_duration,published_timestamp,subject
0,1070968,Ultimate Investment Banking Course,https://www.udemy.com/ultimate-investment-bank...,Paid,200.0,2147,23,51,All Levels,1.5,2017-01-18T20:58:58Z,Business Finance
1,1113822,Complete GST Course & Certification - Grow You...,https://www.udemy.com/goods-and-services-tax/,Paid,75.0,2792,923,274,,39.0,2017-03-09T16:34:20Z,Business Finance
2,1006314,Financial Modeling for Business Analysts and C...,https://www.udemy.com/financial-modeling-for-b...,Paid,45.0,2174,74,51,Intermediate Level,2.5,2016-12-19T19:26:30Z,Business Finance
3,1210588,Beginner to Pro - Financial Analysis in Excel ...,https://www.udemy.com/complete-excel-finance-c...,Paid,95.0,2451,11,36,All Levels,3.0,2017-05-30T20:07:24Z,Business Finance
4,1011058,How To Maximize Your Profits Trading Options,https://www.udemy.com/how-to-maximize-your-pro...,Paid,200.0,1276,45,26,Intermediate Level,2.0,2016-12-13T14:57:18Z,Business Finance
...,...,...,...,...,...,...,...,...,...,...,...,...
3673,775618,Learn jQuery from Scratch - Master of JavaScri...,https://www.udemy.com/easy-jquery-for-beginner...,Paid,100.0,1040,14,21,All Levels,2.0,2016-06-14T17:36:46Z,Web Development
3674,1088178,How To Design A WordPress Website With No Codi...,https://www.udemy.com/how-to-make-a-wordpress-...,Paid,25.0,306,3,42,Beginner Level,3.5,2017-03-10T22:24:30Z,Web Development
3675,635248,Learn and Build using Polymer,https://www.udemy.com/learn-and-build-using-po...,Paid,40.0,513,169,48,All Levels,3.5,2015-12-30T16:41:42Z,Web Development
3676,905096,CSS Animations: Create Amazing Effects on Your...,https://www.udemy.com/css-animations-create-am...,Paid,50.0,300,31,38,All Levels,3.0,2016-08-11T19:06:15Z,Web Development


In [278]:
df.dropna(how="all",subset=["price","content_duration"])

Unnamed: 0,course_id,course_title,url,is_paid,price,num_subscribers,num_reviews,num_lectures,level,content_duration,published_timestamp,subject
0,1070968,Ultimate Investment Banking Course,https://www.udemy.com/ultimate-investment-bank...,Paid,200.0,2147,23,51,All Levels,1.5,2017-01-18T20:58:58Z,Business Finance
1,1113822,Complete GST Course & Certification - Grow You...,https://www.udemy.com/goods-and-services-tax/,Paid,75.0,2792,923,274,,39.0,2017-03-09T16:34:20Z,Business Finance
2,1006314,Financial Modeling for Business Analysts and C...,https://www.udemy.com/financial-modeling-for-b...,Paid,45.0,2174,74,51,Intermediate Level,2.5,2016-12-19T19:26:30Z,Business Finance
3,1210588,Beginner to Pro - Financial Analysis in Excel ...,https://www.udemy.com/complete-excel-finance-c...,Paid,95.0,2451,11,36,All Levels,3.0,2017-05-30T20:07:24Z,Business Finance
4,1011058,How To Maximize Your Profits Trading Options,https://www.udemy.com/how-to-maximize-your-pro...,Paid,200.0,1276,45,26,Intermediate Level,2.0,2016-12-13T14:57:18Z,Business Finance
...,...,...,...,...,...,...,...,...,...,...,...,...
3673,775618,Learn jQuery from Scratch - Master of JavaScri...,https://www.udemy.com/easy-jquery-for-beginner...,Paid,100.0,1040,14,21,All Levels,2.0,2016-06-14T17:36:46Z,Web Development
3674,1088178,How To Design A WordPress Website With No Codi...,https://www.udemy.com/how-to-make-a-wordpress-...,Paid,25.0,306,3,42,Beginner Level,3.5,2017-03-10T22:24:30Z,Web Development
3675,635248,Learn and Build using Polymer,https://www.udemy.com/learn-and-build-using-po...,Paid,40.0,513,169,48,All Levels,3.5,2015-12-30T16:41:42Z,Web Development
3676,905096,CSS Animations: Create Amazing Effects on Your...,https://www.udemy.com/css-animations-create-am...,Paid,50.0,300,31,38,All Levels,3.0,2016-08-11T19:06:15Z,Web Development


In [280]:
df.dropna(how="any",subset=["price","content_duration"])

Unnamed: 0,course_id,course_title,url,is_paid,price,num_subscribers,num_reviews,num_lectures,level,content_duration,published_timestamp,subject
0,1070968,Ultimate Investment Banking Course,https://www.udemy.com/ultimate-investment-bank...,Paid,200.0,2147,23,51,All Levels,1.5,2017-01-18T20:58:58Z,Business Finance
1,1113822,Complete GST Course & Certification - Grow You...,https://www.udemy.com/goods-and-services-tax/,Paid,75.0,2792,923,274,,39.0,2017-03-09T16:34:20Z,Business Finance
2,1006314,Financial Modeling for Business Analysts and C...,https://www.udemy.com/financial-modeling-for-b...,Paid,45.0,2174,74,51,Intermediate Level,2.5,2016-12-19T19:26:30Z,Business Finance
3,1210588,Beginner to Pro - Financial Analysis in Excel ...,https://www.udemy.com/complete-excel-finance-c...,Paid,95.0,2451,11,36,All Levels,3.0,2017-05-30T20:07:24Z,Business Finance
4,1011058,How To Maximize Your Profits Trading Options,https://www.udemy.com/how-to-maximize-your-pro...,Paid,200.0,1276,45,26,Intermediate Level,2.0,2016-12-13T14:57:18Z,Business Finance
...,...,...,...,...,...,...,...,...,...,...,...,...
3673,775618,Learn jQuery from Scratch - Master of JavaScri...,https://www.udemy.com/easy-jquery-for-beginner...,Paid,100.0,1040,14,21,All Levels,2.0,2016-06-14T17:36:46Z,Web Development
3674,1088178,How To Design A WordPress Website With No Codi...,https://www.udemy.com/how-to-make-a-wordpress-...,Paid,25.0,306,3,42,Beginner Level,3.5,2017-03-10T22:24:30Z,Web Development
3675,635248,Learn and Build using Polymer,https://www.udemy.com/learn-and-build-using-po...,Paid,40.0,513,169,48,All Levels,3.5,2015-12-30T16:41:42Z,Web Development
3676,905096,CSS Animations: Create Amazing Effects on Your...,https://www.udemy.com/css-animations-create-am...,Paid,50.0,300,31,38,All Levels,3.0,2016-08-11T19:06:15Z,Web Development


In [282]:
df.isnull().sum()

course_id              0
course_title           0
url                    0
is_paid                0
price                  5
num_subscribers        0
num_reviews            0
num_lectures           0
level                  3
content_duration       3
published_timestamp    0
subject                0
dtype: int64

In [286]:
# df.fillna(value=default/manual value)
df.fillna(value="not found")

Unnamed: 0,course_id,course_title,url,is_paid,price,num_subscribers,num_reviews,num_lectures,level,content_duration,published_timestamp,subject
0,1070968,Ultimate Investment Banking Course,https://www.udemy.com/ultimate-investment-bank...,Paid,200.0,2147,23,51,All Levels,1.5,2017-01-18T20:58:58Z,Business Finance
1,1113822,Complete GST Course & Certification - Grow You...,https://www.udemy.com/goods-and-services-tax/,Paid,75.0,2792,923,274,not found,39.0,2017-03-09T16:34:20Z,Business Finance
2,1006314,Financial Modeling for Business Analysts and C...,https://www.udemy.com/financial-modeling-for-b...,Paid,45.0,2174,74,51,Intermediate Level,2.5,2016-12-19T19:26:30Z,Business Finance
3,1210588,Beginner to Pro - Financial Analysis in Excel ...,https://www.udemy.com/complete-excel-finance-c...,Paid,95.0,2451,11,36,All Levels,3.0,2017-05-30T20:07:24Z,Business Finance
4,1011058,How To Maximize Your Profits Trading Options,https://www.udemy.com/how-to-maximize-your-pro...,Paid,200.0,1276,45,26,Intermediate Level,2.0,2016-12-13T14:57:18Z,Business Finance
...,...,...,...,...,...,...,...,...,...,...,...,...
3673,775618,Learn jQuery from Scratch - Master of JavaScri...,https://www.udemy.com/easy-jquery-for-beginner...,Paid,100.0,1040,14,21,All Levels,2.0,2016-06-14T17:36:46Z,Web Development
3674,1088178,How To Design A WordPress Website With No Codi...,https://www.udemy.com/how-to-make-a-wordpress-...,Paid,25.0,306,3,42,Beginner Level,3.5,2017-03-10T22:24:30Z,Web Development
3675,635248,Learn and Build using Polymer,https://www.udemy.com/learn-and-build-using-po...,Paid,40.0,513,169,48,All Levels,3.5,2015-12-30T16:41:42Z,Web Development
3676,905096,CSS Animations: Create Amazing Effects on Your...,https://www.udemy.com/css-animations-create-am...,Paid,50.0,300,31,38,All Levels,3.0,2016-08-11T19:06:15Z,Web Development


In [288]:
df["price"].fillna(value=0,inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["price"].fillna(value=0,inplace=True)


In [290]:
df.isnull().sum()

course_id              0
course_title           0
url                    0
is_paid                0
price                  0
num_subscribers        0
num_reviews            0
num_lectures           0
level                  3
content_duration       3
published_timestamp    0
subject                0
dtype: int64

In [294]:
df.fillna(method="ffill")

  df.fillna(method="ffill")


Unnamed: 0,course_id,course_title,url,is_paid,price,num_subscribers,num_reviews,num_lectures,level,content_duration,published_timestamp,subject
0,1070968,Ultimate Investment Banking Course,https://www.udemy.com/ultimate-investment-bank...,Paid,200.0,2147,23,51,All Levels,1.5,2017-01-18T20:58:58Z,Business Finance
1,1113822,Complete GST Course & Certification - Grow You...,https://www.udemy.com/goods-and-services-tax/,Paid,75.0,2792,923,274,All Levels,39.0,2017-03-09T16:34:20Z,Business Finance
2,1006314,Financial Modeling for Business Analysts and C...,https://www.udemy.com/financial-modeling-for-b...,Paid,45.0,2174,74,51,Intermediate Level,2.5,2016-12-19T19:26:30Z,Business Finance
3,1210588,Beginner to Pro - Financial Analysis in Excel ...,https://www.udemy.com/complete-excel-finance-c...,Paid,95.0,2451,11,36,All Levels,3.0,2017-05-30T20:07:24Z,Business Finance
4,1011058,How To Maximize Your Profits Trading Options,https://www.udemy.com/how-to-maximize-your-pro...,Paid,200.0,1276,45,26,Intermediate Level,2.0,2016-12-13T14:57:18Z,Business Finance
...,...,...,...,...,...,...,...,...,...,...,...,...
3673,775618,Learn jQuery from Scratch - Master of JavaScri...,https://www.udemy.com/easy-jquery-for-beginner...,Paid,100.0,1040,14,21,All Levels,2.0,2016-06-14T17:36:46Z,Web Development
3674,1088178,How To Design A WordPress Website With No Codi...,https://www.udemy.com/how-to-make-a-wordpress-...,Paid,25.0,306,3,42,Beginner Level,3.5,2017-03-10T22:24:30Z,Web Development
3675,635248,Learn and Build using Polymer,https://www.udemy.com/learn-and-build-using-po...,Paid,40.0,513,169,48,All Levels,3.5,2015-12-30T16:41:42Z,Web Development
3676,905096,CSS Animations: Create Amazing Effects on Your...,https://www.udemy.com/css-animations-create-am...,Paid,50.0,300,31,38,All Levels,3.0,2016-08-11T19:06:15Z,Web Development


In [296]:
df.level.unique()

array(['All Levels', nan, 'Intermediate Level', 'Beginner Level',
       'Expert Level'], dtype=object)

In [298]:
df.level.mode()

0    All Levels
Name: level, dtype: object

In [300]:
df.level.value_counts()

level
All Levels            1926
Beginner Level        1270
Intermediate Level     421
Expert Level            58
Name: count, dtype: int64

In [304]:
df.level.mode()[0]

'All Levels'

In [308]:
import warnings
warnings.filterwarnings("ignore")

In [310]:
df["level"].fillna(df.level.mode()[0],inplace=True)

In [312]:
df.isnull().sum()

course_id              0
course_title           0
url                    0
is_paid                0
price                  0
num_subscribers        0
num_reviews            0
num_lectures           0
level                  0
content_duration       3
published_timestamp    0
subject                0
dtype: int64

In [314]:
df.content_duration.mean()

4.097392290250612

In [316]:
df.content_duration.median()

2.0

In [318]:
df.content_duration.max()

78.5

In [320]:
df["content_duration"].fillna(df.content_duration.median(),inplace=True)

In [322]:
df.isnull().sum()

course_id              0
course_title           0
url                    0
is_paid                0
price                  0
num_subscribers        0
num_reviews            0
num_lectures           0
level                  0
content_duration       0
published_timestamp    0
subject                0
dtype: int64

### 2. Handling wrong formatted data

In [325]:
df.dtypes

course_id                int64
course_title            object
url                     object
is_paid                 object
price                  float64
num_subscribers          int64
num_reviews              int64
num_lectures             int64
level                   object
content_duration       float64
published_timestamp     object
subject                 object
dtype: object

In [327]:
df["published_timestamp"] = pd.to_datetime(df["published_timestamp"])

In [329]:
df.dtypes

course_id                            int64
course_title                        object
url                                 object
is_paid                             object
price                              float64
num_subscribers                      int64
num_reviews                          int64
num_lectures                         int64
level                               object
content_duration                   float64
published_timestamp    datetime64[ns, UTC]
subject                             object
dtype: object

In [333]:
df["published_year"] = df["published_timestamp"].dt.year
df

Unnamed: 0,course_id,course_title,url,is_paid,price,num_subscribers,num_reviews,num_lectures,level,content_duration,published_timestamp,subject,published_year
0,1070968,Ultimate Investment Banking Course,https://www.udemy.com/ultimate-investment-bank...,Paid,200.0,2147,23,51,All Levels,1.5,2017-01-18 20:58:58+00:00,Business Finance,2017
1,1113822,Complete GST Course & Certification - Grow You...,https://www.udemy.com/goods-and-services-tax/,Paid,75.0,2792,923,274,All Levels,39.0,2017-03-09 16:34:20+00:00,Business Finance,2017
2,1006314,Financial Modeling for Business Analysts and C...,https://www.udemy.com/financial-modeling-for-b...,Paid,45.0,2174,74,51,Intermediate Level,2.5,2016-12-19 19:26:30+00:00,Business Finance,2016
3,1210588,Beginner to Pro - Financial Analysis in Excel ...,https://www.udemy.com/complete-excel-finance-c...,Paid,95.0,2451,11,36,All Levels,3.0,2017-05-30 20:07:24+00:00,Business Finance,2017
4,1011058,How To Maximize Your Profits Trading Options,https://www.udemy.com/how-to-maximize-your-pro...,Paid,200.0,1276,45,26,Intermediate Level,2.0,2016-12-13 14:57:18+00:00,Business Finance,2016
...,...,...,...,...,...,...,...,...,...,...,...,...,...
3673,775618,Learn jQuery from Scratch - Master of JavaScri...,https://www.udemy.com/easy-jquery-for-beginner...,Paid,100.0,1040,14,21,All Levels,2.0,2016-06-14 17:36:46+00:00,Web Development,2016
3674,1088178,How To Design A WordPress Website With No Codi...,https://www.udemy.com/how-to-make-a-wordpress-...,Paid,25.0,306,3,42,Beginner Level,3.5,2017-03-10 22:24:30+00:00,Web Development,2017
3675,635248,Learn and Build using Polymer,https://www.udemy.com/learn-and-build-using-po...,Paid,40.0,513,169,48,All Levels,3.5,2015-12-30 16:41:42+00:00,Web Development,2015
3676,905096,CSS Animations: Create Amazing Effects on Your...,https://www.udemy.com/css-animations-create-am...,Paid,50.0,300,31,38,All Levels,3.0,2016-08-11 19:06:15+00:00,Web Development,2016


In [337]:
df["published_year"].value_counts()

published_year
2016    1206
2015    1014
2017     715
2014     491
2013     202
2012      45
2011       5
Name: count, dtype: int64

#### Handling Wrong data

In [340]:
df.describe()

Unnamed: 0,course_id,price,num_subscribers,num_reviews,num_lectures,content_duration,published_year
count,3678.0,3678.0,3678.0,3678.0,3678.0,3678.0,3678.0
mean,675972.0,66.049483,3197.150625,156.259108,40.108755,4.095682,2015.431213
std,343273.2,61.005755,9504.11701,935.452044,50.383346,6.053291,1.185317
min,8324.0,0.0,0.0,0.0,0.0,0.0,2011.0
25%,407692.5,20.0,111.0,4.0,15.0,1.0,2015.0
50%,687917.0,45.0,911.5,18.0,25.0,2.0,2016.0
75%,961355.5,95.0,2546.0,67.0,45.75,4.5,2016.0
max,1282064.0,200.0,268923.0,27445.0,779.0,78.5,2017.0


In [342]:
df[df["num_lectures"]==0]

Unnamed: 0,course_id,course_title,url,is_paid,price,num_subscribers,num_reviews,num_lectures,level,content_duration,published_timestamp,subject,published_year
892,627332,Mutual Funds for Investors in Retirement Accounts,https://www.udemy.com/mutual-funds-for-investo...,Paid,20.0,0,0,0,All Levels,0.0,2015-12-17 05:38:38+00:00,Business Finance,2015


In [352]:
df.drop("url",axis=1,inplace=True)

In [354]:
df

Unnamed: 0,course_id,course_title,is_paid,price,num_subscribers,num_reviews,num_lectures,level,content_duration,published_timestamp,subject,published_year
0,1070968,Ultimate Investment Banking Course,Paid,200.0,2147,23,51,All Levels,1.5,2017-01-18 20:58:58+00:00,Business Finance,2017
1,1113822,Complete GST Course & Certification - Grow You...,Paid,75.0,2792,923,274,All Levels,39.0,2017-03-09 16:34:20+00:00,Business Finance,2017
2,1006314,Financial Modeling for Business Analysts and C...,Paid,45.0,2174,74,51,Intermediate Level,2.5,2016-12-19 19:26:30+00:00,Business Finance,2016
3,1210588,Beginner to Pro - Financial Analysis in Excel ...,Paid,95.0,2451,11,36,All Levels,3.0,2017-05-30 20:07:24+00:00,Business Finance,2017
4,1011058,How To Maximize Your Profits Trading Options,Paid,200.0,1276,45,26,Intermediate Level,2.0,2016-12-13 14:57:18+00:00,Business Finance,2016
...,...,...,...,...,...,...,...,...,...,...,...,...
3673,775618,Learn jQuery from Scratch - Master of JavaScri...,Paid,100.0,1040,14,21,All Levels,2.0,2016-06-14 17:36:46+00:00,Web Development,2016
3674,1088178,How To Design A WordPress Website With No Codi...,Paid,25.0,306,3,42,Beginner Level,3.5,2017-03-10 22:24:30+00:00,Web Development,2017
3675,635248,Learn and Build using Polymer,Paid,40.0,513,169,48,All Levels,3.5,2015-12-30 16:41:42+00:00,Web Development,2015
3676,905096,CSS Animations: Create Amazing Effects on Your...,Paid,50.0,300,31,38,All Levels,3.0,2016-08-11 19:06:15+00:00,Web Development,2016


In [360]:
df[df["num_lectures"]==0].index

Index([892], dtype='int64')

In [364]:
df.drop(df[df["num_lectures"]==0].index,inplace=True)

#### 4. Handling Duplicate entries

In [369]:
df["course_id"].nunique()

3671

In [371]:
# Filtering duplicate records
df[df.duplicated()]

Unnamed: 0,course_id,course_title,is_paid,price,num_subscribers,num_reviews,num_lectures,level,content_duration,published_timestamp,subject,published_year
787,837322,Essentials of money value: Get a financial Life !,Paid,20.0,0,0,20,All Levels,0.616667,2016-05-16 18:28:30+00:00,Business Finance,2016
788,1157298,Introduction to Forex Trading Business For Beg...,Paid,20.0,0,0,27,Beginner Level,1.5,2017-04-23 16:19:01+00:00,Business Finance,2017
894,1035638,Understanding Financial Statements,Paid,25.0,0,0,10,All Levels,1.0,2016-12-15 14:56:17+00:00,Business Finance,2016
1100,1084454,CFA Level 2- Quantitative Methods,Paid,40.0,0,0,35,All Levels,5.5,2017-07-02 14:29:35+00:00,Business Finance,2017
1473,185526,MicroStation - Células,Paid,20.0,0,0,9,Beginner Level,0.616667,2014-04-15 21:48:55+00:00,Graphic Design,2014
2561,28295,Learn Web Designing & HTML5/CSS3 Essentials in...,Paid,75.0,43285,525,24,All Levels,4.0,2013-01-03 00:55:31+00:00,Web Development,2013


In [373]:
df.drop_duplicates(inplace=True)

In [375]:
df

Unnamed: 0,course_id,course_title,is_paid,price,num_subscribers,num_reviews,num_lectures,level,content_duration,published_timestamp,subject,published_year
0,1070968,Ultimate Investment Banking Course,Paid,200.0,2147,23,51,All Levels,1.5,2017-01-18 20:58:58+00:00,Business Finance,2017
1,1113822,Complete GST Course & Certification - Grow You...,Paid,75.0,2792,923,274,All Levels,39.0,2017-03-09 16:34:20+00:00,Business Finance,2017
2,1006314,Financial Modeling for Business Analysts and C...,Paid,45.0,2174,74,51,Intermediate Level,2.5,2016-12-19 19:26:30+00:00,Business Finance,2016
3,1210588,Beginner to Pro - Financial Analysis in Excel ...,Paid,95.0,2451,11,36,All Levels,3.0,2017-05-30 20:07:24+00:00,Business Finance,2017
4,1011058,How To Maximize Your Profits Trading Options,Paid,200.0,1276,45,26,Intermediate Level,2.0,2016-12-13 14:57:18+00:00,Business Finance,2016
...,...,...,...,...,...,...,...,...,...,...,...,...
3673,775618,Learn jQuery from Scratch - Master of JavaScri...,Paid,100.0,1040,14,21,All Levels,2.0,2016-06-14 17:36:46+00:00,Web Development,2016
3674,1088178,How To Design A WordPress Website With No Codi...,Paid,25.0,306,3,42,Beginner Level,3.5,2017-03-10 22:24:30+00:00,Web Development,2017
3675,635248,Learn and Build using Polymer,Paid,40.0,513,169,48,All Levels,3.5,2015-12-30 16:41:42+00:00,Web Development,2015
3676,905096,CSS Animations: Create Amazing Effects on Your...,Paid,50.0,300,31,38,All Levels,3.0,2016-08-11 19:06:15+00:00,Web Development,2016
