# 0. Clean Dataset
The purpose of this Jupyter Notebook is to analyze and clean the [Udemy Courses dataset](https://www.kaggle.com/datasets/hossaingh/udemy-courses?select=Course_info.csv) gathered from Kaggle. This dataset is used under Creative Commons.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
df = pd.read_csv('Course_info.csv')

In [3]:
df.head(5)

Unnamed: 0,id,title,is_paid,price,headline,num_subscribers,avg_rating,num_reviews,num_comments,num_lectures,content_length_min,published_time,last_update_date,category,subcategory,topic,language,course_url,instructor_name,instructor_url
0,4715.0,Online Vegan Vegetarian Cooking School,True,24.99,Learn to cook delicious vegan recipes. Filmed ...,2231.0,3.75,134.0,42.0,37.0,1268.0,2010-08-05T22:06:13Z,2020-11-06,Lifestyle,Food & Beverage,Vegan Cooking,English,/course/vegan-vegetarian-cooking-school/,Angela Poch,/user/angelapoch/
1,1769.0,The Lean Startup Talk at Stanford E-Corner,False,0.0,Debunking Myths of Entrepreneurship A startup ...,26474.0,4.5,709.0,112.0,9.0,88.0,2010-01-12T18:09:46Z,,Business,Entrepreneurship,Lean Startup,English,/course/the-lean-startup-debunking-myths-of-en...,Eric Ries,/user/ericries/
2,5664.0,"How To Become a Vegan, Vegetarian, or Flexitarian",True,19.99,Get the tools you need for a lifestyle change ...,1713.0,4.4,41.0,13.0,14.0,82.0,2010-10-13T18:07:17Z,2019-10-09,Lifestyle,Other Lifestyle,Vegan Cooking,English,/course/see-my-personal-motivation-for-becomin...,Angela Poch,/user/angelapoch/
3,7723.0,How to Train a Puppy,True,199.99,Train your puppy the right way with Dr. Ian Du...,4988.0,4.8,395.0,88.0,36.0,1511.0,2011-06-20T20:08:38Z,2016-01-13,Lifestyle,Pet Care & Training,Pet Training,English,/course/complete-dunbar-collection/,Ian Dunbar,/user/ian-dunbar/
4,8157.0,Web Design from the Ground Up,True,159.99,Learn web design online: Everything you need t...,1266.0,4.75,38.0,12.0,38.0,569.0,2011-06-23T18:31:20Z,,Design,Web Design,Web Design,English,/course/web-design-from-the-ground-up/,E Learning Lab,/user/edwin-ang-2/


## 0.1 Check for Null values
I will be checking for null values using the `info()` method, and see if there is a count of null values for each column.

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 209734 entries, 0 to 209733
Data columns (total 20 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   id                  209734 non-null  float64
 1   title               209734 non-null  object 
 2   is_paid             209734 non-null  bool   
 3   price               209734 non-null  float64
 4   headline            209707 non-null  object 
 5   num_subscribers     209734 non-null  float64
 6   avg_rating          209734 non-null  float64
 7   num_reviews         209734 non-null  float64
 8   num_comments        209734 non-null  float64
 9   num_lectures        209734 non-null  float64
 10  content_length_min  209734 non-null  float64
 11  published_time      209734 non-null  object 
 12  last_update_date    209597 non-null  object 
 13  category            209734 non-null  object 
 14  subcategory         209734 non-null  object 
 15  topic               208776 non-nul

From the information provided by the `info()` method we can see that the columns that have a null value are the ones in `nan_columns` below.

In [5]:
nan_colums = ['headline', 'last_update_date', 'topic', 'instructor_name', 'instructor_url']

These null values won't affect our analysis because they occur in columns that are not goig to be used in our project.

In [6]:
df[df[nan_colums].isnull().any(axis=1)]

Unnamed: 0,id,title,is_paid,price,headline,num_subscribers,avg_rating,num_reviews,num_comments,num_lectures,content_length_min,published_time,last_update_date,category,subcategory,topic,language,course_url,instructor_name,instructor_url
1,1769.0,The Lean Startup Talk at Stanford E-Corner,False,0.00,Debunking Myths of Entrepreneurship A startup ...,26474.0,4.500000,709.0,112.0,9.0,88.0,2010-01-12T18:09:46Z,,Business,Entrepreneurship,Lean Startup,English,/course/the-lean-startup-debunking-myths-of-en...,Eric Ries,/user/ericries/
4,8157.0,Web Design from the Ground Up,True,159.99,Learn web design online: Everything you need t...,1266.0,4.750000,38.0,12.0,38.0,569.0,2011-06-23T18:31:20Z,,Design,Web Design,Web Design,English,/course/web-design-from-the-ground-up/,E Learning Lab,/user/edwin-ang-2/
10,8318.0,Navigating the MBA Admissions Process,True,49.99,MBA Admission: The Complete Course on How to G...,794.0,4.100000,27.0,16.0,10.0,236.0,2011-07-12T04:11:59Z,,Teaching & Academics,Teacher Training,MBA Admissions,English,/course/business-school/,Clear Admit & Beat The GMAT,/user/clearadmitbeatthegmat/
11,8422.0,Kundalini Yoga to Heal Stress and Anxiety by V...,True,49.99,Kundalini Yoga is highly effective simple yoga...,1322.0,4.450000,196.0,76.0,5.0,140.0,2012-09-12T23:47:03Z,,Health & Fitness,Yoga,Kundalini,English,/course/kundalini-yoga-to-heal-stress-and-anxi...,Valinda Cochella,/user/viriamkaur/
15,8467.0,The Lean Startup,True,39.99,Learn how to apply the method that is transfor...,5566.0,4.166666,720.0,163.0,6.0,158.0,2011-07-11T06:29:02Z,,Business,Entrepreneurship,Lean Startup,English,/course/the-lean-startup/,Eric Ries,/user/ericries/
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
209421,4903158.0,クラウドサービスを安全に利用するための「クラウドサービス安全利用の手引き」,False,0.00,クラウドサービスを安全に利用するためには、いくつかチェックすべきポイントがありますが、情報セ...,26.0,0.000000,0.0,0.0,30.0,75.0,2022-09-30T10:03:43Z,2022-09-29,IT & Software,Network & Security,,Japanese,/course/fbyohbmp/,佐藤 豊史,/user/zuo-teng-li-shi/
209441,4903668.0,Parenting course on how to help children learn...,False,0.00,"English reading, writing and vocabulary learni...",530.0,4.500000,1.0,0.0,12.0,56.0,2022-10-03T09:56:30Z,2022-09-29,Teaching & Academics,Language Learning,,English,/course/parenting-course-on-how-to-help-childr...,Brittani Gabriel,/user/evgeniya-pislegina-3/
209447,4903794.0,CONFIANCE ULTIME,False,0.00,avoir une confiance à toute épreuve,37.0,0.000000,0.0,0.0,13.0,59.0,2022-09-29T14:14:26Z,2022-09-28,Personal Development,Self Esteem & Confidence,,French,/course/confiance-ultime/,Nathan Claire,/user/nathan-claire-3/
209540,4906252.0,합격을 부르는 면접 스피치,False,0.00,"면접 스피치의 원칙, 킬러 콘텐츠, 차별화된 답변 메이킹을 통해 면접에 합격하는 강의",15.0,0.000000,0.0,0.0,5.0,119.0,2022-09-30T06:34:04Z,2022-09-29,Personal Development,Career Development,,Korean,/course/interview_pass/,RAN CHOI,/user/coeran-2/


## 0.2 Datetime Correction
Two columns, `published_time` and `last_update_date` need to be converted into a datetime type using `to_datetime()` method.

In [7]:
df['published_time'] = pd.to_datetime(df['published_time'])

In [8]:
df['last_update_date'] = pd.to_datetime(df['last_update_date'])

## 0.3 Questions:
Finally the questions that will be answered during this project are:

1. What are the top courses in Udemy?

2. Are paid courses more likely to be highly rated than free courses?

3. Is there a shift in course category over the years?

4. What is the recommended length of a course and lectures, and what category will demand less time to create a course?