<div class="alert alert-info">

## Project Description

As a data analyst, the objective of this project is to analyze and explore Udemy dataset containing information about various online courses. The dataset includes details such as course titles, pricing, enrollment statistics, instructor information, and other relevant attributes. The project aims to gain insights into the online course market, identify popular trends, and understand factors influencing course success.




<br>
    <b>Data Source:</b> <a href="https://www.kaggle.com/datasets/hossaingh/udemy-courses?resource=download&select=Course_info.csv">Kaggle<a/>
<br>
    
<b>Dataset description</b>  
`course_info.csv` — the calendar of marketing events for 2020    
* `id`: A unique identifier for each course.
* `title`: The title or name of the course.
* `is_paid`: A boolean (True/False) indicating whether the course is paid or free.
* `price`: The price of the course. If the course is free, the price will be 0.
* `headline`: A brief description or headline of the course content.
* `num_subscribers`: The number of subscribers or students enrolled in the course.
* `avg_rating`: The average rating given by students who have taken the course.
* `num_reviews`: The number of reviews the course has received.
* `num_comments`: The number of comments posted by students or users about the course.
* `num_lectures`: The number of lectures or lessons in the course.
* `content_length_min`: The total length of the course content in minutes.
* `published_time`: The date and time when the course was initially published.
* `last_update_date`: The date when the course was last updated.
* `category`: The broad category to which the course belongs (e.g., Lifestyle, Business, Design, etc.).
* `subcategory`: The subcategory within the broader category (e.g., Food & Beverage, Entrepreneurship, Web Design, etc.).
* `topic`: The specific topic or subject of the course.
* `language`: The language in which the course is conducted.
* `course_url`: The URL or link to access the course.
* `instructor_name`: The name of the course instructor.
* `instructor_url`: The URL or link to the instructor's profile.
</div>

---

## Table of Contents

<a href="#Step-1.-Download-and-prepare-data-for-analysis">Step 1. Download and prepare data for analysis<a/>
* <a href="#Load-Libraries">Load libraries<a/>
* <a href="#Load-Datasets">Load datasets<a/>
* <a href="#Missing-Values-and-Duplicates">Missing values and duplicates<a/>

### Load Libraries

In [74]:
# import libraries
import pandas as pd
import numpy as np
import re
import streamlit as st
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity,linear_kernel
import neattext.functions as nfx

### Load Datasets

In [75]:
# load dataset
df = pd.read_csv('course_info.csv')
display(df.head())

Unnamed: 0,id,title,is_paid,price,headline,num_subscribers,avg_rating,num_reviews,num_comments,num_lectures,content_length_min,published_time,last_update_date,category,subcategory,topic,language,course_url,instructor_name,instructor_url
0,4715.0,Online Vegan Vegetarian Cooking School,True,24.99,Learn to cook delicious vegan recipes. Filmed ...,2231.0,3.75,134.0,42.0,37.0,1268.0,2010-08-05T22:06:13Z,2020-11-06,Lifestyle,Food & Beverage,Vegan Cooking,English,/course/vegan-vegetarian-cooking-school/,Angela Poch,/user/angelapoch/
1,1769.0,The Lean Startup Talk at Stanford E-Corner,False,0.0,Debunking Myths of Entrepreneurship A startup ...,26474.0,4.5,709.0,112.0,9.0,88.0,2010-01-12T18:09:46Z,,Business,Entrepreneurship,Lean Startup,English,/course/the-lean-startup-debunking-myths-of-en...,Eric Ries,/user/ericries/
2,5664.0,"How To Become a Vegan, Vegetarian, or Flexitarian",True,19.99,Get the tools you need for a lifestyle change ...,1713.0,4.4,41.0,13.0,14.0,82.0,2010-10-13T18:07:17Z,2019-10-09,Lifestyle,Other Lifestyle,Vegan Cooking,English,/course/see-my-personal-motivation-for-becomin...,Angela Poch,/user/angelapoch/
3,7723.0,How to Train a Puppy,True,199.99,Train your puppy the right way with Dr. Ian Du...,4988.0,4.8,395.0,88.0,36.0,1511.0,2011-06-20T20:08:38Z,2016-01-13,Lifestyle,Pet Care & Training,Pet Training,English,/course/complete-dunbar-collection/,Ian Dunbar,/user/ian-dunbar/
4,8157.0,Web Design from the Ground Up,True,159.99,Learn web design online: Everything you need t...,1266.0,4.75,38.0,12.0,38.0,569.0,2011-06-23T18:31:20Z,,Design,Web Design,Web Design,English,/course/web-design-from-the-ground-up/,E Learning Lab,/user/edwin-ang-2/


In [76]:
display(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 209734 entries, 0 to 209733
Data columns (total 20 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   id                  209734 non-null  float64
 1   title               209734 non-null  object 
 2   is_paid             209734 non-null  bool   
 3   price               209734 non-null  float64
 4   headline            209707 non-null  object 
 5   num_subscribers     209734 non-null  float64
 6   avg_rating          209734 non-null  float64
 7   num_reviews         209734 non-null  float64
 8   num_comments        209734 non-null  float64
 9   num_lectures        209734 non-null  float64
 10  content_length_min  209734 non-null  float64
 11  published_time      209734 non-null  object 
 12  last_update_date    209597 non-null  object 
 13  category            209734 non-null  object 
 14  subcategory         209734 non-null  object 
 15  topic               208776 non-nul

None

In [77]:
# display summary statistics of the numerical columns
display(df.describe())

Unnamed: 0,id,price,num_subscribers,avg_rating,num_reviews,num_comments,num_lectures,content_length_min
count,209734.0,209734.0,209734.0,209734.0,209734.0,209734.0,209734.0,209734.0
mean,3015403.0,81.665529,3096.992,3.747179,244.358812,44.874589,36.548395,265.558856
std,1342558.0,117.317846,15581.32,1.533711,2458.098276,355.773107,51.871962,454.448676
min,1769.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1950734.0,19.99,26.0,3.8,3.0,1.0,11.0,69.0
50%,3292863.0,34.99,206.0,4.333334,17.0,5.0,22.0,133.0
75%,4189458.0,99.99,1435.0,4.625,74.0,18.0,42.0,289.0
max,4914146.0,999.99,1752364.0,5.0,436457.0,39040.0,1095.0,22570.0


## Data Cleaning
### Convert data types

In [78]:
# convert date columns to datetime data type
df['published_time'] = pd.to_datetime(df['published_time'])
df['last_update_date'] = pd.to_datetime(df['last_update_date'], errors='coerce')

In [79]:
# convert columns to integers
numeric_columns = ['id', 'num_subscribers', 'num_reviews', 'num_comments', 'num_lectures', 'content_length_min']
df[numeric_columns] = df[numeric_columns].astype(int)
display(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 209734 entries, 0 to 209733
Data columns (total 20 columns):
 #   Column              Non-Null Count   Dtype              
---  ------              --------------   -----              
 0   id                  209734 non-null  int64              
 1   title               209734 non-null  object             
 2   is_paid             209734 non-null  bool               
 3   price               209734 non-null  float64            
 4   headline            209707 non-null  object             
 5   num_subscribers     209734 non-null  int64              
 6   avg_rating          209734 non-null  float64            
 7   num_reviews         209734 non-null  int64              
 8   num_comments        209734 non-null  int64              
 9   num_lectures        209734 non-null  int64              
 10  content_length_min  209734 non-null  int64              
 11  published_time      209734 non-null  datetime64[ns, UTC]
 12  last_update_date

None

### Missing Values and Duplicates

In [81]:
# check for missing values in each column
display(df.isnull().sum())

id                      0
title                   0
is_paid                 0
price                   0
headline               27
num_subscribers         0
avg_rating              0
num_reviews             0
num_comments            0
num_lectures            0
content_length_min      0
published_time          0
last_update_date      137
category                0
subcategory             0
topic                 958
language                0
course_url              0
instructor_name         5
instructor_url        427
dtype: int64

In [82]:
# replace missing dates in last_update_date with published_time
df['last_update_date'].fillna(df['published_time'], inplace=True)
display(df.isnull().sum())

id                      0
title                   0
is_paid                 0
price                   0
headline               27
num_subscribers         0
avg_rating              0
num_reviews             0
num_comments            0
num_lectures            0
content_length_min      0
published_time          0
last_update_date        0
category                0
subcategory             0
topic                 958
language                0
course_url              0
instructor_name         5
instructor_url        427
dtype: int64

In [85]:
display(df.duplicated().sum())

0

<b>Conclusion</b>  
We decided to delete `headline`, `last_update_date`, `topic`, `instructor_name`, `instructor_url`


<a href="#Table-of-Contents">Back to top<a/>

## EDA