<div class="alert alert-info">

## Project Description

As a data analyst, the objective of this project is to analyze and explore Udemy dataset containing information about various online courses. The dataset includes details such as course titles, pricing, enrollment statistics, instructor information, and other relevant attributes. The project aims to gain insights into the online course market, identify popular trends, and understand factors influencing course success.




<br>
    <b>Data Source:</b> <a href="https://www.kaggle.com/datasets/hossaingh/udemy-courses?resource=download&select=Course_info.csv">Kaggle<a/>
<br>
    
<b>Dataset description</b>  
`course_info.csv` — the calendar of marketing events for 2020    
* `id`: A unique identifier for each course.
* `title`: The title or name of the course.
* `is_paid`: A boolean (True/False) indicating whether the course is paid or free.
* `price`: The price of the course. If the course is free, the price will be 0.
* `headline`: A brief description or headline of the course content.
* `num_subscribers`: The number of subscribers or students enrolled in the course.
* `avg_rating`: The average rating given by students who have taken the course.
* `num_reviews`: The number of reviews the course has received.
* `num_comments`: The number of comments posted by students or users about the course.
* `num_lectures`: The number of lectures or lessons in the course.
* `content_length_min`: The total length of the course content in minutes.
* `published_time`: The date and time when the course was initially published.
* `last_update_date`: The date when the course was last updated.
* `category`: The broad category to which the course belongs (e.g., Lifestyle, Business, Design, etc.).
* `subcategory`: The subcategory within the broader category (e.g., Food & Beverage, Entrepreneurship, Web Design, etc.).
* `topic`: The specific topic or subject of the course.
* `language`: The language in which the course is conducted.
* `course_url`: The URL or link to access the course.
* `instructor_name`: The name of the course instructor.
* `instructor_url`: The URL or link to the instructor's profile.
</div>

---

## Table of Contents

<a href="#Step-1.-Download-and-prepare-data-for-analysis">Step 1. Download and prepare data for analysis<a/>
* <a href="#Load-Libraries">Load libraries<a/>
* <a href="#Load-Datasets">Load datasets<a/>
* <a href="#Missing-Values-and-Duplicates">Missing values and duplicates<a/>

### Load Libraries

In [None]:
# import libraries
import pandas as pd
import numpy as np
import streamlit as st
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

### Load Datasets

In [12]:
# load dataset
course_info = pd.read_csv('course_info.csv')
comments = pd.read_csv('comments.csv')

# create a list of datasets
datasets = [course_info, comments]

In [14]:
def get_df_name(df):
    '''This is a function to get the name of dataset'''
    name = [x for x in globals() if globals()[x] is df][0]
    return name

# display the first five rows
for dataset in datasets:
    display(get_df_name(dataset))
    display(dataset.head())

'course_info'

Unnamed: 0,id,title,is_paid,price,headline,num_subscribers,avg_rating,num_reviews,num_comments,num_lectures,content_length_min,published_time,last_update_date,category,subcategory,topic,language,course_url,instructor_name,instructor_url
0,4715.0,Online Vegan Vegetarian Cooking School,True,24.99,Learn to cook delicious vegan recipes. Filmed ...,2231.0,3.75,134.0,42.0,37.0,1268.0,2010-08-05T22:06:13Z,2020-11-06,Lifestyle,Food & Beverage,Vegan Cooking,English,/course/vegan-vegetarian-cooking-school/,Angela Poch,/user/angelapoch/
1,1769.0,The Lean Startup Talk at Stanford E-Corner,False,0.0,Debunking Myths of Entrepreneurship A startup ...,26474.0,4.5,709.0,112.0,9.0,88.0,2010-01-12T18:09:46Z,,Business,Entrepreneurship,Lean Startup,English,/course/the-lean-startup-debunking-myths-of-en...,Eric Ries,/user/ericries/
2,5664.0,"How To Become a Vegan, Vegetarian, or Flexitarian",True,19.99,Get the tools you need for a lifestyle change ...,1713.0,4.4,41.0,13.0,14.0,82.0,2010-10-13T18:07:17Z,2019-10-09,Lifestyle,Other Lifestyle,Vegan Cooking,English,/course/see-my-personal-motivation-for-becomin...,Angela Poch,/user/angelapoch/
3,7723.0,How to Train a Puppy,True,199.99,Train your puppy the right way with Dr. Ian Du...,4988.0,4.8,395.0,88.0,36.0,1511.0,2011-06-20T20:08:38Z,2016-01-13,Lifestyle,Pet Care & Training,Pet Training,English,/course/complete-dunbar-collection/,Ian Dunbar,/user/ian-dunbar/
4,8157.0,Web Design from the Ground Up,True,159.99,Learn web design online: Everything you need t...,1266.0,4.75,38.0,12.0,38.0,569.0,2011-06-23T18:31:20Z,,Design,Web Design,Web Design,English,/course/web-design-from-the-ground-up/,E Learning Lab,/user/edwin-ang-2/


'comments'

Unnamed: 0,id,course_id,rate,date,display_name,comment
0,88962892,3173036,1.0,2021-06-29T18:54:25-07:00,Rahul,I think a beginner needs more than you think.\...
1,125535470,4913148,5.0,2022-10-07T11:17:41-07:00,Marlo,Aviva is such a natural teacher and healer/hea...
2,68767147,3178386,3.5,2020-10-19T06:35:37-07:00,Yamila Andrea,Muy buena la introducción para entender la bas...
3,125029758,3175814,5.0,2022-09-30T21:13:49-07:00,Jacqueline,This course is the best on Udemy. This breakd...
4,76584052,3174896,4.5,2021-01-30T08:45:11-08:00,Anthony,I found this course very helpful. It was full ...


## Missing Values and Duplicates

In [16]:
for dataset in datasets:
    display(get_df_name(dataset))
    display(dataset.info())
    display(dataset.isnull().sum())
    display(dataset.duplicated().sum())

'course_info'

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 209734 entries, 0 to 209733
Data columns (total 20 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   id                  209734 non-null  float64
 1   title               209734 non-null  object 
 2   is_paid             209734 non-null  bool   
 3   price               209734 non-null  float64
 4   headline            209707 non-null  object 
 5   num_subscribers     209734 non-null  float64
 6   avg_rating          209734 non-null  float64
 7   num_reviews         209734 non-null  float64
 8   num_comments        209734 non-null  float64
 9   num_lectures        209734 non-null  float64
 10  content_length_min  209734 non-null  float64
 11  published_time      209734 non-null  object 
 12  last_update_date    209597 non-null  object 
 13  category            209734 non-null  object 
 14  subcategory         209734 non-null  object 
 15  topic               208776 non-nul

None

id                      0
title                   0
is_paid                 0
price                   0
headline               27
num_subscribers         0
avg_rating              0
num_reviews             0
num_comments            0
num_lectures            0
content_length_min      0
published_time          0
last_update_date      137
category                0
subcategory             0
topic                 958
language                0
course_url              0
instructor_name         5
instructor_url        427
dtype: int64

0

'comments'

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9411727 entries, 0 to 9411726
Data columns (total 6 columns):
 #   Column        Dtype  
---  ------        -----  
 0   id            int64  
 1   course_id     int64  
 2   rate          float64
 3   date          object 
 4   display_name  object 
 5   comment       object 
dtypes: float64(1), int64(2), object(3)
memory usage: 430.8+ MB


None

id                  0
course_id           0
rate                0
date                0
display_name    75362
comment          6333
dtype: int64

0

<b>Conclusion</b>  
delete `headline`, `last_update_date`, `topic`, `instructor_name`, `instructor_url`


<a href="#Table-of-Contents">Back to top<a/>