# Data Collection Notebook

### Objectives:

- Fetch Udemy course data from Kaggle using token and save it as raw data.
- Inspect the data and save it under outputs/datasets/collection for further processing.

### Inputs:

- Kaggle JSON file - the authentication token.

### Outputs:

- Generate Dataset: outputs/datasets/collection/TelcoCustomerChurn.csv

### Change working directory
We need to change the working directory from its current folder to its parent folder

 - Access the current directory with os.getcwd()

In [5]:
import os
current_dir = os.getcwd()
current_dir

'/Users/panda/Desktop/code_institue_projects/portfolio-projects/learning_trends_analyzer'

In [4]:
# In case you want to go one directory back

os.chdir(os.path.dirname(current_dir))
current_dir

'/Users/panda/Desktop/code_institue_projects/portfolio-projects/learning_trends_analyzer/jupyter_notebooks'

In [6]:
current_dir

'/Users/panda/Desktop/code_institue_projects/portfolio-projects/learning_trends_analyzer'

In [54]:
# If you want to change the directory to sub directory

#os.chdir(os.path.join(current_dir, 'learning_trends_analyzer/jupyter_notebooks'))

### Install python packages in the notebooks

In [43]:
%pip install -r requirements.txt


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


### Fetch Data from Kaggle

In [69]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

chmod: kaggle.json: No such file or directory


In [66]:
KaggleDatasetPath = "andrewmvd/udemy-courses"
DestinationFolder = "inputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Dataset URL: https://www.kaggle.com/datasets/andrewmvd/udemy-courses
License(s): other
Downloading udemy-courses.zip to inputs/datasets/raw
100%|█████████████████████████████████████████| 200k/200k [00:00<00:00, 731kB/s]
100%|█████████████████████████████████████████| 200k/200k [00:00<00:00, 729kB/s]


### Steps to download dataset directly

If you'd like to download the dataset from Kaggle without using the API, you can follow these steps:

**Step 1: Sign in to Kaggle**
- Go to the Kaggle dataset page: [Udemy Courses Dataset](https://www.kaggle.com/datasets/andrewmvd/udemy-courses).
- Log in to your Kaggle account.

**Step 2: Download the Dataset Manually**
- On the dataset page, you’ll see a **Download** button on the top right.
- Click on the **Download** button to download the dataset as a `.zip` file.

**Step 3: Unzip the Dataset**
- After downloading the `.zip` file, you need to extract its contents
  
**Step 4: Access the Dataset**
Once unzipped, you can access the dataset and begin using it.

### Load and Inspect data

In [28]:
import pandas as pd
df = pd.read_csv(f"inputs/datasets/raw/udemy_courses.csv")
df.head()

Unnamed: 0,course_id,course_title,url,is_paid,price,num_subscribers,num_reviews,num_lectures,level,content_duration,published_timestamp,subject
0,1070968,Ultimate Investment Banking Course,https://www.udemy.com/ultimate-investment-bank...,True,200,2147,23,51,All Levels,1.5,2017-01-18T20:58:58Z,Business Finance
1,1113822,Complete GST Course & Certification - Grow You...,https://www.udemy.com/goods-and-services-tax/,True,75,2792,923,274,All Levels,39.0,2017-03-09T16:34:20Z,Business Finance
2,1006314,Financial Modeling for Business Analysts and C...,https://www.udemy.com/financial-modeling-for-b...,True,45,2174,74,51,Intermediate Level,2.5,2016-12-19T19:26:30Z,Business Finance
3,1210588,Beginner to Pro - Financial Analysis in Excel ...,https://www.udemy.com/complete-excel-finance-c...,True,95,2451,11,36,All Levels,3.0,2017-05-30T20:07:24Z,Business Finance
4,1011058,How To Maximize Your Profits Trading Options,https://www.udemy.com/how-to-maximize-your-pro...,True,200,1276,45,26,Intermediate Level,2.0,2016-12-13T14:57:18Z,Business Finance


Dataframe Summary

In [29]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3678 entries, 0 to 3677
Data columns (total 12 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   course_id            3678 non-null   int64  
 1   course_title         3678 non-null   object 
 2   url                  3678 non-null   object 
 3   is_paid              3678 non-null   bool   
 4   price                3678 non-null   int64  
 5   num_subscribers      3678 non-null   int64  
 6   num_reviews          3678 non-null   int64  
 7   num_lectures         3678 non-null   int64  
 8   level                3678 non-null   object 
 9   content_duration     3678 non-null   float64
 10  published_timestamp  3678 non-null   object 
 11  subject              3678 non-null   object 
dtypes: bool(1), float64(1), int64(5), object(5)
memory usage: 319.8+ KB


Descriptive statistics of the data

In [30]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
course_id,3678.0,675971.963295,343273.15604,8324.0,407692.5,687917.0,961355.5,1282064.0
price,3678.0,66.049483,61.005755,0.0,20.0,45.0,95.0,200.0
num_subscribers,3678.0,3197.150625,9504.11701,0.0,111.0,911.5,2546.0,268923.0
num_reviews,3678.0,156.259108,935.452044,0.0,4.0,18.0,67.0,27445.0
num_lectures,3678.0,40.108755,50.383346,0.0,15.0,25.0,45.75,779.0
content_duration,3678.0,4.094517,6.05384,0.0,1.0,2.0,4.5,78.5


Check for null values so that we can impute if there's any. No null value found!

In [31]:
df.isnull().sum()

course_id              0
course_title           0
url                    0
is_paid                0
price                  0
num_subscribers        0
num_reviews            0
num_lectures           0
level                  0
content_duration       0
published_timestamp    0
subject                0
dtype: int64

Bring all the duplicated rows to check if we can drop them in next step

In [32]:
df[df.duplicated(keep=False)]

Unnamed: 0,course_id,course_title,url,is_paid,price,num_subscribers,num_reviews,num_lectures,level,content_duration,published_timestamp,subject
453,837322,Essentials of money value: Get a financial Life !,https://www.udemy.com/essentials-of-money-value/,True,20,0,0,20,All Levels,0.616667,2016-05-16T18:28:30Z,Business Finance
454,1157298,Introduction to Forex Trading Business For Beg...,https://www.udemy.com/introduction-to-forex-tr...,True,20,0,0,27,Beginner Level,1.5,2017-04-23T16:19:01Z,Business Finance
463,1084454,CFA Level 2- Quantitative Methods,https://www.udemy.com/cfa-level-2-quantitative...,True,40,0,0,35,All Levels,5.5,2017-07-02T14:29:35Z,Business Finance
778,1035638,Understanding Financial Statements,https://www.udemy.com/understanding-financial-...,True,25,0,0,10,All Levels,1.0,2016-12-15T14:56:17Z,Business Finance
787,837322,Essentials of money value: Get a financial Life !,https://www.udemy.com/essentials-of-money-value/,True,20,0,0,20,All Levels,0.616667,2016-05-16T18:28:30Z,Business Finance
788,1157298,Introduction to Forex Trading Business For Beg...,https://www.udemy.com/introduction-to-forex-tr...,True,20,0,0,27,Beginner Level,1.5,2017-04-23T16:19:01Z,Business Finance
894,1035638,Understanding Financial Statements,https://www.udemy.com/understanding-financial-...,True,25,0,0,10,All Levels,1.0,2016-12-15T14:56:17Z,Business Finance
1100,1084454,CFA Level 2- Quantitative Methods,https://www.udemy.com/cfa-level-2-quantitative...,True,40,0,0,35,All Levels,5.5,2017-07-02T14:29:35Z,Business Finance
1234,185526,MicroStation - Células,https://www.udemy.com/microstation-celulas/,True,20,0,0,9,Beginner Level,0.616667,2014-04-15T21:48:55Z,Graphic Design
1473,185526,MicroStation - Células,https://www.udemy.com/microstation-celulas/,True,20,0,0,9,Beginner Level,0.616667,2014-04-15T21:48:55Z,Graphic Design


In [34]:
df.drop_duplicates(keep='first', inplace=False, subset='course_id')
df.head()

Unnamed: 0,course_id,course_title,url,is_paid,price,num_subscribers,num_reviews,num_lectures,level,content_duration,published_timestamp,subject
0,1070968,Ultimate Investment Banking Course,https://www.udemy.com/ultimate-investment-bank...,True,200,2147,23,51,All Levels,1.5,2017-01-18T20:58:58Z,Business Finance
1,1113822,Complete GST Course & Certification - Grow You...,https://www.udemy.com/goods-and-services-tax/,True,75,2792,923,274,All Levels,39.0,2017-03-09T16:34:20Z,Business Finance
2,1006314,Financial Modeling for Business Analysts and C...,https://www.udemy.com/financial-modeling-for-b...,True,45,2174,74,51,Intermediate Level,2.5,2016-12-19T19:26:30Z,Business Finance
3,1210588,Beginner to Pro - Financial Analysis in Excel ...,https://www.udemy.com/complete-excel-finance-c...,True,95,2451,11,36,All Levels,3.0,2017-05-30T20:07:24Z,Business Finance
4,1011058,How To Maximize Your Profits Trading Options,https://www.udemy.com/how-to-maximize-your-pro...,True,200,1276,45,26,Intermediate Level,2.0,2016-12-13T14:57:18Z,Business Finance


In [35]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3678 entries, 0 to 3677
Data columns (total 12 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   course_id            3678 non-null   int64  
 1   course_title         3678 non-null   object 
 2   url                  3678 non-null   object 
 3   is_paid              3678 non-null   bool   
 4   price                3678 non-null   int64  
 5   num_subscribers      3678 non-null   int64  
 6   num_reviews          3678 non-null   int64  
 7   num_lectures         3678 non-null   int64  
 8   level                3678 non-null   object 
 9   content_duration     3678 non-null   float64
 10  published_timestamp  3678 non-null   object 
 11  subject              3678 non-null   object 
dtypes: bool(1), float64(1), int64(5), object(5)
memory usage: 319.8+ KB


In [36]:
df['published_timestamp'] = pd.to_datetime(df['published_timestamp'])

In [37]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3678 entries, 0 to 3677
Data columns (total 12 columns):
 #   Column               Non-Null Count  Dtype              
---  ------               --------------  -----              
 0   course_id            3678 non-null   int64              
 1   course_title         3678 non-null   object             
 2   url                  3678 non-null   object             
 3   is_paid              3678 non-null   bool               
 4   price                3678 non-null   int64              
 5   num_subscribers      3678 non-null   int64              
 6   num_reviews          3678 non-null   int64              
 7   num_lectures         3678 non-null   int64              
 8   level                3678 non-null   object             
 9   content_duration     3678 non-null   float64            
 10  published_timestamp  3678 non-null   datetime64[ns, UTC]
 11  subject              3678 non-null   object             
dtypes: bool(1), datetime

### Push files to Repo

In [40]:
import os
try:
  os.makedirs(name='outputs/datasets/collection')
except Exception as e:
  print(e)
df.to_csv(f"outputs/datasets/collection/udemy_courses.csv",index=False)

[Errno 17] File exists: 'outputs/datasets/collection'
