# Data Exploration Notebook

### Objectives:

**Thorough Exploratory Data Analysis (EDA):**
    To study each column thoroughly and their relationships with each other.

### Inputs:

inputs/datasets/raw/udemy_courses.csv

### Outputs:

generate code that answers business requirement 1 and can be used to build the Streamlit App

### 1. Import libraries and get the current directory path

In [12]:
import os
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

current_dir = os.getcwd()

In [2]:
os.chdir(os.path.dirname(current_dir))
current_dir

'/Users/panda/Desktop/code_institue_projects/portfolio-projects/learning_trends_analyzer/jupyter_notebooks'

In [4]:
current_dir

'/Users/panda/Desktop/code_institue_projects/portfolio-projects/learning_trends_analyzer/jupyter_notebooks'

### 2. Grab five rows from the dataframe

In [5]:
df = pd.read_csv("inputs/datasets/raw/udemy_courses.csv")
df.head()

Unnamed: 0,course_id,course_title,url,is_paid,price,num_subscribers,num_reviews,num_lectures,level,content_duration,published_timestamp,subject
0,1070968,Ultimate Investment Banking Course,https://www.udemy.com/ultimate-investment-bank...,True,200,2147,23,51,All Levels,1.5,2017-01-18T20:58:58Z,Business Finance
1,1113822,Complete GST Course & Certification - Grow You...,https://www.udemy.com/goods-and-services-tax/,True,75,2792,923,274,All Levels,39.0,2017-03-09T16:34:20Z,Business Finance
2,1006314,Financial Modeling for Business Analysts and C...,https://www.udemy.com/financial-modeling-for-b...,True,45,2174,74,51,Intermediate Level,2.5,2016-12-19T19:26:30Z,Business Finance
3,1210588,Beginner to Pro - Financial Analysis in Excel ...,https://www.udemy.com/complete-excel-finance-c...,True,95,2451,11,36,All Levels,3.0,2017-05-30T20:07:24Z,Business Finance
4,1011058,How To Maximize Your Profits Trading Options,https://www.udemy.com/how-to-maximize-your-pro...,True,200,1276,45,26,Intermediate Level,2.0,2016-12-13T14:57:18Z,Business Finance


### 3. Get the statistcal data and dataset's complete information

In [10]:
df.describe()

Unnamed: 0,course_id,price,num_subscribers,num_reviews,num_lectures,content_duration
count,3678.0,3678.0,3678.0,3678.0,3678.0,3678.0
mean,675972.0,66.049483,3197.150625,156.259108,40.108755,4.094517
std,343273.2,61.005755,9504.11701,935.452044,50.383346,6.05384
min,8324.0,0.0,0.0,0.0,0.0,0.0
25%,407692.5,20.0,111.0,4.0,15.0,1.0
50%,687917.0,45.0,911.5,18.0,25.0,2.0
75%,961355.5,95.0,2546.0,67.0,45.75,4.5
max,1282064.0,200.0,268923.0,27445.0,779.0,78.5


In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3678 entries, 0 to 3677
Data columns (total 12 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   course_id            3678 non-null   int64  
 1   course_title         3678 non-null   object 
 2   url                  3678 non-null   object 
 3   is_paid              3678 non-null   bool   
 4   price                3678 non-null   int64  
 5   num_subscribers      3678 non-null   int64  
 6   num_reviews          3678 non-null   int64  
 7   num_lectures         3678 non-null   int64  
 8   level                3678 non-null   object 
 9   content_duration     3678 non-null   float64
 10  published_timestamp  3678 non-null   object 
 11  subject              3678 non-null   object 
dtypes: bool(1), float64(1), int64(5), object(5)
memory usage: 319.8+ KB


### 3. Data Preprocessing

Data preprocessing involves cleaning the dataset by handling missing values, removing duplicates and correcting inconsistencies. Here we can see that the time stamp is an object. We convert it into datetime and also look for null values as well as duplicates.

In [19]:
# Convert published_timestamp to datetime
df['published_timestamp'] = pd.to_datetime(df['published_timestamp'])

In [15]:
# Check for duplicates
print(f"Duplicates: {df.duplicated().sum()}")

Duplicates: 6


In [16]:
# Check for null values
print(f"Null Values: {df.isnull().sum()}")

Null Values: course_id              0
course_title           0
url                    0
is_paid                0
price                  0
num_subscribers        0
num_reviews            0
num_lectures           0
level                  0
content_duration       0
published_timestamp    0
subject                0
dtype: int64
