# Data Cleaning Notebook

### Objectives:

To clean the dataset for further processing

### Inputs:

inputs/datasets/raw/udemy_courses.csv

### Outputs:

generate cleaned dataset, saved under outputs/datasets/cleaned

### 1. Import libraries and get the current directory path

In [1]:
import os
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
current_dir = os.getcwd()

In [3]:
# In case you want to go one directory back

os.chdir(os.path.dirname(current_dir))
current_dir

'/Users/panda/Desktop/code_institue_projects/portfolio-projects/learning_trends_analyzer/jupyter_notebooks'

### 2. Grab five rows from the dataframe

In [4]:
df = pd.read_csv(f"inputs/datasets/raw/udemy_courses.csv")

### 3. Check for null values

In [5]:
# Check for missing values
df.isnull().sum()

course_id              0
course_title           0
url                    0
is_paid                0
price                  0
num_subscribers        0
num_reviews            0
num_lectures           0
level                  0
content_duration       0
published_timestamp    0
subject                0
dtype: int64

### 4. Drop rows with missing essential data

In [6]:
# Drop rows with missing essential data (e.g., 'course_id', 'course_title', 'price')
df.dropna(subset=['course_id', 'course_title', 'price'], inplace=True)

### 5. Handle Categorical Data

In [7]:
# Handle categorical data - Encoding 'is_paid' as 0/1
df['is_paid'] = df['is_paid'].astype(int)

### 6. Convert published_timestamp to datetime format

In [8]:
# Convert 'published_timestamp' to datetime format
df['published_timestamp'] = pd.to_datetime(df['published_timestamp'])

### 7. Remove duplicates

In [13]:
# Remove duplicates
df.drop_duplicates(inplace=True)

In [14]:
df.shape

(3672, 12)

### 8. Handle Outliers in Numerical Columns (e.g., 'price')

In [15]:
# Handle outliers in numerical columns, e.g., 'price'
q1 = df['price'].quantile(0.25)
q3 = df['price'].quantile(0.75)
iqr = q3 - q1
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr
df = df[(df['price'] >= lower_bound) & (df['price'] <= upper_bound)]

### 9. Convert content_duration to a Reasonable Range

In [17]:
# Convert 'content_duration' to a reasonable range if needed
df['content_duration'] = df['content_duration'].clip(lower=0)

### 10. Final Data Check

In [19]:
# Final data check
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3672 entries, 0 to 3677
Data columns (total 12 columns):
 #   Column               Non-Null Count  Dtype              
---  ------               --------------  -----              
 0   course_id            3672 non-null   int64              
 1   course_title         3672 non-null   object             
 2   url                  3672 non-null   object             
 3   is_paid              3672 non-null   int64              
 4   price                3672 non-null   int64              
 5   num_subscribers      3672 non-null   int64              
 6   num_reviews          3672 non-null   int64              
 7   num_lectures         3672 non-null   int64              
 8   level                3672 non-null   object             
 9   content_duration     3672 non-null   float64            
 10  published_timestamp  3672 non-null   datetime64[ns, UTC]
 11  subject              3672 non-null   object             
dtypes: datetime64[ns, UTC](1)

### Push cleaned data

In [20]:
try:
  os.makedirs(name='outputs/datasets/cleaned') # create outputs/datasets/collection folder
except Exception as e:
  print(e)

[Errno 17] File exists: 'outputs/datasets/cleaned'


In [21]:
df.to_csv("outputs/datasets/cleaned/cleanedDataset.csv", index=False)