# ML Modeling and Evaluation Notebook

### Objectives:

Fit and evaluate a regression model

### Inputs:

outputs/datasets/cleaned/cleanedDataset.csv

### Outputs:

### Price Optimization Based on Course Popularity

Business Case:
By analyzing the relationship between the price, number of subscribers, and number of reviews, you can develop a dynamic pricing strategy. For example, courses with more subscribers or positive reviews can be priced higher. Similarly, courses with fewer subscribers could have discounts or special promotions to attract more learners. This pricing strategy could boost revenue by aligning course prices with demand and perceived value.

- Perform regression analysis to understand how price correlates with factors like subscribers and reviews.
- Segment courses by popularity (e.g., top 10%, bottom 10%) and analyze the pricing strategies for each group.
- Implement dynamic pricing algorithms based on market demand and course popularity.

### 1. Import libraries and get the current directory path

In [7]:
import os
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import matplotlib.pyplot as plt

# In case you want to go one directory back
current_dir = os.getcwd()
os.chdir(os.path.dirname(current_dir))

### 2. Check the data

In [8]:
df = pd.read_csv(f"outputs/datasets/cleaned/cleanedDataset.csv")

In [9]:
df.head()

Unnamed: 0,course_id,course_title,url,is_paid,price,num_subscribers,num_reviews,num_lectures,level,content_duration,published_timestamp,subject
0,1070968,Ultimate Investment Banking Course,https://www.udemy.com/ultimate-investment-bank...,1,200,2147,23,51,All Levels,1.5,2017-01-18 20:58:58+00:00,Business Finance
1,1113822,Complete GST Course & Certification - Grow You...,https://www.udemy.com/goods-and-services-tax/,1,75,2792,923,274,All Levels,39.0,2017-03-09 16:34:20+00:00,Business Finance
2,1006314,Financial Modeling for Business Analysts and C...,https://www.udemy.com/financial-modeling-for-b...,1,45,2174,74,51,Intermediate Level,2.5,2016-12-19 19:26:30+00:00,Business Finance
3,1210588,Beginner to Pro - Financial Analysis in Excel ...,https://www.udemy.com/complete-excel-finance-c...,1,95,2451,11,36,All Levels,3.0,2017-05-30 20:07:24+00:00,Business Finance
4,1011058,How To Maximize Your Profits Trading Options,https://www.udemy.com/how-to-maximize-your-pro...,1,200,1276,45,26,Intermediate Level,2.0,2016-12-13 14:57:18+00:00,Business Finance


In [10]:
print('Unique Subjects',df['subject'].unique())
print('Unique Levels',df['level'].unique())

Unique Subjects ['Business Finance' 'Graphic Design' 'Musical Instruments'
 'Web Development']
Unique Levels ['All Levels' 'Intermediate Level' 'Beginner Level' 'Expert Level']


### 3. Data Cleaning

In [11]:
def clean_data(df):
    # Drop irrelevant rows when is_paid is not 1
    df = df[df['is_paid'] == 1]
    
    # Drop irrelevant columns
    df = df.drop(['course_title', 'course_id', 'url', 'published_timestamp', 'is_paid'], axis=1)
    
    # Handle missing values for numeric columns
    numeric_columns = df.select_dtypes(include=[np.number]).columns
    df[numeric_columns] = df[numeric_columns].fillna(df[numeric_columns].mean())

    # Handle missing values for categorical columns (if any)
    categorical_columns = df.select_dtypes(include=[object]).columns
    for col in categorical_columns:
        df[col] = df[col].fillna(df[col].mode()[0])  # Assign the filled values directly

    return df

# Applying the clean_data function
df = clean_data(df)

In [12]:
df.head(3)

Unnamed: 0,price,num_subscribers,num_reviews,num_lectures,level,content_duration,subject
0,200,2147,23,51,All Levels,1.5,Business Finance
1,75,2792,923,274,All Levels,39.0,Business Finance
2,45,2174,74,51,Intermediate Level,2.5,Business Finance


In [13]:
df.shape

(3362, 7)

### 4. Feature Engineering

In [14]:
# Feature Engineering
def feature_engineering(df):
    # Encode categorical variables (level and subject)
    le_level = LabelEncoder()
    df['level'] = le_level.fit_transform(df['level'])

    le_subject = LabelEncoder()
    df['subject'] = le_subject.fit_transform(df['subject'])
    
    # Feature Engineering
    df['review_subscriber_ratio'] = df['num_reviews'] / df['num_subscribers']
    df['course_popularity'] = df['num_subscribers'] * df['num_reviews']
    df['lectures_to_duration'] = df['num_lectures'] / df['content_duration']  # Added new feature for course structure
    
    return df, le_level, le_subject
df, le_level, le_subject = feature_engineering(df)

In [15]:
df.head(3)

Unnamed: 0,price,num_subscribers,num_reviews,num_lectures,level,content_duration,subject,review_subscriber_ratio,course_popularity,lectures_to_duration
0,200,2147,23,51,0,1.5,0,0.010713,49381,34.0
1,75,2792,923,274,0,39.0,0,0.330587,2577016,7.025641
2,45,2174,74,51,3,2.5,0,0.034039,160876,20.4


### 5. Scaling Features