# Feature Engineering Notebook

### Objectives:

Engineer features for Classification, Regression and Cluster models

### Inputs:

outputs/datasets/cleaned/cleanedDataset.csv

### Outputs:

generate a list with variables to engineer

### 1. Import libraries and get the current directory path

In [2]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
from sklearn.model_selection import train_test_split
from feature_engine.encoding import OrdinalEncoder
from feature_engine.outliers import Winsorizer
from feature_engine.selection import SmartCorrelatedSelection
import scipy.stats as stats
import warnings

# Set styles
sns.set(style="whitegrid")
warnings.filterwarnings('ignore')

# In case you want to go one directory back
current_dir = os.getcwd()
os.chdir(os.path.dirname(current_dir))

### 2. Check the data

In [4]:
df = pd.read_csv(f"outputs/datasets/cleaned/cleanedDataset.csv")
df['published_timestamp'] = pd.to_datetime(df['published_timestamp'])

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3672 entries, 0 to 3671
Data columns (total 12 columns):
 #   Column               Non-Null Count  Dtype              
---  ------               --------------  -----              
 0   course_id            3672 non-null   int64              
 1   course_title         3672 non-null   object             
 2   url                  3672 non-null   object             
 3   is_paid              3672 non-null   int64              
 4   price                3672 non-null   int64              
 5   num_subscribers      3672 non-null   int64              
 6   num_reviews          3672 non-null   int64              
 7   num_lectures         3672 non-null   int64              
 8   level                3672 non-null   object             
 9   content_duration     3672 non-null   float64            
 10  published_timestamp  3672 non-null   datetime64[ns, UTC]
 11  subject              3672 non-null   object             
dtypes: datetime64[ns, UT

### 3. Split Dataset into Train and Test

In [6]:
# Split the dataset into train and test
train_set, test_set = train_test_split(df, test_size=0.2, random_state=42)

# Verify the shape of train and test datasets
train_set.shape, test_set.shape

((2937, 12), (735, 12))

### 4. Feature Engineering for Categorical Variables

In [8]:
# Select categorical variables for encoding
categorical_vars = ['course_title', 'level', 'subject']  # example categorical columns

# Create a copy of the train set with only categorical variables
df_cat = train_set[categorical_vars].copy()

# Apply Ordinal Encoding
encoder = OrdinalEncoder(encoding_method='arbitrary', variables=categorical_vars)
train_set_encoded = encoder.fit_transform(train_set)
test_set_encoded = encoder.transform(test_set)

# Display the transformed train dataset
train_set_encoded.head(20)

Unnamed: 0,course_id,course_title,url,is_paid,price,num_subscribers,num_reviews,num_lectures,level,content_duration,published_timestamp,subject
978,467238,0,https://www.udemy.com/advance-technical-analysis/,1,50,152,6,14,0,3.5,2015-04-08 18:11:52+00:00,0
1146,66383,1,https://www.udemy.com/beginners-guide-to-techn...,1,40,829,78,50,1,8.5,2013-09-15 15:06:02+00:00,0
3543,826366,2,https://www.udemy.com/building-a-mvc-5-members...,1,95,654,86,239,2,25.5,2016-06-26 21:15:57+00:00,1
3468,633606,3,https://www.udemy.com/essential-jquery-training/,1,20,1098,15,54,3,2.5,2015-11-13 19:37:48+00:00,1
2494,948440,4,https://www.udemy.com/psd-to-html5-beginner-to...,1,40,1502,218,49,3,6.0,2016-10-03 21:11:56+00:00,1
2500,1009254,5,https://www.udemy.com/api-development/,1,165,7057,655,65,1,18.5,2016-11-12 18:53:51+00:00,1
2401,1104500,6,https://www.udemy.com/instant-harmonica-play-1...,1,40,31,2,14,3,1.0,2017-05-19 12:26:28+00:00,2
67,408440,7,https://www.udemy.com/how-to-win-97-percent-of...,1,125,5050,461,26,3,1.5,2015-02-10 04:21:40+00:00,0
3372,1024888,8,https://www.udemy.com/how-to-make-a-modern-wor...,0,0,6856,137,19,1,2.0,2016-11-29 23:01:43+00:00,1
1113,791574,9,https://www.udemy.com/succeed-in-futures-even-...,0,0,3014,19,15,3,1.0,2017-01-03 05:58:06+00:00,0
