In [5]:
# imports

import pandas as pd
from sklearn.preprocessing import  OneHotEncoder

### What's Feature Engineering 


It's a technique to use business knowledge for selecting/handling the features of raw data. 

It's always good to remember that data doesn't arrive in a format suitable for model training. In the initial contact between a data professional and the data itself, a cleaning process is always necessary because the so-called raw data can lead to inefficient models. However, any strategy for editing/deleting or creating data within a set should be done with the premise that the solution to the business problem will not be affected - This is the basis of Feature Engineering.

### Some techniques to apply in Feature Engineering

Ok, by now we understand that we need to have knowledge of the business and the problem that needs to be solved. After that, we understand that we need to collect the data, which is the raw material for any model. Following that, we perform the process of exploratory analysis, a step in which we delve into the data to understand the relationships between them, possible gaps, etc. Now, it's time to perform the process of selecting the attributes that will be chosen for the model training. Once chosen, there are techniques to ensure that the format of these attributes is suitable for the process. Let's learn more about these techniques:

### Feature Engineering for Feature Selection

#### Filters

This involves applying techniques to assess how closely an attribute relates to my output variable. For instance, in a fraud analysis context, how crucial is choosing the "Age" attribute to determine whether a purchase is fraudulent or not? These are questions that can be answered through correlation analysis, for example. It's important to note that this process is independent of technology; it's a human analysis. While we can use graphical resources to visualize which attributes make sense for model training, the decision in this case is human-driven. The advantage of using filters for attribute selection is that it doesn't require computational cost for the process. Using filters for attribute selection is a common technique in the data science community.

#### Wrapper Methods


This technique involves instantiating an algorithm and passing our dataset to it. The algorithm divides the dataset into subsets with different combinations of attributes and determines, through an evaluation process (pay attention to this part), which combination of attributes is best for our training. An example of an algorithm that implements this technique is sklearn's RFE (Recursive Feature Elimination).

#### Embedded Methods

While wrapper methods determine the best combination of attributes through an evaluation process, embedded methods achieve this based on the results of the model training. This means that training and feature selection processes occur synchronously in embedded methods. In scikit-learn, for example, methods like RandomForest (whether for classification or regression) already incorporate techniques during training that determine the most relevant attributes for the model. However, this doesn't imply that when using RandomForest models, we shouldn't concern ourselves with attribute selection. Embedded methods can incur high computational costs, especially with large datasets, which can be problematic.

Applying filtering before RandomForest, so it deals with a smaller set of attributes, is an appropriate solution, for example.

### Feature Engineering for handling categorical variables:

#### One-Hot-Encoding 🔥

Categorical variables are typically not in a suitable format for model training. It's common for them to arrive in text format, as mentioned in the examples above. The One-Hot-Encoding technique involves converting these values into numerical data. A practical example of this:

In [4]:
base_example_one_hot = pd.read_csv('online_course_engagement_data.csv')

base_example_one_hot.head()

# Let's apply one-hot-encoding on column 'CourseCategory'

Unnamed: 0,UserID,CourseCategory,TimeSpentOnCourse,NumberOfVideosWatched,NumberOfQuizzesTaken,QuizScores,CompletionRate,DeviceType,CourseCompletion
0,5618,Health,29.979719,17,3,50.365656,20.860773,1,0
1,4326,Arts,27.80264,1,5,62.61597,65.632415,1,0
2,5849,Arts,86.820485,14,2,78.458962,63.812007,1,1
3,4992,Science,35.038427,17,10,59.198853,95.433162,0,1
4,3866,Programming,92.490647,16,0,98.428285,18.102478,0,0


In [11]:
base_example_one_hot = pd.read_csv('online_course_engagement_data.csv')

base_example_one_hot["CourseCategory"].value_counts()

CourseCategory
Business       1837
Health         1821
Science        1814
Programming    1810
Arts           1718
Name: count, dtype: int64

#### Applying One-Hot-Encoding technique

In [10]:
categorical_column = ['CourseCategory']

encoder = OneHotEncoder(sparse_output=False)

base_example_one_hot_encoded = pd.DataFrame(encoder.fit_transform(base_example_one_hot[categorical_column]))

base_example_one_hot_encoded.columns = encoder.get_feature_names_out(categorical_column)

base_preprocessed = base_example_one_hot.drop(categorical_column, axis=1).reset_index(drop = True)

base_preprocessed = pd.concat([base_preprocessed, base_example_one_hot_encoded], axis = 1)

base_preprocessed.head()

Unnamed: 0,UserID,TimeSpentOnCourse,NumberOfVideosWatched,NumberOfQuizzesTaken,QuizScores,CompletionRate,DeviceType,CourseCompletion,CourseCategory_Arts,CourseCategory_Business,CourseCategory_Health,CourseCategory_Programming,CourseCategory_Science
0,5618,29.979719,17,3,50.365656,20.860773,1,0,0.0,0.0,1.0,0.0,0.0
1,4326,27.80264,1,5,62.61597,65.632415,1,0,1.0,0.0,0.0,0.0,0.0
2,5849,86.820485,14,2,78.458962,63.812007,1,1,1.0,0.0,0.0,0.0,0.0
3,4992,35.038427,17,10,59.198853,95.433162,0,1,0.0,0.0,0.0,0.0,1.0
4,3866,92.490647,16,0,98.428285,18.102478,0,0,0.0,0.0,0.0,1.0,0.0


Now, instead of a text with the name of the course category, we have binary columns filled with 1 if the registration belongs to this category, and 0 if it does not.