In [1]:
# imports

import pandas as pd
from sklearn.preprocessing import  OneHotEncoder

### What's Feature Engineering 


It's a technique to use business knowledge for selecting/handling the features of raw data. 

It's always good to remember that data doesn't arrive in a format suitable for model training. In the initial contact between a data professional and the data itself, a cleaning process is always necessary because the so-called raw data can lead to inefficient models. However, any strategy for editing/deleting or creating data within a set should be done with the premise that the solution to the business problem will not be affected - This is the basis of Feature Engineering.

### Some techniques to apply in Feature Engineering

Ok, by now we understand that we need to have knowledge of the business and the problem that needs to be solved. After that, we understand that we need to collect the data, which is the raw material for any model. Following that, we perform the process of exploratory analysis, a step in which we delve into the data to understand the relationships between them, possible gaps, etc. Now, it's time to perform the process of selecting the attributes that will be chosen for the model training. Once chosen, there are techniques to ensure that the format of these attributes is suitable for the process. Let's learn more about these techniques:

### Feature Engineering for Feature Selection

#### Filters

This involves applying techniques to assess how closely an attribute relates to my output variable. For instance, in a fraud analysis context, how crucial is choosing the "Age" attribute to determine whether a purchase is fraudulent or not? These are questions that can be answered through correlation analysis, for example. It's important to note that this process is independent of technology; it's a human analysis. While we can use graphical resources to visualize which attributes make sense for model training, the decision in this case is human-driven. The advantage of using filters for attribute selection is that it doesn't require computational cost for the process. Using filters for attribute selection is a common technique in the data science community.

#### Wrapper Methods


This technique involves instantiating an algorithm and passing our dataset to it. The algorithm divides the dataset into subsets with different combinations of attributes and determines, through an evaluation process (pay attention to this part), which combination of attributes is best for our training. An example of an algorithm that implements this technique is sklearn's RFE (Recursive Feature Elimination).

#### Embedded Methods

While wrapper methods determine the best combination of attributes through an evaluation process, embedded methods achieve this based on the results of the model training. This means that training and feature selection processes occur synchronously in embedded methods. In scikit-learn, for example, methods like RandomForest (whether for classification or regression) already incorporate techniques during training that determine the most relevant attributes for the model. However, this doesn't imply that when using RandomForest models, we shouldn't concern ourselves with attribute selection. Embedded methods can incur high computational costs, especially with large datasets, which can be problematic.

Applying filtering before RandomForest, so it deals with a smaller set of attributes, is an appropriate solution, for example.

### Feature Engineering for handling categorical variables:

#### One-Hot-Encoding 🔥

Categorical variables are typically not in a suitable format for model training. It's common for them to arrive in text format, as mentioned in the examples above. The One-Hot-Encoding technique involves converting these values into numerical data. A practical example of this:

In [2]:
base_example_one_hot = pd.read_csv('online_course_engagement_data.csv')

base_example_one_hot.head()

# Let's apply one-hot-encoding on column 'CourseCategory'

Unnamed: 0,UserID,CourseCategory,TimeSpentOnCourse,NumberOfVideosWatched,NumberOfQuizzesTaken,QuizScores,CompletionRate,DeviceType,CourseCompletion
0,5618,Health,29.979719,17,3,50.365656,20.860773,1,0
1,4326,Arts,27.80264,1,5,62.61597,65.632415,1,0
2,5849,Arts,86.820485,14,2,78.458962,63.812007,1,1
3,4992,Science,35.038427,17,10,59.198853,95.433162,0,1
4,3866,Programming,92.490647,16,0,98.428285,18.102478,0,0


In [11]:
base_example_one_hot = pd.read_csv('online_course_engagement_data.csv')

base_example_one_hot["CourseCategory"].value_counts()

CourseCategory
Business       1837
Health         1821
Science        1814
Programming    1810
Arts           1718
Name: count, dtype: int64

#### Applying One-Hot-Encoding technique

In [13]:
categorical_column = ['CourseCategory']

encoder = OneHotEncoder(sparse_output=False)

base_example_one_hot_encoded = pd.DataFrame(encoder.fit_transform(base_example_one_hot[categorical_column]))

base_example_one_hot_encoded.columns = encoder.get_feature_names_out(categorical_column)

base_preprocessed = base_example_one_hot.drop(categorical_column, axis=1).reset_index(drop = True)

base_preprocessed = pd.concat([base_preprocessed, base_example_one_hot_encoded], axis = 1)

base_preprocessed.head(10)

Unnamed: 0,UserID,TimeSpentOnCourse,NumberOfVideosWatched,NumberOfQuizzesTaken,QuizScores,CompletionRate,DeviceType,CourseCompletion,CourseCategory_Arts,CourseCategory_Business,CourseCategory_Health,CourseCategory_Programming,CourseCategory_Science
0,5618,29.979719,17,3,50.365656,20.860773,1,0,0.0,0.0,1.0,0.0,0.0
1,4326,27.80264,1,5,62.61597,65.632415,1,0,1.0,0.0,0.0,0.0,0.0
2,5849,86.820485,14,2,78.458962,63.812007,1,1,1.0,0.0,0.0,0.0,0.0
3,4992,35.038427,17,10,59.198853,95.433162,0,1,0.0,0.0,0.0,0.0,1.0
4,3866,92.490647,16,0,98.428285,18.102478,0,0,0.0,0.0,0.0,1.0,0.0
5,8650,79.466129,12,7,70.233329,76.484023,0,1,0.0,0.0,1.0,0.0,0.0
6,4321,78.908724,10,2,86.836533,22.588896,1,0,0.0,0.0,1.0,0.0,0.0
7,4589,12.068237,16,3,61.553646,27.410991,1,0,0.0,1.0,0.0,0.0,0.0
8,4215,81.935709,8,4,90.264564,33.308437,0,1,0.0,1.0,0.0,0.0,0.0
9,8089,83.394026,15,10,63.956353,33.2613,1,0,0.0,0.0,0.0,1.0,0.0


Now, instead of a text with the name of the course category, we have binary columns filled with 1 if the registration belongs to this category, and 0 if it does not.

A plausible question at this point is: "Why, instead of generating a column with binary values, doesn’t the One-Hot Encoder simply assign an increasing or random value to each category?" The answer is that the model might interpret values in this format as hierarchical, making category X, assigned the value 1, seem to be lower than category Y, assigned the value 2. 😉

A common drawback discussed in blogs and tutorials about One-Hot Encoding is that when categorical variables have many distinct values, the method can result in a significant increase in the number of columns, making the dataset more complex. Therefore, in such cases, One-Hot Encoding might not be the best solution.


#### Frequency Encoding 🔢

It is a technique that involves replacing categorical values with the total count of their occurrences. It's quite straightforward; let’s use the same dataset from the previous example.

So, to recap, the total count of values in the 'CourseCategory' column is:

In [4]:
base_example_encoding_frequency = pd.read_csv('online_course_engagement_data.csv')
 
base_example_encoding_frequency['CourseCategory'].value_counts()

CourseCategory
Business       1837
Health         1821
Science        1814
Programming    1810
Arts           1718
Name: count, dtype: int64

Now let's apply the Frequency Encoding technique with the code below:

In [5]:
frequency = base_example_encoding_frequency['CourseCategory'].value_counts()
 
base_example_encoding_frequency["CourseCategory"] = base_example_encoding_frequency["CourseCategory"].map(frequency)
 
base_example_encoding_frequency

Unnamed: 0,UserID,CourseCategory,TimeSpentOnCourse,NumberOfVideosWatched,NumberOfQuizzesTaken,QuizScores,CompletionRate,DeviceType,CourseCompletion
0,5618,1821,29.979719,17,3,50.365656,20.860773,1,0
1,4326,1718,27.802640,1,5,62.615970,65.632415,1,0
2,5849,1718,86.820485,14,2,78.458962,63.812007,1,1
3,4992,1814,35.038427,17,10,59.198853,95.433162,0,1
4,3866,1810,92.490647,16,0,98.428285,18.102478,0,0
...,...,...,...,...,...,...,...,...,...
8995,8757,1821,37.445225,14,4,54.469359,32.990704,1,0
8996,894,1814,48.631443,7,7,59.413257,0.254625,0,0
8997,6323,1821,38.212512,3,3,69.508297,70.188159,1,0
8998,3652,1821,70.048665,13,10,79.655182,72.975225,1,1


Notice the new values in the "CourseCategory" column now – The text values have been replaced with their total counts.

**How useful is this?** In datasets where the frequency of appearance of a category is relevant for predicting the target variable. For example, when creating a model to predict whether products left in a marketplace cart will be purchased later, the frequency of appearance of certain products can be significant. Products with high viewing frequency may be more likely to be purchased by a user who left them in the cart out of indecision. Another example is in recommendation algorithms: imagine a streaming service that uses the number of times a user has watched a particular genre of movie or series to recommend similar productions.

However, we should avoid frequency encoding when the number of appearances of two distinct categories is or could be equal – In these cases, there may even be a loss of data. Additionally, it should always be considered that the algorithm will assume a relationship between these values and the target variable during training. If this is not the case for the dataset, it is better to use another technique