# FEATURE TRANSFORMATION

At this stage of the project, different variable transformation techniques will be applied to adapt them to the requirements of the algorithms that will be used during the modelling phase.

As discussed during the exploratory data analysis stage, two different models will be developed:
1. A lead segmentation model that helps sales and marketing teams to identify the company's different leads profiles.
2. A predictive lead scoring model that identifies people who are most likely to convert into paying customers.

In both cases, categorical features have to be transformed into numerical features. Given that the categorical features in the dataset are of the nominal type, one hot encoding technique will be used for this purpose.

Unsupervised modelling techniques based on Kmeans algorithm will be used for lead segmentation model. Kmeans is very sensitive to the different scales of the features as it is a distance-based algorithm, therefore rescaling techniques have to be applied to ensure that all features are on the same scale. Since it has been decided to apply one hot encoding to categorical features, the rescaling technique that makes the most sense to apply in this case is min-max scaling which will allow transforming feature values to a scale between 0 and 1.

On the other hand, it has to be decided whether feature discretisation/binarisation processes are to be applied. Given that for the project to be developed the objective of prediction is more important than interpretation one, and also taking into account that one of the models to be developed is based on a segmentation algorithm, neither discretisation nor binarisation processes will be applied.

Finally, note that it is not necessary to apply class balancing processes as the presence of both classes in the dataset (converted=1, converted=0) is sufficiently significant.

## IMPORTING PACKAGES

In [1]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler


#To increase autocomplete response speed
%config IPCompleter.greedy=True

## DATA IMPORTATION

Project path.

In [2]:
project_path = (r'C:\Users\pedro\PEDRO\DS\Portfolio\LEAD_SCORING').replace('\\','/')

Names of data files.

In [3]:
cat_name = 'cat_result_eda.pickle'
num_name = 'num_result_eda.pickle'

Data importation

In [4]:
cat = pd.read_pickle(project_path + '/02_Data/03_Work/' + cat_name)
num = pd.read_pickle(project_path + '/02_Data/03_Work/' + num_name)

Separating the target.

In [5]:
target = num[['converted']].copy().reset_index(drop=True)

## TRANSFORMATION OF CATEGORICAL FEATURES

### One Hot Encoding

Selecting nominal features to be encoded using OHE:

In [6]:
var_ohe = ['lead_origin', 'source', 'do_not_email', 'do_not_call', 'last_activity',
           'last_notable_activity', 'country', 'city', 'specialization', 'ocupation',
           'hear_about', 'matters_most', 'lead_magnet']

Instantiating:

In [7]:
ohe = OneHotEncoder(sparse = False, handle_unknown='ignore')

Training and applying encoding:

In [8]:
cat_ohe = ohe.fit_transform(cat[var_ohe])

Saving as a dataframe:

In [9]:
cat_ohe = pd.DataFrame(cat_ohe, columns = ohe.get_feature_names_out())

## TRANSFORMATION OF NUMERICAL FEATURES

## JOINING TRANSFORMED DATASETS

In [10]:
df = pd.concat([cat_ohe,num.reset_index()],axis=1)

## FEATURE RESCALATION

### Using Min-Max scaling

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7125 entries, 0 to 7124
Data columns (total 80 columns):
 #   Column                                            Non-Null Count  Dtype  
---  ------                                            --------------  -----  
 0   lead_origin_API                                   7125 non-null   float64
 1   lead_origin_Landing Page Submission               7125 non-null   float64
 2   lead_origin_Lead Add Form                         7125 non-null   float64
 3   lead_origin_Others                                7125 non-null   float64
 4   source_Direct Traffic                             7125 non-null   float64
 5   source_Google                                     7125 non-null   float64
 6   source_Olark Chat                                 7125 non-null   float64
 7   source_Organic Search                             7125 non-null   float64
 8   source_Others                                     7125 non-null   float64
 9   source_Reference   

Selecting features to be rescaling using min-max scaling:

In [12]:
var_mms = df.iloc[:,74:-1].columns

Instantiating:

In [13]:
mms = MinMaxScaler()

Training and applying min-max scaling:

In [14]:
df_mms = mms.fit_transform(df[var_mms])

Saving as a dataframe:

In [15]:
#Adding suffixes to feature names
nombres_mms = [variable + '_mms' for variable in var_mms]

#Saving as dataframe
df_mms = pd.DataFrame(df_mms,columns = nombres_mms)

## JOINING RESCALED DATASETS

In [16]:
df_input = pd.concat([df.id,cat_ohe,df_mms,target],axis=1)

## SAVING DATASETS AFTER DATA TRANSFORMATION

df dataframe will be saved once data transformation procedures have been applied.

In pickle format so as not to lose metadata modifications.

### Defining dataset name

In [17]:
path_df_input = project_path + '/02_Data/03_Work/' + 'df_input.pickle'

### Saving the dataset

In [18]:
df_input.to_pickle(path_df_input)