The aim of this phase of the project is to compile and synthesise the final processes to be applied to both the initial dataset structure and to each of the features in it by packaging them into functions and pipes.

Note: All processes have already been discussed in the corresponding stages of the project.

## IMPORTING PACKAGES

In [37]:
import numpy as np
import pandas as pd
import cloudpickle

#To increase autocomplete response speed
%config IPCompleter.greedy=True

from janitor import clean_names

from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler

from sklearn.linear_model import LogisticRegression

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline

#Disabling warnings
import warnings
warnings.filterwarnings("ignore")

## DATA IMPORTATION

### Project path

In [38]:
project_path = (r'C:\Users\pedro\PEDRO\DS\Portfolio\LEAD_SCORING').replace('\\','/')

### Names of data files

In [39]:
data_file_name = 'Leads.csv'

### Data importation

In [40]:
full_path = project_path + '/02_Data/01_Originals/' + data_file_name

df = pd.read_csv(full_path,sep=',')

### Selecting only final features

#### Loading the final features list

In [41]:
names_final_features = project_path + '/05_Results/' + 'final_features.pickle'

pd.read_pickle(names_final_features).sort_index().index.to_list()

['activity_score_mms',
 'city_Mumbai',
 'city_Other Cities',
 'city_Thane & Outskirts',
 'country_India',
 'do_not_call_No',
 'do_not_email_No',
 'hear_about_Multiple Sources',
 'hear_about_Student of SomeSchool',
 'last_activity_Converted to Lead',
 'last_activity_Email Opened',
 'last_activity_Form Submitted on Website',
 'last_activity_Olark Chat Conversation',
 'last_activity_Others',
 'last_activity_Page Visited on Website',
 'last_activity_SMS Sent',
 'last_activity_Unknown',
 'last_notable_activity_Email Link Clicked',
 'last_notable_activity_Modified',
 'last_notable_activity_SMS Sent',
 'lead_magnet_No',
 'lead_origin_Lead Add Form',
 'matters_most_Better Career Prospects',
 'ocupation_Others',
 'ocupation_Student',
 'ocupation_Unemployed',
 'ocupation_Working Professional',
 'page_views_per_visit_mms',
 'profile_score_mms',
 'source_Direct Traffic',
 'source_Google',
 'source_Olark Chat',
 'source_Others',
 'source_Reference',
 'source_Referral Sites',
 'source_Welingak Websi

#### Writing (manually) the list of final features (without extensions)

In [42]:
final_features = ['activity_score',
                  'city',
                  'country',
                  'do_not_call',
                  'do_not_email',
                  'hear_about',
                  'last_activity',                  
                  'last_notable_activity',                  
                  'lead_magnet',   
                  'lead_origin',  
                  'matters_most',     
                  'ocupation',
                  'page_views_per_visit',
                  'profile_score',
                  'source',
                  'specialization',
                  'total_time_website',
                  'total_visits']

#### Created the process-feature matrix

In order to compile and synthesise the final processes to be applied both to the initial dataset structure and to each of the features in it, an excel named 'Production stage_Processes Design' has been designed and can be found in the folder '01_Documents'.

#### Update imported packages

Go to the top (importing packages section) and update the packages with the ones will finally be used.

## DATASET WRANGLING

### Formatting feature names

In [43]:
df = clean_names(df)

In [44]:
df.rename(columns={'lead_number':'id',
                   'lead_source':'source',
                   'totalvisits':'total_visits',
                   'total_time_spent_on_website':'total_time_website',
                   'how_did_you_hear_about_x_education':'hear_about',
                   'what_is_your_current_occupation':'ocupation',
                   'what_matters_most_to_you_in_choosing_a_course':'matters_most',
                   'asymmetrique_activity_score':'activity_score',
                   'asymmetrique_profile_score':'profile_score',
                   'a_free_copy_of_mastering_the_interview':'lead_magnet'},
          inplace=True)

### Deleting records

#### By duplicated

In [45]:
df.drop_duplicates(inplace = True)

#### By EDA

In [46]:
df = df.loc[~((df.last_activity=='Email Bounced')|(df.last_notable_activity=='Email Bounced'))]

#### For x

Creating x by keeping only features included in the final list.

In [47]:
x = df[final_features].copy()

#### For y

Specifying the target.

In [48]:
target = 'converted'

Creating y.

In [49]:
y = df[target].copy()

## CREATING THE PIPELINE

### Instantiating Data Quality

#### Creating the function

In [14]:
x.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8914 entries, 0 to 9239
Data columns (total 18 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   activity_score         4866 non-null   float64
 1   city                   7543 non-null   object 
 2   country                6519 non-null   object 
 3   do_not_call            8914 non-null   object 
 4   do_not_email           8914 non-null   object 
 5   hear_about             6834 non-null   object 
 6   last_activity          8811 non-null   object 
 7   last_notable_activity  8914 non-null   object 
 8   lead_magnet            8914 non-null   object 
 9   lead_origin            8914 non-null   object 
 10  matters_most           6342 non-null   object 
 11  ocupation              6353 non-null   object 
 12  page_views_per_visit   8791 non-null   float64
 13  profile_score          4866 non-null   float64
 14  source                 8883 non-null   object 
 15  spec

In [50]:
def data_quality(df):   
    # Correcting feature types
    temp = df.astype({'total_visits':'Int64'})
    
    # Unifiying repeated source categories
    temp.source.replace('google','Google',inplace=True)
    
    # Nulls imputation by mode
    def impute_mode(feature):
        return(feature.fillna(feature.mode()[0]))
    
    var_impute_mode = ['ocupation', 'country', 'lead_origin', 'do_not_email', 'do_not_call', 'last_notable_activity',
                       'lead_magnet']
    temp[var_impute_mode] = temp[var_impute_mode].apply(impute_mode)
    
    # Nulls imputation by value
    temp[['last_activity','matters_most']] = temp[['last_activity','matters_most']].fillna('Unknown')
    temp[['hear_about','specialization','city']] = temp[['hear_about','specialization','city']].fillna('Select')
    temp[['source']] = temp[['source']].fillna('Null')
    
    # Nulls imputation by median
    def impute_median(feature):
        if pd.api.types.is_integer_dtype(feature):
            return(feature.fillna(int(feature.median())))
        else:
            return(feature.fillna(feature.median()))
    
    var_impute_median = ['total_visits','total_time_website','page_views_per_visit','activity_score','profile_score']
    temp[var_impute_median] = temp[var_impute_median].apply(impute_median)
    
    # Outliers - groupping atypical categories
    def group_atypical_categories(var, treshold=0.01, group_name='Others'):
        frequencies = var.value_counts(normalize=True)
        below_treshold = [each for each in frequencies.loc[frequencies < treshold].index.values]
        groupped_array = np.where(var.isin(below_treshold),group_name,var)
        return(groupped_array)
    
    var_group = ['lead_origin', 'source', 'last_activity', 'last_notable_activity', 'country', 'specialization',
                 'ocupation', 'hear_about', 'matters_most']
    
    for feature in var_group:
        temp[feature] = group_atypical_categories(temp[feature],treshold = 0.01, group_name='Others')
 
    temp.hear_about.replace('Other','Others',inplace=True)
    
    # Outliers - ad hoc winsorisation
    temp['total_visits'].clip(0,50, inplace=True)
    temp['page_views_per_visit'].clip(0,20, inplace=True)
    
    return(temp)

#### Turning it into a transformer

In [51]:
do_data_quality = FunctionTransformer(data_quality)

### Instantiating Feature Transformation

In [52]:
# One Hot Encoding
var_ohe = ['lead_origin', 'source', 'do_not_email', 'do_not_call', 'last_activity', 'last_notable_activity', 
           'country', 'city', 'specialization', 'ocupation', 'hear_about', 'matters_most', 'lead_magnet']
ohe = OneHotEncoder(handle_unknown='ignore')

# Min-max scaling
var_mms = ['total_visits','total_time_website','page_views_per_visit','activity_score','profile_score']
mms = MinMaxScaler()

### Creating the preprocessing pipe

#### Creating the column transformer

In [53]:
ct = make_column_transformer(
    (ohe, var_ohe),
    (mms, var_mms),
    remainder='drop')

#### Creating the pre-processing pipeline

In [54]:
pipe_prepro = make_pipeline(do_data_quality, 
                            ct)

### Instantiating the model

#### Instantiating the model

In [55]:
model = LogisticRegression(solver='saga',
                                   C=0.9,
                                   penalty='l2',
                                   max_iter=1000, 
                                   n_jobs=-1)

#### Creating the final training pipe

In [56]:
pipe_training = make_pipeline(pipe_prepro, model)

#### Saving the final training pipe

In [63]:
name_pipe_training = 'pipe_training.pickle'

path_pipe_training = project_path + '/04_Models/' + name_pipe_training

with open(path_pipe_training, mode='wb') as file:
   cloudpickle.dump(pipe_training, file)

#### Training the final execution pipe

In [57]:
pipe_execution = pipe_training.fit(x,y)

## SAVING THE PIPE

### Naming the final execution pipe

In [65]:
name_pipe_execution = 'pipe_execution.pickle'

### Saving the final execution pipe

In [66]:
path_pipe_ejecucion = project_path + '/04_Models/' + name_pipe_execution

with open(path_pipe_ejecucion, mode='wb') as file:
   cloudpickle.dump(pipe_execution, file)

## UNSEEN DATA MODEL PERFORMANCE

Testing the performance of the model with previously unseen data using the validation dataset reserved in the set up phase at the beginning of the project, and checking that all transformations and calculations work on new raw data to ensure proper functioning once the model is put into production.

In [58]:
# Loading validation data
val_data_file_name = 'validation.csv'
val_full_path = project_path + '/02_Data/02_Validation/' + val_data_file_name
df_test = pd.read_csv(val_full_path,sep=',')

# Adapting validation data structure
df_test = clean_names(df_test) \
           .rename(columns={'lead_number':'id',
                        'lead_source':'source',
                        'totalvisits':'total_visits',
                         'total_time_spent_on_website':'total_time_website',
                         'how_did_you_hear_about_x_education':'hear_about',
                         'what_is_your_current_occupation':'ocupation',
                         'what_matters_most_to_you_in_choosing_a_course':'matters_most',
                         'asymmetrique_activity_score':'activity_score',
                         'asymmetrique_profile_score':'profile_score',
                         'a_free_copy_of_mastering_the_interview':'lead_magnet'}) \
           .drop_duplicates() \
           .set_index('id')
df_test = df_test.loc[~((df_test.last_activity=='Email Bounced')|(df_test.last_notable_activity=='Email Bounced'))]

final_features = ['activity_score','city','country','do_not_call','do_not_email','hear_about','last_activity',                  
                  'last_notable_activity','lead_magnet','lead_origin','matters_most','ocupation',
                  'page_views_per_visit','profile_score','source','specialization','total_time_website','total_visits']
# x and y            
x_test = df_test[final_features]
y_test = df_test['converted']

# Making predictions
pred_test = pipe_execution.predict_proba(x_test)[:,1]

# Checking validation metrics
from sklearn.metrics import roc_auc_score
print('ROC_AUC_score (unseen data):', roc_auc_score(y_test, pred_test))

ROC_AUC_score (unseen data): 0.9147799832878742
