<a href="https://colab.research.google.com/github/harnalashok/classification/blob/main/lead_scoirngAutoML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Last amended: 24th May, 2021
# AutoML:
# Ref: https://evalml.alteryx.com/en/stable/demos/lead_scoring.html
# Lead scoring:
#      https://towardsdatascience.com/a-true-end-to-end-ml-example-lead-scoring-f5b52e9a3c80
# Kaggle 
#      https://www.kaggle.com/ashydv/leads-dataset

# Building a Lead Scoring Model with EvalML

In this example, we will build an optimized lead scoring model using EvalML. To optimize the pipeline, we will set up an objective function to maximize the revenue generated with true positives while taking into account the cost of false positives. At the end of this demo, we also show you how introducing the right objective during the training is significantly better than using a generic machine learning metric like AUC.

## Install packages

EvalML is an AutoML library that builds, optimizes, and evaluates machine learning pipelines using domain-specific objective functions.

Combined with Featuretools and Compose, EvalML can be used to create end-to-end supervised machine learning solutions.

In [1]:
# Install evalml. Some of the packages installed are:
# imbalanced-learn, scikit-optimize, plotly, category-encoders,
# graphviz, lightgbm, shap, statsmodels, catboost, scikit-learn
# featuretools, matplotlib,nltk,xgboost, pmdarima (for the anagram of 'py' + 'arima')
# kaleido (blockchain) 

!pip install evalml
! pip install dask[dataframe] --upgrade 

Collecting evalml
[?25l  Downloading https://files.pythonhosted.org/packages/56/f3/236e32338f9d51afd65c2418bdcae3f0ed97f2e5f980a0e5784c18ebb579/evalml-0.24.1-py3-none-any.whl (6.2MB)
[K     |████████████████████████████████| 6.2MB 2.9MB/s 
[?25hCollecting category-encoders>=2.0.0
[?25l  Downloading https://files.pythonhosted.org/packages/44/57/fcef41c248701ee62e8325026b90c432adea35555cbc870aff9cfba23727/category_encoders-2.2.2-py2.py3-none-any.whl (80kB)
[K     |████████████████████████████████| 81kB 7.9MB/s 
[?25hCollecting graphviz>=0.13
  Downloading https://files.pythonhosted.org/packages/86/86/89ba50ba65928001d3161f23bfa03945ed18ea13a1d1d44a772ff1fa4e7a/graphviz-0.16-py2.py3-none-any.whl
Collecting pmdarima==1.8.0
[?25l  Downloading https://files.pythonhosted.org/packages/e4/a8/bdf15174e35d072e145d16388b1d3bc7605b752610170cb022a290411427/pmdarima-1.8.0-cp37-cp37m-manylinux1_x86_64.whl (1.5MB)
[K     |████████████████████████████████| 1.5MB 38.8MB/s 
Collecting colorama>=0.

Collecting dask[dataframe]
[?25l  Downloading https://files.pythonhosted.org/packages/b9/a0/0905a1112dc3801304348ac0af0e641a2fbe12fe163ab5c3a43b2e88092d/dask-2021.5.0-py3-none-any.whl (960kB)
[K     |████████████████████████████████| 962kB 3.9MB/s 
Collecting fsspec>=0.6.0
[?25l  Downloading https://files.pythonhosted.org/packages/bc/52/816d1a3a599176057bf29dfacb1f8fadb61d35fbd96cb1bab4aaa7df83c0/fsspec-2021.5.0-py3-none-any.whl (111kB)
[K     |████████████████████████████████| 112kB 43.6MB/s 
[?25hCollecting partd>=0.3.10
  Downloading https://files.pythonhosted.org/packages/41/94/360258a68b55f47859d72b2d0b2b3cfe0ca4fbbcb81b78812bd00ae86b7c/partd-1.2.0-py3-none-any.whl
Collecting locket
  Downloading https://files.pythonhosted.org/packages/50/b8/e789e45b9b9c2db75e9d9e6ceb022c8d1d7e49b2c085ce8c05600f90a96b/locket-0.2.1-py2.py3-none-any.whl
[31mERROR: distributed 2021.5.0 has requirement cloudpickle>=1.5.0, but you'll have cloudpickle 1.3.0 which is incompatible.[0m
Installing co

In [11]:
import evalml
from evalml import AutoMLSearch
from evalml.objectives import LeadScoring

from urllib.request import urlopen
import pandas as pd
import woodwork as ww

In [18]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"


## Configure LeadScoring

To optimize the pipelines toward the specific business needs of this model, you can set your own assumptions for how much value is gained through true positives and the cost associated with false positives. These parameters are

> true_positive - dollar amount to be gained with a successful lead
false_positive - dollar amount to be lost with an unsuccessful lead

Using these parameters, EvalML builds a pileline that will maximize the amount of revenue per lead generated.

In [12]:
lead_scoring_objective = LeadScoring(
                                      true_positives=1000,
                                      false_positives=-10
                                    )

## Dataset

We will be utilizing a dataset detailing a customer’s job, country, state, zip, online action, the dollar amount of that action and whether they were a successful lead.

In [14]:
customers_data = urlopen('https://featurelabs-static.s3.amazonaws.com/lead_scoring_ml_apps/customers.csv')
interactions_data = urlopen('https://featurelabs-static.s3.amazonaws.com/lead_scoring_ml_apps/interactions.csv')
leads_data = urlopen('https://featurelabs-static.s3.amazonaws.com/lead_scoring_ml_apps/previous_leads.csv')

In [15]:
#print(customers_data.read())

In [16]:
customers = pd.read_csv(customers_data)
interactions = pd.read_csv(interactions_data)
leads = pd.read_csv(leads_data)

In [19]:
customers.shape    # (1000, 11)
customers.head()

(1000, 11)

Unnamed: 0,customer_id,date_registered,birthday,job,phone,email,country,state,zip,owner,company
0,460429349361,2017-08-10 11:04:45,,"Engineer, mining",+1-283-990-1507x7713,christian92@gmail.com,,NY,60091.0,Kathleen Hawkins MD,618541400000.0
1,392559384176,2017-08-10 23:13:51,,Arts administrator,400.808.2148,jenniferdavis@carter-ellis.biz,US,CA,,John Edwards,833099000000.0
2,674438580580,2017-08-11 08:35:32,,"Psychologist, forensic",(299)543-9962,wwelch@lee.com,US,CA,,John Edwards,211482700000.0
3,364017777045,2017-08-11 10:15:37,,Air cabin crew,+1-213-455-5314,xjones@smith.net,US,,60091.0,Erica Anderson,
4,551397602202,2017-08-11 13:33:23,,Press sub,619.795.6618,walterromero@gmail.com,US,,,Kathleen Hawkins MD,


In [21]:
interactions.shape  # (5625, 7)
interactions.head()

(5625, 7)

Unnamed: 0,id,time,customer_id,action,amount,session,referrer
0,807870369974,2017-08-17 18:18:45,676332384432,contact_form,,531094009776,https://www.twitter.com
1,889936071815,2017-08-18 18:58:39,676332384432,contact_form,,531094009776,https://www.twitter.com
2,890016715098,2017-08-22 15:54:26,756895858030,purchase,50.54,355429835992,
3,539965059120,2017-08-24 23:25:33,676332384432,page_view,,531094009776,
4,496056352403,2017-08-25 09:32:25,676332384432,page_view,,531094009776,https://medium.com/article


In [23]:
leads.shape  # (584, 3)
leads.head()

(584, 3)

Unnamed: 0,customer_id,time,label
0,961424493033,2017-09-17 08:20:09,False
1,739795366381,2017-09-20 18:14:56,False
2,433081973416,2017-09-26 12:14:42,False
3,178336564320,2017-09-29 12:48:25,False
4,203924762965,2017-10-02 02:57:06,False
