# Feature Exploration

Given the [Kaggle Dataset on Lead Scoring](https://www.kaggle.com/datasets/amritachatterjee09/lead-scoring-dataset?select=Lead+Scoring.csv), I will first do some general EDA to discover potential trends within the data to determine possible next steps.

From the Kaggle site, X Education is education company selling online courses to industry professionals. The typical conversion rate is ~30%. The goal is to increase the conversion rate, with the CEO hoping to achieve a lead conversion rate of about 80%. Our task is to assign a lead score to each of the leads such that the customers with higher lead score *h* have a higher conversion chance (IE create a propensity score).

In [6]:
# Modules
import numpy as np
import pandas as pd
from pathlib import Path
import os
from ydata_profiling import ProfileReport

In [7]:
# Pull in data
raw_data = pd.read_csv(Path(os.getcwd()).parents[0].joinpath("data", "Lead Scoring.csv"))

# General EDA

Clearly, the goal of this exercise is to increase the conversion rate, making the `Converted` column our target variable.

To get a clear picture of how variables might relate to one another, I will be using the `ProfileReport` function to create an HTML page summarizing distributions, correlations, and other measures.

In [None]:
# Use ProfileReport to create HTML doc of general summary
profile = ProfileReport(raw_data, title = "Profiling Report")
profile.to_file("feature_eda.html")

In [None]:
# Or run as follows:
profile.to_notebook_iframe()

In [13]:
print(f"{len(raw_data.columns)} Columns, Names: {raw_data.columns}")

37 Columns, Names: Index(['Prospect ID', 'Lead Number', 'Lead Origin', 'Lead Source',
       'Do Not Email', 'Do Not Call', 'Converted', 'TotalVisits',
       'Total Time Spent on Website', 'Page Views Per Visit', 'Last Activity',
       'Country', 'Specialization', 'How did you hear about X Education',
       'What is your current occupation',
       'What matters most to you in choosing a course', 'Search', 'Magazine',
       'Newspaper Article', 'X Education Forums', 'Newspaper',
       'Digital Advertisement', 'Through Recommendations',
       'Receive More Updates About Our Courses', 'Tags', 'Lead Quality',
       'Update me on Supply Chain Content', 'Get updates on DM Content',
       'Lead Profile', 'City', 'Asymmetrique Activity Index',
       'Asymmetrique Profile Index', 'Asymmetrique Activity Score',
       'Asymmetrique Profile Score',
       'I agree to pay the amount through cheque',
       'A free copy of Mastering The Interview', 'Last Notable Activity'],
      dtype='o

## `feature_eda.html` Analysis

THe `ProfileReport` gives some great insights as a first pass. Our dataset contains 37 variables and 9240 observations. We see that `Prospect ID` and `Lead Number` are both keys for the dataset. The `Do Not Call` variable only has 2 true instances, making it an easy variable to remove in our modeling process. Our target variable `Converted` shows that about 38.5% of people in this dataset convert, slightly higher than the 30% mentioned in the prompt.

### Questions

- What does `Select` mean? Guessing that's the default answer asking someone to select an option?
  - If so, probably can treat as `Missing` as well...

### `Converted` Variable Info

- 38.5% converted / 71.5% did not

### Variables w/ High Correlation

- `Total Time Spent on Website` (0.426)
-  `Asymmetrique Activity Score` (0.419)
- `Lead Quality` (0.659)
- `Tags` (0.931)
- `Last Activity` (0.396)
- `Lead Source` (0.336)
- `Lead Origin` (0.325)
- `Current Occupation` (0.302)
- `Lead Profile` (0.379)
- `Last Notable Activity` (0.38)

### Variables Maybe Needing Preprocessing (Grouping)

- `Country`: 26.6% mssing, heavily favoring India, and all other countries are under 1%...
- `City`(?): Missing 15.4%, but seems to only apply to people in India
- `Last Notable Activity`: 16 total groups
- `Lead Profile`: Missing 29.3% of data
- `Lead Source`: Need to apply some data cleaning (EX: Google vs. google are treated differently)
- `Lead Quality`: 51.6% missing!
- `Tags`: 36.3% missing, many categories
- `What is your current occupation`: 29.1% missing, majority are `unemployed`, and other categories (`Working Professional` and `Student`) make up 10% - others are very small in #s
- `How did you hear about X Education`: 10 categories, 23.9% missing, most are `Select`
- `Specialization`: 15.6% missing, 21% `Select`, >20 categories
- `Last Activity`: >15 categories. However, many seem relevant (unsub, unreachable) - need to do more analysis
- `Page Views Per Visit`(?): many 0s, maybe interesting to see relationships between 0 vs. non-0, or if numerical better



### Variables to Remove

In general, we can exclude 13 variables (shown below + why).

- `I agree to pay the amount through cheque`: 1 value
- `Get updates on DM Content`: 1 value
- `Update me on Supply Chain Content`: 1 value
- `Receive More Updates About Our Courses`: 1 value
- `Through Recommendations`: only 7 Trues
- `Digital Advertisement`: only 4 Trues
- `Newspaper`: Only 1 True
- `X Education Forums`: Only 1 True
- `Newspaper Article`: only 2 true
- `Magazine`: only 1 true
- `Search`: Only 14 True
- `What matters most to you in choosing a course`: Only 3 categories, but 2 have under 5 total obs
- `Do Not Call`: only 2 Trues

In [21]:
data = raw_data.copy()
data = data.drop(["I agree to pay the amount through cheque",
                  "Get updates on DM Content",
                  "Update me on Supply Chain Content",
                  "Receive More Updates About Our Courses",
                  "Through Recommendations",
                  "Digital Advertisement",
                  "Newspaper",
                  "X Education Forums",
                  "Newspaper Article",
                  "Magazine",
                  "Search",
                  "What matters most to you in choosing a course",
                  "Do Not Call"], axis=1)

## Variable Preprocessing

Now, we'll be looking at the variables mentioned earlier that probably need preprocessing.

### 