# ExtraaLearn Project

## Context

The EdTech industry has been surging in the past decade immensely, and according to a forecast, the Online Education market would be worth $286.62bn by 2023 with a compound annual growth rate (CAGR) of 10.26% from 2018 to 2023. The modern era of online education has enforced a lot in its growth and expansion beyond any limit. Due to having many dominant features like ease of information sharing, personalized learning experience, transparency of assessment, etc, it is now preferable to traditional education. 

In the present scenario due to the Covid-19, the online education sector has witnessed rapid growth and is attracting a lot of new customers. Due to this rapid growth, many new companies have emerged in this industry. With the availability and ease of use of digital marketing resources, companies can reach out to a wider audience with their offerings. The customers who show interest in these offerings are termed as leads. There are various sources of obtaining leads for Edtech companies, like

* The customer interacts with the marketing front on social media or other online platforms. 
* The customer browses the website/app and downloads the brochure
* The customer connects through emails for more information.

The company then nurtures these leads and tries to convert them to paid customers. For this, the representative from the organization connects with the lead on call or through email to share further details.

## Objective

ExtraaLearn is an initial stage startup that offers programs on cutting-edge technologies to students and professionals to help them upskill/reskill. With a large number of leads being generated on a regular basis, one of the issues faced by ExtraaLearn is to identify which of the leads are more likely to convert so that they can allocate resources accordingly. You, as a data scientist at ExtraaLearn, have been provided the leads data to:
* Analyze and build an ML model to help identify which leads are more likely to convert to paid customers, 
* Find the factors driving the lead conversion process
* Create a profile of the leads which are likely to convert


## Data Description

The data contains the different attributes of leads and their interaction details with ExtraaLearn. The detailed data dictionary is given below.


**Data Dictionary**
* ID: ID of the lead
* age: Age of the lead
* current_occupation: Current occupation of the lead. Values include 'Professional','Unemployed',and 'Student'
* first_interaction: How did the lead first interacted with ExtraaLearn. Values include 'Website', 'Mobile App'
* profile_completed: What percentage of profile has been filled by the lead on the website/mobile app. Values include Low - (0-50%), Medium - (50-75%), High (75-100%)
* website_visits: How many times has a lead visited the website
* time_spent_on_website: Total time spent on the website
* page_views_per_visit: Average number of pages on the website viewed during the visits.
* last_activity: Last interaction between the lead and ExtraaLearn. 
    * Email Activity: Seeking for details about program through email, Representative shared information with lead like brochure of program , etc 
    * Phone Activity: Had a Phone Conversation with representative, Had conversation over SMS with representative, etc
    * Website Activity: Interacted on live chat with representative, Updated profile on website, etc

* print_media_type1: Flag indicating whether the lead had seen the ad of ExtraaLearn in the Newspaper.
* print_media_type2: Flag indicating whether the lead had seen the ad of ExtraaLearn in the Magazine.
* digital_media: Flag indicating whether the lead had seen the ad of ExtraaLearn on the digital platforms.
* educational_channels: Flag indicating whether the lead had heard about ExtraaLearn in the education channels like online forums, discussion threads, educational websites, etc.
* referral: Flag indicating whether the lead had heard about ExtraaLearn through reference.
* status: Flag indicating whether the lead was converted to a paid customer or not.

## Importing necessary libraries and data

In [29]:
import pandas as pd
import numpy as np

## Data Overview

- Observations
- Sanity checks

In [49]:
# initialize a dataframe from the csv data
df = pd.read_csv('Data/ExtraaLearn.csv')

In [50]:
# initial inspection of the data
df.head()

Unnamed: 0,ID,age,current_occupation,first_interaction,profile_completed,website_visits,time_spent_on_website,page_views_per_visit,last_activity,print_media_type1,print_media_type2,digital_media,educational_channels,referral,status
0,EXT001,57,Unemployed,Website,High,7,1639,1.861,Website Activity,Yes,No,Yes,No,No,1
1,EXT002,56,Professional,Mobile App,Medium,2,83,0.32,Website Activity,No,No,No,Yes,No,0
2,EXT003,52,Professional,Website,Medium,3,330,0.074,Website Activity,No,No,Yes,No,No,0
3,EXT004,53,Unemployed,Website,High,4,464,2.057,Website Activity,No,No,No,No,No,1
4,EXT005,23,Student,Website,High,4,600,16.914,Email Activity,No,No,No,No,No,0


In [51]:
# look at the columns and datatypes for each column 
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4612 entries, 0 to 4611
Data columns (total 15 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   ID                     4612 non-null   object 
 1   age                    4612 non-null   int64  
 2   current_occupation     4612 non-null   object 
 3   first_interaction      4612 non-null   object 
 4   profile_completed      4612 non-null   object 
 5   website_visits         4612 non-null   int64  
 6   time_spent_on_website  4612 non-null   int64  
 7   page_views_per_visit   4612 non-null   float64
 8   last_activity          4612 non-null   object 
 9   print_media_type1      4612 non-null   object 
 10  print_media_type2      4612 non-null   object 
 11  digital_media          4612 non-null   object 
 12  educational_channels   4612 non-null   object 
 13  referral               4612 non-null   object 
 14  status                 4612 non-null   int64  
dtypes: f

In [52]:
# function to inspect the data types, variables and number of entries in each column

def inspect_variable_values(data: pd.DataFrame):
    
    # iterate through each column
    for i, col in enumerate(data.columns):
        
        # get a pandas series of the values in each column
        vals: np.ndarray = data[col].unique()
        
        # get the data type of the variables
        data_type = vals.dtype
        
        # print the column name
        print(f'{i + 1}. {col} \n')
        
        # print the columns data type
        print(f'Data type: {data_type}')
        
        # print the number of values the variable takes on
        print("Number of values:", len(vals))
        
        # print the values inside the column
        
        # if the variable takes on < 20 vals, print all the values
        if len(vals) < 20: 
            print("Values taken on by the variable:",vals)
        else:
            # otherwise, if the variable takes on > 20 values and is numeric, then print the range of values the variable takes on
            if isinstance(data_type, (np.dtypes.Int64DType, np.dtypes.Float64DType)):
                print("Values taken on by the variable:",f"Range: {vals.min()} - {vals.max()}")
            else:
                # otherwise the values are categorical, then print only the first 20 values
                print("Values taken on by the variable:", vals[:20])
        
        # new line between columns        
        print()

In [53]:
# call to the inspection function 
inspect_variable_values(df)

1. ID 

Data type: object
Number of values: 4612
Values taken on by the variable: ['EXT001' 'EXT002' 'EXT003' 'EXT004' 'EXT005' 'EXT006' 'EXT007' 'EXT008'
 'EXT009' 'EXT010' 'EXT011' 'EXT012' 'EXT013' 'EXT014' 'EXT015' 'EXT016'
 'EXT017' 'EXT018' 'EXT019' 'EXT020']

2. age 

Data type: int64
Number of values: 46
Values taken on by the variable: Range: 18 - 63

3. current_occupation 

Data type: object
Number of values: 3
Values taken on by the variable: ['Unemployed' 'Professional' 'Student']

4. first_interaction 

Data type: object
Number of values: 2
Values taken on by the variable: ['Website' 'Mobile App']

5. profile_completed 

Data type: object
Number of values: 3
Values taken on by the variable: ['High' 'Medium' 'Low']

6. website_visits 

Data type: int64
Number of values: 27
Values taken on by the variable: Range: 0 - 30

7. time_spent_on_website 

Data type: int64
Number of values: 1623
Values taken on by the variable: Range: 0 - 2537

8. page_views_per_visit 

Data type: fl

### Observations:

The data consists of 15 columns (variables). None of the columns contain null values. 

The set of features consist of the following numerical, categorical and binary variables:

Numerical Variables:

1. Age, an integer ranging from 18 to 63.
2. Number of Website visits, an integer ranging form 0 to 30.
3. Time spent on the website, an integer (indicating seconds) ranging from 0 to 2537.
4. Page views per visit, a floating point number.

Categorical Variables:

1. Current Occupation, Values: Unemployed, Professional, Student
2. First Interaction, Values: Website, Mobile App
3. Percentage Profile Completed, Values: High, Medium, Low
4. Last Activity, Values: Website Activity, Email Activity, Phone Activity

Binary Variables:

1. Print Media Type 1
2. Print Media Type 2
3. Digital Media
4. Educational Channels
5. Referral

The Target or Response variable is the binary 'status' variable. Which indicates whether the lead was converted to a paying customer or not.

The ID variable is unique to each lead and is therefore irrelevant to the analysis. 

## Exploratory Data Analysis (EDA)

- EDA is an important part of any project involving data.
- It is important to investigate and understand the data better before building a model with it.
- A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
- A thorough analysis of the data, in addition to the questions mentioned below, should be done.

In [54]:
# first drop the irrelevant column (id)
df = df.drop(columns = 'ID')

In [55]:
# check that the column has been dropped
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4612 entries, 0 to 4611
Data columns (total 14 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   age                    4612 non-null   int64  
 1   current_occupation     4612 non-null   object 
 2   first_interaction      4612 non-null   object 
 3   profile_completed      4612 non-null   object 
 4   website_visits         4612 non-null   int64  
 5   time_spent_on_website  4612 non-null   int64  
 6   page_views_per_visit   4612 non-null   float64
 7   last_activity          4612 non-null   object 
 8   print_media_type1      4612 non-null   object 
 9   print_media_type2      4612 non-null   object 
 10  digital_media          4612 non-null   object 
 11  educational_channels   4612 non-null   object 
 12  referral               4612 non-null   object 
 13  status                 4612 non-null   int64  
dtypes: float64(1), int64(4), object(9)
memory usage: 504.6+ 

In [60]:
# list of column names for numerical variables
numerical_vars: list = []

# list of column names for categorical variables
categorical_vars: list = []

# list of column names for binary variables
binary_vars: list = []

In [61]:
# loop through the data and add the column name to the appropriate list

for col in df: 
    vals = df[col].to_numpy()
    data_type = vals.dtype
    
    if isinstance(data_type, (np.dtypes.Int64DType, np.dtypes.Float64DType)) and len(vals) > 2:
        numerical_vars.append(col)
    elif np.isin("Yes", vals):
        binary_vars.append(col)
    else:
        categorical_vars.append(col)

In [62]:
numerical_vars.remove("status")
print("Numerical Variables: ",numerical_vars)
print("Categorical Variables: ",categorical_vars)
print("Binary Variables: ",binary_vars)

Numerical Variables:  ['age', 'website_visits', 'time_spent_on_website', 'page_views_per_visit']
Categorical Variables:  ['current_occupation', 'first_interaction', 'profile_completed', 'last_activity']
Binary Variables:  ['print_media_type1', 'print_media_type2', 'digital_media', 'educational_channels', 'referral']


**Questions**
1. Leads will have different expectations from the outcome of the course and the current occupation may play a key role in getting them to participate in the program. Find out how current occupation affects lead status.
2. The company's first impression on the customer must have an impact. Do the first channels of interaction have an impact on the lead status? 
3. The company uses multiple modes to interact with prospects. Which way of interaction works best? 
4. The company gets leads from various channels such as print media, digital media, referrals, etc. Which of these channels have the highest lead conversion rate?
5. People browsing the website or mobile application are generally required to create a profile by sharing their personal data before they can access additional information.Does having more details about a prospect increase the chances of conversion?

## Data Preprocessing

- Missing value treatment (if needed)
- Feature engineering (if needed)
- Outlier detection and treatment (if needed)
- Preparing data for modeling 
- Any other preprocessing steps (if needed)

## EDA

- It is a good idea to explore the data once again after manipulating it.

## Building a Decision Tree model

## Do we need to prune the tree?

## Building a Random Forest model

## Do we need to prune the tree?

## Actionable Insights and Recommendations