# ExtraaLearn Project

## Context

The EdTech industry has been surging in the past decade immensely, and according to a forecast, the Online Education market would be worth $286.62bn by 2023 with a compound annual growth rate (CAGR) of 10.26% from 2018 to 2023. The modern era of online education has enforced a lot in its growth and expansion beyond any limit. Due to having many dominant features like ease of information sharing, personalized learning experience, transparency of assessment, etc, it is now preferable to traditional education. 

In the present scenario due to the Covid-19, the online education sector has witnessed rapid growth and is attracting a lot of new customers. Due to this rapid growth, many new companies have emerged in this industry. With the availability and ease of use of digital marketing resources, companies can reach out to a wider audience with their offerings. The customers who show interest in these offerings are termed as leads. There are various sources of obtaining leads for Edtech companies, like

* The customer interacts with the marketing front on social media or other online platforms. 
* The customer browses the website/app and downloads the brochure
* The customer connects through emails for more information.

The company then nurtures these leads and tries to convert them to paid customers. For this, the representative from the organization connects with the lead on call or through email to share further details.

## Objective

ExtraaLearn is an initial stage startup that offers programs on cutting-edge technologies to students and professionals to help them upskill/reskill. With a large number of leads being generated on a regular basis, one of the issues faced by ExtraaLearn is to identify which of the leads are more likely to convert so that they can allocate resources accordingly. You, as a data scientist at ExtraaLearn, have been provided the leads data to:
* Analyze and build an ML model to help identify which leads are more likely to convert to paid customers, 
* Find the factors driving the lead conversion process
* Create a profile of the leads which are likely to convert


## Data Description

The data contains the different attributes of leads and their interaction details with ExtraaLearn. The detailed data dictionary is given below.


**Data Dictionary**
* ID: ID of the lead
* age: Age of the lead
* current_occupation: Current occupation of the lead. Values include 'Professional','Unemployed',and 'Student'
* first_interaction: How did the lead first interacted with ExtraaLearn. Values include 'Website', 'Mobile App'
* profile_completed: What percentage of profile has been filled by the lead on the website/mobile app. Values include Low - (0-50%), Medium - (50-75%), High (75-100%)
* website_visits: How many times has a lead visited the website
* time_spent_on_website: Total time spent on the website
* page_views_per_visit: Average number of pages on the website viewed during the visits.
* last_activity: Last interaction between the lead and ExtraaLearn. 
    * Email Activity: Seeking for details about program through email, Representative shared information with lead like brochure of program , etc 
    * Phone Activity: Had a Phone Conversation with representative, Had conversation over SMS with representative, etc
    * Website Activity: Interacted on live chat with representative, Updated profile on website, etc

* print_media_type1: Flag indicating whether the lead had seen the ad of ExtraaLearn in the Newspaper.
* print_media_type2: Flag indicating whether the lead had seen the ad of ExtraaLearn in the Magazine.
* digital_media: Flag indicating whether the lead had seen the ad of ExtraaLearn on the digital platforms.
* educational_channels: Flag indicating whether the lead had heard about ExtraaLearn in the education channels like online forums, discussion threads, educational websites, etc.
* referral: Flag indicating whether the lead had heard about ExtraaLearn through reference.
* status: Flag indicating whether the lead was converted to a paid customer or not.

## Importing necessary libraries and data

In [36]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

data = pd.read_csv('ExtraaLearn.csv')

#check if it worked
print(data.head())


       ID  age current_occupation first_interaction profile_completed  \
0  EXT001   57         Unemployed           Website              High   
1  EXT002   56       Professional        Mobile App            Medium   
2  EXT003   52       Professional           Website            Medium   
3  EXT004   53         Unemployed           Website              High   
4  EXT005   23            Student           Website              High   

   website_visits  time_spent_on_website  page_views_per_visit  \
0               7                   1639                 1.861   
1               2                     83                 0.320   
2               3                    330                 0.074   
3               4                    464                 2.057   
4               4                    600                16.914   

      last_activity print_media_type1 print_media_type2 digital_media  \
0  Website Activity               Yes                No           Yes   
1  Website Activit

## Data Overview

- Observations
- Sanity checks

In [None]:
print("Descriptions:\n\n", data.describe())

# count all NaN
print("\n\nNulls:\n\n", data.isnull().sum())

# Iterate over numerical columns and count zeros in each
print("\n")
for column in data.select_dtypes(include=['int64', 'int32', 'float64', 'float32']).columns:
    zero_count = data[column].value_counts().get(0, 0)
    print(f"Column '{column}' has {zero_count} zeros.\n")

# I'm guessing the lack of web visits is from those who used the app instead?

# Are there any odd ages? Find the lowest and highest value for 'age'
min_age = data['age'].min()
max_age = data['age'].max()

print(f"\nLowest age: {min_age}\n")
print(f"Highest age: {max_age}\n\n")

# count rows, how big the dataset is
total_rows = len(data)

# Print the total number of rows
print(f"Total number of rows in the dataset: {total_rows}")

# List data types
print("\nData Types:\n")
data.dtypes

# Notes: less than 1/3 become customers (29%). It probably could be improved, but it does seem like a good
# percent. I wonder if someone is more likely to be a customer based on referrals (obviously among other things,
# but that's the most interesting question to me. I will answer that, among the other things.)

Descriptions:

                age  website_visits  time_spent_on_website  \
count  4612.000000     4612.000000            4612.000000   
mean     46.201214        3.566782             724.011275   
std      13.161454        2.829134             743.828683   
min      18.000000        0.000000               0.000000   
25%      36.000000        2.000000             148.750000   
50%      51.000000        3.000000             376.000000   
75%      57.000000        5.000000            1336.750000   
max      63.000000       30.000000            2537.000000   

       page_views_per_visit       status  
count           4612.000000  4612.000000  
mean               3.026126     0.298569  
std                1.968125     0.457680  
min                0.000000     0.000000  
25%                2.077750     0.000000  
50%                2.792000     0.000000  
75%                3.756250     1.000000  
max               18.434000     1.000000  


Nulls:

 ID                       0
age      

ID                        object
age                        int64
current_occupation        object
first_interaction         object
profile_completed         object
website_visits             int64
time_spent_on_website      int64
page_views_per_visit     float64
last_activity             object
print_media_type1         object
print_media_type2         object
digital_media             object
educational_channels      object
referral                  object
status                     int64
dtype: object

## Exploratory Data Analysis (EDA)

- EDA is an important part of any project involving data.
- It is important to investigate and understand the data better before building a model with it.
- A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
- A thorough analysis of the data, in addition to the questions mentioned below, should be done.

**Questions**
1. Leads will have different expectations from the outcome of the course and the current occupation may play a key role in getting them to participate in the program. Find out how current occupation affects lead status.
2. The company's first impression on the customer must have an impact. Do the first channels of interaction have an impact on the lead status? 
3. The company uses multiple modes to interact with prospects. Which way of interaction works best? 
4. The company gets leads from various channels such as print media, digital media, referrals, etc. Which of these channels have the highest lead conversion rate?
5. People browsing the website or mobile application are generally required to create a profile by sharing their personal data before they can access additional information.Does having more details about a prospect increase the chances of conversion?

In [33]:
# 1.
# Group by 'current_occupation' and calculate conversion rate (status = 1)
occupation_conversion_rate = data.groupby('current_occupation')['status'].mean()

# Display the result
print(occupation_conversion_rate)

#2. 
# Group by 'first_interaction' and calculate conversion rate (status = 1)
interaction_conversion_rate = data.groupby('first_interaction')['status'].mean()

# Display the result
print(interaction_conversion_rate)

#3. 
# Group by 'last_activity' and calculate conversion rate (status = 1)
last_activity_conversion_rate = data.groupby('last_activity')['status'].mean()

# Display the result
print(last_activity_conversion_rate)

#4. 
# Group by each channel and calculate conversion rates (status = 1)
channels_conversion_rate = {
    'print_media_type1': data.groupby('print_media_type1')['status'].mean(),
    'print_media_type2': data.groupby('print_media_type2')['status'].mean(),
    'digital_media': data.groupby('digital_media')['status'].mean(),
    'educational_channels': data.groupby('educational_channels')['status'].mean(),
    'referral': data.groupby('referral')['status'].mean()
}

# Display the results
for channel, conversion_rate in channels_conversion_rate.items():
    print(f"Conversion rates for {channel}:")
    print(conversion_rate)
    print("\n")

#5. 
# Group by 'profile_completed' and calculate conversion rate (status = 1)
profile_conversion_rate = data.groupby('profile_completed')['status'].mean()

# Display the result
print(profile_conversion_rate)

# Some observations: working professionals, those whose first interaction was through the website, and those
# who were referred had a higher chance of becoming a customer. The biggest was referrals, which I was
# curious about earlier and it should definitely be exploited by ExtraaLearn.
# Though, taking a second glance at the data, there aren't very many referrals right now so the data on it
# is limited by sample size. I still think it's a good prospect, but it needs more testing to confirm.

# Interestingly, more leads became customers if they interacted with the website first. This could mean
# two things: app viewers are more casual (which seems counterintuitive since it likely involves extra steps),
# or it means the app isn't very good. I would look into that if I were working on the app team at this
# company.

# In terms of how the company interacts with leads, website seems to work best. Email isn't far off,
# but phone calls seem to turn people away from being customers.

# The media type categories are basically the same, so they seem to have little impact on the conversion rate,
# except, surprisingly, the educational channels seem to decrease the lead chance to become a customer.

# Surprisingly, profile completion doesn't seem to indicate a lead becomes a customer very often.
# Either I would look into that, or maybe the profile is very easy to complete, which is good.

current_occupation
Professional    0.355122
Student         0.117117
Unemployed      0.265788
Name: status, dtype: float64
first_interaction
Mobile App    0.105314
Website       0.455940
Name: status, dtype: float64
last_activity
Email Activity      0.303336
Phone Activity      0.213128
Website Activity    0.384545
Name: status, dtype: float64
Conversion rates for print_media_type1:
print_media_type1
No     0.29599
Yes    0.31992
Name: status, dtype: float64


Conversion rates for print_media_type2:
print_media_type2
No     0.297328
Yes    0.321888
Name: status, dtype: float64


Conversion rates for digital_media:
digital_media
No     0.295961
Yes    0.318786
Name: status, dtype: float64


Conversion rates for educational_channels:
educational_channels
No     0.302022
Yes    0.279433
Name: status, dtype: float64


Conversion rates for referral:
referral
No     0.290772
Yes    0.677419
Name: status, dtype: float64


profile_completed
High      0.417845
Low       0.074766
Medium    0.188

## Data Preprocessing

- Missing value treatment (if needed)
- Feature engineering (if needed)
- Outlier detection and treatment (if needed)
- Preparing data for modeling 
- Any other preprocessing steps (if needed)

In [64]:
# Check the columns before any operations
print(df.columns)

# Remove the original unwanted columns (before one-hot encoding)
df_cleaned = df.drop(['print_media_type1', 'print_media_type2', 'digital_media'], axis=1)

# Update the categorical_columns list to exclude dropped columns
categorical_columns_updated = [col for col in categorical_columns if col in df_cleaned.columns]

# Print updated categorical_columns list to verify
print(categorical_columns_updated)

# Apply one-hot encoding to the updated categorical columns
df_encoded = pd.get_dummies(df_cleaned, columns=categorical_columns_updated, drop_first=False)

# Check the resulting columns after encoding
print(df_encoded.columns)

# Prepare features (X) and target (y)
X = df_encoded.drop(['status', 'ID'], axis=1)  # Drop target and ID
y = df_encoded['status']  # Target variable

# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Index(['ID', 'age', 'current_occupation', 'first_interaction',
       'profile_completed', 'website_visits', 'time_spent_on_website',
       'page_views_per_visit', 'last_activity', 'print_media_type1',
       'print_media_type2', 'digital_media', 'educational_channels',
       'referral', 'status', 'total_activity'],
      dtype='object')
['current_occupation', 'first_interaction', 'profile_completed', 'last_activity', 'educational_channels', 'referral']
Index(['ID', 'age', 'website_visits', 'time_spent_on_website',
       'page_views_per_visit', 'status', 'total_activity',
       'current_occupation_Professional', 'current_occupation_Student',
       'current_occupation_Unemployed', 'first_interaction_Mobile App',
       'first_interaction_Website', 'profile_completed_High',
       'profile_completed_Low', 'profile_completed_Medium',
       'last_activity_Email Activity', 'last_activity_Phone Activity',
       'last_activity_Website Activity', 'educational_channels_No',
       'educa

## Building a Decision Tree model

In [65]:
# Initialize and train the Decision Tree model
dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(X_train, y_train)

# Make predictions and evaluate the model
y_pred = dt_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)

# Display the results
print(f"Accuracy: {accuracy}")
print("Confusion Matrix:")
print(conf_matrix)
print("Classification Report:")
print(class_report)

# This took a really long time. When I first made the model, it wasn't as accurate as I thought it could be.
# It was easy, it was quick, I had no trouble with it, but I just wasn't satisfied.
# I noticed 'professional' was missing from the training data for some reason, so I fixed that because it
# had a sizable impact on the conversion rate. I also noticed that the media types really didn't predict much
# so I thought the model might be more accurate if I dropped them. All of the print statements in the processing
# were because I kept getting errors that the media columns were not being dropped properly in the training
# data, and it would look prettier if I removed them, but I've come a long way and it's been a journey
# that I figure maybe would be more interesting if I keep in the record.


Accuracy: 0.8082340195016251
Confusion Matrix:
[[564  85]
 [ 92 182]]
Classification Report:
              precision    recall  f1-score   support

           0       0.86      0.87      0.86       649
           1       0.68      0.66      0.67       274

    accuracy                           0.81       923
   macro avg       0.77      0.77      0.77       923
weighted avg       0.81      0.81      0.81       923



## Model Performance evaluation and improvement

The model is very good at predicting non-customers, but it struggles a little at predicting customers. The difference is about 85% and 65% respectively. Perhaps there are several things I could do to improve it further, such as combining website visits with time spent on website, to make it more condensed and clearer to the model.

There doesn't seem to be much overfitting, though, as the accuracy is pretty close to the other scores. But that is also something that could be looked into.

There are way more non-customers than there are customers in the data, which does make it hard on the model and needs tweaking to overcome.

However, the false negatives and false positives are fairly low. There could be fewer, but all things considered, the model seems to do a good job.

## Building a Random Forest model

In [66]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Initialize and train the Random Forest model
rf_model = RandomForestClassifier(random_state=42, n_estimators=100, max_depth=10)  # You can adjust hyperparameters
rf_model.fit(X_train, y_train)

# Make predictions and evaluate the model
y_pred = rf_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)

# Display the results
print(f"Accuracy: {accuracy}")
print("Confusion Matrix:")
print(conf_matrix)
print("Classification Report:")
print(class_report)

Accuracy: 0.8743228602383532
Confusion Matrix:
[[607  42]
 [ 74 200]]
Classification Report:
              precision    recall  f1-score   support

           0       0.89      0.94      0.91       649
           1       0.83      0.73      0.78       274

    accuracy                           0.87       923
   macro avg       0.86      0.83      0.84       923
weighted avg       0.87      0.87      0.87       923



## Model Performance evaluation and improvement

This model is much more accurate. It might be that the data is complex (though it doesn't appear to be), the original model did have a lot of overfitting, or the imbalance between the number of customers/non-customers makes a random forest model more reliable than a decision tree like I used before. Computationally, using a random forest model on this data set isn't an issue because it doesn't require much to run. But if this data set were huge or really complex, this might end up being an unrealistic model to use.

## Actionable Insights and Recommendations

One thing I would really consider, if I worked here, would be to look more into data. The model got better when I removed their marketing efforts, but that seems like a really unlikely reality. I think the marketing either impacted very specific people, and they should look more into who it did, or the marketing team just isn't very good. Also, as mentioned before, there are very few referrals in the data, so it might be more or less impactful than the evaluation I did shows.

If the referrals really are as impactful as the data shows, then there should definitely be a lot of incentive to refer others. Maybe it's something the company can display as a feature before someone becomes a customer--just to show that being a customer has its benefits outside of the product. They might attract more customers not based on referrals, and increase the number of people referred, a win-win.

I would skew my focus away from students and interactions by calling/texting people. It seems these two variables are very unlikely to make a sale. Target the marketing to those who are not students (particularly those who have a job), and focus energy on the interactions through the website.

Figure out why the app seems to deter people from becoming customers. If the company can't figure it out, then I would just remove it. It seems to be more detrimental, and costs money to maintain.

