# SET UP

## CREATING PROJECT ENVIRONMENT

In order to keep the project dependencies isolated from both the system and other projects, a new dedicated working environment named *pf_leadscoring.yml* will be created in this section. *Conda* has been used for this purpose.

conda create --name pf_leadscoring python numpy pandas matplotlib seaborn scikit-learn scipy sqlalchemy xgboost jupyter

conda activate pf_leadscoring

conda install -c conda-forge pyjanitor scikit-plot yellowbrick imbalanced-learn jupyter_contrib_nbextensions cloudpickle

conda install -c districtdatalabs yellowbrick

pip install category_encoders

conda env export > pf_leadscoring.yml

## IMPORTING PACKAGES

In [1]:
import os
import numpy as np
import pandas as pd

#To increase autocomplete response speed
%config IPCompleter.greedy=True

## CREATING PROJECT DIRECTORY

Defining root directory where the project is to be created:

In [2]:
root = ((r'C:\Users\pedro\PEDRO\DS\Portfolio')+'\\').replace(os.sep,'/')

Defining project name:

In [4]:
dir_name = 'LEAD_SCORING'

### Creating project directory and structure

In [12]:
path = root + dir_name

try:
    os.mkdir(path)
    os.mkdir(path + '/01_Documents')
    os.mkdir(path + '/02_Data')
    os.mkdir(path + '/02_Data/01_Originals')
    os.mkdir(path + '/02_Data/02_Validation')
    os.mkdir(path + '/02_Data/03_Work')
    os.mkdir(path + '/02_Data/04_Caches')
    os.mkdir(path + '/03_Notebooks')
    os.mkdir(path + '/03_Notebooks/01_Functions')
    os.mkdir(path + '/03_Notebooks/02_Development')
    os.mkdir(path + '/03_Notebooks/03_System')
    os.mkdir(path + '/04_Models')
    os.mkdir(path + '/05_Results')
    os.mkdir(path + '/09_Others')
    
except OSError:
    print ("Creation of the %s directory has failed." % path)
else:
    print ("%s directory has been successfully created." % path)

C:/Users/pedro/PEDRO/DS/Portfolio/LEAD_SCORING directory has been successfully created.


In [9]:
os.chdir(path)

### Creating Environment.yml file

**pf_leadscoring.yml** file can be found in '/01_Documents' folder of the project directory. 

This document contains the specific version of the packages used in the project, and can be used in the future to replicate this environment if needed.

## CREATING INITIAL DATASETS

The original dataset **Leads.csv** can be found in the folder '/02_Data/01_Originals' together with its associated metadata file **Leads Data Dictionary.pdf**. This dataset was provided by [Ashish](https://www.kaggle.com/ashydv) and has been downloaded from [Kaggle](https://www.kaggle.com/datasets/ashydv/leads-dataset). It provides a list of leads for a education company called X Education, which sells online courses. The dataset has a variety of features for each lead as well as whether or not that lead converted into a customer.






### Data importation

In [5]:
data_file_name = 'Leads.csv'
full_path = path + '/02_Data/01_Originals/' + data_file_name

Brief review of the file content:

In [37]:
open(full_path,'r').readlines()[:3]

['Prospect ID,Lead Number,Lead Origin,Lead Source,Do Not Email,Do Not Call,Converted,TotalVisits,Total Time Spent on Website,Page Views Per Visit,Last Activity,Country,Specialization,How did you hear about X Education,What is your current occupation,What matters most to you in choosing a course,Search,Magazine,Newspaper Article,X Education Forums,Newspaper,Digital Advertisement,Through Recommendations,Receive More Updates About Our Courses,Tags,Lead Quality,Update me on Supply Chain Content,Get updates on DM Content,Lead Profile,City,Asymmetrique Activity Index,Asymmetrique Profile Index,Asymmetrique Activity Score,Asymmetrique Profile Score,I agree to pay the amount through cheque,A free copy of Mastering The Interview,Last Notable Activity\n',
 '7927b2df-8bba-4d29-b9a2-b6e0beafe620,660737,API,Olark Chat,No,No,0,0,0,0,Page Visited on Website,,Select,Select,Unemployed,Better Career Prospects,No,No,No,No,No,No,No,No,Interested in other courses,Low in Relevance,No,No,Select,Select,02.Med

In [6]:
data = pd.read_csv(full_path,sep=',')
data

Unnamed: 0,Prospect ID,Lead Number,Lead Origin,Lead Source,Do Not Email,Do Not Call,Converted,TotalVisits,Total Time Spent on Website,Page Views Per Visit,...,Get updates on DM Content,Lead Profile,City,Asymmetrique Activity Index,Asymmetrique Profile Index,Asymmetrique Activity Score,Asymmetrique Profile Score,I agree to pay the amount through cheque,A free copy of Mastering The Interview,Last Notable Activity
0,7927b2df-8bba-4d29-b9a2-b6e0beafe620,660737,API,Olark Chat,No,No,0,0.0,0,0.00,...,No,Select,Select,02.Medium,02.Medium,15.0,15.0,No,No,Modified
1,2a272436-5132-4136-86fa-dcc88c88f482,660728,API,Organic Search,No,No,0,5.0,674,2.50,...,No,Select,Select,02.Medium,02.Medium,15.0,15.0,No,No,Email Opened
2,8cc8c611-a219-4f35-ad23-fdfd2656bd8a,660727,Landing Page Submission,Direct Traffic,No,No,1,2.0,1532,2.00,...,No,Potential Lead,Mumbai,02.Medium,01.High,14.0,20.0,No,Yes,Email Opened
3,0cc2df48-7cf4-4e39-9de9-19797f9b38cc,660719,Landing Page Submission,Direct Traffic,No,No,0,1.0,305,1.00,...,No,Select,Mumbai,02.Medium,01.High,13.0,17.0,No,No,Modified
4,3256f628-e534-4826-9d63-4a8b88782852,660681,Landing Page Submission,Google,No,No,1,2.0,1428,1.00,...,No,Select,Mumbai,02.Medium,01.High,15.0,18.0,No,No,Modified
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9235,19d6451e-fcd6-407c-b83b-48e1af805ea9,579564,Landing Page Submission,Direct Traffic,Yes,No,1,8.0,1845,2.67,...,No,Potential Lead,Mumbai,02.Medium,01.High,15.0,17.0,No,No,Email Marked Spam
9236,82a7005b-7196-4d56-95ce-a79f937a158d,579546,Landing Page Submission,Direct Traffic,No,No,0,2.0,238,2.00,...,No,Potential Lead,Mumbai,02.Medium,01.High,14.0,19.0,No,Yes,SMS Sent
9237,aac550fe-a586-452d-8d3c-f1b62c94e02c,579545,Landing Page Submission,Direct Traffic,Yes,No,0,2.0,199,2.00,...,No,Potential Lead,Mumbai,02.Medium,01.High,13.0,20.0,No,Yes,SMS Sent
9238,5330a7d1-2f2b-4df4-85d6-64ca2f6b95b9,579538,Landing Page Submission,Google,No,No,1,3.0,499,3.00,...,No,,Other Metro Cities,02.Medium,02.Medium,15.0,16.0,No,No,SMS Sent


### Extracting and reserving production script validation dataset

20% of the data has been randomly separated, with the purpose of simulating unseen data that the model will receive in the future once it is put into production and thus be able to check its production performance.

In [39]:
val = data.sample(frac = 0.2)

In [40]:
validation_file_name = 'validation.csv'
full_path = path + '/02_Data/02_Validation/' + validation_file_name

val.to_csv(full_path,index=False)

### Extracting and saving work dataset

In [41]:
work = data.loc[~data.index.isin(val.index)]

In [42]:
work_file_name = 'work.csv'
full_path = path + '/02_Data/03_Work/' + work_file_name

work.to_csv(full_path,index=False)