# Purpose
This notebook describes beginning to a Springboard Project: Opportunity Pipeline Forecasting. We will be trying to understand the tables, columns and information flow. Typically we also look for data issues try for resolution. At the end of this activity, the data sources and their treatment is finalized. Code in this notebook will not be part of the production code.

# Initialization

In [72]:
%load_ext autoreload
%autoreload 2 

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [73]:
import os
import os.path as op
import shutil

# standard third party imports
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
#from imblearn.over_sampling import SMOTE
pd.options.mode.use_inf_as_na = True

from datetime import datetime
from dateutil.relativedelta import relativedelta

In [74]:
import os
import os.path as op
import pandas as pd 
import great_expectations as ge
from dateutil.relativedelta import relativedelta
os.environ['TA_DEBUG'] = "False"
os.environ['TA_ALLOW_EXCEPTIONS'] = "True"

In [75]:
import warnings

warnings.filterwarnings('ignore', message="The sklearn.metrics.classification module", category=FutureWarning)
warnings.filterwarnings('ignore', message=".*title_format is deprecated. Please use title instead.*")
warnings.filterwarnings('ignore', message="The default value of regex will change from True to False in a future version.", 
                        category=FutureWarning)
warnings.filterwarnings('ignore', message="this method is deprecated in favour of `Styler.to_html()`")

In [76]:
%%time
from ta_lib.core.api import (
    create_context,
    get_package_path,
    display_as_tabs,
    initialize_environment,
    string_cleaning,
    setanalyse,
    merge_expectations
)
import ta_lib.core.api as dataset
import ta_lib.eda.api as analysis
import ta_lib.reports.api as health

CPU times: total: 0 ns
Wall time: 0 ns


In [77]:
# Initialization
initialize_environment(debug=False, hide_warnings=True)

# Data

## Background

The sales group of a technology and chip manufacturing company is provided with annual targets in \\$ that they need to sell. The team tries to achieve the same by pursuing multiple opportunities (an opportunity can be defined as a potential customer with specific asks of products in type, design and quantity with their \\$ values negotiated) across customers and business segments.
The sales enablement group of the same organization has identified that while a number of opportunities are being pursued, the sales group lacks a means to quantify the potential of conversion of these group of opportunities (hereon referred as sales pipeline portfolio) within a specific timeframe (say a quarter). The conversion potential of a sales pipeline portfolio is realized by,
 - Quality of an opportunity: How likely is an opportunity to convert within a specific timeframe? 


Below is an overview of each of the keys in the dataset:

| | Key | Dataset | Details |
|--|--|--|--|
| 1 | Opportunity ID | prod_data.csv & opp_data.csv | Opportunity Identifier |
| 2 | Product ID | prod_data.csv | Product Identifier |
| 3 | Product Segment Name | prod_data.csv | Product Segment |
| 4 | Product Status | prod_data.csv | Product Status (Pending,Win Approved,Win Submitted,Lost,Cancelled)  |
|5 | Product \\$ | prod_data.csv | Price in Proposal |
| 6 | Product Quantity | prod_data.csv | Quantity in Proposal | 
| 7 | Decision Date | prod_data.csv & opp_data | Opportunity Decision Date (Deadline to decide on opportunity)| 
| 8 | Snapshot Time | prod_data.csv | Quantity in Proposal | 
| 9 | Transition To Stage | opp_data.csv | Oppurtunity Stage at Snapshot Time | 
| 10 | Transition To Timestamp | opp_data.csv | Snapshot Time |
| 11 | Transition From Stage Name | opp_data.csv | Opportunity Stage at previous Snapshot | 
| 12 | Transition From Timestamp | opp_data.csv | Previous Snapshot Time| 
| 13 | Customer Name | opp_data.csv | Customer Name | 
| 14 | Risk Status | opp_data.csv | Opportunity Risk |
| 15 | Creation Date | opp_data.csv | Opportunity Creation Date |
| 16 | Opportunity Status | opp_data.csv | Opportuinity Status (Open, Closed won,Closed Lost) |
| 17 | Opportunity Type | opp_data.csv | Opportunity Type(Deal, if customer interested in existing product configuration and Design if customer interested in entirely new design) |
| 18 | Core Consumption Market | opp_data.csv | Consumption market(Entertainment,Office use etc. )|
| 19 | Core Product Segment | opp_data.csv | Product Segment(Processors, Graphic Card, Mother Board etc.) |
| 20 | Core Sales Segment | opp_data.csv | Sales Segment |
| 21 | Geography | opp_data.csv | Customer Geography |
| 22 | Core Product Application | opp_data.csv | Product Application(Pc,Server,Mobile,Tablet,etc.) |

In [78]:
config_path = op.join('conf', 'config.yml')
context = create_context(config_path)
dataset.list_datasets(context)

['/raw/opportunity',
 '/raw/product',
 '/cleaned/opportunity',
 '/cleaned/product',
 '/processed/merged_final_dataset',
 '/train/features',
 '/train/target',
 '/test/features',
 '/test/target',
 '/test/test2']

In [79]:
# Loading all datasets in a loop
data = dict()
for i in dataset.list_datasets(context):
    if '/raw/' in i:
        key_ = i.replace('/raw/','')+'_df'
        data[key_] = dataset.load_dataset(context,i)
        # Standardize column names
        data[key_].columns = string_cleaning(data[key_].columns,lower=True)

## Exploratory Data Analysis

### Shape of Data

In [80]:
(
    pd.DataFrame({x:data[x].shape for x in data.keys()})
    .T
    .rename(columns={0:'rows',1:'columns'})
    .sort_values('rows',ascending=False)
)

Unnamed: 0,rows,columns
opportunity_df,313571,16
product_df,142431,8


## Variable summary

In [81]:
summaries = [analysis.get_variable_summary(data[x]) for x in data.keys()]
display_as_tabs([(x, summaries[idx]) for idx, x in enumerate(data.keys())])

## Health Analysis

Get an overview of the overall health of your dataset. This is usually quick to compute and hopefully highlights some problems to focus on.



In [82]:
summaries_and_plots = [analysis.get_data_health_summary(data[x], return_plot=True) for x in data.keys()]
plots = [x[1] for x in summaries_and_plots]
display_as_tabs([(x, plots[idx]) for idx, x in enumerate(data.keys())])

## Health Analysis

Get an overview of the overall health of your dataset. This is usually quick to compute and hopefully highlights some problems to focus on.



### Summary Plot

Provides a high level summary of the health your dataset.

**Watch out for:**

* too few numeric values
* high % of missing values
* high % of duplicate values
* high % of duplicate columns 

In [83]:
summaries_and_plots = [analysis.get_data_health_summary(data[x], return_plot=True) for x in data.keys()]
plots = [x[1] for x in summaries_and_plots]
display_as_tabs([(x, plots[idx]) for idx, x in enumerate(data.keys())])

**Dev NOTES**

<details>
1. Datatypes : We have both numeric and other types. The bulk of them seem to be numeric. `Numeric` is defined to be one of [float|int|date] and the rest are categorized as `Others`. A column is assumed to have `date` values if it has the string `date` in the column name.

**[TODO]** We prob. need more types: integral, float, bool, dates/timestamps, strings. We have this functionality in Dataprocessor.

2. The missing value plot seems to indicate missing values are not present but we do have them. 

**[TODO]** The plot can be improved to better display small values

3. We are looking for duplicate observations (rows in the data). The plot shows the % of rows that are an exact replica of another row (using `df.duplicated`)

4. We are looking for duplicate features (columns in the data).

**[TODO]** The tigerml code seems complicated but it looks like we compare each column against all other similar columns (numeric/categoric) after dropping nans, infs


**[TODO]** We need better data inspectors. The current data inspectors show columns from the dataframe used to construct the plot and **not** the original data. This does not make sense for an end-user who didn't expicitly construct the intermediate data used for the plot. It would be more meaningful to have labels that match the legends (e.g unique_columns:100%, duplicate_columns:0). Also, the y-axis label doesen't tell anything. The x-axis prob. needs an axis (0 to 100%).

</details>

### Missing Values summary

This provides an overall view focussing on amount of missing values in the dataset.

**Watch out for:**
* A few columns have significant number of missing values 
* Most columns have significant number of missing values

In [84]:
summaries_and_plots = [analysis.get_missing_values_summary(data[x], return_plot=True) for x in data.keys()]
plots = [x[1] for x in summaries_and_plots]
display_as_tabs([(x, plots[idx]) for idx, x in enumerate(data.keys())])

**Dev notes:**

<details>
    
    * By default, the following are considered missing/NA values : `[np.Nan, pd.NaT, 'NA', None]`
    * additional values can be passed to tigerml (add_additional_na_values)
    * these are applied to all columns.
    
    * some of the above information can be learnt from the data discovery step (see discussion below)
    
</details>

### Duplicate Columns

In [85]:
summaries = [analysis.get_duplicate_columns(data[x]) for x in data.keys()]
display_as_tabs([(x, summaries[idx]) for idx, x in enumerate(data.keys())])

### Outlier Checks

In [86]:
summaries = [analysis.get_outliers(data[x]) for x in data.keys()]
display_as_tabs([(x, summaries[idx]) for idx, x in enumerate(data.keys())])

In [87]:
data['opportunity_df'].head()

Unnamed: 0,ïopportunity id,transition to stage,transition to timestamp,transition from stage name,transition from timestamp,customer name,risk status,creation date,decision date,opportunity status,opportunity type,core consumption market,core product segment,core sales segment,geography,core product application
0,5,Stage 3,20150211130001,,,Customer 83,,2015-02-11,2015-03-12,,,Core Market 11,Core Prd Seg 3,Sales Segment 8,Geo NA,Prd App 4
1,5,Stage 3,20150213050002,Stage 3,20150210000000.0,Customer 83,,2015-02-11,2015-03-12,,,Core Market 11,Core Prd Seg 3,Sales Segment 8,Geo NA,Prd App 4
2,5,Stage 3,20150218210007,Stage 3,20150210000000.0,Customer 83,,2015-02-11,2015-03-12,,,Core Market 11,Core Prd Seg 3,Sales Segment 8,Geo NA,Prd App 4
3,5,Stage 3,20150304210002,Stage 3,20150220000000.0,Customer 83,,2015-02-11,2015-03-12,,,Core Market 11,Core Prd Seg 3,Sales Segment 8,Geo NA,Prd App 4
4,5,Stage 3,20150304210002,Stage 3,20150300000000.0,Customer 83,,2015-02-11,2015-03-12,Open,,Core Market 11,Core Prd Seg 3,Sales Segment 8,Geo NA,Prd App 4


In [88]:
data['product_df'].head()

Unnamed: 0,ïopportunity id,product id,product segment name,product status,product $,product quantity,decision date,snapshot time
0,1361,6669,Product Segment NA,Win Approved,0.0,1.0,2015-08-12,20151125050002
1,1361,6669,Product Segment NA,Win Approved,0.0,1.0,2015-08-12,20151025050002
2,1361,6669,Product Segment NA,Win Approved,0.0,1.0,2015-08-12,20160105050003
3,1361,6669,Product Segment NA,Win Approved,0.0,1.0,2015-08-12,20160101050003
4,1361,6669,Product Segment NA,Pending,0.0,1.0,2015-08-31,20150525050003


## Data Cleaning

### opportunity data
* transition to timestamp, transition from timestamp: convert timestamp to datetime object
* core consumption market, core product segment. core sales segment, geography, core product application: make strings to    integer(geo 1 - geo)
* Encode 'transition to stage', 'transition from stage name', 'customer name', 'risk status', 'opportunity status', and 'opportunity type' as numerical values
* dropped duplicates

### product data
* snapshot time: convert timestamp to datetime object
* encode the product status
* sorted the daya
* removing duplicates
* dropped product segment column as it has only one unique value

In [89]:
data={}
for i in dataset.list_datasets(context):
    if '/raw/' in i:
        dataset_name = i.replace('/raw/','')
        key_ = dataset_name+'_df'
        data[key_] = dataset.load_dataset(context,i)
        
        # converting column names to lowercase for uniformity
        data[key_].columns = string_cleaning(data[key_].columns,lower=True)
        data[key_].rename(columns = {'ïopportunity id' : 'opportunity id'}, inplace = True)
        if key_ == 'opportunity_df':
            
            # Convert transition timestamps to datetime objects
            data[key_]['transition from timestamp']=pd.to_datetime(data[key_]['transition from timestamp'], format='%Y%m%d%H%M%S')
            data[key_]['transition to timestamp']=pd.to_datetime(data[key_]['transition to timestamp'], format='%Y%m%d%H%M%S')
                        
            # Encode 'transition to stage', 'transition from stage name', 'customer name', 'risk status', 
            # 'opportunity status', and 'opportunity type' as numerical values
            for col in ['transition to stage', 'transition from stage name']:
                temp=[]
                for i in data[key_][col]:
                    if type(i)==str:
                        temp.append(int(i[-1]))
                    else:
                        temp.append(-1)           
                data[key_][col]=temp 
                
            
            temp=[]
            for i in data[key_]["customer name"]:
                if i!='Customer NA':
                    cust_no = int(i.split(" ")[-1])
                    temp.append(cust_no)
                else:
                    temp.append(-1)           
            data[key_]["customer name"]=temp
            
            for col in ['risk status', 'opportunity status', 'opportunity type']:
                unique_vals = data[key_][col].unique()
                val_dict = {unique_vals[i]: i for i in range(len(unique_vals))}
                temp = []
                for val in data[key_][col]:
                    temp.append(val_dict[val])
                data[key_][col] = temp
            
            # Encode 'core consumption market' and 'core product segment' as numerical values
            for col in ['core consumption market', 'core product segment','core sales segment']:
                temp = []
                for val in data[key_][col]:
                    if val != 'Customer NA':
                        num = int(val.split(" ")[-1])
                        temp.append(num)
                    else:
                        temp.append(-1)
                data[key_][col] = temp
            
            temp=[]
            for i in data[key_]["geography"]:
                if i!='Geo NA':
                    geo_no = int(i.split(" ")[-1])
                    temp.append(geo_no)
                else:
                    temp.append(-1)           
            data[key_]["geography"]=temp
            
            temp=[]
            for i in data[key_]["core product application"]:
                if i!='Prd App 4 NA':
                    prod_app_no = int(i.split(" ")[-1])
                    temp.append(prod_app_no)
                else:
                    temp.append(-1)           
            data[key_]["core product application"]=temp
            
            #removing duplicates
            data['opportunity_df']=data['opportunity_df'].drop_duplicates(subset=["opportunity id",
            "transition to stage","transition from stage name","customer name","risk status","creation date","decision date",
            "opportunity status","opportunity type","core sales segment","geography","core product application"],keep="first")

            opportunity = data['opportunity_df']
        if key_ == 'product_df':
            
            #calculating time taken for transition
            data[key_]['snapshot time']=pd.to_datetime(data[key_]['snapshot time'], format='%Y%m%d%H%M%S')
            
            
            temp=[]
            prod_status=data[key_]['product status'].unique()
            prod_stat_dict= {prod_status[i] : i for i in range(len(prod_status)) }
            for i in data[key_]["product status"]:
                temp.append(prod_stat_dict[i])         
            data[key_]["product status"]=temp
            
            #sorting in ascending order
            data['product_df'].sort_values(by=['opportunity id','product id','snapshot time','decision date','product $','product quantity'],inplace=True)

            #dropping product segment name as it has only NA value
            data['product_df'].drop(columns=['product segment name'],inplace=True)
            
            #dropping duplicates
            data['product_df'].drop_duplicates(subset=['opportunity id','product id','product status','product $','product quantity','product quantity'],inplace=True)
            product=data['product_df']
        
        dataset.save_dataset(context, data[key_], 'cleaned/'+dataset_name, index = False)

In [90]:
product.head()

Unnamed: 0,opportunity id,product id,product status,product $,product quantity,decision date,snapshot time
245,5,404,1,230400.0,1200.0,12-03-2015,2015-02-11 21:00:01
272,5,404,0,230400.0,1200.0,16-03-2015,2015-03-16 13:00:01
526,19,16377,1,0.0,400000.0,14-08-2015,2015-05-11 05:00:04
455,19,16377,0,0.0,400000.0,18-08-2015,2015-08-18 13:00:02
443,19,16378,1,0.0,200000.0,14-08-2015,2015-05-11 05:00:04


In [91]:
product.shape

(9376, 7)

In [92]:
opportunity.shape

(36135, 16)

In [93]:
merged_df=pd.merge(opportunity,product,how='inner',left_on=['opportunity id','decision date'],right_on=['opportunity id','decision date'])
dataset.save_dataset(context,merged_df , 'processed/'+'merged_final_dataset')

In [94]:
merged_df.shape

(23474, 21)

In [95]:
merged_df.head()

Unnamed: 0,opportunity id,transition to stage,transition to timestamp,transition from stage name,transition from timestamp,customer name,risk status,creation date,decision date,opportunity status,...,core consumption market,core product segment,core sales segment,geography,core product application,product id,product status,product $,product quantity,snapshot time
0,5,3,2015-02-11 13:00:01,-1,NaT,83,0,11-02-2015,12-03-2015,0,...,11,3,8,-1,4,404,1,230400.0,1200.0,2015-02-11 21:00:01
1,5,3,2015-02-13 05:00:02,3,2015-02-11 13:00:01,83,0,11-02-2015,12-03-2015,0,...,11,3,8,-1,4,404,1,230400.0,1200.0,2015-02-11 21:00:01
2,5,3,2015-03-04 21:00:02,3,2015-03-04 21:00:02,83,0,11-02-2015,12-03-2015,1,...,11,3,8,-1,4,404,1,230400.0,1200.0,2015-02-11 21:00:01
3,5,4,2015-03-16 13:00:01,3,2015-03-04 21:00:02,83,0,11-02-2015,12-03-2015,2,...,11,3,8,-1,4,404,1,230400.0,1200.0,2015-02-11 21:00:01
4,5,4,2015-03-17 13:00:01,4,2015-03-16 13:00:01,83,0,11-02-2015,12-03-2015,2,...,11,3,8,1,4,404,1,230400.0,1200.0,2015-02-11 21:00:01


## Feature Engineering

In [96]:
#time taken to transition from one stage to another
merged_df['transition days']=(merged_df['transition to timestamp']-merged_df['transition from timestamp']).dt.days

In [97]:
merged_df['cost_per_product']=merged_df['product $']/merged_df['product quantity']
merged_df['cost_per_product']

average_price = merged_df.groupby('product id')['cost_per_product'].mean()
merged_df['diff_avgcost'] = merged_df['cost_per_product'] - merged_df['product id'].map(average_price)
merged_df['diff_avgcost'].min()

-280.0

In [98]:
#calculate total time taken for taking decision that is difference between time when a particular opportunity id opened and first time its status changes
#to either win or loss
merged_df = merged_df.sort_values(by=['opportunity id', 'transition to timestamp'])
merged_df["time difference"] = np.nan
first_transition_from_timestamp = None

for i in range(1, len(merged_df)):
    
    if merged_df["opportunity id"][i] != merged_df["opportunity id"][i-1]:
        first_transition_from_timestamp = np.where(pd.notnull(merged_df["transition from timestamp"][i]), merged_df["transition from timestamp"][i], merged_df["transition to timestamp"][i])
    elif first_transition_from_timestamp is None:
        first_transition_from_timestamp = merged_df["transition from timestamp"][i]
    
    if merged_df["opportunity status"][i] in [2,3]:
        merged_df["time difference"][i] = (merged_df["transition to timestamp"][i] - first_transition_from_timestamp).total_seconds()/86400
        
#print(first_transition_from_timestamp)

In [99]:
df_min=merged_df[['opportunity id','time difference']]
merged_df=merged_df.rename(columns={'time difference':'time taken for decision'})
df_min = df_min.groupby('opportunity id', as_index=False)['time difference'].min()
merged_df=pd.merge(merged_df, df_min, on=['opportunity id'], how='left')

In [100]:
df_min.isnull().sum()

opportunity id       0
time difference    906
dtype: int64

In [101]:
merged_df.drop(['time taken for decision'],axis=1,inplace=True)
merged_df=merged_df.rename(columns={'time difference':'time taken for decision'})
merged_df

Unnamed: 0,opportunity id,transition to stage,transition to timestamp,transition from stage name,transition from timestamp,customer name,risk status,creation date,decision date,opportunity status,...,core product application,product id,product status,product $,product quantity,snapshot time,transition days,cost_per_product,diff_avgcost,time taken for decision
0,5,3,2015-02-11 13:00:01,-1,NaT,83,0,11-02-2015,12-03-2015,0,...,4,404,1,230400.0,1200.0,2015-02-11 21:00:01,,192.0,0.0,33.000000
1,5,3,2015-02-13 05:00:02,3,2015-02-11 13:00:01,83,0,11-02-2015,12-03-2015,0,...,4,404,1,230400.0,1200.0,2015-02-11 21:00:01,1.0,192.0,0.0,33.000000
2,5,3,2015-03-04 21:00:02,3,2015-03-04 21:00:02,83,0,11-02-2015,12-03-2015,1,...,4,404,1,230400.0,1200.0,2015-02-11 21:00:01,0.0,192.0,0.0,33.000000
3,5,4,2015-03-16 13:00:01,3,2015-03-04 21:00:02,83,0,11-02-2015,12-03-2015,2,...,4,404,1,230400.0,1200.0,2015-02-11 21:00:01,11.0,192.0,0.0,33.000000
4,5,4,2015-03-17 13:00:01,4,2015-03-16 13:00:01,83,0,11-02-2015,12-03-2015,2,...,4,404,1,230400.0,1200.0,2015-02-11 21:00:01,1.0,192.0,0.0,33.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23469,42789,1,2016-06-12 21:00:04,6,2016-06-10 07:13:44,9787,0,15-11-2015,01-09-2016,3,...,5,38158,1,6000.0,1000.0,2015-11-16 13:00:03,2.0,6.0,0.0,66.666678
23470,42789,1,2016-06-15 13:00:03,1,2016-06-12 21:00:04,9787,0,15-11-2015,01-09-2016,3,...,5,38158,1,60000.0,10000.0,2015-11-15 13:00:03,2.0,6.0,0.0,66.666678
23471,42789,1,2016-06-15 13:00:03,1,2016-06-12 21:00:04,9787,0,15-11-2015,01-09-2016,3,...,5,38158,1,6000.0,1000.0,2015-11-16 13:00:03,2.0,6.0,0.0,66.666678
23472,42789,1,2016-07-01 05:00:04,1,2016-06-25 05:00:03,9787,1,15-11-2015,01-09-2016,3,...,5,38158,1,60000.0,10000.0,2015-11-15 13:00:03,6.0,6.0,0.0,66.666678


In [106]:
merged_df['decision date'] = pd.to_datetime(merged_df['decision date'], format = '%d-%m-%Y')
merged_df['creation date'] = pd.to_datetime(merged_df['creation date'], format = '%d-%m-%Y')

In [107]:
merged_df['decision days']=(merged_df['decision date']-merged_df['creation date']).dt.days
merged_df['decision days'].mean()

85.372667632274

In [109]:
print(merged_df['decision date'].max())
# it seems that there is typo in decision date, so we will convert 2215-03-20 00:00:00 to 2015-03-20 00:00:00
merged_df['decision date'].replace('2215-03-20 00:00:00','2015-03-20 00:00:00', inplace=True)
print(merged_df['decision date'].max())

2215-03-20 00:00:00
2020-03-31 00:00:00


## Define DV

Here opportunity status is missing for less than 0.5% so we can ignore these.
Around for 18% opportunity status is open, we can't just drop these, so we will have different file for these values. -

In [110]:
#there  2 : win
# 1 : open
# 0 : na
# 3 : lost
merged_df['opportunity status'].value_counts()
#for a small proportion opportunity status is zero (<0.5%), it turns out that this zero value given to intially to opportunity before they are considered
#open, so we should have 1 value of opportnuity status for every opportunity id, so we will chabe these values to there final opportnutiy status

1    11807
2     9816
0      971
3      880
Name: opportunity status, dtype: int64

In [111]:
merged_df["opportunity status"] = merged_df.groupby('opportunity id')["opportunity status"].transform("max")

In [112]:
#for opportunities that are still open and we don't know there final value, we will have a different dataset for it,after testing our model we can find 
#opportunity status for these values
test2=merged_df[merged_df['opportunity status']==1]
test2.shape
dataset.save_dataset(context, test2, 'test/'+'test2')

In [113]:
merged_df = merged_df[~((merged_df['opportunity status'] == 0) | (merged_df['opportunity status'] == 1))]

In [114]:
merged_df.shape

(19755, 26)

In [115]:
merged_df['opportunity status'].replace(3,0,inplace=True)
merged_df['opportunity status'].replace(2,1,inplace=True)

In [116]:
dataset.save_dataset(context,merged_df , 'processed/'+'merged_final_dataset')

In [None]:
one hot encoding
key drivers - interaction with target variable
make initial pipeline
model tune
gridsearchcv
best estimators
make pipeline using best estimators
model evaluation
classification error analysis
model summary
most influencing features
classification reports
model comparison

