# PROBLEMA DE NEGÓCIO

**Company Name:**

Insurance All Company

---

**What the Company do?**

Sell health insurance to its customers.

---

**What's the business problem?**

The company is trying to know which are the best customers to offer its new product, auto insurance.

---

**Which are the main strategy?**

The company will call initially 5.000 customers, so we need to know which to call.

---

**What kind of question we need to answer?**



1.   Qual a **porcentagem de clientes**, interessados em adquirir o seguro de veículo, que o time de vendas conseguirá fazendo **5.000 ligações**? E qual **retorno financeiro**, se comparado ao **modelo randômico**, se cada seguro de veículo custar **1000 reais**?

2.   E se **aumentarmos** a quantidade de ligações para **10.000**?

3.   E se agora, **aumentarmos** para **20.000** ligações?



## Solution Planning

**What is the solution?**

We need to develop a machine learning model that rank the customers based on his probabilities to acquire his new product, auto insurance.

---

**How we going to deliver the solution?**

We going to make an API that return to our company which are the score of each customer based on machine learning model that rank them and deploy it in a cloud system

---

**What about it hosting?**

The API will be hosting on Heroku platform:

https://health-insurance-score-27.herokuapp.com/predict

---

**Which are the INPUTS?**

*   **Id** : Unique ID for the customer
*   **Gender** : Gender of the customer
*   **Age** : Age of the customer
*   **Driving License** : 0 = Customer does not have DL, 1 = Customer already has DL
*   **Region Code** : Unique code for the region of the customer
*   **Previously Insured** : 1 = Customer already has Vehicle Insurance, 0 = Customer doesn't have Vehicle Insurance
*   **Vehicle Age** : Age of the Vehicle
*   **Vehicle Damage** : 1 = Customer got his/her vehicle damaged in the past. 0 = Customer didn't get his/her vehicle damaged in the past.
*   **Annual Premium** : The amount customer needs to pay as premium in the year
*   **Policy Sales Channel** : Anonymized Code for the channel of outreaching to the customer ie. Different Agents, Over Mail, Over Phone, In Person, etc.
*   **Vintage** : Number of Days, Customer has been associated with the company
*   **Response** : 1 = Yes, 0 = No
---

**Which are the OUTPUTS?**

All above, except **Response** plus:
*   **Score**

## Process Planning


**Where is the data?**

The data is available on **AWS** platform

---

**Which SGBD?**

Postgres

**CREDENTIALS:**
*   HOST = comunidade-ds-postgres.c50pcakiuwi3.us-east-1.rds.amazonaws.com
*   PORT = 5432
*   Database = comunidadedsdb
*   Username = member
*   Password = cdspa

## Tools Planning

We going to use **Python** and its libraries to collect, **visualize**, **prepare**, **transform**, **select**, **modeling** and **predict** the probabilities of auto insurance **acquisition**

---

**STEPS PLANNING**

*   **Data Description** - to know our dataset
*   **Feature Engineering** - some hypothesis to try to make some insights
*   <s>Feature Filtering</s>
*   **Exploratory Data Analysis** - to undestand the feature relevance to our business model
*   **Data Preparation** - rescaling, encoding and transforming
*   **Feature Selection** - get to know some relevants features
*   **Machine Learning Modeling** - to test some machine learning models
*   **Cross Validation** - to make cross validation of selected models with best performance
*   **Hyperparameter Fine Tunning** - to find best parameters to our selected models
*   **Business Questions** - to convert  ML metrics to business model
*   Deploy - to make our API and deploy it on Heroku platform


# 0.0. IMPORTS

## 0.0.0 LIBRARIES

In [2]:
!pip install scikit-plot

Collecting scikit-plot
  Downloading scikit_plot-0.3.7-py3-none-any.whl (33 kB)
Installing collected packages: scikit-plot
Successfully installed scikit-plot-0.3.7


In [3]:
import pickle
import requests
import json
import math
import random
import warnings
import os
import pandas                    as pd
import numpy                     as np
import seaborn                   as sns
import scikitplot                as skplt
import xgboost                   as xgb
import random                    as rd
import psycopg2                  as pg

from google.colab                import drive
from IPython.core.display        import HTML
from IPython.display             import Image
from tabulate                    import tabulate
from sklearn.ensemble            import RandomForestClassifier
from lightgbm                    import LGBMClassifier
from sklearn.tree                import DecisionTreeClassifier
from sklearn.neighbors           import KNeighborsClassifier
from sklearn.naive_bayes         import GaussianNB
from sklearn.linear_model        import LogisticRegression
from scipy                       import stats                                   as ss
from matplotlib                  import pyplot                                  as plt
from sklearn                     import preprocessing                           as pp
from sklearn                     import model_selection                         as ms
from sklearn                     import ensemble                                as en
from sklearn                     import neighbors                               as nh
from sklearn                     import linear_model                            as lm
from sklearn                     import model_selection                         as ms 
from sklearn                     import metrics                                 as m
from scikitplot                  import metrics                                 as mt
from sklearn.metrics             import confusion_matrix, classification_report

warnings.filterwarnings("ignore")

  """)


In [4]:
# drive.mount('/content/drive')

## 0.0.1. Helper Functions

### Models

In [5]:
def models_train( models, x_train, y_train, x_val, y_val, predict = 'predict', metric = 'multi', verbose=1):
    metrics = pd.DataFrame()
    for model in models:
        if verbose == 1:
            print( model.__class__.__name__ )
        model.fit( x_train, y_train )
        if predict == 'predict':
            yhat = model.predict( x_val )
        elif predict == 'predict_proba':
            yhat = model.predict_proba( x_val )
            yhat = yhat[:, 1]
        
        if metric == 'multi':
            metrics = metrics.append( multi_class_metrics( model.__class__.__name__, y_val, yhat, verbose ) )
        elif metric == 'binary':
            metrics = metrics.append( binary_class_metrics( model.__class__.__name__, y_val, yhat, verbose ) )
            
    return metrics

In [6]:
def modeling( models, x_train, y_train, x_test, y_test, verbose=True):
    metrics = pd.DataFrame()
    models_performance = pd.DataFrame()
    i = 0
    for model in models:
        if verbose == True:
          print( model.__class__.__name__ + ' - ' + str( i ) )
        model.fit( x_train, y_train )

        yhat = model.predict( x_test )
        yhat_proba = model.predict_proba( x_test )[:, 1]

        #binary_class_metrics( model, y_test, verbose = ')
        modeling = pd.DataFrame( [model.__class__.__name__ + ' - ' + str( i )] ).T
        i = i + 1

        # AUC_ROC
        roc = m.roc_auc_score( y_test, yhat_proba )
        df_roc = pd.DataFrame( [roc] )

        # TopK Score
        knum = y_test.value_counts().count() - 1
        topk = m.top_k_accuracy_score( y_test, yhat_proba, k = knum )
        df_topk = pd.DataFrame( [topk] )    

        # Precision Score
        precision = m.precision_score( y_test, yhat )
        df_precision = pd.DataFrame( [precision] ).T

        # Recall Score
        recall = m.recall_score( y_test, yhat )
        df_recall = pd.DataFrame( [recall] ).T

        # F1 Score
        f1 = m.f1_score( y_test, yhat )
        df_f1 = pd.DataFrame( [f1] ).T

        # Accracy Score
        accuracy = m.accuracy_score( y_test, yhat )
        df_accuracy = pd.DataFrame( [accuracy] ).T
    
        metrics = pd.concat( [modeling, df_roc, df_topk, df_f1, df_precision, df_recall, df_accuracy] ).T.reset_index()
        metrics.columns = ['Index', 'Model', 'ROC AUC', 'Top K Score', 'F1', 'Precision', 'Recall', 'Accuracy']

        models_performance = models_performance.append( metrics ).reset_index().drop( ['Index', 'index'], axis=1 )

    return models_performance

### Metrics

In [7]:
def numerical_attributes( data ):
    
    # central tendency (quantile, median) & dispersion - std, min, max, range, skew, kurtosis
    d0 = pd.DataFrame( data.apply( lambda x: x.quantile( 0 ) ) ).T
    d1 = pd.DataFrame( data.apply( lambda x: x.quantile( 0.25 ) ) ).T
    d2 = pd.DataFrame( data.apply( lambda x: x.quantile( 0.50 ) ) ).T
    d3 = pd.DataFrame( data.apply( lambda x: x.quantile( 0.75 ) ) ).T
    d4 = pd.DataFrame( data.apply( lambda x: x.quantile( 1 ) ) ).T 
    d5 = pd.DataFrame( data.apply( lambda x: x.max() - x.min() ) ).T
    d6 = pd.DataFrame( data.apply( np.mean ) ).T 
    d7 = pd.DataFrame( data.apply( lambda x: x.std() ) ).T
    d8 = pd.DataFrame( data.apply( lambda x: x.skew() ) ).T 
    d9 = pd.DataFrame( data.apply( lambda x: x.kurtosis() ) ).T
    
    
    # concatenar
    aux = pd.concat( [d0, d1, d2, d3, d4, d5, d6, d7, d8, d9] ).T.reset_index()
    aux.columns = ['ATTRIBUTES', 'MIN', 'Q1', 'MEDIAN', 'Q3', 'MAX', 'RANGE', 'MEAN', 'STD', 'SKEW', 'KURTOSIS']
    return aux


def multi_class_metrics( model, y_test, yhat, verbose = 0 ):
    
    model = pd.DataFrame( [model] ).T

    # Precision Score
    precision = m.precision_score( y_test, yhat )
    df_precision = pd.DataFrame( [precision] ).T
    
    #Recall Score
    recall = m.recall_score( y_test, yhat )
    df_recall = pd.DataFrame( [recall] ).T
    
    # F1 Score
    f1 = m.f1_score( y_test, yhat )
    df_f1 = pd.DataFrame( [f1] ).T

    # Precision Score
    accuracy = m.accuracy_score( y_test, yhat )
    df_accuracy = pd.DataFrame( [accuracy] ).T
    
    metrics = pd.concat( [model, df_f1, df_precision, df_recall, df_accuracy] ).T.reset_index()
    metrics.columns = ['Index', 'Model', 'F1', 'Precision', 'Recall', 'Accuracy']
    metrics.drop( ['Index'], axis=1 )
    if verbose == 1:
        print( 'Precision Score: {}'.format( precision ) )
        print( 'Recall Score: {}'.format( recall ) )
        print( 'F1 Score: {}'.format( f1 ) )
        print( 'Accuracy'.format( accuracy ) )

        # Classification Report
        print( m.classification_report( y_test, yhat ) )

        # Confusion Matrix
        mt.plot_confusion_matrix( y_test, yhat, normalize=False, figsize=( 12, 12 ) )
        
    return metrics

def binary_class_metrics( model, y_test, yhat, verbose = 1 ):
    
    model = pd.DataFrame( [model] ).T

    # AUC_ROC
    roc = m.roc_auc_score( y_test, yhat )
    rocdf = pd.DataFrame( [roc] )
    
    # TopK Score
    knum = y_test.value_counts().count() - 1
    topk = m.top_k_accuracy_score( y_test, yhat, k = knum )
    topkdf = pd.DataFrame( [topk] )    
    
    metrics = pd.concat( [model, rocdf, topkdf] ).T.reset_index()
    metrics.columns = ['Index', 'Model', 'ROC AUC', 'Top K Score']
    metrics.drop( ['Index'], axis = 1 )
    if verbose == 1:
        print( 'ROC AUC: {}'.format( roc ) )
        print( 'Top K Score: {}'.format( topk ) )
        # Classification Report
        print( m.classification_report( y_test, yhat ) )
        # Confision Matrix
        mt.plot_confusion_matrix( y_test, yhat, normalize = False, figsize = ( 12, 12 ) )
        
    return metrics

def precision_at_k( df, yhat_proba, target, perc = 0.25 ):
    k = int( np.floor( len( df ) * perc ) )
    
    df['score'] = yhat_proba[:, 1].tolist()
    df = df.sort_values( 'score', ascending=False )
    df = df.reset_index( drop=True )
    df['ranking'] = df.index + 1
    df['precision_at_k'] = df[target].cumsum() / df['ranking']

    return df.loc[k, 'precision_at_k']

def recall_at_k( df, yhat_proba, target, perc = 0.25):
    k = int( np.floor( len( df ) * perc ) )

    df['score'] = yhat_proba[:, 1].tolist()
    df = df.sort_values( 'score', ascending=False)
    df = df.reset_index( drop = True )
    df['recall_at_k'] = df[target].cumsum() / df[target].sum()
    
    return df.loc[k, 'recall_at_k']

def top_k_performance( df, proba, response, perc ):
    df_final_performance = pd.DataFrame()
    for i in proba:   
        for j in perc:
            k = int( np.floor( len( df ) * j ) )
            
            target_total = df[response].sum()
            
            df['score'] = i[:, 1].tolist()
            df = df.sort_values( 'score', ascending=False )
            
            target_at_k = df[response][:k].sum()
            target_perc = target_at_k / target_total

            precision = precision_at_k( df, i, response, j )
            recall = recall_at_k( df, i, response, j )

            df_final_performance = df_final_performance.append( {'Model': 'Model',
                                                                 'perc': j,
                                                                 'k': k,
                                                                 'precision': precision,
                                                                 'recall': recall,
                                                                 'target_total': target_total,
                                                                 'target_at_k': target_at_k,
                                                                 'perc_target': target_perc}, ignore_index=True)
    return df_final_performance

# Supress Notation
np.set_printoptions(suppress=True)
pd.set_option('display.float_format', lambda x: '%.4f' % x)

### Graphics

In [8]:
def graphic_percentage( ax, total ):
    for p in ax.patches:
        height = p.get_height()
        ax.text( p.get_x() + p.get_width() / 2.,
                 height,
                 '{:1.2f}'.format( height / total * 100 ),
                 ha = "center" ) 
    plt.show()

def cramer_v( x, y ):
    cm = pd.crosstab( x, y ).values
    
    n = cm.sum()
    r, k = cm.shape
    chi2 = ss.chi2_contingency( cm )[0]
    
    chi2corr = max( 0, chi2 - (k-1)*(r-1)/(n-1) )
    kcorr = k - ( k - 1 ) ** 2 / ( n - 1 )
    rcorr = r - ( r - 1 ) ** 2 / ( n - 1 )
    
    b = ( chi2corr / n ) / ( min( kcorr - 1, rcorr - 1) )
    
    v = np.sqrt( b )
    return v

def corr_cramer_v( categorical_attributes ):
    cat_attributes_list = categorical_attributes.columns.tolist()

    corr_dict = {}

    for i in range( len( cat_attributes_list ) ):
        corr_list = []
        for j in range( len( cat_attributes_list ) ):
            ref = cat_attributes_list[i]
            feat = cat_attributes_list[j]
            corr = cramer_v( categorical_attributes[ref], categorical_attributes[feat] )
            corr_list.append( corr )
        corr_dict[ref]= corr_list
    return corr_dict

def jupyter_settings():
    %matplotlib inline
    
    plt.style.use( 'bmh' )
    plt.rcParams['figure.figsize'] = [25, 12]
    plt.rcParams['font.size'] = 24
    
    display( HTML( '<style>.container { width:100% !important; }</style>' ) )
    pd.options.display.max_columns = None
    pd.options.display.max_rows = None
    pd.set_option( 'display.expand_frame_repr', False )
    
    sns.set()
    
jupyter_settings();

### Dataset Connection

In [9]:
def connection_db():
    host = 'comunidade-ds-postgres.c50pcakiuwi3.us-east-1.rds.amazonaws.com'
    port = 5432
    database = 'comunidadedsdb'
    username = 'member'
    pwd = 'cdspa'
    
    conn = pg.connect( user = username,
                       password = pwd,
                       host = host,
                       port = port,
                       database = database)
    
    return conn

def query_db():
    query_tables = """
    SELECT *
    FROM pa004.users u INNER JOIN pa004.vehicle v ON ( u.id = v.id )
                       INNER JOIN pa004.insurance i ON ( u.id = i.id )
    """

    df = pd.read_sql( query_tables, conn )
    conn.close()
    return df

### Cross Validation

In [10]:
def cross_validation( model_name, model, X, Y, n, verbose = 0 ):
  i = 1
  c = pd.DataFrame()
  d = pd.DataFrame()
  
  kfold = ms.StratifiedKFold( n_splits = n, shuffle = True, random_state = 27)

  for train_cv, test_cv in kfold.split( X, Y ):
    print( 'KFold Number {}/{}'.format( i, 10 ) )

    if verbose == 1:
      print("TRAIN:", train_cv, "/nTEST:", test_cv)

    x_train, x_test = X.iloc[train_cv], X.iloc[test_cv]
    y_train, y_test = Y.iloc[train_cv], Y.iloc[test_cv]
    
    # Modeling
    model = model.fit( x_train, y_train )
    yhat_model = model.predict( x_test )
    yhat_proba = model.predict_proba( x_test )[:, 1]
    
    a = binary_class_metrics( '{} - {}/{}'.format( model_name, i, n ), y_test, yhat_proba, 0 )
    b = multi_class_metrics( '{} - {}/{}'.format( model_name, i, n ), y_test, yhat_model, 0 )
    c = c.append( a ).reset_index().drop(['index', 'Index'], axis=1)
    d = d.append( b ).reset_index().drop(['index', 'Index'], axis=1)
              
    i = i + 1
  
  # Features OUTPUT
  name      = pd.DataFrame( { 'Model': ['{} Average'.format( model_name )] } ).T
  roc_auc   = pd.DataFrame( { 'ROC AUC': [c['ROC AUC'].mean()] } ).T
  top_k     = pd.DataFrame( { 'Top K Score': [c['Top K Score'].mean()] } ).T
  f1        = pd.DataFrame( { 'F1': [d['F1'].mean()] } ).T
  precision = pd.DataFrame( { 'Precision': [d['Precision'].mean()] } ).T
  recall    = pd.DataFrame( { 'Recall': [d['Recall'].mean()] } ).T
  accuracy  = pd.DataFrame( { 'Accuracy': [d['Accuracy'].mean()] } ).T
  avg       = pd.concat( [name, roc_auc, top_k, f1, precision, recall, accuracy] ).T
  
  cv_list = c.merge( d, on='Model', how='right' )
  cv_list = cv_list.append( avg ).reset_index().drop( 'index', axis=1 )
  avg     = avg.reset_index().drop( 'index', axis=1 )
  return cv_list, avg

## 0.0.2. Loading data

In [None]:
conn = connection_db()
df_raw = query_db()

In [None]:
# df_raw = pd.read_csv( '/content/drive/MyDrive/Colab/data/pa004/train.csv' )
df_raw = df_raw.loc[:,~df_raw.columns.duplicated()]

In [None]:
df_raw.columns

# 1.0. PASSO 01 - DESCRICAO DOS DADOS

In [None]:
df1 = df_raw.copy()

In [None]:
df1.head()

## 1.1. Rename Columns

In [None]:
df1.columns

In [None]:
cols_new = ['id', 'gender', 'age', 'region_code', 'policy_sales_channel',
            'driving_license', 'vehicle_age', 'vehicle_damage', 'previously_insured',
            'annual_premium', 'vintage', 'response']

df1.columns = cols_new

## 1.2. Data Dimensions

In [None]:
print( 'Number of Rows: {}'.format( df1.shape[0] ) )
print( 'Number of Columns: {}'.format( df1.shape[1] ) )

## 1.3. Data Types

In [None]:
df1.dtypes

## 1.4. Check NA

In [None]:
df1.isna().sum()

## <s>1.5. Fillout NA</s>

## 1.6. Change Data Types

In [None]:
df1['region_code'] = df1['region_code'].astype('int64')
df1['annual_premium'] = df1['annual_premium'].astype('int64')
df1['policy_sales_channel'] = df1['policy_sales_channel'].astype('int64')

In [None]:
df1['vehicle_damage'].unique()

In [None]:
df1.dtypes

## 1.7 Check Balance Data

In [None]:
df1['response'].value_counts( normalize=True )

## 1.8. Descriptive Statistics

In [None]:
num_attributes = df1.select_dtypes( include=['int64', 'float64'] )
cat_attributes = df1.select_dtypes( exclude=['int64', 'float64'] )

### 1.8.1. Numerical Atributes

In [None]:
numerical_attributes( num_attributes )

### 1.8.2. Categorical Atributes

In [None]:
cat_attributes.apply( lambda x: x.unique().shape[0] )

# 2.0. PASSO 02 - FEATURE ENGINEERING

In [None]:
df2 = df1.copy()

## 2.1. Mapa Mental de Hipoteses

In [None]:
# Image('path')

## 2.2. Business Search

**Relevant Features to Business Model that are not included**

**1.** Driving License Time

**2.** Garage

**3.** Security Alarm

**4.** Civil State

**5.** Vehicle Model

**6.** Vehicle (more details about it)

**7.** State

**8.** City

**9.** Children

## 2.3. Hipothesis

**1.** Pessoas que tem annual_premium >= 30564 (MEAN) tem mais interesse

**2.** Pessoas com age >= 49 (Q3) tem menos interesse para a oferta

**3.** Pessoas com vintage >= 227 (Q3) tem mais interesse para a oferta

**4.** Pessoas com driving_license == 0 (MIN) tem menos interesse para a oferta

**5.** Pessoas com previously_insured == 0 (MIN) tem mais interesse para a oferta

**6.** Pessoas com gender == 'Female' tem menos interesse para a oferta

**7.** Pessoas com vehicle_damage == 0 (MIN) tem menos interesse para a oferta

## 2.4. Feature Engineering

In [None]:
# vehicle age
df2['vehicle_age'] = df2['vehicle_age'].apply( lambda x: 'over_2_years' if x == '> 2 Years'
                                              else 'between_1_2_years' if x == '1-2 Year'
                                              else 'below_1_year' )
# vehicle damage
df2['vehicle_damage'] = df2['vehicle_damage'].apply( lambda x: 1 if x == 'Yes' 
                                                    else 0 )

In [None]:
# # data split - test, train, validation
# X = df2.drop('response', axis=1)
# y = df2['response'].copy()

# X_TRAIN, x_test, Y_TRAIN, y_test = ms.train_test_split( X, y, test_size=0.15 )

# # x_train, x_validation, y_train, y_validation = ms.train_test_split( X_TRAIN, Y_TRAIN, test_size=0.20 )

# df2 = pd.concat( [X_TRAIN, Y_TRAIN], axis=1 )

In [None]:
df2['vehicle_damage'].unique()

In [None]:
df2.dtypes

# <S>3.0. PASSO 03 - FILTRAGEM DE VARIÁVEIS</s>

In [None]:
df3 = df2.copy()
df3.head()

## <s>3.1. Filtragem das Linhas</s>


## <s>3.2. Selecao das Colunas</s>

# 4.0. PASSO 04 - ANALISE EXPLORATÓRIA DE DADOS

In [None]:
df4 = df3.copy()

## 4.1. Analise Univariada

### 4.1.1. Response Variable

In [None]:
# plot graphic - displot, etc..
sns.distplot(df4['response']);

### 4.1.2. Numerical Variable

#### 4.1.2.0 OVERVIEW

In [None]:
# plot graphic num_attributes - hist, etc..
num_attributes = df4.select_dtypes( include=['int64', 'float64'] )
num_attributes.hist( bins=25 );

#### 4.1.2.1 Age

In [None]:
sns.boxplot( x='response', y='age', data=df4 );

In [None]:
aux00 = df4.loc[df4['response'] == 0, 'age']
plt.subplot( 2, 1, 1 )
sns.histplot( aux00 );

plt.subplot( 2, 1, 2 )
aux00 = df4.loc[df4['response'] == 1, 'age']
sns.histplot( aux00 );

#### 4.1.2.2 Annual Premium

In [None]:
aux1 = df4[( df4['annual_premium'] > 10000 ) & ( df4['annual_premium'] < 80000 )]
sns.boxplot( x = 'response', y = 'annual_premium', data = aux1 );

Response = 0 & Response = 1

In [None]:
aux00 = aux1.loc[df4['response'] == 0, 'annual_premium']
plt.subplot( 2, 1, 1 )
sns.histplot( aux00 );

plt.subplot( 2, 1, 2 )
aux00 = aux1.loc[df4['response'] == 1, 'annual_premium']
sns.histplot( aux00 );

#### 4.1.2.3 Driving License


In [None]:
aux = df4[['driving_license', 'response']].groupby('response').sum().reset_index()
aux['driving_license_lic'] = aux['driving_license'] / aux['driving_license'].sum()
# aux.head()
sns.barplot(x='response', y='driving_license', data=aux)

#### 4.1.2.4 Region Code


In [None]:
ax0 = df4[['id', 'region_code', 'response']].groupby( ['region_code', 'response'] ).count().reset_index()
sns.scatterplot(x='region_code', y='id', hue='response', data=ax0);

#### 4.1.2.5 Previously insured


In [None]:
pd.crosstab( df4['previously_insured'], df4['response'] ).apply( lambda x: x / x.sum(), axis=1 )

#### 4.1.2.6 Vehicle Age


In [None]:
df4[['vehicle_age', 'response']].value_counts( normalize=True ).reset_index()

#### 4.1.2.7 Policy Sales Channel


In [None]:
plt.figure( figsize=(24,12) )
aux = df4[['policy_sales_channel', 'response']].groupby('policy_sales_channel').sum().reset_index()
sns.barplot( x='response', y='policy_sales_channel', data=aux );

#### 4.1.2.8 Vintage

In [None]:
sns.boxplot( x='response', y='vintage', data=df4 );

Response = 0 & Response = 1

In [None]:
aux00 = df4.loc[df4['response'] == 0, 'vintage']
plt.subplot( 2, 1, 1 )
sns.histplot( aux00 );

plt.subplot( 2, 1, 2 )
aux00 = df4.loc[df4['response'] == 1, 'vintage']
sns.histplot( aux00 );

### <s>4.1.3. Categorical Variable</s>

## 4.2. Analise Bivariada

### 1. Pessoas que tem annual_premium >= 30564 (MEAN) tem mais interesse

**VERDADEIRA** Cerca de (58.34%) das pessoas interessadas são as pessoas que tem (annual_premium >= 30564)

In [None]:
aux = df4.copy()
aux['more_30564'] = df4['annual_premium'].apply(lambda x: 0 if x<30564 else 1)
aux1 = aux[aux['response'] == 1]

aux1 = aux1[['more_30564']].groupby( 'more_30564' ).size().reset_index().rename(columns={0:'qtd'})
ax1 = sns.barplot( x='more_30564', y='qtd', data=aux1 )

total = sum(aux1['qtd'])
graphic_percentage( ax1, total )

### 2. Pessoas com age >= 49 (Q3) tem menos interesse para a oferta

**VERDADEIRA** Cerca de (68.98%) das pessoas interessadas são as pessoas que tem menos de 49 anos.

In [None]:
aux = df4.copy()
aux['age_49'] = df4['age'].apply(lambda x: 0 if x<49 else 1)
aux1 = aux[aux['response'] == 1]

aux1 = aux1[['age_49']].groupby( 'age_49' ).size().reset_index().rename(columns={0:'qtd'})
ax1 = sns.barplot( x='age_49', y='qtd', data=aux1 )

total = sum(aux1['qtd'])
graphic_percentage( ax1, total )

### 3. Pessoas com vintage >= 227 (Q3) tem mais interesse para a oferta

**FALSA** Cerca de (74.92%) das pessoas interessadas são as pessoas que tem (vintage < 227) de 49 anos.

In [None]:
aux = df4.copy()
aux['vintage_227'] = df4['vintage'].apply(lambda x: 0 if x<227 else 1)
aux1 = aux[aux['response'] == 1]

aux1 = aux1[['vintage_227']].groupby( 'vintage_227' ).size().reset_index().rename(columns={0:'qtd'})
ax1 = sns.barplot( x='vintage_227', y='qtd', data=aux1 )

total = sum(aux1['qtd'])
graphic_percentage( ax1, total )

### 4. Pessoas com driving_license == 0 (MIN) tem menos interesse para a oferta
**VERDADEIRA** Cerca de (0.10%) das pessoas interessadas não tem driving license

In [None]:
aux = df4.copy()
aux1 = aux[aux['response'] == 1]

aux1 = aux1[['driving_license']].groupby( 'driving_license' ).size().reset_index().rename(columns={0:'qtd'})
ax1 = sns.barplot( x='driving_license', y='qtd', data=aux1 )

total = sum(aux1['qtd'])
graphic_percentage( ax1, total )

### 5. Pessoas com previously_insured == 0 (MIN) tem mais interesse para a oferta
**VERDADEIRA** Cerca de (99.67%) das pessoas interessadas não tem previously insured

In [None]:
aux = df4.copy()
aux1 = aux[aux['response'] == 1]

aux1 = aux1[['previously_insured']].groupby( 'previously_insured' ).size().reset_index().rename(columns={0:'qtd'})
ax1 = sns.barplot( x='previously_insured', y='qtd', data=aux1 )

total = sum(aux1['qtd'])
graphic_percentage( ax1, total )

### 6. Pessoas com gender == 'Female' tem menos interesse para a oferta

**VERDADEIRA** Mulheres tem em média (39%) de interesse enquanto Homens tem em média (61%) de interesse

In [None]:
aux = df4.copy()
aux1 = aux[aux['response'] == 1]

aux1 = aux1[['gender']].groupby( 'gender' ).size().reset_index().rename(columns={0:'qtd'})
ax1 = sns.barplot( x='gender', y='qtd', data=aux1 )

total = sum(aux1['qtd'])
graphic_percentage( ax1, total )

### 7. Pessoas com vehicle_damage == 0 (MIN) tem menos interesse para a oferta

**VERDADEIRA** Cerca de (97.97%) das pessoas interessadas já tiveram danos ao seus veículos

In [None]:
aux = df4.copy()
aux1 = aux[aux['response'] == 1]

aux1 = aux1[['vehicle_damage']].groupby( 'vehicle_damage' ).size().reset_index().rename(columns={0:'qtd'})
ax1 = sns.barplot( x='vehicle_damage', y='qtd', data=aux1 )

total = sum(aux1['qtd'])
graphic_percentage( ax1, total )

## Resumo das Hipóteses

In [None]:
tab = [['Hipoteses', 'Conclusao', 'Relevância'],
      ['H1', 'verdadeira', 'media'],
      ['H2', 'verdadeira', 'media'],
      ['H3', 'falsa', 'baixa'],
      ['H4', 'verdadeira', 'alta'],
      ['H5', 'verdadeira', 'alta'],
      ['H6', 'verdadeira', 'media'],
      ['H7', 'verdadeira', 'alta']]

print( tabulate( tab, headers='firstrow'))

## 4.3. Analise Multivariada

### 4.3.1 Numerical Attributes

In [None]:
correlation = num_attributes.corr( method='pearson')
sns.heatmap(correlation, annot=True )

### 4.3.2 Categorical Attributes

In [None]:
a = df4.select_dtypes( include='object' ).copy()
a['female'] = df4['gender'].apply( lambda x: 1 if x == 'Female' else 0 )
a['male'] = df4['gender'].apply( lambda x: 1 if x == 'Male' else 0 )

a['below_1_year'] = df4['vehicle_age'].apply( lambda x: 1 if x == 'below_1_year' else 0 )
a['between_1_2_years'] = df4['vehicle_age'].apply( lambda x: 1 if x == 'between_1_2_years' else 0 )
a['over_2_years'] = df4['vehicle_age'].apply( lambda x: 1 if x == 'over_2_years' else 0 )

In [None]:
a1 = cramer_v(a['female'], a['female'])
a2 = cramer_v(a['female'], a['male'])
a3 = cramer_v(a['female'], a['below_1_year'])
a4 = cramer_v(a['female'], a['between_1_2_years'])
a5 = cramer_v(a['female'], a['over_2_years'])

a6 = cramer_v(a['male'], a['female'])
a7 = cramer_v(a['male'], a['male'])
a8 = cramer_v(a['male'], a['below_1_year'])
a9 = cramer_v(a['male'], a['between_1_2_years'])
a10 = cramer_v(a['male'], a['over_2_years'])

a11 = cramer_v(a['below_1_year'], a['female'])
a12 = cramer_v(a['below_1_year'], a['male'])
a13 = cramer_v(a['below_1_year'], a['below_1_year'])
a14 = cramer_v(a['below_1_year'], a['between_1_2_years'])
a15 = cramer_v(a['below_1_year'], a['over_2_years'])

a16 = cramer_v(a['between_1_2_years'], a['female'])
a17 = cramer_v(a['between_1_2_years'], a['male'])
a18 = cramer_v(a['between_1_2_years'], a['below_1_year'])
a19 = cramer_v(a['between_1_2_years'], a['between_1_2_years'])
a20 = cramer_v(a['between_1_2_years'], a['over_2_years'])

a21 = cramer_v(a['over_2_years'], a['female'])
a22 = cramer_v(a['over_2_years'], a['male'])
a23 = cramer_v(a['over_2_years'], a['below_1_year'])
a24 = cramer_v(a['over_2_years'], a['between_1_2_years'])
a25 = cramer_v(a['over_2_years'], a['over_2_years'])

d = pd.DataFrame( {'female': [a1, a2, a3, a4, a5],
                   'male': [a6, a7, a8, a9, a10],
                   'below_1_year': [a11, a12, a13, a14, a15],
                   'between_1_2_years': [a16, a17, a18, a19, a20],
                   'over_2_years': [a21, a22, a23, a24, a25]})
d.set_index( d.columns )

sns.heatmap( d, annot=True, linewidths=.5)

## 4.4. Hipothesis Insights

**<u>KEEP AN EYE</u>**

Através das análises gráficas das hipóteses podemos observar que as features que parecem ter importância são:

1) **driving_license**

2) **previously_insured**

3) **vehicle_damage**

4) **age**

5) **gender**

6) **vehicle_age**

<s>7) **region_code</s>**

Como faltam explicações mais claras para origem de **region_code**, devemos ficar de olho futuramente sobre essa feature

Nos próximos passos, devemos analisar mais a fundo com auxilio de um algoritmo para poder calcular o quão relevante essas features podem ser para o meu modelo.



# 5.0. PASSO 05 - DATA PREPARATION

In [None]:
df5 = df4.copy()

## 5.1 Standardization

In [None]:
# Subtrai da média e divide pelo desvio padrão
ss = pp.StandardScaler().fit(df5[['annual_premium']])

# annual_premium
df5['annual_premium'] = ss.transform( df5[['annual_premium']] )

## 5.2 Rescaling

In [None]:
# Reescala dos intervalos entre [0,1] - como não é distribuição normal
mms_age = pp.MinMaxScaler().fit( df5[['age']] )
mms_vintage = pp.MinMaxScaler().fit( df5[['vintage']] )

# age
df5['age'] = mms_age.transform( df5[['age']] )

# vintage
df5['vintage'] = mms_vintage.transform( df5[['vintage']] )

## 5.3 Encoding

In [None]:
# Muda o tipo da variável da categórica para numerica, respeitando a natureza da variável - entre 0 e 1 (sim ou nao)
# gender - Frequency Encoding / *Target Encoding / Weighted Target ENcoding
target_gender = df5.groupby( 'gender' )['response'].mean()
df5.loc[:, 'gender'] = df5['gender'].map( target_gender )

# region_code - Frequency Encoding / *Target Encoding / Weighted Target ENcoding
target_region = df5.groupby( 'region_code' )['response'].mean()
df5.loc[:, 'region_code'] = df5['region_code'].map( target_region )

# vehicle_age - *One Hot Encoding / Order Encoding / Frequency Encoding / Target Encoding / Weighted Target ENcoding
target_vehicle_age = df5.groupby( 'vehicle_age' )['response'].mean()
df5.loc[:, 'vehicle_age'] = df5['vehicle_age'].map( target_vehicle_age )

# policy_sales_channel - Target Encoding / *Frequency Encoding
fe_policy = df5.groupby( 'policy_sales_channel' ).size() / len( df5 )
df5.loc[:, 'policy_sales_channel'] = df5['policy_sales_channel'].map( fe_policy )

## 5.4 Data Split

In [None]:
df5.head()

In [None]:
# data split - test, train, validation
X = df5.drop('response', axis=1)
y = df5['response'].copy()

X_TRAIN, X_TEST, Y_TRAIN, Y_TEST = ms.train_test_split( X, y, test_size=0.2, shuffle=True, stratify=y )

df5 = pd.concat( [X_TRAIN, Y_TRAIN], axis=1 )

# 6.0. PASSO 06 - FEATURE SELECTION

In [None]:
df6 = df5.copy()

In [None]:
x_train_n = X_TRAIN.drop( 'id', axis=1 )
y_train_n = Y_TRAIN.copy()

In [None]:
x_train_n.head()

## 6.1 Feature Importance

In [None]:
# featuring importance
# x_train_n = x_train.drop( 'id', axis=1 ).copy()
# y_train_n = y_train.copy()

# define RandomForestClassifier
model = RandomForestClassifier()

# fit the model
model.fit(x_train_n, y_train_n)

In [None]:
# get importance
importance = model.feature_importances_

# summarize feature importance
for i,v in enumerate( importance ):
    print('Feature: %0d, Score: %.5f' % (i,v))

# plot feature importance
feat_imp = pd.DataFrame( {'feature': x_train_n.columns,
                          'feature_importance': importance} ).sort_values( 'feature_importance', ascending=False ).reset_index( drop=True )

sns.barplot( x='feature_importance', y='feature', data=feat_imp, orient='h', color='royalblue' );

### 6.1.1 Columns Selected

Here we could figure it out that there are **7 columns** are relevant to our **model**, all above **0.05** of **feature importance**.

In [None]:
cols_selected = ['vintage', 'annual_premium', 'age', 'region_code', 'vehicle_damage', 'policy_sales_channel', 'previously_insured']

# 7.0. PASSO 07 - MACHINE LEARNING MODELING

In [None]:
df7 = df6.copy()

In [None]:
X_TRAIN = X_TRAIN[cols_selected]
X_TEST = X_TEST[cols_selected]

## 7.1 Comparing Models

In [None]:
models_performance = pd.DataFrame()
models_performance1 = pd.DataFrame()
models_list=[KNeighborsClassifier(n_jobs=-1),
             LogisticRegression(penalty='l2', solver='newton-cg'),
             GaussianNB(),
             LGBMClassifier(),
             xgb.XGBClassifier(objective='binary:logistic',
                               eval_metric='error',
                               n_estimators=100,
                               eta=0.01,
                               max_depth=10,
                               subsample=0.7,
                               colsample_bytree=0.9),
             RandomForestClassifier(),
             DecisionTreeClassifier(criterion='entropy', random_state=0)]

ma = modeling( models_list, X_TRAIN, Y_TRAIN, X_TEST, Y_TEST, False )
ma
    # me = pd.concat( [me, ma] )
# me.reset_index().reset_index().drop( ['index'], axis=1 )

### 7.1.1 KNN Model

In [None]:
# model prediction - Poder de GENERALIZACAO 
knn_model = models_list[0].fit( X_TRAIN, Y_TRAIN )
yhat_knn = knn_model.predict_proba( X_TEST )
yhat = knn_model.predict( X_TEST )
print( classification_report( Y_TEST, yhat ) )

### 7.1.2 Logistic Regression Model

In [None]:
# model prediction - Poder de GENERALIZACAO 
lr_model = models_list[1].fit( X_TRAIN, Y_TRAIN )
yhat_lr = lr_model.predict_proba( X_TEST )
yhat = lr_model.predict( X_TEST )
print( classification_report( Y_TEST, yhat ) )

### 7.1.3 GaussianNB Model

In [None]:
# model prediction - Poder de GENERALIZACAO 
from sklearn.metrics import plot_confusion_matrix
gnb_model = models_list[2].fit( X_TRAIN, Y_TRAIN )
yhat_gbn = gnb_model.predict_proba( X_TEST )
yhat = gnb_model.predict( X_TEST )
print( classification_report( Y_TEST, yhat ) )

### 7.1.4 LGBM Model

In [None]:
# model prediction - Poder de GENERALIZACAO 
lgbm_model = models_list[3].fit( X_TRAIN, Y_TRAIN )
yhat_lgbm = lgbm_model.predict_proba( X_TEST )
yhat = lgbm_model.predict( X_TEST )
print( classification_report( Y_TEST, yhat ) )

### 7.1.5 XGB Model

In [None]:
# model prediction - Poder de GENERALIZACAO 
xgb_model = models_list[4].fit( X_TRAIN, Y_TRAIN )
yhat_xgb = xgb_model.predict_proba( X_TEST )
yhat = xgb_model.predict( X_TEST )
print( classification_report( Y_TEST, yhat ) )

### 7.1.6 Random Forest Classifier

In [None]:
# model prediction - Poder de GENERALIZACAO 
rf_model = models_list[5].fit( X_TRAIN, Y_TRAIN )
yhat_rf = rf_model.predict_proba( X_TEST )
yhat = rf_model.predict( X_TEST )
print( classification_report( Y_TEST, yhat ) )

### 7.1.7 Decision Tree

In [None]:
# model prediction - Poder de GENERALIZACAO 
dt_model = models_list[6].fit( X_TRAIN, Y_TRAIN )
yhat_dt = dt_model.predict_proba( X_TEST )
yhat = dt_model.predict( X_TEST )
print( classification_report( Y_TEST, yhat ) )

## 7.2 Comparing Cumulative Curve

### 7.2.1 K Nearest Neighbors

In [None]:
skplt.metrics.plot_cumulative_gain( Y_TEST, yhat_knn, figsize=(10, 10), title='K Nearest Neighbors' );

### 7.2.2 Logistic Regression

In [None]:
skplt.metrics.plot_cumulative_gain( Y_TEST, yhat_lr, figsize=(10, 10), title='Logistic Regression' );

### 7.2.3 GaussianBN

In [None]:
skplt.metrics.plot_cumulative_gain( Y_TEST, yhat_gbn, figsize=(10, 10), title='GaussianBN' );

### 7.2.4 LightGBM

In [None]:
skplt.metrics.plot_cumulative_gain( Y_TEST, yhat_lgbm, figsize=(10, 10), title='LGBM' );

### 7.2.5 XGBoost

In [None]:
skplt.metrics.plot_cumulative_gain( Y_TEST, yhat_xgb, figsize=(10, 10), title='XGBoost' );

### 7.2.6 Random Forest Classifier

In [None]:
skplt.metrics.plot_cumulative_gain( Y_TEST, yhat_rf, figsize=(10, 10), title='Random Forest' );

### 7.2.7 Decision Tree

In [None]:
skplt.metrics.plot_cumulative_gain( Y_TEST, yhat_dt, figsize=(10, 10), title='Decision Tree' );

## 7.3 Comparing Lift Curve

### 7.3.1 K Nearest Neighbors

In [None]:
skplt.metrics.plot_lift_curve( Y_TEST, yhat_knn, figsize=(12,6), title='K Nearest Neighbors' );

### 7.3.2 Logistic Regression

In [None]:
skplt.metrics.plot_lift_curve( Y_TEST, yhat_lr, figsize=(12,6), title='Logistic Regression' );

### 7.3.3 GaussianBN

In [None]:
skplt.metrics.plot_lift_curve( Y_TEST, yhat_gbn, figsize=(12,6), title='GaussianBN' );

### 7.3.4 LightGBM

In [None]:
skplt.metrics.plot_lift_curve( Y_TEST, yhat_lgbm, figsize=(12,6), title='LGBM' );

### 7.3.5 XGBoost

In [None]:
skplt.metrics.plot_lift_curve( Y_TEST, yhat_xgb, figsize=(12,6), title='XGBoost' );

### 7.3.6 Random Forest

In [None]:
skplt.metrics.plot_lift_curve( Y_TEST, yhat_rf, figsize=(12,6), title='Random Forest' );

### 7.3.7 Decision Tree

In [None]:
skplt.metrics.plot_lift_curve( Y_TEST, yhat_dt, figsize=(12,6), title='Decision Tree' );

# 8.0. PASSO 08 - CROSS VALIDATION

In [None]:
df8 = df7.copy()
models_performance = pd.DataFrame()
avg_performance = pd.DataFrame()

## 8.1 K Nearest Neighbors

In [None]:
performance, avg = cross_validation( 'KNN', models_list[0], X_TRAIN, Y_TRAIN, 10 )
models_performance = models_performance.append( performance )
avg_performance = avg_performance.append( avg )

In [None]:
models_performance

## 8.2 Logistic Regressor

In [None]:
performance, avg = cross_validation( 'Logistic', models_list[1], X_TRAIN, Y_TRAIN, 10 )
models_performance = models_performance.append( performance )
avg_performance = avg_performance.append( avg )

## 8.3 GaussianNB

In [None]:
performance, avg = cross_validation( 'GaussianNB', models_list[2], X_TRAIN, Y_TRAIN, 10 )
models_performance = models_performance.append( performance )
avg_performance = avg_performance.append( avg )

## 8.4 LightGBM

In [None]:
performance, avg = cross_validation( 'LGBM', models_list[3], X_TRAIN, Y_TRAIN, 10 )
models_performance = models_performance.append( performance )
avg_performance = avg_performance.append( avg )

## 8.5 XGBoost

In [None]:
performance, avg = cross_validation( 'XGBoost', models_list[4], X_TRAIN, Y_TRAIN, 10 )
models_performance = models_performance.append( performance )
avg_performance = avg_performance.append( avg )

## 8.6 Random Forest

In [None]:
performance, avg = cross_validation( 'RF', models_list[5], X_TRAIN, Y_TRAIN, 10 )
models_performance = models_performance.append( performance )
avg_performance = avg_performance.append( avg )

## 8.7 Decision Tree

In [None]:
performance, avg = cross_validation( 'Decision Tree', models_list[6], X_TRAIN, Y_TRAIN, 10 )
models_performance = models_performance.append( performance )
avg_performance = avg_performance.append( avg )

## 8.6 Cross Validation Performance

In [None]:
avg_performance.sort_values( 'Recall', ascending=False )

In [None]:
models_performance.sort_values( 'Precision', ascending=False )

# 9.0 PASSO 09 - HYPERPARAMETER FINE TUNNING

Aqui selecionei os

In [None]:
max_eval = 5
me = pd.DataFrame()

## 9.1 KNN

In [None]:
knn_parameter = {'n_neighbors': [2, 3, 5],
                 'weights': ['uniform', 'distance'],
                 'leaf_size': [10, 20, 30, 40, 50],
                 'p': [1, 2, 3, 4, 5],
                 'n_jobs': [-1]}

In [None]:
# Escolha de parâmetros aleatório
for i in range ( max_eval ):
    hp = {k: rd.sample( v, 1 )[0] for k, v in knn_parameter.items()}
    print( hp )
    model_knn = KNeighborsClassifier( n_neighbors = hp['n_neighbors'],
                                     weights = hp['weights'],
                                     leaf_size = hp['leaf_size'],
                                     p = hp['p'],
                                     n_jobs = hp['n_jobs']).fit( X_TRAIN, Y_TRAIN ) 
                                                                  
    ma = modeling( [model_knn], X_TRAIN, Y_TRAIN, X_TEST, Y_TEST, False )
    me = pd.concat( [me, ma] )
me.reset_index().drop( ['index'], axis=1 )

In [None]:
# # Escolha de parâmetros aleatório
# for i in range ( max_eval ):
#     hp = {k: rd.sample( v, 1 )[0] for k, v in knn_parameter.items()}
#     print( hp )
#     model_knn = KNeighborsClassifier( n_neighbors = hp['n_neighbors'],
#                                       weights = hp['weights'],
#                                       leaf_size = hp['leaf_size'],
#                                       p = hp['p'],
#                                       n_jobs = hp['n_jobs']).fit( X_TRAIN, Y_TRAIN )
    
#     ma = modeling( [model_knn], X_TRAIN, Y_TRAIN, X_TEST, Y_TEST, False )
#     me = pd.concat( [me, ma] )

# me.reset_index().reset_index().drop( ['index'], axis=1 )

## 9.2 Random Forest

In [None]:
rf_parameter = {'bootstrap': [True, False],
                'max_depth': [10, 30, 50, 70, 90, None],
                'max_features': ['auto', 'sqrt'],
                'min_samples_leaf': [1, 2, 4],
                'min_samples_split': [2, 5, 9],
                'n_estimators': [200, 400, 800, 1200],
                'n_jobs': [-1],
                'random_state': [420]}

In [None]:
# # Escolha de parâmetros aleatório
# for i in range ( max_eval ):
#     hp = {k: rd.sample( v, 1 )[0] for k, v in rf_parameter.items()}
#     print( hp )
#     model_rf = RandomForestClassifier( bootstrap = hp['bootstrap'],
#                                        max_depth = hp['max_depth'],
#                                        max_features = hp['max_features'],
#                                        min_samples_leaf = hp['min_samples_leaf'],
#                                        min_samples_split = hp['min_samples_split'],
#                                        n_estimators = hp['n_estimators'],
#                                        n_jobs = hp['n_jobs'],
#                                        random_state = hp['random_state'] ).fit( X_TRAIN, Y_TRAIN )
                                                                  
#     ma = modeling( [model_rf], X_TRAIN, Y_TRAIN, X_TEST, Y_TEST, False )
#     me = pd.concat( [me, ma] )
# me.reset_index().drop( ['index'], axis=1 )

## 9.3 XGBoost

In [None]:
# xgb_parameter = {'n_estimators': [200, 400, 800, 1200, 1600],
#                  'eta': [0.01, 0.03, 0.05],
#                  'max_depth': [3, 5, 7, 9],
#                  'subsample': [0.1, 0.5, 0.7],
#                  'colsample_bytree': [0.3, 0.7, 0.9]}

In [None]:
# # Escolha de parâmetros aleatório
# for i in range ( max_eval ):
#     hp = {k: rd.sample( v, 1 )[0] for k, v in xgb_parameter.items()}
#     print( hp )
#     model_xgb = xgb.XGBClassifier( objective = 'reg:squarederror',
#                                    n_estimators = hp['n_estimators'],
#                                    eta = hp['eta'],
#                                    max_depth = hp['max_depth'],
#                                    subsample = hp['subsample'],
#                                    colsample_bytree = hp['colsample_bytree'] ).fit( X_TRAIN, Y_TRAIN )
                                                                  
#     ma = modeling( [model_xgb], X_TRAIN, Y_TRAIN, X_TEST, Y_TEST, False )
#     me = pd.concat( [me, ma] )
# me.reset_index().drop( ['index'], axis=1 )

## 9.4 GaussianNB

In [None]:
# # HAVE NO PARAMETERS TO FINE TUNNING
# model_gnb = GaussianNB()
# ma = modeling( [model_gnb], X_TRAIN, Y_TRAIN, X_TEST, Y_TEST, False  )
# me = pd.concat( [me, ma] ).reset_index().drop('index', axis=1)

## 9.4 Hyperparameter Performance

In [None]:
me.sort_values( 'Recall', ascending=False )

Os melhores resultado são:

**Random Forest** com **87.7%** de SCORE

**KNN** com **87.6%** de SCORE

**GNB** com **90%** de RECALL

Sendo assim, proseguiremos analisando com as 3 para termos certeza de qual usar, visto que obtiveram SCORE bem próximos as duas primeiras e o GNB tem um Recall muito a frente.

In [None]:
knn = KNeighborsClassifier( n_neighbors = 5,
                            weights = 'distance',
                            leaf_size = 40,
                            p = 1,
                            n_jobs = -1).fit( X_TRAIN, Y_TRAIN )

knn1 = KNeighborsClassifier( n_neighbors = 2,
                            weights = 'distance',
                            leaf_size = 30,
                            p = 1,
                            n_jobs = -1).fit( X_TRAIN, Y_TRAIN )
gnb = GaussianNB().fit( X_TRAIN, Y_TRAIN )
rf = RandomForestClassifier().fit( X_TRAIN, Y_TRAIN )

# lgbm = LGBMClassifier( num_leaves = 75,                             
#                        max_depth = 5,
#                        min_split_gain = 0.1,
#                        min_child_weight = 3,
#                        subsample = 1.0,
#                        colsample_bytree = 0.7 ).fit( X_TRAIN, Y_TRAIN )

# rf = RandomForestClassifier( bootstrap = True,
#                              max_depth = 30,
#                              max_features = 'sqrt',
#                              min_samples_leaf = 4,
#                              min_samples_split = 9,
#                              n_estimators = 1200,
#                              n_jobs = -1,
#                              random_state = 420 ).fit( X_TRAIN, Y_TRAIN )
models_tunned = [knn, knn1, gnb, rf]
                                

mode = modeling( models_tunned, X_TRAIN, Y_TRAIN, X_TEST, Y_TEST, False )
mode

## 9.5 Final Model

### 9.5.1 GNB Tunned Model

In [None]:
yhat_gnb = gnb.predict( X_TEST )
yhat_proba_gnb = gnb.predict_proba( X_TEST )
x_test_copy = X_TEST.copy()
x_test_copy['score_gnb'] = yhat_gnb.tolist()
x_test_copy['score_proba_gnb'] = yhat_proba_gnb[:,1].tolist()
x_test_copy['true_response'] = Y_TEST.copy()

### 9.5.2 Random Forest Tunned Model

In [None]:
yhat_rf = rf.predict( X_TEST )
yhat_proba_rf = rf.predict_proba( X_TEST )
x_test_copy['score_rf'] = yhat_rf.tolist()
x_test_copy['score_proba_rf'] = yhat_proba_rf[:,1].tolist()
# x_test_copy['score rf'] = x_test_copy[:,1].tolist()

### 9.5.3 KNN Tunned Model


In [None]:
yhat_knn = knn.predict( X_TEST )
yhat_proba_knn = knn.predict_proba( X_TEST )

x_test_copy['score_knn'] = yhat_knn.tolist()
x_test_copy['score_proba_knn'] = yhat_proba_knn[:,1].tolist()
# x_test_copy['score rf'] = x_test_copy[:,1].tolist()

In [None]:
x_test_copy[x_test_copy['true_response'] == 1].sample(10)

### 9.5.3 Comparing Tunned Model's Cumulative Curve

In [None]:
skplt.metrics.plot_cumulative_gain( Y_TEST, yhat_proba_gnb, figsize = ( 10, 10 ), title = 'GNB - Cumulative Gain' );
skplt.metrics.plot_cumulative_gain( Y_TEST, yhat_proba_rf, figsize = ( 10, 10 ), title = 'RF - Cumulative Gain' );
skplt.metrics.plot_cumulative_gain( Y_TEST, yhat_proba_knn, figsize = ( 10, 10 ), title = 'KNN - Cumulative Gain' );

### 9.5.4 Comparing Tunned Model's LIFT Curve

In [None]:
skplt.metrics.plot_lift_curve( Y_TEST, yhat_proba_gnb, figsize = ( 10, 10 ), title = 'GNB - LIFT Curve' );
skplt.metrics.plot_lift_curve( Y_TEST, yhat_proba_rf, figsize = ( 10, 10 ), title = 'RF - LIFT Curve' );
skplt.metrics.plot_lift_curve( Y_TEST, yhat_proba_knn, figsize = ( 10, 10 ), title = 'KNN - LIFT Curve' );

### 9.5.5 Comparing Precision n' Recall

In [None]:
precision_gnb = precision_at_k( X_TEST, yhat_proba_gnb, 'score', 0.2 )
recall_gnb = recall_at_k( X_TEST, yhat_proba_gnb, 'score', 0.2 )

precision_rf = precision_at_k( X_TEST, yhat_proba_rf, 'score', 0.2 )
recall_rf = recall_at_k( X_TEST, yhat_proba_rf, 'score', 0.2 )

precision_knn = precision_at_k( X_TEST, yhat_proba_knn, 'score', 0.2 )
recall_knn = recall_at_k( X_TEST, yhat_proba_knn, 'score', 0.2 )

print( 'GNB Precision at K: {}'.format( precision_gnb ) )
print( 'GNB Recall at K: {}'.format( recall_gnb ) )

print( '---------- ## ----------' )

print( 'RF Precision at K: {}'.format( precision_rf ) )
print( 'RF Recall at K: {}'.format( recall_rf ) )

print( '---------- ## ----------' )

print( 'RF Precision at K: {}'.format( precision_knn ) )
print( 'RF Recall at K: {}'.format( recall_knn ) )

In [1]:
proba = [yhat_proba_gnb, yhat_proba_rf, yhat_proba_knn]
perc = [0.01, 0.1, 0.2, 0.4, 0.5]

NameError: ignored

In [None]:
df_final_performance = pd.DataFrame()
df_final_performance = top_k_performance( X_TEST, proba, 'score', perc)
df_final_performance['Model'].loc[:4] = 'GNB'
df_final_performance['Model'].loc[5:9] = 'Random Forest'
df_final_performance['Model'].loc[10:] = 'KNN'


In [None]:
X_TEST.shape[0]

In [None]:
df_final_performance.sort_values(['Model', 'perc'], ascending=True)

In [None]:
# k -- AMOSTRAGEM DOS DADOS
k = 38111

# perc -- PERCENTAGEM DOS DATASET DE TESTE ORIGINAL
aff = k / x_test_copy.shape[0]

# target_At_k -- ORDENA O SCORE DE CADA MODELO, SELECIONA OS "K" PRIMEIROS E CONTA QUANTOS SÃO TRUE POSITIVE
target_at_k = x_test_copy.sort_values('score_proba_gnb', ascending=False)[:k]
# target_at_k['true_response'].sum()
target_at_k = target_at_k[target_at_k['true_response'] == 1]['true_response'].count()

# target_total -- TODOS OS TRUE_RESPONSE = 1 DOS DADO  OUTPUT = 9342
target_total = x_test_copy[x_test_copy['true_response'] == 1]['true_response'].count()

# perc_target -- target_at_tk DIVIDIDO POR target_total
perc_target = target_at_k/target_total

perc_target

# aff

In [None]:
# def top_k_performance( df, proba, response, perc ):
#     df_final_performance = pd.DataFrame()
#     for i in proba:   
#         for j in perc:
#             k = int( np.floor( len( df ) * j ) )
            
#             target_total = int( df['score'].count()/2 )
            
#             df['score'] = i[:, 1].tolist()
#             df = df.sort_values( 'score', ascending=False )
            
#             target_at_k = df['score'][:k].count()
#             target_perc = target_at_k / target_total

#             precision = precision_at_k( df, i, response, j )
#             recall = recall_at_k( df, i, response, j )

#             df_final_performance = df_final_performance.append( {'Model': 'Model',
#                                                                  'perc': j,
#                                                                  'k': k,
#                                                                  'precision': precision,
#                                                                  'recall': recall,
#                                                                  'target_total': target_total,
#                                                                  'target_at_k': target_at_k,
#                                                                  'perc_target': target_perc}, ignore_index=True)
#     return df_final_performance

# def precision_at_k( df, yhat_proba, target, perc = 0.25 ):
#     k = int( np.floor( len( df ) * perc ) )
    
#     df[target] = yhat_proba[:, 1].tolist()
#     df = df.sort_values( target, ascending=False ).reset_index( drop=True )
#     df['ranking'] = df.index + 1
#     df['precision_at_k'] = df[target].cumsum() / df['ranking']

#     return df.loc[k, 'precision_at_k']

# def recall_at_k( df, yhat_proba, target, perc = 0.25):
#     k = int( np.floor( len( df ) * perc ) )

#     df[target] = yhat_proba[:, 1].tolist()
#     df = df.sort_values( target, ascending=False).reset_index( drop = True )
#     df['recall_at_k'] = df[target].cumsum() / df[target].sum()
    
#     return df.loc[k, 'recall_at_k']

### 9.5.6 Final Considerations about what kind of Model to use

In [None]:
# pickle.dump( rf, open( '../model/Random_Forest_Model.pkl' , 'wb' ) )
# pickle.dump( lgbm, open( '../model/LGBM_Model.pkl','wb' ) )

# rf_size = os.stat( '../model/Random_Forest_Model.pkl' ).st_size / 1024
# lgbm_size = os.stat( '../model/LGBM_Model.pkl' ).st_size / 1024

# print( 'Random Forest model size: {0:.2f} KB'.format( rf_size ) )
# print( 'LGBM model size: {0:.2f} KB'.format( lgbm_size ) )

Como os modelos tem desempenhos semelhantes, optamos por utilizar o modelo de **LGBM** em produção, pois o **Random Forest** tem em torno de **7000 vezes** o tamanho do **LGBM**, portanto o custo de **storage** na **cloud** da empresa é minimizado.

### 9.5.7 Defining Threshold

In [None]:
threshold = [0.7, 0.8, 0.81, 0.82, 0.83, 0.84, 0.85, 0.86, 0.87, 0.88, 0.89, 0.9, 0.95]
threshold_performance = pd.DataFrame()

for i in threshold:
    calc_threshold = lambda x: 0 if x < i else 1
    prediction = list( map( calc_threshold, yhat_proba[:,1] ) )
    threshold_performance = threshold_performance.append( multi_class_metrics( i, Y_TEST, prediction, 0 ) )

threshold_performance.reset_index().drop( ['index', 'Index'], axis=1 )

Vemos que um threshold bom para nosso modelo é um torno de **0.8**.

### 9.5.8 Saving all parameters and model

In [None]:
# pickle.dump( lgbm, open( '/content/drive/MyDrive/Colab/model/pa004/LGBM_Model.pkl', 'wb' ) )
# pickle.dump( ss, open( '/content/drive/MyDrive/Colab/parameter/pa004/annual_premium_scaler.pkl', 'wb' ) )
# pickle.dump( mms_age, open( '/content/drive/MyDrive/Colab/parameter/pa004/age_scaler.pkl', 'wb' ) )
# pickle.dump( mms_vintage, open( '/content/drive/MyDrive/Colab/parameter/pa004/vintage_scaler.pkl', 'wb' ) )
# pickle.dump( target_gender, open( '/content/drive/MyDrive/Colab/parameter/pa004/gender_scaler.pkl', 'wb' ) )
# pickle.dump( target_region, open( '/content/drive/MyDrive/Colab/parameter/pa004/region_code_scaler.pkl', 'wb' ) )
# pickle.dump( target_vehicle_age, open( '/content/drive/MyDrive/Colab/parameter/pa004/vehicle_age_scaler.pkl', 'wb' ) )
# pickle.dump( fe_policy, open( '/content/drive/MyDrive/Colab/parameter/pa004/policy_sales_channel_scaler.pkl', 'wb' ) )

# 10.0. PASSO 10 - BUSINESS QUESTIONS

## 10.1 Insights Learned

### 1) Attributes Insights

Cerca de **12%** do dataset respondeu ter **interesse** no Cross-Sell proposto pela empresa **Insurance All**


---


A feature "**annual_premium**" tem um **RANGE** muito próximo do **MAX**, o que pode significar **presença de outliers** e devem ser estudados nos próximos ciclos.


---


Existem **features** que deveriam constar no **dataset** para um **melhor entendimento** e **desenvolvimento do modelo de negócio**, como mostrado na seção **2.2 Business Search** e isso deve ser reportado ao **time de negócio**.


---


A feature "**age**" nos mostra que existe uma preferencia maior pela oferta de Cross-Sell entre as idades de **40 e 50** anos.


---


A feature "**region_code**" nos mostra que existe uma preferencia maior pela oferta, porém **não temos detalhes suficientes** para especificar o porquê disso.


---


A feature "**vehicle_age**" nos mostra que existe uma preferência maior pela oferta de costumers que tem veículos entre **1 e 2 anos de idade**.


---


A feature "**policy_sales_channel**" nos mostra que existe uma preferência maior pela oferta, porém **não temos detalhes suficientes** para especificar o porquê disso também.


---


A feature "**vehicle_damage**" nos mostra que pessoas que **já sofreram danos** em seu carro estão mais interessadas pela oferta.


---


Pudemos observar, então, que as features que parecem ter importância são:

**1) driving_license**

**2) previously_insured**

**3) vehicle_damage**

**4) age**

**5) gender**

**6) vehicle_age**

<s>**7) region_code</s>**

## 10.2 Some possibles questions asked by the CEO

### 10.2.1 Qual a **porcentagem de clientes**, interessados em adquirir o seguro de veículo, que o time de vendas conseguirá fazendo **5.000 ligações**? E qual **retorno financeiro**, se comparado ao **modelo randômico**, se cada seguro de veículo custar **1000 reais**?

In [None]:
calls = 5000
total = X_TEST.shape[0]
perc_df = calls / total
price = 1000
resposta = top_k_performance( X_TEST, [yhat_proba], 'score', [perc_df] )
print( 'O total de dados no dataset é {}'.format( total ) )
print( 'E 5000 calls representam {0:.2f}% do dataset'.format( perc_df*100 ) )
resposta['R$ GNB Model'] = resposta['target_at_k'] * price
resposta['target_random'] = resposta['perc'] * resposta['target_total']
resposta['R$ Random Model'] = resposta['target_random'] * price
resposta['R$ Final'] = resposta['R$ GNB Model'] - resposta['R$ Random Model']
resposta['% Final'] = resposta['R$ GNB Model'] * 100 / resposta['R$ Random Model']
resposta[['k', 'perc_target', 'R$ GNB Model', 'R$ Random Model', 'R$ Final', '% Final']]

**R:**

Se ligarmos pra **5000 pessoas**, estaremos ligando para **6.56%** do nosso **dataset (x_test)** e teremos em torno de **14.9%** de todos os interessados possíveis.

Cerca de **2.3 vezes** mais do que o método randômico de seleção, de acordo com a **LIFT Curve** do modelo **GNB**!!

Trazendo um retorno de **R$ 2.738.901** a mais que o randômico !!

### 10.2.2 E se **aumentarmos** a quantidade de ligações para **10.000**?

In [None]:
calls = 10000
perc_df = calls / total
price = 1000
resposta = top_k_performance( X_TEST, [yhat_proba], 'score', [perc_df] )
print( 'O total de dados no dataset é {}'.format( total ) )
print( 'E 10000 calls representam {0:.2f}% do dataset'.format( perc_df*100 ) )
resposta['R$ LGBM Model'] = resposta['target_at_k'] * price
resposta['target_random'] = resposta['perc'] * resposta['target_total']
resposta['R$ Random Model'] = resposta['target_random'] * price
resposta['R$ Final'] = resposta['R$ LGBM Model'] - resposta['R$ Random Model']
resposta['% Final'] = resposta['R$ LGBM Model'] * 100 / resposta['R$ Random Model']
resposta[['k', 'perc_target', 'R$ LGBM Model', 'R$ Random Model', 'R$ Final', '% Final']]

**R:**

Se ligarmos pra **10000 pessoas**, estaremos ligando para **17,49%** do nosso **dataset (x_test)** e teremos em torno de **29.7%** de todos os interessados possíveis.

Cerca de **2.26 vezes** mais do que o método randômico de seleção, de acordo com a **LIFT Curve** do modelo **GNB**!!

Trazendo um retorno de **R$ 5.443.255** a mais que o randômico para a empresa!!

### 10.2.3 E se agora, **aumentarmos** para **20.000** ligações?

In [None]:
calls = 20000
perc_df = calls / total
price = 1000
resposta = top_k_performance( X_TEST, [yhat_proba], 'score', [perc_df] )
print( 'O total de dados no dataset é {}'.format( total ) )
print( 'E 1 calls representam {0:.2f}% do dataset'.format( perc_df*100 ) )
resposta['R$ LGBM Model'] = resposta['target_at_k'] * price
resposta['target_random'] = resposta['perc'] * resposta['target_total']
resposta['R$ Random Model'] = resposta['target_random'] * price
resposta['R$ Final'] = resposta['R$ LGBM Model'] - resposta['R$ Random Model']
resposta['% Final'] = resposta['R$ LGBM Model'] * 100 / resposta['R$ Random Model']
resposta[['k', 'perc_target', 'R$ LGBM Model', 'R$ Random Model', 'R$ Final', '% Final']]

**R:**

Se ligarmos pra **20000 pessoas**, estaremos ligando para **26.24%** do nosso **dataset (x_test)** e teremos em torno de **58.7%** de todos os interessados possíveis.

Cerca de **2.23 vezes** mais do que o método randômico de seleção, de acordo com a **LIFT Curve** do modelo **GNB**!!

Trazendo um retorno de **R$ 10.668.655** a mais que o randômico para a empresa!!

# 11.0. PASSO 11 - DEPLOY MODEL TO PRODUCTION

## 10.1 Rossmann Class

In [None]:
import pickle
import pandas as pd
import numpy as np

class HealthInsurance( object ):
  def __init__( self ):
    self.home_path = ''
    self.annual_premium_scaler = pickle.load( open( self.home_path + 'features/annual_premium_scaler.pkl','rb' ) )
    self.age_scaler = pickle.load( open( self.home_path + 'features/age_scaler.pkl','rb' ) )
    self.vintage_scaler = pickle.load( open( self.home_path + 'features/vintage_scaler.pkl', 'rb' ) )
    self.gender_scaler = pickle.load( open( self.home_path + 'features/gender_scaler.pkl', 'rb' ) )
    self.region_code_scaler = pickle.load( open( self.home_path + 'features/region_code_scaler.pkl', 'rb' ) )
    self.vehicle_age_scaler = pickle.load( open( self.home_path + 'features/vehicle_age_scaler.pkl', 'rb' ) )
    self.policy_sales_channel_scaler = pickle.load( open( self.home_path + 'features/policy_sales_channel_scaler.pkl', 'rb' ) )

  def data_cleaning( self, df ):
    cols_new = ['id', 'gender', 'age', 'region_code', 'policy_sales_channel',
                'driving_license', 'vehicle_age', 'vehicle_damage', 'previously_insured',
                'annual_premium', 'vintage', 'response']
    df.columns = cols_new

    return df

  def feature_engineering( self, df ):
    df['region_code'] = df['region_code'].astype('int64')
    df['annual_premium'] = df['annual_premium'].astype('int64')
    df['policy_sales_channel'] = df['policy_sales_channel'].astype('int64')

    # vehicle age
    df['vehicle_age'] = df['vehicle_age'].apply( lambda x: 'over_2_years' if x == '> 2 Years' else 'between_1_2_years' if x == '1-2 Year' else 'below_1_year' )

    # vehicle damage
    df['vehicle_damage'] = df['vehicle_damage'].apply( lambda x: 1 if x == 'Yes' else 0 )

    return df

  def data_preparation( self, df ):
    # annual_premium
    df['annual_premium'] = self.annual_premium_scaler.transform( df[['annual_premium']].values )

    # age
    df['age'] = self.age_scaler.transform( df[['age']].values )

    # vintage
    df['vintage'] = self.vintage_scaler.transform( df[['vintage']].values )

    # gender - Frequency Encoding / *Target Encoding / Weighted Target ENcoding
    df.loc[:, 'gender'] = df['gender'].map( self.gender_scaler )

    # region_code - Frequency Encoding / *Target Encoding / Weighted Target ENcoding
    df.loc[:, 'region_code'] = df['region_code'].map( self.region_code_scaler )
    
    # vehicle_age - *One Hot Encoding / Order Encoding / Frequency Encoding / Target Encoding / Weighted Target ENcoding
    df.loc[:, 'vehicle_age'] = df['vehicle_age'].map( self.vehicle_age_scaler )

    # policy_sales_channel - Target Encoding / *Frequency Encoding
    df.loc[:, 'policy_sales_channel'] = df['policy_sales_channel'].map( self.policy_sales_channel_scaler )
    
    # select columns
    cols_selected = ['vintage', 'annual_premium', 'age', 'region_code', 'vehicle_damage', 'policy_sales_channel', 'previously_insured']

    return df[cols_selected]
    
  def get_prediction( self, model, original_data, test_data ):
    # model prediction
    pred = model.predict_proba( test_data )

    # join prediction into original_data
    original_data['score'] = pred[:, 1].tolist()

    # threshold
    self.threshold = lambda x: 0 if x < 0.33 else original_data['score']
    original_data.loc[:, 'score'] = original_data['score'].map( self.threshold )
    
    # sort_values
    original_data = original_data.sort_values( 'score', ascending = False )

    return original_data.to_json( orient='records', data_format='iso')

## 10.2 API Handler

In [None]:
from crypt import methods
import os
import pickle
import pandas as pd
from flask import Flask, request, Response
from healthinsurance.HealthInsurance import HealthInsurance

# loading model
model = pickle.load( lgbm, open( 'model/LGBM_Model.pkl','rb' ) )

# initialize API
app = Flask( __name__ )

@app.route( '/predict', methods=['POST'] )
def healthinsurance_predict():
    test_json = request.get_json()
    if test_json: # there is data
        if isinstance( test_json, dict ): # unique exemple
            test_raw = pd.DataFrame( test_json, index=[0] )
            
        else: # multiple exemples
            test_raw = pd.DataFrame( test_json, columns=test_json[0].keys() )
        
        # Instantiate HealthInsurance Class
        pipeline = HealthInsurance()
        
        # data cleaning
        df = pipeline.data_cleaning( test_raw )
        
        # feature engineering
        df = pipeline.feature_engineering( df )
        
        # data preparation
        df = pipeline.data_preparation( df )
        
        # prediction
        df_response = pipeline.get_prediction( model, test_raw, df )
        
        return df_response
        
    else:
        return Response( '{}', status=200, mimetype='application/json')
    
if __name__ == '__main__':
    port = os.environ.get( 'PORT', 5000 )
    app.run( '0.0.0.0', port=port )


## 10.3 API Tester

In [None]:
import pandas as pd
import numpy  as np
import pickle
import requests
import json

In [None]:
df_raw = pd.read_csv( '/home/jocafneto/repositorio/allPAs/health_insurance/data/train.csv' )

df1 = df_raw.copy()

cols_new = ['id', 'gender', 'age', 'driving_license', 'region_code',
            'previously_insured', 'vehicle_age', 'vehicle_damage', 'annual_premium',
            'policy_sales_channel', 'vintage', 'response']

df1.columns = cols_new

df2 = df1.copy()

df3 = df2.copy()

df4 = df3.copy()

X = df4.drop('response', axis=1)
y = df4['response'].copy()

x_train, x_validation, y_train, y_validation = ms.train_test_split( X, y, test_size=0.20 )

df5 = pd.concat( [x_train, y_train], axis=1 )

# loading test dataset
df_test = x_validation
# df_test = df_test.drop(columns='response')
df_test['response'] = y_validation

df_test = df_test.sample(10)

df_test

# convert dataframe to json
data = json.dumps( df_test.to_dict( orient='records' ) )

# API Call
url = 'https://health-insurance-score-27.herokuapp.com/predict'
header = { 'Content-type': 'application/json' }
data = data

r = requests.post( url, data=data, headers=header )
print( 'Status Code {}'.format( r.status_code ) )


# return json to dataframe
d1 = pd.DataFrame( data=r.json(), columns=r.json()[0].keys() )

d2 = d1[['id', 'score']]

d2

d3 = pd.merge(df_test, d2, how='left', on='id')

d3