# Agency Data Analysis in Python

This version does some things the `R` version does not, specifically optimizing hyperparameter tuning and using different algorithms for the predictions. It implicitly (through the `sklearn` libraries) uses multiprocessing too. I'll compare and contrast what I did in `R` as well.

Part of this analysis is also instructional; while only two algorithms are used (`DecisionTreeClassifier` and `GaussianNB`), any could have been used with proper hyperparameter tuning through `RamdomizedSearchCV` (cross-validation tuning).

Let's get to it Boppers. Set up the data et al.

In [15]:
# Imports - let this stand alone
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

## Data

Not much to note here, except how `groupby` is used with a `lambda` function to filter out `transaction_type` categories with 3 or fewer instances. The algo complains (logically) about too few instances if there are three or less.

In [2]:
# Data load and prune
agency_df = pd.read_excel("../data/AgencyData_clean.xlsx")
print("Original agency df shape:", agency_df.shape)

# Trim the phat
agency_df_pruned = agency_df.drop(['account_name', 'branch_name'], axis=1)
print("Agency df pruned shape:", agency_df_pruned.shape)

# Need to remove values for transaction_type that have too few instances
agency_df_used = agency_df_pruned.groupby('transaction_type').filter(lambda x : len(x)>3)
print("Agency data used shape:", agency_df_used.shape)

Original agency df shape: (2376, 14)
Agency df pruned shape: (2376, 12)
Agency data used shape: (2371, 12)


## One-hot Encoding Reusable Function

In [3]:
def ohe(df, feature):
    # encode
    df = pd.concat([df,pd.get_dummies(df[feature], prefix=feature)],axis=1)
    # now drop the field, it's no longer needed
    df.drop([feature],axis=1, inplace=True)
    return df

# Encode the Data

For the algo to work, the data has to be numeric, even the categorical data. To facilitate this, I used one-hot encoding (the function in the last code block) for the ordinal categorical data - i.e., any feature that's a list of text options, they are converted to either '1' or '0' (with columns for each value), where `1` indicates the text value **is** that option and `0` means it is **not* that option.

The `policy_term` is ordinal since 6 months is always less than 12 months. For that, I used a very simple function to convert to `0` or `1`, since there are only two options.

I finally had to convert the effective data to the integer representation of the date. 

In [4]:
agency_df_used = ohe(agency_df_used, 'account_type')
agency_df_used = ohe(agency_df_used, 'assigned_agent')
agency_df_used = ohe(agency_df_used, 'lob')
agency_df_used = ohe(agency_df_used, 'master_company')
agency_df_used = ohe(agency_df_used, 'policy_type')
agency_df_used = ohe(agency_df_used, 'rating_state')
agency_df_used = ohe(agency_df_used, 'status')

# Simple replace for ordinal value
policy_term_mapper = {"6 Months": 1, "12 Months":2}
agency_df_used.replace(policy_term_mapper, inplace=True)

# Convert the time
agency_df_used['eff_date_int'] = pd.to_datetime(agency_df_used['effective_date']).astype(np.int64)
agency_df_used.drop(['effective_date'],axis=1, inplace=True)

## Set features, target, test, and train

The target is `transaction_type`.

In [5]:
# Set features and target
target = agency_df_used['transaction_type']
features = agency_df_used.loc[:, agency_df_used.columns != 'transaction_type']

# Create training and test sets
features_train, features_test, target_train, target_test = train_test_split(
    features, target, test_size=0.25, random_state=1)

## Vanilla DecisionTreeClassifer and GaussianNB

Without cross-validation parameter tuning, you'd get a pretty bad score. Here's proof.

In [7]:
dt_class_bland = DecisionTreeClassifier(random_state=0)
dt_class_bland.fit(features_train, target_train)
y_predict = dt_class_bland.predict(features_test)
acc_dt = accuracy_score(target_test, y_predict)
print("Bland DecisionTree Classifier accuracy score:", format(acc_dt, '%'))

nb_class_bland = GaussianNB()
nb_class_bland.fit(features_train, target_train)
y_predict = nb_class_bland.predict(features_test)
acc_nb = accuracy_score(target_test, y_predict)
print("Bland Naive Bayes Classifier accuracy score:", format(acc_nb, '%'))

Bland DecisionTree Classifier accuracy score: 58.178752%
Bland Naive Bayes Classifier accuracy score: 47.892074%


## Cross Validation Setup and Execution

A lot is about to happen. First, specific parameters for the `DecisionTreeClassifier` and `GaussianNB` from `sklearn` are setup. You can add/remove from this list and alter the available options.

From there, a GridSearchCV (cross validation) is run with each classifier and each grid. Then the training data is fit and the scores output.

In [23]:
def randomSearchCV(classifier, the_grid):
    rtc_random = GridSearchCV(estimator=classifier,
                              param_grid=the_grid,
                              cv = 3, verbose=2, n_jobs = -1)
    rtc_random.fit(features_train, target_train)
    return rtc_random

# Running in main enables multiprocessor functionality
if __name__ == '__main__':
    scores_dict = {}

    dt_random_grid = {'criterion': ['gini'],
               'splitter': ['best', 'random'],
               'max_features': [None, 'auto'],
               'min_samples_split': [2, 3, 4, 5],
               'min_samples_leaf': [1, 2, 3, 4]}

    dt_classifier = DecisionTreeClassifier(random_state=0)
    rtc_dt = randomSearchCV(dt_classifier, dt_random_grid)
    print("Best parameters:", rtc_dt.best_params_)
    print("Best score:", rtc_dt.best_score_)

    nb_random_grid = {}
    nb_classifier = GaussianNB()
    rtc_nb = randomSearchCV(nb_classifier, nb_random_grid)
    print("Best parameters:", rtc_nb.best_params_)
    print("Best score:", rtc_nb.best_score_)

Fitting 3 folds for each of 64 candidates, totalling 192 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.


Best parameters: {'criterion': 'gini', 'max_features': None, 'min_samples_leaf': 4, 'min_samples_split': 2, 'splitter': 'random'}
Best score: 0.7137232845894264
Fitting 3 folds for each of 1 candidates, totalling 3 fits
Best parameters: {}
Best score: 0.5264341957255344


[Parallel(n_jobs=-1)]: Done 192 out of 192 | elapsed:    0.6s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed:    0.0s finished


## Observations

The tuned score improvement is quite good and done without very deep hyperparameter tuning. I think more can be done with both the parameters used and the value options noted within the `random_grid`.

For my last project, I'll improve upon this to go much deeper.

In [25]:
score_imp_df = rtc_dt.best_score_ - acc_dt
print("Decision Tree Improvement: ", score_imp_df)

score_imp_nb = rtc_nb.best_score_ - acc_nb
print("Naive Bayes Improvement: ", score_imp_nb)

Decision Tree Improvement:  0.13193576351016834
Naive Bayes Improvement:  0.04751345373565247
