# Agency Data Analysis in Python

This version does some things the `R` version does not, specifically optimizing hyperparameter tuning. It implicitly (through the `sklearn` libraries) uses multiprocessing too. The end result is a superior score.

Part of this analysis is also instructional; while only one algorithm is used (`DecisionTreeClassifier`), any could have been used with proper hyperparameter tuning through `RamdomizedSearchCV` (cross-validation tuning).

Let's get to it Boppers. Set up the data et al.

In [7]:
# Imports - let this stand alone
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

## Data

Not much to note here, except how `groupby` is used with a `lambda` function to filter out `transaction_type` categories with 3 or fewer instances. The algo complains (logically) about too few instances if there are three or less.

In [2]:
# Data load and prune
agency_df = pd.read_excel("../data/AgencyData_clean.xlsx")
print("Original agency df shape:", agency_df.shape)

# Trim the phat
agency_df_pruned = agency_df.drop(['account_name', 'branch_name'], axis=1)
print("Agency df pruned shape:", agency_df_pruned.shape)

# Need to remove values for transaction_type that have too few instances
agency_df_used = agency_df_pruned.groupby('transaction_type').filter(lambda x : len(x)>3)
print("Agency data used shape:", agency_df_used.shape)

Original agency df shape: (2376, 14)
Agency df pruned shape: (2376, 12)
Agency data used shape: (2371, 12)


## One-hot Encoding Reusable Function

In [3]:
def ohe(df, feature):
    # encode
    df = pd.concat([df,pd.get_dummies(df[feature], prefix=feature)],axis=1)
    # now drop the field, it's no longer needed
    df.drop([feature],axis=1, inplace=True)
    return df

# Encode the Data

For the algo to work, the data has to be numeric, even the categorical data. To facilitate this, I used one-hot encoding (the function in the last code block) for the ordinal categorical data - i.e., any feature that's a list of text options, they are converted to either '1' or '0' (with columns for each value), where `1` indicates the text value **is** that option and `0` means it is **not* that option.

The `policy_term` is ordinal since 6 months is always less than 12 months. For that, I used a very simple function to convert to `0` or `1`, since there are only two options.

I finally had to convert the effective data to the integer representation of the date. 

In [4]:
agency_df_used = ohe(agency_df_used, 'account_type')
agency_df_used = ohe(agency_df_used, 'assigned_agent')
agency_df_used = ohe(agency_df_used, 'lob')
agency_df_used = ohe(agency_df_used, 'master_company')
agency_df_used = ohe(agency_df_used, 'policy_type')
agency_df_used = ohe(agency_df_used, 'rating_state')
agency_df_used = ohe(agency_df_used, 'status')

# Simple replace for ordinal value
policy_term_mapper = {"6 Months": 1, "12 Months":2}
agency_df_used.replace(policy_term_mapper, inplace=True)

# Convert the time
agency_df_used['eff_date_int'] = pd.to_datetime(agency_df_used['effective_date']).astype(np.int64)
agency_df_used.drop(['effective_date'],axis=1, inplace=True)

## Set features, target, test, and train

The target is `transaction_type`.

In [5]:
# Set features and target
target = agency_df_used['transaction_type']
features = agency_df_used.loc[:, agency_df_used.columns != 'transaction_type']

# Create training and test sets
features_train, features_test, target_train, target_test = train_test_split(
    features, target, test_size=0.25, random_state=1)

## Vanilla DecisionTreeClassifer

Without cross-validation parameter tuning, you'd get a pretty bad score. Here's proof.

In [9]:
dt_class_bland = DecisionTreeClassifier(random_state=0)
dt_class_bland.fit(features_train, target_train)
y_predict = dt_class_bland.predict(features_test)
acc = accuracy_score(target_test, y_predict)
print("Bland accuracy score:", format(acc, '%'))

Bland accuracy score: 58.178752%


## Cross Validation Setup and Execution

A lot is about to happen. First, specific parameters for the `DecisionTreeClassifier` from `sklearn` are setup. You can add/remove from this list and alter the available options.

In [12]:
# Set up cross validation validator
criterion = ['gini', 'entropy']
splitter = ['best', 'random']
max_features = ['auto', 'sqrt']
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
min_samples_split = [2, 5, 10]
min_samples_leaf = [1, 2, 4]

random_grid = {'criterion': criterion,
               'splitter': splitter,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf}

def randomSearchCV(the_grid):
    decision_tree_classifier = DecisionTreeClassifier(random_state=0)
    rtc_random = RandomizedSearchCV(estimator = decision_tree_classifier,
                                    param_distributions = the_grid, n_iter = 100,
                                    cv = 3, verbose=2, random_state=42, n_jobs = -1)
    rtc_random.fit(features_train, target_train)
    return rtc_random

# Running in main enables multiprocessor functionality
if __name__ == '__main__':
    rtc = randomSearchCV(random_grid)
    print("Best parameters:", rtc.best_params_)
    print("Best score:", format(rtc.best_score_, '%'))
    print("Error score:", rtc.error_score)
    print("Scoring?", rtc.scoring)
    the_predict = rtc.predict

Fitting 3 folds for each of 100 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:    5.9s


Best parameters: {'max_features': 'sqrt', 'max_depth': 60, 'splitter': 'random', 'criterion': 'entropy', 'min_samples_split': 10, 'min_samples_leaf': 1}
Best score: 70.584927%
Error score: raise
Scoring? None


[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed:    6.6s finished


## Observations

The tuned score improvement is shown below. This is quite good and done without very deep hyperparameter tuning. I think more can be done with both the parameters used and the value options noted within the `random_grid`.

In [13]:
    score_improvement = format(rtc.best_score_ - acc, '%')
    print("Score Improvement: ", score_improvement)

Score Improvement:  12.406175%
