# Summary

The intent of the following was to identify the most suitable regressors(s) to predict the underwriting gain/loss value (UWG).
However, the resulting performance of the models demonstrated that the available data used was not suitable for regression.<br><br>
Several iterations of feature engineering (by hand) and Recursive Feature Elimination were employed in an attempt to predict the UWG , but without success at the time of writing.<br><br>
The lack of success so far is attributed to the distribution of the data, which in aggregate does not exhibit any specific correlation with the UWG. Further examination/clustering and/or feature engineering of the data may be required to exhaustively explore regression.<br><br>
The following regression models were evaluated. In order to reduce execution runtime, all the hyperparameter tuning (via grid search) and RFE cross validation were carried out on Google Colab:
- Linear Regression
- Decision Tree
- Random Forest
- Extra Trees
- Ada Boost
- Gradient Boosting

Lastly, **Unsupervised Learning** was employed to find clusters on which regression models could be applied separately. However, the clustering (K-Means, DBSCAN, Hierarchical) models did not yield meaningful or interpretable results.

# Import Packages

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import math
import numpy as np
%matplotlib inline
%config IPCompleter.greedy=True

df = pd.read_csv('../data/df_model_trimmed.csv')

sns.set(style='darkgrid')
from sklearn.metrics import r2_score
from sklearn.preprocessing import StandardScaler, PolynomialFeatures, PowerTransformer
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.linear_model import RidgeCV, LassoCV, ElasticNetCV
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from sklearn.feature_selection import RFECV
import itertools

In [2]:
df.head(2)

Unnamed: 0,year,company,auwgr,lkpp,hlr_lag1,hlr_lag2,hlr_lag3,hlr_lag4,hlr_lag5,mer,...,class_fire,class_health,class_mac,class_mahl,class_motor,class_others,class_pa,class_prof_indm,class_pub_lia,class_wic
0,2005,c166,1.403847,9.746642,-0.653108,-0.67628,-0.702346,-0.71422,-0.734678,-0.295015,...,0,0,0,0,0,0,1,0,0,0
1,2006,c166,1.897721,4.531747,-0.950801,-0.67628,-0.702346,-0.71422,-0.734678,-0.334269,...,0,0,0,0,0,0,1,0,0,0


In [3]:
df.shape

(4347, 33)

# Drop Dummies

In [4]:
# # drop the 'class' dummies
droplist = ['class_bonds','class_cnstr_engr','class_cpr','class_fire','class_health',
            'class_mac','class_mahl','class_motor','class_others','class_pa',
            'class_prof_indm','class_pub_lia','class_wic']
df.drop(columns=droplist,inplace=True)
df.shape

(4347, 20)

# Get features from dataframe

In [5]:
print(df.shape)
features = [col for col in df._get_numeric_data().columns if (col != 'auwgr') and (col != 'year')]
print(features)

(4347, 20)
['lkpp', 'hlr_lag1', 'hlr_lag2', 'hlr_lag3', 'hlr_lag4', 'hlr_lag5', 'mer', 'der', 'oer', 'prem_write_net_lag1', 'claim_set_net_lag1', 'exp_management_lag1', 'exp_comm_incur_net_lag1', 'exp_other_lag1', 'prem_liab_diff_lag1', 'claim_liab_diff_lag1', 'uw_gain_lag1']


In [6]:
df.isna().sum().any()

False

In [7]:
X = df[features]
y = df['auwgr']
print(X.shape)
print(y.shape)

(4347, 17)
(4347,)


## Regression Model 1

### Train/Test Split

In [23]:
# split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
print(X_train.shape,X_test.shape)
print(y_train.shape,y_test.shape)

(3260, 17) (1087, 17)
(3260,) (1087,)


### Standard Scaler

In [24]:
ss = StandardScaler()
ss.fit(X_train)
X_train_sc = ss.transform(X_train)
X_test_sc = ss.transform(X_test)

### Linear, Ridge, Lasso (with standard scaled data)

In [25]:
# LINEAR REG - Instantiate and score using cross validation (3 folds)
linreg = LinearRegression()
print('\nLINEAR REG cross-val mean score:')
print('X-Val score MEAN using X_train\t\t',cross_val_score(linreg, X_train_sc, y_train,cv=3).mean())

# RIDGE - Instantiate and score using cross validation (3 folds)
ridge=RidgeCV(alphas=np.linspace(.1, 10, 100))
print('\nRIDGE cross-val mean score:')
print('X-Val score MEAN using X_train\t\t',cross_val_score(ridge, X_train_sc, y_train,cv=3).mean())

# LASSO - Instantiate and score using cross validation (3 folds)
lasso = LassoCV(n_alphas=200,cv=3)
print('\nLASSO cross-val mean score:')
print('X-Val score MEAN using X_train\t\t',cross_val_score(lasso, X_train_sc, y_train,cv=3).mean())


LINEAR REG cross-val mean score:
X-Val score MEAN using X_train		 0.11119262833850345

RIDGE cross-val mean score:
X-Val score MEAN using X_train		 0.11351653946765368

LASSO cross-val mean score:
X-Val score MEAN using X_train		 0.1131184168029213


## Regression Model 2

### Power Transformer

In [11]:
pt_x = PowerTransformer() # transform X
pt_x.fit(X_train)
X_train_pt = pt_x.transform(X_train)
X_test_pt = pt_x.transform(X_test)

pt_y = PowerTransformer() # transform Y
# PowerTransformer requires a matrix/DataFrame, so we use .to_frame() method on y_train
# subsequently we use .ravel() to flatten it into an array (which is required for cross_val later)
pt_y.fit(y_train.to_frame())
y_train_pt = pt_y.transform(y_train.to_frame()).ravel()
y_test_pt = pt_y.transform(y_test.to_frame()).ravel()

### Linear, Ridge Lasso (with power transformed data)

In [12]:
# LINEAR REG - Instantiate and score using cross validation (3 folds)
linreg = LinearRegression()
print('\nLINEAR REG cross-val mean score:')
print('X-Val score MEAN using X_train\t\t',cross_val_score(linreg, X_train_pt, y_train_pt,cv=3).mean())

# RIDGE - Instantiate and score using cross validation (3 folds)
ridge=RidgeCV(alphas=np.linspace(.1, 10, 100))
print('\nRIDGE cross-val mean score:')
print('X-Val score MEAN using X_train\t\t',cross_val_score(ridge, X_train_pt, y_train_pt,cv=3).mean())

# LASSO - Instantiate and score using cross validation (3 folds)
lasso = LassoCV(n_alphas=200,cv=3)
print('\nLASSO cross-val mean score:')
print('X-Val score MEAN using X_train\t\t',cross_val_score(lasso, X_train_pt, y_train_pt,cv=3).mean())


LINEAR REG cross-val mean score:
X-Val score MEAN using X_train		 0.08187312635603868

RIDGE cross-val mean score:
X-Val score MEAN using X_train		 0.08303968253782978

LASSO cross-val mean score:
X-Val score MEAN using X_train		 0.08247678166047641


## Decision Tree Regressor

### Initial Hyperparameters

In [13]:
dtreg = DecisionTreeRegressor()
dtreg.fit(X_train_sc,y_train) # Use un-scaled data
# Evaluate model.
print(dtreg.score(X_train_sc,y_train))
print(dtreg.score(X_test_sc,y_test))

0.999999999490874
-0.7812532255355314


### GridSearchCV

In [None]:
# param_grid = [{'max_depth':range(2,1000),
#                'min_samples_split':range(2,21)
#               }]
# reg = GridSearchCV(DecisionTreeRegressor(), param_grid, cv=5)
# reg.fit(X_train, y_train)
# reg.best_params_

### Results

In [None]:
# dtreg = DecisionTreeRegressor(max_depth=2, min_samples_split=15)
# dtreg.fit(X_train,y_train) # Use un-scaled data
# # Evaluate model.
# print(dtreg.score(X_train,y_train))
# print(dtreg.score(X_test,y_test))

## Random Forest Regressor

### Initial Hyperparameters

In [14]:
rfreg = RandomForestRegressor(n_estimators=10) # default no. of trees ('n_estimators') = 10
rfreg.fit(X_train_sc,y_train) # Use un-scaled data
# Evaluate model
print(rfreg.score(X_train_sc,y_train))
print(rfreg.score(X_test_sc,y_test))

0.81783645478352
-0.016826547420873972


### GridSearchCV

In [None]:
# param_grid = [{'n_estimators':[50,100,200],
#                'max_depth':range(2,50),
#                'min_samples_split':range(2,20),
#                'oob_score':[True]
#               }]
# reg = GridSearchCV(RandomForestRegressor(), param_grid, cv=5)
# reg.fit(X_train, y_train)
# reg.best_params_

### Results

In [None]:
# rfreg = RandomForestRegressor(n_estimators=100,max_depth=2,min_samples_split=10,oob_score=True)
# rfreg.fit(X_train,y_train) # Use un-scaled data
# # Evaluate model
# print(rfreg.score(X_train,y_train))
# print(rfreg.score(X_test,y_test))

## Extra Trees Regressor

### Initial Hyperparameters

In [15]:
etreg = ExtraTreesRegressor(bootstrap=True,oob_score=True,warm_start=False,n_estimators=100)
etreg.fit(X_train_sc,y_train) # Use un-scaled data
# Evaluate model
print(etreg.score(X_train_sc,y_train))
print(etreg.score(X_test_sc,y_test))

0.8785319032217249
0.09059870568796191


### GridSearchCV

In [None]:
# param_grid = [{'n_estimators':[100,200,300],
#                'max_depth':range(2,50),
#                'min_samples_split':range(2,20),
#                'oob_score':[True],
#                'bootstrap':[True]
#               }]

# reg = GridSearchCV(ExtraTreesRegressor(), param_grid, cv=5)
# reg.fit(X_train, y_train)
# reg.best_params_

### Results

In [None]:
# etreg = ExtraTreesRegressor(bootstrap=True,max_depth=23,min_samples_split=14,n_estimators=100,oob_score=True)
# etreg.fit(X_train,y_train) # Use un-scaled data
# # Evaluate model
# print(etreg.score(X_train,y_train))
# print(etreg.score(X_test,y_test))

# Continued on notebooks "06 Annex-1 Part 2 (Regression and Clustering)" and "06 Annex-1 Part 3 (RFE-CV)"