## Predictive modeling with Tweedie distributed response  
*Gorkem Ozkaya*

In this notebook we generate data according to the linear Tweedie model with a log link. Then we compare three predictive models: 
* Tweedie GLM with the `cplm` package
* Gradient boosting with the `xgboost` package, using the `reg:tweedie` objective, where the loss is likelihood of the mean for Tweedie distribution
* Gradient boosting with the `xgboost` package, using the `reg:linear` objective, where the loss is simple least squares

In [1]:
library(tweedie)
library(cplm)
library(xgboost)

: package ‘cplm’ was built under R version 3.3.2Loading required package: coda
: package ‘coda’ was built under R version 3.3.2Loading required package: Matrix
Loading required package: splines


Function that generates data according to the linear model: 

In [2]:
gen_data <- function (N = 10000, p = 1.5, phi=100) {
  var1 <- rnorm(N)
  var2 <- rnorm(N)
  var3 <- rnorm(N)
  y <- exp(7 + 0.1*var1 + 0.2*var2 - 0.3*var3)
  
  resp = rep(0, N)
  
  for (i in 1:N) {
    resp[i] <- rtweedie(1, xi = p, mu = y[i], phi=phi )
  }
  
  return(data.frame(var1, var2, var3, mu=y, resp))
}

### Generating training and test sets

In [3]:
dt <- gen_data()
dt_test <- gen_data()

# for xgboost models
d_train <- xgb.DMatrix(data = as.matrix(dt[, c("var1", "var2", "var3")]), label = dt$resp, missing = NA)
d_test <- xgb.DMatrix(data = as.matrix(dt_test[, c("var1", "var2", "var3")]), label = dt_test$resp, missing = NA)

### Tweedie GLM

In [4]:
model_glm <- cpglm(resp ~ var1+var2+var3, data=dt, optimizer="bobyqa")

y_hat_glm <- predict(model_glm, dt_test[, c("var1", "var2", "var3")])

### XGBoost with Tweedie objective

In [5]:
params_tweedie <- list(
  objective = 'reg:tweedie',
  eval_metric = 'rmse', 
  tweedie_variance_power = 1.5,
  max_depth = 6,
  eta = 0.01)


bst_tweedie <- xgb.train(
  data = d_train, 
  params = params_tweedie, 
  maximize = FALSE,
  watchlist = list(train = d_train, test=d_test), 
  nrounds = 1000,
  print_every_n = 50)


y_hat_xgb_tweedie <- predict(bst_tweedie, d_test)

[1]	train-rmse:2434.154053	test-rmse:2363.083252 
[51]	train-rmse:2432.750000	test-rmse:2361.693115 
[101]	train-rmse:2429.011963	test-rmse:2358.010254 
[151]	train-rmse:2419.379639	test-rmse:2348.615234 
[201]	train-rmse:2396.393555	test-rmse:2326.561035 
[251]	train-rmse:2349.364014	test-rmse:2282.937500 
[301]	train-rmse:2273.946045	test-rmse:2217.568848 
[351]	train-rmse:2183.551514	test-rmse:2148.278076 
[401]	train-rmse:2099.968750	test-rmse:2096.677734 
[451]	train-rmse:2034.676758	test-rmse:2068.325195 
[501]	train-rmse:1988.494629	test-rmse:2056.002930 
[551]	train-rmse:1955.826294	test-rmse:2051.948486 
[601]	train-rmse:1933.223145	test-rmse:2050.793213 
[651]	train-rmse:1915.505981	test-rmse:2050.909424 
[701]	train-rmse:1902.271729	test-rmse:2051.531982 
[751]	train-rmse:1892.087158	test-rmse:2052.246094 
[801]	train-rmse:1883.345703	test-rmse:2053.188232 
[851]	train-rmse:1875.134644	test-rmse:2053.934082 
[901]	train-rmse:1866.986572	test-rmse:2054.508057 
[951]	train-rms

### XGBoost with Least squares objective

In [6]:
params_tweedie <- list(
  objective = 'reg:linear',
  eval_metric = 'rmse', 
  tweedie_variance_power = 1.5,
  max_depth = 6,
  eta = 0.01)


bst_leastsq <- xgb.train(
  data = d_train, 
  params = params_tweedie, 
  maximize = FALSE,
  watchlist = list(train = d_train, test=d_test), 
  nrounds = 1000,
  print_every_n = 50)


y_hat_xgb_leastsq <- predict(bst_leastsq, d_test)

[1]	train-rmse:2427.306885	test-rmse:2357.136719 
[51]	train-rmse:2172.122314	test-rmse:2149.021484 
[101]	train-rmse:2051.062012	test-rmse:2072.892578 
[151]	train-rmse:1988.768799	test-rmse:2049.608398 
[201]	train-rmse:1953.366211	test-rmse:2044.666260 
[251]	train-rmse:1931.090088	test-rmse:2044.839233 
[301]	train-rmse:1915.889771	test-rmse:2046.154785 
[351]	train-rmse:1899.971924	test-rmse:2048.905762 
[401]	train-rmse:1885.940796	test-rmse:2051.149902 
[451]	train-rmse:1872.962769	test-rmse:2052.896484 
[501]	train-rmse:1863.371216	test-rmse:2054.953125 
[551]	train-rmse:1854.415283	test-rmse:2056.223877 
[601]	train-rmse:1845.067139	test-rmse:2057.261230 
[651]	train-rmse:1836.041260	test-rmse:2058.142578 
[701]	train-rmse:1827.057129	test-rmse:2059.070068 
[751]	train-rmse:1817.310669	test-rmse:2060.233643 
[801]	train-rmse:1808.161621	test-rmse:2061.356201 
[851]	train-rmse:1799.128540	test-rmse:2062.604248 
[901]	train-rmse:1790.119873	test-rmse:2063.698975 
[951]	train-rms

### Comparing the results of the three models
Now we compare the model performances on the test sets using Gini indices, which is one of the standard performance measures in insurance:

In [7]:
df_gini <- data.frame(y = dt_test$resp, glm = y_hat_glm, xgb_tweedie = y_hat_xgb_tweedie, xgb_leastsq = y_hat_xgb_leastsq)
df_gini$base = mean(df_gini$y)
gini(loss = "y" , score = c("glm", "xgb_tweedie", "xgb_leastsq"), base = NULL, data=df_gini)


Call:
gini(loss = "y", score = c("glm", "xgb_tweedie", "xgb_leastsq"), 
    base = NULL, data = df_gini)

Gini indices:
             glm       xgb_tweedie  xgb_leastsq
glm           0.00000   0.17174     -0.06209   
xgb_tweedie  15.61175   0.00000      8.08790   
xgb_leastsq  14.46925   5.17869      0.00000   

Standard errors:
             glm    xgb_tweedie  xgb_leastsq
glm          0.000  1.058        1.055      
xgb_tweedie  1.092  0.000        1.101      
xgb_leastsq  1.052  1.070        0.000      

The selected score is glm.

### Discussion
Looking at the Gini indices, we see that GLM performs the best.  This is expected, because we generated the data completely in accordance with the GLM Tweedie model assumptions.  

On the other hand, when we compare the XGBoost performances, we see no significant difference between using  *least squares* versus *Tweedie* objectives. 