## Development tutorial

### Getting started
All exercises rely on chainladder v0.5.5 and later.  There have also been breaking changes with `pandas 1.0` and if you are using an earlier version, date slicing may behave differently.

In [1]:
import chainladder as cl
import pandas as pd
print('chainladder:' + cl.__version__)
print('pandas:' + pd.__version__)

chainladder:0.7.9
pandas:1.0.4


### Should you develop with Chain-Ladder?

The Chain Ladder method is based on the strong assumptions of independence across origin years and across valuation years. 
Mack developed tests to verify if these assumptions hold, and these tests have been implemented in chainladder.

You should verify that your data satisfies these tests at the required confidence interval level. If it does not, you should consider if the
development would be better done in other ways, for example using an AR model instead.
Below is an example of how to test independence across origin and development years

In [3]:
raa = cl.load_sample('raa')
raa

Unnamed: 0,12,24,36,48,60,72,84,96,108,120
1981,5012,8269.0,10907.0,11805.0,13539.0,16181.0,18009.0,18608.0,18662.0,18834.0
1982,106,4285.0,5396.0,10666.0,13782.0,15599.0,15496.0,16169.0,16704.0,
1983,3410,8992.0,13873.0,16141.0,18735.0,22214.0,22863.0,23466.0,,
1984,5655,11555.0,15766.0,21266.0,23425.0,26083.0,27067.0,,,
1985,1092,9565.0,15836.0,22169.0,25955.0,26180.0,,,,
1986,1513,6445.0,11702.0,12935.0,15852.0,,,,,
1987,557,4020.0,10946.0,12314.0,,,,,,
1988,1351,6947.0,13112.0,,,,,,,
1989,3133,5395.0,,,,,,,,
1990,2063,,,,,,,,,


In [7]:
print('Correlation across valuation years? ', raa.valuation_correlation(p_critical=.1, total=True).z_critical.values)
print('Correlation across origin years? ', raa.development_correlation(p_critical=.5).t_critical.values)

Correlation across valuation years?  [[False]]
Correlation across origin years?  [[False]]


The above tests show that the `raa` triangle is independent in both cases, suggesting that Chain Ladder is indeed an appropriate method to develop it.
It is suggested to read Mack (1993) and Mack (1997) [refs] to ensure a proper understanding of the methodology and the choice of p_critical.

Mack (1997) differs from Mack (1993) for testing valuation years correlation. The first paper looks at the aggregate of all years, while the latter suggests to check independence for each valuation year and if dependence does appear in one year, to reduce the weight for such year in the development process [how?]
To test for each valuation year one can run

In [8]:
# Setting total=False provides a year-by-year test
raa.valuation_correlation(p_critical=.1, total=False).z_critical

Unnamed: 0,1982,1983,1984,1985,1986,1987,1988,1989,1990
1981,False,False,False,False,False,False,False,False,False


Please note that the tests are run on the entire 4 dimensions of the `triangle`, and indeed the output of the test is a `triangle` itself

### Estimator Basics

All development methods follow the `sklearn` estimator API.  These estimators have a few properties that are worth getting used to.

You instiantiate the estimator with your choice of assumptions.  In the case where you don't opt for any assumptions, defaults are chosen for you.

In [11]:
cl.Development()

Development(average='volume', drop=None, drop_high=None, drop_low=None,
            drop_valuation=None, fillna=None, n_periods=-1,
            sigma_interpolation='log-linear')

At this point, we've chosen an estimator and assumptions (even if default) but we have not shown our estimator a `Triangle`.  At this point it is merely instructions on how to fit development patterns, but no patterns exist as of yet.

All estimators have a `fit` method and you can pass a triangle to your estimator.  Let's `fit` a `Triangle` in a `Development` estimator.  Let's also assign the estimator to a variable so we can reference attributes about it.

In [12]:
genins = cl.load_sample('genins')
genins

Unnamed: 0,12,24,36,48,60,72,84,96,108,120
2001,357848,1124788.0,1735330.0,2218270.0,2745596.0,3319994.0,3466336.0,3606286.0,3833515.0,3901463.0
2002,352118,1236139.0,2170033.0,3353322.0,3799067.0,4120063.0,4647867.0,4914039.0,5339085.0,
2003,290507,1292306.0,2218525.0,3235179.0,3985995.0,4132918.0,4628910.0,4909315.0,,
2004,310608,1418858.0,2195047.0,3757447.0,4029929.0,4381982.0,4588268.0,,,
2005,443160,1136350.0,2128333.0,2897821.0,3402672.0,3873311.0,,,,
2006,396132,1333217.0,2180715.0,2985752.0,3691712.0,,,,,
2007,440832,1288463.0,2419861.0,3483130.0,,,,,,
2008,359480,1421128.0,2864498.0,,,,,,,
2009,376686,1363294.0,,,,,,,,
2010,344014,,,,,,,,,


In [13]:
dev = cl.Development().fit(genins)

Now that we have `fit` a `Development` estimator, it has many additional properties that didn't exist before fitting.  For example, 
we ca view the `ldf_`

In [15]:
dev.ldf_

Unnamed: 0,12-24,24-36,36-48,48-60,60-72,72-84,84-96,96-108,108-120
(All),3.4906,1.7473,1.4574,1.1739,1.1038,1.0863,1.0539,1.0766,1.0177


We can view the `cdf_`

In [16]:
dev.cdf_

Unnamed: 0,12-Ult,24-Ult,36-Ult,48-Ult,60-Ult,72-Ult,84-Ult,96-Ult,108-Ult
(All),14.4466,4.1387,2.3686,1.6252,1.3845,1.2543,1.1547,1.0956,1.0177


Notice these extra attributes have a trailing underscore_.  This is `sklearn` API convention and it is used to quickly distinguish between attributes that are assumptions (i.e. that exist pre-fit), and those that are estimated from the data (only exist post-fit) 

In [17]:
print('Assumption parameter (no underscore):', dev.average)
print('Estimated parameter (underscore):\n',   dev.ldf_)

Assumption parameter (no underscore): volume
Estimated parameter (underscore):
           12-24     24-36     36-48     48-60     60-72     72-84     84-96    96-108   108-120
(All)  3.490607  1.747333  1.457413  1.173852  1.103824  1.086269  1.053874  1.076555  1.017725


### Development Averaging

Now that we have a grounding in triangle manipulation and the basics of estimators, we can start getting more creative with customizing our `Development` factors.

The basic `Development` estimator uses a weighted regression through the origin for estimating parameters.  Mack showed that using weighted regressions allows for:
1. `volume` weighted average development patterns<br>
2. `simple` average development factors<br>
3. OLS `regression` estimate of development factor where the regression equation is Y = mX + 0<br>

While he posited this framework to suggest the `MackChainladder` stochastic method, it is an elegant form even for deterministic development pattern selection.

In [18]:
vol = cl.Development(average='volume').fit(genins).ldf_
vol

Unnamed: 0,12-24,24-36,36-48,48-60,60-72,72-84,84-96,96-108,108-120
(All),3.4906,1.7473,1.4574,1.1739,1.1038,1.0863,1.0539,1.0766,1.0177


In [19]:
sim = cl.Development(average='simple').fit(genins).ldf_
sim

Unnamed: 0,12-24,24-36,36-48,48-60,60-72,72-84,84-96,96-108,108-120
(All),3.5661,1.7456,1.452,1.181,1.1112,1.0848,1.0527,1.0748,1.0177


In most cases, estimator attributes are `Triangle`s themselves and can be manipulated with just like raw triangles.

In [20]:
print('LDF Type: ', type(vol))
print('Difference between volume and simple average:')
vol-sim

LDF Type:  <class 'chainladder.core.triangle.Triangle'>
Difference between volume and simple average:


Unnamed: 0,12-24,24-36,36-48,48-60,60-72,72-84,84-96,96-108,108-120
(All),-0.0755,0.0018,0.0055,-0.0071,-0.0074,0.0015,0.0011,0.0018,


Choosing how you average your LDFs can be done independently for each age-to-age period.  For example, we can use `volume` averaging on the first pattern, `simple` the second, `regression` the third, and then repeat the cycle as follows:

In [21]:
cl.Development(average=['volume', 'simple', 'regression']*3).fit(genins).ldf_

Unnamed: 0,12-24,24-36,36-48,48-60,60-72,72-84,84-96,96-108,108-120
(All),3.4906,1.7456,1.4619,1.1739,1.1112,1.0873,1.0539,1.0748,1.0177


Another example, using `volume`-weighting for the first and last three patterns with `simple` averaging in between.

In [22]:
cl.Development(average=['volume']+['simple']*5+['volume']*3).fit(genins).ldf_

Unnamed: 0,12-24,24-36,36-48,48-60,60-72,72-84,84-96,96-108,108-120
(All),3.4906,1.7456,1.452,1.181,1.1112,1.0848,1.0539,1.0766,1.0177


### Averaging Period

`Development` comes with an `n_periods` parameter that allows you to select the latest `n` valuation periods for fitting your development patterns.  `n_periods=-1` is used to indicate the usage of all available periods.

In [23]:
cl.Development(n_periods=3).fit(genins).ldf_

Unnamed: 0,12-24,24-36,36-48,48-60,60-72,72-84,84-96,96-108,108-120
(All),3.4604,1.8465,1.392,1.1539,1.0849,1.0974,1.0539,1.0766,1.0177


The units of `n_periods` follows the `origin_grain` of your triangle.

In [24]:
dev = cl.Development(n_periods=5).fit(genins)
print('Using ' + str(dev.n_periods) + str(genins.origin_grain) + ' Avg')
dev.ldf_

Using 5Y Avg


Unnamed: 0,12-24,24-36,36-48,48-60,60-72,72-84,84-96,96-108,108-120
(All),3.2448,1.7867,1.4682,1.1651,1.1038,1.0863,1.0539,1.0766,1.0177


Much like `average`, `n_periods` can also be set for each age-to-age individually.

In [25]:
cl.Development(n_periods=[8,2,6,5,-1,2,-1,-1,5]).fit(genins).ldf_

Unnamed: 0,12-24,24-36,36-48,48-60,60-72,72-84,84-96,96-108,108-120
(All),3.5325,1.9502,1.4808,1.1651,1.1038,1.0825,1.0539,1.0766,1.0177


Note that if you want more `n_periods` than are available for any particular age-to-age period, all available periods will be used instead.

In [26]:
cl.Development(n_periods=[1,2,3,4,5,6,7,8,9]).fit(genins).ldf_ == \
cl.Development(n_periods=[1,2,3,4,5,4,3,2,1]).fit(genins).ldf_

True

### Dropping problematic link-ratios

Even with `n_periods`, there are situations where you might want to be more surgical in your picks.  For example, you could have a valuation period with bad data and wish to omit the entire diagonal from your averaging.

In [27]:
cl.Development(drop_valuation='2004').fit(genins).ldf_

Unnamed: 0,12-24,24-36,36-48,48-60,60-72,72-84,84-96,96-108,108-120
(All),3.3797,1.7517,1.4426,1.1651,1.1038,1.0863,1.0539,1.0766,1.0177


Maybe you want do do olympic averaging (i.e. exluding high and low from each period)

In [28]:
cl.Development(drop_high=True, drop_low=True).fit(genins).ldf_

  "drop_high and drop_low cannot be computed "


Unnamed: 0,12-24,24-36,36-48,48-60,60-72,72-84,84-96,96-108,108-120
(All),3.5201,1.7277,1.4351,1.193,1.1018,1.0825,1.0573,1.0766,1.0177


Or maybe there is just a single outlier link-ratio that you don't think is indicative of future development.  For these, you can specify the intersection of the origin and development age of the **denominator** of the link-ratio to `drop`.

In [29]:
cl.Development(drop=('2004', 12)).fit(genins).ldf_

Unnamed: 0,12-24,24-36,36-48,48-60,60-72,72-84,84-96,96-108,108-120
(All),3.3797,1.7473,1.4574,1.1739,1.1038,1.0863,1.0539,1.0766,1.0177


If there are more than one troublesome outliers, you cal also pass a list to the `drop` argument.

In [30]:
cl.Development(drop=[('2004', 12), ('2003', 24)]).fit(genins).ldf_

Unnamed: 0,12-24,24-36,36-48,48-60,60-72,72-84,84-96,96-108,108-120
(All),3.3797,1.7517,1.4574,1.1739,1.1038,1.0863,1.0539,1.0766,1.0177


### Transformers
In `sklearn` parlance, there are two types of estimators.  They are Transformers (which `Development` is) and Predictors.  The `Development` object is a means to creating development patterns, but itself is not an reserving model.  Transformers come with the `tranform` and `fit_transform` method.  These will return a `Triangle` object but augment it with additional information for use in a subsequent IBNR model.

In [31]:
transformed_triangle = cl.Development(drop_high=[True]*4+[False]*9).fit_transform(genins)
transformed_triangle

Unnamed: 0,12,24,36,48,60,72,84,96,108,120
2001,357848,1124788.0,1735330.0,2218270.0,2745596.0,3319994.0,3466336.0,3606286.0,3833515.0,3901463.0
2002,352118,1236139.0,2170033.0,3353322.0,3799067.0,4120063.0,4647867.0,4914039.0,5339085.0,
2003,290507,1292306.0,2218525.0,3235179.0,3985995.0,4132918.0,4628910.0,4909315.0,,
2004,310608,1418858.0,2195047.0,3757447.0,4029929.0,4381982.0,4588268.0,,,
2005,443160,1136350.0,2128333.0,2897821.0,3402672.0,3873311.0,,,,
2006,396132,1333217.0,2180715.0,2985752.0,3691712.0,,,,,
2007,440832,1288463.0,2419861.0,3483130.0,,,,,,
2008,359480,1421128.0,2864498.0,,,,,,,
2009,376686,1363294.0,,,,,,,,
2010,344014,,,,,,,,,


Our transformed triangle behaves as our original `genins` triangle.  However, notice the link_ratios exclude any droppped values you specified.

In [32]:
transformed_triangle.link_ratio.heatmap(cmap='PuBu')

Unnamed: 0,12-24,24-36,36-48,48-60,60-72,72-84,84-96,96-108,108-120
2001,3.1432,1.5428,1.2783,,1.2092,1.0441,1.0404,1.063,1.0177
2002,3.5106,1.7555,1.5453,1.1329,1.0845,1.1281,1.0573,1.0865,
2003,4.4485,1.7167,1.4583,1.2321,1.0369,1.12,1.0606,,
2004,,1.5471,,1.0725,1.0874,1.0471,,,
2005,2.5642,1.873,1.3615,1.1742,1.1383,,,,
2006,3.3656,1.6357,1.3692,1.2364,,,,,
2007,2.9228,1.8781,1.4394,,,,,,
2008,3.9533,,,,,,,,
2009,3.6192,,,,,,,,


In [33]:
print(type(transformed_triangle))
transformed_triangle.latest_diagonal

<class 'chainladder.core.triangle.Triangle'>


Unnamed: 0,2010
2001,3901463
2002,5339085
2003,4909315
2004,4588268
2005,3873311
2006,3691712
2007,3483130
2008,2864498
2009,1363294
2010,344014


However, it has other attributes that make it IBNR model-ready.

In [34]:
transformed_triangle.cdf_

Unnamed: 0,12-Ult,24-Ult,36-Ult,48-Ult,60-Ult,72-Ult,84-Ult,96-Ult,108-Ult
(All),13.1367,3.887,2.2809,1.6131,1.3845,1.2543,1.1547,1.0956,1.0177


`fit_transform()` is equivalent to calling `fit` and `transform` in succession on the same triangle.  Again, this should feel very familiar to the `sklearn` practitioner.

In [35]:
cl.Development().fit_transform(genins) == cl.Development().fit(genins).transform(genins)

True

The reason you might want want to use `fit` and `transform` separately would be when you want to apply development patterns to a a different triangle.  For examlple, we can:

1. Extract the commercial auto triangles from the `clrd` dataset<br>
2. Summarize to an industry level and `fit` a `Development` object<br>
3. We can then `transform` the individual company triangles with the industry development patterns<br>

In [36]:
clrd = cl.load_sample('clrd')
comauto = clrd[clrd['LOB']=='comauto']['CumPaidLoss']

comauto_industry = comauto.sum()
industry_dev = cl.Development().fit(comauto_industry)

industry_dev.transform(comauto)

Unnamed: 0,Triangle Summary
Valuation:,1997-12
Grain:,OYDY
Shape:,"(157, 1, 10, 10)"
Index:,"[GRNAME, LOB]"
Columns:,[CumPaidLoss]


### Working with multidimensional triangles

Several (though not all) of the estimators in `chainladder` can be fit to several triangles simultaneously.  While this can be a convenient shorthand, it will use the same assumptions across every triangle.

In [37]:
clrd = cl.load_sample('clrd').groupby('LOB').sum()['CumPaidLoss']
print('Fitting to ' + str(len(clrd.index)) + ' industries simultaneously.')
cl.Development().fit_transform(clrd).cdf_

Fitting to 6 industries simultaneously.


Unnamed: 0,Triangle Summary
Valuation:,2262-03
Grain:,OYDY
Shape:,"(6, 1, 1, 9)"
Index:,[LOB]
Columns:,[CumPaidLoss]


For greater control, you can slice individual triangles out and fit separate patterns to each.

In [38]:
print(cl.Development(average='simple').fit(clrd.loc['wkcomp']))
print(cl.Development(n_periods=4).fit(clrd.loc['ppauto']))
print(cl.Development(average='regression', n_periods=6).fit(clrd.loc['comauto']))

Development(average='simple', drop=None, drop_high=None, drop_low=None,
            drop_valuation=None, fillna=None, n_periods=-1,
            sigma_interpolation='log-linear')
Development(average='volume', drop=None, drop_high=None, drop_low=None,
            drop_valuation=None, fillna=None, n_periods=4,
            sigma_interpolation='log-linear')
Development(average='regression', drop=None, drop_high=None, drop_low=None,
            drop_valuation=None, fillna=None, n_periods=6,
            sigma_interpolation='log-linear')
