## Exercise for Chapter 3

This exercises is design to assist you how to use the pandas package to import, preprocess data and perform basic statistical analysis. Later we should see how data generating events can produce data of interest to insurance analysts.

We will look at the Local Government Property Insurance Fund in this chapter. The fund insures property owned by municipal governments, such as schools and libraries.

* government buildings,

* educational institutions,

* public libraries, and

* motor vehicles.

Over a thousand local government units are covered by the fund, which charges about \$25 million in annual premiums and provides insurance coverage of about \$75 billion.

**Example 1** Import the claim dataset namely ClaimsExperienceData.csv from my Github repository. Then write Python commands to answer the following questions.

In [1]:
import pandas as pd

claims = pd.read_csv('/Users/Kaemyuijang/SCMA248/Data/ClaimsExperienceData.csv')

1. How many claims observations are there in this dataset?

In [2]:
claims.shape[0]

5639

2. How many variables (features) are there in this dataset? List (print out) all the features. 

In [3]:
claims.shape[1]
claims.columns

Index(['PolicyNum', 'Year', 'LnCoverage', 'BCcov', 'Premium', 'Freq', 'Deduct',
       'y', 'lny', 'yAvg', 'lnDeduct', 'Fire5', 'NoClaimCredit', 'TypeCity',
       'TypeCounty', 'TypeMisc', 'TypeSchool', 'TypeTown', 'TypeVillage',
       'AC00', 'AC05', 'AC10', 'AC15'],
      dtype='object')


## Description of Rating Variables

One of the important tasks of insurance analysts is to develop models to represent and manage the two outcome variables, **frequency** and **severity**. 

However, when actuaries and other financial analysts use those models, they do so in the context of external variables. 

In general statistical terminology, one might call these explanatory or predictor variables.

Because of our insurance focus, we call them **rating variables** as they are useful in setting insurance rates and premiums.

The following table describes the rating variables considered.

These are variables that you think might naturally be related to claims outcomes.

<!-- To handle the skewness, we henceforth focus on logarithmic transformations of coverage and deductibles. -->

<!-- For our immediate purposes, the coverage is our first rating variable. Other things being equal, we would expect that policyholders with larger coverage have larger claims. We will make this vague idea much more precise as we proceed, and also justify this expectation with data. -->

**Variable**  | **Description**
----- | -------------
EntityType    | Categorical variable that is one of six types: (Village, City, County, Misc, School, or Town) 
LnCoverage    | Total building and content coverage, in logarithmic millions of dollars
LnDeduct      | Deductible, in logarithmic dollars
AlarmCredit   | Categorical variable that is one of four types: (0, 5, 10, or 15) for automatic smoke alarms in main rooms
NoClaimCredit | Binary variable to indicate no claims in the past two years
Fire5         | Binary variable to indicate the fire class is below 5 (The range of fire class is 0 to 10)  

**In what follows, for illustrate, we will consider claims data in year 2010.**

3. How many policies are there in 2010? 

Name the answer with the variable name **num_policies**. 

Hint: one may use `.value_counts` method that return a Series containing counts of unique values. Alternatively, you want to count False and True separately you can use `pd.Series.sum()` + `~`.

In [4]:
temp = claims['Year']  == 2010
temp.value_counts()

False    4529
True     1110
Name: Year, dtype: int64

In [5]:
num_policies = temp.sum()

In [6]:
(~temp).sum()

4529

4. How many claims are there in 2010? Assign the result to the variable **num_claims**.

In [7]:
claims2010 = claims[temp]

In [8]:
claims2010.columns

Index(['PolicyNum', 'Year', 'LnCoverage', 'BCcov', 'Premium', 'Freq', 'Deduct',
       'y', 'lny', 'yAvg', 'lnDeduct', 'Fire5', 'NoClaimCredit', 'TypeCity',
       'TypeCounty', 'TypeMisc', 'TypeSchool', 'TypeTown', 'TypeVillage',
       'AC00', 'AC05', 'AC10', 'AC15'],
      dtype='object')

In [9]:
claims2010.sum()

PolicyNum        1.652057e+08
Year             2.231100e+06
LnCoverage       2.488480e+03
BCcov            4.577870e+10
Premium          1.590532e+07
Freq             1.377000e+03
Deduct           3.994500e+06
y                3.665931e+07
lny              3.736787e+03
yAvg             2.270177e+07
lnDeduct         8.013576e+03
Fire5            6.220000e+02
NoClaimCredit    6.270000e+02
TypeCity         1.560000e+02
TypeCounty       7.100000e+01
TypeMisc         1.230000e+02
TypeSchool       3.110000e+02
TypeTown         1.850000e+02
TypeVillage      2.640000e+02
AC00             3.460000e+02
AC05             8.200000e+01
AC10             9.000000e+01
AC15             5.920000e+02
dtype: float64

In [10]:
num_claims = claims2010['Freq'].sum()
print(num_claims)

1377


5. Which policy number has the maximum number of claims and what is this claims number?

In [11]:
claims2010.sort_values('Freq', ascending = False).head(2)

Unnamed: 0,PolicyNum,Year,LnCoverage,BCcov,Premium,Freq,Deduct,y,lny,yAvg,...,TypeCity,TypeCounty,TypeMisc,TypeSchool,TypeTown,TypeVillage,AC00,AC05,AC10,AC15
1406,138109,2010,6.338472,565930800.0,124504,239,25000.0,223784.65,12.318439,936.337448,...,0,0,0,1,0,0,0,0,0,1
100,120030,2010,7.801717,2444797000.0,391168,103,50000.0,4920530.65,15.408927,47772.14223,...,0,1,0,0,0,0,0,0,0,1


# Hard cording

claims2010.loc[1406,'Freq'] 

With `.idxmax()`, we can return the index at which maximum weight value is present.

See https://www.geeksforgeeks.org/get-the-index-of-maximum-value-in-dataframe-column/.

In [12]:
print(claims2010['Freq'].idxmax())

ind_freq_max = claims2010['Freq'].idxmax()

max_claims = claims2010.loc[ind_freq_max,'Freq'] 

1406


6. Calculate the proportion of policyholders who did not have any claims (use the name **num_policies_no_claims** for your output).

In [13]:
# Using value_count() and .sort_index to obtain the number of 
# policies by claim numbers.

(claims2010['Freq'].value_counts()).sort_index()

num_policies_no_claims = (claims2010['Freq'].value_counts()).sort_index()[0]

In [14]:
# Calculate the proportion of policyholders who did not have any claims.

round(num_policies_no_claims / num_policies,4)

0.6369

In [15]:
(claims2010['Freq'].value_counts())[0]/claims2010['Freq'].sum()

0.5134350036310821

7. Calculate the proportion of policyholders who had only one claim.

In [16]:
num_policies_one_claims = (claims2010['Freq'].value_counts()).sort_index()[1]

In [17]:
round(num_policies_one_claims / num_policies,4)

0.1883

8. Calculate the average number of claims for this sample. 

In [18]:
num_claims/num_policies

1.2405405405405405

9. The `describe()` method is used for calculating some statistical data like percentile, mean and std of the numerical values of the Series or DataFrame. 

Applying to year 2010, what do we get when we run the command claims.describe()?

In [19]:
claims2010.describe()

Unnamed: 0,PolicyNum,Year,LnCoverage,BCcov,Premium,Freq,Deduct,y,lny,yAvg,...,TypeCity,TypeCounty,TypeMisc,TypeSchool,TypeTown,TypeVillage,AC00,AC05,AC10,AC15
count,1110.0,1110.0,1110.0,1110.0,1110.0,1110.0,1110.0,1110.0,1110.0,1110.0,...,1110.0,1110.0,1110.0,1110.0,1110.0,1110.0,1110.0,1110.0,1110.0,1110.0
mean,148833.995495,2010.0,2.241874,41242070.0,14329.113514,1.240541,3598.648649,33026.4,3.366475,20452.05,...,0.140541,0.063964,0.110811,0.28018,0.166667,0.237838,0.311712,0.073874,0.081081,0.533333
std,16131.790893,0.0,1.962844,114243200.0,24663.572338,8.154437,8787.925562,428778.2,4.573141,392724.1,...,0.347704,0.244799,0.314039,0.44929,0.372846,0.425951,0.463402,0.261683,0.273083,0.499113
min,120002.0,2010.0,-4.575223,10304.0,9.0,0.0,500.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,138104.25,2010.0,0.900512,2460876.0,1633.5,0.0,500.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,150302.0,2010.0,2.523816,12476150.0,6365.0,0.0,1000.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
75%,160628.75,2010.0,3.705105,40654310.0,17923.75,1.0,2500.0,4139.75,8.32831,2818.75,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0
max,180791.0,2010.0,7.801717,2444797000.0,391168.0,239.0,100000.0,12922220.0,16.374459,12922220.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


10. A common method for determining the severity distribution is to look at the distribution of the sample of 1,377 claims. Another typical strategy is to look at the **distribution of average claims among policyholders who have made claims**.

In our 2010 sample, how many such policyholders who have made claims?

In [20]:
num_policies - num_policies_no_claims

403

11. The average claim for the 209 policyholders who had only one claim is the same as the single claim they had. 

Write the command(s) to list the average claim of such 209 policyholders.

In [21]:
selected_index = (claims2010['Freq'] == 1)

claims2010[selected_index][['Freq','y']]


Unnamed: 0,Freq,y
4,1,6838.87
9,1,9711.28
14,1,10323.50
24,1,3469.79
31,1,35000.00
...,...,...
5534,1,1851.48
5568,1,3405.00
5603,1,20679.58
5635,1,168304.05


12. Calculate the average claim of the policyholder with the maximum number of claims.

ind_freq_max = claims2010['Freq'].idxmax()

max_claims = claims2010.loc[ind_freq_max,'Freq'] 

In [22]:
claims2010.loc[ind_freq_max,'y'] / claims2010.loc[ind_freq_max,'Freq'] 

936.3374476987448

In [23]:
claims.describe()

Unnamed: 0,PolicyNum,Year,LnCoverage,BCcov,Premium,Freq,Deduct,y,lny,yAvg,...,TypeCity,TypeCounty,TypeMisc,TypeSchool,TypeTown,TypeVillage,AC00,AC05,AC10,AC15
count,5639.0,5639.0,5639.0,5639.0,5639.0,5639.0,5639.0,5639.0,5639.0,5639.0,...,5639.0,5639.0,5639.0,5639.0,5639.0,5639.0,5639.0,5639.0,5639.0,5639.0
mean,148880.491222,2007.979784,2.132881,37280850.0,14796.029793,1.109239,3364.869658,17287.3,2.753088,9291.565,...,0.140628,0.058166,0.107998,0.283206,0.172194,0.237808,0.463025,0.042383,0.057989,0.436602
std,15911.165649,1.415949,1.977179,103402000.0,25520.718048,8.549179,8273.670512,230462.7,4.31127,197514.1,...,0.347668,0.234078,0.310405,0.450596,0.377582,0.425779,0.498675,0.20148,0.233743,0.496008
min,120002.0,2006.0,-4.717555,8937.0,9.0,0.0,500.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,138112.0,2007.0,0.785066,2192556.0,1612.5,0.0,500.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,150350.0,2008.0,2.429532,11353570.0,6578.0,0.0,1000.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,160619.0,2009.0,3.606218,36826540.0,18184.0,1.0,2500.0,2181.825,7.687917,1519.706,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0
max,180791.0,2010.0,7.801717,2444797000.0,412328.0,263.0,100000.0,12922220.0,16.374459,12922220.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [24]:
claims.mean()

PolicyNum        1.488805e+05
Year             2.007980e+03
LnCoverage       2.132881e+00
BCcov            3.728085e+07
Premium          1.479603e+04
Freq             1.109239e+00
Deduct           3.364870e+03
y                1.728730e+04
lny              2.753088e+00
yAvg             9.291565e+03
lnDeduct         7.157236e+00
Fire5            5.552403e-01
NoClaimCredit    3.286044e-01
TypeCity         1.406278e-01
TypeCounty       5.816634e-02
TypeMisc         1.079979e-01
TypeSchool       2.832062e-01
TypeTown         1.721937e-01
TypeVillage      2.378081e-01
AC00             4.630254e-01
AC05             4.238340e-02
AC10             5.798901e-02
AC15             4.366022e-01
dtype: float64

## Part 2

1. Create a table that shows the 2010 claims frequency distribution. The table should contain the number of policies, the number of claims and the proportion (broken down by the number of claims).

1.1. How many policyholders in the 2010 claims data have 9 or more claims?

1.2. What is the percentage proportion of policyholders having exactly 3 claims?

Goal: the table should tell us the (percentage) proportion of policyholders who did not have any claims, only one claim and so on. 

2. From those 403 policyholders who made at least one claim, create a table that provides information about the distribution of average claim amounts in year 2010.

2.1. What is the mean of the average claim amounts?

2.2. What is the third quartile of the average claim amounts?

First, we add the column, namely `ClaimsAvg` representing the average cost per claim for each observation. The average cost per claim (or claim average) amount is calculated by dividing the number of claims  by the total claim amount.

3. Consider the claims data over the 5 years between 2006-2010 inclusive. Create a table that show the average claim varies over time, average frequency, average coverage and the number of policyholders. 

3.1 What can you say about the number of policyholders over this period?

3.2 How does the average coverage change over this period?