In [10]:
from IPython.core.interactiveshell import InteractiveShell
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
%matplotlib inline 
import matplotlib.pyplot as plt
import pandas as pd


# Data Analysis with Python for Excel User - Merge and Reshape Data

<!-- PELICAN_BEGIN_SUMMARY -->

As a CPA, I have had to work with datasets from data processing applications, such as insurance application, general ledger,
regulatory reporting system and etc.

The process is to incorporate all the data then create reinsurance analysis for pricing treaties, pro-forma financial analysis
for new business opportunity, regulatory compliance and risk accessment for stress test.

As an old saying that the data needs to be valid, otherwise will be "Garbage In and Garbage Out".

The first step for data preparation process involves loading, validating, merging and reshaping of data.

<!-- PELICAN_END_SUMMARY -->

### How to get a clean and valid dataset?
-  Join two or more dataframes by using a key or composite key
-  Spot invalid data by using pandas 
    - pandas offers functionality similar to Excel Vlookup, Sumif etc.
    - I expect every fellow accountant reading this blog might be asking "Why don't I just continue using Excel?"
    - It took me a while to understand that python-pandas offer much more effectively way to deal with huge dataset
-  Remove duplicate data
-  Calculate the average value or the maximum value of certain dataset

### Example - Reinsurance Premium Loss Data
The example is to merge and reshape data from different application and ultimately determing business reinsurance coverage.

I understand this is an oversimplied reinsurance example.
- the example data is generated by randomization algorithms
- the input data is an excel file consisting of 3 tabs/sheets: premium, loss and reinsurance    

### Understand the business, what is abnormal value for this business?
- If average premium per policy is between 10K to 200K, the premium outside this range is invalid
- If maximum ceded reinsurance retention is 500K, the ceded recoverable greater than 500K is invalid 

### Display Excel Tabs 

In [11]:
df = pd.ExcelFile('data/Premium_Loss.xlsx')
print ('df.sheet_names=%s' % df.sheet_names)

df.sheet_names=['premium', 'loss', 'reinsurance']


### Determine which policy has incurred loss
- Use Merge function with key identifier Policy Number
- Default to inner merge 

In [12]:
fs1 = df.parse("premium")
fs2 = df.parse("loss")
df1 = pd.merge(fs1, fs2, on='Policy Number')
df1.head(3)

Unnamed: 0,Policy Number,First Name_x,Last Name_x,Policy Effective Date,LOB,Policy Type_x,Premium_x,Coverage,Excess Coverage,Excess Policy,Claim Number,Policy Type_y,First Name_y,Last Name_y,Claim Report Date,LAE,Loss,Premium_y
0,10880,Kurtis,Dumm,2017-07-08,Phy,CM,232058,1000000,3900000,96289.0,880206,CM,Kurtis,Dumm,2016-09-24,74793.0,180260.0,232058.0
1,10948,Florencia,Bilyk,2017-06-05,Podiatrist,CM,5132,1000000,3900000,99671.0,880208,CM,Florencia,Bilyk,2017-04-22,38573.0,289389.0,5132.0
2,10244,Wonda,Hallsworth,2017-08-07,Dentist,OCC,13330,1000000,3900000,92319.0,880016,OCC,Wonda,Hallsworth,2015-09-14,44072.0,167121.0,13330.0


### Identify Loss Ratio
- Use Outer Merge to match loss with policy
- This is exceedingly useful to quickly identify current calendar year loss experience by Line of Business
- Output displays as NaN when no data match

In [13]:
fs1 = df.parse("premium")
fs2 = df.parse("loss")
df1 = pd.merge(fs1, fs2, on='Policy Number', how='outer')
df1.head(3)

Unnamed: 0,Policy Number,First Name_x,Last Name_x,Policy Effective Date,LOB,Policy Type_x,Premium_x,Coverage,Excess Coverage,Excess Policy,Claim Number,Policy Type_y,First Name_y,Last Name_y,Claim Report Date,LAE,Loss,Premium_y
0,10880,Kurtis,Dumm,2017-07-08,Phy,CM,232058.0,1000000.0,3900000.0,96289.0,880206.0,CM,Kurtis,Dumm,2016-09-24,74793.0,180260.0,232058.0
1,10948,Florencia,Bilyk,2017-06-05,Podiatrist,CM,5132.0,1000000.0,3900000.0,99671.0,880208.0,CM,Florencia,Bilyk,2017-04-22,38573.0,289389.0,5132.0
2,10962,Taisha,Whack,2017-09-19,Dentist,OCC,11308.0,1000000.0,3900000.0,91809.0,,,,,NaT,,,


In [None]:
### Ceded Loss Ratio
- To identify policy trigger reinsurance coverage by matching policy to ceded loss 
- Add a ceded loss ratio to determining and adjusting reinsurance coverage
- Track loss/LAE ratio
- generate reinsurance report based on policy effective date and loss date
- generate loss report that hit the reinsurance layer

In [14]:
fs3 = df.parse("reinsurance")
df2=pd.merge(fs1, fs3, on='Policy Number', how='inner')
df2.tail(3)

Unnamed: 0,Policy Number,First Name_x,Last Name_x,Policy Effective Date,LOB_x,Policy Type,Premium_x,Coverage,Excess Coverage,Excess Policy,...,Report Date,Policy Start Date,Policy End Date,Policy Form,Premium_y,Indemnity Reserve,ALAE Reserve,Net ALAE Reserve,Indemnity Payment,ALAE Payment
21,12602,Penelope,Eleonora,2017-08-27,Phy,OCC,133494,1000000,3900000,98244.0,...,2017-10-15,2017-08-27,2018-08-27,OCC,133494,50000,20000,,73677,145562
22,12652,,Brasher Hospital,2017-07-25,Hospital,Tail,371474,1000000,3900000,95279.0,...,2017-10-15,2017-07-25,2018-07-25,Tail,371474,50000,20000,,73677,145562
23,19162,Austin,Jonesy,2017-11-10,Phy,CM,152892,1000000,3900000,90538.0,...,2018-06-14,2017-08-18,2018-08-18,OCC,69853,50000,20000,,81447,166528
