In [25]:
from IPython.core.interactiveshell import InteractiveShell
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
%matplotlib inline 
import matplotlib.pyplot as plt
import pandas as pd


# Merge and Reshape Data


<!-- PELICAN_BEGIN_SUMMARY -->

As a CPA, I have had to work with datasets from many data processing applications to be used for financial data analysis.

In my case, I have had to work with data from a insurance and general ledger software systems.
The process then is to get the data into another software system used for reporting, such as Cognos.

In many cases, the way the data is stored in files or databases is not best suited for data analysis.

The first step I used for financial data analysis and modeling is to prepare the data.
The preparation process involves loading, validating, merging and reshaping of data.

<!-- PELICAN_END_SUMMARY -->

How to get a clean and valid data set for data analysis?

-  Join two or more dataframes by using a key or composite key
-  Spot invalid data by using pandas 
    - it offers functionality similar to Excel Vlookup, Sumif etc.
    - I expect every fellow accountant reading this to be asking "Why don't I just continue using excel?"
-  Remove duplicate data
-  Understand the business, the average value or the maximum value of certain dataset.

The data preparation will help in spotting any unusual values in the dataset.  Those irregularities may require further  understanding towards validating and making sense of the data.
    
## Obtain Sample Reinsurance Premium Loss Data

- the data I am using has been created with mock data generated by randomization algorithms.
- the input data is an excel file consisting of 3 tabs/sheets: 
    - premium
    - loss and 
    - reinsurance
- how to display each tab name
- how to merge data using key "policy number"?
- generate reinsurance report based on policy effective date and loss date
- generate loss report that hit the reinsurance layer
- generate Loss/LAE ratio with premium


## Display Excel tabs name

In [56]:
df = pd.ExcelFile('../extra/Premium_Loss.xlsx')
print ('df.sheet_names=%s' % df.sheet_names)

df.sheet_names=['premium', 'loss', 'reinsurance']


## Determine which policy has incurred loss

- Use Merge function with key identifier Policy Number

In [57]:
# inner merge if not specified
fs1 = df.parse("premium")
fs2 = df.parse("loss")
df1 = pd.merge(fs1, fs2, on='Policy Number')
df1.head()

Unnamed: 0,Policy Number,First Name_x,Last Name_x,Policy Effective Date,LOB,Policy Type_x,Premium_x,Coverage,Excess Coverage,Excess Policy,Claim Number,Policy Type_y,First Name_y,Last Name_y,Claim Report Date,LAE,Loss,Premium_y
0,10880,Kurtis,Dumm,2017-07-08,Phy,CM,232058,1000000,3900000,96289.0,880206,CM,Kurtis,Dumm,2016-09-24,74793.0,180260.0,232058.0
1,10948,Florencia,Bilyk,2017-06-05,Podiatrist,CM,5132,1000000,3900000,99671.0,880208,CM,Florencia,Bilyk,2017-04-22,38573.0,289389.0,5132.0
2,10244,Wonda,Hallsworth,2017-08-07,Dentist,OCC,13330,1000000,3900000,92319.0,880016,OCC,Wonda,Hallsworth,2015-09-14,44072.0,167121.0,13330.0
3,10509,Patrica,Hartle,2017-02-09,Podiatrist,OCC,5647,1000000,3900000,96730.0,880041,OCC,Patrica,Hartle,2013-11-16,88069.0,102638.0,5647.0
4,10700,Sabra,Higgenbotham,2017-10-07,PA,OCC,12021,1000000,3900000,90239.0,880205,OCC,Sabra,Higgenbotham,2013-12-05,97247.0,37639.0,12021.0


## Identify Loss Ratio

- We can also use Outer Merge to see loss experience when matching policy
- This is extremly useful for business leader to quickly identify which line of business current calendar year loss experience

In [58]:
# display as NaN if no data on the other dataframe
fs1 = df.parse("premium")
fs2 = df.parse("loss")
df1 = pd.merge(fs1, fs2, on='Policy Number', how='outer')
df1.head()

Unnamed: 0,Policy Number,First Name_x,Last Name_x,Policy Effective Date,LOB,Policy Type_x,Premium_x,Coverage,Excess Coverage,Excess Policy,Claim Number,Policy Type_y,First Name_y,Last Name_y,Claim Report Date,LAE,Loss,Premium_y
0,10880,Kurtis,Dumm,2017-07-08,Phy,CM,232058.0,1000000.0,3900000.0,96289.0,880206.0,CM,Kurtis,Dumm,2016-09-24,74793.0,180260.0,232058.0
1,10948,Florencia,Bilyk,2017-06-05,Podiatrist,CM,5132.0,1000000.0,3900000.0,99671.0,880208.0,CM,Florencia,Bilyk,2017-04-22,38573.0,289389.0,5132.0
2,10962,Taisha,Whack,2017-09-19,Dentist,OCC,11308.0,1000000.0,3900000.0,91809.0,,,,,NaT,,,
3,11028,Yun,Linely,2017-10-11,Dentist,OCC,13381.0,1000000.0,3900000.0,94102.0,,,,,NaT,,,
4,10244,Wonda,Hallsworth,2017-08-07,Dentist,OCC,13330.0,1000000.0,3900000.0,92319.0,880016.0,OCC,Wonda,Hallsworth,2015-09-14,44072.0,167121.0,13330.0


## Ceded Loss Ratio
- To create which policy trigger reinsurance coverage by matching Policy Number to see the ceded loss ratio
- This is extremly useful for business leader when determine reinsurance coverage

In [59]:
fs3 = df.parse("reinsurance")
df2=pd.merge(fs1, fs3, on='Policy Number', how='inner')
df2.head(5)

Unnamed: 0,Policy Number,First Name_x,Last Name_x,Policy Effective Date,LOB_x,Policy Type,Premium_x,Coverage,Excess Coverage,Excess Policy,...,Report Date,Policy Start Date,Policy End Date,Policy Form,Premium_y,Indemnity Reserve,ALAE Reserve,Net ALAE Reserve,Indemnity Payment,ALAE Payment
0,10880,Kurtis,Dumm,2017-07-08,Phy,CM,232058,1000000,3900000,96289.0,...,2016-09-24,2017-07-08,2018-07-08,CM,232058,50000,20000,,180260,74793
1,10880,Kurtis,Dumm,2017-07-08,Phy,CM,232058,1000000,3900000,96289.0,...,2018-02-03,2017-07-08,2018-07-08,CM,232058,50000,20000,,66591,48555
2,10948,Florencia,Bilyk,2017-06-05,Podiatrist,CM,5132,1000000,3900000,99671.0,...,2017-04-22,2017-06-05,2018-06-05,CM,5132,50000,20000,,289389,38573
3,10948,Florencia,Bilyk,2017-06-05,Podiatrist,CM,5132,1000000,3900000,99671.0,...,2018-05-27,2017-06-05,2018-06-05,CM,5132,50000,20000,,75211,256554
4,10962,Taisha,Whack,2017-09-19,Dentist,OCC,11308,1000000,3900000,91809.0,...,2017-08-31,2017-09-19,2018-09-19,OCC,11308,50000,20000,,73677,145562
