Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as any collaborators you worked with:

In [None]:
COLLABORATORS = ""

## To receive credit for this assignment, you must also fill out the [AI Use survey](https://forms.gle/ZhR5k8TdAeN8rj4CA)


---

In [None]:
%matplotlib inline
%precision 16
import numpy
import matplotlib.pyplot as plt
import pandas as pd

# Final Project

This notebook will provide a brief structure and rubric for presenting your final project. 

The purpose of the project is 2-fold
* To give you an opportunity to work on a problem you are truly interested in (as this is the best way to actually learn something)
* To demonstrate to me that you understand the overall workflow of problem solving from problem selection to implementation to discussion 

You can choose any subject area that interests you as long as there is a computational component to it.  However, please do not reuse projects or homeworks you have done in other classes.  This should be **your** original work.

**You can work in teams, but clearly identify each persons contribution** and every team member should hand in their own copy of the notebook.

### Structure
There are 5 parts for a total of 100 points that provide the overall structure of a mini research project.

* Abstract
* Introduction and Problem Description
* Brief discussion of Computational approach and import of any additional packages
* Implementation including tests
* Discussion of results and future directions

For grading purposes, please try to make this notebook entirely self contained. 

The project is worth about 2 problem sets and should be of comparable length (please: I will have about 100 of these to read and I am not expecting full 10 page papers).  The actual project does not necessarily have to work but in that case you should demonstrate that you understand why it did not work and what steps you would take next to fix it.

Have fun

## Abstract [10 pts]

Provide a 1-2 paragraph abstract of the project in the style of a research paper.  The abstract should contain

* A brief description of the problem
* A brief justification describing why this problem is important/interesting to you
* A general description of the computational approach
* A brief summary of what you did and what you learned


Every year, extreme weather events cause damage to homes across the United States, many of which are insured. For a property insurance carrier, it is important to model the risks and expected losses that arise from underwriting home insurance policies, especially in states that experience a high frequency of extreme weather events. Thus, this project aims to use FEMA data to model the claims, expected losses, and profits that an insurer might face in states like Florida and Washington. I am personally interested in exploring this problem because I am an aspiring actuary and would like to explore Catastrophe Modeling as applied to the property insurance space, which is full of complex problems and real-world implications.

I began by collecting data from FEMA's National Risk Index, which provides county-level metrics like proprietary natural hazard risk scores and Expected Annual Loss (EAL). Taking the EAL and other assumptions about the insurer- such as market share, baseline claim rate, and average insured home value- Generalized Linear Models (GLMs) are fitted to estimate frequency and severity parameters for the underlying claim distributions. Monte Carlo simulation and numerical quadrature is used to capture typical insurance policy mechanics like deductibles and coinsurance rates, which factor into estimating the expected profit for the insurer in each state.

## Introduction [15 pts]

In ~4-5 paragraphs, describe 
* The general problem you want to solve
* Why it is important and what you hope to achieve.

Please provide basic **references**, particularly if you are reproducing results from a paper. Also include any basic equations you plan to solve. 

Please use proper spelling and grammar. 

Property insurance carriers are interested in understanding the risks, as well as the payoffs and potential profits involved in underwriting policies (i.e. providing insurance contracts) in a given market. This involves modeling the expected losses and premiums that the insurer could face, as well as modeling tail-end outcomes that correspond to a year with a high frequency and/or severity of extreme weather events that trigger claims (i.e. liabilities for the insurer) to be filed. These model considerations are especially important for insurers underwriting policies in states that experience relatively many extreme weather events, like Florida and Washington.

Florida is notorious for its exposure to hurricanes, while Washington is a geographically diverse state that normally experiences torrential rainfall, flooding, wildfires, and winter storms. Overall, climate is one of the most important considerations when creating insurance models that forecast losses and profits in any state. Climate change and new regulations continually affect the property insurance indsutry, making the problem of financial forecasting a dynamic and crucial one for insurers. To illustrate, California passed new legislation this Summer which allows insurance companies to develop Wildfire Catastrophe Models to aid their fire insurance underwriting, whereas beforehand, most companies chose to stay out of California by default because of the high-level of risk that they were not allowed to properly estimate.

Furthermore, the impact of this forecasting problem extends far beyond just predicting a hypothetical insurance company's bottom lineâ€“ the financial well-beings of homeowners are also at stake when an insurer develops such a model, as it helps them design the terms of insurance policies in a given area. However, this project is not a full-fledged catastrophe model that considers several types of perils (e.g. fire, flood). This project is an end-to-end model that takes FEMA National Risk Index data, which is publicly available on a county level, to fit statistical models and run Monte Carlo simulations to forecast claims in the states of Florida and Washington. Also, numerical integration is used to solve one form of a classical profit equation in insurance, as follows.

The Gauss-Legendre quadrature method is used to approximate the insurer's expected payout per claim, which is modeled by

$$E[(X-d)^+]=\int_d^{\infty} (x-d)f_X(x)dx,$$

where the claim severity $X$ follows a Gamma distribution, $d$ is the deductible, and $c$ is the coinsruance rate.

### References

Federal Emergency Management Agency. (n.d.). National risk index: Data & resources. FEMA.
https://hazards.fema.gov/nri/data-resources

## Computational  Methods [10 pts]

Describe the specific approach you will take to solve some concrete aspect of the general problem. 

You should  include all the numerical or computational methods you intend to use.  These can include methods or packages  we did not discuss in class but provide some reference to the method. You do not need to explain in detail how the methods work, but you should describe their basic functionality and justify your choices. 




The project models insured losses and profits for a small insurer in the states of Florida and Washington by by combining a traditional frequency-severity statistical model with stochastic Monte Carlo simulations and deterministic numerical integration. There are 3 computational stages in the model pipeline:

#### 1. Parameter estimation and perturbation for underlying claim distributions

As mentioned previously, the FEMA data provides county-level data that is fitted by two complementary Generalized Linear Models (GLMs) for the classic Frequency-Severity approach. Breaking down losses into the frequency and severity of claims is natural, as the total cost of the claims an insurer can expect to be liable for is total number of claims * expected cost of claims.

The Frequency GLM models the claim counts $Y_i$ per county $i$ with the Poisson distribution with mean $\mu_i$, which is the canonical choice for modeling count data in actuarial applications. 

$$ Y_i \sim \text{Poisson}(\mu_i), \; \text{log}(\mu_i)=\text{log}(\text{exposure}_i) + x_i\beta$$
$$\implies \mu_i =\text{exposure}_i * \text{exp}(x_i\beta)$$

Furthermore, the logarithmic link makes the predictor $x_i$ a coefficient that multiplies our rate $\mu_i$, which is interpretable since the predictor $x_i$ is FEMA's proprietary risk score for each county, rescaled to $\in[0,1]$. The exposure offset allows the rate to be modeled per unit of exposure, where exposure is the number of homes in a state times the market share of the insurer (i.e. the number of policies in their portfolio).

Because actual claim count data is not publicly available, I construct a synthetic expected count using a baseline claim rate (CITE) and county exposures:

$$ \text{num claims}_i^{\text{target}} = \text{exposure}_i \times \text{base rate} \times \text{risk rel}_i $$

The Severity GLM models the severity $X_i$ (i.e. cost) per claim with a Gamma GLM and log link:

$$X_i \sim \text{Gamma}(\mu_i, \; \phi \mu_i^2), \; log(\mu_i)=x_i \eta,$$

where $\phi$ is a scale parameter. Gamma is a good distribution choice for cost data because claim severities are positive, continuous, and right-skewed (CITE). The interpretation of the log link is that a unit increase in the predictor multiplies $\mu$ by $e^{\eta_j} (what?).

Both GLMs are estimated by IRLS with the GLM.fit() method. Finite-difference gradient checks are conducted and condition numbers of the design matrix are computed later to ensure the GLM outputs are well-conditioned.

Then, these baseline actuarial parameters derived by the GLMs are perturbed stochastically before Monte Carlo simulation:

$$\text{Poisson: } \lambda_{\text{sim}} = \lambda_(\text{base})(1+\epsilon_\lambda), \epsilon_\lambda \sim \mathcal{N}(0,\sigma_\lambda^2),$$
$$\text{Gamma: } \mu_{\text{sim}} = \mu_(\text{base})(1+\epsilon_\mu), \epsilon_\mu \sim \mathcal{N}(0,\sigma_\mu^2).$$

This step introduces controlled uncertainty to reflect the g=heterogeneity of different counties in each state and model misspecification.

#### 2. Monte Carlo and Numerical Integration for simulating claims and profits

The claim counts for each county on each Monte Carlo draw (N_SIM=10,000) is as follows:

$$N \sim \text{Poisson}(\lambda_{\text{sim}})$$

Instead of simulating individual claim severities, which would be computationally expensive and subject to Monte Carlo noise, the expected payout per claim is computed according to a standard policy scheme that includes a deductible $d$ and the coinsurance rate $c$:

$$E[(X-d)^+] * c = c * \int_d^{\infty} (x-d)f_X(x)dx, \; \; X \sim \Gamma(k, \; \theta)$$

A numerical integration approach is appropriate here because the deductible truncates the Gamma distribution, and so this integral has no closed-form expression. The implementation in the next cell evaluates the integral using n-point Gauss-Legendre quadrature:

$$\int_d^u g(x)dx \approx \sum_{i=1}^n w_i g(t_i),$$

where $t_i, w_i$ are the Legendre nodes and weights mapped $[-1,1] \to [d,u]$, and $u$, the 99.99th percentile of the Gamma distribution, is set as the upper bound of integration. This approach is a deterministic and numerically stable way of estimating the expected loss per claim.

For each simulation trial,

$$\text{CountyLoss}_j = N_j * E[(X-d)^+] * c$$

and the losses for each state are accumulated across each of its counties.

#### 3. Compute financial outcomes and Bootstrap Confidence Intervals

Premiums are calculated by multiplying the expected loss in a state by 1.2 to each Monte Carlo simulation of each state's loss. Then profits are premiums - loss per simulation for each state, which yields an empirical distribution of potential financial outcomes. 

Using the empirical loss distribution from our simulations, risk measures like the Value at Risk and Tail Value at Risk are taken as the empirical 5th percentile of profit and the mean of the worst 5% of profit outcomes, respectively. 

A 95% confidence interval for the expected profit is constructed using nonparametric bootstrap resampling, which produces uncertainty bands without assuming normality:

$$ \hat{\mu}_b^* = \frac{1}{n}\sum_{i=1}^n \product_i^{*(b)}, /: b=1,...,B$$ 

**If you need to install or import any additional python packages,  please provide complete installation instructions in the code block below**


In [None]:
# Provide complete installation or import information for external packages or modules here e.g.

# the following packages are included in the Anaconda distribution

import numpy as np
import pandas as pd
import statsmodels.api as sm
from scipy.stats import gamma
from numpy.linalg import cond

## Implementation [50 pts]

Use the Markdown and Code blocks below to implement and document your methods including figures.  Only the first markdown block will be a grading cell but please add (not copy) cells in this section to organize your work. 

Please make the description of your problem readable by interlacing clear explanatory text with code (again with proper grammar and spelling). 
All code should be well described and commented.

For at least one routine you code below, you should provide a test block (e.g. using `numpy.testing` routines, or a convergence plot) to validate your code.  

An **important** component of any computational paper is to demonstrate to yourself and others that your code is producing correct results.

YOUR ANSWER HERE

## Discussion [15 pts]

Evaluate the results of your project including 
* Why should I believe that your numerical results are correct (convergence, test cases etc)?
* Did the project work (in your opinion)?
* If yes:  what would be the next steps to try
* If no:  Explain why your approach did not work and what you might do to fix it.


YOUR ANSWER HERE