Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as any collaborators you worked with:

In [77]:
COLLABORATORS = ""

## To receive credit for this assignment, you must also fill out the [AI Use survey](https://forms.gle/ZhR5k8TdAeN8rj4CA)


---

In [78]:
%matplotlib inline
%precision 16
import numpy
import matplotlib.pyplot as plt
import pandas as pd

# Final Project

This notebook will provide a brief structure and rubric for presenting your final project. 

The purpose of the project is 2-fold
* To give you an opportunity to work on a problem you are truly interested in (as this is the best way to actually learn something)
* To demonstrate to me that you understand the overall workflow of problem solving from problem selection to implementation to discussion 

You can choose any subject area that interests you as long as there is a computational component to it.  However, please do not reuse projects or homeworks you have done in other classes.  This should be **your** original work.

**You can work in teams, but clearly identify each persons contribution** and every team member should hand in their own copy of the notebook.

### Structure
There are 5 parts for a total of 100 points that provide the overall structure of a mini research project.

* Abstract
* Introduction and Problem Description
* Brief discussion of Computational approach and import of any additional packages
* Implementation including tests
* Discussion of results and future directions

For grading purposes, please try to make this notebook entirely self contained. 

The project is worth about 2 problem sets and should be of comparable length (please: I will have about 100 of these to read and I am not expecting full 10 page papers).  The actual project does not necessarily have to work but in that case you should demonstrate that you understand why it did not work and what steps you would take next to fix it.

Have fun

## Abstract [10 pts]

Provide a 1-2 paragraph abstract of the project in the style of a research paper.  The abstract should contain

* A brief description of the problem
* A brief justification describing why this problem is important/interesting to you
* A general description of the computational approach
* A brief summary of what you did and what you learned


Every year, extreme weather events cause damage to homes across the United States, many of which are insured. For a property insurance carrier, it is important to model the risks and expected losses that arise from underwriting home insurance policies, especially in states that experience a high frequency of extreme weather events. Thus, this project aims to use FEMA data to model the claims, expected losses, and profits that an insurer might face in states like Florida and Washington. I am personally interested in exploring this problem because I am an aspiring actuary and would like to explore Catastrophe Modeling as applied to the property insurance space, which is full of complex problems and real-world implications.

I began by collecting data from FEMA's National Risk Index, which provides county-level metrics like proprietary natural hazard risk scores and Expected Annual Loss (EAL). Taking the EAL and other assumptions about the insurer- such as market share, baseline claim rate, and average insured home value- Generalized Linear Models (GLMs) are fitted to estimate frequency and severity parameters for the underlying claim distributions. Monte Carlo simulation and numerical quadrature is used to capture typical insurance policy mechanics like deductibles and coinsurance rates, which factor into estimating the expected profit for the insurer in each state.

## Introduction [15 pts]

In ~4-5 paragraphs, describe 
* The general problem you want to solve
* Why it is important and what you hope to achieve.

Please provide basic **references**, particularly if you are reproducing results from a paper. Also include any basic equations you plan to solve. 

Please use proper spelling and grammar. 

Property insurance carriers are interested in understanding the risks, as well as the payoffs and potential profits involved in underwriting policies (i.e. providing insurance contracts) in a given market. This involves modeling the expected losses and premiums that the insurer could face, as well as modeling tail-end outcomes that correspond to a year with a high frequency and/or severity of extreme weather events that trigger claims (i.e. liabilities for the insurer) to be filed. These model considerations are especially important for insurers underwriting policies in states that experience relatively many extreme weather events, like Florida and Washington.

Florida is notorious for its exposure to hurricanes, while Washington is a geographically diverse state that normally experiences torrential rainfall, flooding, wildfires, and winter storms. Overall, climate is one of the most important considerations when creating insurance models that forecast losses and profits in any state. Climate change and new regulations continually affect the property insurance indsutry, making the problem of financial forecasting a dynamic and crucial one for insurers. To illustrate, California passed new legislation this Summer which allows insurance companies to develop Wildfire Catastrophe Models to aid their fire insurance underwriting, whereas beforehand, most companies chose to stay out of California by default because of the high-level of risk that they were not allowed to properly estimate.

Furthermore, the impact of this forecasting problem extends far beyond just predicting a hypothetical insurance company's bottom line– the financial well-beings of homeowners are also at stake when an insurer develops such a model, as it helps them design the terms of insurance policies in a given area. However, this project is not a full-fledged catastrophe model that considers several types of perils (e.g. fire, flood). This project is an end-to-end model that takes FEMA National Risk Index data, which is publicly available on a county level, to fit statistical models and run Monte Carlo simulations to forecast claims in the states of Florida and Washington. Also, numerical integration is used to solve one form of a classical profit equation in insurance, as follows.

The Gauss-Legendre quadrature method is used to approximate the insurer's expected payout per claim, which is modeled by

$$E[(X-d)^+]=\int_d^{\infty} (x-d)f_X(x)dx,$$

where the claim severity $X$ follows a Gamma distribution, $d$ is the deductible, and $c$ is the coinsruance rate.

### References

[1] Federal Emergency Management Agency. (n.d.). National risk index: Data & resources. FEMA.
https://hazards.fema.gov/nri/data-resources   

[2] Insurance Information Institute. (2025). Homeowners insurance claims frequency: About one in 18 insured homes has a claim. Retrieved from https://www.iii.org/fact-statistic/facts-statistics-homeowners-and-renters-insurance

[3] USAFacts. (2023). Why are US homes getting bigger while households shrink? Retrieved from https://usafacts.org/data/topics/people‑society/population‑and‑demographics/population‑data/average‑family‑size/

[4] Fu, L. (2004). Severity distributions for GLMs: Gamma or lognormal? Casualty Actuarial Society. Gamma distributions are commonly used for modeling claim severity due to their support on positive values and ability to capture right‑skewed behavior.

## Computational  Methods [10 pts]

Describe the specific approach you will take to solve some concrete aspect of the general problem. 

You should  include all the numerical or computational methods you intend to use.  These can include methods or packages  we did not discuss in class but provide some reference to the method. You do not need to explain in detail how the methods work, but you should describe their basic functionality and justify your choices. 




The project models insured losses and profits for a small insurer in the states of Florida and Washington by combining a traditional frequency-severity statistical model with stochastic Monte Carlo simulations and deterministic numerical integration. There are 3 computational stages in the predictive model pipeline:

1. Parameter estimation and perturbation for underlying claim distributions
The FEMA data provides county-level data that is fitted by two complementary Generalized Linear Models (GLMs) for the classic Frequency-Severity approach. The Frequency GLM models claim counts with a Poisson distribution , and the Severity GLM models individual claim costs with a Gamma distribution. Breaking down losses into the frequency and severity of claims is natural, as the total cost of the claims an insurer can expect to be liable for is total number of claims * expected cost of claims. Both GLMs are estimated using Iteratively Reweighted Least Squares (IRLS) with the statsmodels Python package. To account for heterogeneity between counties in a state, the fitted parameters for each GLM are stochastically perturbed by Gaussian noise.

2. Monte Carlo and Numerical Integration for simulating claims and profits
Claim counts are simulated by a Poisson distribution with the perturbed parameter for each county. The justification for the numerical integration is explained in a later markdown cell in the implementation.

3. Compute financial outcomes and Bootstrap Confidence Intervals
State-level premiums and profit calculations are arithmetic about the expected losses that the model predicts. From the empirical distribution of profits that are simulated, Value at Risk and Tail Value at Risk are calculated to understand the risk surrounding profits in each state better. The uncertainty in the expected profit is captured by nonparametric bootstrap resampling, which yields a 95% confidence interval without assuming normality in the empirical distribution of simulated profits.

In summary, this end-to-end approach to an insurance model is described by the following pipeline of fitting GLMs, stochastically perturbing parameters, running Monte Carlo simulations and approximations with numerical integration, and computing profit and risk metrics.

**If you need to install or import any additional python packages,  please provide complete installation instructions in the code block below**


In [79]:
# Provide complete installation or import information for external packages or modules here e.g.

# the following packages are included in the Anaconda distribution

import numpy as np
import pandas as pd
import statsmodels.api as sm
from scipy.stats import gamma
from numpy.linalg import cond

## Implementation [50 pts]

Use the Markdown and Code blocks below to implement and document your methods including figures.  Only the first markdown block will be a grading cell but please add (not copy) cells in this section to organize your work. 

Please make the description of your problem readable by interlacing clear explanatory text with code (again with proper grammar and spelling). 
All code should be well described and commented.

For at least one routine you code below, you should provide a test block (e.g. using `numpy.testing` routines, or a convergence plot) to validate your code.  

An **important** component of any computational paper is to demonstrate to yourself and others that your code is producing correct results.

The implementation is broken down into 5 code cells besides the global variable cell:

1. Loading and preprocessing the data

2. The three computational steps of the model pipeline described in the previous section

3. Numerical iagnostics to assess the success of the model

In [80]:
# Global Variables
MARKET_SHARE = 0.05 # percent of households in a state that the insurer covers
AVERAGE_HOUSEHOLD_SIZE = 2.5 # US average (reference 3)
BASE_RATE = 1 / 18  # US average annual claim probability per property (reference 2)
NUM_TRIALS = 10000  # Monte Carlo draws per county/state

In [81]:
# Preprocessing

# Initialize states
FLORIDA = {
    "name": "Florida",
}
WASHINGTON = {
    "name": "Washington",
}

# Load + clean FEMA data
def load_and_clean_fema(file_path):
    df = pd.read_csv(file_path)[
        ["COUNTY", "POPULATION", "BUILDVALUE", "EAL_VALT", "RISK_VALUE", "RISK_SCORE"] # keep only the fields that are used in the model
    ]

    df = df.rename(columns={
        "BUILDVALUE": "exposure",
        "EAL_VALT": "EAL",
        "POPULATION": "population"
    })

    numeric_cols = ["exposure", "EAL", "population", "RISK_VALUE", "RISK_SCORE"]
    df[numeric_cols] = df[numeric_cols].apply(pd.to_numeric, errors="coerce")

    return df.dropna(subset=["exposure", "EAL"]) # ensure no NA values


FLORIDA["df_clean"] = load_and_clean_fema("fl_county.csv")
WASHINGTON["df_clean"] = load_and_clean_fema("wa_county.csv")

# Preprocessing for GLM / Monte Carlo
def preprocess_location_df(state_dict):
    df = state_dict["df_clean"].copy()

    df["properties"] = df["population"] / AVERAGE_HOUSEHOLD_SIZE # proxy for no. of properties
    df["insured_properties"] = df["properties"] * MARKET_SHARE # i.e. the "exposure" of the insurer
    df["lambda_base"] = df["insured_properties"] * BASE_RATE # set a deterministic rate parameter for frequency
    df["severity_mean"] = df["EAL"] / df["lambda_base"] # set the deterministic severity parameter as FEMA's empirical Expected Annual Loss / frequency rate
    df["risk_scaled"] = df["RISK_VALUE"] / df["RISK_VALUE"].max() # rescale FEMA risk score to [0,1] as the original was [0,100]

    state_dict["df_glm"] = df
    return state_dict


FLORIDA = preprocess_location_df(FLORIDA)
WASHINGTON = preprocess_location_df(WASHINGTON)

#### 1. Parameter estimation and perturbation for underlying claim distributions

The Frequency GLM models the claim counts $Y_i$ per county $i$ with the Poisson distribution with mean $\mu_i$, which is the canonical choice for modeling count data in actuarial applications. 

$$ Y_i \sim \text{Poisson}(\mu_i), \; \text{log}(\mu_i)=\text{log}(\text{exposure}_i) + x_i\beta$$
$$\implies \mu_i =\text{exposure}_i * \text{exp}(x_i\beta)$$

Furthermore, the logarithmic link makes the predictor $x_i$ a coefficient that multiplies our rate $\mu_i$, which is interpretable since the predictor $x_i$ is FEMA's proprietary risk score for each county, rescaled to $\in[0,1]$. The exposure offset allows the rate to be modeled per unit of exposure, where exposure is the number of homes in a state times the market share of the insurer (i.e. the number of policies in their portfolio).

Because actual claim count data is not publicly available, I construct a synthetic expected count using a baseline claim rate (CITE) and county exposures:

$$ \text{num claims}_i^{\text{target}} = \text{exposure}_i \times \text{base rate} \times \text{risk rel}_i $$

Then, these baseline actuarial parameters derived by the GLMs are perturbed stochastically before Monte Carlo simulation:

$$\text{Poisson: } \lambda_{\text{sim}} = \lambda_(\text{base})(1+\epsilon_\lambda), \epsilon_\lambda \sim \mathcal{N}(0,\sigma_\lambda^2),$$
$$\text{Gamma: } \mu_{\text{sim}} = \mu_(\text{base})(1+\epsilon_\mu), \epsilon_\mu \sim \mathcal{N}(0,\sigma_\mu^2).$$

This step introduces controlled uncertainty to reflect the heterogeneity of different counties in each state and model misspecification.

The Severity GLM models the severity $X_i$ (i.e. cost) per claim with a Gamma GLM and log link:

$$X_i \sim \text{Gamma}(\mu_i, \; \phi \mu_i^2), \; log(\mu_i)=x_i \eta,$$

where $\phi$ is a scale parameter. Gamma is a good distribution choice for cost data because claim severities are positive, continuous, and right-skewed (reference 4). Under the log link, a unit increase in the predictor multiplies the expected severity by $e^{\eta_1}$, holding all other covariates fixed.

Both GLMs are estimated by IRLS with the GLM.fit() method. Finite-difference gradient checks are conducted and condition numbers of the design matrix are computed later to ensure the GLM outputs are well-conditioned.

In [82]:
def fit_glm_baselines(location):
    """
    The previous synthetic baselines are now replaced by the fitted GLM means, incorporating both exposure and county risk scores.
    """
    df = location["df_glm"].copy()

    # Frequency GLM (Poisson)
    y_freq = df["lambda_base"]

    X_freq = sm.add_constant(df[["risk_scaled"]])

    freq_model = sm.GLM(
        y_freq,
        X_freq,
        family=sm.families.Poisson(),
        offset=np.log(df["insured_properties"])
    )

    freq_results = freq_model.fit()

    df["lambda_base"] = freq_results.fittedvalues.clip(lower=0.01)

    # Severity GLM (Gamma)
    y_sev = df["severity_mean"]
    X_sev = sm.add_constant(df[["risk_scaled"]])

    sev_model = sm.GLM(
        y_sev,
        X_sev,
        family=sm.families.Gamma(sm.families.links.log())
    )

    sev_results = sev_model.fit()

    # Replace severity mean with fitted mean
    df["severity_mean"] = sev_results.fittedvalues.clip(lower=1000)

    # Store results (useful for diagnostics section)
    location["df_glm"] = df
    location["glm_results"] = {
        "frequency": freq_results,
        "severity": sev_results
    }

    return location


Then, these baseline actuarial parameters derived by the GLMs are perturbed stochastically before Monte Carlo simulation:

$$\text{Poisson: } \lambda_{\text{sim}} = \lambda_(\text{base})(1+\epsilon_\lambda), \epsilon_\lambda \sim \mathcal{N}(0,\sigma_\lambda^2),$$
$$\text{Gamma: } \mu_{\text{sim}} = \mu_(\text{base})(1+\epsilon_\mu), \epsilon_\mu \sim \mathcal{N}(0,\sigma_\mu^2).$$

This step introduces controlled uncertainty to reflect the heterogeneity of different counties in each state and model misspecification.

In [83]:

def perturb_county_level_parameters(location, perturb_std_lambda=0.05, perturb_std_severity=0.1, random_state=None):
    """
    Produce county-level stochastic frequency (lambda) and severity.
    
    Args:
        location: dict containing df_glm
        perturb_std_lambda: relative std dev for lambda perturbation (5% default)
        perturb_std_severity: relative std dev for severity perturbation (10% default)
        random_state: seed for reproducibility
    """
    rng = np.random.default_rng(random_state)
    df = location["df_glm"].copy()
    
    # Perturb lambda around base
    df["lambda"] = df["lambda_base"] * (1 + rng.normal(0, perturb_std_lambda, size=len(df)))
    df["lambda"] = df["lambda"].clip(lower=0.01)  # prevent negative lambda

    # Perturb severity around mean
    df["severity_mean"] = df["severity_mean"] * (1 + rng.normal(0, perturb_std_severity, size=len(df)))
    df["severity_mean"] = df["severity_mean"].clip(lower=1000)  # prevent absurdly low severity

    
    location["df_sim"] = df
    return location

# Apply GLM to both states
FLORIDA = fit_glm_baselines(FLORIDA)
WASHINGTON = fit_glm_baselines(WASHINGTON)

# Apply stochastic perturbations now
FLORIDA = perturb_county_level_parameters(FLORIDA, random_state=42)
WASHINGTON = perturb_county_level_parameters(WASHINGTON, random_state=42)

# Quick check of lambda and severity parameters after stochastic perturbation
FLORIDA["df_sim"][["COUNTY", "insured_properties", "lambda", "severity_mean"]].head()




Unnamed: 0,COUNTY,insured_properties,lambda,severity_mean
0,Alachua,5559.68,313.577026,429623.898001
1,Baker,557.04,29.337465,488323.16583
2,Bay,3497.38,201.589481,442977.102268
3,Bradford,564.66,32.845276,392420.351588
4,Brevard,12121.34,607.715664,402776.018916


#### 2. Monte Carlo and Numerical Integration for simulating claims and profits

The claim counts for each county on each Monte Carlo draw (N_SIM=10,000) is as follows:

$$N \sim \text{Poisson}(\lambda_{\text{sim}})$$

Instead of simulating individual claim severities, which would be computationally expensive and subject to Monte Carlo noise, the expected payout per claim is computed according to a standard policy scheme that includes a deductible $d$ and the coinsurance rate $c$:

$$E[(X-d)^+] * c = c * \int_d^{\infty} (x-d)f_X(x)dx, \; \; X \sim \Gamma(k, \; \theta)$$

A numerical integration approach is appropriate here because the deductible truncates the Gamma distribution, and so this integral has no closed-form expression. The implementation in the next cell evaluates the integral using n-point Gauss-Legendre quadrature:

$$\int_d^u g(x)dx \approx \sum_{i=1}^n w_i g(t_i),$$

where $t_i, w_i$ are the Legendre nodes and weights mapped $[-1,1] \to [d,u]$, and $u$, the 99.99th percentile of the Gamma distribution, is set as the upper bound of integration. This approach is a deterministic and numerically stable way of estimating the expected loss per claim.

For each simulation trial,

$$\text{CountyLoss}_j = N_j * E[(X-d)^+] * c$$

and the losses for each state are accumulated across each of its counties.

In [84]:
# Monte Carlo Simulation with Quadrature for Severity

def expected_payout_quadrature(deductible, coinsurance, shape, scale, n=200):
    """
    Computes E[(X - deductible)^+] * coinsurance for Gamma(shape, scale) using n-point Gauss-Legendre quadrature.
    """
    # Upper limit chosen as 99.99th percentile of severity distribution
    upper = gamma.ppf(0.9999, a=shape, scale=scale)
    
    # Gauss–Legendre nodes & weights on [-1,1]
    xs, ws = np.polynomial.legendre.leggauss(n)
    
    # Transform to [deductible, upper]
    t = 0.5 * (xs + 1) * (upper - deductible) + deductible
    w = 0.5 * (upper - deductible) * ws

    # Integrand: (x - d) * f_X(x)
    integrand = (t - deductible) * gamma.pdf(t, a=shape, scale=scale)
    
    # Multiply by coinsurance factor
    return coinsurance * np.sum(w * integrand)


def apply_deductible_coinsurance_vectorized(county_df, deductible=10000, coinsurance=0.8, n_sim=NUM_TRIALS, random_state=42):
    """
    Monte Carlo simulation of state losses involves: Poisson frequency simulation and expected payout approximated by G-L quadrature
    """
    rng = np.random.default_rng(random_state)
    state_losses = np.zeros(n_sim)

    for _, row in county_df.iterrows():
        
        # Simulate claim frequency per county
        n_claims = rng.poisson(lam=row['lambda'], size=n_sim)

        # Quadrature: expected payout per claim
        severity_shape = 4  # can be tuned
        severity_scale = row['severity_mean'] / severity_shape

        E_payout = expected_payout_quadrature(deductible=deductible, coinsurance=coinsurance, shape=severity_shape, scale=severity_scale)

        # Total losses per simulation draw
        county_total = n_claims * E_payout

        # Aggregate for state-level losses
        state_losses += county_total

    return state_losses


# Quick sense check with first 5 counties
FL_sim_losses = apply_deductible_coinsurance_vectorized(FLORIDA['df_sim'].head(5), n_sim=NUM_TRIALS)
WA_sim_losses = apply_deductible_coinsurance_vectorized(WASHINGTON['df_sim'].head(5), n_sim=NUM_TRIALS)

print(f"Florida sample losses (first 5 counties, {NUM_TRIALS} draws):")
print(pd.Series(FL_sim_losses).describe())

print(f"\nWashington sample losses (first 5 counties, {NUM_TRIALS} draws):")
print(pd.Series(WA_sim_losses).describe())


Florida sample losses (first 5 counties, 10000 draws):
count    1.000000e+04
mean     3.872161e+08
std      1.128472e+07
min      3.335626e+08
25%      3.796284e+08
50%      3.871139e+08
75%      3.948156e+08
max      4.345081e+08
dtype: float64

Washington sample losses (first 5 counties, 10000 draws):
count    1.000000e+04
mean     1.137988e+08
std      5.313799e+06
min      9.615461e+07
25%      1.102478e+08
50%      1.137569e+08
75%      1.172989e+08
max      1.372933e+08
dtype: float64


#### 3. Compute financial outcomes and Bootstrap Confidence Intervals

Premiums are calculated by multiplying the expected loss in a state by 1.2 to each Monte Carlo simulation of each state's loss. Then profits are premiums - loss per simulation for each state, which yields an empirical distribution of potential financial outcomes. 

Using the empirical loss distribution from our simulations, risk measures like the Value at Risk and Tail Value at Risk are taken as the empirical 5th percentile of profit and the mean of the worst 5% of profit outcomes, respectively. 

A 95% confidence interval for the expected profit is constructed using nonparametric bootstrap resampling, which produces uncertainty bands without assuming normality of the simulated profit outcomes $\set{\Pi_1,...,\Pi_n}$.

$$ \hat{\mu}_b^* = \frac{1}{n}\sum_{i=1}^n \Pi_i^{*(b)}, \: b=1,...,B$$ 

In [85]:
# Financial computations and confidence interval bootstrapping

def compute_metrics(state_losses, loading=1.2, n_bootstrap=NUM_TRIALS, random_state=42):
    """
    Compute premium, profit samples, VaR, TVaR, and bootstrap CI.
    Inputs:
        state_losses: array of total state-level losses (after deductible & coinsurance)
        loading: premium multiplier
        n_bootstrap: number of bootstrap resamples for mean profit CI
    Returns:
        metrics: dict with expected loss, premium, VaR, TVaR, profit samples, bootstrap CI
    """
    rng = np.random.default_rng(random_state)
    state_losses = np.array(state_losses)
    
    # Expected loss and premium
    expected_loss = state_losses.mean()
    premium = expected_loss * loading
    
    # Profit samples
    profit_samples = premium - state_losses
    
    # Risk metrics
    VaR_95 = np.percentile(profit_samples, 5) # Value at Risk
    TVaR_95 = profit_samples[profit_samples <= VaR_95].mean() # Tail Value at Risk for worst 5% of profit outcomes
    
    # Bootstrap 95% Confidence Interval for mean profit
    bootstrap_means = np.array([rng.choice(profit_samples, size=len(profit_samples), replace=True).mean() for _ in range(n_bootstrap)])
    ci_lower, ci_upper = np.percentile(bootstrap_means, [2.5, 97.5])
    
    metrics = {
        'expected_loss': expected_loss,
        'premium': premium,
        'VaR_95': VaR_95,
        'TVaR_95': TVaR_95,
        'profit_mean': profit_samples.mean(),
        'profit_std': profit_samples.std(),
        'profit_bootstrap_CI': (ci_lower, ci_upper)
    }
    
    return metrics

def print_metrics(metrics, state_name):
    """
    Nicely format and print financial metrics for a given state
    """
    print(f"=== {state_name} Metrics ===")
    print(f"Expected Loss: ${metrics['expected_loss']/1e6:,.2f} M")
    print(f"Premium:       ${metrics['premium']/1e6:,.2f} M")
    print(f"Profit Mean:   ${metrics['profit_mean']/1e6:,.2f} M ± ${metrics['profit_std']/1e6:,.2f} M (std)")
    print(f"VaR 5%:        ${metrics['VaR_95']/1e6:,.2f} M")
    print(f"TVaR 5%:       ${metrics['TVaR_95']/1e6:,.2f} M")
    ci_lower, ci_upper = metrics['profit_bootstrap_CI']
    print(f"Bootstrap 95% CI for mean profit: ${ci_lower/1e6:,.2f} M – ${ci_upper/1e6:,.2f} M")
    print("\n")

# Apply metrics to each state
FL_metrics = compute_metrics(FL_sim_losses)
WA_metrics = compute_metrics(WA_sim_losses)

# Print nicely for both states
print_metrics(FL_metrics, "Florida")
print_metrics(WA_metrics, "Washington")

=== Florida Metrics ===
Expected Loss: $387.22 M
Premium:       $464.66 M
Profit Mean:   $77.44 M ± $11.28 M (std)
VaR 5%:        $58.68 M
TVaR 5%:       $53.93 M
Bootstrap 95% CI for mean profit: $77.22 M – $77.67 M


=== Washington Metrics ===
Expected Loss: $113.80 M
Premium:       $136.56 M
Profit Mean:   $22.76 M ± $5.31 M (std)
VaR 5%:        $13.91 M
TVaR 5%:       $11.54 M
Bootstrap 95% CI for mean profit: $22.66 M – $22.86 M




#### GLM Diagnostics

This cell conducts a diagnostic check on the fitted frequency and severity GLMs, including:

- Displaying a full GLM statistics summary for the Poisson (frequency) and Gamma (severity) GLM models
- Computing design matrix condition numbers to assess numerical stability
- Performs finite-difference gradient checks to validate whether the design matrices is well-conditioned

In [86]:
epsilon = 1e-5 # for finite-difference analysis

def finite_difference_gradient(X, beta, mu_transform, epsilon=1e-5):
    """
    Finite-difference gradient approximation for GLM mean function
    """
    grad = np.zeros_like(beta)

    for i in range(len(beta)):
        beta_plus = beta.copy()
        beta_minus = beta.copy()
        beta_plus[i] += epsilon
        beta_minus[i] -= epsilon
        mu_plus = mu_transform(X @ beta_plus)
        mu_minus = mu_transform(X @ beta_minus)

        grad[i] = (mu_plus.sum() - mu_minus.sum()) / (2 * epsilon)
    return grad


def glm_diagnostics(results, X, y, model_name, mu_transform, analytical_grad):
    """
    Print GLM summary, design matrix diagnostics, condition number, and finite-difference vs analytical gradient checks.
    """
    print(f"\n{'='*70}")
    print(f"{model_name} GLM Diagnostics")
    print(f"{'='*70}\n")

    # Model summary
    print(results.summary())

    # Design matrix diagnostics
    X_df = pd.DataFrame(X, columns=[f"x{i}" for i in range(X.shape[1])])
    print("\nDesign matrix describe():")
    print(X_df.describe())

    print("\nDesign matrix condition number:")
    print(cond(X))

    # Gradient checks
    beta = results.params.values
    mu_hat = results.fittedvalues.values

    grad_fd = finite_difference_gradient(
        X=X,
        beta=beta,
        mu_transform=mu_transform,
        epsilon=epsilon
    )

    grad_an = analytical_grad(X, y, mu_hat)

    print("\nGradient comparison:")
    print("Finite-difference gradient:", grad_fd)
    print("Analytical gradient:       ", grad_an)

# Run diagnostics on FLORIDA
df = FLORIDA["df_glm"]

#Frequency GLM (Poisson)
freq_results = FLORIDA["glm_results"]["frequency"]

X_freq = sm.add_constant(df[["risk_scaled"]]).values
y_freq = df["lambda_base"].values

glm_diagnostics(results=freq_results, X=X_freq, y=y_freq, model_name="Frequency (Poisson)", mu_transform=np.exp, analytical_grad=lambda X, y, mu: X.T @ (y - mu))

# Severity GLM (Gamma, log link)
sev_results = FLORIDA["glm_results"]["severity"]

X_sev = sm.add_constant(df[["risk_scaled"]]).values
y_sev = df["severity_mean"].values

glm_diagnostics(results=sev_results, X=X_sev, y=y_sev, model_name="Severity (Gamma, log link)", mu_transform=np.exp, analytical_grad=lambda X, y, mu: X.T @ ((y - mu) / (mu ** 2)))



Frequency (Poisson) GLM Diagnostics

                 Generalized Linear Model Regression Results                  
Dep. Variable:            lambda_base   No. Observations:                   67
Model:                            GLM   Df Residuals:                       65
Model Family:                 Poisson   Df Model:                            1
Link Function:                    Log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -223.34
Date:                Fri, 12 Dec 2025   Deviance:                  -4.8675e-13
Time:                        23:02:28   Pearson chi2:                 1.70e-25
No. Iterations:                     6   Pseudo R-squ. (CS):         -1.985e-13
Covariance Type:            nonrobust                                         
                  coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------
const       



## Discussion [15 pts]

Evaluate the results of your project including 
* Why should I believe that your numerical results are correct (convergence, test cases etc)?
* Did the project work (in your opinion)?
* If yes:  what would be the next steps to try
* If no:  Explain why your approach did not work and what you might do to fix it.


I will now evaluate whether the value and efficiency of the project by evaluating the sensibility of the model's outputs and diagnostics, as well as checking whether each computational stage in the pipeline was executed appropriately and contributed to the results and interpretability of the model.

Beginning with the numerical results from the last diagnostics cell, it seems that the Frequency GLM, based on a Poisson distribution, has perfect separation due to the Warning that the code throws. This is likely because the synthetic frequency rate lambda_base was derived deterministically from the start, so the GLM was unable to find any signal in the predictor because the target is almost entirely determined by the offset term (see Methodology section). This is evidenced by the fitted coefficient for the predictor risk_scaled being nearly zero (8.465E-16) and having a p-value of 1.0. The same line of critique applies to the Severity GLM, as the perfect separation warning is also raised.

However, the numerical methods themselves were executed correctly, as the IRLS algorithm converged in 6 iterations, the condition numbers of $\approx 5.1$ do not raise any concerns that the design matrices are ill-conditioned, and the Monte Carlo simulations produced reasonable magnitudes of losses, premiums, and profits for states like Florida and Washington. For example, it makes sense that Florida would experience much more losses than Washington, just due to the size of each state. At the same time, the magnitudes of state-level losses being in the range of hundreds of millions could be reasonable for an insurer whose market share is 5% of each state. Still, the finite-difference gradient analysis for both GLMs is nonzero, which indicates numerical instability and that the GLMs do not meaningfully capture much variation across counties in each state with the current covariates.

I think this project did work- the pipeline of fitting GLMs, perturbing parameters, running Monte Carlo simulations and approximations with numerical integration, and computing profit and risk metrics, were all fully functional and interpretable steps in the end-to-end insurance model I set out to develop. While the modelling logic was sound, I think my GLMs ended up being degenerate because the FEMA data did not provide enough signal in the covariates to be able to meaningfully capture variations in the claim frequency and severities.

The next steps include either finding or simulating real-world claim count data to fit the Frequency GLM on, because it clearly was not able to learn on the synthetically constructed, deterministic frequency and severity parameters. In addition, more covariates could be added to the GLM models, such as population density and historical loss ratios (a commonly tracked metric by insurers). And since there are a few global variables (i.e. model hyperparameters) at play in this project, such as the baseline claim rate and insurer market share, sensitivity analyses could be conducted to measure the response of state-level profit and loss metrics. Furthermore, the spirit of catastrophe modeling could be infused into this insurance model by stochastically perturbing the severity distributions with more than just Gaussian noise, in order to represent extreme events like hurricanes in Florida or flooding in Washington. 