# Project - Predicting Property Damage Costs

ProGuard Insurance is a leading insurance company specializing in property insurance.

They cover a wide range of properties including homes, commercial buildings, and rental units.

However, with the current method, they can't accurately estimating property damage costs, which directly impact premium pricing for policyholders and result in dissatisfied policyholders and financial losses for the company.

They want to use regression techniques to build a predictive model for property damage costs, ensuring fair and accurate premium pricing.

However, the company will only consider adopting the model if the average percentage difference between the predicted values and the actual values is less than 20%.


--- 

## Data Information

The dataset contains some property characteristics and weather data related to property insurance with the claim history for the last 5 years, obtained through the company database.


| Column Name            | Details                                                                                                                                          |
|------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------|
| property_id         | Nominal. Unique identifier for each property.                                                                                                   |
| property_type                    | Nominal. Type of the property, such as 'Single-Family Home', 'Apartment', or 'Commercial Building'.                                                                       |
| property_age        | Discrete. Age of the property in years.                                                                              |
| crime_rate          | Continuous. The rate of criminal incidents per 1,000 residents in the vicinity of the insured property.                                                                                        |
| construction_materials        | Nominal. Materials used in the property's construction, such as 'Wood', 'Concrete', or 'Steel'.                                                                                     |
| weather_precipitation        | Continuous. Average annual precipitation in inches in the property's location.                                                                               |
| weather_temperature          | Continuous. Average annual temperature in the property's location.                                                                                          |
| claim_type        | Nominal. Type of insurance claim filed, such as 'Fire', 'Water Damage', or 'Theft'.           |
| claim_amount      | Continuous. The amount of the insurance claim filed for property damage. |
| property_damage_cost      | Continuous. The cost of property damage associated with each historical claim. |
| insured_value     | Discrete. The value of the property insured by the policyholder. |
| policyholder_age  | Discrete. Age of the policyholder. |
| property_size     | Continuous. Size of the property in square feet. |


# Data Generation

Utilizing the `faker` and `pandas`libraries, a dataset can be generated to include the following relationship between variables:

1. **Property Type and Claim Type Relationship**: Different property types (Single-Family Home, Apartment, Commercial Building) have varying probabilities of experiencing specific claim types (Fire, Water Damage, Theft). This relationship reflects the differing risks associated with each property type.

2. **Property Age and Property Damage Cost Relationship**: Older properties (higher property age) may have higher property damage costs due to wear and tear. This is reflected in the age-specific damage multipliers, where older properties are more likely to incur higher costs.

3. **Construction Material and Property Damage Cost Relationship**: The choice of construction material (Wood, Concrete, Steel) affects the property damage cost. Steel has a higher damage multiplier, meaning properties made of steel are more costly to repair or replace.

4. **Property Size and Crime Rate Relationship**: Property size influences the crime rate. Larger properties have a higher likelihood of experiencing a higher crime rate, introducing a correlation between property size and the risk of property-related incidents.

5. **Weather Conditions and Claim Type Relationship**: Weather conditions (precipitation and temperature) impact the likelihood of specific claim types. For example, higher precipitation may lead to more Water Damage claims, and colder regions may have more claims related to Fire incidents.

6. **Property Size and Insured Value Relationship**: Property size directly affects the insured value. Larger properties have a higher insured value, reflecting the higher cost to insure a larger property.

7. **Outliers and Missing Values**: Introducing outliers in insured value and missing values in property type and claim type creates noise and missing data, mimicking the variability and data quality issues often encountered in real-world datasets.

In [60]:
import pandas as pd
from faker import Faker
import random

# Instantiate Faker object and set the locale to the US
fake = Faker()

# Set the seed value for reproducibility
Faker.seed(20230807)

# Set the number of rows in the dataset
num_rows = 9350

# Create an empty dictionary to store the data
data = {
    'property_id': [],
    'property_type': [],
    'property_age': [],
    'crime_rate': [],
    'construction_materials': [],
    'weather_precipitation': [],
    'weather_temperature': [],
    'claim_type': [],
    'claim_amount': [],
    'property_damage_cost': [],
    'insured_value': [],
    'policyholder_age': [],
    'property_size': []
}

# Define average claim amounts for different claim types
claim_type_avg_amounts = {
    'Fire': 15000,
    'Water Damage': 8000,
    'Theft': 3000
}

# Property type-specific size multipliers
property_type_size_multipliers = {
    'Single-Family Home': 0.2,
    'Apartment': 0.1,
    'Commercial Building': 0.5
}

# Construction material-specific damage multipliers
construction_material_damage_multipliers = {
    'Wood': 0.8,
    'Concrete': 1.0,
    'Steel': 1.2
}

# Age-specific damage multipliers
age_damage_multipliers = {
    0: 1.2,  # New properties may have higher costs due to construction quality
    10: 1.0,  # Middle-aged properties
    30: 1.5   # Older properties may have more wear and tear
}

# Generate random data for each column
for _ in range(num_rows):
    # Generate a unique property ID
    property_id = fake.unique.random_number(digits=4)

    # Generate property_type column
    property_type = random.choices(
        ['Single-Family Home', 'Apartment', 'Commercial Building'],
        weights=[0.45, 0.3, 0.25]
    )[0]

    # Generate property_age column and introduce negative values
    property_age = random.randint(1, 50)
    if random.random() < 0.01:
        property_age = -property_age

    # Generate construction_materials column based on property_type
    if property_type == 'Single-Family Home':
        construction_materials = 'Wood'
    else:
        construction_materials = random.choices(
            ['Wood', 'Concrete', 'Steel'],
            weights=[0.5, 0.3, 0.2]
        )[0]

    # Generate weather features
    # Weather in the US: Precipitation (in inches) and Temperature (in Fahrenheit)
    if random.random() < 0.1:
        # Higher precipitation for some locations (10% of the time)
        weather_precipitation = round(random.uniform(30, 120), 2)
    else:
        weather_precipitation = round(random.uniform(10, 40), 2)

    # Temperature ranges based on US climate
    if random.random() < 0.3:
        # Colder regions (30% of the time)
        weather_temperature = round(random.uniform(0, 60), 2)
    else:
        # Warmer regions
        weather_temperature = round(random.uniform(40, 95), 2)

    # Generate claim_type column based on property type
    if property_type == 'Single-Family Home':
        # Single-Family Homes are more likely to have Fire claims
        claim_type = random.choices(
            ['Fire', 'Water Damage', 'Theft'],
            weights=[0.6, 0.3, 0.1]
        )[0]
    else:
        # Apartments and Commercial Buildings are more likely to have Water Damage claims
        claim_type = random.choices(
            ['Fire', 'Water Damage', 'Theft'],
            weights=[0.3, 0.6, 0.1]
        )[0]

    # Generate claim amount based on claim type
    claim_amount = round(
        random.normalvariate(claim_type_avg_amounts[claim_type], claim_type_avg_amounts[claim_type] * 0.2),
        2
    )

    # Calculate property damage cost based on claim type, amount, construction material, and property age
    age_damage_multiplier = None
    for age_threshold in sorted(age_damage_multipliers.keys(), reverse=True):
        if property_age >= age_threshold:
            age_damage_multiplier = age_damage_multipliers[age_threshold]
            break
    if age_damage_multiplier is None:
        age_damage_multiplier = 1.0  # Default multiplier

    material_damage_multiplier = construction_material_damage_multipliers[construction_materials]
    property_damage_cost = round(
        random.uniform(0.5 * claim_amount, 2 * claim_amount) * material_damage_multiplier * age_damage_multiplier, 2
    )

    # Generate insured value based on claim amount and policyholder age
    insured_value = round(random.uniform(1.5 * (claim_amount + property_damage_cost), 3 * (claim_amount + property_damage_cost)), 2)

    # Generate policyholder age
    policyholder_age = random.randint(18, 80)

    # Generate property size based on insured value and property type
    property_size_multiplier = property_type_size_multipliers[property_type]
    property_size = round(insured_value * property_size_multiplier)

    # Generate crime rate based on property size
    crime_rate_multiplier = random.uniform(0.01, 0.2)
    crime_rate = round(property_size * crime_rate_multiplier, 1)

    # Introduce outliers in insured_value
    if random.random() < 0.03:
        insured_value *= 2.5

    # Introduce missing values in property_type and claim_type
    if random.random() < 0.05:
        property_type = '-'
        claim_type = None

    # Append the generated data to the dictionary
    data['property_id'].append(property_id)
    data['property_type'].append(property_type)
    data['property_age'].append(property_age)
    data['crime_rate'].append(crime_rate)
    data['construction_materials'].append(construction_materials)
    data['weather_precipitation'].append(weather_precipitation)
    data['weather_temperature'].append(weather_temperature)
    data['claim_type'].append(claim_type)
    data['claim_amount'].append(claim_amount)
    data['property_damage_cost'].append(property_damage_cost)
    data['insured_value'].append(insured_value)
    data['policyholder_age'].append(policyholder_age)
    data['property_size'].append(property_size)

# Create a DataFrame from the generated data
df = pd.DataFrame(data)

# Save the DataFrame to a CSV file
df.to_csv("data/property.csv", index=False)

# Display the first few rows of the DataFrame
df.head()


Unnamed: 0,property_id,property_type,property_age,crime_rate,construction_materials,weather_precipitation,weather_temperature,claim_type,claim_amount,property_damage_cost,insured_value,policyholder_age,property_size
0,8953,Single-Family Home,46,1115.0,Wood,34.81,67.7,Fire,16277.94,12556.07,75524.76,49,15105
1,5587,Single-Family Home,48,538.9,Wood,19.92,33.9,Theft,2956.98,5262.57,17250.54,27,3450
2,7057,Apartment,42,926.9,Wood,36.87,56.79,Water Damage,9878.04,20999.01,84403.03,64,8440
3,280,Apartment,31,864.3,Wood,10.77,53.3,Fire,13603.57,17943.45,63843.83,77,6384
4,9375,Commercial Building,36,6994.4,Wood,13.2,42.8,Fire,16461.37,35005.05,120537.94,46,60269
