<br>

# Técnicas Matemáticas para Big Data - Car Crash Insurance Payment using Fuzzy Logic
<br><br>


GROUP NN:
- Liliana Ribeiro - Nº 108713 - 33% Work Participation
- Brayan Munoz - Nº 130722 - 33% Work Participation
- Esha Bint E Ghazali - Nº 130726 - 33% Work Participation

<br><br>

## 1. Introduction to the problem of study [1,0 valor]

### Problem Context

In the insurance industry, determining appropriate compensation for vehicle accident claims is a complex decision-making process that involves analyzing multiple uncertain and imprecise factors. Traditional binary logic systems often fail to capture the nuanced reality of accident scenarios, where variables like damage severity and driver fault exist on a continuous spectrum rather than in discrete categories.

### The Challenge

Insurance adjusters must evaluate:
- *Damage Severity*: The extent of vehicle damage (ranging from minor scratches to total loss)
- *Driver Fault*: The degree of responsibility attributed to the insured driver (from completely innocent to entirely at fault)
- *Witness Testimony*: Additional evidence that may support or contradict the initial assessment

The relationship between these variables is not straightforward. For instance, high fault typically reduces payment, but the interaction with damage severity creates complex scenarios:
- High fault + High damage → Low payment (driver primarily responsible)
- Low fault + High damage → High payment (significant compensation needed)
- Low fault + Low damage → Low payment (minor incident, minor compensation)
- Moderate fault + Moderate damage → Uncertain (requires fuzzy reasoning)

### Why Fuzzy Logic?

Classical binary logic cannot adequately model linguistic terms like "slightly damaged," "moderately at fault," or "somewhat credible witnesses." Fuzzy logic provides a mathematical framework to handle this imprecision, allowing the system to:

1. Model human reasoning and expert judgment
2. Handle uncertainty and vagueness in input data
3. Provide smooth, continuous transitions between decision boundaries
4. Incorporate multiple variables with complex interdependencies

This project implements a *Fuzzy Inference System (FIS)* to automate and standardize insurance payment decisions while maintaining the flexibility to handle real-world ambiguity.

<br><br>
## 2. Brief and general description of the approach and methods used [1,5 valor]

### Overall Approach

This project employs a *Fuzzy Inference System (FIS)* based on the Mamdani method to model the insurance claim decision-making process. The system transforms crisp numerical inputs into linguistic variables, applies expert-defined rules, and produces a defuzzified payment recommendation.

### System Architecture

The fuzzy logic system consists of four main components:

#### 1. *Fuzzification*
Converts crisp input values into degrees of membership across multiple fuzzy sets:
- *Damage Input*: Mapped to linguistic terms (e.g., "Low," "Medium," "High")
- *Fault Input*: Categorized by responsibility level (e.g., "None," "Partial," "Complete")
- *Witness Input*: Assessed for credibility and relevance (e.g., "Weak," "Moderate," "Strong")

Each input variable uses membership functions (typically triangular or trapezoidal) to determine the degree to which the input belongs to each fuzzy set.

#### 2. *Rule Base*
A collection of IF-THEN rules that encode expert knowledge and insurance policies:

IF damage IS high AND fault IS low THEN payment IS high
IF damage IS low AND fault IS high THEN payment IS very_low
IF damage IS medium AND fault IS medium AND witnesses IS strong THEN payment IS medium_high


These rules capture the complex interactions between variables that would be difficult to express in traditional mathematical formulas.

#### 3. *Inference Engine*
Applies fuzzy logic operators to evaluate all rules simultaneously:
- *AND operator*: Typically implemented as minimum (min) or product
- *OR operator*: Typically implemented as maximum (max)
- *Aggregation*: Combines the outputs of all activated rules using max or sum

The engine determines which rules fire and to what degree based on the fuzzified inputs.

#### 4. *Defuzzification*
Converts the fuzzy output into a crisp payment value using methods such as:
- *Centroid Method*: Calculates the center of gravity of the aggregated fuzzy output
- *Mean of Maximum*: Takes the average of the values with maximum membership
- *Weighted Average*: Computes a weighted mean based on rule strengths

### Methodology Steps

1. *Define Input/Output Variables*: Identify linguistic variables and their ranges
2. *Design Membership Functions*: Create appropriate shapes (triangular, Gaussian, etc.)
3. *Establish Rule Base*: Formulate expert rules based on domain knowledge
4. *Configure Inference System*: Select operators and aggregation methods
5. *Implement in Python*: Use libraries like scikit-fuzzy or simpful
6. *Validate and Test*: Compare outputs against expert decisions and edge cases
7. *Optimize*: Tune membership functions and rules to improve accuracy

### Advantages of This Approach

- *Handles Uncertainty*: Manages imprecise and incomplete information naturally
- *Interpretable*: Rules are human-readable and align with expert reasoning
- *Flexible*: Easy to modify rules or add new variables without restructuring the entire system
- *Non-linear Modeling*: Captures complex relationships without requiring explicit mathematical formulations
- *Robust*: Performs well even with noisy or missing data

### Expected Outcomes

The system will produce payment recommendations that:
- Reflect realistic expert judgment
- Provide consistent decisions for similar cases
- Handle edge cases and ambiguous scenarios gracefully
- Offer transparency through rule traceability


<br><br>
## 3. Brief History and literature review of the problem and methods/algorithms [1,5 valor]

### 3.1 Historical Context of Vehicle Crash Analysis

The analysis of vehicle crashes and their implications for insurance has been a subject of research for several decades. As road traffic accidents became a leading cause of mortality and economic loss globally, the need for accurate risk assessment and claim prediction systems grew significantly. The World Health Organization (WHO) reported that road traffic deaths reached 1.35 million annually, making it the leading killer of people aged 5-29 years [1].

Traditional approaches to insurance claim assessment relied heavily on deterministic models and crisp logic systems. However, these methods often failed to capture the inherent uncertainty and imprecision in accident scenarios, where factors like damage severity, driver fault, and witness credibility exist on continuous spectrums rather than in discrete categories.

### 3.2 Evolution of Fuzzy Logic in Insurance

The application of fuzzy logic to insurance problems was pioneered by DeWit in 1982 [5], who first recognized that underwriting decisions involved subjective assessments that could not be adequately captured by binary logic [2]. This seminal work laid the foundation for incorporating linguistic variables and fuzzy set theory into insurance decision-making processes.

Over the past 25 years, the use of fuzzy logic in insurance has expanded considerably [2]. FL technologies have been employed in insurance-related areas including:

- **Classification and Underwriting**: Lemaire (1990) used fuzzy expert systems to provide flexible definitions of preferred policyholders, employing continuous membership functions and various fuzzy intersection operators [2].

- **Risk Assessment**: Multiple studies have demonstrated that fuzzy set theory provides a realistic approach to formal risk analysis, particularly when dealing with imprecise definitions and measurements [2].

- **Pricing and Ratemaking**: Young (1996, 1997) [6] illustrated how fuzzy logic could be used to make pricing decisions in group health insurance that consistently consider supplementary data, including vague or linguistic objectives of the insurer [2].

### 3.3 Machine Learning Approaches to Driver Behavior Analysis

Recent advances in sensor technology and machine learning have enabled more sophisticated approaches to driver behavior analysis. Yuksel & Atmaca [1] developed a comprehensive driver risk assessment system using machine learning and fuzzy logic, achieving 100% accuracy in identifying risky driving behaviors.

Their study compared seven widely-used machine learning algorithms:
- C4.5 Decision Tree (98.37% accuracy)
- Random Forest (99.82% accuracy)
- Artificial Neural Network (99.73% accuracy)
- Support-Vector Machine (98.37% accuracy)
- K-Nearest Neighbor (99.91% accuracy)
- Naive Bayes (93.01% accuracy)
- **K-Star (100% accuracy)** - the best performing algorithm

The authors demonstrated that using accelerometer and gyroscope sensors with a rolling window approach and statistical features could effectively identify four major risky driving behaviors: sudden acceleration, sudden deceleration, sudden right turn, and sudden left turn [1].

### 3.4 Vehicle Crash Modeling Using Neural-Fuzzy Systems

The complexity of vehicle crash dynamics has led researchers to explore hybrid approaches combining multiple computational intelligence techniques. Zhao et al. [3] presented a novel Adaptive Neural-Fuzzy Inference System (ANFIS) approach to reconstruct kinematics of colliding vehicles.

### 3.5 Risk Factor Prioritization in Insurance

Understanding which factors contribute most significantly to risk is crucial for fair and accurate insurance pricing. Esfandabadi et al. [4] addressed this challenge in the context of comprehensive automobile insurance in Iran using a hybrid multi-criteria decision-making model combining:

- **Fuzzy Delphi Method (FDM)**: To identify important risk factors through expert consensus
- **Fuzzy Analytic Hierarchy Process (FAHP)**: To prioritize and weight the identified factors
- **Similarity Aggregation Method (SAM)**: To combine individual expert opinions into group consensus

Their findings revealed that behavioral characteristics and driving experience were more important than vehicle specifications in determining risk levels. The top three risk factors identified were [4]:
1. Traffic offences including speed violations (highest weight)
2. Claim history in third-party liability insurance
3. Driving experience


### 3.6 Fuzzy Systems in Insurance: A Comprehensive Framework

Shapiro [2] provides an extensive review of fuzzy logic applications in insurance, categorizing them by methodology:

#### 3.6.1 Fuzzy Set Theory and Linguistic Variables
- Risk capacity assessment
- Definition and measurement of uncertainty in risk management
- Optimal excess of loss retention in reinsurance programs

#### 3.6.2 Fuzzy Arithmetic
- Fuzzy future and present values of cash amounts
- Premium computation for insurance policies
- Cash-flow matching with uncertain occurrence dates

#### 3.6.3 Fuzzy Inference Systems
- Life and health underwriting
- Group health insurance selection processes
- Occupational injury risk evaluation
- Financial forecasting

#### 3.6.4 Fuzzy Clustering
- Risk classification in life and non-life insurance
- Credibility estimation
- Age groupings in general insurance

### 3.7 Integration of Multiple Data Sources

Modern insurance systems must integrate diverse data sources to make accurate assessments. The literature demonstrates several successful approaches:

1. **Sensor Data Integration**: Yuksel & Atmaca [1] showed that combining accelerometer and gyroscope data with statistical features (mean, variance, standard deviation, skewness, kurtosis) provides robust behavior identification.

2. **Expert Knowledge Elicitation**: Multiple studies emphasize the importance of properly processing expert opinion rather than using it raw [2, 4].

3. **Historical Claims Data**: Derrig & Ostaszewski demonstrated the value of fuzzy techniques in pattern recognition for risk and claim classification using Massachusetts automobile insurance data [2].

### 3.8 Relevance to Current Project

This literature review establishes several key principles relevant to our car crash insurance claim prediction system:

1. **Fuzzy Logic Appropriateness**: The inherent uncertainty in damage assessment and fault determination makes fuzzy logic an ideal framework [2].

2. **Multi-Factor Consideration**: Effective systems must consider multiple interacting factors including vehicle characteristics, driver behavior, environmental conditions, and accident circumstances [4].

3. **Hybrid Approaches**: Combining fuzzy logic with other computational intelligence techniques (neural networks, genetic algorithms) often yields superior results [1, 3].

4. **Validation Importance**: Rigorous comparison with alternative methods and validation against expert decisions is essential for system credibility [1].

5. **Interpretability**: Fuzzy rule-based systems provide transparency and human interpretability, which is crucial for insurance applications where decisions must be explainable [2].

Our proposed system builds upon these foundations by applying fuzzy inference to car crash data, incorporating damage severity and driver responsibility factors to predict appropriate insurance claim amounts. This approach addresses the identified gap in the literature where most studies focus on risk classification or behavior prediction rather than direct claim amount determination.


<br><br>
## 4. About the main method/algorithm used [1,5 valor]

<br><br>

## 5. Python imports and global configurations [0,5 valor]

### Install and import the necessary libraries to compute the Bayesian Network and perform other methods  

In [None]:
# %pip install pandas
# %pip install seaborn
# %pip install matplotlib
# %pip install numpy
# %pip install pomegranate
# %pip install torch
# %pip install Pillow
# %pip install scikit-fuzzy

In [1]:
import pandas as pd
import seaborn as sns

In [2]:
data = pd.read_excel('data/new dataset.xlsx')
data

FileNotFoundError: [Errno 2] No such file or directory: 'data/new dataset.xlsx'

In [None]:
# Function to calculate Fault Score (0-100)
def calculate_fault_score(row):
    """
    Calculate fault score based on:
    - Primary Factor: Different factors have different fault weights
    - Hour: Late night/early morning hours increase fault
    - Weekend: Weekend incidents may indicate riskier behavior
    """
    score = 0
    
    # Primary Factor contribution (0-60 points)
    primary_factor = str(row['Primary factor']).lower()
    if 'speed' in primary_factor or 'reckless' in primary_factor:
        score += 60
    elif 'fail' in primary_factor or 'violation' in primary_factor:
        score += 50
    elif 'follow' in primary_factor or 'distance' in primary_factor:
        score += 40
    elif 'signal' in primary_factor or 'turn' in primary_factor:
        score += 35
    elif 'under influence' in primary_factor or 'alcohol' in primary_factor:
        score += 60
    else:
        score += 20
    
    # Hour contribution (0-20 points) - risky hours increase fault
    hour = row['Hour']
    if 0 <= hour < 6 or 22 <= hour < 24:  # Late night/early morning
        score += 15
    elif 17 <= hour < 20:  # Rush hour
        score += 10
    else:
        score += 5
    
    # Weekend contribution (0-20 points)
    if row['Weekend?'] == 'Yes':
        score += 15
    else:
        score += 5
    
    return min(score, 100)  # Cap at 100


# Function to calculate Severity Score (0-100)
def calculate_severity_score(row):
    """
    Calculate severity score based on:
    - Collision Type: More severe collision types score higher
    - Injury Type: More serious injuries score higher
    """
    score = 0
    
    # Collision Type contribution (0-50 points)
    collision_type = str(row['Collision type']).lower()
    if 'head-on' in collision_type or 'head on' in collision_type:
        score += 50
    elif 'angle' in collision_type:
        score += 40
    elif 'rear-end' in collision_type or 'rear end' in collision_type:
        score += 30
    elif 'sideswipe' in collision_type:
        score += 25
    elif 'single' in collision_type:
        score += 20
    else:
        score += 15
    
    # Injury Type contribution (0-50 points)
    injury_type = str(row['Injury type']).lower()
    if 'fatality' in injury_type or 'fatal' in injury_type:
        score += 50
    elif 'serious' in injury_type or 'incapacitating' in injury_type:
        score += 45
    elif 'evident' in injury_type or 'visible' in injury_type:
        score += 35
    elif 'possible' in injury_type or 'complaint' in injury_type:
        score += 25
    elif 'pdo' in injury_type or 'property' in injury_type:
        score += 10
    else:
        score += 5
    

In [None]:
# Calculate scores for all records
data['Fault_Score'] = data.apply(calculate_fault_score, axis=1)
data['Severity_Score'] = data.apply(calculate_severity_score, axis=1)

# Display sample of calculated scores
print("Sample of calculated scores:")
print(data[['Primary factor', 'Hour', 'Weekend?', 'Fault_Score', 
            'Collision type', 'Injury type', 'Severity_Score']].head(10))
print(f"\nFault Score - Min: {data['Fault_Score'].min()}, Max: {data['Fault_Score'].max()}, Mean: {data['Fault_Score'].mean():.2f}")
print(f"Severity Score - Min: {data['Severity_Score'].min()}, Max: {data['Severity_Score'].max()}, Mean: {data['Severity_Score'].mean():.2f}")

<br><br>

## 6. Dataset and variables explanation [1,5 valor]

The Car Crash Dataset from Kaggle provides a comprehensive compilation of information related to road accidents. This dataset includes various factors that influence accidents, such as collision severity, weather conditions, road types, and contributing factors. These attributes offer valuable insights into accident patterns and contribute to the analysis of road safety.

For the purpose of this project, we will use the dataset to focus on damage severity and driver responsibility, ultimately predicting the claim amount. The dataset contains multiple variables that can help us determine both the extent of damage and the fault of the driver, two key factors in the assessment of insurance claims.

### Selected Variables and Their Explanation

We have chosen two main variables for this project: damage severity and driver responsibility. These variables are crucial in determining the claim amount in an insurance context. Below is the detailed explanation of the selected variables and how they will be used:

1. **Damage Severity** 
	•	Description: This variable indicates the extent of damage caused by the accident. It can range from minor damage to total destruction. The severity of the damage directly impacts the repair costs, which is a key component in determining the claim amount.
	•	How it’s used: The Weekend, Hour, and Collision Type will be used to assess the damage severity:
	•	Weekend: Accidents that occur during the weekend might show higher severity due to increased traffic and higher speeds.
	•	Hour: The time of day could influence the damage severity, with accidents occurring during rush hours possibly leading to more severe collisions.
	•	Collision Type: Different types of collisions (e.g., head-on, rear-end, side-impact) cause varying levels of damage. The dataset provides this variable to classify the severity.

2. **Driver Responsibility** 
	•	Description: This variable determines the level of fault attributed to the driver. It ranges from fully responsible to no responsibility. Understanding driver responsibility is crucial for assessing the claim amount, as the fault distribution influences the payout.
	•	How it’s used: The Primary Factor and other contributing factors will be used to estimate driver responsibility:
	•	Primary Factor: This factor includes various conditions that contributed to the accident, such as distracted driving, speeding, or weather conditions. The more severe the contributing factors related to the driver’s actions, the higher their responsibility.


#### Damage Severity:
The **damage severity** is directly related to the cost of repairs and the overall cost of the claim. Since the primary aim is to predict the claim amount, understanding the extent of the damage is essential.
Previous literature has focused on predicting accident risk and types but did not specifically address how to predict the claim amount. The variables selected, like collision type, hour, and weekend, are relevant because they influence the level of damage, which in turn impacts the claim.

#### Driver Responsibility:
The **driver’s level of responsibility** determines the liability and, consequently, the claim payout. If the driver is at fault, the insurance payout might vary so by analyzing the primary factor and other contributing elements like weather conditions or road type, we can determine how much responsibility the driver holds. This is a critical part of calculating the claim amount, as insurance companies often adjust payouts based on fault.

<br><br>

## 7. Main code as possible solution to the problem [1,5 valor] 

<br><br>

## 8. Analysis of Example 1 [3,0 valor]

<br><br>

## 9. Analysis of Example 2 [3,0 valor]

<br><br>
## 10. Pros and cons of the approach [2,0 valor]

<br><br>
## 11. Future improvements [2,0 valor]

<br>
<div style="text-align: center;">
    <br><br>
    <p style="font-size: 40px;">References [1,0 valor]</p>
</div>
<br>

<ol>
    <li>
        Yuksel, A. S., & Atmaca, S. (2020). 
        <b>Driver’s black box: a system for driver risk assessment using machine learning and fuzzy logic</b>. 
        <i>Journal of Intelligent Transportation Systems</i>, 25(5), 482–500. 
        <a href="https://doi.org/10.1080/15472450.2020.1852083" target="_blank">https://doi.org/10.1080/15472450.2020.1852083</a>
    </li>
    </br>
    <li>
        Shapiro, Arnold. (2007). 
        <b>An Overview of Insurance Uses of Fuzzy Logic</b>. 
        <i>Computational Intelligence in Economics and Finance: Volume II</i>. 25-61. 
        <a href="https://doi.org/10.1007/978-3-540-72821-4_2" target="_blank">10.1007/978-3-540-72821-4_2</a>. 
    </li>
        </br>
    <li>
        L. Zhao, W. Pawlus, H. R. Karimi and K. G. Robbersmyr, 
        "<b>Data-Based Modeling of Vehicle Crash Using Adaptive Neural-Fuzzy Inference System</b>," 
        in <i>IEEE/ASME Transactions on Mechatronics</i>, vol. 19, no. 2, pp. 684-696, April 2014, 
        doi: 
        <a href="https://doi.org/10.1109/TMECH.2013.2255422" target="_blank">10.1109/TMECH.2013.2255422</a>.
        <br>
        <i>keywords: {Training;Adaptation models;Analytical models;Adaptive systems;Accuracy;Kinematics;Predictive models;Mathematical models;Vehicle dynamics;Accidents;Adaptive neural-fuzzy inference system (ANFIS)-based prediction;time-series analysis;vehicle crash reconstruction;vehicle dynamics modeling}</i>
    </li>
        </br>
    <li>
        Esfandabadi, Z. S., Ranjbari, M., & Scagnelli, S. D. (2020). 
        <b>Prioritizing Risk-level Factors in Comprehensive Automobile Insurance Management: A Hybrid Multi-criteria Decision-making Model</b>. 
        <i>Global Business Review</i>, 24(5), 972-989. 
        <a href="https://doi.org/10.1177/0972150920932287" target="_blank">https://doi.org/10.1177/0972150920932287</a> (Original work published 2023)
    </li>    </br>
    <li>
        G.W. de Wit, 
        <b>Underwriting and uncertainty</b>, 
        <i>Insurance: Mathematics and Economics</i>, Volume 1, Issue 4, 1982, Pages 277-285, 
        ISSN 0167-6687, 
        <a href="https://doi.org/10.1016/0167-6687(82)90028-2" target="_blank">https://doi.org/10.1016/0167-6687(82)90028-2</a>. 
        (<a href="https://www.sciencedirect.com/science/article/pii/0167668782900282" target="_blank">https://www.sciencedirect.com/science/article/pii/0167668782900282</a>)
    </li>    </br>
    <li>
        Young, V. R. (1996). 
        <b>Insurance Rate Changing: A Fuzzy Logic Approach</b>. 
        <i>The Journal of Risk and Insurance</i>, 63(3), 461–484. 
        <a href="https://doi.org/10.2307/253621" target="_blank">https://doi.org/10.2307/253621</a>
    </li>
</ol>
