# NMVCCS Actuarial Field Extraction

## Description
This notebook extracts relevant data for actuarial analysis from NMVCCS (National Motor Vehicle Crash Causation Survey) XML and HTML files.

## Features
- **XML Parsing**: Structured extraction from NMVCCS XML files
- **HTML Parsing**: Web scraping from 508-compliant HTML files
- **Bulk Processing**: Processing directories containing thousands of files
- **Structured Output**: CSV with 50+ actuarial variables per case

## Input
- Single XML files or directories
- Single HTML files or directories  
- Mixed directories (XML + HTML)

## Output
- `output.csv`: Raw extracted data
- `DB.csv`: Final database with `;` separator

## Package Installation

Installing necessary packages for data extraction:
- **BeautifulSoup4**: HTML parsing
- **html2text**: HTML to text conversion

In [4]:
# !pip install bs4 html2text

Collecting html2text
  Downloading html2text-2025.4.15-py3-none-any.whl.metadata (4.1 kB)
Downloading html2text-2025.4.15-py3-none-any.whl (34 kB)
Installing collected packages: html2text
Successfully installed html2text-2025.4.15


## Library Imports

Importing all necessary libraries for processing:

In [None]:
import pandas as pd
import xml.etree.ElementTree as ET
import os
import glob
from tqdm import tqdm
import re
from bs4 import BeautifulSoup
import html2text

## XML Data Extraction

### Variables Extracted from XML
The function extracts the following data categories:

| Category | Variables | Description |
|----------|-----------|-------------|
| **Case Info** | CaseID, CaseStr, NumOfVehicle | Case identifiers |
| **Summary** | CaseSummary | Textual case description |
| **Crash** | CrashTime, CrashSeverity | Temporal details and severity |
| **Vehicles** | Make, Model, Year, Odometer | Vehicle characteristics |
| **Tires** | TireDepth, TirePressure | Tire conditions |
| **Damage** | DamageExtent | Damage extent (CDC) |
| **Drivers** | Age, Sex, InjurySeverity | Demographics and injuries |
| **Pre-crash** | CriticalEvent, CriticalReason | Critical events |
| **Environment** | RoadSurface, Atmospheric | Environmental conditions |

In [1]:
def extract_actuarial_data_from_xml(xml_file_or_directory, output_csv=None):
    """
    Estrae dati rilevanti per analisi attuariale dai file XML NMVCCS.
    
    Args:
        xml_file_or_directory: Un file XML singolo o una directory contenente file XML
        output_csv: Opzionale, percorso dove salvare il CSV risultante
        
    Returns:
        DataFrame pandas con i dati estratti
    """
    # Determina se l'input è un file o una directory
    if os.path.isfile(xml_file_or_directory):
        xml_files = [xml_file_or_directory]
    elif os.path.isdir(xml_file_or_directory):
        xml_files = glob.glob(os.path.join(xml_file_or_directory, "*.xml"))
    else:
        raise ValueError("Il percorso fornito non è né un file né una directory valida")
    
    # Lista per memorizzare i dati estratti
    data_list = []
    
    # Processa ogni file XML
    for xml_file in tqdm(xml_files, desc="Elaborazione file XML"):
        try:
            # Parsing del file XML
            tree = ET.parse(xml_file)
            root = tree.getroot()
            
            # Dizionario per memorizzare i dati di questo caso
            case_data = {}
            
            # Estrai informazioni di base del caso
            case_data['CaseID'] = root.get('CaseID')
            case_data['CaseStr'] = root.get('CaseStr')
            case_data['NumOfVehicle'] = root.get('NumOfVehicle')
            
            # Estrai il riassunto del caso (summary)
            summary_elem = root.find(".//XML_CASESUMMARY/SUMMARY")
            if summary_elem is not None and summary_elem.text:
                # Pulisci il testo rimuovendo spazi extra e andando a capo
                summary_text = re.sub(r'\s+', ' ', summary_elem.text).strip()
                case_data['CaseSummary'] = summary_text
            
            # Estrai informazioni sul tempo e data dell'incidente
            crash_derived = root.find(".//XML_CRASHDERIVED")
            if crash_derived is not None:
                for elem in crash_derived:
                    case_data[elem.tag] = elem.text
            
            # Estrai informazioni sull'evento di crash
            crash_elem = root.find(".//CRASH")
            if crash_elem is not None:
                case_data['CrashTime'] = crash_elem.find("TIME").text if crash_elem.find("TIME") is not None else None
                kabcou_elem = crash_elem.find("KABCOU")
                if kabcou_elem is not None:
                    case_data['CrashSeverity'] = kabcou_elem.get('attrCatStr')
                    case_data['CrashSeverityCode'] = kabcou_elem.get('value')
            
            # Estrai informazioni per ogni veicolo coinvolto
            for i, vehicle_elem in enumerate(root.findall(".//GeneralVehicle"), 1):
                vehicle_id = vehicle_elem.get('VEHICLEID')
                
                # Informazioni base del veicolo
                vehicle = vehicle_elem.find(".//VEHICLE")
                if vehicle is not None:
                    case_data[f'Vehicle{i}_Year'] = vehicle.find("MODELYEAR").text if vehicle.find("MODELYEAR") is not None else None
                    
                    make_elem = vehicle.find("MAKE")
                    if make_elem is not None:
                        case_data[f'Vehicle{i}_Make'] = make_elem.get('attrCatStr')
                    
                    model_elem = vehicle.find("MODEL")
                    if model_elem is not None:
                        case_data[f'Vehicle{i}_Model'] = model_elem.get('attrCatStr')
                    
                    body_type_elem = vehicle.find("BODY_TYPE")
                    if body_type_elem is not None:
                        case_data[f'Vehicle{i}_BodyType'] = body_type_elem.get('attrCatStr')
                    
                    case_data[f'Vehicle{i}_Odometer'] = vehicle.find("ODOMETER").text if vehicle.find("ODOMETER") is not None else None
                
                # Informazioni sui pneumatici
                tires = vehicle_elem.findall(".//TIRE")
                for tire in tires:
                    tire_loc = tire.get('TIRE_LOCATION')
                    tire_depth = tire.find("TIRE_TREAD_DEPTH")
                    if tire_depth is not None:
                        case_data[f'Vehicle{i}_Tire{tire_loc}_Depth'] = tire_depth.text
                    
                    tire_pressure = tire.find("TIRE_PRESSURE")
                    if tire_pressure is not None and tire_pressure.text and tire_pressure.text.isdigit():
                        case_data[f'Vehicle{i}_Tire{tire_loc}_Pressure'] = tire_pressure.text
                
                # CDC (Collision Deformation Classification)
                cdc = vehicle_elem.find(".//CDC")
                if cdc is not None:
                    damage_extent = cdc.find("DAMAGEEXTENT")
                    if damage_extent is not None:
                        case_data[f'Vehicle{i}_DamageExtent'] = damage_extent.get('attrCatStr')
                
                # Informazioni sull'autista
                driver = root.find(f".//Occupant[@VEHICLEID='{vehicle_id}']/OccupantByVehicle/OCCUPANT[ROLE='1']")
                if driver is not None:
                    case_data[f'Vehicle{i}_DriverAge'] = driver.find("AGE").text if driver.find("AGE") is not None else None
                    
                    sex_elem = driver.find("SEX_PREGNANCY")
                    if sex_elem is not None:
                        case_data[f'Vehicle{i}_DriverSex'] = sex_elem.get('attrCatStr')
                    
                    kabcou_elem = driver.find("KABCOU")
                    if kabcou_elem is not None:
                        case_data[f'Vehicle{i}_DriverInjurySeverity'] = kabcou_elem.get('attrCatStr')
                
                # Informazioni precrash per il veicolo
                precrash = root.find(f".//PrecrashAssessmentForm[@VEHICLEID='{vehicle_id}']")
                if precrash is not None:
                    # Critical Event
                    critical_event = precrash.find(".//PRECRASH/CRITICAL_EVENT")
                    if critical_event is not None:
                        case_data[f'Vehicle{i}_CriticalEvent'] = critical_event.get('attrCatStr')
                    
                    # Critical Reason
                    critical_reason = precrash.find(".//PRECRASH/CRITICAL_REASON")
                    if critical_reason is not None:
                        case_data[f'Vehicle{i}_CriticalReason'] = critical_reason.get('attrCatStr')
                    
                    # Driver Experience
                    route_freq = precrash.find(".//DRIVER_BEHAVIOR/THIS_ROUTE_FREQUENCY")
                    if route_freq is not None:
                        case_data[f'Vehicle{i}_RouteFrequency'] = route_freq.get('attrCatStr')
                    
                    # Driver Fatigue
                    fatigue = precrash.find(".//FATIGUE/DRIVER_FATIGUE")
                    if fatigue is not None:
                        case_data[f'Vehicle{i}_DriverFatigue'] = fatigue.get('attrCatStr')
                    
                    # Alcohol/Drug
                    alcohol_test = precrash.find(".//DRIVER_HEALTH/ALCOHOL_TEST_RESULT")
                    if alcohol_test is not None:
                        case_data[f'Vehicle{i}_AlcoholTest'] = alcohol_test.get('attrCatStr')
                    
                    # Distraction/Inattention
                    surveillance = precrash.find(".//DRIVER_BEHAVIOR/SURVEILLANCE")
                    if surveillance is not None:
                        case_data[f'Vehicle{i}_Surveillance'] = surveillance.get('attrCatStr')
            
            # Estrai informazioni sull'ambiente/strada
            roadway = root.find(".//ROADWAY")
            if roadway is not None:
                surface_type = roadway.find("SURFACE_TYPE")
                if surface_type is not None:
                    case_data['RoadSurfaceType'] = surface_type.get('attrCatStr')
                
                surface_cond = roadway.find("SURFACE_CONDITION")
                if surface_cond is not None:
                    case_data['RoadSurfaceCondition'] = surface_cond.get('attrCatStr')
                
                roadway_align = roadway.find("ROADWAY_ALIGN")
                if roadway_align is not None:
                    case_data['RoadwayAlignment'] = roadway_align.get('attrCatStr')
                
                roadway_profile = roadway.find("ROADWAY_VERT_PROFILE")
                if roadway_profile is not None:
                    case_data['RoadwayVerticalProfile'] = roadway_profile.get('attrCatStr')
            
            # Estrai informazioni sulle condizioni atmosferiche
            atmospheric = root.find(".//ATMOSPHERIC_CONDITION/ATMOSPHERICCONDITION")
            if atmospheric is not None:
                case_data['AtmosphericCondition'] = atmospheric.get('attrCatStr')
            
            # Estrai informazioni sulla luce naturale
            natural_lighting = root.find(".//PRECRASHVEHICLE/NATURAL_LIGHTING")
            if natural_lighting is not None:
                case_data['NaturalLighting'] = natural_lighting.get('attrCatStr')
            
            # Aggiungi i dati di questo caso alla lista principale
            data_list.append(case_data)
            
        except Exception as e:
            print(f"Errore nell'elaborazione del file {xml_file}: {str(e)}")
    
    # Crea un DataFrame dai dati raccolti
    df = pd.DataFrame(data_list)
    
    # Salva il DataFrame in un file CSV se richiesto
    if output_csv:
        df.to_csv(output_csv, index=False)
        print(f"Dati salvati in {output_csv}")
    
    return df

## HTML Data Extraction

### HTML Parsing Challenges
NMVCCS HTML files present specific challenges:
- **Complex Structure**: Nested tables and divs with dynamic IDs
- **Layout Variability**: Differences between file versions
- **Hidden Data**: Information distributed across multiple sections
- **Encoding Issues**: Special character handling

### Extraction Strategy
- **BeautifulSoup**: Robust HTML parsing
- **Pattern Matching**: Regex for specific data extraction
- **DOM Navigation**: Element search via attributes and positions
- **Fallback Logic**: Handling structural variations

In [2]:
def extract_actuarial_data_from_html(html_file_or_directory, output_csv=None):
    """
    Estrae dati rilevanti per analisi attuariale dai file HTML NMVCCS.
    
    Args:
        html_file_or_directory: Un file HTML singolo o una directory contenente file HTML
        output_csv: Opzionale, percorso dove salvare il CSV risultante
        
    Returns:
        DataFrame pandas con i dati estratti
    """
    # Determina se l'input è un file o una directory
    if os.path.isfile(html_file_or_directory):
        html_files = [html_file_or_directory]
    elif os.path.isdir(html_file_or_directory):
        html_files = glob.glob(os.path.join(html_file_or_directory, "*.html"))
    else:
        raise ValueError("Il percorso fornito non è né un file né una directory valida")
    
    # Lista per memorizzare i dati estratti
    data_list = []
    
    # Processa ogni file HTML
    for html_file in tqdm(html_files, desc="Elaborazione file HTML"):
        try:
            # Leggi il file HTML
            with open(html_file, 'r', encoding='utf-8', errors='ignore') as file:
                html_content = file.read()
            
            # Utilizza BeautifulSoup per analizzare l'HTML
            soup = BeautifulSoup(html_content, 'html.parser')
            
            # Dizionario per memorizzare i dati di questo caso
            case_data = {}
            
            # Estrai numero caso dal titolo
            case_title = soup.find('title')
            if case_title:
                case_match = re.search(r'NMVCCS Case (\d+-\d+-\d+)', case_title.text)
                if case_match:
                    case_data['CaseID'] = case_match.group(1)
            
            # Estrai il riassunto del caso dal sommario
            # Prima cerca la sezione "Case Summary"
            summary_section = soup.find('a', {'name': 'Summary'})
            if summary_section:
                # Cerca la tabella Case Summary
                summary_header = None
                for table in soup.find_all('table'):
                    for row in table.find_all('tr', {'class': 'heading'}):
                        if 'Case Summary' in row.text:
                            summary_header = table
                            break
                    if summary_header:
                        break
                
                if summary_header:
                    # Trova il div successivo che contiene il testo del riassunto
                    summary_div = summary_header.find_next_sibling('div', id='indent')
                    if summary_div and summary_div.find('td'):
                        summary_text = summary_div.find('td').text.strip()
                        summary_text = re.sub(r'\s+', ' ', summary_text)
                        case_data['CaseSummary'] = summary_text
            
            # Estrai informazioni di base dell'incidente
            
            crash_info_table = soup.find('th', string='Crash Overview')
            if crash_info_table:
                crash_table = crash_info_table.find_parent('table').find_next_sibling('div').find('table')
                for row in crash_table.find_all('tr'):                    
                    # Cerca in tutti i tag tr
                    for tr_tag in row.find_all('tr'):
                        cells = tr_tag.find_all(['th', 'td'])
                        if len(cells) == 2:
                            key = cells[0].text.strip()
                            value = cells[1].text.strip()
                            if key == 'Crash Level KABCOU':
                                case_data['CrashSeverity'] = value
                        if key == 'Case Number':
                            case_data['CaseNum'] = value
                        elif key == 'Date':
                            case_data['CrashDate'] = value
                        elif key == 'Day of Week':
                            case_data['DayOfWeek'] = value
                        elif key == 'PAR Time of Crash':
                            case_data['CrashTime'] = value
                        elif key == 'Crash Level KABCOU':
                            case_data['CrashSeverity'] = value
                            # Controlla anche direttamente le celle della riga corrente
                    cells = row.find_all(['th', 'td'])
                    if len(cells) == 2:
                        key = cells[0].text.strip()
                        value = cells[1].text.strip()
                        if key == 'Case Number':
                            case_data['CaseNum'] = value
                        elif key == 'Date':
                            case_data['CrashDate'] = value
                        elif key == 'Day of Week':
                            case_data['DayOfWeek'] = value
                        elif key == 'PAR Time of Crash':
                            case_data['CrashTime'] = value
                        elif key == 'Crash Level KABCOU':
                            case_data['CrashSeverity'] = value
            # Cerca specificamente Crash Level KABCOU in tutto il documento
            kabcou_elem = soup.find('th', string=lambda s: s and 'Crash Level KABCOU' in s)
            if kabcou_elem and kabcou_elem.find_next_sibling('td'):
                case_data['CrashSeverity'] = kabcou_elem.find_next_sibling('td').text.strip()
            # Estrai informazioni sui veicoli
            # Invece di cercare intestazioni, cerchiamo direttamente le sezioni di veicolo
            for i in range(1, 10):  # Consideriamo fino a 10 veicoli (normalmente sono pochi)
                # Trova tabella con le informazioni del veicolo
                vehicle_section = soup.find('a', {'name': f'GV_Vehicle{i}'})
                if not vehicle_section:
                    continue  # Salta se questo veicolo non esiste
                
                vehicle_table = vehicle_section.find_next('table', {'class': 'output'})
                if vehicle_table:
                    for row in vehicle_table.find_all('tr'):
                        cells = row.find_all(['th', 'td'])
                        if len(cells) == 2:
                            key = cells[0].text.strip()
                            value = cells[1].text.strip()
                            if key == 'Model Year':
                                case_data[f'Vehicle{i}_Year'] = value
                            elif key == 'Make':
                                case_data[f'Vehicle{i}_Make'] = value
                            elif key == 'Model':
                                case_data[f'Vehicle{i}_Model'] = value
                            elif key == 'Body Type':
                                case_data[f'Vehicle{i}_BodyType'] = value
                            elif key == 'Odometer Reading':
                                # Estrai solo il valore numerico dell'odometro
                                odometer_match = re.search(r'(\d+)', value)
                                if odometer_match:
                                    case_data[f'Vehicle{i}_Odometer'] = odometer_match.group(1)
                
                # Estrai informazioni sui pneumatici
                tire_section = soup.find('a', {'name': f'GV_Tire{i}'})
                if tire_section:
                    # Cerca tutte le tabelle di pneumatici dopo questa sezione
                    current_element = tire_section
                    tires_found = False
                    while current_element and not (current_element.name == 'a' and current_element.get('name', '').startswith('GV_') and not current_element.get('name', '') == f'GV_Tire{i}'):
                        current_element = current_element.find_next()
                        
                        # Verifica se questa è una tabella di pneumatici
                        if current_element and current_element.name == 'table':
                            headers = [th.text.strip() for th in current_element.find_all('th')]
                            if len(headers) >= 5 and 'Location' in headers and 'Tread Depth' in headers:
                                tires_found = True
                                # Processa le righe della tabella (salta l'intestazione)
                                for row in current_element.find_all('tr')[1:]:
                                    cells = row.find_all('td')
                                    if len(cells) >= 7:
                                        location = cells[0].text.strip()
                                        # Converti la posizione in un codice
                                        loc_code = ''
                                        if 'Left' in location:
                                            loc_code = 'L'
                                        elif 'Right' in location:
                                            loc_code = 'R'
                                        
                                        if 'Front' in location:
                                            loc_code += 'F'
                                        elif 'Rear' in location:
                                            loc_code += 'R'
                                        
                                        # Trova l'indice della colonna della profondità del battistrada
                                        tread_depth_idx = headers.index('Tread Depth (mm)') if 'Tread Depth (mm)' in headers else 6
                                        pressure_idx = tread_depth_idx + 1
                                        
                                        if tread_depth_idx < len(cells):
                                            tread_depth = cells[tread_depth_idx].text.strip()
                                            if tread_depth and re.search(r'\d+', tread_depth):
                                                case_data[f'Vehicle{i}_Tire{loc_code}_Depth'] = re.search(r'\d+', tread_depth).group(0)
                                        
                                        if pressure_idx < len(cells):
                                            pressure = cells[pressure_idx].text.strip()
                                            pressure_match = re.search(r'\d+', pressure)
                                            if pressure_match:
                                                case_data[f'Vehicle{i}_Tire{loc_code}_Pressure'] = pressure_match.group(0)
                
                # Estrai informazioni CDC (Collision Deformation Classification)
                cdc_section = soup.find('a', {'name': f'GV_CDC{i}'})
                if cdc_section:
                    cdc_table = cdc_section.find_next('table', {'class': 'output'})
                    if cdc_table:
                        for row in cdc_table.find_all('tr'):
                            cells = row.find_all(['th', 'td'])
                            if len(cells) == 2 and 'Extent' in cells[0].text:
                                case_data[f'Vehicle{i}_DamageExtent'] = cells[1].text.strip()
                
                # Estrai informazioni sull'autista
                driver_section = soup.find('a', {'name': f'OC_OccupantV{i}O1'})
                if driver_section:
                    driver_table = driver_section.find_next('table', {'class': 'output'})
                    if driver_table:
                        for row in driver_table.find_all('tr'):
                            cells = row.find_all(['th', 'td'])
                            if len(cells) == 2:
                                key = cells[0].text.strip()
                                value = cells[1].text.strip()
                                if key == 'Age':
                                    age_match = re.search(r'(\d+)', value)
                                    if age_match:
                                        case_data[f'Vehicle{i}_DriverAge'] = age_match.group(1)
                                elif key == 'Sex':
                                    case_data[f'Vehicle{i}_DriverSex'] = value
                                elif key == 'Occupant KABCOU Rating':
                                    case_data[f'Vehicle{i}_DriverInjurySeverity'] = value
                
                # Estrai informazioni precrash per il veicolo
                precrash_section = soup.find('a', {'name': f'PA_Precrash{i}'})
                if precrash_section:
                    # Naviga attraverso tutti i divs e le tabelle dopo la sezione precrash
                    current_element = precrash_section
                    while current_element and not (current_element.name == 'a' and 'PA_SupportData' in current_element.get('name', '')):
                        current_element = current_element.find_next()
                        
                        # Cerca informazioni sul Critical Event
                        if current_element and current_element.name == 'tr' and current_element.find('th'):
                            header = current_element.find('th').text.strip()
                            if header == 'Critical Pre-Crash Event':
                                value = current_element.find('td').text.strip()
                                case_data[f'Vehicle{i}_CriticalEvent'] = value
                            elif header == 'Critical Reason for Critical Pre-Crash Event':
                                value = current_element.find('td').text.strip()
                                case_data[f'Vehicle{i}_CriticalReason'] = value
                
                # Estrai informazioni sulla fatica del conducente
                support_data_section = soup.find('a', {'name': f'PA_SupportData{i}'})
                if support_data_section:
                    next_elems = support_data_section.find_next_siblings(['div', 'table'])
                    for elem in next_elems:
                        if 'Fatigue' in elem.text:
                            fatigue_tables = elem.find_all('table')
                            for table in fatigue_tables:
                                for row in table.find_all('tr'):
                                    cells = row.find_all(['th', 'td'])
                                    if len(cells) == 2 and 'Driver Fatigue' in cells[0].text:
                                        case_data[f'Vehicle{i}_DriverFatigue'] = cells[1].text.strip()
            
            # Estrai informazioni sull'ambiente/strada
            roadway_sections = soup.find_all(string=lambda s: s and 'Roadway' in s)
            for roadway_text in roadway_sections:
                parent = roadway_text.parent
                if parent and parent.name == 'td':
                    roadway_table = parent.find_parent('table')
                    if roadway_table:
                        next_div = roadway_table.find_next_sibling('div', id='indent')
                        if next_div:
                            road_table = next_div.find('table')
                            if road_table:
                                for row in road_table.find_all('tr'):
                                    cells = row.find_all(['th', 'td'])
                                    if len(cells) == 2:
                                        key = cells[0].text.strip()
                                        value = cells[1].text.strip()
                                        if 'Type of Road Surface' in key:
                                            case_data['RoadSurfaceType'] = value
                                        elif 'Condition of Road Surface' in key:
                                            case_data['RoadSurfaceCondition'] = value
                                        elif 'Roadway Horizontal Alignment' in key:
                                            case_data['RoadwayAlignment'] = value
                                        elif 'Roadway Vertical Profile' in key:
                                            case_data['RoadwayVerticalProfile'] = value
            
            # Estrai informazioni sulle condizioni atmosferiche
            for table in soup.find_all('table'):
                for row in table.find_all('tr', {'class': 'highlightrow'}):
                    if row.find('th') and 'Atmospheric Condition' in row.find('th').text:
                        atmospheric_value = row.find('td')
                        if atmospheric_value:
                            case_data['AtmosphericCondition'] = atmospheric_value.text.strip()
            
            # Aggiungi i dati di questo caso alla lista principale
            data_list.append(case_data)
            
        except Exception as e:
            print(f"Errore nell'elaborazione del file {html_file}: {str(e)}")
    
    # Crea un DataFrame dai dati raccolti
    df = pd.DataFrame(data_list)
    
    # Salva il DataFrame in un file CSV se richiesto
    if output_csv:
        df.to_csv(output_csv, index=False)
        print(f"Dati salvati in {output_csv}")
    
    return df

## Unified Extraction Function

### Automatic Format Handling
The `extract_actuarial_data()` function automatically handles:
- **Single Files**: XML or HTML
- **Directories**: Mixed with XML and HTML
- **Combination**: Automatic result merging
- **Validation**: Format and path checking

In [None]:
def extract_actuarial_data(file_or_directory, output_csv=None):
    """
    Estrae dati rilevanti per analisi attuariale dai file NMVCCS (XML o HTML).
    
    Args:
        file_or_directory: Un file singolo o una directory contenente file
        output_csv: Opzionale, percorso dove salvare il CSV risultante
        
    Returns:
        DataFrame pandas con i dati estratti
    """
    # Determina se l'input è un file o una directory
    if os.path.isfile(file_or_directory):
        file_ext = os.path.splitext(file_or_directory)[1].lower()
        if file_ext == '.xml':
            return extract_actuarial_data_from_xml(file_or_directory, output_csv)
        elif file_ext in ['.html', '.htm']:
            return extract_actuarial_data_from_html(file_or_directory, output_csv)
        else:
            raise ValueError(f"Formato file non supportato: {file_ext}")
    elif os.path.isdir(file_or_directory):
        # Trova tutti i file XML e HTML nella directory
        xml_files = glob.glob(os.path.join(file_or_directory, "*.xml"))
        html_files = glob.glob(os.path.join(file_or_directory, "*.html"))
        htm_files = glob.glob(os.path.join(file_or_directory, "*.htm"))
        html_files.extend(htm_files)  # Combina i file .html e .htm
        
        # Elabora tutti i file e unisci i risultati
        data_frames = []
        
        if xml_files:
            xml_df = extract_actuarial_data_from_xml(file_or_directory)
            data_frames.append(xml_df)
            print(f"Estratti dati da {len(xml_files)} file XML")
        
        if html_files:
            html_df = extract_actuarial_data_from_html(file_or_directory)
            data_frames.append(html_df)
            print(f"Estratti dati da {len(html_files)} file HTML")
        
        if not data_frames:
            raise ValueError("Nessun file XML o HTML trovato nella directory")
        
        # Unisci tutti i DataFrame
        combined_df = pd.concat(data_frames, ignore_index=True)
        
        # Salva il DataFrame in un file CSV se richiesto
        if output_csv:
            combined_df.to_csv(output_csv, index=False)
            print(f"Dati salvati in {output_csv}")
        
        return combined_df
    else:
        raise ValueError("Il percorso fornito non è né un file né una directory valida")

## Extraction Execution

### Single File Test
Testing the function on a sample HTML file:

In [1]:
# Esempio di utilizzo
if __name__ == "__main__":
    # Per elaborare un singolo file XML
    # df = extract_actuarial_data("path/to/CaseForm.xml", "output.csv")
    
    # Per elaborare un singolo file HTML
    df = extract_actuarial_data("2005045588642_direct2.html", "output.csv")
    
    # Per elaborare tutti i file XML e HTML in una directory
    # df = extract_actuarial_data("path/to/directory", "output.csv")
    pass

Elaborazione file HTML: 100%|██████████| 1/1 [00:00<00:00,  2.21it/s]

Dati salvati in output.csv





## Bulk Processing

### Complete Directory Processing
Processing all XML files in the `nmvccs_xml_files` directory:
- **Progress Bar**: Advanced progress monitoring with tqdm
- **Error Handling**: Managing corrupted or malformed files
- **Memory Management**: Efficient processing of thousands of files

In [3]:
df = extract_actuarial_data("nmvccs_xml_files", "output.csv")

Elaborazione file XML: 100%|██████████| 4/4 [00:00<00:00, 181.88it/s]


Estratti dati da 4 file XML


Elaborazione file HTML: 100%|██████████| 6926/6926 [49:55<00:00,  2.31it/s]  


Estratti dati da 6926 file HTML
Dati salvati in output.csv


## Data Saving and Cleaning

### Output Format
- **DB.csv**: Main database with `;` separator (European standard)
- **Deduplication**: Automatic duplicate removal
- **Encoding**: UTF-8 for international compatibility

In [4]:
df.to_csv('DB.csv',sep=';', index=False)

In [7]:
df = df.drop_duplicates()

## Results

### Generated Dataset
The process produces a structured dataset with:
- **Actuarial Variables**: 50+ columns for risk analysis
- **Complete Coverage**: All crash aspects
- **Data Quality**: Standardization and validation
- **Standard Format**: CSV compatible with actuarial systems

### Possible Uses
- **Risk Assessment**: Insurance risk profiling
- **Claims Analysis**: Claim cost analysis
- **Underwriting**: Pricing model development
- **Safety Research**: Accident prevention research