Skip to content

ENH: Make pd.to_datetime with format parameter more robust to dirty data #62840

@CoderMoray

Description

@CoderMoray

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

When using pd.to_datetime with a specific format parameter, the function strictly expects all data in the input series/column to adhere to that exact format. If even a single value deviates (e.g., due to dirty/corrupted data), the function raises an error (if errors='raise', the default) or potentially converts all problematic values to NaT (if errors='coerce'). This often forces users to perform manual data cleaning or validation before calling to_datetime to handle these outliers, which can be cumbersome and interrupt the data processing workflow, especially when the user believes the majority of the data should follow the specified format.

Feature Description

I propose modifying the internal behavior of pd.to_datetime when a format parameter is explicitly provided. The goal is to make the function more robust to a small proportion of dirty data that does not match the specified format, while still catching cases where the format is fundamentally incorrect for the data. This change would be internal and would not modify the errors parameter's existing behavior.

The suggested internal logic would be:

  1. Check for Explicit Format: If the format argument is provided (not None or mixed), proceed to the next steps. Otherwise, maintain the current behavior.
  2. Initial Parsing Attempt: Call the internal datetime parsing logic using the provided format and errors='coerce'. This produces an intermediate result series, let's call it temp_result.
  3. Calculate Success Rate: Calculate the proportion of non-NA/NaT values in temp_result compared to the total number of input values.
  4. Apply Threshold Logic:
    • If the success rate (proportion of non-NaT values in temp_result) is greater than a defined threshold (e.g., 50%): Return temp_result. This means the specified format is considered appropriate for the majority of the data, and the few entries that failed to parse are marked as NaT.
    • If the success rate is less than or equal to the threshold: Revert to the current behavior, which is equivalent to errors='raise'. An error should be raised, indicating that the data format does not match the specified format (or only a very small minority does). This ensures strictness when the format is likely incorrect.

Pseudocode for the core logic:

def to_datetime(..., format=None, errors='raise', ...):
    # ... (existing validation and setup code) ...

    if format is not None and format != 'mixed':
        # --- New Robust Logic Starts Here ---
        # Step 1: Parse with format using errors='coerce'
        temp_result = _internal_parse_function(input_arg, format=format, errors='coerce')
        
        # Step 2: Calculate success rate
        total_values = len(temp_result)
        successful_parses = temp_result.notna().sum() # Count non-NaT values
        success_rate = successful_parses / total_values

        # Step 3: Apply threshold (e.g., 50%)
        THRESHOLD = 0.5 # This value might need discussion/configurability later
        if success_rate > THRESHOLD:
            # Most data matched the format, return with NaT for failures
            return temp_result
        else:
            # Too little data matched, raise error like errors='raise'
            # This would likely involve re-parsing with errors='raise' or
            # raising a specific error calculated from temp_result's NaT locations
            return _internal_parse_function(input_arg, format=format, errors='raise') # Or equivalent error logic
        # --- New Robust Logic Ends Here ---
    else:
        # ... (existing logic for format=None, format='mixed', etc.) ...
        return _internal_parse_function(input_arg, format=format, errors=errors)

This approach aims to make to_datetime more robust to small amounts of dirty data when a specific format is expected, reducing the need for pre-processing, while still catching cases where the format is fundamentally wrong.

Describe alternatives you've considered

Currently, users must handle this scenario manually:

  • Pre-process the data to clean or identify/replace problematic entries before calling pd.to_datetime.
  • Call pd.to_datetime with errors='coerce', check the number of NaT values generated, and decide if the result is acceptable or if further action is needed.
  • Use pd.to_datetime without a format (relying on the default flexible parser), which might be slower and less precise if the user knows the intended format.

The proposed solution would encapsulate this common pattern of "try strict format, check success, decide action" into the function itself, based only on the user providing an explicit format.

Alternative Solutions

Currently, users must handle this scenario manually:

  • Pre-process the data to clean or identify/replace problematic entries before calling pd.to_datetime.
  • Call pd.to_datetime with errors='coerce', check the number of NaT values generated, and decide if the result is acceptable or if further action is needed.
  • Use pd.to_datetime without a format (relying on the default flexible parser), which might be slower and less precise if the user knows the intended format.

The proposed solution would encapsulate this common pattern of "try strict format, check success, decide action" into the function itself, based only on the user providing an explicit format.

Additional Context

This enhancement would primarily benefit data analysts and engineers working with potentially messy datasets where a datetime column is expected to follow a specific format, but a few entries might be incorrect. It streamlines the common workflow of parsing datetimes with an expected format while gracefully handling minor inconsistencies. The change is localized to the scenario where format is specified, minimizing the impact on other existing behaviors.

A potential future optimization could involve an initial pass to assess the overall "date-likeness" or format compliance of the data before attempting the main parsing logic. Such a pre-check could potentially offer even better performance or more precise error reporting in cases where the data is largely non-conforming. However, this initial implementation focuses on the simpler threshold-based approach described above, which provides immediate benefits with a more contained change. The pre-check idea could be explored in a follow-up enhancement if deemed necessary.

Metadata

Metadata

Assignees

No one assigned

    Labels

    EnhancementNeeds TriageIssue that has not been reviewed by a pandas team member

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions