ENH: Make pd.to_datetime with format parameter more robust to dirty data

### Feature Type

- [ ] Adding new functionality to pandas

- [x] Changing existing functionality in pandas

- [ ] Removing existing functionality in pandas


### Problem Description

When using `pd.to_datetime` with a specific `format` parameter, the function strictly expects all data in the input series/column to adhere to that exact format. If even a single value deviates (e.g., due to dirty/corrupted data), the function raises an error (if `errors='raise'`, the default) or potentially converts all problematic values to `NaT` (if `errors='coerce'`). This often forces users to perform manual data cleaning or validation *before* calling `to_datetime` to handle these outliers, which can be cumbersome and interrupt the data processing workflow, especially when the user believes the *majority* of the data *should* follow the specified format.


### Feature Description

I propose modifying the internal behavior of `pd.to_datetime` when a `format` parameter is explicitly provided. The goal is to make the function more robust to a small proportion of dirty data that does not match the specified format, while still catching cases where the format is fundamentally incorrect for the data. This change would be internal and would *not* modify the `errors` parameter's existing behavior.

The suggested internal logic would be:

1.  **Check for Explicit Format:** If the `format` argument is provided (not `None` or `mixed`), proceed to the next steps. Otherwise, maintain the current behavior.
2.  **Initial Parsing Attempt:** Call the internal datetime parsing logic using the provided `format` and `errors='coerce'`. This produces an intermediate result series, let's call it `temp_result`.
3.  **Calculate Success Rate:** Calculate the proportion of non-NA/NaT values in `temp_result` compared to the total number of input values.
4.  **Apply Threshold Logic:**
    *   **If the success rate (proportion of non-NaT values in `temp_result`) is *greater than* a defined threshold (e.g., 50%):** Return `temp_result`. This means the specified `format` is considered appropriate for the majority of the data, and the few entries that failed to parse are marked as `NaT`.
    *   **If the success rate is *less than or equal to* the threshold:** Revert to the current behavior, which is equivalent to `errors='raise'`. An error should be raised, indicating that the data format does not match the specified format (or only a very small minority does). This ensures strictness when the format is likely incorrect.

**Pseudocode for the core logic:**

```python
def to_datetime(..., format=None, errors='raise', ...):
    # ... (existing validation and setup code) ...

    if format is not None and format != 'mixed':
        # --- New Robust Logic Starts Here ---
        # Step 1: Parse with format using errors='coerce'
        temp_result = _internal_parse_function(input_arg, format=format, errors='coerce')
        
        # Step 2: Calculate success rate
        total_values = len(temp_result)
        successful_parses = temp_result.notna().sum() # Count non-NaT values
        success_rate = successful_parses / total_values

        # Step 3: Apply threshold (e.g., 50%)
        THRESHOLD = 0.5 # This value might need discussion/configurability later
        if success_rate > THRESHOLD:
            # Most data matched the format, return with NaT for failures
            return temp_result
        else:
            # Too little data matched, raise error like errors='raise'
            # This would likely involve re-parsing with errors='raise' or
            # raising a specific error calculated from temp_result's NaT locations
            return _internal_parse_function(input_arg, format=format, errors='raise') # Or equivalent error logic
        # --- New Robust Logic Ends Here ---
    else:
        # ... (existing logic for format=None, format='mixed', etc.) ...
        return _internal_parse_function(input_arg, format=format, errors=errors)

```

This approach aims to make `to_datetime` more robust to small amounts of dirty data when a specific format is expected, reducing the need for pre-processing, while still catching cases where the format is fundamentally wrong.

### Describe alternatives you've considered

Currently, users must handle this scenario manually:
*   Pre-process the data to clean or identify/replace problematic entries before calling `pd.to_datetime`.
*   Call `pd.to_datetime` with `errors='coerce'`, check the number of `NaT` values generated, and decide if the result is acceptable or if further action is needed.
*   Use `pd.to_datetime` without a `format` (relying on the default flexible parser), which might be slower and less precise if the user knows the intended format.

The proposed solution would encapsulate this common pattern of "try strict format, check success, decide action" into the function itself, based *only* on the user providing an explicit `format`.

### Alternative Solutions

Currently, users must handle this scenario manually:
*   Pre-process the data to clean or identify/replace problematic entries before calling `pd.to_datetime`.
*   Call `pd.to_datetime` with `errors='coerce'`, check the number of `NaT` values generated, and decide if the result is acceptable or if further action is needed.
*   Use `pd.to_datetime` without a `format` (relying on the default flexible parser), which might be slower and less precise if the user knows the intended format.

The proposed solution would encapsulate this common pattern of "try strict format, check success, decide action" into the function itself, based *only* on the user providing an explicit `format`.

### Additional Context

This enhancement would primarily benefit data analysts and engineers working with potentially messy datasets where a datetime column is expected to follow a specific format, but a few entries might be incorrect. It streamlines the common workflow of parsing datetimes with an expected format while gracefully handling minor inconsistencies. The change is localized to the scenario where `format` is specified, minimizing the impact on other existing behaviors.

A potential future optimization could involve an initial pass to assess the overall "date-likeness" or format compliance of the data *before* attempting the main parsing logic. Such a pre-check could potentially offer even better performance or more precise error reporting in cases where the data is largely non-conforming. However, this initial implementation focuses on the simpler threshold-based approach described above, which provides immediate benefits with a more contained change. The pre-check idea could be explored in a follow-up enhancement if deemed necessary.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

ENH: Make pd.to_datetime with format parameter more robust to dirty data #62840

Feature Type

Problem Description

Feature Description

Describe alternatives you've considered

Alternative Solutions

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Uh oh!

ENH: Make pd.to_datetime with format parameter more robust to dirty data #62840

Description

Feature Type

Problem Description

Feature Description

Describe alternatives you've considered

Alternative Solutions

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions