# Data Cleaning and Preprocessing

## Overview
This notebook covers essential steps for preparing raw data before analysis. The goal is to ensure that datasets are structured, free from inconsistencies, and ready for merging or visualization. The steps include:

- Handling missing values
- Formatting column names
- Filtering and removing duplicates
- Creating new features if necessary

Data cleaning is the first step in any data analysis pipeline, ensuring the integrity and reliability of insights drawn from the dataset.

---

## Importing Necessary Libraries
This notebook relies primarily on **pandas** for data wrangling and **chardet** for encoding detection.  
Visualisation libraries are *not* required at this stage and have been removed to keep the environment lean.
- `pandas`: For working with structured data (tables, CSV, Excel)
- `numpy`: For numerical computations
- `matplotlib` & `seaborn`: For potential visual exploration

In [1]:
# Import Libraries
import os
import pandas as pd
import numpy as np
import chardet
from pathlib import Path

# Get the relative path of the directory where this script/notebook is located.
script_dir = os.getcwd()  # or wherever your notebook is running

# Go one level up (to the parent folder) and then into "02 data".
data_folder = Path.cwd()
input_path = data_folder.parent / '02 Data' / '00_original_data'
output_path = data_folder.parent / '02 Data' / '01_processed_data' / '01_clean_data'

# Create the output folder if it doesn't exist
os.makedirs(output_path, exist_ok=True)

print("Data folder:", data_folder)
print("Input path:", input_path)
print("Output path:", output_path)

from IPython.display import display, HTML

# Force Jupyter Notebook to use all available horizontal space
pd.set_option('display.max_columns', None)  # Show all columns
pd.set_option('display.width', 1000)          # Set width to a large number
pd.set_option('display.max_colwidth', None)     # Show full column content if needed
pd.set_option('display.float_format', lambda x: f"{x:,.2f}".replace(',', ' '))  # Format numbers with 2 decimal places

Data folder: C:\Users\User\Dropbox\Personal\CareerFoundry\06 Sourcing data\Notebook folder\03 scripts
Input path: C:\Users\User\Dropbox\Personal\CareerFoundry\06 Sourcing data\Notebook folder\02 Data\00_original_data
Output path: C:\Users\User\Dropbox\Personal\CareerFoundry\06 Sourcing data\Notebook folder\02 Data\01_processed_data\01_clean_data


### Verify input folder and list available files
This cell checks that `input_path` exists and prints the CSV/PKL files that will be processed.

In [2]:
# Verify the input folder exists and list available files.
if not os.path.exists(input_path):
    print(f"Error: The folder '{input_path}' does not exist. Please ensure the base folder is correct.")
else:
    available_files = [f for f in os.listdir(input_path)]
    print("Available files in the input folder:")
    for idx, f in enumerate(available_files, start=1):
        print(f"{idx}. {f}")
    
    file_numbers_input = input(
        "\nEnter the file numbers to process (comma-separated), or leave blank to process all files: "
    ).strip()
    
    if file_numbers_input:
        try:
            indices = [int(num.strip()) for num in file_numbers_input.split(',') if num.strip()]
            # Validate indices and build the list of selected files.
            files_list = [available_files[i-1] for i in indices if 1 <= i <= len(available_files)]
            if not files_list:
                print("No valid file numbers were entered.")
        except ValueError:
            print("Error: Please enter valid numbers separated by commas.")
            files_list = []
    else:
        files_list = available_files

    print("\nFiles selected for processing:", files_list)

Available files in the input folder:
1. service_types.csv
2. weekly_deliveries.csv
3. work_time_and_km.csv



Enter the file numbers to process (comma-separated), or leave blank to process all files:  3



Files selected for processing: ['work_time_and_km.csv']


### Initialize report generation variables
These variables will be used to track processing statistics and generate a summary report.

---

In [3]:
# Starting report generation
summary_lines = []
current_file = files_list[0]  # Get the first file from the list of selected files
file_path = os.path.join(input_path, current_file)

### Helper – `preview_csv_with_delimiters`
Detect encoding & delimiter by sampling the first lines of a CSV. Returns a pandas DataFrame or raises `FileNotFoundError`.

In [4]:
def preview_csv_with_delimiters(file_path, 
                                fallback_encodings=['latin1', 'ISO-8859-1', 'cp1252'], 
                                possible_delimiters=[',', ';', '\t', '|']):
    """
    Detects the file encoding, then previews the first 3 rows of the CSV using various delimiters.
    Returns the detected encoding and a dictionary of previews keyed by delimiter.
    """
    # Detect encoding using a sample from the file.
    with open(file_path, 'rb') as f:
        rawdata = f.read(100000)  # Read first 100k bytes
    detection = chardet.detect(rawdata)
    encoding = detection.get('encoding', 'utf-8')
    print(f"Detected encoding: {encoding}\n")
    
    previews = {}
    for delim in possible_delimiters:
        print(f"Preview using delimiter {repr(delim)}:")
        try:
            # Read only the first 3 rows for preview
            df_preview = pd.read_csv(file_path, encoding=encoding, sep=delim, nrows=3)
            previews[delim] = df_preview
            print(df_preview)
        except Exception as e:
            print(f"Failed with delimiter {repr(delim)}. Error: {e}")
        print("-" * 50 + "\n")
    return encoding, previews

# Dictionary to store the DataFrames for each file.
df = {}

for file in files_list:
    file_path = os.path.join(input_path, file)
    if os.path.exists(file_path):
        if file.endswith('.csv'):
            print(f"Processing CSV file: {file}")
            # Preview the CSV with different delimiters.
            encoding, previews = preview_csv_with_delimiters(file_path)
            
            # List the delimiter options with numbers.
            possible_delimiters = [',', ';', '\t', '|']
            print("Select the correct delimiter from the options below:")
            for idx, delim in enumerate(possible_delimiters, start=1):
                print(f"{idx}. {repr(delim)}")
            
            # Ask the user to choose the correct delimiter by number.
            while True:
                try:
                    selected_number = int(input("Enter the number corresponding to the correct delimiter: "))
                    if 1 <= selected_number <= len(possible_delimiters):
                        selected_delim = possible_delimiters[selected_number - 1]
                        break
                    else:
                        print("Invalid number. Please choose a valid option.")
                except ValueError:
                    print("Please enter a valid integer.")
            
            # Now load the full CSV using the selected delimiter.
            try:
                df_csv = pd.read_csv(file_path, encoding=encoding, sep=selected_delim)
                df[file] = df_csv
                print(f"Successfully loaded '{file}' with delimiter {repr(selected_delim)} " \
                      f"(rows: {df_csv.shape[0]}, columns: {df_csv.shape[1]})")
            except Exception as e:
                print(f"Error loading CSV file {file} with delimiter {repr(selected_delim)}: {e}")
                continue
        elif file.endswith('.pkl'):
            try:
                df[file] = pd.read_pickle(file_path)
                print(f"Loaded pickle file '{file}' (rows: {df[file].shape[0]}, columns: {df[file].shape[1]})")
            except Exception as e:
                print(f"Error reading pickle file {file}: {e}")
                continue
        else:
            print(f"Skipping unsupported file format: {file}")
            continue
        
        print("=" * 100 + "\n")
    else:
        print(f"File {file} not found and will be skipped.")

# Logging details.
report_details = [f"File: {current_file}"]
report_details.append(f"Total loaded files: {len(df)}")
modifications = []

Processing CSV file: work_time_and_km.csv
Detected encoding: UTF-8-SIG

Preview using delimiter ',':
Failed with delimiter ','. Error: Error tokenizing data. C error: Expected 2 fields in line 3, saw 3

--------------------------------------------------

Preview using delimiter ';':
         Date  Year  Month  Day  Route  Route_id Start_time  end_time   time distance
0  17.03.2025  2025      3   17    102         2   09:00:00  19:25:00  10,42      115
1  18.03.2025  2025      3   18    202         2   09:34:00  19:02:00   9,47    53,19
2  19.03.2025  2025      3   19    302         2   08:22:00  17:36:00   9,23    49,32
--------------------------------------------------

Preview using delimiter '\t':
  Date;Year;Month;Day;Route;Route_id;Start_time;end_time;time;distance
0               17.03.2025;2025;3;17;102;2;09:00:00;19:25:00;10,42;115
1              18.03.2025;2025;3;18;202;2;09:34:00;19:02:00;9,47;53,19
2              19.03.2025;2025;3;19;302;2;08:22:00;17:36:00;9,23;49,32
------

Enter the number corresponding to the correct delimiter:  2


Successfully loaded 'work_time_and_km.csv' with delimiter ';' (rows: 20, columns: 10)



### Quick HTML preview of all loaded DataFrames
Shows the first two rows of every imported file for a visual sanity check.

In [5]:
# present all imported DataFrames

for file_name, data in df.items():
    html = data.to_html(max_rows=2, max_cols=30)
    display(HTML(f'<h4>{file_name}</h4><div style="overflow-x: auto; width:100%;">{html}</div>'))

Unnamed: 0,Date,Year,Month,Day,Route,Route_id,Start_time,end_time,time,distance
0,17.03.2025,2025,3,17,102,2,09:00:00,19:25:00,1042,115
...,...,...,...,...,...,...,...,...,...,...
19,21.03.2025,2025,3,21,543,43,10:56:00,15:44:00,48,53


In [6]:
df[current_file].shape

(20, 10)

## Handling Missing Values
Missing data can affect analysis accuracy. This section explores strategies such as:
- Removing rows/columns with excessive missing values
- Imputing missing values based on statistical methods
- Using placeholders for unknown values

In [7]:
# Data Customization – Exclude Columns
df[current_file].head()
data = df[current_file]
data.head()
exclude_cols_input = input("\nEnter columns to exclude (comma-separated), or press Enter to skip: ").strip()
if exclude_cols_input:
    exclude_cols = [col.strip() for col in exclude_cols_input.split(',') if col.strip()]
    data = data.drop(columns=exclude_cols, errors='ignore')
    modifications.append(f"Excluded columns: {', '.join(exclude_cols)}")
    print(f"Columns excluded: {exclude_cols}")
else:
    print("No columns were excluded.")

# Update the DataFrame in the dictionary.
df[current_file] = data


Enter columns to exclude (comma-separated), or press Enter to skip:  


No columns were excluded.


## Formatting and Standardizing Column Names
To ensure consistency across datasets, column names are standardized:
- Converted to lowercase
- Replacing spaces with underscores
- Removing special characters

In [8]:
# Data Customization – Rename Columns
while True:
    print("\nCurrent column names:")
    print(list(data.columns))
    col_to_rename = input("Enter column name to rename, or press Enter to stop renaming: ").strip()
    if not col_to_rename:
        break
    if col_to_rename in data.columns:
        new_name = input(f"Enter new name for column '{col_to_rename}': ").strip()
        data.rename(columns={col_to_rename: new_name}, inplace=True)
        modifications.append(f"Renamed column '{col_to_rename}' to '{new_name}'")
        print(f"Renamed '{col_to_rename}' to '{new_name}'")
    else:
        print(f"Column '{col_to_rename}' not found.")
        
df[current_file] = data


Current column names:
['Date', 'Year', 'Month', 'Day', 'Route', 'Route_id', 'Start_time', 'end_time', 'time', 'distance']


Enter column name to rename, or press Enter to stop renaming:  


In [9]:
# Example function to clean currency-like strings and convert to float
def parse_currency_string(value):
    """
    Attempts to remove currency symbols, spaces, and convert commas to dots,
    then parses the result as a float. Leaves NaNs as NaN.
    """
    if pd.isna(value):
        return value  # Keep NaNs as is
    # Remove '$' if present (or other currency symbols)
    value = str(value).replace('$', '')
    # Remove spaces
    value = value.replace(' ', '')
    # Replace comma with dot (assuming comma decimal format)
    value = value.replace(',', '.')
    # Convert to float
    return float(value)

# List of available data types (key + explanation)
available_dtypes = [
    ('int', "Integer (no decimal part). E.g. 42"),
    ('float', "Floating-point number (decimal allowed). E.g. 3.14"),
    ('str', "String (text). E.g. 'Hello world'"),
    ('bool', "Boolean (True or False)"),
    ('datetime64[ns]', "Date and time in Pandas datetime format"),
    ('category', "Categorical data (saves memory if repetitive values)"),
]

# Data Customization – Change Data Types by column number
while True:
    
    # Number each column
    columns_list = list(data.columns)
    print("\nColumns:")
    for i, col in enumerate(columns_list, start=1):
        print(f"{i}. {col} (current dtype: {data[col].dtype})")
    
    # Ask user which column to convert by number
    col_number_input = input("\nEnter the column number to change data type, or press Enter to stop:\n").strip()
    if not col_number_input:
        break  # Stop if user presses Enter without a choice
    
    try:
        col_number = int(col_number_input)
        if 1 <= col_number <= len(columns_list):
            col_to_cast = columns_list[col_number - 1]
        else:
            print("Invalid column number. Please try again.")
            continue
    except ValueError:
        print("Please enter a valid integer for the column number.")
        continue
    
    # Show available data types by number
    print("\nAvailable data types:")
    for idx, (dt_key, dt_expl) in enumerate(available_dtypes, start=1):
        print(f"{idx}. {dt_key} - {dt_expl}")
    
    # Ask user for the new data type by number
    dtype_choice = input(f"\nEnter the number corresponding to the desired data type for '{col_to_cast}':\n").strip()
    try:
        dtype_choice_num = int(dtype_choice)
        if 1 <= dtype_choice_num <= len(available_dtypes):
            new_dtype = available_dtypes[dtype_choice_num - 1][0]
        else:
            print("Invalid number. Please choose a valid option.")
            continue
    except ValueError:
        print("Please enter a valid integer.")
        continue
    
    # Try direct casting first
    try:
        data[col_to_cast] = data[col_to_cast].astype(new_dtype)
        modifications.append(f"Changed data type of '{col_to_cast}' to {new_dtype}")
        print(f"Changed data type of '{col_to_cast}' to {new_dtype}")
    except Exception as e:
        print(f"Failed to change data type of '{col_to_cast}' to {new_dtype}: {e}")
        
        # Attempt auto-clean only if user wants int or float
        if new_dtype in ["int", "float"]:
            print(f"\nAttempting to auto-clean the '{col_to_cast}' column for {new_dtype} conversion...")

            try:
                # 1. Clean the column (remove currency symbols, fix decimal separators, etc.)
                temp_col = data[col_to_cast].apply(parse_currency_string)
                
                # 2. Handle missing values before casting to int/float
                missing_count = temp_col.isna().sum()
                if missing_count > 0:
                    print(f"Found {missing_count} missing (NaN) values in '{col_to_cast}'.")
                    print("How would you like to handle these missing values?")
                    print("1. Drop rows with missing values")
                    print("2. Fill missing values with 0")
                    print("3. Fill missing values with a custom value")
                    print("4. Leave them as NaN (only works for float)")
                    mv_choice = input("Enter the number of your choice (1/2/3/4): ").strip()
                    
                    if mv_choice == "1":
                        temp_col = temp_col.dropna()
                        print("Dropped rows with missing values.")
                    elif mv_choice == "2":
                        temp_col = temp_col.fillna(0)
                        print("Filled missing values with 0.")
                    elif mv_choice == "3":
                        fill_val = input("Enter the value to fill missing values: ")
                        # Convert fill_val to the appropriate numeric type
                        if new_dtype == "float":
                            fill_val = float(fill_val)
                        elif new_dtype == "int":
                            fill_val = int(float(fill_val))  # in case user typed "3.0"
                        temp_col = temp_col.fillna(fill_val)
                        print(f"Filled missing values with '{fill_val}'.")
                    elif mv_choice == "4":
                        if new_dtype == "float":
                            print("Leaving missing values as NaN.")
                        else:
                            print("Cannot leave NaN if converting to int. Attempting fill with 0.")
                            temp_col = temp_col.fillna(0)
                    else:
                        print("Invalid choice. Leaving missing values as NaN for now.")
                
                # 3. Convert to float or int as requested
                if new_dtype == "int":
                    temp_col = temp_col.astype(int)
                else:
                    temp_col = temp_col.astype(float)
                
                # 4. Show side-by-side comparison for first 5 rows
                comparison_df = pd.DataFrame({
                    "original": data[col_to_cast].head(5),
                    "converted": temp_col.head(5)
                })
                
                print("\nPreview of original vs. converted values (first 5 rows):")
                print(comparison_df)
                
                # 5. Prompt user confirmation
                confirm = input("\nDoes this look correct? (y/n): ").strip().lower()
                if confirm == "y":
                    # If we dropped rows, we need to align the main DataFrame with temp_col’s index
                    data = data.reindex(temp_col.index)  # In case some rows were dropped
                    data[col_to_cast] = temp_col
                    modifications.append(f"Auto-cleaned and changed data type of '{col_to_cast}' to {new_dtype}")
                    print(f"Successfully auto-cleaned and changed data type of '{col_to_cast}' to {new_dtype}")
                else:
                    print("No changes applied.")
            
            except Exception as e2:
                print(f"Auto-cleaning also failed: {e2}")
        else:
            print("Auto-cleaning is only implemented for int or float conversions.")

# Finally, store the updated DataFrame back if needed
df[current_file] = data


Columns:
1. Date (current dtype: object)
2. Year (current dtype: int64)
3. Month (current dtype: int64)
4. Day (current dtype: int64)
5. Route (current dtype: int64)
6. Route_id (current dtype: int64)
7. Start_time (current dtype: object)
8. end_time (current dtype: object)
9. time (current dtype: object)
10. distance (current dtype: object)



Enter the column number to change data type, or press Enter to stop:
 1



Available data types:
1. int - Integer (no decimal part). E.g. 42
2. float - Floating-point number (decimal allowed). E.g. 3.14
3. str - String (text). E.g. 'Hello world'
4. bool - Boolean (True or False)
5. datetime64[ns] - Date and time in Pandas datetime format
6. category - Categorical data (saves memory if repetitive values)



Enter the number corresponding to the desired data type for 'Date':
 5


Changed data type of 'Date' to datetime64[ns]

Columns:
1. Date (current dtype: datetime64[ns])
2. Year (current dtype: int64)
3. Month (current dtype: int64)
4. Day (current dtype: int64)
5. Route (current dtype: int64)
6. Route_id (current dtype: int64)
7. Start_time (current dtype: object)
8. end_time (current dtype: object)
9. time (current dtype: object)
10. distance (current dtype: object)



Enter the column number to change data type, or press Enter to stop:
 7



Available data types:
1. int - Integer (no decimal part). E.g. 42
2. float - Floating-point number (decimal allowed). E.g. 3.14
3. str - String (text). E.g. 'Hello world'
4. bool - Boolean (True or False)
5. datetime64[ns] - Date and time in Pandas datetime format
6. category - Categorical data (saves memory if repetitive values)



Enter the number corresponding to the desired data type for 'Start_time':
 5


Changed data type of 'Start_time' to datetime64[ns]

Columns:
1. Date (current dtype: datetime64[ns])
2. Year (current dtype: int64)
3. Month (current dtype: int64)
4. Day (current dtype: int64)
5. Route (current dtype: int64)
6. Route_id (current dtype: int64)
7. Start_time (current dtype: datetime64[ns])
8. end_time (current dtype: object)
9. time (current dtype: object)
10. distance (current dtype: object)



Enter the column number to change data type, or press Enter to stop:
 8



Available data types:
1. int - Integer (no decimal part). E.g. 42
2. float - Floating-point number (decimal allowed). E.g. 3.14
3. str - String (text). E.g. 'Hello world'
4. bool - Boolean (True or False)
5. datetime64[ns] - Date and time in Pandas datetime format
6. category - Categorical data (saves memory if repetitive values)



Enter the number corresponding to the desired data type for 'end_time':
 5


Changed data type of 'end_time' to datetime64[ns]

Columns:
1. Date (current dtype: datetime64[ns])
2. Year (current dtype: int64)
3. Month (current dtype: int64)
4. Day (current dtype: int64)
5. Route (current dtype: int64)
6. Route_id (current dtype: int64)
7. Start_time (current dtype: datetime64[ns])
8. end_time (current dtype: datetime64[ns])
9. time (current dtype: object)
10. distance (current dtype: object)



Enter the column number to change data type, or press Enter to stop:
 9



Available data types:
1. int - Integer (no decimal part). E.g. 42
2. float - Floating-point number (decimal allowed). E.g. 3.14
3. str - String (text). E.g. 'Hello world'
4. bool - Boolean (True or False)
5. datetime64[ns] - Date and time in Pandas datetime format
6. category - Categorical data (saves memory if repetitive values)



Enter the number corresponding to the desired data type for 'time':
 2


Failed to change data type of 'time' to float: could not convert string to float: '10,42'

Attempting to auto-clean the 'time' column for float conversion...

Preview of original vs. converted values (first 5 rows):
  original  converted
0    10,42      10.42
1     9,47       9.47
2     9,23       9.23
3     9,48       9.48
4     9,17       9.17



Does this look correct? (y/n):  y


Successfully auto-cleaned and changed data type of 'time' to float

Columns:
1. Date (current dtype: datetime64[ns])
2. Year (current dtype: int64)
3. Month (current dtype: int64)
4. Day (current dtype: int64)
5. Route (current dtype: int64)
6. Route_id (current dtype: int64)
7. Start_time (current dtype: datetime64[ns])
8. end_time (current dtype: datetime64[ns])
9. time (current dtype: float64)
10. distance (current dtype: object)



Enter the column number to change data type, or press Enter to stop:
 10



Available data types:
1. int - Integer (no decimal part). E.g. 42
2. float - Floating-point number (decimal allowed). E.g. 3.14
3. str - String (text). E.g. 'Hello world'
4. bool - Boolean (True or False)
5. datetime64[ns] - Date and time in Pandas datetime format
6. category - Categorical data (saves memory if repetitive values)



Enter the number corresponding to the desired data type for 'distance':
 2


Failed to change data type of 'distance' to float: could not convert string to float: '53,19'

Attempting to auto-clean the 'distance' column for float conversion...

Preview of original vs. converted values (first 5 rows):
  original  converted
0      115     115.00
1    53,19      53.19
2    49,32      49.32
3    43,14      43.14
4   101,08     101.08



Does this look correct? (y/n):  y


Successfully auto-cleaned and changed data type of 'distance' to float

Columns:
1. Date (current dtype: datetime64[ns])
2. Year (current dtype: int64)
3. Month (current dtype: int64)
4. Day (current dtype: int64)
5. Route (current dtype: int64)
6. Route_id (current dtype: int64)
7. Start_time (current dtype: datetime64[ns])
8. end_time (current dtype: datetime64[ns])
9. time (current dtype: float64)
10. distance (current dtype: float64)



Enter the column number to change data type, or press Enter to stop:
 


### Handle missing values + outlier detection
Runs per-column NaN checks, prints summary, and detects IQR-based outliers in numeric columns.

In [10]:
# # Checking for missing values in the dataset
print("\nChecking for missing values...")
missing_summary = data.isnull().sum()
missing_summary = missing_summary[missing_summary > 0]
rows_before = len(data)

if not missing_summary.empty:
    print("Columns with missing values and their counts:")
    print(missing_summary)
    
    print("\nColumn statistics:")
    display(data.describe())
    
    # Loop through each column that has missing values
    for column in missing_summary.index:
        # Display the first 5 rows where the current column has missing values
        print(f"\nFirst 5 rows where '{column}' is missing:")
        display(data[data[column].isnull()].head())
        
        fill_method = input(
            f"Enter method to handle missing values for '{column}' "
            "(mean, median, drop, or custom value), or press Enter to skip: "
        ).strip()
        
        if fill_method == 'mean':
            data[column] = data[column].fillna(data[column].mean())
            modifications.append(f"Filled missing values in '{column}' with mean")
        
        elif fill_method == 'median':
            data[column] = data[column].fillna(data[column].median())
            modifications.append(f"Filled missing values in '{column}' with median")
        
        elif fill_method == 'drop':
            data.dropna(subset=[column], inplace=True)
            modifications.append(f"Dropped rows with missing values in '{column}'")
        
        elif fill_method:
            try:
                if data[column].dtype.kind in 'fc':  # numeric types
                    value = float(fill_method)
                else:
                    value = fill_method
                
                data[column] = data[column].fillna(value)
                modifications.append(f"Filled missing values in '{column}' with custom value: {value}")
            
            except ValueError:
                print(f"Invalid custom value for '{column}', skipping...")
else:
    print("No missing values detected.")

rows_after = len(data)
report_details.append(f"Rows dropped due to missing values: {rows_before - rows_after}")

df[current_file] = data


Checking for missing values...
No missing values detected.


### Duplicate row management
Identify and optionally drop exact duplicates, reporting the count removed.

In [11]:
# Duplicate Row Management

print("\nChecking for duplicate rows...")
duplicates = data.duplicated().sum()
print(f"Found {duplicates} duplicate rows.")
report_details.append(f"Number of duplicate rows: {duplicates}")

if duplicates > 0:
    print("Preview of duplicate rows:")
    display(data[data.duplicated()].head())
    drop_dup = input("Do you want to drop duplicates? (yes/no): ").strip().lower()
    if drop_dup == 'yes':
        rows_before_dup = len(data)
        data.drop_duplicates(inplace=True)
        rows_after_dup = len(data)
        modifications.append("Dropped duplicate rows")
        report_details.append(f"Rows dropped due to duplicates: {rows_before_dup - rows_after_dup}")
        print("Duplicates dropped.")
    else:
        print("Duplicates not dropped.")
        
df[current_file] = data


Checking for duplicate rows...
Found 0 duplicate rows.


### Outlier detection with the IQR rule  
For every numeric column, compute quartiles and flag observations lying outside **[Q1 – 1.5 × IQR, Q3 + 1.5 × IQR]**. Store counts per column for reporting and preview the first offending rows.  

In [12]:
# Outlier Detection

print("\nDetecting outliers in numeric columns...")
outliers_info = {}
for col in data.select_dtypes(include=['number']).columns:
    q1 = data[col].quantile(0.25)
    q3 = data[col].quantile(0.75)
    iqr = q3 - q1
    lower_bound = q1 - 1.5 * iqr
    upper_bound = q3 + 1.5 * iqr
    outlier_rows = data[(data[col] < lower_bound) | (data[col] > upper_bound)]
    if not outlier_rows.empty:
        outliers_info[col] = len(outlier_rows)
        print(f"Column '{col}': {len(outlier_rows)} outlier rows detected.")
        display(outlier_rows.head())
        
if outliers_info:
    modifications.append(f"Outliers detected: {outliers_info}")
    report_details.append("Outlier detection completed.")
else:
    report_details.append("No outliers detected.")


Detecting outliers in numeric columns...
Column 'Route_id': 5 outlier rows detected.


Unnamed: 0,Date,Year,Month,Day,Route,Route_id,Start_time,end_time,time,distance
0,2025-03-17,2025,3,17,102,2,2025-05-16 09:00:00,2025-05-16 19:25:00,10.42,115.0
1,2025-03-18,2025,3,18,202,2,2025-05-16 09:34:00,2025-05-16 19:02:00,9.47,53.19
2,2025-03-19,2025,3,19,302,2,2025-05-16 08:22:00,2025-05-16 17:36:00,9.23,49.32
3,2025-03-20,2025,3,20,402,2,2025-05-16 08:32:00,2025-05-16 18:01:00,9.48,43.14
4,2025-03-21,2025,3,21,502,2,2025-05-16 07:30:00,2025-05-16 16:40:00,9.17,101.08


Column 'time': 4 outlier rows detected.


Unnamed: 0,Date,Year,Month,Day,Route,Route_id,Start_time,end_time,time,distance
15,2025-03-17,2025,3,17,143,43,2025-05-16 13:30:00,2025-05-16 17:15:00,3.75,31.0
16,2025-03-18,2025,3,17,243,43,2025-05-16 08:00:00,2025-05-16 20:21:00,12.35,74.0
18,2025-03-20,2025,3,20,443,43,2025-05-16 11:46:00,2025-05-16 16:18:00,4.53,56.0
19,2025-03-21,2025,3,21,543,43,2025-05-16 10:56:00,2025-05-16 15:44:00,4.8,53.0


Column 'distance': 1 outlier rows detected.


Unnamed: 0,Date,Year,Month,Day,Route,Route_id,Start_time,end_time,time,distance
0,2025-03-17,2025,3,17,102,2,2025-05-16 09:00:00,2025-05-16 19:25:00,10.42,115.0


### Ask for export format (CSV / PKL)
Interactive prompt determining output format.

In [13]:
# Ask user for file format preference: CSV or pkl
file_format = input("Enter desired output file format (csv or pkl): ").strip().lower()
while file_format not in ['csv', 'pkl']:
    file_format = input("Invalid format. Please enter 'csv' or 'pkl': ").strip().lower()

# Create default filename based on original (without extension)
original_name_without_ext = os.path.splitext(current_file)[0]
default_output_filename = f"{original_name_without_ext}_clean"

# Prompt for filename but use default if empty
output_filename = input(f"Enter the desired file name (without extension) or press Enter for '{default_output_filename}': ").strip()
if not output_filename:
    output_filename = default_output_filename

output_file = os.path.join(output_path, f"{output_filename}.{file_format}")

# Save the processed DataFrame in the selected format
if file_format == 'csv':
    data.to_csv(output_file, index=False)
elif file_format == 'pkl':
    data.to_pickle(output_file)

print(f"\n✅ Processed file saved to: {output_file}")
report_details.append(f"Processed file saved to: {output_file}")
report_details.append(f"Total rows in the exported file: {len(data)}")

# Update the stored data frame for the current file
df[current_file] = data

Enter desired output file format (csv or pkl):  csv
Enter the desired file name (without extension) or press Enter for 'work_time_and_km_clean':  



✅ Processed file saved to: C:\Users\User\Dropbox\Personal\CareerFoundry\06 Sourcing data\Notebook folder\02 Data\01_processed_data\01_clean_data\work_time_and_km_clean.csv
