<h2>This Python code example looks to a local file and coverts 25+ .csv files into Parquet files.</h2>

- Scans a folder (~/Desktop/csvinput) for CSV files.
- glob finds files matching a specified pattern (here, all .csv)
- Creates an output structure (~/Desktop/csvdatabase/<table_name>/).
- Reads each CSV, trying multiple encodings to avoid crashes.
- Converts the CSV into Parquet format (a compressed, columnar storage format that’s more efficient than CSV).
- Saves the Parquet file in its own subfolder.

<h2> Error handling of bad data.</h2>
Bad data is an unfortunate reaility we all have to deal with.  In the example below, the Python script ingores any .csv row where there is an issue.  Since this is test data, I decised to skip the bad rows.  If this is production data, you would want different error handling to identify and fix the bad data.









In [None]:
import pandas as pd
import os
from glob import glob

csv_folder = os.path.expanduser('~/Desktop/cvsinput')        
database_folder = os.path.expanduser('~/Desktop/csvdatabase')
os.makedirs(database_folder, exist_ok=True)

csv_files = glob(os.path.join(csv_folder, '*.csv'))

for file in csv_files:
    table_name = os.path.splitext(os.path.basename(file))[0]
    table_folder = os.path.join(database_folder, table_name)
    os.makedirs(table_folder, exist_ok=True)
    output_path = os.path.join(table_folder, f'{table_name}.parquet')

    # Try multiple encodings and parsing engines
    encodings = ['utf-8', 'utf-8-sig', 'latin1', 'cp1252']
    df = None
    for enc in encodings:
        try:
            # Use python engine and skip bad lines
            df = pd.read_csv(file, encoding=enc, engine='python', on_bad_lines='skip')
            print(f"Read {file} using {enc}")
            break
        except Exception as e:
            print(f"Failed with encoding {enc}: {e}")
    
    if df is None:
        print(f"Skipping {file} completely due to unreadable encoding or malformed rows")
        continue

    df.to_parquet(output_path, engine='pyarrow', index=False)
    print(f"Saved {table_name} as Parquet in {table_folder}")

