The article walks through how to clean more than 15 common **file** formats (structured, semi-structured, and unstructured) using Python, with a “dirty input → cleaned output → code” pattern for each format. 

## Core ideas

- Treat “file format” as part of your data contract: you standardize I/O per format (e.g., CSV, Parquet, PDF) and keep cleaning logic close to ingestion. 
- Use format-specific libraries (pandas, pyarrow, sqlite3, json, yaml, HTML/XML parsers, PDF/Doc readers, audio/video/image libs) but normalize everything into a small number of canonical tabular or document schemas. 
- Design repeatable, testable cleaning functions: each format gets a loader + cleaner that handles types, missing values, deduping, normalization (dates, categories, casing), and structural fixes.

## Structured formats (tabular, DB, logs)

The article covers: CSV, TSV, Excel, Parquet, SQLite, generic `.db` files, and log-like delimited text.

- CSV/TSV:  
  - Use `pandas.read_csv` with explicit `dtype`, `na_values`, and encoding; fix delimiters, header issues, and bad quoting; then standardize column names, parse dates, and drop/flag bad rows. 
  - For TSV, specify `sep="\t"` and handle embedded tabs/quotes carefully.  

- Excel (multi-sheet):  
  - Use `pandas.read_excel` with `sheet_name=None` to load all sheets, then iterate to clean each sheet consistently (schema alignment, types, missing values).  
  - Useful when business users keep slightly different schemas per sheet that must be unified.

- Parquet:  
  - Use `pyarrow`/`pandas.read_parquet` when CSV is too slow or too large; enforce schema (column names and types) and handle partitioned datasets.
  - Good for large-scale analytics pipelines where schema drift must be detected early.  

- SQLite / `.db`:  
  - Connect via `sqlite3`/SQLAlchemy, inspect tables, and pull into pandas; apply the same cleaning as for CSV once in DataFrame form. 
  - Helps standardize “offline SQL” sources (local app DBs, exports from legacy systems).  

- Log-like / delimited text:  
  - Use `read_csv` with custom separators or regex, or line-by-line parsing with `re`; extract structured fields (timestamp, level, message, JSON payload) then normalize.

## Semi-structured formats (hierarchical text, config, web)

The article covers: JSON, newline-delimited JSON, XML/HTML-like markup, YAML/config files, and API responses.

- JSON / NDJSON:  
  - Load with `json` or `pandas.read_json(lines=True)`; flatten nested structures with `json_normalize`; handle missing keys by providing defaults and validating against expected schema. 
  - For NDJSON, treat each line as a record and stream for large files.  

- XML / HTML:  
  - Parse with `lxml` or `BeautifulSoup`; map tags/attributes to a tabular or document schema. [
  - Clean by stripping markup noise, normalizing whitespace, resolving malformed tags, and enforcing required fields.  

- YAML / config:  
  - Use `pyyaml` to load; validate against a schema (e.g., required keys, allowed enums, types) and fill defaults for missing values. 
  - Useful when configs from different environments drift.  

- API-like JSON:  
  - Handle pagination, nested `data`/`meta` shapes, and rate-limit induced partial results; normalize into canonical tables like `users`, `events`, etc. 

## Unstructured formats (documents, media)

The article highlights patterns more than perfect extraction; cleaning means “extract usable structure and text,” not just pretty output. 

- PDF:  
  - Use PDF text extractors to pull text, then clean line breaks, hyphenation, headers/footers, and multi-column layouts as best as possible. 
  - Tables may require specialized libraries or exporting to CSV first.  

- Word docs (`.docx`):  
  - Use `python-docx` or similar to convert paragraphs and tables into structured text or tabular data; then reapply the same tabular cleaning patterns.

- HTML reports / emails:  
  - Strip styling, nav, and boilerplate; keep main content and tables; standardize date and currency formats. 

- Images, audio, video:  
  - Use format-specific libraries mostly for metadata (EXIF, duration, resolution) and to drive downstream pipelines, not deep semantic cleaning. 
  - Clean by normalizing paths, formats, and metadata consistency (e.g., required tags, codec, sample rate).  

## Reusable cleaning patterns

Across all formats, the article pushes a consistent pattern: loader → validator → cleaner → normalized output. 

- Loader: Minimal function to read the raw file into an in-memory representation (DataFrame, dict/list, text blob, tree, etc.).  
- Validator: Assertions or pydantic/JSON Schema/marshmallow-like checks for required columns/fields, types, ranges, and allowed values. 
- Cleaner:  
  - Standardize column/field names (snake_case, trim spaces).  
  - Convert types (str → int/float/bool/datetime).  
  - Handle missing values (drop, impute, or flag).  
  - Deduplicate rows/records.  
  - Normalize categorical values and free text casing.  
- Normalized output: A small set of canonical schemas (fact-like tables, dimension-like lookups, or document chunks) that downstream systems rely on.
If you want, a next step can be a compact Python “cookbook” file where each format has a small loader+cleaner function you can plug into your own pipelines.