# Load raw data

Common formats: CSV, Excel, JSON, Parquet.

Minimal steps and checks:
- Detect and record input path/URL and file format.
- Use streaming or `nrows` sampling for very large files.
- Basic read examples (use appropriate library): `pandas.read_csv`, `pandas.read_excel`, `pandas.read_parquet`.
- Use read parameters to improve safety and performance: `parse_dates`, `dtype`, `usecols`, `compression`, `nrows` for sampling.
- After loading: inspect `df.head()`, `df.shape`, `df.info()` to validate column names and basic types.
- Record provenance: input filename, read options, row counts (raw and after any immediate filtering).

# Generate data profile (types, missingness, basic stats)

Key profiling outputs to capture:
- Data types: `df.dtypes` to find numeric, datetime, categorical, and object types.
- Missingness: per-column missing counts (`df.isnull().sum()`) and percentages (`(df.isnull().mean()*100).round(2)`).
- Numeric summaries: `df.describe(include='number')` for mean, std, min/max, quartiles.
- Categorical summaries: frequency tables (`df[col].value_counts(dropna=False).head()`), unique counts (`df.nunique()`).
- Correlations: correlation matrix for numeric features (`df.select_dtypes('number').corr()`).
- Spot checks: sample unusual values (`df.sample(5)`) and check for inconsistent encodings or whitespace in string columns.
- Optional full reports: use profiling tools like `ydata_profiling` (formerly pandas-profiling) to export an HTML report for sharing.

# Clean/convert column types

Common conversions and validation steps:
- Numeric coercion: convert numeric-like columns with `pd.to_numeric(..., errors='coerce')` and inspect resulting `NaN` counts before imputing or removing.
- Datetime parsing: use `pd.to_datetime(..., errors='coerce')`; verify common formats and timezone awareness when needed.
- Categorical conversion: convert low-cardinality strings to `category` dtype to save memory and clarify intent.
- String normalization: trim whitespace, unify case, and replace empty strings with missing (`.str.strip().replace({'': None})`).
- Missing-value handling: decide per-column strategy (drop, `fillna()` with a statistic or sentinel, or model-based imputation).
- Validation: after conversions, re-check `df.dtypes`, `df.isnull().sum()`, and spot-check changed values with `value_counts()` or `head()`.
- Persisting cleaned data: save to an efficient format such as Parquet (`df.to_parquet`) or CSV (`df.to_csv`) and record the cleaning steps applied.