# Level 1 â€” Week 1 Practice (Starter Notebook)

Starter code for environment + data profiling basics.

## References (docs)
- Python `venv` (official): https://docs.python.org/3/library/venv.html
- pip user guide (official): https://pip.pypa.io/en/stable/user_guide/
- Python errors/exceptions (official): https://docs.python.org/3/tutorial/errors.html
- Python `logging` (official): https://docs.python.org/3/library/logging.html
- Pandas getting started: https://pandas.pydata.org/docs/getting_started/index.html
- Pandas I/O (CSV): https://pandas.pydata.org/docs/user_guide/io.html


## Setup

Run this in an environment that already has `pandas` installed.


In [None]:
import logging
from pathlib import Path

import pandas as pd


In [None]:
logging.basicConfig(level=logging.INFO, format='%(levelname)s %(message)s')
logger = logging.getLogger('week1')

OUTPUT_DIR = Path('output')
OUTPUT_DIR.mkdir(exist_ok=True)
OUTPUT_DIR


## Create or load a CSV

Replace sample generation with `pd.read_csv(...)` for your real dataset.


In [None]:
df = pd.DataFrame({
    'user_id': [1, 2, 3, 4, 5],
    'age': [23, None, 31, 45, 29],
    'country': ['US', 'US', 'SG', None, 'CN'],
    'purchase_amount': [12.5, 0.0, 7.99, 103.2, None],
})
csv_path = OUTPUT_DIR / 'sample.csv'
df.to_csv(csv_path, index=False)
logger.info('Wrote sample CSV to: %s' % csv_path)
df


## Basic profiling

Starter checks you can reuse in `data_profile.py`.


In [None]:
df = pd.read_csv(csv_path)
logger.info('Read CSV: rows=%s cols=%s' % (len(df), len(df.columns)))

profile = {
    'n_rows': int(df.shape[0]),
    'n_cols': int(df.shape[1]),
    'dtypes': {k: str(v) for k, v in df.dtypes.items()},
    'missing_by_col': df.isna().sum().to_dict(),
    'duplicate_rows': int(df.duplicated().sum()),
}
profile


In [None]:
numeric_stats = df.select_dtypes(include='number').describe().to_dict()
numeric_stats


## Export outputs (JSON + Markdown)

Export artifacts to `output/` so runs are reproducible.


In [None]:
import json

out_json = OUTPUT_DIR / 'profile.json'
out_md = OUTPUT_DIR / 'profile.md'

with out_json.open('w', encoding='utf-8') as f:
    json.dump({'profile': profile, 'numeric_stats': numeric_stats}, f, ensure_ascii=False, indent=2)

md_lines = [
    '# Data Profile',
    '',
    'Rows: {}'.format(profile['n_rows']),
    'Cols: {}'.format(profile['n_cols']),
    '',
    '## Missing Values',
    '',
]
for k, v in profile['missing_by_col'].items():
    md_lines.append('- **{}**: {}'.format(k, v))

out_md.write_text('\n'.join(md_lines), encoding='utf-8')

logger.info('Wrote: %s' % out_json)
logger.info('Wrote: %s' % out_md)
str(out_json), str(out_md)
