Automated EDA, model training and reporting for car price data — a Python CLI plus a Streamlit dashboard that runs a reproducible analysis pipeline and produces human‑readable reports and charts.
AI Car Price Data Analyst is an end‑to‑end, portfolio‑focused project that demonstrates practical data workflows:
- Ingest CSVs, validate and clean data.
- Run automated EDA and generate visualizations.
- Train baseline and improved regression models.
- Produce a plain‑text report (CLI) and an interactive Streamlit UI for exploration and report download.
This repository emphasizes reproducible pipeline design, testing, and clear reporting rather than maximizing predictive performance on the included synthetic sample.
- CSV ingestion and validation (default:
data/example.csv). - Automated EDA: types, head, row/column counts, missing value rates, suspicious columns.
- Price‑specific analysis: correlations, qualitative strength labels, short interpretations.
- ML workflows:
- Model v1 — LinearRegression baseline on numeric features.
- Model v2 — OneHotEncoded categoricals + RandomForest / GradientBoosting comparison; selects best by R²; reports feature importances.
- Visualizations: Matplotlib / Seaborn static plots + interactive ECharts in Streamlit.
- Streamlit UI with three tabs: Home, Analysis, Interactive Explorer.
- Unit tests (pytest) and CI friendly layout.
CLI
python src/main.py
# Generates report.txt (project root) and PNG files under plots/Streamlit
streamlit run src/app.py
# Open http://localhost:8501Recommended: create an isolated environment (conda/mamba or venv).
Conda / mamba (recommended for reproducibility)
mamba env create -f environment.yml
conda activate ai-dataPip (lightweight)
pip install -r requirements.txtCLI
# place CSV in data/ (or rely on the script's discovery)
python src/main.pyStreamlit
- Open the app, upload a CSV or use the built‑in sample dataset.
- Click "Run analysis" on the Analysis tab and download the full report.
ai-data-analyst/
├─ README.md
├─ data/ # input CSVs (e.g. example.csv from Kaggle cars-pre)
├─ plots/ # generated PNG charts
├─ reports/ # (optional) saved reports
├─ report.txt # sample CLI output
├─ src/
│ ├─ data_loading.py # CSV loading & validation
│ ├─ eda.py # EDA functions, diagnostics, and plotting
│ ├─ modeling.py # model training, evaluation, selection
│ ├─ reporting.py # report composition & save utility
│ ├─ main.py # CLI pipeline orchestrator
│ └─ app.py # Streamlit UI (Home, Analysis, Interactive)
└─ tests/
└─ test_main.py # pytest unit tests
- src/data_loading.py
- Locate and read CSV files, basic validation, and path utilities.
- src/eda.py
- Basic overview, column analysis, missingness, correlations, static plot generation.
- src/modeling.py
- Baseline numeric model (LinearRegression), advanced models (RandomForest, GradientBoosting) with ColumnTransformer preprocessing.
- src/reporting.py
- Format and save the full text report (report.txt).
- src/main.py
- Orchestrates the CLI run: find file → load → run pipeline → save report.
- src/app.py
- Streamlit UI with: Home (upload), Analysis (run pipeline + display), Interactive Explorer (ECharts).
The Streamlit app includes an Interactive Explorer powered by ECharts (via streamlit-echarts):
- Filters:
- Brand (multi-select)
- Year (slider)
- Engine Size (slider)
- Price (range slider)
- Chart types:
- Scatter: Mileage vs Price (by Brand)
- Bar: Average Price by Brand
- Purpose: quick, client-ready visual exploration with zoom and save-as-image features.
- Basic overview
- Number of rows: 2,500
- Number of columns: 10
- First 5 rows: (printed)
- Column analysis
- Numeric columns: Car ID, Year, Engine Size, Mileage, Price
- Categorical columns: Brand, Fuel Type, Transmission, Condition, Model
- Missing values: per-column percentages
- Suspicious columns: columns with a single unique value
- Price relationships
- Correlations with Price and qualitative strength (e.g., "weak negative correlation")
- Modeling summary
- Model v1: LinearRegression — R², MAE, short interpretation
- Model v2: RandomForest & GradientBoosting — R²/MAE for each, best model highlighted, top feature importances
- Visualizations
- List of saved PNG files in
plots/
- List of saved PNG files in
- Model v1 (baseline)
- Numeric features only (ID-like columns removed)
- Simple LinearRegression, 80/20 train/test split
- Metrics: R², MAE
- Model v2 (improved)
- ColumnTransformer: numeric passthrough + OneHotEncoder for categoricals
- Trains RandomForestRegressor and GradientBoostingRegressor
- Selects best model by test R²; reports per-model metrics and feature importances (tree models)
- Evaluation
- Interpret metrics in the report; provide short human‑readable guidance on model quality.
- Static: histograms, correlation heatmap, bar plots for brands/models (Matplotlib / Seaborn).
- Interactive: ECharts charts embedded in Streamlit for exploration.
- Saved to
plots/as PNG for inclusion in reports or portfolio screenshots.
- Unit tests:
tests/test_main.py(pytest). - CI: GitHub Actions workflow runs tests on push/PR (see
.github/workflows/ci.yml). - Run locally:
pip install pytest
pytest- The sample dataset is derived from a Kaggle cars dataset (cars‑pre).
- It contains weak/noisy linear signal for
Price— low or negative R² values are expected with simple baselines. - Purpose: demonstrate a robust, reproducible pipeline (EDA → preprocessing → modeling → reporting) rather than to deliver high predictive accuracy on this specific synthetic sample.
- For serious modeling, provide higher‑quality data, stronger feature engineering, and cross‑validation/hyperparameter tuning.
Tested with these versions (pinned in requirements.txt / environment.yml):
- Python: 3.10.12
- pandas: 2.2.3
- numpy: 2.1.3
- matplotlib: 3.9.2
- seaborn: 0.13.2
- scikit-learn: 1.7.0
- scipy: 1.15.3
- streamlit: 1.51.0
- streamlit-echarts: 0.4.0
- pytest: 8.4.0
Dependency files:
environment.yml(conda/mamba): reproducible environment with pinned binaries.mamba env create -f environment.yml conda activate ai-data
requirements.txt(pip): lightweight installs for venv/virtualenv.pip install -r requirements.txt
Why both:
- Use
environment.ymlfor exact, binary‑compatible environments (recommended for reproducibility). - Use
requirements.txtfor simple pip installs or when deploying to pip‑based environments.
Short term
- Improve UX in Streamlit (progress indicators, better layout).
- Add imputation, outlier detection and class for preprocessing.
- Save richer report formats (Markdown / HTML) with embedded images.
Medium term
- Add model explainability (SHAP) and feature pipelines.
- Add cross‑validation and hyperparameter tuning.
Long term
- Containerize (Docker) and deploy Streamlit (Streamlit Cloud / cloud provider).
- Convert pipeline into a reusable library / CLI with better configuration flags.
- Data ingestion & validation
- Exploratory Data Analysis (EDA)
- Feature handling & basic preprocessing
- Model training, selection & evaluation
- Visualization (static + interactive)
- Unit testing and CI practices
- Simple interactive web UI (Streamlit)