AI Car Price Data Analyst

Automated EDA, model training and reporting for car price data — a Python CLI plus a Streamlit dashboard that runs a reproducible analysis pipeline and produces human‑readable reports and charts.

Overview

AI Car Price Data Analyst is an end‑to‑end, portfolio‑focused project that demonstrates practical data workflows:

Ingest CSVs, validate and clean data.
Run automated EDA and generate visualizations.
Train baseline and improved regression models.
Produce a plain‑text report (CLI) and an interactive Streamlit UI for exploration and report download.

This repository emphasizes reproducible pipeline design, testing, and clear reporting rather than maximizing predictive performance on the included synthetic sample.

Key features

CSV ingestion and validation (default: data/example.csv).
Automated EDA: types, head, row/column counts, missing value rates, suspicious columns.
Price‑specific analysis: correlations, qualitative strength labels, short interpretations.
ML workflows:
- Model v1 — LinearRegression baseline on numeric features.
- Model v2 — OneHotEncoded categoricals + RandomForest / GradientBoosting comparison; selects best by R²; reports feature importances.
Visualizations: Matplotlib / Seaborn static plots + interactive ECharts in Streamlit.
Streamlit UI with three tabs: Home, Analysis, Interactive Explorer.
Unit tests (pytest) and CI friendly layout.

Demo

CLI

python src/main.py
# Generates report.txt (project root) and PNG files under plots/

Streamlit

streamlit run src/app.py
# Open http://localhost:8501

Installation

Recommended: create an isolated environment (conda/mamba or venv).

Conda / mamba (recommended for reproducibility)

mamba env create -f environment.yml
conda activate ai-data

Pip (lightweight)

pip install -r requirements.txt

Quick usage

CLI

# place CSV in data/ (or rely on the script's discovery)
python src/main.py

Streamlit

Open the app, upload a CSV or use the built‑in sample dataset.
Click "Run analysis" on the Analysis tab and download the full report.

Project structure

ai-data-analyst/
├─ README.md
├─ data/                 # input CSVs (e.g. example.csv from Kaggle cars-pre)
├─ plots/                # generated PNG charts
├─ reports/              # (optional) saved reports
├─ report.txt            # sample CLI output
├─ src/
│  ├─ data_loading.py    # CSV loading & validation
│  ├─ eda.py             # EDA functions, diagnostics, and plotting
│  ├─ modeling.py        # model training, evaluation, selection
│  ├─ reporting.py       # report composition & save utility
│  ├─ main.py            # CLI pipeline orchestrator
│  └─ app.py             # Streamlit UI (Home, Analysis, Interactive)
└─ tests/
   └─ test_main.py       # pytest unit tests

Architecture & file responsibilities

src/data_loading.py
- Locate and read CSV files, basic validation, and path utilities.
src/eda.py
- Basic overview, column analysis, missingness, correlations, static plot generation.
src/modeling.py
- Baseline numeric model (LinearRegression), advanced models (RandomForest, GradientBoosting) with ColumnTransformer preprocessing.
src/reporting.py
- Format and save the full text report (report.txt).
src/main.py
- Orchestrates the CLI run: find file → load → run pipeline → save report.
src/app.py
- Streamlit UI with: Home (upload), Analysis (run pipeline + display), Interactive Explorer (ECharts).

Interactive Explorer (Streamlit)

The Streamlit app includes an Interactive Explorer powered by ECharts (via streamlit-echarts):

Filters:
- Brand (multi-select)
- Year (slider)
- Engine Size (slider)
- Price (range slider)
Chart types:
- Scatter: Mileage vs Price (by Brand)
- Bar: Average Price by Brand
Purpose: quick, client-ready visual exploration with zoom and save-as-image features.

Example output (CLI report - excerpts)

Basic overview
- Number of rows: 2,500
- Number of columns: 10
- First 5 rows: (printed)
Column analysis
- Numeric columns: Car ID, Year, Engine Size, Mileage, Price
- Categorical columns: Brand, Fuel Type, Transmission, Condition, Model
- Missing values: per-column percentages
- Suspicious columns: columns with a single unique value
Price relationships
- Correlations with Price and qualitative strength (e.g., "weak negative correlation")
Modeling summary
- Model v1: LinearRegression — R², MAE, short interpretation
- Model v2: RandomForest & GradientBoosting — R²/MAE for each, best model highlighted, top feature importances
Visualizations
- List of saved PNG files in plots/

Models & methodology

Model v1 (baseline)
- Numeric features only (ID-like columns removed)
- Simple LinearRegression, 80/20 train/test split
- Metrics: R², MAE
Model v2 (improved)
- ColumnTransformer: numeric passthrough + OneHotEncoder for categoricals
- Trains RandomForestRegressor and GradientBoostingRegressor
- Selects best model by test R²; reports per-model metrics and feature importances (tree models)
Evaluation
- Interpret metrics in the report; provide short human‑readable guidance on model quality.

Visualizations

Static: histograms, correlation heatmap, bar plots for brands/models (Matplotlib / Seaborn).
Interactive: ECharts charts embedded in Streamlit for exploration.
Saved to plots/ as PNG for inclusion in reports or portfolio screenshots.

Tests & CI

Unit tests: tests/test_main.py (pytest).
CI: GitHub Actions workflow runs tests on push/PR (see .github/workflows/ci.yml).
Run locally:

pip install pytest
pytest

Dataset notes & limitations

The sample dataset is derived from a Kaggle cars dataset (cars‑pre).
It contains weak/noisy linear signal for Price — low or negative R² values are expected with simple baselines.
Purpose: demonstrate a robust, reproducible pipeline (EDA → preprocessing → modeling → reporting) rather than to deliver high predictive accuracy on this specific synthetic sample.
For serious modeling, provide higher‑quality data, stronger feature engineering, and cross‑validation/hyperparameter tuning.

Verified environment & dependencies

Tested with these versions (pinned in requirements.txt / environment.yml):

Python: 3.10.12
pandas: 2.2.3
numpy: 2.1.3
matplotlib: 3.9.2
seaborn: 0.13.2
scikit-learn: 1.7.0
scipy: 1.15.3
streamlit: 1.51.0
streamlit-echarts: 0.4.0
pytest: 8.4.0

Dependency files:

environment.yml (conda/mamba): reproducible environment with pinned binaries.
```
mamba env create -f environment.yml
conda activate ai-data
```
requirements.txt (pip): lightweight installs for venv/virtualenv.
```
pip install -r requirements.txt
```

Why both:

Use environment.yml for exact, binary‑compatible environments (recommended for reproducibility).
Use requirements.txt for simple pip installs or when deploying to pip‑based environments.

Roadmap & extensions

Short term

Improve UX in Streamlit (progress indicators, better layout).
Add imputation, outlier detection and class for preprocessing.
Save richer report formats (Markdown / HTML) with embedded images.

Medium term

Add model explainability (SHAP) and feature pipelines.
Add cross‑validation and hyperparameter tuning.

Long term

Containerize (Docker) and deploy Streamlit (Streamlit Cloud / cloud provider).
Convert pipeline into a reusable library / CLI with better configuration flags.

Skills demonstrated

Data ingestion & validation
Exploratory Data Analysis (EDA)
Feature handling & basic preprocessing
Model training, selection & evaluation
Visualization (static + interactive)
Unit testing and CI practices
Simple interactive web UI (Streamlit)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI Car Price Data Analyst

Overview

Key features

Demo

Installation

Quick usage

Project structure

Architecture & file responsibilities

Interactive Explorer (Streamlit)

Example output (CLI report - excerpts)

Models & methodology

Visualizations

Tests & CI

Dataset notes & limitations

Verified environment & dependencies

Roadmap & extensions

Skills demonstrated

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.github/workflows		.github/workflows
data/archive		data/archive
plots		plots
src		src
tests		tests
README.md		README.md
environment,yml		environment,yml
report.txt		report.txt
requirements.txt		requirements.txt

kat1478/ai-data-analyst

Folders and files

Latest commit

History

Repository files navigation

AI Car Price Data Analyst

Overview

Key features

Demo

Installation

Quick usage

Project structure

Architecture & file responsibilities

Interactive Explorer (Streamlit)

Example output (CLI report - excerpts)

Models & methodology

Visualizations

Tests & CI

Dataset notes & limitations

Verified environment & dependencies

Roadmap & extensions

Skills demonstrated

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages