Skip to content

An intelligent data analysis assistant that loads CSV files, performs exploratory analysis, identifies anomalies and missing values, and produces a structured analytical report.

Notifications You must be signed in to change notification settings

kat1478/ai-data-analyst

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AI Car Price Data Analyst

Automated EDA, model training and reporting for car price data — a Python CLI plus a Streamlit dashboard that runs a reproducible analysis pipeline and produces human‑readable reports and charts.

Overview

AI Car Price Data Analyst is an end‑to‑end, portfolio‑focused project that demonstrates practical data workflows:

  • Ingest CSVs, validate and clean data.
  • Run automated EDA and generate visualizations.
  • Train baseline and improved regression models.
  • Produce a plain‑text report (CLI) and an interactive Streamlit UI for exploration and report download.

This repository emphasizes reproducible pipeline design, testing, and clear reporting rather than maximizing predictive performance on the included synthetic sample.

Key features

  • CSV ingestion and validation (default: data/example.csv).
  • Automated EDA: types, head, row/column counts, missing value rates, suspicious columns.
  • Price‑specific analysis: correlations, qualitative strength labels, short interpretations.
  • ML workflows:
    • Model v1 — LinearRegression baseline on numeric features.
    • Model v2 — OneHotEncoded categoricals + RandomForest / GradientBoosting comparison; selects best by R²; reports feature importances.
  • Visualizations: Matplotlib / Seaborn static plots + interactive ECharts in Streamlit.
  • Streamlit UI with three tabs: Home, Analysis, Interactive Explorer.
  • Unit tests (pytest) and CI friendly layout.

Demo

CLI

python src/main.py
# Generates report.txt (project root) and PNG files under plots/

Streamlit

streamlit run src/app.py
# Open http://localhost:8501

Installation

Recommended: create an isolated environment (conda/mamba or venv).

Conda / mamba (recommended for reproducibility)

mamba env create -f environment.yml
conda activate ai-data

Pip (lightweight)

pip install -r requirements.txt

Quick usage

CLI

# place CSV in data/ (or rely on the script's discovery)
python src/main.py

Streamlit

  • Open the app, upload a CSV or use the built‑in sample dataset.
  • Click "Run analysis" on the Analysis tab and download the full report.

Project structure

ai-data-analyst/
├─ README.md
├─ data/                 # input CSVs (e.g. example.csv from Kaggle cars-pre)
├─ plots/                # generated PNG charts
├─ reports/              # (optional) saved reports
├─ report.txt            # sample CLI output
├─ src/
│  ├─ data_loading.py    # CSV loading & validation
│  ├─ eda.py             # EDA functions, diagnostics, and plotting
│  ├─ modeling.py        # model training, evaluation, selection
│  ├─ reporting.py       # report composition & save utility
│  ├─ main.py            # CLI pipeline orchestrator
│  └─ app.py             # Streamlit UI (Home, Analysis, Interactive)
└─ tests/
   └─ test_main.py       # pytest unit tests

Architecture & file responsibilities

  • src/data_loading.py
    • Locate and read CSV files, basic validation, and path utilities.
  • src/eda.py
    • Basic overview, column analysis, missingness, correlations, static plot generation.
  • src/modeling.py
    • Baseline numeric model (LinearRegression), advanced models (RandomForest, GradientBoosting) with ColumnTransformer preprocessing.
  • src/reporting.py
    • Format and save the full text report (report.txt).
  • src/main.py
    • Orchestrates the CLI run: find file → load → run pipeline → save report.
  • src/app.py
    • Streamlit UI with: Home (upload), Analysis (run pipeline + display), Interactive Explorer (ECharts).

Interactive Explorer (Streamlit)

The Streamlit app includes an Interactive Explorer powered by ECharts (via streamlit-echarts):

  • Filters:
    • Brand (multi-select)
    • Year (slider)
    • Engine Size (slider)
    • Price (range slider)
  • Chart types:
    • Scatter: Mileage vs Price (by Brand)
    • Bar: Average Price by Brand
  • Purpose: quick, client-ready visual exploration with zoom and save-as-image features.

Example output (CLI report - excerpts)

  • Basic overview
    • Number of rows: 2,500
    • Number of columns: 10
    • First 5 rows: (printed)
  • Column analysis
    • Numeric columns: Car ID, Year, Engine Size, Mileage, Price
    • Categorical columns: Brand, Fuel Type, Transmission, Condition, Model
    • Missing values: per-column percentages
    • Suspicious columns: columns with a single unique value
  • Price relationships
    • Correlations with Price and qualitative strength (e.g., "weak negative correlation")
  • Modeling summary
    • Model v1: LinearRegression — R², MAE, short interpretation
    • Model v2: RandomForest & GradientBoosting — R²/MAE for each, best model highlighted, top feature importances
  • Visualizations
    • List of saved PNG files in plots/

Models & methodology

  • Model v1 (baseline)
    • Numeric features only (ID-like columns removed)
    • Simple LinearRegression, 80/20 train/test split
    • Metrics: R², MAE
  • Model v2 (improved)
    • ColumnTransformer: numeric passthrough + OneHotEncoder for categoricals
    • Trains RandomForestRegressor and GradientBoostingRegressor
    • Selects best model by test R²; reports per-model metrics and feature importances (tree models)
  • Evaluation
    • Interpret metrics in the report; provide short human‑readable guidance on model quality.

Visualizations

  • Static: histograms, correlation heatmap, bar plots for brands/models (Matplotlib / Seaborn).
  • Interactive: ECharts charts embedded in Streamlit for exploration.
  • Saved to plots/ as PNG for inclusion in reports or portfolio screenshots.

Tests & CI

  • Unit tests: tests/test_main.py (pytest).
  • CI: GitHub Actions workflow runs tests on push/PR (see .github/workflows/ci.yml).
  • Run locally:
pip install pytest
pytest

Dataset notes & limitations

  • The sample dataset is derived from a Kaggle cars dataset (cars‑pre).
  • It contains weak/noisy linear signal for Price — low or negative R² values are expected with simple baselines.
  • Purpose: demonstrate a robust, reproducible pipeline (EDA → preprocessing → modeling → reporting) rather than to deliver high predictive accuracy on this specific synthetic sample.
  • For serious modeling, provide higher‑quality data, stronger feature engineering, and cross‑validation/hyperparameter tuning.

Verified environment & dependencies

Tested with these versions (pinned in requirements.txt / environment.yml):

  • Python: 3.10.12
  • pandas: 2.2.3
  • numpy: 2.1.3
  • matplotlib: 3.9.2
  • seaborn: 0.13.2
  • scikit-learn: 1.7.0
  • scipy: 1.15.3
  • streamlit: 1.51.0
  • streamlit-echarts: 0.4.0
  • pytest: 8.4.0

Dependency files:

  • environment.yml (conda/mamba): reproducible environment with pinned binaries.
    mamba env create -f environment.yml
    conda activate ai-data
  • requirements.txt (pip): lightweight installs for venv/virtualenv.
    pip install -r requirements.txt

Why both:

  • Use environment.yml for exact, binary‑compatible environments (recommended for reproducibility).
  • Use requirements.txt for simple pip installs or when deploying to pip‑based environments.

Roadmap & extensions

Short term

  • Improve UX in Streamlit (progress indicators, better layout).
  • Add imputation, outlier detection and class for preprocessing.
  • Save richer report formats (Markdown / HTML) with embedded images.

Medium term

  • Add model explainability (SHAP) and feature pipelines.
  • Add cross‑validation and hyperparameter tuning.

Long term

  • Containerize (Docker) and deploy Streamlit (Streamlit Cloud / cloud provider).
  • Convert pipeline into a reusable library / CLI with better configuration flags.

Skills demonstrated

  • Data ingestion & validation
  • Exploratory Data Analysis (EDA)
  • Feature handling & basic preprocessing
  • Model training, selection & evaluation
  • Visualization (static + interactive)
  • Unit testing and CI practices
  • Simple interactive web UI (Streamlit)

About

An intelligent data analysis assistant that loads CSV files, performs exploratory analysis, identifies anomalies and missing values, and produces a structured analytical report.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages