Excel Analyzing

A comprehensive Python pipeline for analyzing Excel workbooks like databases, treating workbooks as databases and sheets as tables. Built with PEP8 standards and object-oriented design principles.

Features

Excel Processing Pipeline: Recursively discover and process Excel workbooks (.xlsx, .xls, .xlsm, .xlsb)
Database-like Treatment: Treat workbooks as databases and sheets as tables with proper schema detection
Data Type Inference: Automatically detect column data types (string, integer, float, boolean, datetime, date)
Data Cleaning: Drop empty rows/columns, clean column names, handle null values
Multiple Environments: Development, test, and production configurations
Web Interface: Django-based web interface for managing and querying data
REST API: RESTful API for programmatic access
Command Line Interface: Rich CLI for batch processing and analysis
Database Storage: PostgreSQL backend with SQLAlchemy ORM
Validation: Pydantic models for data validation and serialization
Testing: Comprehensive test suite with pytest
Documentation: Detailed markdown documentation

Architecture

The project implements a containerized microservices architecture with hostname-based service discovery:

excel_analyzing/
├── core/              # Configuration management and settings
│   ├── config.py      # Environment-specific configuration loading
│   ├── data_types.py  # Data type definitions and enums
│   ├── schema.py      # Schema validation utilities
│   └── cleaning.py    # Data cleaning utilities
├── models/            # Data models and persistence
│   ├── schemas.py     # Pydantic models for validation
│   └── database.py    # SQLAlchemy ORM models and database management
├── pipeline/          # Excel processing pipeline
│   ├── processor.py   # Pandas-based Excel data processing engine
│   └── orchestrator.py # Pipeline orchestration and database integration
├── web/               # Django web application
│   ├── apps/api/      # REST API endpoints
│   ├── apps/workbooks/ # Workbook management interface
│   ├── settings/      # Environment-specific Django settings
│   └── urls.py        # URL routing configuration
├── utils/             # Shared utility functions
└── cli.py             # Rich CLI with comprehensive commands

Service Architecture (Containerized)

Production Environment:
┌─────────────────────┐    ┌─────────────────────┐    ┌─────────────────────┐
│   prod-web-service  │◄──►│  prod-db-service    │◄──►│ prod-cache-service  │
│   (Django/Alpine)   │    │ (PostgreSQL/Alpine) │    │   (Redis/Alpine)    │
└─────────────────────┘    └─────────────────────┘    └─────────────────────┘
           │
           ▼
┌─────────────────────┐    ┌─────────────────────┐
│ prod-worker-service │    │ prod-nginx-service  │
│  (Celery/Alpine)    │    │   (Nginx/Alpine)    │
└─────────────────────┘    └─────────────────────┘

Configuration Management

The application uses centralized environment-specific configuration organized under /env/service/subservice/:

Dynamic Loading: Configuration automatically loads based on ENVIRONMENT variable
Hostname-Based Services: Uses service discovery (e.g., prod-web-service, dev-db-service)
Multi-Environment Support: development, test, production with isolated configurations

Technologies Used

Python 3.10+: Core programming language (requires 3.10+ as per pyproject.toml)
Pandas: Data manipulation and analysis engine for Excel processing
OpenPyXL: Modern Excel file (.xlsx, .xlsm) reading and writing
xlrd: Legacy Excel file (.xls) support
Pydantic: Data validation, serialization, and settings management
SQLAlchemy: Database ORM with async support
PostgreSQL: Primary database with hostname-based service discovery
Redis: Cache service for session management and task queuing
Django 5.2+: Web framework with modern async capabilities
Django REST Framework: API development with comprehensive endpoints
Click: Command line interface with rich terminal output
Rich: Terminal formatting and progress indicators
Docker: Containerized deployment with Alpine Linux base images
Gunicorn: WSGI server for production deployment
pytest: Comprehensive testing framework with multiple test categories
Black: Code formatting with 88-character line length
Flake8: Code linting and style enforcement
mypy: Static type checking
pre-commit: Git hooks for code quality

Installation

Prerequisites

Python 3.10+ (Required - see pyproject.toml)
PostgreSQL 12+ or containerized database service
Redis (for caching and session management)
Docker & Docker Compose (for containerized deployment)
Git (for version control)

Development Setup

Clone the repository:

git clone https://github.com/nullroute-commits/excel_analyzing.git
cd excel_analyzing

Create a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:

# Development with all optional dependencies
pip install -e ".[dev,test]"

# Or install from requirements files
pip install -r requirements-dev.txt

Set up environment configuration:

# The application uses structured environment configuration
# Example environment files are in env/*/
export ENVIRONMENT=development  # or test, production

Initialize the database:

# Using the CLI tool
excel-analyze init-db

# Or reset if needed
excel-analyze reset-db --confirm

Start development services (choose one):

Option A: Docker Compose (Recommended)

docker-compose -f docker-compose.dev.yml up -d

Option B: Local Development

# Start Django development server
python manage.py runserver

# In another terminal, start worker processes if needed
excel-analyze process /path/to/excel/files --recursive

Production Setup

Deploy with Docker Compose (Recommended):

# Production deployment with all services
docker-compose up -d

# This starts:
# - prod-web-service (Django app)
# - prod-db-service (PostgreSQL)
# - prod-cache-service (Redis)
# - prod-nginx-service (Nginx reverse proxy)
# - prod-worker-service (Background processing)

Manual Installation:

# Install production dependencies
pip install -e ".[prod]"

# Set up environment variables for production
export ENVIRONMENT=production
export DATABASE_HOST=prod-db-service  # or your PostgreSQL hostname
export REDIS_HOST=prod-cache-service   # or your Redis hostname
export DJANGO_SECRET_KEY=your-secret-key
export DJANGO_ALLOWED_HOSTS=your-domain.com,prod-web-service

# Initialize the database
excel-analyze init-db

# Run with Gunicorn
gunicorn --bind 0.0.0.0:8000 --workers 4 excel_analyzing.web.wsgi:application

Environment-Specific Configuration:

The application automatically loads configurations from:

env/web/django/.env.production
env/database/postgresql/.env.production
env/cache/redis/.env.production
env/processing/core/.env.production

Usage

Command Line Interface

The CLI provides comprehensive commands for Excel processing:

Process Excel files in a directory:

excel-analyze process /path/to/excel/files --recursive

Analyze a specific workbook with detailed output:

excel-analyze analyze /path/to/workbook.xlsx

Database management:

# Initialize database tables
excel-analyze init-db

# Reset database (with confirmation)
excel-analyze reset-db --confirm

# List processed workbooks
excel-analyze list-workbooks

Query data from processed sheets:

excel-analyze query workbook_name sheet_name --filter "column > 100" --limit 20

Processing options:

excel-analyze process /path/to/files \
  --recursive \
  --drop-empty-rows \
  --drop-empty-columns \
  --clean-column-names \
  --null-threshold 0.8

Python API

Basic Processing Pipeline:

from excel_analyzing.pipeline.orchestrator import ExcelPipeline
from excel_analyzing.models.schemas import ProcessingOptions

# Create pipeline with custom processing options
options = ProcessingOptions(
    drop_empty_rows=True,
    drop_empty_columns=True,
    clean_column_names=True,
    infer_data_types=True,
    null_threshold=0.9,
    max_sample_size=100
)
pipeline = ExcelPipeline(options)

# Process a single workbook
result = pipeline.process_workbook("path/to/workbook.xlsx")
if result.success:
    print(f"Processed {result.rows_processed} rows in {result.processing_time_seconds:.2f}s")
    workbook_info = result.workbook
    for sheet in workbook_info.sheets:
        print(f"Sheet: {sheet.name} ({sheet.row_count} rows, {sheet.column_count} columns)")
else:
    print(f"Processing failed: {result.error_message}")

# Process entire directory
results = pipeline.process_directory("path/to/directory", recursive=True)
successful = [r for r in results if r.success]
print(f"Successfully processed {len(successful)} out of {len(results)} files")

Direct Data Processing:

from excel_analyzing.pipeline.processor import ExcelDataProcessor
from excel_analyzing.models.schemas import ProcessingOptions

# Initialize processor
processor = ExcelDataProcessor(ProcessingOptions())

# Load and analyze workbook
workbook_info = processor.load_workbook("path/to/file.xlsx")

# Access processed data
df = processor.get_dataframe("filename", "sheet_name")
if df is not None:
    print(df.head())
    
    # Apply filters using pandas query syntax
    filtered_df = processor.apply_filter("filename", "sheet_name", "column_name > 100")
    
    # Get summary statistics
    summary = processor.get_summary_statistics("filename", "sheet_name")
    print(f"Memory usage: {summary['memory_usage_mb']:.2f} MB")

Database Integration:

from excel_analyzing.models.database import db_manager
from sqlalchemy.orm import Session

# Get database session
session: Session = next(db_manager.get_session())

# Query workbooks directly
from excel_analyzing.models.database import WorkbookModel, SheetModel
workbooks = session.query(WorkbookModel).all()
for wb in workbooks:
    print(f"Workbook: {wb.file_name} ({wb.sheet_count} sheets)")
    for sheet in wb.sheets:
        print(f"  - {sheet.name}: {sheet.row_count} rows")

session.close()

Web Interface

Development Server:

# Start Django development server
python manage.py runserver

# Access the web interface at http://localhost:8000
# API endpoints available at http://localhost:8000/api/

Production Deployment:

# Using Docker Compose (includes Nginx reverse proxy)
docker-compose up -d

# Access at configured domain or http://localhost:80
# HTTPS available if SSL certificates configured

API Endpoints:

Workbooks: GET/POST /api/workbooks/
Workbook Details: GET/PUT/DELETE /api/workbooks/{id}/
Data Queries: POST /api/query/

Authentication: API endpoints require authentication except for data queries

Configuration

Environment Variables

The application supports environment-specific configuration with automatic loading:

Variable	Description	Default	Environments
`ENVIRONMENT`	Application environment	`development`	development/test/production
`DATABASE_HOST`	Database hostname (service discovery)	`db-service`	Environment-specific prefixes
`DATABASE_PORT`	Database port	`5432`	All
`DATABASE_NAME`	Database name	`excel_analyzing`	All
`REDIS_HOST`	Redis hostname (service discovery)	`cache-service`	Environment-specific prefixes
`REDIS_PORT`	Redis port	`6379`	All
`DJANGO_SECRET_KEY`	Django secret key	Auto-generated	All
`DJANGO_ALLOWED_HOSTS`	Comma-separated allowed hosts	`web-service,localhost`	All
`DEBUG`	Enable debug mode	`False`	development=True
`MAX_FILE_SIZE_MB`	Maximum file size to process (MB)	`100`	All
`CHUNK_SIZE`	Processing chunk size	`1000`	All
`LOG_LEVEL`	Logging level	`INFO`	All

Hostname-Based Service Discovery

The application uses hostname-based service discovery for containerized deployments:

Environment	Web Service	Database Service	Cache Service
Development	`dev-web-service:8000`	`dev-db-service:5432`	`dev-cache-service:6379`
Test	`test-web-service:8000`	`test-db-service:5432`	`test-cache-service:6379`
Production	`prod-web-service:8000`	`prod-db-service:5432`	`prod-cache-service:6379`

Configuration File Structure

env/
├── web/django/           # Django-specific settings
│   ├── .env.development
│   ├── .env.test
│   └── .env.production
├── database/postgresql/  # Database connection settings
│   ├── .env.development
│   ├── .env.test
│   └── .env.production
├── cache/redis/         # Redis cache settings
│   ├── .env.development
│   ├── .env.test
│   └── .env.production
└── processing/core/     # Processing pipeline settings
    ├── .env.development
    ├── .env.test
    └── .env.production

Processing Options

Configure Excel processing behavior:

from excel_analyzing.models.schemas import ProcessingOptions

options = ProcessingOptions(
    drop_empty_rows=True,           # Drop completely empty rows
    drop_empty_columns=True,        # Drop completely empty columns
    infer_data_types=True,          # Automatically infer column data types
    clean_column_names=True,        # Clean and normalize column names
    max_sample_size=100,            # Maximum sample size for type inference
    null_threshold=0.9              # Drop columns with >90% null values
)

Data Pipeline

Pipeline Architecture

The Excel processing pipeline implements a 4-stage architecture with comprehensive data handling:

1. Discovery & Validation

Recursive scanning of directories for Excel files
File type validation: Supports .xlsx, .xls, .xlsm, .xlsb formats
Size filtering: Configurable file size limits (default: 100MB)
Temporary file exclusion: Skips files starting with ~$
Accessibility checks: Validates file permissions and readability

2. Processing & Analysis

Workbook metadata extraction using OpenPyXL and xlrd engines
Automatic header detection: Intelligent row scanning for headers
Data type inference: Automatic detection of string, integer, float, boolean, datetime, date types
Column analysis: Null counts, unique values, sample data extraction
Data cleaning pipeline:
- Column name normalization (spaces → underscores, special character handling)
- Sheet name cleaning and validation
- Empty row/column removal (configurable)
- Null threshold filtering (configurable percentage)

Processing Options Configuration:

ProcessingOptions(
    drop_empty_rows=True,           # Remove completely empty rows
    drop_empty_columns=True,        # Remove completely empty columns
    infer_data_types=True,          # Automatic type detection
    clean_column_names=True,        # Normalize column names
    max_sample_size=100,            # Sample size for type inference
    null_threshold=0.9              # Drop columns with >90% null values
)

3. Database Storage & Persistence

Hierarchical data model: Workbooks → Sheets → Columns
Metadata storage: File information, processing timestamps, statistics
Column schema storage: Data types, null counts, sample values
Processing results tracking: Success/failure, timing, error messages
Incremental updates: Support for reprocessing and updates

Database Schema:

workbooks (id, file_path, file_name, file_size_bytes, sheet_count, created_at, processed_at)
├── sheets (id, workbook_id, name, original_name, row_count, column_count, has_header, header_row)
│   └── columns (id, sheet_id, name, original_name, position, data_type, null_count, unique_count)
└── processing_results (id, workbook_id, success, error_message, processing_time_seconds)

4. Querying & Data Access

Pandas integration: Direct DataFrame access for processed data
Query capabilities: SQL-like filtering using pandas query syntax
Transformation support: Grouping, sorting, column selection
Export functionality: Multiple output formats
Statistical analysis: Summary statistics and data profiling

Query Examples:

# Filter data using pandas query syntax
df = processor.apply_filter("workbook", "sheet", "column_name > 100 and status == 'active'")

# Apply transformations
transformed = processor.apply_transformation("workbook", "sheet", {
    "operation": "group_by",
    "columns": ["category"],
    "agg_func": "mean"
})

# Get summary statistics
stats = processor.get_summary_statistics("workbook", "sheet")
print(f"Memory usage: {stats['memory_usage_mb']:.2f} MB")
print(f"Numeric columns: {stats['numeric_columns']}")

Testing

The project implements comprehensive testing infrastructure with multiple test categories:

Test Categories

Run all tests:

pytest

Run with coverage reporting:

pytest --cov=excel_analyzing --cov-report=html --cov-report=term-missing

Run specific test categories:

# Unit tests (fast, isolated)
pytest tests/unit/ -v

# Integration tests (database, service integration)
pytest tests/integration/ -v

# Regression tests (ensure no breaking changes)
pytest tests/regression/ -v

# Security tests (vulnerability scanning)
pytest tests/security/ -v

# Performance tests (benchmarking)
pytest tests/performance/ -v

# End-to-end tests (full workflow)
pytest tests/e2e/ -v

# Architecture validation tests
python test_architecture.py

Run tests by markers:

pytest -m "unit"           # Unit tests only
pytest -m "integration"    # Integration tests only
pytest -m "slow"           # Long-running tests
pytest -m "security"       # Security tests only

Testing Infrastructure

Test Configuration:

pytest with comprehensive markers and fixtures
pytest-django for Django integration testing
pytest-cov for coverage reporting
factory-boy for test data generation
pytest-playwright for end-to-end browser testing

Architecture Testing:

Environment configuration validation
Hostname-based service discovery verification
Alpine container usage validation
Docker Compose configuration testing
Multistage Dockerfile verification

Coverage Requirements:

Unit test coverage: >90% target
Integration tests: All environments tested
Security tests: No critical/high vulnerabilities
Performance tests: <10% regression tolerance

Continuous Integration

GitHub Actions Workflows:

Comprehensive testing pipeline: Multi-environment testing
Security testing pipeline: Vulnerability scanning
Performance benchmarking: Automated performance tracking
Code quality checks: Linting, formatting, type checking

Code Quality

Format code with Black:

black excel_analyzing/

Lint with Flake8:

flake8 excel_analyzing/

Type checking with mypy:

mypy excel_analyzing/

Deployment

Containerized Deployment (Recommended)

Production deployment with Docker Compose:

# Start all services in production mode
docker-compose up -d

# Services started:
# - prod-web-service: Django application (port 8000)
# - prod-db-service: PostgreSQL database
# - prod-cache-service: Redis cache
# - prod-nginx-service: Nginx reverse proxy (ports 80/443)
# - prod-worker-service: Background processing

Development deployment:

docker-compose -f docker-compose.dev.yml up -d

Test deployment:

docker-compose -f docker-compose.test.yml up -d

Docker Architecture

Multi-stage Dockerfile:

Base stage: Alpine Linux with system dependencies
Dependencies stage: Python package installation
Builder stage: Application code integration
Production stage: Optimized runtime image

Security Features:

Non-root user: Application runs as user excel
Alpine Linux: Minimal attack surface
Health checks: Container health monitoring
Resource limits: Memory and CPU constraints

Manual Deployment

System requirements:

# Install system dependencies
sudo apt-get update
sudo apt-get install python3.10+ postgresql-client redis-tools

# Install application
pip install -e ".[prod]"

Environment setup:

export ENVIRONMENT=production
export DATABASE_HOST=your-db-host
export REDIS_HOST=your-redis-host
export DJANGO_SECRET_KEY=your-secret-key
export DJANGO_ALLOWED_HOSTS=your-domain.com

Run with Gunicorn:

gunicorn --bind 0.0.0.0:8000 \
         --workers 4 \
         --timeout 120 \
         --worker-class sync \
         excel_analyzing.web.wsgi:application

Load Balancing & Scaling

Horizontal scaling:

Multiple web service replicas
Shared database and cache
Nginx load balancing
Worker service scaling

Performance optimization:

Static file serving: Nginx with compression
Database connection pooling: SQLAlchemy pool management
Redis caching: Session and data caching
Async processing: Background task queuing

API Documentation

REST API Endpoints

Base URL: http://localhost:8000/api/ (development) or https://your-domain.com/api/ (production)

Authentication: Most endpoints require authentication (Django session or token-based)

Workbook Management

List workbooks:

GET /api/workbooks/
Authorization: Required
Response: JSON array of workbook objects

Create workbook (upload and process):

POST /api/workbooks/
Authorization: Required
Content-Type: multipart/form-data
Body: Excel file upload
Response: 201 Created with workbook ID

Get workbook details:

GET /api/workbooks/{id}/
Authorization: Required
Response: Detailed workbook information with sheets

Update workbook:

PUT /api/workbooks/{id}/
Authorization: Required
Body: JSON with updated fields
Response: Updated workbook object

Delete workbook:

DELETE /api/workbooks/{id}/
Authorization: Required
Response: 204 No Content

Data Querying

Query sheet data:

POST /api/query/
Authorization: Not required (for testing)
Content-Type: application/json
Body: {
  "workbook": "workbook_name",
  "sheet": "sheet_name", 
  "filters": {
    "column_name": {"operator": ">", "value": 100},
    "status": {"operator": "==", "value": "active"}
  },
  "limit": 1000,
  "offset": 0
}
Response: {
  "results": [...],
  "count": 1500,
  "filters_applied": {...}
}

Error Responses

All endpoints return consistent error responses:

{
  "error": "Error description",
  "details": "Additional error details",
  "code": "ERROR_CODE"
}

HTTP Status Codes:

200 OK: Successful request
201 Created: Resource created successfully
400 Bad Request: Invalid request data
401 Unauthorized: Authentication required
404 Not Found: Resource not found
500 Internal Server Error: Server error

WebSocket Events (Future Enhancement)

The application is designed to support real-time updates:

workbook.processing.started - Processing started
workbook.processing.progress - Processing progress update
workbook.processing.completed - Processing completed
workbook.processing.failed - Processing failed

Contributing

Fork the repository
Create a feature branch
Make your changes following PEP8 standards
Add tests for new functionality
Run the test suite
Submit a pull request

License

This project is licensed under the GNU General Public License v3.0 - see the LICENSE file for details.

Support

For support and questions:

Create an issue on GitHub
Check the documentation
Review the test examples

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
.github		.github
.vscode		.vscode
docs		docs
env		env
excel_analyzing		excel_analyzing
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
ARCHITECTURE.md		ARCHITECTURE.md
DOCKER_TEST_ERROR_ANALYSIS.md		DOCKER_TEST_ERROR_ANALYSIS.md
DOCKER_TEST_EXECUTION.md		DOCKER_TEST_EXECUTION.md
Dockerfile		Dockerfile
Dockerfile.dev		Dockerfile.dev
Dockerfile.e2e		Dockerfile.e2e
Dockerfile.test		Dockerfile.test
Dockerfile.test.simple		Dockerfile.test.simple
LICENSE		LICENSE
README.md		README.md
Repair-rebuild-test.prompt.yml		Repair-rebuild-test.prompt.yml
SECURITY.md		SECURITY.md
TECHNICAL_DESIGN.md		TECHNICAL_DESIGN.md
TESTING_README.md		TESTING_README.md
docker-compose.dev.yml		docker-compose.dev.yml
docker-compose.test.yml		docker-compose.test.yml
docker-compose.yml		docker-compose.yml
docker_test_report.txt		docker_test_report.txt
manage.py		manage.py
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements-e2e.txt		requirements-e2e.txt
requirements-prod.txt		requirements-prod.txt
requirements.txt		requirements.txt
run_tests.py		run_tests.py
test_architecture.py		test_architecture.py
test_report.txt		test_report.txt
test_results.json		test_results.json
tox.ini		tox.ini

License

nullroute-commits/excel_analyzing

Folders and files

Latest commit

History

Repository files navigation

Excel Analyzing

Features

Architecture

Service Architecture (Containerized)

Configuration Management

Technologies Used

Installation

Prerequisites

Development Setup

Production Setup

Usage

Command Line Interface

Python API

Web Interface

Configuration

Environment Variables

Hostname-Based Service Discovery

Configuration File Structure

Processing Options

Data Pipeline

Pipeline Architecture

1. Discovery & Validation

2. Processing & Analysis

3. Database Storage & Persistence

4. Querying & Data Access

Testing

Test Categories

Testing Infrastructure

Continuous Integration

Code Quality

Deployment

Containerized Deployment (Recommended)

Docker Architecture

Manual Deployment

Load Balancing & Scaling

API Documentation

REST API Endpoints

Workbook Management

Data Querying

Error Responses

WebSocket Events (Future Enhancement)

Contributing

License

Support

About

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages