Skip to content

kerski/python-wheel-file-template

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Marvel Character Data Pipeline

A comprehensive Python data pipeline for extracting, cleaning, and validating Marvel character data from FiveThirtyEight's dataset. This project includes pytest unit tests, Great Expectations data validation, and automated wheel packaging for deployment to Microsoft Fabric environments.

πŸ“‹ Table of Contents

πŸ“– Project Overview

This project extracts Marvel character data from the FiveThirtyEight dataset, applies comprehensive data cleaning transformations, and validates the output using both pytest (for code testing) and Great Expectations (for data quality validation).

Data Source: FiveThirtyEight Marvel Dataset

Key Capabilities:

  • Extract data from web sources or local files
  • Clean and standardize character names, identities, alignments, appearances, years, and status
  • Validate data quality with configurable rules
  • Package as a wheel for deployment to Fabric environments
  • Fail-fast validation to prevent bad data propagation

✨ Features

  • Modular Design: Separate modules for extraction, cleaning, and validation
  • Comprehensive Testing: Full pytest suite with unit and integration tests
  • Data Quality: Great Expectations integration for production validation
  • Automated Build: Single-command wheel building and deployment
  • Fabric Integration: Ready-to-use Environment configuration
  • Type Safety: Type hints throughout the codebase
  • Logging: Detailed logging at every pipeline stage
  • Error Handling: Graceful error handling with informative messages

πŸ“ Project Structure

wheel-file-demo/
β”œβ”€β”€ python/                        # Python package and development
β”‚   β”œβ”€β”€ src/
β”‚   β”‚   └── marvel_pipeline/       # Main package
β”‚   β”‚       β”œβ”€β”€ __init__.py        # Package initialization
β”‚   β”‚       β”œβ”€β”€ extract.py         # Data extraction module
β”‚   β”‚       β”œβ”€β”€ clean.py           # Data cleaning module
β”‚   β”‚       └── validate.py        # Data validation module
β”‚   β”œβ”€β”€ tests/                     # Test suite
β”‚   β”‚   β”œβ”€β”€ conftest.py            # Pytest fixtures
β”‚   β”‚   β”œβ”€β”€ test_extract.py        # Extract module tests
β”‚   β”‚   β”œβ”€β”€ test_clean.py          # Clean module tests
β”‚   β”‚   └── test_integration.py    # Integration tests
β”‚   β”œβ”€β”€ scripts/                   # Utility scripts
β”‚   β”‚   β”œβ”€β”€ build_wheel.py         # Wheel build script
β”‚   β”‚   └── init_great_expectations.py # GE initialization
β”‚   β”œβ”€β”€ dist/                      # Built wheel packages
β”‚   β”œβ”€β”€ build/                     # Build artifacts
β”‚   β”œβ”€β”€ requirements.txt           # Core dependencies
β”‚   β”œβ”€β”€ requirements-dev.txt       # Dev dependencies
β”‚   β”œβ”€β”€ setup.py                   # Package setup
β”‚   β”œβ”€β”€ pyproject.toml             # Build configuration
β”‚   └── pytest.ini                 # Pytest configuration
β”œβ”€β”€ fabric-artifacts/              # Microsoft Fabric resources
β”‚   β”œβ”€β”€ env-example.Environment/   # Fabric Environment
β”‚   β”‚   β”œβ”€β”€ .platform              # Environment metadata
β”‚   β”‚   β”œβ”€β”€ Setting/
β”‚   β”‚   β”‚   └── Sparkcompute.yml   # Spark configuration
β”‚   β”‚   └── Libraries/
β”‚   β”‚       └── CustomLibraries/   # Wheel packages go here
β”‚   └── nb-marvel-extract.Notebook/ # Fabric notebook
β”‚       └── notebook-content.py    # Notebook implementation
β”œβ”€β”€ venv/                          # Virtual environment (not in git)
β”œβ”€β”€ .gitignore                     # Git ignore rules
└── README.md                      # This file

πŸš€ Getting Started

Prerequisites

  • Python 3.8 or higher
  • Git
  • Access to Microsoft Fabric (for Fabric deployment)

Quick Start

  1. Clone the repository

    git clone <your-repo-url>
    cd wheel-file-demo
  2. Set up virtual environment

    # Create virtual environment
    python -m venv venv
    
    # Activate (Windows)
    venv\Scripts\activate
    
    # Activate (Linux/Mac)
    source venv/bin/activate
  3. Install dependencies

    # Navigate to python directory
    cd python
    
    # Install core dependencies
    pip install -r requirements.txt
    
    # Install dev dependencies
    pip install -r requirements-dev.txt
    
    # Install package in editable mode
    pip install -e .
  4. Run tests

    # From the python directory
    pytest tests/ -v
  5. Build the wheel

    # From the python directory
    python scripts/build_wheel.py

πŸ’» Local Development

Setting Up Your Environment

  1. Activate virtual environment (if not already active):

    venv\Scripts\activate  # Windows
    source venv/bin/activate  # Linux/Mac
  2. Navigate to python directory:

    cd python
  3. Install in editable mode for development:

    pip install -e .

    This allows you to make changes to the code without reinstalling.

Running the Pipeline Locally

Create a Python script or use the Python REPL:

from marvel_pipeline import extract, clean, validate

# Extract data
df = extract.fetch_marvel_data(
    url=extract.DEFAULT_MARVEL_URL,
    output_path="data/marvel-raw.csv"
)

# Clean data
df_cleaned = clean.clean_marvel_data(df)

# Save cleaned data
df_cleaned.to_csv("data/marvel-cleaned.csv", index=False)

# Validate
result = validate.run_validation("data/marvel-cleaned.csv")
print(f"Validation: {'PASSED' if result['success'] else 'FAILED'}")

Code Style

  • Follow PEP 8 guidelines
  • Use type hints for function parameters and returns
  • Add docstrings to all functions (NumPy style)
  • Keep functions focused and modular

πŸ§ͺ Testing

Running All Tests

pytest tests/ -v

Running Specific Test Files

pytest tests/test_extract.py -v
pytest tests/test_clean.py -v
pytest tests/test_integration.py -v

Running Tests by Marker

# Unit tests only
pytest tests/ -m unit -v

# Integration tests only
pytest tests/ -m integration -v

# Skip slow tests
pytest tests/ -m "not slow" -v

Code Coverage

# From python directory
pytest tests/ --cov=src/marvel_pipeline --cov-report=html

View coverage report: Open python/htmlcov/index.html in your browser

Test Structure

  • Unit Tests: Test individual functions in isolation
  • Integration Tests: Test the complete pipeline end-to-end
  • Fixtures: Reusable test data in conftest.py
  • Mocking: Mock external requests for fast, reliable tests

πŸ”§ Building the Wheel

The build script automates the entire packaging process:

# From the python directory
cd python
python scripts/build_wheel.py

What it does:

  1. βœ“ Cleans old build artifacts
  2. βœ“ Runs the test suite (fails if tests fail)
  3. βœ“ Builds the wheel package
  4. βœ“ Copies wheel to Fabric Environment directory
  5. βœ“ Reports success with next steps

Options:

# Skip tests (not recommended)
python scripts/build_wheel.py --skip-tests

Output:

  • Wheel file: python/dist/marvel_pipeline-0.1.0-py3-none-any.whl
  • Deployed to: fabric-artifacts/env-example.Environment/Libraries/CustomLibraries/

🌐 Fabric Deployment

Deploying to Microsoft Fabric

  1. Build the wheel (if not already done):

    cd python
    python scripts/build_wheel.py
  2. Commit changes including the wheel:

    git add fabric-artifacts/env-example.Environment/Libraries/CustomLibraries/*.whl
    git commit -m "Deploy marvel-pipeline v0.1.0"
    git push
  3. Sync workspace in Fabric:

    • Go to your Fabric workspace
    • Navigate to Source Control settings
    • Click "Sync" to pull latest changes
  4. Attach Environment to notebook:

    • Open the notebook: nb-marvel-extract
    • Click "Environment" in the toolbar
    • Select env-example
    • Wait for environment to attach
  5. Run the notebook:

    • Execute cells to run the pipeline
    • Data is saved to lakehouse at /lakehouse/default/Files/

Environment Configuration

The fabric-artifacts/env-example.Environment includes:

  • Custom Libraries: marvel-pipeline wheel package
  • Spark Configuration: 8 cores, 56GB memory
  • Runtime: Fabric Runtime 1.3

Custom Libraries Directory

The fabric-artifacts/env-example.Environment/Libraries/CustomLibraries/ directory contains wheel files that are automatically loaded when the Environment is attached to a notebook.

Important Notes:

  • Wheel files in this directory ARE tracked by git (unlike typical Python packages)
  • This is required for Fabric workspace sync
  • When you rebuild the wheel, the old version is automatically removed
  • The build script (scripts/build_wheel.py) handles copying wheels here

Updating the Package:

  1. Make code changes in python/src/marvel_pipeline/
  2. Update tests in python/tests/
  3. Run: cd python && python scripts/build_wheel.py
  4. Wheel is automatically copied to CustomLibraries
  5. Commit and sync with Fabric

πŸ“š Usage

In Fabric Notebooks

from marvel_pipeline import extract, clean, validate

# Configuration
MARVEL_URL = "https://raw.githubusercontent.com/fivethirtyeight/data/master/comic-characters/marvel-wikia-data.csv"
RAW_PATH = "/lakehouse/default/Files/marvel-raw.csv"
CLEANED_PATH = "/lakehouse/default/Files/marvel-cleaned.csv"

# Extract
df = extract.fetch_marvel_data(url=MARVEL_URL, output_path=RAW_PATH)

# Clean
df_cleaned = clean.clean_marvel_data(df)
df_cleaned.to_csv(CLEANED_PATH, index=False)

# Validate
result = validate.run_validation(CLEANED_PATH)

In Local Scripts

from marvel_pipeline import extract, clean, validate

# Extract from web
df = extract.fetch_marvel_data()

# Or load from local file
df = extract.load_local_csv("path/to/data.csv")

# Clean
df_cleaned = clean.clean_marvel_data(df)

# Validate DataFrame directly
result = validate.validate_dataframe(df_cleaned)

πŸ”„ Data Pipeline Stages

1. Extract

Module: marvel_pipeline.extract

  • Fetches data from URL via HTTP requests
  • Saves raw CSV to specified location
  • Returns pandas DataFrame
  • Handles network errors gracefully

2. Clean

Module: marvel_pipeline.clean

Cleaning Functions:

  • clean_names(): Strip whitespace, standardize formatting
  • clean_identity(): Standardize to Secret/Public/Unknown
  • clean_alignment(): Standardize to Good/Bad/Neutral/Unknown
  • clean_appearances(): Convert to numeric, handle nulls
  • clean_year(): Parse dates, validate range (1939-2026)
  • clean_status(): Standardize to Living/Deceased/Unknown

Main Function: clean_marvel_data()

  • Applies all cleaning functions
  • Removes duplicate rows
  • Resets DataFrame index
  • Logs cleaning summary

3. Validate

Module: marvel_pipeline.validate

Validation Checks:

  • DataFrame not empty
  • Expected columns present
  • No completely null rows
  • Name column has values
  • Appearances are numeric and non-negative
  • Year values in reasonable range

Validation Modes:

  • Great Expectations (if configured)
  • Basic validation (fallback)

Behavior:

  • Raises ValidationError on failure
  • Logs warnings for non-critical issues
  • Returns validation results dict

οΏ½ Great Expectations Setup (Optional)

The pipeline includes basic validation by default. For advanced data quality monitoring, you can configure Great Expectations:

Initialize Great Expectations

# Run initialization script from python directory
cd python
python scripts/init_great_expectations.py

This creates the python/great_expectations/ directory with:

  • Configuration files
  • Datasource definitions
  • Expectation suites
  • Checkpoint configurations

Configure Datasources

Edit python/great_expectations/great_expectations.yml to add CSV datasources for Marvel data.

Create Expectation Suites

# Interactive suite creation
great_expectations suite new

# Name it 'marvel_suite' to match validation module

Define expectations for your data:

  • Column existence (expect_column_to_exist)
  • Data types (expect_column_values_to_be_of_type)
  • Null constraints (expect_column_values_to_not_be_null)
  • Categorical values (expect_column_values_to_be_in_set)
  • Numeric ranges (expect_column_values_to_be_between)

Create Checkpoints

# Create checkpoint for automated validation
great_expectations checkpoint new marvel_checkpoint

View Data Docs

# Build interactive documentation
great_expectations docs build

Opens a browser with data quality reports and validation history.

Integration with Pipeline

The marvel_pipeline.validate module automatically:

  • Looks for Great Expectations config in python/great_expectations/ directory
  • Uses checkpoint named marvel_checkpoint
  • Uses expectation suite named marvel_suite
  • Falls back to basic validation if GE is not configured

Note: The uncommitted/ subdirectory contains local data docs and validation results. These files are ignored by git and should not be committed.

οΏ½πŸ› Troubleshooting

Common Issues

Virtual Environment Not Activating

Windows:

# If script execution is disabled
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser

# Then activate
venv\Scripts\activate

Linux/Mac:

source venv/bin/activate

Import Errors

# Navigate to python directory and reinstall package in editable mode
cd python
pip install -e .

Tests Failing

# Navigate to python directory
cd python

# Check dependencies
pip install -r requirements.txt -r requirements-dev.txt

# Run with verbose output
pytest tests/ -v -s

# Run specific failing test
pytest tests/test_clean.py::test_clean_names -v

Wheel Not Loading in Fabric

  1. Verify wheel is in correct directory:

    fabric-artifacts/env-example.Environment/Libraries/CustomLibraries/
    
  2. Check Environment is attached to notebook

  3. Restart notebook kernel

  4. Verify git sync completed in Fabric workspace

Great Expectations Not Found

# Navigate to python directory
cd python

# Initialize GE (optional)
python scripts/init_great_expectations.py

# Or install GE
pip install great-expectations

Getting Help

  • Check error logs for detailed messages
  • Review test output: pytest tests/ -v
  • Inspect data: df.info(), df.head(), df.describe()
  • Enable debug logging in modules

πŸ“ Development Roadmap

  • Add support for additional comic datasets (DC Comics)
  • Implement data profiling reports
  • Add CLI interface for command-line execution
  • Create Fabric pipeline for scheduled execution
  • Add data lineage tracking
  • Implement incremental data updates

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature/amazing-feature
  3. Make changes and add tests
  4. Run tests: pytest tests/ -v
  5. Commit changes: git commit -m 'Add amazing feature'
  6. Push to branch: git push origin feature/amazing-feature
  7. Open a Pull Request

πŸ“„ License

This project is provided as-is for educational and development purposes.

πŸ™ Acknowledgments

  • Data Source: FiveThirtyEight for the Marvel character dataset
  • Tools: Great Expectations, pytest, pandas, Microsoft Fabric

Built with ❀️ for data quality and reproducible pipelines

About

Example project includes pytest unit tests, Great Expectations data validation, and automated wheel packaging for deployment to Microsoft Fabric environments.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 100.0%