A comprehensive Python data pipeline for extracting, cleaning, and validating Marvel character data from FiveThirtyEight's dataset. This project includes pytest unit tests, Great Expectations data validation, and automated wheel packaging for deployment to Microsoft Fabric environments.
- Project Overview
- Features
- Project Structure
- Getting Started
- Local Development
- Testing
- Building the Wheel
- Fabric Deployment
- Usage
- Data Pipeline Stages
- Troubleshooting
This project extracts Marvel character data from the FiveThirtyEight dataset, applies comprehensive data cleaning transformations, and validates the output using both pytest (for code testing) and Great Expectations (for data quality validation).
Data Source: FiveThirtyEight Marvel Dataset
Key Capabilities:
- Extract data from web sources or local files
- Clean and standardize character names, identities, alignments, appearances, years, and status
- Validate data quality with configurable rules
- Package as a wheel for deployment to Fabric environments
- Fail-fast validation to prevent bad data propagation
- Modular Design: Separate modules for extraction, cleaning, and validation
- Comprehensive Testing: Full pytest suite with unit and integration tests
- Data Quality: Great Expectations integration for production validation
- Automated Build: Single-command wheel building and deployment
- Fabric Integration: Ready-to-use Environment configuration
- Type Safety: Type hints throughout the codebase
- Logging: Detailed logging at every pipeline stage
- Error Handling: Graceful error handling with informative messages
wheel-file-demo/
βββ python/ # Python package and development
β βββ src/
β β βββ marvel_pipeline/ # Main package
β β βββ __init__.py # Package initialization
β β βββ extract.py # Data extraction module
β β βββ clean.py # Data cleaning module
β β βββ validate.py # Data validation module
β βββ tests/ # Test suite
β β βββ conftest.py # Pytest fixtures
β β βββ test_extract.py # Extract module tests
β β βββ test_clean.py # Clean module tests
β β βββ test_integration.py # Integration tests
β βββ scripts/ # Utility scripts
β β βββ build_wheel.py # Wheel build script
β β βββ init_great_expectations.py # GE initialization
β βββ dist/ # Built wheel packages
β βββ build/ # Build artifacts
β βββ requirements.txt # Core dependencies
β βββ requirements-dev.txt # Dev dependencies
β βββ setup.py # Package setup
β βββ pyproject.toml # Build configuration
β βββ pytest.ini # Pytest configuration
βββ fabric-artifacts/ # Microsoft Fabric resources
β βββ env-example.Environment/ # Fabric Environment
β β βββ .platform # Environment metadata
β β βββ Setting/
β β β βββ Sparkcompute.yml # Spark configuration
β β βββ Libraries/
β β βββ CustomLibraries/ # Wheel packages go here
β βββ nb-marvel-extract.Notebook/ # Fabric notebook
β βββ notebook-content.py # Notebook implementation
βββ venv/ # Virtual environment (not in git)
βββ .gitignore # Git ignore rules
βββ README.md # This file
- Python 3.8 or higher
- Git
- Access to Microsoft Fabric (for Fabric deployment)
-
Clone the repository
git clone <your-repo-url> cd wheel-file-demo
-
Set up virtual environment
# Create virtual environment python -m venv venv # Activate (Windows) venv\Scripts\activate # Activate (Linux/Mac) source venv/bin/activate
-
Install dependencies
# Navigate to python directory cd python # Install core dependencies pip install -r requirements.txt # Install dev dependencies pip install -r requirements-dev.txt # Install package in editable mode pip install -e .
-
Run tests
# From the python directory pytest tests/ -v -
Build the wheel
# From the python directory python scripts/build_wheel.py
-
Activate virtual environment (if not already active):
venv\Scripts\activate # Windows source venv/bin/activate # Linux/Mac
-
Navigate to python directory:
cd python -
Install in editable mode for development:
pip install -e .This allows you to make changes to the code without reinstalling.
Create a Python script or use the Python REPL:
from marvel_pipeline import extract, clean, validate
# Extract data
df = extract.fetch_marvel_data(
url=extract.DEFAULT_MARVEL_URL,
output_path="data/marvel-raw.csv"
)
# Clean data
df_cleaned = clean.clean_marvel_data(df)
# Save cleaned data
df_cleaned.to_csv("data/marvel-cleaned.csv", index=False)
# Validate
result = validate.run_validation("data/marvel-cleaned.csv")
print(f"Validation: {'PASSED' if result['success'] else 'FAILED'}")- Follow PEP 8 guidelines
- Use type hints for function parameters and returns
- Add docstrings to all functions (NumPy style)
- Keep functions focused and modular
pytest tests/ -vpytest tests/test_extract.py -v
pytest tests/test_clean.py -v
pytest tests/test_integration.py -v# Unit tests only
pytest tests/ -m unit -v
# Integration tests only
pytest tests/ -m integration -v
# Skip slow tests
pytest tests/ -m "not slow" -v# From python directory
pytest tests/ --cov=src/marvel_pipeline --cov-report=htmlView coverage report: Open python/htmlcov/index.html in your browser
- Unit Tests: Test individual functions in isolation
- Integration Tests: Test the complete pipeline end-to-end
- Fixtures: Reusable test data in
conftest.py - Mocking: Mock external requests for fast, reliable tests
The build script automates the entire packaging process:
# From the python directory
cd python
python scripts/build_wheel.pyWhat it does:
- β Cleans old build artifacts
- β Runs the test suite (fails if tests fail)
- β Builds the wheel package
- β Copies wheel to Fabric Environment directory
- β Reports success with next steps
Options:
# Skip tests (not recommended)
python scripts/build_wheel.py --skip-testsOutput:
- Wheel file:
python/dist/marvel_pipeline-0.1.0-py3-none-any.whl - Deployed to:
fabric-artifacts/env-example.Environment/Libraries/CustomLibraries/
-
Build the wheel (if not already done):
cd python python scripts/build_wheel.py -
Commit changes including the wheel:
git add fabric-artifacts/env-example.Environment/Libraries/CustomLibraries/*.whl git commit -m "Deploy marvel-pipeline v0.1.0" git push
-
Sync workspace in Fabric:
- Go to your Fabric workspace
- Navigate to Source Control settings
- Click "Sync" to pull latest changes
-
Attach Environment to notebook:
- Open the notebook:
nb-marvel-extract - Click "Environment" in the toolbar
- Select
env-example - Wait for environment to attach
- Open the notebook:
-
Run the notebook:
- Execute cells to run the pipeline
- Data is saved to lakehouse at
/lakehouse/default/Files/
The fabric-artifacts/env-example.Environment includes:
- Custom Libraries: marvel-pipeline wheel package
- Spark Configuration: 8 cores, 56GB memory
- Runtime: Fabric Runtime 1.3
The fabric-artifacts/env-example.Environment/Libraries/CustomLibraries/ directory contains wheel files that are automatically loaded when the Environment is attached to a notebook.
Important Notes:
- Wheel files in this directory ARE tracked by git (unlike typical Python packages)
- This is required for Fabric workspace sync
- When you rebuild the wheel, the old version is automatically removed
- The build script (
scripts/build_wheel.py) handles copying wheels here
Updating the Package:
- Make code changes in
python/src/marvel_pipeline/ - Update tests in
python/tests/ - Run:
cd python && python scripts/build_wheel.py - Wheel is automatically copied to CustomLibraries
- Commit and sync with Fabric
from marvel_pipeline import extract, clean, validate
# Configuration
MARVEL_URL = "https://raw.githubusercontent.com/fivethirtyeight/data/master/comic-characters/marvel-wikia-data.csv"
RAW_PATH = "/lakehouse/default/Files/marvel-raw.csv"
CLEANED_PATH = "/lakehouse/default/Files/marvel-cleaned.csv"
# Extract
df = extract.fetch_marvel_data(url=MARVEL_URL, output_path=RAW_PATH)
# Clean
df_cleaned = clean.clean_marvel_data(df)
df_cleaned.to_csv(CLEANED_PATH, index=False)
# Validate
result = validate.run_validation(CLEANED_PATH)from marvel_pipeline import extract, clean, validate
# Extract from web
df = extract.fetch_marvel_data()
# Or load from local file
df = extract.load_local_csv("path/to/data.csv")
# Clean
df_cleaned = clean.clean_marvel_data(df)
# Validate DataFrame directly
result = validate.validate_dataframe(df_cleaned)Module: marvel_pipeline.extract
- Fetches data from URL via HTTP requests
- Saves raw CSV to specified location
- Returns pandas DataFrame
- Handles network errors gracefully
Module: marvel_pipeline.clean
Cleaning Functions:
clean_names(): Strip whitespace, standardize formattingclean_identity(): Standardize to Secret/Public/Unknownclean_alignment(): Standardize to Good/Bad/Neutral/Unknownclean_appearances(): Convert to numeric, handle nullsclean_year(): Parse dates, validate range (1939-2026)clean_status(): Standardize to Living/Deceased/Unknown
Main Function: clean_marvel_data()
- Applies all cleaning functions
- Removes duplicate rows
- Resets DataFrame index
- Logs cleaning summary
Module: marvel_pipeline.validate
Validation Checks:
- DataFrame not empty
- Expected columns present
- No completely null rows
- Name column has values
- Appearances are numeric and non-negative
- Year values in reasonable range
Validation Modes:
- Great Expectations (if configured)
- Basic validation (fallback)
Behavior:
- Raises
ValidationErroron failure - Logs warnings for non-critical issues
- Returns validation results dict
The pipeline includes basic validation by default. For advanced data quality monitoring, you can configure Great Expectations:
# Run initialization script from python directory
cd python
python scripts/init_great_expectations.pyThis creates the python/great_expectations/ directory with:
- Configuration files
- Datasource definitions
- Expectation suites
- Checkpoint configurations
Edit python/great_expectations/great_expectations.yml to add CSV datasources for Marvel data.
# Interactive suite creation
great_expectations suite new
# Name it 'marvel_suite' to match validation moduleDefine expectations for your data:
- Column existence (
expect_column_to_exist) - Data types (
expect_column_values_to_be_of_type) - Null constraints (
expect_column_values_to_not_be_null) - Categorical values (
expect_column_values_to_be_in_set) - Numeric ranges (
expect_column_values_to_be_between)
# Create checkpoint for automated validation
great_expectations checkpoint new marvel_checkpoint# Build interactive documentation
great_expectations docs buildOpens a browser with data quality reports and validation history.
The marvel_pipeline.validate module automatically:
- Looks for Great Expectations config in
python/great_expectations/directory - Uses checkpoint named
marvel_checkpoint - Uses expectation suite named
marvel_suite - Falls back to basic validation if GE is not configured
Note: The uncommitted/ subdirectory contains local data docs and validation results. These files are ignored by git and should not be committed.
Windows:
# If script execution is disabled
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser
# Then activate
venv\Scripts\activateLinux/Mac:
source venv/bin/activate# Navigate to python directory and reinstall package in editable mode
cd python
pip install -e .# Navigate to python directory
cd python
# Check dependencies
pip install -r requirements.txt -r requirements-dev.txt
# Run with verbose output
pytest tests/ -v -s
# Run specific failing test
pytest tests/test_clean.py::test_clean_names -v-
Verify wheel is in correct directory:
fabric-artifacts/env-example.Environment/Libraries/CustomLibraries/ -
Check Environment is attached to notebook
-
Restart notebook kernel
-
Verify git sync completed in Fabric workspace
# Navigate to python directory
cd python
# Initialize GE (optional)
python scripts/init_great_expectations.py
# Or install GE
pip install great-expectations- Check error logs for detailed messages
- Review test output:
pytest tests/ -v - Inspect data:
df.info(),df.head(),df.describe() - Enable debug logging in modules
- Add support for additional comic datasets (DC Comics)
- Implement data profiling reports
- Add CLI interface for command-line execution
- Create Fabric pipeline for scheduled execution
- Add data lineage tracking
- Implement incremental data updates
- Fork the repository
- Create a feature branch:
git checkout -b feature/amazing-feature - Make changes and add tests
- Run tests:
pytest tests/ -v - Commit changes:
git commit -m 'Add amazing feature' - Push to branch:
git push origin feature/amazing-feature - Open a Pull Request
This project is provided as-is for educational and development purposes.
- Data Source: FiveThirtyEight for the Marvel character dataset
- Tools: Great Expectations, pytest, pandas, Microsoft Fabric
Built with β€οΈ for data quality and reproducible pipelines