Marvel Character Data Pipeline

A comprehensive Python data pipeline for extracting, cleaning, and validating Marvel character data from FiveThirtyEight's dataset. This project includes pytest unit tests, Great Expectations data validation, and automated wheel packaging for deployment to Microsoft Fabric environments.

📖 Project Overview

This project extracts Marvel character data from the FiveThirtyEight dataset, applies comprehensive data cleaning transformations, and validates the output using both pytest (for code testing) and Great Expectations (for data quality validation).

Data Source: FiveThirtyEight Marvel Dataset

Key Capabilities:

Extract data from web sources or local files
Clean and standardize character names, identities, alignments, appearances, years, and status
Validate data quality with configurable rules
Package as a wheel for deployment to Fabric environments
Fail-fast validation to prevent bad data propagation

✨ Features

Modular Design: Separate modules for extraction, cleaning, and validation
Comprehensive Testing: Full pytest suite with unit and integration tests
Data Quality: Great Expectations integration for production validation
Automated Build: Single-command wheel building and deployment
Fabric Integration: Ready-to-use Environment configuration
Type Safety: Type hints throughout the codebase
Logging: Detailed logging at every pipeline stage
Error Handling: Graceful error handling with informative messages

📁 Project Structure

wheel-file-demo/
├── python/                        # Python package and development
│   ├── src/
│   │   └── marvel_pipeline/       # Main package
│   │       ├── __init__.py        # Package initialization
│   │       ├── extract.py         # Data extraction module
│   │       ├── clean.py           # Data cleaning module
│   │       └── validate.py        # Data validation module
│   ├── tests/                     # Test suite
│   │   ├── conftest.py            # Pytest fixtures
│   │   ├── test_extract.py        # Extract module tests
│   │   ├── test_clean.py          # Clean module tests
│   │   └── test_integration.py    # Integration tests
│   ├── scripts/                   # Utility scripts
│   │   ├── build_wheel.py         # Wheel build script
│   │   └── init_great_expectations.py # GE initialization
│   ├── dist/                      # Built wheel packages
│   ├── build/                     # Build artifacts
│   ├── requirements.txt           # Core dependencies
│   ├── requirements-dev.txt       # Dev dependencies
│   ├── setup.py                   # Package setup
│   ├── pyproject.toml             # Build configuration
│   └── pytest.ini                 # Pytest configuration
├── fabric-artifacts/              # Microsoft Fabric resources
│   ├── env-example.Environment/   # Fabric Environment
│   │   ├── .platform              # Environment metadata
│   │   ├── Setting/
│   │   │   └── Sparkcompute.yml   # Spark configuration
│   │   └── Libraries/
│   │       └── CustomLibraries/   # Wheel packages go here
│   └── nb-marvel-extract.Notebook/ # Fabric notebook
│       └── notebook-content.py    # Notebook implementation
├── venv/                          # Virtual environment (not in git)
├── .gitignore                     # Git ignore rules
└── README.md                      # This file

🚀 Getting Started

Prerequisites

Python 3.8 or higher
Git
Access to Microsoft Fabric (for Fabric deployment)

Quick Start

Clone the repository

git clone <your-repo-url>
cd wheel-file-demo

Set up virtual environment

# Create virtual environment
python -m venv venv

# Activate (Windows)
venv\Scripts\activate

# Activate (Linux/Mac)
source venv/bin/activate

Install dependencies

# Navigate to python directory
cd python

# Install core dependencies
pip install -r requirements.txt

# Install dev dependencies
pip install -r requirements-dev.txt

# Install package in editable mode
pip install -e .

Run tests

# From the python directory
pytest tests/ -v

Build the wheel

# From the python directory
python scripts/build_wheel.py

💻 Local Development

Setting Up Your Environment

Activate virtual environment (if not already active):

venv\Scripts\activate  # Windows
source venv/bin/activate  # Linux/Mac

Navigate to python directory:
```
cd python
```
Install in editable mode for development:
```
pip install -e .
```
This allows you to make changes to the code without reinstalling.

Running the Pipeline Locally

Create a Python script or use the Python REPL:

from marvel_pipeline import extract, clean, validate

# Extract data
df = extract.fetch_marvel_data(
    url=extract.DEFAULT_MARVEL_URL,
    output_path="data/marvel-raw.csv"
)

# Clean data
df_cleaned = clean.clean_marvel_data(df)

# Save cleaned data
df_cleaned.to_csv("data/marvel-cleaned.csv", index=False)

# Validate
result = validate.run_validation("data/marvel-cleaned.csv")
print(f"Validation: {'PASSED' if result['success'] else 'FAILED'}")

Code Style

Follow PEP 8 guidelines
Use type hints for function parameters and returns
Add docstrings to all functions (NumPy style)
Keep functions focused and modular

🧪 Testing

Running All Tests

pytest tests/ -v

Running Specific Test Files

pytest tests/test_extract.py -v
pytest tests/test_clean.py -v
pytest tests/test_integration.py -v

Running Tests by Marker

# Unit tests only
pytest tests/ -m unit -v

# Integration tests only
pytest tests/ -m integration -v

# Skip slow tests
pytest tests/ -m "not slow" -v

Code Coverage

# From python directory
pytest tests/ --cov=src/marvel_pipeline --cov-report=html

View coverage report: Open python/htmlcov/index.html in your browser

Test Structure

Unit Tests: Test individual functions in isolation
Integration Tests: Test the complete pipeline end-to-end
Fixtures: Reusable test data in conftest.py
Mocking: Mock external requests for fast, reliable tests

🔧 Building the Wheel

The build script automates the entire packaging process:

# From the python directory
cd python
python scripts/build_wheel.py

What it does:

✓ Cleans old build artifacts
✓ Runs the test suite (fails if tests fail)
✓ Builds the wheel package
✓ Copies wheel to Fabric Environment directory
✓ Reports success with next steps

Options:

# Skip tests (not recommended)
python scripts/build_wheel.py --skip-tests

Output:

Wheel file: python/dist/marvel_pipeline-0.1.0-py3-none-any.whl
Deployed to: fabric-artifacts/env-example.Environment/Libraries/CustomLibraries/

🌐 Fabric Deployment

Deploying to Microsoft Fabric

Build the wheel (if not already done):
```
cd python
python scripts/build_wheel.py
```

Commit changes including the wheel:

git add fabric-artifacts/env-example.Environment/Libraries/CustomLibraries/*.whl
git commit -m "Deploy marvel-pipeline v0.1.0"
git push

Sync workspace in Fabric:
- Go to your Fabric workspace
- Navigate to Source Control settings
- Click "Sync" to pull latest changes
Attach Environment to notebook:
- Open the notebook: nb-marvel-extract
- Click "Environment" in the toolbar
- Select env-example
- Wait for environment to attach
Run the notebook:
- Execute cells to run the pipeline
- Data is saved to lakehouse at /lakehouse/default/Files/

Environment Configuration

The fabric-artifacts/env-example.Environment includes:

Custom Libraries: marvel-pipeline wheel package
Spark Configuration: 8 cores, 56GB memory
Runtime: Fabric Runtime 1.3

Custom Libraries Directory

The fabric-artifacts/env-example.Environment/Libraries/CustomLibraries/ directory contains wheel files that are automatically loaded when the Environment is attached to a notebook.

Important Notes:

Wheel files in this directory ARE tracked by git (unlike typical Python packages)
This is required for Fabric workspace sync
When you rebuild the wheel, the old version is automatically removed
The build script (scripts/build_wheel.py) handles copying wheels here

Updating the Package:

Make code changes in python/src/marvel_pipeline/
Update tests in python/tests/
Run: cd python && python scripts/build_wheel.py
Wheel is automatically copied to CustomLibraries
Commit and sync with Fabric

📚 Usage

In Fabric Notebooks

from marvel_pipeline import extract, clean, validate

# Configuration
MARVEL_URL = "https://raw.githubusercontent.com/fivethirtyeight/data/master/comic-characters/marvel-wikia-data.csv"
RAW_PATH = "/lakehouse/default/Files/marvel-raw.csv"
CLEANED_PATH = "/lakehouse/default/Files/marvel-cleaned.csv"

# Extract
df = extract.fetch_marvel_data(url=MARVEL_URL, output_path=RAW_PATH)

# Clean
df_cleaned = clean.clean_marvel_data(df)
df_cleaned.to_csv(CLEANED_PATH, index=False)

# Validate
result = validate.run_validation(CLEANED_PATH)

In Local Scripts

from marvel_pipeline import extract, clean, validate

# Extract from web
df = extract.fetch_marvel_data()

# Or load from local file
df = extract.load_local_csv("path/to/data.csv")

# Clean
df_cleaned = clean.clean_marvel_data(df)

# Validate DataFrame directly
result = validate.validate_dataframe(df_cleaned)

🔄 Data Pipeline Stages

1. Extract

Module: marvel_pipeline.extract

Fetches data from URL via HTTP requests
Saves raw CSV to specified location
Returns pandas DataFrame
Handles network errors gracefully

2. Clean

Module: marvel_pipeline.clean

Cleaning Functions:

clean_names(): Strip whitespace, standardize formatting
clean_identity(): Standardize to Secret/Public/Unknown
clean_alignment(): Standardize to Good/Bad/Neutral/Unknown
clean_appearances(): Convert to numeric, handle nulls
clean_year(): Parse dates, validate range (1939-2026)
clean_status(): Standardize to Living/Deceased/Unknown

Main Function: clean_marvel_data()

Applies all cleaning functions
Removes duplicate rows
Resets DataFrame index
Logs cleaning summary

3. Validate

Module: marvel_pipeline.validate

Validation Checks:

DataFrame not empty
Expected columns present
No completely null rows
Name column has values
Appearances are numeric and non-negative
Year values in reasonable range

Validation Modes:

Great Expectations (if configured)
Basic validation (fallback)

Behavior:

Raises ValidationError on failure
Logs warnings for non-critical issues
Returns validation results dict

� Great Expectations Setup (Optional)

The pipeline includes basic validation by default. For advanced data quality monitoring, you can configure Great Expectations:

Initialize Great Expectations

# Run initialization script from python directory
cd python
python scripts/init_great_expectations.py

This creates the python/great_expectations/ directory with:

Configuration files
Datasource definitions
Expectation suites
Checkpoint configurations

Configure Datasources

Edit python/great_expectations/great_expectations.yml to add CSV datasources for Marvel data.

Create Expectation Suites

# Interactive suite creation
great_expectations suite new

# Name it 'marvel_suite' to match validation module

Define expectations for your data:

Column existence (expect_column_to_exist)
Data types (expect_column_values_to_be_of_type)
Null constraints (expect_column_values_to_not_be_null)
Categorical values (expect_column_values_to_be_in_set)
Numeric ranges (expect_column_values_to_be_between)

Create Checkpoints

# Create checkpoint for automated validation
great_expectations checkpoint new marvel_checkpoint

View Data Docs

# Build interactive documentation
great_expectations docs build

Opens a browser with data quality reports and validation history.

Integration with Pipeline

The marvel_pipeline.validate module automatically:

Looks for Great Expectations config in python/great_expectations/ directory
Uses checkpoint named marvel_checkpoint
Uses expectation suite named marvel_suite
Falls back to basic validation if GE is not configured

Note: The uncommitted/ subdirectory contains local data docs and validation results. These files are ignored by git and should not be committed.

�🐛 Troubleshooting

Common Issues

Virtual Environment Not Activating

Windows:

# If script execution is disabled
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser

# Then activate
venv\Scripts\activate

Linux/Mac:

source venv/bin/activate

Import Errors

# Navigate to python directory and reinstall package in editable mode
cd python
pip install -e .

Tests Failing

# Navigate to python directory
cd python

# Check dependencies
pip install -r requirements.txt -r requirements-dev.txt

# Run with verbose output
pytest tests/ -v -s

# Run specific failing test
pytest tests/test_clean.py::test_clean_names -v

Wheel Not Loading in Fabric

Verify wheel is in correct directory:

fabric-artifacts/env-example.Environment/Libraries/CustomLibraries/

Check Environment is attached to notebook
Restart notebook kernel
Verify git sync completed in Fabric workspace

Great Expectations Not Found

# Navigate to python directory
cd python

# Initialize GE (optional)
python scripts/init_great_expectations.py

# Or install GE
pip install great-expectations

Getting Help

Check error logs for detailed messages
Review test output: pytest tests/ -v
Inspect data: df.info(), df.head(), df.describe()
Enable debug logging in modules

📝 Development Roadmap

Add support for additional comic datasets (DC Comics)
Implement data profiling reports
Add CLI interface for command-line execution
Create Fabric pipeline for scheduled execution
Add data lineage tracking
Implement incremental data updates

🤝 Contributing

Fork the repository
Create a feature branch: git checkout -b feature/amazing-feature
Make changes and add tests
Run tests: pytest tests/ -v
Commit changes: git commit -m 'Add amazing feature'
Push to branch: git push origin feature/amazing-feature
Open a Pull Request

📄 License

This project is provided as-is for educational and development purposes.

🙏 Acknowledgments

Data Source: FiveThirtyEight for the Marvel character dataset
Tools: Great Expectations, pytest, pandas, Microsoft Fabric

Built with ❤️ for data quality and reproducible pipelines

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
fabric-artifacts		fabric-artifacts
python		python
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Marvel Character Data Pipeline

📋 Table of Contents

📖 Project Overview

✨ Features

📁 Project Structure

🚀 Getting Started

Prerequisites

Quick Start

💻 Local Development

Setting Up Your Environment

Running the Pipeline Locally

Code Style

🧪 Testing

Running All Tests

Running Specific Test Files

Running Tests by Marker

Code Coverage

Test Structure

🔧 Building the Wheel

🌐 Fabric Deployment

Deploying to Microsoft Fabric

Environment Configuration

Custom Libraries Directory

📚 Usage

In Fabric Notebooks

In Local Scripts

🔄 Data Pipeline Stages

1. Extract

2. Clean

3. Validate

� Great Expectations Setup (Optional)

Initialize Great Expectations

Configure Datasources

Create Expectation Suites

Create Checkpoints

View Data Docs

Integration with Pipeline

�🐛 Troubleshooting

Common Issues

Virtual Environment Not Activating

Import Errors

Tests Failing

Wheel Not Loading in Fabric

Great Expectations Not Found

Getting Help

📝 Development Roadmap

🤝 Contributing

📄 License

🙏 Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages