# Mayhem Project - Attribute Conflation Pipeline (Colab Setup)

This notebook runs the complete attribute conflation pipeline for the Mayhem project.
It leverages the full Yelp dataset (~150k records) to train ML models and evaluates them against real-world Overture data.

## 1. Setup Environment

Clones the repository and installs dependencies.


In [None]:
# Repository URL
REPO_URL = "https://github.com/project-terraforma/Mayhem_Attribute_Conflation.git"
REPO_NAME = REPO_URL.split('/')[-1].replace('.git', '')
BRANCH_NAME = "feature/local-pipeline-dev"  # Using your active development branch

# Clone and Setup
!git clone {REPO_URL}
%cd {REPO_NAME}
!git checkout {BRANCH_NAME}
!git lfs pull  # Ensure Yelp data is downloaded

# Install Dependencies
!pip install -r requirements.txt

## 2. Run Full Pipeline (Massive Scale)

This single command orchestrates the entire workflow:
1.  **Generate Synthetic Data:** Creates 150,000+ training records from the Yelp dataset.
2.  **Extract Features:** Processes features for Name, Phone, Website, Address, and Category.
3.  **Train Models:** Trains Logistic Regression, Random Forest, and Gradient Boosting models on the full dataset.
4.  **Evaluate:** Tests models and rule-based baselines on the 200-record manual Golden Dataset.
5.  **Inference:** Runs final predictions on the 2,000 Overture Project B samples.

**Note:** This may take 30-60 minutes due to the large dataset size.


In [None]:
# Run pipeline with ALL Yelp records (limit=0 means no limit)
!python scripts/run_algorithm_pipeline.py --synthetic-limit 0

## 3. Analyze Results

Generates a summary table of F1-scores, Accuracy, and Compute Metrics for all approaches.


In [None]:
!python scripts/analyze_results.py

## 4. Download Results (Optional)

Zip the results folder to download to your local machine.


In [None]:
!zip -r mayhem_results.zip data/results/ models/ml/
from google.colab import files
files.download('mayhem_results.zip')