# Mayhem Project - Attribute Conflation Pipeline (Colab Setup)

This notebook guides you through setting up and running the attribute conflation pipeline for the Mayhem project in Google Colab.
It covers cloning the repository, installing dependencies, and running the full pipeline for a single attribute (`name`) as a demonstration.
You can extend this to run for all attributes and further experiments.

## 1. Clone the Repository

First, we'll clone your GitHub repository. Make sure you have the correct URL.
If your repository is private, you might need to generate a Personal Access Token (PAT) from GitHub and use it in the URL (e.g., `https://<PAT>@github.com/your-username/your-repo.git`).

Replace `<YOUR_REPO_URL>` with the actual URL of your Mayhem project repository.


In [None]:
# Replace with your actual repository URL
REPO_URL = "https://github.com/project-terraforma/Mayhem_Attribute_Conflation.git"
REPO_NAME = REPO_URL.split('/')[-1].replace('.git', '')

!git clone {REPO_URL}
%cd {REPO_NAME}

# IMPORTANT: Checkout your working branch to get all the latest changes
print(f"Checking out branch: local_model_annotation")
!git checkout local_model_annotation

## 2. Install Dependencies

Next, we'll install all the necessary Python packages listed in your `requirements.txt` file.


In [None]:
!pip install -r requirements.txt

## 3. Run the Pipeline for 'name' Attribute

This section will demonstrate running the full pipeline for the 'name' attribute:
1.  Generate synthetic training data from Yelp.
2.  Process this synthetic data into features.
3.  Train ML models.
4.  Evaluate the model on your 200 real-world records.
5.  Run inference on the 2,000 Overture records.

This will also log inference time and memory usage.

In [None]:
ATTRIBUTE = 'name'

print(f"\n--- Running pipeline for attribute: {ATTRIBUTE} ---")

# 1. Generate synthetic dataset
!python scripts/generate_synthetic_dataset.py

# 2. Process synthetic data into features
!python -m scripts.process_synthetic_data --attribute {ATTRIBUTE}

# 3. Train ML models
!python scripts/train_models.py --features data/processed/features_{ATTRIBUTE}_synthetic.parquet --output-dir models/ml_models/{ATTRIBUTE}

# 4. Evaluate on real-world data (200 records)
!python -m scripts.evaluate_real_data --attribute {ATTRIBUTE} --model models/ml_models/{ATTRIBUTE}/best_model_gradient_boosting.joblib

# 5. Run inference on 2,000 Overture records\n!python -m scripts.run_inference --attribute {ATTRIBUTE} --model models/ml_models/{ATTRIBUTE}/best_model_gradient_boosting.joblib

## 4. View Results

You can now find the generated files in your repository structure:
*   `data/synthetic_golden_dataset_2k.json` (The synthetic training data)
*   `data/processed/features_name_synthetic.parquet` (Extracted features)
*   `models/ml_models/name/` (Trained model and summary)
*   `data/results/final_evaluation_report.txt` (Evaluation on 200 real records)
*   `data/results/final_conflated_names_2k.json` (Inference results on 2,000 Overture records)

You can also commit these changes back to your GitHub repository from Colab if you configure git credentials.

## 5. Next Steps

To run for all attributes (phone, website, address, category):
1.  Copy and paste the code block from section 3.
2.  Change `ATTRIBUTE = 'name'` to `ATTRIBUTE = 'phone'`, then `ATTRIBUTE = 'website'`, etc.
3.  After running for all attributes, run the consolidation script:
    ```bash
    !python scripts/consolidate_results.py
    ```
4.  **Scaling and Refinement:**
    *   Modify `scripts/generate_synthetic_dataset.py` to generate a larger dataset (e.g., 10,000 or 50,000 records) to leverage the full Yelp dataset.
    *   Experiment with different ML models (Varnit's `scripts/train_models.py` already supports Logistic Regression, Random Forest, Gradient Boosting).
    *   Incorporate rule-based baselines (`scripts/baseline_heuristics.py` from Varnit's branch) and compare performance.
    *   Perform error analysis (OKR 2, KR2) on misclassified records from `scripts/evaluate_real_data.py` to identify patterns and refine features or rules.
