# Restaurant Graph Pipeline (fz dataset)

## Notebook Overview & Requirements

This notebook builds the full pipeline for the restaurant (fz) dataset or graph-based constraint/repair experiments :
- Clean raw ARFF and extract `area_code`
- Construct similarity edges based on `(addr, city)`
- Build graphs: constraint graph `S` and instance graph `G`
- Produce cleaned ground truth `G_opt` with no violations
- Persist artifacts for downstream perturbation/repair experiments
- Import to Neo4j and inject label noise.

Tips: tune `ADDRESS_DISTANCE_THRESHOLD` to control similarity density; ensure `.env` is set for Neo4j

## Setup & Tunable Parameters

- `ARFF_PATH`: path to input ARFF (FZ dataset).
- `OUTPUT_DIR`: destination for intermediate and cleaned artifacts.
- `ADDRESS_DISTANCE_THRESHOLD`: max edit distance for `addr` similarity within the same `city`.
- If `networkx` is available, graph stats/cleaning use it; otherwise a lightweight fallback is used.


In [1]:
# Setup and Parameters

# Imports
import os
from pathlib import Path
from datetime import datetime

import pandas as pd

# Edit distance
try:
    import Levenshtein
    def string_distance(a, b):
        return Levenshtein.distance(str(a), str(b))
except Exception as e:
    raise RuntimeError("python-Levenshtein is required; install via pyproject or pip.")

# Optional: NetworkX for convenience (graphs + stats)
try:
    import networkx as nx
    NX_AVAILABLE = True
except Exception:
    NX_AVAILABLE = False

# Parameters (tune as needed)
ARFF_PATH = Path("datasets/restaurant/fz.arff")
OUTPUT_DIR = Path("datasets/temp")
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

ADDRESS_DISTANCE_THRESHOLD = 7  # max edit distance for addr similarity

print("NetworkX available:", NX_AVAILABLE)
print("ARFF path:", ARFF_PATH)
print("Output dir:", OUTPUT_DIR)

NetworkX available: True
ARFF path: datasets\restaurant\fz.arff
Output dir: datasets\temp


## Load ARFF

- Parses the ARFF header to derive column names and reads the `@data` section into a DataFrame.
- Assumes a simple ARFF without sparse rows and with quoted strings.
- Output: `df` with columns including `name`, `phone`, `addr`, `city`

In [2]:
# Load ARFF into DataFrame
from io import StringIO

def arff_to_dataframe(filepath: Path) -> pd.DataFrame:
    data = False
    header = ""
    csv_content = []
    with open(filepath, 'r', encoding='utf-8') as f:
        for line in f:
            line = line.rstrip('\n')
            if "@attribute" in line.lower():
                attributes = line.split()
                attri_idx = next(i for i, x in enumerate(attributes) if x.lower() == "@attribute")
                column_name = attributes[attri_idx + 1]
                header = header + column_name + ","
            elif "@data" in line.lower():
                data = True
                header = header.rstrip(',') + '\n'
                csv_content.append(header)
            elif data and line.strip():
                csv_content.append(line + '\n')
    csv_string = ''.join(csv_content)
    df_local = pd.read_csv(StringIO(csv_string), quotechar='"')
    return df_local

# Load
df = arff_to_dataframe(ARFF_PATH)
print(f"Loaded {len(df)} rows; columns: {list(df.columns)}")
df.head(3)

Loaded 864 rows; columns: ['name', 'addr', 'city', 'phone', 'type', 'class']


Unnamed: 0,name,addr,city,phone,type,class
0,arnie morton's of chicago,"""435 s. la cienega blv.""","""los angeles""","""310/246-1501""","""american""",'0'
1,arnie morton's of chicago,"""435 s. la cienega blvd.""","""los angeles""","""310-246-1501""","""steakhouses""",'0'
2,art's delicatessen,"""12224 ventura blvd.""","""studio city""","""818/762-1221""","""american""",'1'


## Enrich: Area Code

- Derives `area_code` from the first three digits of `phone` (after stripping separators).
- Validates expected columns: `name`, `phone`, `addr`, `city`.
- Output: `df` includes new `area_code` column (may be `None`).

In [3]:
# Extract area code and basic cleaning

def extract_area_code(phone):
    if pd.isna(phone) or phone is None:
        return None
    s = str(phone).strip().strip('"').strip("'")
    digits = s.replace('-', '').replace('/', '').replace(' ', '')
    if len(digits) >= 3 and digits[:3].isdigit():
        return digits[:3]
    return None

# Ensure expected columns exist
expected_cols = {"name", "phone", "addr", "city"}
missing = expected_cols - set(df.columns)
if missing:
    raise ValueError(f"Missing expected columns: {missing}")

# Compute area_code
df["area_code"] = df["phone"].apply(extract_area_code)

print("Area code extraction:")
print("  total:", len(df))
print("  with area_code:", df["area_code"].notna().sum())
print("  unique area_codes:", df["area_code"].nunique())

df[["name", "phone", "area_code"]].head(5)

Area code extraction:
  total: 864
  with area_code: 864
  unique area_codes: 11


Unnamed: 0,name,phone,area_code
0,arnie morton's of chicago,"""310/246-1501""",310
1,arnie morton's of chicago,"""310-246-1501""",310
2,art's delicatessen,"""818/762-1221""",818
3,art's deli,"""818-762-1221""",818
4,hotel bel-air,"""310/472-1211""",310


## Similarity Edges

- Builds undirected similarity edges where:
  - Restaurants share the same `city`, and
  - Levenshtein distance on `addr` is `< ADDRESS_DISTANCE_THRESHOLD`.
- Output: `similarity_edges` as list of `(i, j)` index pairs.
- Note: quadratic in N; acceptable for small datasets.


In [4]:
# Build similarity pairs based on (addr, city)

similarity_edges = []
N = len(df)
print(f"Building similarity pairs over {N} restaurants...")
for i in range(N):
    ai = df.iloc[i]
    for j in range(i+1, N):
        aj = df.iloc[j]
        if ai["city"] != aj["city"]:
            continue
        d = string_distance(ai["addr"], aj["addr"])
        if d < ADDRESS_DISTANCE_THRESHOLD:
            similarity_edges.append((i, j))

print("Similarity summary:")
print("  total edges:", len(similarity_edges))
if N:
    print("  avg degree:", (2*len(similarity_edges))/N)
print("Sample edges:", similarity_edges[:10])

Building similarity pairs over 864 restaurants...
Similarity summary:
  total edges: 5307
  avg degree: 12.284722222222221
Sample edges: [(0, 1), (0, 22), (0, 242), (1, 22), (1, 242), (2, 3), (2, 37), (2, 706), (3, 37), (3, 706)]


## Persist: Raw Graph Inputs

- Writes two files for downstream steps:
  - `restaurants_YYYYMMDD-HHMMSS.txt`: tab-separated `id, name, area_code, addr, city`.
  - `restaurant_similarities_YYYYMMDD-HHMMSS.txt`: one edge per line as `(i,j)`.
- These represent the uncleaned instance graph inputs.


In [5]:
# Persist cleaned temp outputs (raw graph inputs)

ts = datetime.now().strftime("%Y%m%d-%H%M%S")
restaurants_path = OUTPUT_DIR / f"restaurants_{ts}.txt"
similarities_path = OUTPUT_DIR / f"restaurant_similarities_{ts}.txt"

# Ensure area_code exists to avoid KeyError if earlier cell wasn't run
if "area_code" not in df.columns:
    def _extract_area_code_local(phone):
        if pd.isna(phone) or phone is None:
            return None
        s = str(phone).strip().strip('"').strip("'")
        digits = s.replace('-', '').replace('/', '').replace(' ', '')
        if len(digits) >= 3 and digits[:3].isdigit():
            return digits[:3]
        return None
    df["area_code"] = df["phone"].apply(_extract_area_code_local)

with open(restaurants_path, "w", encoding="utf-8") as f:
    f.write("id\tname\tarea_code\taddr\tcity\n")
    for idx, row in df.iterrows():
        f.write(f"{idx}\t{row['name']}\t{row['area_code']}\t{row['addr']}\t{row['city']}\n")

with open(similarities_path, "w", encoding="utf-8") as f:
    for i, j in similarity_edges:
        f.write(f"({i},{j})\n")

print("Saved:")
print("  ", restaurants_path)
print("  ", similarities_path)

Saved:
   datasets\temp\restaurants_20260127-155333.txt
   datasets\temp\restaurant_similarities_20260127-155333.txt


## Graphs & Cleaning

- Constraint graph `S`: nodes are `area_code` values; only self-loops are allowed (same-area connections).
- Instance graph `G`: nodes are restaurants (with `label = area_code`), edges from `similarity_edges`.
- Violation: an edge `(u, v)` where `label(u) != label(v)`.
- Cleaning: iteratively remove the lower-degree endpoint of the first found violating edge until none remain, producing `G_opt`.
- Outputs: pre/post statistics and number of removed nodes.

In [6]:
# Build graphs (S, G) and clean to ground truth (G_opt)

# Constraint graph S: nodes = area codes; only self-loops allowed
labels = sorted(df["area_code"].dropna().unique().tolist())

if NX_AVAILABLE:
    S = nx.Graph()
    S.add_nodes_from(labels)
    S.add_edges_from((ac, ac) for ac in labels)
else:
    S = {ac: {ac} for ac in labels}  # adjacency-by-label for checks

# Instance graph G: nodes = restaurant indices, label = area_code, edges from similarity pairs
if NX_AVAILABLE:
    G = nx.Graph()
    for idx, row in df.iterrows():
        G.add_node(int(idx))
        G.nodes[int(idx)]["label"] = row["area_code"]
        G.nodes[int(idx)]["name"] = row["name"]
        G.nodes[int(idx)]["addr"] = row["addr"]
        G.nodes[int(idx)]["city"] = row["city"]
    for u, v in similarity_edges:
        if u != v:
            G.add_edge(int(u), int(v))
else:
    # Lightweight structure without networkx
    G_nodes = {int(idx): {
        "label": row["area_code"],
        "name": row["name"],
        "addr": row["addr"],
        "city": row["city"],
    } for idx, row in df.iterrows()}
    G_edges = {(min(int(u), int(v)), max(int(u), int(v))) for (u, v) in similarity_edges if u != v}

# Violation check: neighbors must share same area_code

def has_edge_in_S(lu, lv):
    if NX_AVAILABLE:
        return S.has_edge(lu, lv)
    return lv in S.get(lu, set())

# Helpers to get the FIRST violating edge with respect to the CURRENT graph state
if NX_AVAILABLE:
    def first_violation(Gx):
        for (u, v) in Gx.edges():
            lu = Gx.nodes[u].get("label")
            lv = Gx.nodes[v].get("label")
            if not has_edge_in_S(lu, lv):
                return (u, v)
        return None

    def count_violations(Gx):
        c = 0
        for (u, v) in Gx.edges():
            lu = Gx.nodes[u].get("label")
            lv = Gx.nodes[v].get("label")
            if not has_edge_in_S(lu, lv):
                c += 1
        return c
else:
    def first_violation(_unused):
        for (u, v) in G_edges:
            lu = G_nodes[u]["label"]
            lv = G_nodes[v]["label"]
            if not has_edge_in_S(lu, lv):
                return (u, v)
        return None

    def count_violations_non_nx():
        c = 0
        for (u, v) in G_edges:
            lu = G_nodes[u]["label"]
            lv = G_nodes[v]["label"]
            if not has_edge_in_S(lu, lv):
                c += 1
        return c

# Cleaning: iteratively remove the lower-degree endpoint of first violation until none remain
removed = set()
if NX_AVAILABLE:
    G_opt = G.copy()
    viol_before = count_violations(G_opt)
    while True:
        pair = first_violation(G_opt)
        if not pair:
            break
        u, v = pair
        # Guard in case of transient references
        if u not in G_opt or v not in G_opt:
            continue
        drop = u if G_opt.degree[u] <= G_opt.degree[v] else v
        removed.add(drop)
        G_opt.remove_node(drop)
    viol_after = count_violations(G_opt)
else:
    # Non-NX cleaning
    viol_before = count_violations_non_nx()
    remaining_nodes = set(G_nodes.keys())
    while True:
        pair = first_violation(None)
        if not pair:
            break
        u, v = pair
        deg_u = sum(1 for e in G_edges if u in e)
        deg_v = sum(1 for e in G_edges if v in e)
        drop = u if deg_u <= deg_v else v
        removed.add(drop)
        remaining_nodes.discard(drop)
        G_edges = {e for e in G_edges if drop not in e}
    viol_after = count_violations_non_nx()

print("Graph stats (before cleaning):")
if NX_AVAILABLE:
    print(f"  |V|={len(G.nodes)}, |E|={len(G.edges)}")
else:
    print(f"  |V|={len(G_nodes)}, |E|={len(G_edges)}")

print("Cleaning results:")
print("  violations before:", viol_before)
print("  violations after:", viol_after)
print("  removed nodes:", len(removed))
if NX_AVAILABLE:
    print(f"  G_opt |V|={len(G_opt.nodes)}, |E|={len(G_opt.edges)}")
else:
    print(f"  G_opt |V|={len(remaining_nodes)}, |E|={len(G_edges)}")

Graph stats (before cleaning):
  |V|=864, |E|=5307
Cleaning results:
  violations before: 32
  violations after: 0
  removed nodes: 15
  G_opt |V|=849, |E|=5245


## Persist: Cleaned Ground Truth

- Writes `restaurants_cleaned_*.txt` and `restaurant_similarities_cleaned_*.txt`, mirroring the raw formats but restricted to `G_opt`.
- These files serve as the canonical, violation-free ground truth for experiments.

In [7]:
# Persist cleaned ground truth artifacts
ts2 = datetime.now().strftime("%Y%m%d-%H%M%S")
clean_restaurants_path = OUTPUT_DIR / f"restaurants_cleaned_{ts2}.txt"
clean_similarities_path = OUTPUT_DIR / f"restaurant_similarities_cleaned_{ts2}.txt"

# Helper to iterate nodes/edges in G_opt independent of NetworkX
if NX_AVAILABLE:
    nodes_iter = sorted(G_opt.nodes)
    edges_iter = sorted((min(u, v), max(u, v)) for (u, v) in G_opt.edges())
else:
    nodes_iter = sorted(remaining_nodes)
    edges_iter = sorted(G_edges)

# Map from id to row for writing attributes
by_id = {int(idx): row for idx, row in df.iterrows()}

with open(clean_restaurants_path, "w", encoding="utf-8") as f:
    f.write("id\tname\tarea_code\taddr\tcity\n")
    for rid in nodes_iter:
        r = by_id.get(int(rid))
        if r is None:
            continue
        f.write(f"{rid}\t{r['name']}\t{r['area_code']}\t{r['addr']}\t{r['city']}\n")

with open(clean_similarities_path, "w", encoding="utf-8") as f:
    for u, v in edges_iter:
        f.write(f"({u},{v})\n")

print("Saved cleaned ground truth:")
print("  ", clean_restaurants_path)
print("  ", clean_similarities_path)

Saved cleaned ground truth:
   datasets\temp\restaurants_cleaned_20260127-155349.txt
   datasets\temp\restaurant_similarities_cleaned_20260127-155349.txt


# Import into Neo4j

- Uses `.env` for: `NEO4J_URI`, `NEO4J_USERNAME`, `NEO4J_PASSWORD`, `NEO4J_CONSTRAINT_DB`, `NEO4J_INSTANCE_DB`.
- GT DB: loads `G_opt` (cleaned) and creates `AreaCode` nodes with self-loops via `:ALLOWED`.
- Instance DB: starts as a copy of GT (can be noised later).
- Validates that violations are zero after import.

### Neo4j Import Details

- Constraints: unique on `Restaurant(id)` and `AreaCode(code)`.
- Restaurants: properties `name`, `addr`, `city`, `area_code`, and `label = area_code`.
- Similarity: undirected similarity imported as two directed `:SIMILAR` relationships.
- Constraint graph `S`: collected distinct labels â†’ `AreaCode` nodes with self `:ALLOWED` loops.

In [8]:
import os
import pathlib
from dotenv import load_dotenv
from neo4j import GraphDatabase

# --- Env ---
env_path = pathlib.Path.cwd() / ".env"
load_dotenv(dotenv_path=env_path, override=True)

def _strip_quotes(v):
    return None if v is None else v.strip().strip('"').strip("'")

URI = _strip_quotes(os.getenv("NEO4J_URI"))
USERNAME = _strip_quotes(os.getenv("NEO4J_USERNAME"))
PASSWORD = _strip_quotes(os.getenv("NEO4J_PASSWORD"))
GT_DB = _strip_quotes(os.getenv("NEO4J_CONSTRAINT_DB")) or "restaurants-gt"   # rename in your head: GT
INSTANCE_DB = _strip_quotes(os.getenv("NEO4J_INSTANCE_DB")) or "restaurants-instance"

AUTH = (USERNAME, PASSWORD)
print("Neo4j URI:", URI)
print("GT_DB:", GT_DB)
print("INSTANCE_DB:", INSTANCE_DB)

# --- Neo4j helpers ---
def clear_database(driver, database):
    driver.execute_query("MATCH (n) DETACH DELETE n", database_=database)

def setup_database(driver, database):
    driver.execute_query("""
        CREATE CONSTRAINT restaurant_id_unique IF NOT EXISTS
        FOR (r:Restaurant) REQUIRE r.id IS UNIQUE
    """, database_=database)

    driver.execute_query("""
        CREATE CONSTRAINT areacode_code_unique IF NOT EXISTS
        FOR (a:AreaCode) REQUIRE a.code IS UNIQUE
    """, database_=database)

def import_instance(driver, database, restaurants_list, edges_list):
    driver.execute_query("""
        UNWIND $restaurants AS r
        MERGE (n:Restaurant {id: r.id})
        SET n.name = r.name,
            n.addr = r.addr,
            n.city = r.city,
            n.area_code = r.area_code,
            n.label = r.area_code
    """, restaurants=restaurants_list, database_=database)

    # undirected SIMILAR as two directed edges
    driver.execute_query("""
        UNWIND $pairs AS p
        MATCH (a:Restaurant {id: p[0]})
        MATCH (b:Restaurant {id: p[1]})
        MERGE (a)-[:SIMILAR]->(b)
        MERGE (b)-[:SIMILAR]->(a)
    """, pairs=edges_list, database_=database)

def build_S(driver, database):
    driver.execute_query("""
        MATCH (r:Restaurant)
        WITH DISTINCT r.label AS code
        WHERE code IS NOT NULL
        MERGE (:AreaCode {code: code})
    """, database_=database)

    driver.execute_query("""
        MATCH (a:AreaCode)
        MERGE (a)-[:ALLOWED]->(a)
    """, database_=database)

def count_violations(driver, database):
    rec = driver.execute_query("""
        MATCH (a:Restaurant)-[:SIMILAR]->(b:Restaurant)
        WHERE a.label IS NOT NULL AND b.label IS NOT NULL AND a.label <> b.label
        RETURN count(*) AS v
    """, database_=database).records[0]
    return rec["v"]

# --- Load payloads from CLEANED ground-truth files (produced by cleaning notebook) ---
def load_clean_restaurants(path):
    rows = []
    with open(path, encoding="utf-8") as f:
        next(f)  # header
        for line in f:
            rid, name, area_code, addr, city = line.rstrip("\n").split("\t")
            rows.append({
                "id": str(rid),
                "name": name,
                "area_code": None if area_code in ("None", "") else area_code,
                "addr": addr,
                "city": city
            })
    return rows

def load_clean_edges(path):
    edges = []
    with open(path, encoding="utf-8") as f:
        for line in f:
            s = line.strip().lstrip("(").rstrip(")")
            if not s:
                continue
            a, b = [x.strip() for x in s.split(",")]
            edges.append((str(a), str(b)))
    return edges

gt_restaurants = load_clean_restaurants(clean_restaurants_path)
gt_edges = load_clean_edges(clean_similarities_path)
print("GT payload:", len(gt_restaurants), "restaurants,", len(gt_edges), "edges")

# --- Run import ---
with GraphDatabase.driver(URI, auth=AUTH) as driver:
    driver.verify_connectivity()
    print("Connected to Neo4j.")

    # GT_DB = cleaned G_opt
    clear_database(driver, GT_DB)
    setup_database(driver, GT_DB)
    import_instance(driver, GT_DB, gt_restaurants, gt_edges)
    build_S(driver, GT_DB)
    print("GT violations:", count_violations(driver, GT_DB), "(should be 0)")

    # INSTANCE_DB = exact copy of GT for now (noise later)
    clear_database(driver, INSTANCE_DB)
    setup_database(driver, INSTANCE_DB)
    import_instance(driver, INSTANCE_DB, gt_restaurants, gt_edges)
    build_S(driver, INSTANCE_DB)
    print("INSTANCE violations:", count_violations(driver, INSTANCE_DB), "(should be 0)")

print("Import done.")

Neo4j URI: neo4j://127.0.0.1:7687
GT_DB: restaurant-constraint
INSTANCE_DB: restaurant-instance
GT payload: 849 restaurants, 5245 edges
Connected to Neo4j.
GT violations: 0 (should be 0)
INSTANCE violations: 0 (should be 0)
Import done.


## Inject Label Noise (for experiments)

- Parameter `FRAUD_NUMBER` (e.g., `10` or `10x`) controls how many restaurants to relabel; parsed as digits only.
- Randomly selects `K` restaurants and changes their `label` to a different `AreaCode`.
- Annotates modified nodes with `:Fraudulent`, `noise_type = "label_noise"`, and `noise_old_label`.
- Reports violation count before/after to quantify injected inconsistency.
- Use `SEED` for reproducibility.

In [9]:
import random

# ---- Parameters ----
# Read FRAUD_NUMBER safely (handles "10x" -> 10)
raw_k = os.getenv("FRAUD_NUMBER", "10")
digits = "".join(ch for ch in raw_k if ch.isdigit())
K = int(digits) if digits else 10

SEED = 42  # reproducible noise
print("Using K =", K, " (from FRAUD_NUMBER =", raw_k, ")")
print("Using SEED =", SEED)

def fetch_restaurant_ids(driver, database):
    res = driver.execute_query("""
        MATCH (r:Restaurant)
        RETURN r.id AS id
        ORDER BY toInteger(r.id)
    """, database_=database)
    return [str(r["id"]) for r in res.records]

def fetch_area_codes(driver, database):
    res = driver.execute_query("""
        MATCH (a:AreaCode)
        RETURN a.code AS code
        ORDER BY a.code
    """, database_=database)
    return [str(r["code"]) for r in res.records]

def inject_label_noise(driver, database, k, seed=42):
    rng = random.Random(seed)

    ids = fetch_restaurant_ids(driver, database)
    codes = fetch_area_codes(driver, database)

    if not ids:
        raise RuntimeError("No Restaurant nodes found.")
    if len(codes) < 2:
        raise RuntimeError("Need at least 2 distinct AreaCodes to inject label noise.")

    k = min(k, len(ids))
    chosen_ids = rng.sample(ids, k)

    # Fetch current labels for chosen nodes
    res = driver.execute_query("""
        UNWIND $ids AS id
        MATCH (r:Restaurant {id: id})
        RETURN r.id AS id, r.label AS old_label
    """, ids=chosen_ids, database_=database)

    updates = []
    for rec in res.records:
        rid = str(rec["id"])
        old = rec["old_label"]
        # choose a *different* label
        new = old
        while new == old:
            new = rng.choice(codes)
        updates.append({"id": rid, "old_label": old, "new_label": new})

    # Apply updates
    driver.execute_query("""
        UNWIND $updates AS u
        MATCH (r:Restaurant {id: u.id})
        SET r.noise_old_label = u.old_label,
            r.label = u.new_label,
            r.noise_type = "label_noise"
        SET r:Fraudulent
    """, updates=updates, database_=database)

    return updates

def count_violations(driver, database):
    rec = driver.execute_query("""
        MATCH (a:Restaurant)-[:SIMILAR]->(b:Restaurant)
        WHERE a.label IS NOT NULL AND b.label IS NOT NULL AND a.label <> b.label
        RETURN count(*) AS v
    """, database_=database).records[0]
    return int(rec["v"])

with GraphDatabase.driver(URI, auth=AUTH) as driver:
    driver.verify_connectivity()

    # Safety: INSTANCE should start clean
    v0 = count_violations(driver, INSTANCE_DB)
    print("INSTANCE violations BEFORE noise:", v0)

    updates = inject_label_noise(driver, INSTANCE_DB, K, seed=SEED)
    print(f"Injected label noise into {len(updates)} restaurants.")

    v1 = count_violations(driver, INSTANCE_DB)
    print("INSTANCE violations AFTER noise:", v1)

    # Quick peek: show a few modified nodes
    sample = updates[:10]
    print("Sample updates:", sample)

Using K = 10  (from FRAUD_NUMBER = 10x )
Using SEED = 42
INSTANCE violations BEFORE noise: 0
Injected label noise into 10 restaurants.
INSTANCE violations AFTER noise: 136
Sample updates: [{'id': '668', 'old_label': '213', 'new_label': '818'}, {'id': '123', 'old_label': '212', 'new_label': '770'}, {'id': '34', 'old_label': '213', 'new_label': '212'}, {'id': '773', 'old_label': '702', 'new_label': '805'}, {'id': '295', 'old_label': '212', 'new_label': '702'}, {'id': '264', 'old_label': '310', 'new_label': '100'}, {'id': '238', 'old_label': '818', 'new_label': '100'}, {'id': '151', 'old_label': '404', 'new_label': '212'}, {'id': '768', 'old_label': '702', 'new_label': '310'}, {'id': '113', 'old_label': '718', 'new_label': '310'}]
