# Auto-Detect: Data-Driven Error Detection in Tables

#### 1. Paper
https://dl.acm.org/doi/10.1145/3183713.3196889


#### 2. Basics
* Statistics-based technique leveraging co-occurence statistics from a large table corpus for error-detection
    * Co-occurence is calculated using a statistical measure called "point-wise mutual information" (PMI)
    * This score is then normalized onto a scale between -1 and 1
    * To avoid getting negative results for all patterns which weren't part of the training set, a smoothing is applied to the co-occurence statistics
* Cells are converted using generalization languages, spanned from a set of possible generalizations for the english language
* Picks the most suitable generalization language based on a static precision requirement
* Uses ensemble of generalizations to judge the compatibility of different values
* Different languages are sensitive to different types of misalignments
* Aims to resemblintuition of humans for errors 
* Allows adaptation to a customizable memory budget for client side application through the use of the CM-Sketch data structure


#### 3. Demonstration

In [None]:
import dill

autodetect = dill.load(open("autodetect.pkl", "rb"))
autodetect.trainings_set.add_redis_connections()


demos = [
    ("2000-24-12", "2018-08-12"),
    ("2003-24-12", "2038-08-12"),
    ("01.12.2070", "3123-08-12"),
    ("01.12.2070", "24/08/2952"),
    ("Mr. Smith", "Mr. Homes"),
    ("Mrs. Smith", "Mr. Smith"),
    ("14-28", "na-na"),
    ("20-23", "14-18"),
    ("3:28 min", "3 23 minutes"),
    ("June 2000", "01.06.2000"),
    ("0.26", "26%"),
    ("(511) 325161", "511 325-161"),
    ("511 325612", "511-32-32-51"),
]

for demo in demos:
    try:
        compatible, confidence = autodetect.predict(demo[0], demo[1])
        answer = "compatible" if compatible else "incompatible"
        print(f"{demo[0]} and {demo[1]} are {answer}, confidence: {confidence}")
    except Exception as e:
        print(f"Something went wrong: {e}")

In [None]:
value1 = "Put something here"
value2 = "Put something else here"

In [None]:
try:
    compatible, confidence = autodetect.predict(value1, value2)
    answer = "compatible" if compatible else "incompatible"
    print(f"{demo[0]} and {demo[1]} are {answer}, confidence: {confidence}")
except Exception as e:
    print(f"Something went wrong: {e}")

#### 5. Try yourself

In [None]:
import pprint
from src.test_auto_detect import read_test_files

pp = pprint.PrettyPrinter(depth=1)

test_files = read_test_files("test_data")
for test_file in test_files:
    statistics = test_file.test(autodetect.predict)
    print(f"=== Testing against {test_file.name} ===")
    print(f"Statistics:")
    pp.pprint(statistics)
    print()
    print()

#### 4. Test performance