## Preprocess Data for Algorithmic Merging
The merging algorithm in merge.py employs an O(n!) brute force search, where n is the number of highly divergent item descriptions. To be computationally feasible n must be 8 or less. Cleaning is done by hand until they are sufficiently similar. This notebook preprocess the subset of participants whose receipts were transcribed in all three data sets.

In [1]:
import re
import datetime

import pandas as pd
import gensim.downloader as api

import merge

In [2]:
pd.set_option('display.max_rows', 200)

In [None]:
%time word_vectors = api.load("glove-wiki-gigaword-300") # 50, 100, 200, 300 sizes available

In [None]:
DATA_PATH = '../Data/'
FILES = ['max', 'maria', 'samantha']
COLS = [0, 1, 2, 3, 5]  # Index, ID, Session, Receipt, Item
DTYPES = {'ID': 'uint8', 'Session': 'uint8', 'Receipt': 'uint8', 'Item': str}

dfs = [pd.read_csv(DATA_PATH + 'clean_' + file + '.csv', index_col=0, usecols=COLS, dtype=DTYPES) for file in FILES]

Restrict data set to shared participants

In [None]:
ids_shared = set.intersection(*[set(df.ID.unique()) for df in dfs])
dfs = [df[df.ID.isin(ids_shared)].reset_index(drop=True) for df in dfs]

In [None]:
initial_row_counts = [df.shape[0] for df in dfs] # used at the end to compute drop rate

In [None]:
for df in dfs:
    print(df.info())
    print()

Item descriptions are optionally formated as "item (modifier)", where modifier usually denotes an adjective like flavor, such as "ice cream (chocolate)". The reformat_modifier function removes this formatting by moving 'modifier' to beginning of text and droping the parentheses. The Item strings are additionally cleaned by removing punctuation and stripping white space.

In [None]:
paren = re.compile(r'\(.+\)')

def reformat_modifier(text):
    m = paren.search(text)
    if m:
        text = ' '.join([m.group(0)[1:-1], text])
        text = paren.sub('', text)
    return text

In [None]:
for df in dfs:
    df.Item = (df.Item
               .apply(reformat_modifier)
               .str.replace(r'[/(),"&]', ' ', regex=True)
               .str.replace(r'?', '', regex=False)
               .str.replace(r"'s", '', regex=False)
               .str.replace(r"coupon", '', regex=False)
               .str.strip())

### Examination of Variation within Sessions

In [None]:
pd.concat([df.groupby(by=['ID', 'Session']).Item.count() for df in dfs], axis=1, ignore_index=True)

The following transcription sessions are dropped by inspection.

In [None]:
for df in dfs:
    df.drop(df[(df.ID == 114) & (df.Session == 4)].index, inplace=True)
    df.drop(df[(df.ID == 130) & (df.Session == 1)].index, inplace=True)
    df.drop(df[(df.ID == 135) & (df.Session == 6)].index, inplace=True)
    df.drop(df[(df.ID == 153) & (df.Session == 2)].index, inplace=True)
    df.drop(df[(df.ID == 153) & (df.Session == 6)].index, inplace=True)

### Examination of Variation in Receipt Count
The merge algorithm operates on receipts and requires each data set to recognize the same number of receipts per session per ID. The number of receipts are examined for variations between the data sets.

In [None]:
pd.concat([df.groupby(by=['ID', 'Session']).Receipt.unique() for df in dfs], axis=1, ignore_index=True)

Discrepancies are corrected by inspection. In some cases a receipt stub remained as an artifact of previous data cleaning. In other cases two receipts needed to be merged due to the same.

ID: 114, Session: 5, Receipts: [1]	[1, 2]	[1]

In [None]:
pd.concat([df.loc[(df.ID == 114) & (df.Session == 5), ['Receipt', 'Item']].reset_index(drop=True) for df in dfs],
          axis=1, ignore_index=True)

In [None]:
dfs[1].drop(dfs[1][(dfs[1].ID == 114) & (dfs[1].Session == 5) & (dfs[1].Receipt == 2)].index, inplace=True)

ID: 127, Session: 2, Receipts: [1, 2, 3]	[1, 2]	[1, 2, 3]

In [None]:
pd.concat([df.loc[(df.ID == 127) & (df.Session == 2), ['Receipt', 'Item']].reset_index(drop=True) for df in dfs],
          axis=1, ignore_index=True)

In [None]:
dfs[1].loc[(dfs[1].ID == 127) & (dfs[1].Session == 2), ['Receipt', 'Item']]

In [None]:
dfs[1].loc[831:839, 'Receipt'] = 3

ID: 127, Session: 5, Receipts: [1, 2, 3, 4]	[1, 2, 3]	[1, 2, 3, 4]

In [None]:
pd.concat([df.loc[(df.ID == 127) & (df.Session == 5), ['Receipt', 'Item']].reset_index(drop=True) for df in dfs],
          axis=1, ignore_index=True)

In [None]:
dfs[1].loc[(dfs[1].ID == 127) & (dfs[1].Session == 5), ['Receipt', 'Item']]

In [None]:
dfs[1].loc[880:901, 'Receipt'] = 4

ID: 135, Session: 2, Receipts: [1, 2, 3, 4, 5, 6, 7, 8]	[1, 2, 3, 4, 5, 6, 7, 8, 9]	[1, 2, 3, 4, 5, 6, 7, 8]

In [None]:
pd.concat([df.loc[(df.ID == 135) & (df.Session == 2), ['Receipt', 'Item']].reset_index(drop=True) for df in dfs],
          axis=1, ignore_index=True)

In [None]:
dfs[1].drop(dfs[1][(dfs[1].ID == 135) & (dfs[1].Session == 2) & (dfs[1].Receipt == 9)].index, inplace=True)

### Examination of Item Divergence
Receipts are examined individually for highly divergent items. Corrections are made by inspection.

In [None]:
merge.divergence([dfs[0], dfs[1]], word_vectors)

ID: 135, Session: 2, Receipt: 3, Div: 8! [1, 3, 4, 5, 6, 7, 8, 9]

In [None]:
pd.concat([df.loc[(df.ID == 135) & (df.Session == 2) & (df.Receipt == 3), 'Item'].reset_index(drop=True) for df in dfs],
          axis=1, ignore_index=True)

In [None]:
for df in dfs:
    df.drop(df[(df.ID == 135) & (df.Session == 2) & (df.Receipt == 3)].index, inplace=True)

ID: 135, Session: 2, Receipt: 8! [0, 1, 2, 10, 14, 15, 16, 17]

In [None]:
pd.concat([df.loc[(df.ID == 135) & (df.Session == 2) & (df.Receipt == 6), 'Item'].reset_index(drop=True) for df in dfs],
          axis=1, ignore_index=True)

In [None]:
dfs[1].loc[(dfs[1].ID == 135) & (dfs[1].Session == 2) & (dfs[1].Receipt == 6) & (dfs[1].Item == 'quino brown bread'), 'Item'] = 'quinoa brown bread'
dfs[1].loc[(dfs[1].ID == 135) & (dfs[1].Session == 2) & (dfs[1].Receipt == 6) & (dfs[1].Item == 'sweetner'), 'Item'] = 'sweetener'

In [None]:
merge.divergence([dfs[0], dfs[2]], word_vectors)

In [None]:
merge.divergence([dfs[1], dfs[2]], word_vectors)

ID: 135, Session: 2, Receipt: 6, Div: 8! [1, 2, 3, 5, 15, 16, 17, 18]

In [None]:
pd.concat([df.loc[(df.ID == 135) & (df.Session == 2) & (df.Receipt == 6), 'Item'].reset_index(drop=True) for df in dfs],
          axis=1, ignore_index=True)

In [None]:
dfs[2].loc[(dfs[2].ID == 135) & (dfs[2].Session == 2) & (dfs[2].Receipt == 6) & (dfs[2].Item == 'quiona'), 'Item'] = 'quinoa'

### Results

In [None]:
for df in dfs:
    df.info()
    print()

In [None]:
for initial_row_count, df in zip(initial_row_counts, dfs):
    print(f'Total row reduction: {initial_row_count - df.shape[0]} ({(initial_row_count - df.shape[0]) / initial_row_count:.0%})')

In [34]:
for i, df in enumerate(dfs):
    df = df.reset_index(drop=True)
    df.to_csv(f'{DATA_PATH}all_three_{FILES[i]}_to_merge.csv')