## Preprocess Data for Algorithmic Merging
The merging algorithm in merge.py employs an O(n!) brute force search, where n is the number of highly divergent item descriptions. To be computationally feasible n must be 8 or less. Cleaning is done by hand until they are sufficiently similar. This notebook preprocess the subset of participants whose receipts were transcribed in Mar's and S's data set only.

In [1]:
import re
import datetime

import pandas as pd
import gensim.downloader as api

import merge

In [2]:
pd.set_option('display.max_rows', 200)

In [3]:
%time word_vectors = api.load("glove-wiki-gigaword-300") # 50, 100, 200, 300 sizes available

CPU times: total: 1min 1s
Wall time: 1min 1s


In [4]:
DATA_PATH = '../Data/'
FILES = ['mar', 's']
COLS = [0, 1, 2, 3, 5]  # Index, ID, Session, Receipt, Item
DTYPES = {'ID': 'uint8', 'Session': 'uint8', 'Receipt': 'uint8', 'Item': str}

dfs = [pd.read_csv(DATA_PATH + 'clean_' + file + '.csv', index_col=0, usecols=COLS, dtype=DTYPES) for file in FILES]
df_m = pd.read_csv(DATA_PATH + 'clean_m.csv', index_col=0, usecols=COLS, dtype=DTYPES)

Restrict data set to shared participants

In [5]:
ids_shared = set.intersection(*[set(df.ID.unique()) for df in dfs]) - set(df_m.ID.unique())
dfs = [df[df.ID.isin(ids_shared)].reset_index(drop=True) for df in dfs]

In [6]:
initial_row_counts = [df.shape[0] for df in dfs] # used at the end to compute drop rate

In [7]:
for df in dfs:
    print(df.info())
    print()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 69 entries, 0 to 68
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   ID       69 non-null     uint8 
 1   Session  69 non-null     uint8 
 2   Receipt  69 non-null     uint8 
 3   Item     69 non-null     object
dtypes: object(1), uint8(3)
memory usage: 887.0+ bytes
None

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 71 entries, 0 to 70
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   ID       71 non-null     uint8 
 1   Session  71 non-null     uint8 
 2   Receipt  71 non-null     uint8 
 3   Item     71 non-null     object
dtypes: object(1), uint8(3)
memory usage: 909.0+ bytes
None



### Examination of Variation within Sessions

In [8]:
pd.concat([df.groupby(by=['ID', 'Session']).Item.count() for df in dfs], axis=1, ignore_index=True)

Unnamed: 0_level_0,Unnamed: 1_level_0,0,1
ID,Session,Unnamed: 2_level_1,Unnamed: 3_level_1
148,2,69,71


### Examination of Variation in Receipt Count
The merge algorithm operates on receipts and requires each data set to recognize the same number of receipts per session per ID. The number of receipts are examined for variations between the data sets.

In [9]:
pd.concat([df.groupby(by=['ID', 'Session']).Receipt.unique() for df in dfs], axis=1, ignore_index=True)

Unnamed: 0_level_0,Unnamed: 1_level_0,0,1
ID,Session,Unnamed: 2_level_1,Unnamed: 3_level_1
148,2,"[1, 2]","[1, 2]"


### Examination of Item Divergence
Receipts are examined individually for highly divergent items. Corrections are made by inspection.

In [10]:
merge.divergence([dfs[0], dfs[1]], word_vectors)

ID: 148, Session: 2, Receipt: 1, Div: 5! [2, 3, 6, 9, 10]
ID: 148, Session: 2, Receipt: 2, Div: 7! [10, 16, 40, 41, 48, 49, 54]



This very small data set seems in order.

### Results

In [11]:
for initial_row_count, df in zip(initial_row_counts, dfs):
    print(f'Total row reduction: {initial_row_count - df.shape[0]} ({(initial_row_count - df.shape[0]) / initial_row_count:.0%})')

Total row reduction: 0 (0%)
Total row reduction: 0 (0%)


In [12]:
for i, df in enumerate(dfs):
    df = df.reset_index(drop=True)
    df.to_csv(f'{DATA_PATH}only_two_{FILES[i]}_to_merge.csv')