## Merging Repeated Transcription Data
There are two collections of participants that were repeatedly transcribed. Transcription produces several types of variation including alternate spellings ("mangos" vs "mangoes"), synonyms ("beef patty" vs "hamburger"), as well as the documented variation in the number of items transcribed leading to misalignments and common typos. The data sets are merged algorithmically. Finally, the merged sets are concatenated along with the transcription data unique to each transcriber.

In [1]:
import copy

import pandas as pd
import gensim.downloader as api
from importlib import reload

import merge

In [2]:
pd.set_option('display.max_rows', 1000)

When two item descriptions are sufficiently divergent, for example, "hamburger meat" compared with "beef patties," we use pre-trained word embeddings from [gensim](https://github.com/RaRe-Technologies/gensim-data#models) to compute their similarity. 

In [3]:
%time word_vectors = api.load("glove-wiki-gigaword-300") # 50, 100, 200, 300 sizes available

Wall time: 1min 29s


### All Three
A collection of participants was transcribed in all three data sets. Merging three data columns requires two iterations of merge.py.

In [4]:
%%time
DATA_PATH = '../Data/'
FILES_ALL_THREE = ['all_three_max_to_merge', 'all_three_maria_to_merge', 'all_three_samantha_to_merge']
COLS = [0, 1, 2, 3, 4]  # Index, ID, Session, Receipt, Item
DTYPES = {'ID': 'uint8', 'Session': 'uint8', 'Receipt': 'uint8', 'Item': 'string'}

dfs_all_three = [pd.read_csv(DATA_PATH + file + '.csv', index_col=0, usecols=COLS, dtype=DTYPES) for file in FILES_ALL_THREE]

Wall time: 218 ms


In [None]:
%time df_merged01, df_merged01_wv = merge.merge([dfs_all_three[0], dfs_all_three[1]], word_vectors)

ID:   0%|          | 0/6 [00:00<?, ?it/s]

In [None]:
print(f"The merge used word_vectors heavily on {df_merged01_wv.shape[0]} rows")

In [None]:
df_merged01_wv

73 items corrected by hand. An empty string (19) represents no obvious connection between items.

In [None]:
df_merged01.iloc[61, 3] = "chocolate bars"
df_merged01.iloc[73, 3] = "hamburgers"
df_merged01.iloc[76, 3] = "hot dogs"
df_merged01.iloc[106, 3] = "pie"
df_merged01.iloc[125, 3] = ""
df_merged01.iloc[136, 3] = "salad"
df_merged01.iloc[156, 3] = "mango popsicles"
df_merged01.iloc[162, 3] = ""
df_merged01.iloc[163, 3] = ""
df_merged01.iloc[191, 3] = ""
df_merged01.iloc[205, 3] = ""
df_merged01.iloc[216, 3] = "maple cereal"
df_merged01.iloc[205, 3] = ""
df_merged01.iloc[221, 3] = "walnuts"
df_merged01.iloc[223, 3] = "tilapia"
df_merged01.iloc[230, 3] = "havarti cheese"
df_merged01.iloc[258, 3] = "swiss cheese"
df_merged01.iloc[259, 3] = "havarti cheese"
df_merged01.iloc[279, 3] = "caesar salad"
df_merged01.iloc[294, 3] = "fruit"
df_merged01.iloc[306, 3] = "cake"
df_merged01.iloc[331, 3] = "blueberries"
df_merged01.iloc[332, 3] = "blueberries"
df_merged01.iloc[345, 3] = "salad"
df_merged01.iloc[400, 3] = "hot dogs"
df_merged01.iloc[401, 3] = ""
df_merged01.iloc[402, 3] = ""
df_merged01.iloc[416, 3] = "mint ice cream"
df_merged01.iloc[424, 3] = ""
df_merged01.iloc[432, 3] = "yogurt"
df_merged01.iloc[434, 3] = "yogurts"
df_merged01.iloc[447, 3] = ""
df_merged01.iloc[491, 3] = "fruit"
df_merged01.iloc[492, 3] = "fruit"
df_merged01.iloc[520, 3] = "milk"
df_merged01.iloc[521, 3] = "corvina"
df_merged01.iloc[526, 3] = "fruit"
df_merged01.iloc[553, 3] = "milk"
df_merged01.iloc[555, 3] = "beans"
df_merged01.iloc[559, 3] = "strawberries"
df_merged01.iloc[560, 3] = ""
df_merged01.iloc[594, 3] = "strawberries"
df_merged01.iloc[609, 3] = ""
df_merged01.iloc[610, 3] = ""
df_merged01.iloc[611, 3] = "candy"
df_merged01.iloc[627, 3] = "milk"
df_merged01.iloc[629, 3] = "strawberries"
df_merged01.iloc[637, 3] = "blackberry jam"
df_merged01.iloc[642, 3] = "chocolate candy"
df_merged01.iloc[665, 3] = "strawberries"
df_merged01.iloc[677, 3] = "milk"
df_merged01.iloc[685, 3] = ""
df_merged01.iloc[703, 3] = "salmon"
df_merged01.iloc[704, 3] = "strawberries"
df_merged01.iloc[705, 3] = "strawberries"
df_merged01.iloc[715, 3] = "milk"
df_merged01.iloc[730, 3] = "corvina"
df_merged01.iloc[741, 3] = "milk"
df_merged01.iloc[748, 3] = ""
df_merged01.iloc[768, 3] = "blueberries"
df_merged01.iloc[769, 3] = "blueberries"
df_merged01.iloc[772, 3] = ""
df_merged01.iloc[812, 3] = ""
df_merged01.iloc[815, 3] = "beef"
df_merged01.iloc[823, 3] = "romaine lettuce"
df_merged01.iloc[825, 3] = "garlic toast"
df_merged01.iloc[846, 3] = "vegetables"
df_merged01.iloc[854, 3] = "mangos"
df_merged01.iloc[889, 3] = ""
df_merged01.iloc[890, 3] = "chips"
df_merged01.iloc[891, 3] = "chips"
df_merged01.iloc[892, 3] = ""

Merge the previous results with the third data set. We first examine the divergence to see if the data sets are sufficiently similar for reasonable computation: a divergence of 8 or less.

In [None]:
merge.divergence([df_merged01, dfs_all_three[2]], word_vectors)

In [None]:
%time df_merged01_2, df_merged01_2_wv = merge.merge([df_merged01, dfs_all_three[2]], word_vectors)

In [None]:
print(f"The merge used word_vectors heavily on {df_merged01_2_wv.shape[0]} rows")

In [None]:
df_merged01_2_wv

32 items corrected by hand. An empty string (8) represents no obvious connection between items.

In [None]:
df_merged01_2.iloc[57, 3] = "candy bars"
df_merged01_2.iloc[76, 3] = ""
df_merged01_2.iloc[86, 3] = ""
df_merged01_2.iloc[87, 3] = "bread"
df_merged01_2.iloc[135, 3] = "barbecue sauce"
df_merged01_2.iloc[147, 3] = "salad dressing"
df_merged01_2.iloc[153, 3] = "salad dressing"
df_merged01_2.iloc[308, 3] = "cake"
df_merged01_2.iloc[346, 3] = "salad"
df_merged01_2.iloc[379, 3] = "dog treats"
df_merged01_2.iloc[401, 3] = ""
df_merged01_2.iloc[410, 3] = ""
df_merged01_2.iloc[471, 3] = ""
df_merged01_2.iloc[487, 3] = ""
df_merged01_2.iloc[488, 3] = ""
df_merged01_2.iloc[499, 3] = "soda"
df_merged01_2.iloc[516, 3] = "cheese"
df_merged01_2.iloc[564, 3] = "whipped cream"
df_merged01_2.iloc[595, 3] = "hot dogs"
df_merged01_2.iloc[640, 3] = "blackberry jam"
df_merged01_2.iloc[641, 3] = "raspberry jam"
df_merged01_2.iloc[673, 3] = "milk"
df_merged01_2.iloc[681, 3] = "whipped cream"
df_merged01_2.iloc[686, 3] = "tomato sauce"
df_merged01_2.iloc[747, 3] = ""
df_merged01_2.iloc[799, 3] = "sauce"
df_merged01_2.iloc[826, 3] = "romaine lettuce"
df_merged01_2.iloc[829, 3] = "soda"
df_merged01_2.iloc[855, 3] = "mangos"
df_merged01_2.iloc[888, 3] = "grain"
df_merged01_2.iloc[901, 3] = "cranberry sauce"

### Only Two
A collection of participants was transcribed in on Maria's and Samantha's data set.

In [None]:
%%time
FILES_ONLY_TWO = ['only_two_maria_to_merge', 'only_two_samantha_to_merge']

dfs_only_two = [pd.read_csv(DATA_PATH + file + '.csv', index_col=0, usecols=COLS, dtype=DTYPES) for file in FILES_ONLY_TWO]

In [None]:
merge.divergence(dfs_only_two, word_vectors)

In [None]:
%time df_merged_only_two, df_merged_only_two_wv = merge.merge(dfs_only_two, word_vectors)

In [None]:
print(f"The merge used word_vectors heavily on {df_merged_only_two_wv.shape[0]} rows")

In [None]:
df_merged_only_two_wv

5 items corrected by hand. An empty string (4) represents no obvious connection between items.

In [None]:
df_merged_only_two.iloc[8, 3] = ""
df_merged_only_two.iloc[9, 3] = ""
df_merged_only_two.iloc[65, 3] = "soda"
df_merged_only_two.iloc[66, 3] = ""
df_merged_only_two.iloc[69, 3] = ""

### Concatenate All Sources of Data
After merging, there are five sources of unique data. Three are unique to the transcribers and two are the results of the merges above.

In [None]:
%%time
FILES_CLEAN = ['clean_max', 'clean_maria', 'clean_samantha']
COLS_CLEAN = [0, 1, 2, 3, 5]

dfs_clean = [pd.read_csv(DATA_PATH + file + '.csv', index_col=0, usecols=COLS_CLEAN, dtype=DTYPES) for file in FILES_CLEAN]

In [None]:
dfs_unique = copy.deepcopy(dfs_clean)
for i, df in enumerate(dfs_unique):
    for j, df in enumerate(dfs_clean):
        if i != j:
            dfs_unique[i] = dfs_unique[i][~dfs_unique[i].ID.isin(dfs_clean[j].ID.unique())]

dfs_all = [*dfs_unique, df_merged01_2, df_merged_only_two]
print(*[df.ID.unique() for df in dfs_all], sep='\n')

In [None]:
for df in dfs_all:
    print(df.info())
    print()

In [None]:
df_merged_final = pd.concat(dfs_all).reset_index()

In [None]:
print(df_merged_final.info())
display(df_merged_final.head())

In [None]:
df_merged_final.to_csv(f'{DATA_PATH}merged_full.csv')