## Merging Repeated Transcription Data
There are two collections of participants that were repeatedly transcribed. Transcription produces several types of variation including alternate spellings ("mangos" vs "mangoes"), synonyms ("beef patty" vs "hamburger"), as well as the documented variation in the number of items transcribed leading to misalignments and common typos. The data sets are merged algorithmically. Finally, the merged sets are concatenated along with the transcription data unique to each transcriber.

In [1]:
import copy

import pandas as pd
import gensim.downloader as api
from importlib import reload

import merge

In [2]:
pd.set_option('display.max_rows', 1000)

When two item descriptions are sufficiently divergent, for example, "hamburger meat" compared with "beef patties," we use pre-trained word embeddings from [gensim](https://github.com/RaRe-Technologies/gensim-data#models) to compute their similarity. 

In [3]:
%time word_vectors = api.load("glove-wiki-gigaword-300") # 50, 100, 200, 300 sizes available

Wall time: 2min 12s


### All Three
A collection of participants was transcribed in all three data sets. Merging three data columns requires two iterations of merge.py.

In [4]:
%%time
DATA_PATH = '../Data/'
FILES_ALL_THREE = ['all_three_max_to_merge', 'all_three_maria_to_merge', 'all_three_samantha_to_merge']
COLS = [0, 1, 2, 3, 4]  # Index, ID, Session, Receipt, Item
DTYPES = {'ID': 'uint8', 'Session': 'uint8', 'Receipt': 'uint8', 'Item': 'string'}

dfs_all_three = [pd.read_csv(DATA_PATH + file + '.csv', index_col=0, usecols=COLS, dtype=DTYPES) for file in FILES_ALL_THREE]

Wall time: 2.02 s


In [5]:
%time df_merged01, df_merged01_wv = merge.merge([dfs_all_three[0], dfs_all_three[1]], word_vectors)

ID:   0%|          | 0/6 [00:00<?, ?it/s]

Wall time: 27min 26s


In [6]:
print(f"The merge used word_vectors heavily on {df_merged01_wv.shape[0]} rows")

The merge used word_vectors heavily on 142 rows


In [7]:
df_merged01_wv

Unnamed: 0,ID,Session,Receipt,Item1,Item2,Item,Distance
25,114,1,3,coca cola soda,diet cola,cola pepsi,4.81381
26,114,1,3,coca cola soda,diet cola,cola pepsi,4.81381
27,114,1,3,coca cola soda,diet cola,cola pepsi,4.81381
28,114,1,3,coca cola soda,diet cola,cola pepsi,4.81381
61,114,2,2,candy bars,assorted hershey chocolate,cookies,6.910234
65,114,3,1,coca cola soda,diet cola,cola pepsi,4.81381
73,114,3,1,hamburgers,frozen cooked burgers,burgers sandwiches,6.494545
76,114,3,1,hot dogs,meat franks,dog,8.567453
86,114,3,2,bistro salad bowl,chef salad,salad restaurant,5.505913
100,114,5,1,coca cola soda,diet cola,cola pepsi,4.81381


73 items corrected by hand. An empty string (19) represents no obvious connection between items.

In [8]:
df_merged01.iloc[61, 3] = "chocolate bars"
df_merged01.iloc[73, 3] = "hamburgers"
df_merged01.iloc[76, 3] = "hot dogs"
df_merged01.iloc[106, 3] = "pie"
df_merged01.iloc[125, 3] = ""
df_merged01.iloc[136, 3] = "salad"
df_merged01.iloc[156, 3] = "mango popsicles"
df_merged01.iloc[162, 3] = ""
df_merged01.iloc[163, 3] = ""
df_merged01.iloc[191, 3] = ""
df_merged01.iloc[205, 3] = ""
df_merged01.iloc[216, 3] = "maple cereal"
df_merged01.iloc[205, 3] = ""
df_merged01.iloc[221, 3] = "walnuts"
df_merged01.iloc[223, 3] = "tilapia"
df_merged01.iloc[230, 3] = "havarti cheese"
df_merged01.iloc[258, 3] = "swiss cheese"
df_merged01.iloc[259, 3] = "havarti cheese"
df_merged01.iloc[279, 3] = "caesar salad"
df_merged01.iloc[294, 3] = "fruit"
df_merged01.iloc[306, 3] = "cake"
df_merged01.iloc[331, 3] = "blueberries"
df_merged01.iloc[332, 3] = "blueberries"
df_merged01.iloc[345, 3] = "salad"
df_merged01.iloc[400, 3] = "hot dogs"
df_merged01.iloc[401, 3] = ""
df_merged01.iloc[402, 3] = ""
df_merged01.iloc[416, 3] = "mint ice cream"
df_merged01.iloc[424, 3] = ""
df_merged01.iloc[432, 3] = "yogurt"
df_merged01.iloc[434, 3] = "yogurts"
df_merged01.iloc[447, 3] = ""
df_merged01.iloc[491, 3] = "fruit"
df_merged01.iloc[492, 3] = "fruit"
df_merged01.iloc[520, 3] = "milk"
df_merged01.iloc[521, 3] = "corvina"
df_merged01.iloc[526, 3] = "fruit"
df_merged01.iloc[553, 3] = "milk"
df_merged01.iloc[555, 3] = "beans"
df_merged01.iloc[559, 3] = "strawberries"
df_merged01.iloc[560, 3] = ""
df_merged01.iloc[594, 3] = "strawberries"
df_merged01.iloc[609, 3] = ""
df_merged01.iloc[610, 3] = ""
df_merged01.iloc[611, 3] = "candy"
df_merged01.iloc[627, 3] = "milk"
df_merged01.iloc[629, 3] = "strawberries"
df_merged01.iloc[637, 3] = "blackberry jam"
df_merged01.iloc[642, 3] = "chocolate candy"
df_merged01.iloc[665, 3] = "strawberries"
df_merged01.iloc[677, 3] = "milk"
df_merged01.iloc[685, 3] = ""
df_merged01.iloc[703, 3] = "salmon"
df_merged01.iloc[704, 3] = "strawberries"
df_merged01.iloc[705, 3] = "strawberries"
df_merged01.iloc[715, 3] = "milk"
df_merged01.iloc[730, 3] = "corvina"
df_merged01.iloc[741, 3] = "milk"
df_merged01.iloc[748, 3] = ""
df_merged01.iloc[768, 3] = "blueberries"
df_merged01.iloc[769, 3] = "blueberries"
df_merged01.iloc[772, 3] = ""
df_merged01.iloc[812, 3] = ""
df_merged01.iloc[815, 3] = "beef"
df_merged01.iloc[823, 3] = "romaine lettuce"
df_merged01.iloc[825, 3] = "garlic toast"
df_merged01.iloc[846, 3] = "vegetables"
df_merged01.iloc[854, 3] = "mangos"
df_merged01.iloc[889, 3] = ""
df_merged01.iloc[890, 3] = "chips"
df_merged01.iloc[891, 3] = "chips"
df_merged01.iloc[892, 3] = ""

Merge the previous results with the third data set. We first examine the divergence to see if the data sets are sufficiently similar for reasonable computation: a divergence of 8 or less.

In [9]:
merge.divergence([df_merged01, dfs_all_three[2]], word_vectors)

ID: 114, Session: 1, Receipt: 1, Div: 0! []
ID: 114, Session: 1, Receipt: 2, Div: 0! []
ID: 114, Session: 1, Receipt: 3, Div: 0! []
ID: 114, Session: 2, Receipt: 1, Div: 0! []
ID: 114, Session: 2, Receipt: 2, Div: 1! [7]
ID: 114, Session: 3, Receipt: 1, Div: 3! [1, 2, 9]
ID: 114, Session: 3, Receipt: 2, Div: 0! []
ID: 114, Session: 5, Receipt: 1, Div: 2! [1, 2]
ID: 114, Session: 6, Receipt: 1, Div: 1! [2]
ID: 114, Session: 6, Receipt: 2, Div: 2! [2, 9]
ID: 114, Session: 6, Receipt: 3, Div: 0! []

ID: 137, Session: 1, Receipt: 1, Div: 2! [10, 20]
ID: 137, Session: 1, Receipt: 2, Div: 6! [5, 6, 7, 8, 9, 10]
ID: 137, Session: 4, Receipt: 1, Div: 2! [10, 11]
ID: 137, Session: 4, Receipt: 2, Div: 0! []

ID: 153, Session: 1, Receipt: 1, Div: 1! [14]
ID: 153, Session: 1, Receipt: 2, Div: 4! [0, 9, 10, 14]
ID: 153, Session: 5, Receipt: 1, Div: 7! [0, 8, 14, 27, 28, 29, 30]
ID: 153, Session: 5, Receipt: 2, Div: 6! [3, 4, 5, 9, 11, 12]

ID: 127, Session: 1, Receipt: 1, Div: 1! [9]
ID: 127, Sessi

In [10]:
%time df_merged01_2, df_merged01_2_wv = merge.merge([df_merged01, dfs_all_three[2]], word_vectors)
df_merged01_2.Item = df_merged01_2.Item.astype('string')

ID:   0%|          | 0/6 [00:00<?, ?it/s]

Wall time: 1min 36s


In [11]:
print(f"The merge used word_vectors heavily on {df_merged01_2_wv.shape[0]} rows")

The merge used word_vectors heavily on 89 rows


In [12]:
df_merged01_2_wv

Unnamed: 0,ID,Session,Receipt,Item1,Item2,Item,Distance
25,114,1,3,cola pepsi,cola diet soda,cola coke,5.05694
26,114,1,3,cola pepsi,cola diet soda,cola coke,5.05694
27,114,1,3,cola pepsi,cola diet soda,cola coke,5.05694
28,114,1,3,cola pepsi,cola diet soda,cola coke,5.05694
61,114,2,2,hershey nuggets,assorted candy bars,chocolate,8.14262
69,114,3,1,cola pepsi,cola diet soda,cola coke,5.05694
76,114,3,1,hot dogs,cola diet soda,drink,8.80994
86,114,3,2,salad restaurant,bistro bowl salad,salad cafe,5.19363
87,114,3,2,nut,walnut multigrain bread,nut pecan,7.949779
99,114,5,1,cola pepsi,cola diet soda,cola coke,5.05694


32 items corrected by hand. An empty string (8) represents no obvious connection between items.

In [15]:
df_merged01_2.iloc[57, 3] = "candy bars"
df_merged01_2.iloc[76, 3] = ""
df_merged01_2.iloc[86, 3] = ""
df_merged01_2.iloc[87, 3] = "bread"
df_merged01_2.iloc[135, 3] = "barbecue sauce"
df_merged01_2.iloc[147, 3] = "salad dressing"
df_merged01_2.iloc[153, 3] = "salad dressing"
df_merged01_2.iloc[308, 3] = "cake"
df_merged01_2.iloc[346, 3] = "salad"
df_merged01_2.iloc[379, 3] = "dog treats"
df_merged01_2.iloc[401, 3] = ""
df_merged01_2.iloc[410, 3] = ""
df_merged01_2.iloc[471, 3] = ""
df_merged01_2.iloc[487, 3] = ""
df_merged01_2.iloc[488, 3] = ""
df_merged01_2.iloc[499, 3] = "soda"
df_merged01_2.iloc[516, 3] = "cheese"
df_merged01_2.iloc[564, 3] = "whipped cream"
df_merged01_2.iloc[595, 3] = "hot dogs"
df_merged01_2.iloc[640, 3] = "blackberry jam"
df_merged01_2.iloc[641, 3] = "raspberry jam"
df_merged01_2.iloc[673, 3] = "milk"
df_merged01_2.iloc[681, 3] = "whipped cream"
df_merged01_2.iloc[686, 3] = "tomato sauce"
df_merged01_2.iloc[747, 3] = ""
df_merged01_2.iloc[799, 3] = "sauce"
df_merged01_2.iloc[826, 3] = "romaine lettuce"
df_merged01_2.iloc[829, 3] = "soda"
df_merged01_2.iloc[855, 3] = "mangos"
df_merged01_2.iloc[888, 3] = "grain"
df_merged01_2.iloc[901, 3] = "cranberry sauce"

### Only Two
A collection of participants was transcribed in on Maria's and Samantha's data set.

In [16]:
%%time
FILES_ONLY_TWO = ['only_two_maria_to_merge', 'only_two_samantha_to_merge']

dfs_only_two = [pd.read_csv(DATA_PATH + file + '.csv', index_col=0, usecols=COLS, dtype=DTYPES) for file in FILES_ONLY_TWO]

Wall time: 225 ms


In [17]:
merge.divergence(dfs_only_two, word_vectors)

ID: 148, Session: 2, Receipt: 1, Div: 5! [2, 3, 6, 9, 10]
ID: 148, Session: 2, Receipt: 2, Div: 7! [10, 16, 40, 41, 48, 49, 54]



In [18]:
%time df_merged_only_two, df_merged_only_two_wv = merge.merge(dfs_only_two, word_vectors)
df_merged_only_two.Item = df_merged_only_two.Item.astype('string')

ID:   0%|          | 0/1 [00:00<?, ?it/s]

Wall time: 23min 37s


In [20]:
print(f"The merge used word_vectors heavily on {df_merged_only_two_wv.shape[0]} rows")

The merge used word_vectors heavily on 13 rows


In [21]:
df_merged_only_two_wv

Unnamed: 0,ID,Session,Receipt,Item1,Item2,Item,Distance
8,148,2,1,disinfectant wipes,clorox,bleach,7.699841
9,148,2,1,cleaning tool,simply juice,using,7.489419
10,148,2,1,cleaning tool,simply juice,using,7.489419
48,148,2,2,plackers dental floss,floss picks,floss teeth,4.41709
50,148,2,2,starbucks coffee,beans coffee,coffee espresso,4.263258
53,148,2,2,crumbled feta,feta cheese,feta cheeses,4.138771
61,148,2,2,nitrile gloves,vinyl gloves,gloves latex,5.01352
63,148,2,2,zip locks bags,sandwich bag,bag plastic,7.139863
65,148,2,2,sprite,lemon lime soda,juice,8.022551
66,148,2,2,laundry detergent,lotion,shampoo,8.104012


5 items corrected by hand. An empty string (4) represents no obvious connection between items.

In [22]:
df_merged_only_two.iloc[8, 3] = ""
df_merged_only_two.iloc[9, 3] = ""
df_merged_only_two.iloc[65, 3] = "soda"
df_merged_only_two.iloc[66, 3] = ""
df_merged_only_two.iloc[69, 3] = ""

### Concatenate All Sources of Data
After merging, there are five sources of unique data. Three are unique to the transcribers and two are the results of the merges above.

In [23]:
%%time
FILES_CLEAN = ['clean_max', 'clean_maria', 'clean_samantha']
COLS_CLEAN = [0, 1, 2, 3, 5]

dfs_clean = [pd.read_csv(DATA_PATH + file + '.csv', index_col=0, usecols=COLS_CLEAN, dtype=DTYPES) for file in FILES_CLEAN]

Wall time: 561 ms


In [24]:
dfs_unique = copy.deepcopy(dfs_clean)
for i, df in enumerate(dfs_unique):
    for j, df in enumerate(dfs_clean):
        if i != j:
            dfs_unique[i] = dfs_unique[i][~dfs_unique[i].ID.isin(dfs_clean[j].ID.unique())]

dfs_all = [*dfs_unique, df_merged01_2, df_merged_only_two]
print(*[df.ID.unique() for df in dfs_all], sep='\n')

[129 136 144]
[154 159 128 119 134]
[131 139 145 149 157 162 113 118 126]
[114 137 153 127 130 135]
[148]


In [25]:
for df in dfs_all:
    print(df.info())
    print()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 509 entries, 0 to 1536
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   ID       509 non-null    uint8 
 1   Session  509 non-null    uint8 
 2   Receipt  509 non-null    uint8 
 3   Item     509 non-null    string
dtypes: string(1), uint8(3)
memory usage: 9.4 KB
None

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1593 entries, 0 to 2055
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   ID       1593 non-null   uint8 
 1   Session  1593 non-null   uint8 
 2   Receipt  1593 non-null   uint8 
 3   Item     1593 non-null   string
dtypes: string(1), uint8(3)
memory usage: 29.6 KB
None

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1110 entries, 0 to 1109
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   ID       1110 non-null   uint8 
 1   Session  111

In [26]:
df_merged_final = pd.concat(dfs_all).reset_index(drop=True)

In [27]:
df_merged_final = df_merged_final[df_merged_final.Item != ''].reset_index(drop=True)

In [28]:
print(df_merged_final.info())
display(df_merged_final.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4085 entries, 0 to 4084
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   ID       4085 non-null   uint8 
 1   Session  4085 non-null   uint8 
 2   Receipt  4085 non-null   uint8 
 3   Item     4085 non-null   string
dtypes: string(1), uint8(3)
memory usage: 44.0 KB
None


Unnamed: 0,ID,Session,Receipt,Item
0,129,1,1,grapefruit soda
1,129,1,1,grapefruit soda
2,129,1,1,dried fruit
3,129,1,1,lemon wafers
4,129,1,1,vanilla wafers


In [29]:
df_merged_final.to_csv(f'{DATA_PATH}merged_full.csv')