# Fetch Data Challenge
You will be provided with a dataset of offers and some associated metadata around the retailers and brands that are sponsoring the offer.

You will also be provided with a dataset of some brands that we support on our platform, and the categories that those products belong
to.

# Acceptance Criteria
- If a user searches for a category (ex. diapers) the tool should return a list of offers that are relevant to that
category.
- If a user searches for a brand (ex. Huggies) the tool should return a list of offers that are relevant to that brand.
- If a user searches for a retailer (ex. Target) the tool should return a list of offers that are relevant to that retailer.
- The tool should also return the score that was used to measure the similarity of the text input with each offer

# What we seek below:
- Detailed responses to each problem, with a focus on the production pipeline surrounding the model.
- Identifies several useful techniques to approach eReceipt classification and entity extraction.
- Demonstrate a knowledge of recent innovations in NLP and a willingness to think about the problem in terms of software engineering rather than an academic exercise

## Data Inspection and EDA

In [1]:
# check current path
import os

# # Load the Drive helper and mount
# from google.colab import drive
# # This will prompt for authorization.
# drive.mount('/content/drive/')
# path_gdrive = '/content/drive/MyDrive/Colab Datasets/Fetch'
# os.chdir(path_gdrive)
# print(os.getcwd())

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).
/content/drive/MyDrive/Colab Datasets/Fetch


In [None]:
!ls

 00-Data-EDA.ipynb    categories.csv			 offer_retailer.csv
 brand_category.csv  'Fetch _ Data Scientist, NLP.pdf'


## Categories.csv

In [2]:
import pandas as pd

df_categories = pd.read_csv('categories.csv')
df_categories

Unnamed: 0,CATEGORY_ID,PRODUCT_CATEGORY,IS_CHILD_CATEGORY_TO
0,1f7d2fa7-a1d7-4969-aaf4-1244f232c175,Red Pasta Sauce,Pasta Sauce
1,3e48a9b3-1ab2-4f2d-867d-4a30828afeab,Alfredo & White Pasta Sauce,Pasta Sauce
2,09f3decc-aa93-460d-936c-0ddf06b055a3,Cooking & Baking,Pantry
3,12a89b18-4c01-4048-94b2-0705e0a45f6b,Packaged Seafood,Pantry
4,2caa015a-ca32-4456-a086-621446238783,Feminine Hygeine,Health & Wellness
...,...,...,...
113,0b039c0e-d33d-4356-b57b-83352d98623f,Frozen Turkey,Frozen Meat
114,3b79dd23-c298-4429-b9bf-ce5803b594eb,Frozen Chicken,Frozen Meat
115,7fbb4211-de07-4074-b359-aea21a7ad50c,Frozen Beef,Frozen Meat
116,a9ace557-fce3-4eec-9536-0e4b399987b7,Frozen Seafood,Frozen Meat


In [3]:
# how many categories? how many category id's?
df_categories.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 118 entries, 0 to 117
Data columns (total 3 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   CATEGORY_ID           118 non-null    object
 1   PRODUCT_CATEGORY      118 non-null    object
 2   IS_CHILD_CATEGORY_TO  118 non-null    object
dtypes: object(3)
memory usage: 2.9+ KB


In [4]:
cat_features = df_categories.select_dtypes(include=['object']).columns.tolist()

In [5]:
for i in cat_features:
    print(i,len(df_categories[i].value_counts()))

CATEGORY_ID 118
PRODUCT_CATEGORY 118
IS_CHILD_CATEGORY_TO 23


In [6]:
# find if there're overlaps between categories and upper categories
all_categories = df_categories['PRODUCT_CATEGORY'].unique().tolist()
upper_categories = df_categories['IS_CHILD_CATEGORY_TO'].unique().tolist()

common_categories = [i for i in all_categories if i in upper_categories]
print(common_categories)
print(len(common_categories))

['Candy', 'Frozen', 'Dairy', 'Household Supplies', 'Puffed Snacks', 'Oral Care', 'Spirits', 'Pasta & Noodles', 'Mature']
9


In [35]:
upper_categories

['Pasta Sauce',
 'Pantry',
 'Health & Wellness',
 'Deli & Bakery',
 'Dairy',
 'Beverages',
 'Frozen',
 'Home & Garden',
 'Snacks',
 'Baby & Toddler',
 'Household Supplies',
 'Alcohol',
 'Oral Care',
 'Spirits',
 'Pasta & Noodles',
 'Puffed Snacks',
 'Meat & Seafood',
 'Beauty',
 'Mature',
 'Animals & Pet Supplies',
 'Sports Drinks & Enhanced Waters',
 'Frozen Meat',
 'Candy']

In [30]:
import re

def search_list(pattern, string_list):
    matches = []
    for string in string_list:
        if re.search(pattern, string, flags=re.IGNORECASE):
            matches.append(string)
    if len(matches) == 0:
        print("no matches")
        return None
    else:
        return matches

In [33]:
search_list(r'Alcohol', all_categories)

['Wine']

In [37]:
df_categories[df_categories["PRODUCT_CATEGORY"].eq('Wine')]

Unnamed: 0,CATEGORY_ID,PRODUCT_CATEGORY,IS_CHILD_CATEGORY_TO
50,ca1c0f4d-3efc-4978-8357-69862996f416,Wine,Alcohol


In [38]:
df_categories[df_categories["IS_CHILD_CATEGORY_TO"].eq('Alcohol')]

Unnamed: 0,CATEGORY_ID,PRODUCT_CATEGORY,IS_CHILD_CATEGORY_TO
29,a3ef4899-2bd2-4cac-bb31-a46ef1169c8c,Beer,Alcohol
36,6b770f8d-09b5-4491-8bbb-31ee38645b20,Malt Beverages,Alcohol
39,d5cba6a8-9ac7-40dd-9692-f9e890a48ca4,"Hard Seltzers, Sodas, Waters, Lemonades & Teas",Alcohol
40,b06d9098-f313-4ba8-88aa-a001db2759d8,Hard Ciders,Alcohol
50,ca1c0f4d-3efc-4978-8357-69862996f416,Wine,Alcohol
82,2def9983-9d56-4872-a5b9-7aa8bbc2331c,Spirits,Alcohol


For categories.csv, we know:
- There are 118 categories in total
- There are only 118-9=109 child categories

### retailer.csv

In [39]:
df_retailer = pd.read_csv('offer_retailer.csv')
df_retailer

Unnamed: 0,OFFER,RETAILER,BRAND
0,Spend $50 on a Full-Priced new Club Membership,SAMS CLUB,SAMS CLUB
1,"Beyond Meat® Plant-Based products, spend $25",,BEYOND MEAT
2,Good Humor Viennetta Frozen Vanilla Cake,,GOOD HUMOR
3,"Butterball, select varieties, spend $10 at Dil...",DILLONS FOOD STORE,BUTTERBALL
4,"GATORADE® Fast Twitch®, 12-ounce 12 pack, at A...",AMAZON,GATORADE
...,...,...,...
379,Spend $10 at KFC,KFC,KFC
380,Sargento Product,,SARGENTO
381,Thomas'® Bagel Thins,,THOMAS
382,Spend $270 at Pavilions,PAVILIONS,PAVILIONS


In [9]:
df_retailer.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 384 entries, 0 to 383
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   OFFER     384 non-null    object
 1   RETAILER  238 non-null    object
 2   BRAND     384 non-null    object
dtypes: object(3)
memory usage: 9.1+ KB


In [10]:
# notice a column with NaN; check null counts
df_retailer.isnull().sum()

OFFER         0
RETAILER    146
BRAND         0
dtype: int64

In [11]:
# check categorical feature uniqueness
for i in df_retailer.columns.tolist():
    print(i,len(df_retailer[i].value_counts()))

OFFER 376
RETAILER 61
BRAND 144


In [42]:
# find if there're overlaps between retailers and brands
retailers = df_retailer['RETAILER'].unique().tolist()
brands = df_retailer['BRAND'].unique().tolist()

# what brands are also the retailers
[i for i in retailers if i in brands]

['SAMS CLUB',
 'ZAXBYS',
 'SUBWAY',
 'SHAWS',
 'ACME',
 'KFC',
 'CASEYS GENERAL STORE',
 'RANDALLS',
 'VONS',
 'FRESH THYME MARKET',
 'ALBERTSONS',
 'BJS WHOLESALE',
 'TOM THUMB',
 'SAFEWAY',
 'TGI FRIDAYS',
 'PAVILIONS',
 'STAR MARKET',
 'BLUE APRON',
 'MCALISTERS DELI',
 'COSTCO',
 'FARMER BOYS',
 'CHEWY',
 'DICKEYS BARBECUE PIT',
 'CVS',
 'BURGER KING']

In [None]:
# also, check the distribution of length for the column "offer"

### Brand

In [14]:
df_brand = pd.read_csv('brand_category.csv')
df_brand

Unnamed: 0,BRAND,BRAND_BELONGS_TO_CATEGORY,RECEIPTS
0,CASEYS GEN STORE,Tobacco Products,2950931
1,CASEYS GEN STORE,Mature,2859240
2,EQUATE,Hair Removal,893268
3,PALMOLIVE,Bath & Body,542562
4,DAWN,Bath & Body,301844
...,...,...,...
9901,WIBBY BREWING,Beer,11
9902,LA FETE DU ROSE,Wine,11
9903,BIG ISLAND BREWHAUS,Beer,11
9904,BRIDGE LANE,Wine,11


In [15]:
df_brand.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9906 entries, 0 to 9905
Data columns (total 3 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   BRAND                      9905 non-null   object
 1   BRAND_BELONGS_TO_CATEGORY  9906 non-null   object
 2   RECEIPTS                   9906 non-null   int64 
dtypes: int64(1), object(2)
memory usage: 232.3+ KB


In [19]:
# why is one BRAND null?
df_brand[df_brand["BRAND"].isnull() == True]

Unnamed: 0,BRAND,BRAND_BELONGS_TO_CATEGORY,RECEIPTS
6624,,Beer,33


In [20]:
# check categorical feature uniqueness
for i in df_brand.columns.tolist():
    print(i,len(df_brand[i].value_counts()))

BRAND 8521
BRAND_BELONGS_TO_CATEGORY 118
RECEIPTS 2097


One brand can belong to mutiple categories. And one category surely has mutiple brands.