In [1]:
# Fetch provides value to our user base through the rich variety of offers that are active in the app. 
#  We want our users to be able to easily seek out offers in the app, so that they get the most out of using the app and our partners get the most out of their relationship with Fetch.
# For this assignment, you will build a tool that allows users to intelligently search for offers via text input from the user.
# You will be provided with a dataset of offers and some associated metadata around the retailers and brands that are sponsoring the offer. 
# You will also be provided with a dataset of some brands that we support on our platform, and the categories that those products belong to.

# Acceptance Criteria
# If a user searches for a category (ex. diapers) the tool should return a list of offers that are relevant to that category.
# If a user searches for a brand (ex. Huggies) the tool should return a list of offers that are relevant to that brand.
# If a user searches for a retailer (ex. Target) the tool should return a list of offers that are relevant to that retailer.
# The tool should also return the score that was used to measure the similarity of the text input with each offer

# Feel free to make the deployment of this tool as sophisticated as you'd like, whether it's a simple command line tool, a web app, or something else entirely. 
#  We're looking for a tool that is easy to use and provides a good user experience.
# Your submission must include:
# A link to a Github repository containing your code
# A link to a hosted version of your tool (if applicable)
# A brief writeup of your approach to the problem, including any assumptions you made and any tradeoffs you considered
# Instructions on how to run your tool locally, if applicable

In [2]:
# IMPORTS
import texthero as hero # to get env to work, first pip install gensim = 3.8.1
import pandas as pd
pd.set_option('display.max_colwidth', 100) # want to be able to read the offers. 
import numpy as np
import tensorflow_hub as hub 

# Load pre-trained universal sentence encoder model 
embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4") 

✔ Download and installation successful
You can now load the model via spacy.load('en_core_web_sm')


In [3]:
# READING IN DATA
df_offers = pd.read_csv("C:\\Users\\laura\\Desktop\\Career Prep\\Fetch\\DS_NLP_search_data\\offer_retailer.csv")
df_brand = pd.read_csv("C:\\Users\\laura\\Desktop\\Career Prep\\Fetch\\DS_NLP_search_data\\brand_category.csv")
df_categories = pd.read_csv("C:\\Users\\laura\\Desktop\\Career Prep\\Fetch\\DS_NLP_search_data\\categories.csv")

In [4]:
# checking overlap of categories before merge
s_cat = set(df_brand['BRAND_BELONGS_TO_CATEGORY'])
print(len(s_cat))
s_cat_prod = set(df_categories['PRODUCT_CATEGORY'])
print(len(s_cat_prod))
s_overlap = s_cat.intersection(s_cat_prod)
print(len(s_overlap))

118
118
118


In [5]:
df_interim = df_offers.merge(df_brand, on = 'BRAND')
df_categories = df_categories.rename(columns={'PRODUCT_CATEGORY':'BRAND_BELONGS_TO_CATEGORY'})
df_final = df_interim.merge(df_categories[['BRAND_BELONGS_TO_CATEGORY', 'IS_CHILD_CATEGORY_TO']], on = 'BRAND_BELONGS_TO_CATEGORY')
df_final
# Note, I wasn't able to figure out what the 'RECEIPTS' column means. 

Unnamed: 0,OFFER,RETAILER,BRAND,BRAND_BELONGS_TO_CATEGORY,RECEIPTS,IS_CHILD_CATEGORY_TO
0,"Beyond Meat® Plant-Based products, spend $25",,BEYOND MEAT,Plant-Based Meat,1584,Meat & Seafood
1,"Beyond Steak™ Plant-Based seared tips, 10 ounce at Target",TARGET,BEYOND MEAT,Plant-Based Meat,1584,Meat & Seafood
2,"Beyond Steak™ Plant-Based seared tips, 10 ounce, buy 2 at H-E-B",H-E-B,BEYOND MEAT,Plant-Based Meat,1584,Meat & Seafood
3,"Beyond Steak™ Plant-Based seared tips, 10 ounce at H-E-B",H-E-B,BEYOND MEAT,Plant-Based Meat,1584,Meat & Seafood
4,"Beyond Steak™ Plant-Based seared tips, 10 ounce, buy 2 at Target",TARGET,BEYOND MEAT,Plant-Based Meat,1584,Meat & Seafood
...,...,...,...,...,...,...
771,"Glad® Trash Bags, 4 OR 8 Gallon",,GLAD,Food Storage,268,Household Supplies
772,Glad® ForceFlex Max Strength Trash Bags,,GLAD,Food Storage,268,Household Supplies
773,:ratio™ KETO* Friendly Cereal OR Granola,,RATIO,Yogurt,1131,Dairy
774,Nature Valley™ Protein Granola,,NATURE VALLEY,Trail Mix,370,Snacks


In [6]:
# Cleaning and calibrating the text columns. 
df_final['OFFER_clean'] = hero.clean(df_final['OFFER'])
df_final['RETAILER'] = hero.clean(df_final['RETAILER'])
df_final['BRAND'] = hero.clean(df_final['BRAND'])
df_final['BRAND_BELONGS_TO_CATEGORY'] = hero.clean(df_final['BRAND_BELONGS_TO_CATEGORY'])

# get embeddings for the cleaned offer column
embeddings = embed(df_final['OFFER_clean'])
use= np.array(embeddings).tolist()
df_final['use'] = [v for v in use]
#df_final['pca'] = hero.pca(df_final['use'])
# ^ played with dimension reduction but was less successful than using the raw embeddings. 

In [7]:
df_final = df_final.rename(columns={'BRAND_BELONGS_TO_CATEGORY': 'CATEGORY'})
# renaming for more user-friendly search

In [8]:
df_final.to_csv("C:\\Users\\laura\\Desktop\\Career Prep\\Fetch\\offer_search_tool\\SearchData.csv", index = False)