Notes, thoughts, and analysis:

I first tested out the process with one single pdp. (blocks 1-5). Then I implemented a crawler (block 6). I found filtering and parsing the information from the crawler too complicated to implement by hand. So I chose 20 webpages to form a larger document for testing purposes.(block 7 and 8)

I implemented the chat functionality with two libraries: a question-answering BERT model, and langchain. I have not carried out comprehensive testings, but langchain performs better overall. langchain performs relatively better on questions about recommendation than about product details. 

In the future we might face a design choice between optimizing the accuracy of answering questions on details, or giving contextual recommendations. I am not familiar with webpage indexing so I would like to know which should have more weight. For the former, we could fine-tune the model. For the latter, we could retrieve more info about related products from the pdps.

### 1. Get contents from a single webpage

In [1]:
import requests
from bs4 import BeautifulSoup
import json

url = 'https://www.jcrew.com/p/womens/categories/clothing/blazers/lady-jacket/odette-sweater-lady-jacket-in-cotton-blend-boucleacute/BR789?display=standard&fit=Classic&color_name=kelly-green&colorProductCode=BR789'
response = requests.get(url)
html_content = response.text

# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(html_content, 'html.parser')

# hard coded, locate useful info about product
allTags = soup.find_all('script')
descriptionStr = None
for tag in allTags:
    if tag.string and tag.string.strip():
        descriptionStr = tag.string.strip()
descriptionDict = json.loads(descriptionStr)['props']['initialState']['products']['productsByProductCode']

### 2. View the data

In [2]:
def print_nested_dict_keys(d, prefix=""):
    for k, v in d.items():
        if isinstance(v, dict):
            print_nested_dict_keys(v, prefix + k + ".")
        else:
            print(prefix + k + str(v))
print_nested_dict_keys(descriptionDict)

BR789.productCodeBR789
BR789.productDataFetchedFalse
BR789.lastUpdated0
BR789.deliveryMethod
BR789.variationProductCodeNone
BR789.productNameOdette sweater lady jacket in cotton-blend boucl&eacute;
BR789.pdpIntlMessage
BR789.isPreorderTrue
BR789.isProductOutOfStockFalse
BR789.shipRestrictedFalse
BR789.isFindInStoreFalse
BR789.styledWithSkus
BR789.shopTheLookItems[]
BR789.swatchOrderAlphabeticalFalse
BR789.isFreeShippingFalse
BR789.brand
BR789.brandLink
BR789.marketplaceAttributesNone
BR789.sizeChart1,0
BR789.url/p/womens/categories/clothing/blazers/lady-jacket/odette-sweater-lady-jacket-in-cotton-blend-boucleacute/BR789?display=standard
BR789.promoTextNone
BR789.excludePromoTrue
BR789.defaultColorName
BR789.baseProductCodeBR789
BR789.defaultColorCode
BR789.baseProductColorCode
BR789.shotTypes['eiec']
BR789.genderwomen
BR789.productDescriptionRomanceSomewhere between a jacket and a cardigan, this easy layer is perfect for days that feel somewhere between seasons. Featuring a crochet tri

### 3. Use Depth First Search, load selected tags and values

In [3]:
# Ignored related products for now, focus only on product details
info = {'productCode':0,'title':'', 'gender':'', 'listPrice':'', 'productDescriptionRomance':'', 'productDescriptionTech':'', 'productDescriptionFit':'', 'colorsList':None, 'sizesMap':None}
def loadInfo(d):
    global context
    for k, v in d.items():
        if k in info:
            if k == 'listPrice':
                info[k] = str(v['formatted'])
                continue
            if isinstance(v, dict): info[k] = list(v.keys())
            else: info[k] = v

        if isinstance(v, dict):
            loadInfo(v)
loadInfo(descriptionDict)

def contextualize(info): 
    if info['colorsList'] != None:
        colorsList = []
        for element in info['colorsList']:
            for color in element['colors']: 
                if 'name' in color: colorsList.append(color['name'])
                if 'color' in color: colorsList.append(color['color'])
        info['colorsList'] = colorsList
    context = ''
    context += str(info['productCode']) + ' is ' + str(info['title']) + '\n'
    context += str(info['productCode']) + ' is for ' + str(info['gender']) + '\n'
    context += str(info['productCode']) + ' costs ' + str(info['listPrice']) + '\n'
    context += 'For ' + str(info['productCode']) + ', ' + info['productDescriptionRomance'] + '\n'
    for description in info['productDescriptionTech']:
        context += str(info['productCode']) + ' : ' + description + '\n'
    for description in info['productDescriptionFit']:
        context += str(info['productCode']) + ' has ' + description + '\n'

    if info['colorsList'] != None:
        context += str(info['productCode']) + ' has color '
        for color in info['colorsList']:
            context += color + ', and '
    context += '\n'
    
    if info['sizesMap'] != None:
        context += info['productCode'] + ' has size '
        for size in info['sizesMap']:
            if size == None: continue
            context += size + ', and'
    return context

context = contextualize(info)
with open('context.txt', 'w') as file:
    file.write(context)
    file.write('\n\n')
file.close()

### 4. Try a BERT model as baseline

In [105]:
from transformers import BertForQuestionAnswering
from transformers import pipeline
from transformers import AutoTokenizer
model = BertForQuestionAnswering.from_pretrained('deepset/bert-base-cased-squad2')
tokenizer = AutoTokenizer.from_pretrained('deepset/bert-base-cased-squad2')
nlp = pipeline('question-answering', model=model, tokenizer=tokenizer)

In [106]:
questions = [
    "what is this product",
    "what is this product made of",
    "Is this product a dress or a jacket",
    "what is the title of this product",
    "what is the price for this product",
    "is this product for man or women",
    "is this product for summer or winter"
]
for question in questions:

    answer = nlp({
        'question':question,
        'context': context
    })
    print(question + " : " + str(answer))

what is this product : {'score': 0.07867526262998581, 'start': 323, 'end': 362, 'answer': 'this sweet style is soft, snug and chic'}
what is this product made of : {'score': 0.04123115539550781, 'start': 323, 'end': 362, 'answer': 'this sweet style is soft, snug and chic'}
Is this product a dress or a jacket : {'score': 0.31645408272743225, 'start': 165, 'end': 175, 'answer': 'a cardigan'}
what is the title of this product : {'score': 0.04958398640155792, 'start': 127, 'end': 132, 'answer': 'BR789'}
what is the price for this product : {'score': 0.5577917695045471, 'start': 115, 'end': 122, 'answer': '$178.00'}
is this product for man or women : {'score': 0.13194237649440765, 'start': 78, 'end': 83, 'answer': 'women'}
is this product for summer or winter : {'score': 0.0428488552570343, 'start': 208, 'end': 248, 'answer': 'days that feel somewhere between seasons'}


### 5. Try Langchain

In [8]:
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.embeddings import OpenAIEmbeddings
from langchain.llms import OpenAI
import os
os.environ['OPENAI_API_KEY'] = "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"

loader = TextLoader('context.txt')
documents = loader.load()

In [9]:
textSplitter = RecursiveCharacterTextSplitter(chunk_size = 200, chunk_overlap = 20)
texts = textSplitter.split_documents(documents)
embeddings = OpenAIEmbeddings()
db = FAISS.from_documents(texts, embeddings)
retriever = db.as_retriever()

In [10]:
docs = retriever.get_relevant_documents("How much is this lady jacket?")
print("\n\n".join([x.page_content[:200] for x in docs[:2]]))

BR789 is J.Crew: Odette sweater lady jacket in cotton-blend boucl&eacute; for women
BR789 is for women
BR789 costs $178.00

For BR789, Somewhere between a jacket and a cardigan, this easy layer is perfect for days that feel somewhere between seasons. Featuring a crochet trim, gold button details and hook-and-eye closures,


### 6. Crawl the webpage

In [57]:
visited = set()
pdps = set()
def crawl_site(url):
    # set up a set to store visited URLs
    
    
    # add the starting URL to the list of URLs to visit
    urls_to_visit = [url]

    while urls_to_visit:
        # get the next URL to visit
        url = urls_to_visit.pop(0)

        # if the URL has already been visited, skip it
        if url in visited:
            continue

        # make a request to the URL
        print(url)
        try:
            response = requests.get(url)

            soup = BeautifulSoup(response.content, 'html.parser')


            allTags = soup.find_all('script')
            descriptionStr = None
            for tag in allTags:
                if tag.string and tag.string.strip():
                    descriptionStr = tag.string.strip()
            descriptionDict = None
            try:
                descriptionDict = json.loads(descriptionStr)
            except json.JSONDecodeError as e:
                continue
            if 'props' in descriptionDict:
                if 'initialState' in descriptionDict['props']:
                    if 'products' in descriptionDict['props']['initialState']:
                        if 'productsByProductCode' in descriptionDict['props']['initialState']['products']:
                            pdps.add(url)

            # add any new URLs on the page to the list of URLs to visit
            for link in soup.find_all('a'):
                href = link.get('href')
                if href == None:
                    continue
                href = 'https://www.jcrew.com'+href
                if href not in visited:
                    urls_to_visit.append(href)

            # mark the URL as visited
            visited.add(url)
        except requests.exceptions.RequestException as e:
            continue

crawl_site('https://www.jcrew.com/plp/mens/categories/clothing/pants-and-chinos?style-fit=mens-pants-skinny&intcmp=mheader_pants-fit-skinny')


https://www.jcrew.com/plp/mens/categories/clothing/pants-and-chinos?style-fit=mens-pants-skinny&intcmp=mheader_pants-fit-skinny
https://www.jcrew.com/l/rewards?intcmp=seealloffersnav_1_signupfree
https://www.jcrew.com/l/credit_card?intcmp=seealloffersnav_1_applytoday
https://www.jcrew.comhttps://d.comenity.net/jcrew/?intcmp=seealloffersnav_1_manageyourcard
https://www.jcrew.com/s/rewards
https://www.jcrew.com/
https://www.jcrew.com/checkout/cart
https://www.jcrew.com/plp/womens/features/new-arrivals
https://www.jcrew.com/plp/mens/features/new-arrivals
https://www.jcrew.com/plp/girls/features/new-arrivals
https://www.jcrew.com/plp/boys/features/new-arrivals
https://www.jcrew.com/plp/womens/features/the-wedding-shop
https://www.jcrew.com/plp/womens/features/the-linen-shop
https://www.jcrew.com/plp/womens/features/the-work-remix
https://www.jcrew.com/plp/womens/features/olympias-picks
https://www.jcrew.com/plp/womens/features/resort-wear
https://www.jcrew.com/plp/womens/features/straw-acc

KeyboardInterrupt: 

### 7. Form a larger context

In [4]:
# use 20 urls from the crawler here for testing
pdps = ['https://www.jcrew.com/p/womens/categories/clothing/blazers/lady-jacket/odette-sweater-lady-jacket-in-cotton-blend-boucleacute/BR789?display=standard&fit=Classic&color_name=mediterranean-navy&colorProductCode=BR789',
'https://www.jcrew.com/m/womens/categories/shoes/flats/hazel-strappy-sandals-in-leather/MP099?display=standard&fit=Classic&color_name=burnt-caramel&colorProductCode=BR616',
'https://www.jcrew.com/p/womens/categories/clothing/dresses-and-jumpsuits/maxine-v-neck-shift-dress-in-linen/BR470?display=standard&fit=Classic&color_name=brilliant-kelly&colorProductCode=BR470',
'https://www.jcrew.com/p/womens/categories/clothing/sweaters/pullovers/cropped-crochet-tank-top-in-silk-cotton-blend/BO171?display=standard&fit=Classic&color_name=warm-ivory&colorProductCode=BO171',
'https://www.jcrew.com/p/womens/categories/clothing/dresses-and-jumpsuits/squareneck-mini-sweater-dress/BR343?display=standard&fit=Classic&color_name=bright-patina&colorProductCode=BR343',
'https://www.jcrew.com/p/womens/categories/clothing/dresses-and-jumpsuits/relaxed-fit-short-sleeve-baird-mcnutt-irish-linen-shirtdress/AY623?display=standard&fit=Classic&color_name=navy&colorProductCode=AY623',
'https://www.jcrew.com/r/shop-the-look?externalProductCodes=BO171-SR1435:BP499-NA6445:BP329-NA6095:H8908-BR6404:BG613-NA6167:BO299-YL5433:BG933-BL8133:BI521-NA5810:AZ717-EE0543:AK085-EB4815&intcmp=w_na_look1',
'https://www.jcrew.com/r/shop-the-look?externalProductCodes=BO171-SR1435:BP499-NA6445:BP329-NA6095:H8908-BR6404:BG613-NA6167:BO299-YL5433&intcmp=w_na_look2',
'https://www.jcrew.com/p/womens/categories/accessories/scarves-hats/hats/open-weave-packable-straw-hat/BH859?display=standard&fit=Classic&color_name=natural-straw&colorProductCode=BH859',
'https://www.jcrew.com/p/womens/categories/clothing/shirts-and-tops/button-up-bow-top-in-cotton-poplin-eyelet/BP771?display=standard&fit=Classic&color_name=navy&colorProductCode=BP771',
'https://www.jcrew.com/m/womens/categories/shoes/flats/menorca-toe-ring-slingback-sandals-in-leather/MP886?display=standard&fit=Classic&color_name=metallic-silver&colorProductCode=BR603',
'https://www.jcrew.com/p/mens/categories/shoes/dress-shoes/camden-loafers-in-leather/AV166?display=standard&fit=Classic&color_name=dress-brown&colorProductCode=AV166',
'https://www.jcrew.com/m/mens/categories/clothing/shirts/seersucker/short-sleeve-yarn-dyed-seersucker-shirt/MP604?display=standard&fit=Classic&color_name=seersucker-stripe-fade&colorProductCode=BE982',
'https://www.jcrew.com/p/mens/categories/clothing/t-shirts/performance/performance-t-shirt-with-coolmaxreg/BO520?display=standard&fit=Classic&color_name=soft-aqua&colorProductCode=BO520',
'https://www.jcrew.com/p/mens/categories/clothing/shirts/secret-wash/secret-wash-cotton-poplin-shirt/BJ706?display=standard&fit=Classic&color_name=henry-stripe-green-blue&colorProductCode=BJ706',
'https://www.jcrew.com/m/mens/categories/clothing/shorts/linen-and-seersucker/9quot-stretch-seersucker-short/MP949?display=standard&fit=9 Inch&color_name=sky-blue-white&colorProductCode=BO857',
'https://www.jcrew.com/p/mens/categories/clothing/shirts/shirt-jackets/wallace-amp-barnes-slub-poplin-military-shirt/BO866?display=standard&fit=Classic&color_name=classic-khaki&colorProductCode=BO866']

In [5]:
context = ''
for i in range(17):
    url = pdps[i]
    print(url)
    response = requests.get(url)
    html_content = response.text

    # Parse the HTML content using BeautifulSoup
    soup = BeautifulSoup(html_content, 'html.parser')


    allTags = soup.find_all('script')
    descriptionStr = None
    for tag in allTags:
        if tag.string and tag.string.strip():
            descriptionStr = tag.string.strip()
    descriptionDict = json.loads(descriptionStr)['props']['initialState']['products']['productsByProductCode']

    info = {'productCode':0,'title':'', 'gender':'', 'listPrice':'', 'productDescriptionRomance':'', 'productDescriptionTech':'', 'productDescriptionFit':'', 'colorsList':None, 'sizesMap':None}
    loadInfo(descriptionDict)
    print(info)
    currContext = contextualize(info)
    context += currContext + '\n\n\n'

https://www.jcrew.com/p/womens/categories/clothing/blazers/lady-jacket/odette-sweater-lady-jacket-in-cotton-blend-boucleacute/BR789?display=standard&fit=Classic&color_name=mediterranean-navy&colorProductCode=BR789
{'productCode': 'BR789', 'title': 'J.Crew: Odette sweater lady jacket in cotton-blend boucl&eacute; for women', 'gender': 'women', 'listPrice': '$178.00', 'productDescriptionRomance': 'Somewhere between a jacket and a cardigan, this easy layer is perfect for days that feel somewhere between seasons. Featuring a crochet trim, gold button details and hook-and-eye closures, this sweet style is soft, snug and chic.', 'productDescriptionTech': ['70% cotton/30% polyamide.', 'Machine wash.', 'Import.'], 'productDescriptionFit': ['Cropped fit.', 'Hits slightly above hip.', 'Body length: 18 3/4".', 'Sleeve length: 30".'], 'colorsList': [{'price': {'amount': 178, 'formatted': '$178.00'}, 'colors': [{'productCode': 'BR789', 'code': 'YL5659', 'name': 'CRISP YELLOW'}, {'productCode': 'BR7

In [6]:
with open('largercontext.txt', 'w') as file:
    file.write(context)
    file.write('\n\n')
file.close()

### 8. Ask LangChain Again

In [11]:
loader = TextLoader('largercontext.txt')
documents = loader.load()
textSplitter = RecursiveCharacterTextSplitter(chunk_size = 200, chunk_overlap = 20)
texts = textSplitter.split_documents(documents)
embeddings = OpenAIEmbeddings()
db = FAISS.from_documents(texts, embeddings)
retriever = db.as_retriever()

mini length. Plus, this dress is made with a lightweight, super-breathable linen (aka our version of personal AC).

BR470 is J.Crew: Maxine V-neck shift dress in linen for women
BR470 is for women
BR470 costs $118.00


In [14]:
questions = ["Recommend me a summer dress for women",
             "Recommend me a cheap shirt for men",
             "What is the price for the product BR470",
             "What is the product Open-weave packable straw hat for women made of",
             "What size does BP771 have"]
for question in questions:
    docs = retriever.get_relevant_documents(question)
    print(question)
    print("\n\n".join([x.page_content[:200] for x in docs[:2]]))
    print("\n\n\n")

Recommend me a summer dress for women
mini length. Plus, this dress is made with a lightweight, super-breathable linen (aka our version of personal AC).

BR470 is J.Crew: Maxine V-neck shift dress in linen for women
BR470 is for women
BR470 costs $118.00




Recommend me a cheap shirt for men
BE981 is J.Crew: Short-sleeve seersucker shirt in print for men
BE981 is for men
BE981 costs $79.50

BJ706 is J.Crew: Secret Wash cotton poplin shirt for men
BJ706 is for men
BJ706 costs $89.50




What is the price for the product BR470
BR470 is J.Crew: Maxine V-neck shift dress in linen for women
BR470 is for women
BR470 costs $118.00

BR470 : 100% linen.
BR470 : Lined.
BR470 : Machine wash.
BR470 : Import.
BR470 has Sheath silhouette.
BR470 has Falls above knee, 38 1/2" from high point of shoulder (based on a size 6).




What is the product Open-weave packable straw hat for women made of
BH859 is J.Crew: Open-weave packable straw hat for women
BH859 is for women
BH859 costs $69.50

For BH859, 