## Veridion Challenge #2

For this challenge, I created a Named Entity Recognition (NER) model using the spaCy library. The model was trained with over 100 websites scrapped from the given URLs. The model was trained with the 'PRODUCT' entity.

In [41]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import re
import time
import json
import os

`avoid.txt` contains words that should be ignored when analyzing the text. The words are separated by a newline character. I chose to use a regex pattern to find the words to find because I want to match the whole word and not just a substring. (e.g. 'us' should not match 'Cirrus')

In [42]:
with open('avoid.txt', 'r') as f:
    avoid = f.read().splitlines()

# Compile a regex pattern to match whole words from to_avoid list
pattern_to_avoid = r"\b(?:" + "|".join(map(re.escape, avoid)) + r")\b"

While I was analyzing some webpages, I noticed that some of them had their product names in classes like `'product-title'` or `'product--item'`. I decided to use this information to filter out the products I wanted to use. So I created a regular expression to match these classes.

`avoid_find` list contains substrings / symbols that should be avoided when searching for the product names, even if they belong to a substring (e.g. '199$' should be found because of the dollar symbol)

In [43]:
df = pd.read_csv("pages.csv")
urls = df["max(page)"].tolist()
pattern = re.compile(r".*(product|grid).*(title|item|image).*")
avoid_find = ["$", "€", "£", ".com", "width", "height", "depth", "©"]

This function downloads content from a URL using `requests` and includes error handling. It sets a custom User-Agent header and retries failed downloads after a delay (default 1 second). On success, it returns the downloaded content.

In [44]:
def fetch_with_delays(url, delay=1):
    # Set a custom User-Agent to mimic a browser
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
    }
    try:
        response = requests.get(url, headers=headers, timeout=3)
        response.raise_for_status()
        return response
    except requests.exceptions.RequestException as e:
        print(f"Error fetching {url}: {e}")
        time.sleep(delay)
    return None

Here, I extract the products from the websites. Firstly, I use BeautifulSoup to parse the HTML content of the response. Then, I target the elements with classes matching a pattern (`pattern` regex) and headers (h1-h6). Various filters are applied to extracted text to improve accuracy:
- Non-empty text with multiple words
- Avoid text matching a pattern (pattern_to_avoid)
- Remove stop words from avoid_find list
- Clean extra spaces
- Keep unique entries

In [45]:
def extract_products(response):
    soup = BeautifulSoup(response.content, "html.parser")
    matching_elements = soup.find_all(class_=lambda x: x and pattern.search(x))

    headers = {}
    for i in range(1, 7):  # for h1 to h6
        headers[f"h{i}"] = [
            header.text.strip()
            for header in soup.find_all(f"h{i}")
            if header.text.strip()
            and len(header.text.split()) > 1
            and len(header.text.split()) < 10
            and not re.search(pattern_to_avoid, header.text, re.IGNORECASE)
        ]
    matching_elements = [
        element.text.strip()
        for element in matching_elements
        if element.text.strip()
        and len(element.text.split()) > 1
        and len(element.text.split()) < 10
        and not re.search(pattern_to_avoid, element.text, re.IGNORECASE)
    ]

    all_texts = matching_elements
    all_texts += [header for header_list in headers.values() for header in header_list]
    # split elements when \n is met
    all_texts = [
        text
        for text in all_texts
        for text in text.split("\n")
        if text.strip() and len(text.split()) > 1
    ]
    all_texts = [text for text in all_texts if not any(m in text for m in avoid_find)]

    all_texts = [re.sub(r"\s+", " ", text) for text in all_texts]
    # keep unique elements
    all_texts = list(set(all_texts))
    return all_texts



This function only extracts the text from a website.

In [46]:
def extract_text(response):
    clean_text = " ".join(BeautifulSoup(response.text, "html.parser").stripped_strings)
    # remove any newlines
    clean_text = clean_text.replace("\n", " ")
    return clean_text

To improve the training data, I later decided to include more websites as sources. I find all the links in the HTML page + filter out the unwanted links (`links_avoid`).
Multiple filters are applied to the links found:
- The links are not Null/None
- They don't have contain any string to avoid
- insert a `/` incase they don't have it
- Format the final link

Then, we are downloading the content from each link. A valid link is when the response is not None and it contains product entities. The first 5 links with its data will be saved.

In [47]:
def extract_links(response):
    soup = BeautifulSoup(response.content, "html.parser")
    links = [link.get("href") for link in soup.find_all("a")]
    domain = response.url.split("/")[2]

    links_avoid = ['javascript', 'js', 'mailto', 'tel', 'pdf', 'cart']

    if domain[0:3] != "www":
        domain = "www." + domain

    url = response.url.split("/")
    url = url[0] + "//" + url[1] + url[2]

    links = [link for link in links if link != None]
    links = [link for link in links if not any(avoid in link.lower() for avoid in links_avoid)]
    links = [link[1:] if link[0] == "/" else link for link in links]
    links = [f'{url}/{link}' for link in links if link[0:4] != "http"]

    for link in links:
        link_response = fetch_with_delays(link)
        if link_response is None:
            continue
        else: 
            products = extract_products(link_response)
            if products:
                return (link, products)
    return None

Now, we are going to loop throught products and see their position in the website's text.

In [48]:
def extract_entities(text, products):
    entities = []
    for product in products:
        finding = text
        while finding.find(product) != -1:
            starting_index = finding.find(product)
            tuple = (starting_index, starting_index + len(product), 'PRODUCT')
            entities.append(tuple)
            finding = finding[starting_index + len(product):]
    return entities

In [49]:
TRAIN_DATA = []
sites_visited = 0
LABELS = ['PRODUCT']

Most links don't work at all, so I will use the first 100 links that I can access. Firstly, I append the training data for the main link and then for its children (links that I found later). I recommend running with the `train_data.json` file in working directory because it takes a lot of time to load all of the websites.

In [50]:
# check if the file exists
if not os.path.exists("train_data.json"):
    for url in urls:  # Crawl first 100 URLs
        if sites_visited >= 100:
            break
        try:
            response = fetch_with_delays(url)
            if response is None:
                continue

            sites_visited += 1

            products = extract_products(response)
            text = extract_text(response)
            extra_link = extract_links(response)

            entities_main = extract_entities(text, products)
            TRAIN_DATA.append((text, {"entities": entities_main}))
            
            # print(extra_link)
            if extra_link:
                extra_text = extract_text(fetch_with_delays(extra_link[0]))
                entities_extra = extract_entities(extra_text, extra_link[1])
                TRAIN_DATA.append((extra_text, {"entities": entities_extra}))
                
        except Exception as e:
            print(f"Error processing {url}: {e}")
            continue

    with open("train_data.json", "w") as f:
        json.dump(TRAIN_DATA, f)
else:
    with open("train_data.json", "r") as f:
        TRAIN_DATA = json.load(f)


Error fetching https://home-buy.com.au/products/bridger-pendant-larger-lamp-metal-brass: 404 Client Error: Not Found for url: https://home-buy.com.au/products/bridger-pendant-larger-lamp-metal-brass
Error fetching https://beckurbanfurniture.com.au/products/page/2/: HTTPSConnectionPool(host='beckurbanfurniture.com.au', port=443): Max retries exceeded with url: /products/page/2/ (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7faf892665f0>: Failed to establish a new connection: [Errno -2] Name or service not known'))
Error fetching https://livingedge.com.au/home-experience: 500 Server Error: Internal Server Error for url: https://livingedge.com.au/home-experience
Error fetching https://livingedge.com.au/professional-experience: 500 Server Error: Internal Server Error for url: https://livingedge.com.au/professional-experience
Error fetching https://edenliving.online/collections/summerloving/products/nice-lounge-1: 404 Client Error: Not Found for url: https:/

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Error fetching https://sika-design.com/pages/retailers-and-partners: HTTPSConnectionPool(host='sika-design.com', port=443): Max retries exceeded with url: /pages/retailers-and-partners (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7fb07708cf40>, 'Connection to sika-design.com timed out. (connect timeout=3)'))
Error fetching https://sika-design.com/pages/catalogues: HTTPSConnectionPool(host='sika-design.com', port=443): Max retries exceeded with url: /pages/catalogues (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7fb07708cc40>, 'Connection to sika-design.com timed out. (connect timeout=3)'))
Error fetching https://sika-design.com/pages/press-info: HTTPSConnectionPool(host='sika-design.com', port=443): Max retries exceeded with url: /pages/press-info (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7fb07708d720>, 'Connection to sika-design.com timed out. (connect timeout=3)'))
Error processing

Preparing spaCy.

In [51]:
!python3 -m spacy download en_core_web_lg
!python3 -m spacy init fill-config base_config.cfg config.cfg

Defaulting to user installation because normal site-packages is not writeable
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting en-core-web-lg==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.7.1/en_core_web_lg-3.7.1-py3-none-any.whl (587.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m587.7/587.7 MB[0m [31m43.0 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')
  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


In [52]:
import spacy
from spacy.util import filter_spans
from spacy.tokens import DocBin
from tqdm import tqdm


def get_spacy_doc(file, data):
    nlp = spacy.blank("en")
    db = DocBin()

    for text, annot in tqdm(data):
        doc = nlp.make_doc(text)
        annot = annot['entities']

        ents = []
        entity_indices = []

        for start, end, label in annot:
            skip_entity = False
            for idx in range(start, end):
                if idx in entity_indices:
                    skip_entity = True
                    break
            if skip_entity:
                continue

            entity_indices = entity_indices + list(range(start, end))
            try:
                span = doc.char_span(start, end, label=label, alignment_mode='strict')
            except:
                continue

            if span is None:
                err_data = str([start, end]) + "   " + str(text) + "\n"
                file.write(err_data)
            else:
                ents.append(span)

        doc.ents = filter_spans(ents)
        db.add(doc)


    return db

We also have a set of data for training and a set for testing

In [53]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(TRAIN_DATA, test_size=0.2)


print(len(train), len(test))

file = open("log.txt", "w")
db = get_spacy_doc(file, train)
db.to_disk("train.spacy")

db = get_spacy_doc(file, test)
db.to_disk("test.spacy")

file.close()

120 30


100%|██████████| 120/120 [00:01<00:00, 77.76it/s]
100%|██████████| 30/30 [00:00<00:00, 105.79it/s]


It's tiiime to train the module!

In [54]:
!python3 -m spacy train config.cfg --output ./output --paths.train ./train.spacy --paths.dev ./test.spacy

[38;5;2m✔ Created output directory: output[0m
[38;5;4mℹ Saving to output directory: output[0m
[38;5;4mℹ Using CPU[0m
[38;5;4mℹ To switch to GPU 0, use the option: --gpu-id 0[0m
[1m
  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'ner'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00    413.33    0.00    0.00    0.00    0.00
  1     200        754.26  22653.83    0.00    0.00    0.00    0.00
  3     400         83.19   4255.88    1.87   17.65    0.99    0.02
  5     600        125.18   3431.24    4.48   15.09    2.63    0.04
  7     800        217.46   2921.13   12.14   30.67    7.57    0.12
  8    1000        484.30   2753.18   17.04   22.22   13.82    0.17
 10    1200        395.18   2291.31    7.88   50.00    4.28   

Now we are looking throught the other websites to check if the module is trained.

In [55]:
nlp_ner = spacy.load("./output/model-best")
sites_visited = 100
for url in urls[sites_visited:]:
    try:
        response = fetch_with_delays(url)
        if response is None:
            continue
        
        print(url)
        products = extract_products(response)
        text = extract_text(response)

        doc = nlp_ner(text)
        for ent in doc.ents:
            print(ent.text, ent.label_)

    except Exception as e:
        continue

https://www.normann-copenhagen.com/en/en/products/tivoli/lamps/toli-lamp--20-cm-eu-5005028
Benches Chair Accessories PRODUCT
Throw Blankets Cushions Bed Linen Tea Towels & Dishcloths Textiles PRODUCT
Lounge Chairs PRODUCT
Benches Chair Accessories PRODUCT
Most Popular PRODUCT
Phantom Lamp EU Small PRODUCT
Phantom Lamp EU Medium White 1250.00 EUR PRODUCT
Tub Wall Lamp EU White 310.00 EUR + Add to shopping bag PRODUCT
Tub Wall Lamp EU Black 310.00 EUR + Add to shopping bag PRODUCT
Tub Wall Lamp EU Aluminum PRODUCT
Wall Ø20 EU White 215.00 EUR + Add to shopping bag PRODUCT
Error fetching https://www.gowfb.com/products/patio-set-covers: 404 Client Error: Not Found for url: https://www.gowfb.com/products/patio-set-covers
https://allwoodfurn.com/products/group-119-rustic-two-tone-gathering-table-and-barstools
Error fetching https://decorum-shop.co.uk/products/gift-card-10-25-50-100: 404 Client Error: Not Found for url: https://decorum-shop.co.uk/products/gift-card-10-25-50-100
https://lostin

Error fetching https://www.jandjtreasuretrove.net/apps/webstore/products/show/8098794: HTTPSConnectionPool(host='www.jandjtreasuretrove.net', port=443): Max retries exceeded with url: /apps/webstore/products/show/8098794 (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1007)')))
Error fetching https://www.idcmn.com/products/sun-company-861013-queen-bed&source=webclient: 404 Client Error: Not Found for url: https://www.idcmn.com/products/sun-company-861013-queen-bed&source=webclient
Error fetching https://calligarisla.com/products/living/sofas/metro/minimal-and-contemporary-modular-sofa: HTTPSConnectionPool(host='calligarisla.com', port=443): Max retries exceeded with url: /products/living/sofas/metro/minimal-and-contemporary-modular-sofa (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fb04c1fb5b0>: Failed to establish a new connection: [Errno -5] No

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://sika-design.com/products/belladonna-nature
Rocking Chair PRODUCT
Scent Sticks PRODUCT
Celia Headboard PRODUCT
Alexander Teak Coffee Table PRODUCT
James Exterior Trolley PRODUCT
Ottoman Large PRODUCT
Roger Stool PRODUCT
Chain for Hanging Egg Chair PRODUCT
Luis Bench PRODUCT
Blues Counter Stool PRODUCT
Marche Basket PRODUCT
Carlo Bar Trolley PRODUCT
Salsa Bar Stool PRODUCT
Charlottenborg Coffee Table PRODUCT
Healthy Animal | Large PRODUCT
Isabell Bar Stool PRODUCT
Anna Exterior Side Table PRODUCT
Anna Exterior Side Table PRODUCT
George Teak Bench PRODUCT
George Teak Extendable Table 200/280x100 cm PRODUCT
Lucas Teak Extendable Table 200/280X100 cm PRODUCT
Alfred Teak Side Table PRODUCT
Simone Stool PRODUCT
Tangelo Lamp Shade | Extra Small PRODUCT
Healthy Animal | Small PRODUCT
Jute Carpet 200X300 PRODUCT
Jute Carpet 140X200 PRODUCT
Bedspread 240X260 cm PRODUCT
Bedspread 240X260 cm PRODUCT
Bedspread 240X260 cm PRODUCT
Bedspread 240X260 cm PRODUCT
Bedspread 240X260 cm PRODUCT
Bedsp

I've learned a lot of new things while trying to create this project. I'm really glad I stuck with it! There were definitely some challenges along the way, but figuring out how to extract data from websites and how to train an NER module using spaCy really helped me improve my problem-solving skills and my vision about ML/AI. Now I feel much more confident tackling complex tasks.
