# spaCy 2025 with NVIDIA Product Data — Exhaustive Tests

This notebook exercises spaCy's latest pipelines on NVIDIA product descriptions.  
It compares **transformer** and **vector** models, tests **tokenization**, **POS**, **dependency parsing**, **NER**, **vector similarity**, and includes **performance benchmarking**.

> If running in a restricted environment without internet access, the scraping step will fall back to a small, hardcoded list.

## 1) Setup

In [1]:

# Install core dependencies (uncomment if needed)
# Note: If you're on a managed environment, installs may be restricted.
!pip install -q spacy requests numpy beautifulsoup4

# Download spaCy models (comment out if already present)
!python -m spacy download en_core_web_trf
!python -m spacy download en_core_web_md


Collecting en-core-web-trf==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.8.0/en_core_web_trf-3.8.0-py3-none-any.whl (457.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m457.4/457.4 MB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting spacy-curated-transformers<1.0.0,>=0.2.2 (from en-core-web-trf==3.8.0)
  Downloading spacy_curated_transformers-0.3.1-py2.py3-none-any.whl.metadata (2.7 kB)
Collecting curated-transformers<0.2.0,>=0.1.0 (from spacy-curated-transformers<1.0.0,>=0.2.2->en-core-web-trf==3.8.0)
  Downloading curated_transformers-0.1.1-py2.py3-none-any.whl.metadata (965 bytes)
Collecting curated-tokenizers<0.1.0,>=0.0.9 (from spacy-curated-transformers<1.0.0,>=0.2.2->en-core-web-trf==3.8.0)
  Downloading curated_tokenizers-0.0.9-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.9 kB)
Downloading spacy_curated_transformers-0.3.1-py2.py3-none-any.whl (237 kB)
[2K   [90m━━━━━

## 2) Imports

In [2]:

import spacy, requests, time, numpy as np
from spacy.pipeline import EntityRuler
from spacy import displacy
from bs4 import BeautifulSoup


## 3) Fetch NVIDIA Product Data (with fallback)

In [3]:

def fetch_nvidia_products(max_items=40):
    url = "https://www.nvidia.com/en-us/data-center/products/"
    try:
        resp = requests.get(url, timeout=15)
        resp.raise_for_status()
        soup = BeautifulSoup(resp.text, 'html.parser')
        # Simple heuristic: anchor texts that contain 'NVIDIA' and look like product names
        anchors = [a.get_text(strip=True) for a in soup.find_all('a')]
        products = [a for a in anchors if "NVIDIA" in a and len(a) <= 80]
        # Deduplicate and trim
        seen, deduped = set(), []
        for p in products:
            if p not in seen:
                seen.add(p)
                deduped.append(p)
        return deduped[:max_items] if deduped else None
    except Exception as e:
        print(f"[warn] Could not fetch live data: {e}")
        return None

nvidia_products = fetch_nvidia_products()
if not nvidia_products:
    nvidia_products = [
        "NVIDIA GeForce RTX 4090",
        "NVIDIA A100 Tensor Core GPU",
        "NVIDIA DGX H100 system",
        "NVIDIA HGX platform",
        "NVIDIA Blackwell architecture",
        "NVIDIA RTX 5090 Ti — upcoming GPU",
        "NVIDIA Grace Hopper Superchip",
        "NVIDIA DGX Station A100",
    ]

print(f"Loaded {len(nvidia_products)} product strings.")
for s in nvidia_products[:10]:
    print("•", s)


Loaded 37 product strings.
• Artificial Intelligence Computing Leadership from NVIDIA
• NVIDIA APIsExplore, test, and deploy AI models and agents
• Private RegistryGuide for using NVIDIA NGC private registry with GPU cloud
• NVIDIA NGCAccelerated, containerized AI models and SDKs
• G-SYNC MonitorsSmooth, tear-free gaming with NVIDIA G-SYNC monitors
• NVIDIA StudioHigh performance laptops and desktops, purpose-built for creators
• NVIDIA AppOptimize gaming, streaming, and AI-powered creativity
• NVIDIA RTX PRO DesktopsPowerful AI, graphics, rendering, and compute workloads
• NVIDIA Mission Control
• NVIDIA AI Enterprise Platform


## 4) Load spaCy Models

In [4]:

# Heavy, accurate (Transformer-based) model
nlp_trf = spacy.load("en_core_web_trf")
# Lighter, fast model with word vectors
nlp_md  = spacy.load("en_core_web_md")

print("Transformer pipeline:", nlp_trf.pipe_names)
print("MD pipeline:", nlp_md.pipe_names)

# Vector existence checks
print("Has vector for 'GPU' (MD):", nlp_md.vocab["GPU"].has_vector)
print("Has vector for 'GPU' (TRF):", nlp_trf.vocab["GPU"].has_vector)


Transformer pipeline: ['transformer', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']
MD pipeline: ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']
Has vector for 'GPU' (MD): True
Has vector for 'GPU' (TRF): False


## 5) Add Custom Entity Rules (Domain Boost)

In [6]:
# Add simple patterns to help tag product-like spans
ruler = nlp_md.add_pipe("entity_ruler", before="ner")
patterns = [
    {"label": "PRODUCT", "pattern": [{"LOWER": "nvidia"}, {"IS_TITLE": True}]},
    {"label": "PRODUCT", "pattern": [{"LOWER": "dgx"}, {"IS_ALPHA": True}]},
    {"label": "PRODUCT", "pattern": [{"LOWER": "rtx"}, {"IS_DIGIT": True}]},
    {"label": "PRODUCT", "pattern": [{"LOWER": "blackwell"}]},
]
ruler.add_patterns(patterns)

## 6) Analysis Helpers

In [7]:

def analyze_texts(nlp, texts):
    results = []
    for text in texts:
        doc = nlp(text)
        tokens = [(t.text, t.pos_, t.dep_) for t in doc]
        ents = [(e.text, e.label_) for e in doc.ents]
        # Similarity to the concept 'GPU' (may be None if vectors unavailable)
        sim_gpu = None
        try:
            sim_gpu = float(doc.similarity(nlp("GPU")))
        except Exception:
            pass
        results.append({
            "text": text,
            "tokens": tokens,
            "entities": ents,
            "similarity_to_GPU": sim_gpu
        })
    return results

def print_comparison(res_md, res_trf, products):
    for i, txt in enumerate(products):
        print("TEXT:", txt)
        print("  MD  ents:", res_md[i]["entities"])
        print("  TRF ents:", res_trf[i]["entities"])
        print("  MD  sim(GPU):", res_md[i]["similarity_to_GPU"])
        print("  TRF sim(GPU):", res_trf[i]["similarity_to_GPU"])
        print("-"*60)


## 7) Run Tests (NER, Similarity)

In [8]:

res_md  = analyze_texts(nlp_md,  nvidia_products)
res_trf = analyze_texts(nlp_trf, nvidia_products)
print_comparison(res_md, res_trf, nvidia_products[:20])


  sim_gpu = float(doc.similarity(nlp("GPU")))
  sim_gpu = float(doc.similarity(nlp("GPU")))


TEXT: Artificial Intelligence Computing Leadership from NVIDIA
  MD  ents: [('NVIDIA', 'ORG')]
  TRF ents: [('NVIDIA', 'ORG')]
  MD  sim(GPU): 0.304282009601593
  TRF sim(GPU): 0.0
------------------------------------------------------------
TEXT: NVIDIA APIsExplore, test, and deploy AI models and agents
  MD  ents: [('NVIDIA', 'ORG'), ('AI', 'ORG')]
  TRF ents: [('NVIDIA', 'ORG')]
  MD  sim(GPU): 0.3404432535171509
  TRF sim(GPU): 0.0
------------------------------------------------------------
TEXT: Private RegistryGuide for using NVIDIA NGC private registry with GPU cloud
  MD  ents: [('NVIDIA', 'ORG'), ('GPU', 'ORG')]
  TRF ents: [('NVIDIA NGC', 'ORG')]
  MD  sim(GPU): 0.4650319218635559
  TRF sim(GPU): 0.0
------------------------------------------------------------
TEXT: NVIDIA NGCAccelerated, containerized AI models and SDKs
  MD  ents: [('NVIDIA NGCAccelerated', 'NORP')]
  TRF ents: [('NVIDIA', 'ORG')]
  MD  sim(GPU): 0.446361780166626
  TRF sim(GPU): 0.0
----------------------

## 8) Performance Benchmark

In [9]:

def time_pipeline(nlp, texts, reps=50):
    start = time.time()
    for _ in range(reps):
        for t in texts:
            _ = nlp(t)
    return round((time.time() - start) / reps, 4)

print("MD  avg sec/rep:", time_pipeline(nlp_md,  nvidia_products))
print("TRF avg sec/rep:", time_pipeline(nlp_trf, nvidia_products))


MD  avg sec/rep: 0.2534
TRF avg sec/rep: 3.2063


## 9) Visualization (Dependency Parse)

In [10]:

doc = nlp_trf("NVIDIA DGX H100 system accelerates training workloads.")
displacy.render(doc, style="dep", jupyter=True, options={"distance": 110})


## 10) Optional: Similarity Clustering (vectors)

In [11]:

# Uses MD model vectors to compute pairwise cosine similarity.
# For larger corpora, consider FAISS or a vector DB.
def cosine(a, b):
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

vectors = [nlp_md(t).vector for t in nvidia_products]
for i in range(len(nvidia_products)):
    for j in range(i+1, len(nvidia_products)):
        print(f"Sim({nvidia_products[i][:30]}..., {nvidia_products[j][:30]}...) =",
              round(cosine(vectors[i], vectors[j]), 4))


Sim(Artificial Intelligence Comput..., NVIDIA APIsExplore, test, and ...) = 0.734
Sim(Artificial Intelligence Comput..., Private RegistryGuide for usin...) = 0.7287
Sim(Artificial Intelligence Comput..., NVIDIA NGCAccelerated, contain...) = 0.6363
Sim(Artificial Intelligence Comput..., G-SYNC MonitorsSmooth, tear-fr...) = 0.5171
Sim(Artificial Intelligence Comput..., NVIDIA StudioHigh performance ...) = 0.7271
Sim(Artificial Intelligence Comput..., NVIDIA AppOptimize gaming, str...) = 0.6897
Sim(Artificial Intelligence Comput..., NVIDIA RTX PRO DesktopsPowerfu...) = 0.5805
Sim(Artificial Intelligence Comput..., NVIDIA Mission Control...) = 0.8269
Sim(Artificial Intelligence Comput..., NVIDIA AI Enterprise Platform...) = 0.695
Sim(Artificial Intelligence Comput..., NVIDIA Run:ai...) = 0.5025
Sim(Artificial Intelligence Comput..., AI WorkbenchSimplify AI develo...) = 0.5691
Sim(Artificial Intelligence Comput..., API CatalogExplore NVIDIA's AI...) = 0.7635
Sim(Artificial Intelligence Comp


---

## README — How to Use & Extend

**What this notebook does**  
1. Pulls (or falls back to) NVIDIA product text and runs it through two spaCy models:
   - `en_core_web_trf` (Transformer-based; higher accuracy, slower)
   - `en_core_web_md` (Faster; includes word vectors)
2. Compares **entities** (NER) and **similarity** to the concept “GPU” across texts.
3. Benchmarks **inference speed**.
4. Visualizes a **dependency parse**.
5. Provides an optional **similarity clustering** demo using vectors.

**When to use which model?**  
- Use **TRF** when accuracy matters most and throughput is acceptable.  
- Use **MD** for lower latency and simpler similarity operations using static vectors.

**Extending this notebook**  
- Add an **EntityRuler** with more patterns or train a **custom NER** on your labeled NVIDIA catalog.  
- Use **SentenceTransformers** for stronger semantic embeddings and cluster with **FAISS**.  
- For production, precompute vectors, cache results, and batch with `nlp.pipe`.

**Troubleshooting**  
- If installs fail, run the notebook in an environment with `pip` enabled.  
- If model downloads fail, manually install models or use already-packaged models in your environment.  
- Some environments restrict network calls; rely on the built‑in fallback list.

---
