# NVIDIA Product Texts — Exhaustive NLP Testing (Hugging Face Transformers)

"""
This notebook provides a modern 2025-ready NLP workflow for analyzing **NVIDIA product descriptions** using the **Hugging Face Transformers** library.

It performs comprehensive NLP tasks including text classification, summarization, NER, and embeddings similarity using pre-trained transformer models.
"""


In [None]:
# === Setup ===
# Install dependencies
!pip install -q transformers datasets torch sentencepiece accelerate
!pip install -q beautifulsoup4 requests numpy

In [None]:
import torch
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification, AutoModel
from datasets import load_dataset
import numpy as np
import requests
from bs4 import BeautifulSoup

In [None]:
# === Fetch NVIDIA Product Data ===
def fetch_nvidia_products():
    try:
        url = "https://www.nvidia.com/en-us/data-center/products/"
        soup = BeautifulSoup(requests.get(url).text, 'html.parser')
        products = [a.text.strip() for a in soup.find_all('a') if 'NVIDIA' in a.text]
        return list(set([p for p in products if len(p) < 100]))[:20]
    except:
        return [
            "NVIDIA GeForce RTX 4090",
            "NVIDIA A100 Tensor Core GPU",
            "NVIDIA DGX H100 system",
            "NVIDIA HGX platform",
            "NVIDIA Blackwell architecture",
            "NVIDIA RTX 5090 Ti upcoming GPU",
        ]

nvidia_products = fetch_nvidia_products()
print("Sample products:", nvidia_products[:5])

In [None]:
# === Pipelines Setup ===

# 1. Named Entity Recognition
ner_pipe = pipeline("ner", model="dslim/bert-base-NER", aggregation_strategy="simple")

# 2. Summarization
sum_pipe = pipeline("summarization", model="facebook/bart-large-cnn")

# 3. Zero-shot classification (categorize products)
zero_shot = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

# 4. Embedding model for similarity
embed_model_name = "sentence-transformers/all-MiniLM-L6-v2"
tok = AutoTokenizer.from_pretrained(embed_model_name)
embed_model = AutoModel.from_pretrained(embed_model_name)

def get_embedding(text):
    inputs = tok(text, return_tensors='pt', truncation=True)
    with torch.no_grad():
        outputs = embed_model(**inputs)
    emb = outputs.last_hidden_state.mean(dim=1)
    return emb / emb.norm()

In [None]:
# === Exhaustive Testing ===

for text in nvidia_products:
    print(f"\n=== {text} ===")
    print("NER:", ner_pipe(text))
    # Dynamically set max_length for summarization, ensuring it's at least min_length
    summary_max_length = max(min(len(text.split()), 20), 5) # set max length to the maximum of (number of words in the text or 20) and 5
    print("Summary:", sum_pipe(text, max_length=summary_max_length, min_length=5, do_sample=False)[0]['summary_text'])
    print("Category:", zero_shot(text, candidate_labels=["GPU", "Server", "Chip", "AI System", "Platform"]))

In [None]:
# === Similarity Testing ===

def cosine(a, b):
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

embs = [get_embedding(t).numpy().flatten() for t in nvidia_products]
print("\nPairwise product similarities:")
for i in range(len(nvidia_products)):
    for j in range(i + 1, len(nvidia_products)):
        sim = cosine(embs[i], embs[j])
        if sim > 0.7:
            print(f"{nvidia_products[i]} ↔ {nvidia_products[j]} = {sim:.2f}")

"""
README — Hugging Face Transformers NVIDIA NLP Test

This notebook demonstrates a comprehensive NLP evaluation using Hugging Face pipelines:

### Tasks Covered
- **NER**: Extracts entities (models, architectures, hardware terms)
- **Summarization**: Generates concise summaries of NVIDIA product names or descriptions
- **Zero-shot Classification**: Assigns each product to categories like GPU, AI System, or Platform
- **Embedding Similarity**: Measures semantic closeness among product names

### Models Used
- `dslim/bert-base-NER` — general-purpose named entity recognition
- `facebook/bart-large-cnn` — summarization
- `facebook/bart-large-mnli` — zero-shot classification
- `sentence-transformers/all-MiniLM-L6-v2` — sentence embeddings for similarity

### How to Run
1. Install dependencies via pip commands in the setup cell.
2. Run all cells sequentially in a Jupyter or Colab environment.
3. Review outputs printed for each NVIDIA product name.
4. Modify `candidate_labels` or extend with your own data for custom evaluations.

### Extension Ideas
- Replace `MiniLM` with `bge-large` or `text-embedding-3-large` for higher embedding accuracy.
- Use `faiss` or `qdrant` for scalable vector similarity search.
- Add question-answering (`pipeline('question-answering')`) for product feature extraction.
- Integrate results into a FastAPI or Streamlit dashboard.
"""