# NLP using HuggingFace Models

### A Tour of NLP Tasks with `pipeline()`

The `pipeline()` function is the easiest entry point. It intelligently bundles a pre-trained model with its required tokenizer and post-processing steps. You just provide the task name and optionally a specific model from the [Hugging Face Hub](https://huggingface.co/models).


In [None]:
!nvidia-smi

Tue Aug 26 15:00:22 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   50C    P8              9W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

## 0) Setup

In [None]:
# in a fresh environment (Python 3.10+ recommended)
!pip install -U transformers torch torchvision torchaudio sentencepiece accelerate datasets sentence-transformers --quiet
# (Optional, for faster CPU inference)
!pip install -U optimum --quiet

In [None]:
# Quick sanity check
import transformers, torch, platform
print(transformers.__version__, torch.__version__, platform.python_version())

4.55.4 2.8.0+cu126 3.12.11


---

## 1) What’s the Pipeline doing?

* **`pipeline(...)`**: 1-liner that downloads a model + tokenizer, handles pre/postprocessing for common tasks. Fastest way to try things.


We’ll use pipeline for some of these most common NLP tasks:

1. Sentiment Analysis
2. Text Classification (multi-class)
3. Named Entity Recognition (NER)
4. Question Answering (extractive)
5. Summarization
6. Translation
7. Text Generation

---

## 2) Device selection (CPU/GPU)

In [None]:
import torch
DEVICE = 0 if torch.cuda.is_available() else -1  # pipeline uses -1 for CPU, 0 for first GPU
print("Using", "GPU" if DEVICE==0 else "CPU")

Using GPU


---

## 3) Sentiment Analysis Pipeline (binary sentiment on SST-2)

In [None]:
from transformers import pipeline

sent_pipeline = pipeline("sentiment-analysis",
                         model="distilbert-base-uncased-finetuned-sst-2-english",
                         device=DEVICE)

texts = ["I love this library!", "This is the worst experience ever."]
print(sent_pipeline(texts))  # [{'label': 'POSITIVE', 'score': ...}, ...]

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Device set to use cuda:0


[{'label': 'POSITIVE', 'score': 0.9998852014541626}, {'label': 'NEGATIVE', 'score': 0.9997727274894714}]


---

## 4) Text Classification (multi-class example)

> Use a model trained for topic labels; outputs N-way softmax.


In [None]:
from transformers import pipeline

# If you have a GPU: device=0 (CUDA:0). For CPU, use device=-1.
DEVICE = 0

# Use the correct pipeline for zero-shot with candidate labels
clf = pipeline(
    task="zero-shot-classification",
    model="facebook/bart-large-mnli",
    device=DEVICE,              # 0 for GPU, -1 for CPU
)

text = ["I bought a new GPU and optimized my PyTorch code.", "NIFTY50 gave a gap down opening today. Alas!"]
candidate_labels = ["technology", "sports", "finance"]

# For mutually exclusive labels (pick the single best), keep multi_label=False (default)
result = clf(
    sequences=text,
    candidate_labels=candidate_labels,
    # hypothesis_template controls how labels are framed. Default works well:
    # hypothesis_template="This example is {}."
)

result

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cuda:0


[{'sequence': 'I bought a new GPU and optimized my PyTorch code.',
  'labels': ['technology', 'sports', 'finance'],
  'scores': [0.9735428690910339, 0.01587299257516861, 0.010584195144474506]},
 {'sequence': 'NIFTY50 gave a gap down opening today. Alas!',
  'labels': ['finance', 'technology', 'sports'],
  'scores': [0.6778068542480469, 0.23001892864704132, 0.09217418730258942]}]

---

## 5) Named Entity Recognition (NER)

## 🌐 What is Named Entity Recognition (NER)?

**Named Entity Recognition (NER)** is an **NLP (Natural Language Processing)** task where we automatically detect and classify important pieces of information (called *entities*) in text.

Entities can be:

* **Names of people** → *"Elon Musk"*
* **Organizations** → *"Google"*
* **Locations** → *"New York City"*
* **Dates/Times** → *"August 2025"*
* **Monetary values** → *"\$5 million"*
* **Product names, diseases, laws, chemicals, etc.** (depending on the domain)

👉 In simple terms: NER finds **“who, where, when, what”** in unstructured text.

---

## 📊 Business Applications Across Domains

### 1. **Finance & Banking**

* Extracting **company names, financial instruments, monetary values** from analyst reports.
* Detecting **fraudulent entities** in transaction logs.
* Automating **compliance checks** (e.g., identifying mentions of sanctioned individuals or organizations).

---

### 2. **Healthcare & Pharma**

* Extracting **disease names, drugs, symptoms, treatments** from medical reports and research papers.
* Building **clinical decision support systems**.
* Analyzing **adverse event reports** for pharmaceuticals.

---

### 3. **Retail & E-commerce**

* Identifying **brand names, product types, SKUs** from customer reviews.
* Improving **search and recommendation systems**.
* Tracking **competitor product mentions** across social media.

---

### 4. **Legal & Government**

* Extracting **laws, regulations, case citations, organization names** from contracts or court documents.
* Supporting **due diligence** in mergers & acquisitions.
* Automating **document classification** in large legal repositories.

---

### 5. **Media & Customer Insights**

* Identifying **people, places, companies, events** in news articles.
* Tracking **brand reputation** by monitoring mentions on Twitter, blogs, forums.
* Analyzing **customer complaints** for named products or services.

---

### 6. **Cybersecurity**

* Extracting **IP addresses, domain names, malware names** from threat intelligence reports.
* Automating **incident reports** for faster response.

---

✅ **Summary**: NER turns messy, free-form text into **structured data** that businesses can analyze, search, and act upon.

### 5) NER using Pipeline

In [None]:
# ner = pipeline("ner", model="dslim/bert-base-NER", aggregation_strategy="simple", device=DEVICE)
# ner("Hugging Face is based in New York City and was founded by Julien, Thomas, and Clément.")
from transformers import pipeline
from transformers.utils import logging as hf_logging
hf_logging.set_verbosity_error()   # hide warnings from Transformers

DEVICE = 0
ner = pipeline("token-classification",
               model="dslim/bert-base-NER",
               aggregation_strategy="max",
               device=DEVICE)

text = "Hugging Face is based in New York City and was founded by Julien, Thomas, and Clément."
ner(text)

config.json:   0%|          | 0.00/829 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/433M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/59.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

[{'entity_group': 'ORG',
  'score': np.float32(0.96009326),
  'word': 'Hugging Face',
  'start': 0,
  'end': 12},
 {'entity_group': 'LOC',
  'score': np.float32(0.9995024),
  'word': 'New York City',
  'start': 25,
  'end': 38},
 {'entity_group': 'PER',
  'score': np.float32(0.9989139),
  'word': 'Julien',
  'start': 58,
  'end': 64},
 {'entity_group': 'PER',
  'score': np.float32(0.9964598),
  'word': 'Thomas',
  'start': 66,
  'end': 72},
 {'entity_group': 'PER',
  'score': np.float32(0.99871016),
  'word': 'Clément',
  'start': 78,
  'end': 85}]

---

## 6) Question Answering (Extractive)

In [None]:
qa = pipeline("question-answering",
              model="deepset/roberta-base-squad2",
              device=DEVICE)

context = "The Apollo 11 mission landed on the Moon in 1969. Neil Armstrong and Buzz Aldrin walked on its surface."
qa(question="Who walked on the Moon?", context=context)

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/496M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/79.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

{'score': 0.9734604868608585,
 'start': 50,
 'end': 80,
 'answer': 'Neil Armstrong and Buzz Aldrin'}

In [None]:
!nividia-smi

---

## 7) Summarization Pipeline


In [None]:
summ = pipeline("summarization",
                model="facebook/bart-large-cnn",
                device=DEVICE)

long_text = """1. Generative AI is a subset of artificial intelligence that focuses on creating new and original content, rather than just analyzing and processing existing data.

2. It uses algorithms and machine learning techniques to generate new ideas, designs, or solutions based on a set of input data or parameters.

3. Generative AI has a wide range of applications, including creating art, music, and text, as well as assisting in product design and optimization. It has the potential to revolutionize industries by automating creative tasks and providing innovative solutions."""
summ(long_text, max_length=80, min_length=30, do_sample=False)

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

[{'summary_text': 'Generative AI is a subset of artificial intelligence that focuses on creating new and original content. It uses algorithms and machine learning techniques to generate new ideas, designs, or solutions based on a set of input data or parameters. Generative AI has a wide range of applications, including creating art, music, and text.'}]

---

## 8) Translation (EN → FR) Pipeline

In [None]:
translate = pipeline("translation_en_to_fr",
                     model="Helsinki-NLP/opus-mt-en-fr",
                     device=DEVICE)
translate("This notebook teaches Hugging Face pipelines and auto classes.")

config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/301M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/301M [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/778k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/802k [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]



[{'translation_text': "Ce cahier enseigne les pipelines Hugging Face et les classes d'auto."}]

In [None]:
translate_de = pipeline(
    "translation_en_to_de",
    model="Helsinki-NLP/opus-mt-en-de",
    device=DEVICE
)

print(translate_de("This notebook teaches Hugging Face pipelines and auto classes."))
# Example output: [{'translation_text': 'Dieses Notizbuch vermittelt Hugging Face Pipelines und Auto-Klassen.'}]


config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/298M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/298M [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/768k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/797k [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

[{'translation_text': 'Dieses Notebook lehrt Hugging Face Pipelines und Autoklassen.'}]


In [None]:
translate_hi = pipeline(
    "translation_en_to_hi",
    model="Helsinki-NLP/opus-mt-en-hi",
    device=DEVICE
)

print(translate_hi("Natural Language Processing (NLP) is a branch of Artificial Intelligence (AI) that focuses on enabling computers to understand, interpret, and generate human language."))


[{'translation_text': 'स्वाभाविक भाषा अनुवाद (एनएलपी) कलावाद (अंग्रेज़ी) का एक शाखा है जो कंप्यूटरों को समझने, व्याख्या करने, और मानव भाषा तैयार करने में समर्थ करता है ।'}]


---

## 9) Text Generation (causal LM)

### (Brief) What is **Causal Language Modeling**

* **Causal Language Modeling (CLM)** = *next-token prediction*: model sees tokens **to the left only** and learns to predict the **next** token.

  * Used by **decoder-only** models (GPT-style).
  * Good for free-form generation, completion.

* **Masked Language Modeling (MLM)** = *fill in the blank*: model sees tokens on **both sides** (bidirectional) and predicts the **masked** token(s).

  * Used by **encoder-only** models (BERT-style).
  * Great for understanding tasks like NER, classification, and **fill-mask**.


In [None]:
gen = pipeline("text-generation", model="distilgpt2", device=DEVICE)
output = gen("In a future where AI writes code,", max_length=50, num_return_sequences=1, do_sample=True, top_p=0.95, temperature=0.8)
print(output[0]["generated_text"])

In a future where AI writes code, is the future of AI?
The future of AI is a big question, and yet we have a lot of good science in it, and it is a question that has to be answered in a long time.
We have a lot of good science in it, and it is a question that has to be answered in a long time. It is a question that has to be answered in a long time.
We are the first to talk about the future of AI.
This is the next stage in the development of AI. The next step is to develop AI, which will hopefully be able to achieve a much bigger goal than what was envisioned.
We have about 2-3 years of development of AI, which will hopefully be able to achieve a much bigger goal than what was envisioned. In the past, we had been thinking of ways to use AI to develop AI. We had a few of the ideas that were in the works, but now we have more ideas.
We have a couple of ideas to talk about, but one is that the main goal is to do something new, which is to use AI to develop AI. It is not that we are not wo

## “What to use when” cheat sheet

* **Binary/Multiclass Sentiment** → `distilbert-base-uncased-finetuned-sst-2-english` (binary), or fine-tuned classifier
* **NER** → `dslim/bert-base-NER`
* **Q\&A** → `deepset/roberta-base-squad2`
* **Summarization** → `facebook/bart-large-cnn`
* **Translation** → `Helsinki-NLP/opus-mt-*`
* **Generation** → `distilgpt2` (toy), LLaMA/others for real apps
* **Fill-mask** → `bert-base-uncased`
* **Embeddings** → `sentence-transformers/all-MiniLM-L6-v2` (fast, solid baseline)

---


In [None]:
!nvidia-smi

Tue Aug 26 16:14:41 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   77C    P0             31W /   70W |    6810MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                