# **introduction to TextBlob in Python**

---

👋 **Hey there!** Ready to explore how Python can understand human language? Let’s meet your new buddy: **TextBlob**!

---

### 🤖 What is TextBlob?

Think of **TextBlob** as your smart assistant that can read and analyze text *almost like a human*—but faster and in Python! It helps with:

* Figuring out what someone *feels* in a sentence (positive or negative?) 🎭
* Splitting text into words and sentences ✂️
* Correcting spelling mistakes ✍️
* Translating from one language to another 🌍
* And even fixing grammar!

---

### 🧠 Why use TextBlob?

Because it’s **super beginner-friendly**! You don’t need to write complex code. Just one or two lines, and boom—you’ve got insights from text.

---

### ✨ Let’s try it out!

```python
from textblob import TextBlob

text = "I really love programming with Python!"
blob = TextBlob(text)

print("Words:", blob.words)
print("Sentiment:", blob.sentiment)
```

📌 Output:

```
Words: ['I', 'really', 'love', 'programming', 'with', 'Python']
Sentiment: Sentiment(polarity=0.9, subjectivity=0.6)
```

🎯 That `polarity` tells us the sentence is **very positive**!

---

Ready to explore more like **spelling correction** or **translation**?

Try these next:

```python
TextBlob("I havv goo grammer.").correct()
blob.translate(to='hi')  # Translates to Hindi
```

---

TextBlob turns text into insights, effortlessly. A perfect tool if you're starting out in **Natural Language Processing**!




# 1. Text Annotation

In [1]:
from textblob import TextBlob

text = "I love programming in Python. It's such a versatile language!"
blob = TextBlob(text)

In [2]:
# Get Polarity and Subjectivity
print("Polarity:", blob.sentiment.polarity)
print("Subjectivity:", blob.sentiment.subjectivity)

Polarity: 0.25
Subjectivity: 0.55


# 2. Detect posttive or negative sentiment in a sentence

In [3]:
def detect_sentiment(text):
    blob = TextBlob(text)
    if blob.sentiment.polarity > 0:
        return "Positive"
    elif blob.sentiment.polarity < 0:
        return "Negative"
    else:
        return "Neutral"
    
print(detect_sentiment("i hate the interface"))
print(detect_sentiment('I love design'))

Negative
Positive


# 3. Sentiment of Multiple Sentences (List of Reviews)

In [4]:
reviews = ["I Love the product!","Wrost camera ever!","Battery life is ok"]
for review in reviews:
    sentiment = TextBlob(review).sentiment.polarity
    print(f"'{review}' - Sentiment: {'Positive' if sentiment > 0 else 'Negative' if sentiment < 0 else 'Neutral'}")

'I Love the product!' - Sentiment: Positive
'Wrost camera ever!' - Sentiment: Neutral
'Battery life is ok' - Sentiment: Positive


# 4. Count Positive/Negative/Netural in Text list

In [5]:
reviews=["Great product!","Worst experience ever!","It's okay, not bad but not great.", "Not Bad"]

counts = {"Positive": 0, "Negative": 0, "Neutral": 0}
for review in reviews:
    polarity = TextBlob(review).sentiment.polarity
    if polarity > 0:
        counts["Positive"] += 1
    elif polarity < 0:
        counts["Negative"] += 1
    else:
        counts["Neutral"] += 1
print("Counts:", counts)

Counts: {'Positive': 3, 'Negative': 1, 'Neutral': 0}


# 5. Simple GUI For Sentiment (Using Tkinter)

In [6]:
# from tkinter import *
# from textblob import TextBlob

# def analyze():
#     text = entry.get()
#     blob = TextBlob(text)
#     result.set(f"Polarity: {blob.sentiment.polarity:.2f}")

# root = Tk()
# root.title("Sentiment Analyzer")

# entry = Entry(root, width=40)
# entry.pack(pady=10)


# Button(root, text="Analyze", command=analyze).pack()

# result = StringVar()
# Label(root, textvariable=result).pack(pady=5)

# root.mainloop()


# Sentiment Annotation Types – Python Examples

In **TextBlob**, **polarity** is a number that tells you how **positive or negative** a piece of text is.

### 📊 Polarity Score Range:

* **-1.0** → *very negative*
* **0.0** → *neutral*
* **+1.0** → *very positive*

### 🧠 How it works:

TextBlob uses a built-in sentiment lexicon (a list of words with associated emotions) to calculate the average sentiment of words in a sentence.

---

### 🧪 Example:

```python
from textblob import TextBlob

text = TextBlob("I love this beautiful weather!")
print(text.sentiment.polarity)
```

🔍 Output:

```
0.85  # Very positive!
```

Another example:

```python
TextBlob("This is the worst movie ever.").sentiment.polarity
```

🔍 Output:

```
-1.0  # Extremely negative
```

So, polarity helps you measure the *emotion* behind text—perfect for reviews, feedback analysis, or even chatbot reactions!


In [7]:
from textblob import TextBlob

text = "I love this product!"
blob = TextBlob(text)
print("Polarity:", blob.sentiment.polarity)  # > 0 → Positive

Polarity: 0.625


2. Negative Sentiment
Example Sentence: "This is the worst service ever."

In [8]:
text = "This is the worst service ever."
blob = TextBlob(text)
print("Polarity:", blob.sentiment.polarity)  # < 0 → Negative


Polarity: -1.0


 3. Neutral Sentiment
Example Sentence: "It was okay, nothing special."

In [9]:
text = "It was okay, nothing special."
blob = TextBlob(text)
print("Polarity:", blob.sentiment.polarity)  # ≈ 0 → Neutral

Polarity: 0.4285714285714286


 4. Mixed Sentiment
Example Sentence: "The food was great, but pricey."

In [10]:
text = "The food was great, but pricey."
blob = TextBlob(text)
print("Polarity:", blob.sentiment.polarity)  # Moderate → Mixed

Polarity: 0.8


In [11]:
if 0 < polarity < 0.4:
    print("Mixed Sentiment")

Mixed Sentiment


 5. Sarcasm/Irony (requires VADER or advanced models)
Example Sentence: "Oh great, another rainy day."

In [12]:
# %pip install vaderSentiment

In [13]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

analyzer = SentimentIntensityAnalyzer()
text = "Oh great, another rainy day."
score = analyzer.polarity_scores(text)
print(score)

{'neg': 0.155, 'neu': 0.357, 'pos': 0.488, 'compound': 0.5859}


6. Emotion-Based Tags (using NRCLex or similar)
Example Sentence: "I'm furious about this delay."

In [14]:
%pip install nrclex

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [15]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to C:\Users\PANDIT
[nltk_data]     JI\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [16]:
import nltk
nltk.download('punkt_tab')
from nrclex import NRCLex

text = "I'm furious about this delay."
emotion = NRCLex(text)
print(emotion.top_emotions)


[nltk_data] Downloading package punkt_tab to C:\Users\PANDIT
[nltk_data]     JI\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


[('anger', 0.25), ('negative', 0.25), ('disgust', 0.25)]


 7. Aspect-Based Sentiment
Example Sentence: "Camera is great, but battery life is poor."

In [17]:
from textblob import TextBlob

text = "Camera is great, but battery life is poor."
blob = TextBlob(text)

# Aspect extraction (manually here)
aspects = {
    "Camera": TextBlob("Camera is great").sentiment.polarity,
    "Battery": TextBlob("battery life is poor").sentiment.polarity
}

print(aspects)

{'Camera': 0.8, 'Battery': -0.4}


# Intent Annotation

1. Basic Rule-Based intent detection

In [18]:
def detect_intent(text):
    text = text.lower()
    if "hi" in text or "hello" in text:
        return "Greeting"
    elif "bye" in text or "see you" in text:
        return "Goodbye"
    elif "order" in text:
        return "OrderProduct"
    elif "weather" in text:
        return "GetWeather"
    elif "remind" in text:
        return "SetReminder"
    else:
        return "Unknown"

print(detect_intent("Hi, I want to check the weather."))

Greeting


# 2. Using scikit-learn - Text classification (intent detection)

In [19]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

# Training data
texts = [
    "Hi there", "Goodbye", "I want to order pizza",
    "What's the weather", "Remind me to study"
]
labels = ["Greeting", "Goodbye", "OrderProduct", "GetWeather", "SetReminder"]

# Vectorize and train
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)
model = MultinomialNB()
model.fit(X, labels)

# Prediction
test = ["Book pizza", "Hey!", "Will it rain tomorrow?"]
test_vector = vectorizer.transform(test)
print(model.predict(test_vector))

['OrderProduct' 'GetWeather' 'GetWeather']


# 3. Advanced BERT-based intent classification using transformers

In [20]:
from transformers import pipeline

classifier = pipeline("text-classification", model="bert-base-uncased", top_k=None)

input_text = "I need to cancel my order"
output = classifier(input_text)

print(output)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cpu


[[{'label': 'LABEL_0', 'score': 0.558242917060852}, {'label': 'LABEL_1', 'score': 0.44175708293914795}]]


# 4. Intent Annotation on Multiple Inputs (Batch)


In [21]:
inputs = ["What's the weather?", "Remind me to drink water", "Hi!", "Cancel my booking"]

for i in inputs:
    print(f"Input: {i} → Intent: {detect_intent(i)}")

Input: What's the weather? → Intent: GetWeather
Input: Remind me to drink water → Intent: SetReminder
Input: Hi! → Intent: Greeting
Input: Cancel my booking → Intent: Unknown


# 5. Simple GUI For Intent Dector (Tkinter)

In [22]:
# from tkinter import *

# def classify_intent():
#     text = entry.get()
#     intent = detect_intent(text)
#     result.set(f"Intent: {intent}")

# root = Tk()
# root.title("Intent Detector")

# entry = Entry(root, width=40)
# entry.pack(pady=10)

# Button(root, text="Detect", command=classify_intent).pack()

# result = StringVar()
# Label(root, textvariable=result).pack(pady=5)

# root.mainloop()

# Entity Annotation


2. Entity Annotation Programs (NER) in Python
🔹 1. Basic Named Entity Recognition using SpaCy

In [23]:
# %pip install spacy
# !python -m spacy download en_core_web_sm

import spacy

nlp = spacy.load("en_core_web_sm")
text = "Apple Inc. was founded by Steve Jobs in California."

doc = nlp(text)
for ent in doc.ents:
    print(ent.text, "→", ent.label_)

Apple Inc. → ORG
Steve Jobs → PERSON
California → GPE


# 2. Highlight Entities in Sentence

In [24]:
# from spacy import displacy

# doc = nlp("Barack Obama was born in Hawaii in 1961.")
# displacy.render(doc, style="ent")

from spacy import displacy
from IPython.display import display, HTML

doc = nlp("Barack Obama was born in Hawaii in 1961.")
html = displacy.render(doc, style="ent", jupyter=False)  # Returns HTML
display(HTML(html))


3. Custom Entity Recognition (Rule-based with SpaCy)

In [25]:
from spacy.tokens import Span

text = "Ravneet bought a PS5 from Amazon."
doc = nlp(text)

# Manually mark "PS5" as PRODUCT
span = Span(doc, 3, 4, label="PRODUCT")
doc.ents = list(doc.ents) + [span]

for ent in doc.ents:
    print(ent.text, ent.label_)


PS5 PRODUCT
Amazon ORG


4. Extract Specific Entity Types Only

In [26]:
text = "Elon Musk launched SpaceX in the United States."

doc = nlp(text)

for ent in doc.ents:
    if ent.label_ in ["PERSON", "ORG", "GPE"]:
        print(ent.text, "→", ent.label_)

Elon Musk → PERSON
the United States → GPE


5. Entity Recognition with NLTK (Basic)

In [27]:
import nltk
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     C:\Users\PANDIT JI\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


True

In [28]:
import nltk
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     C:\Users\PANDIT JI\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


True

In [29]:
# Ensure required NLTK resources are installed
import nltk
nltk.download('maxent_ne_chunker')
nltk.download('maxent_ne_chunker_tab')
nltk.download('words')  # required for ne_chunk
nltk.download('punkt')  # for word_tokenize
nltk.download('averaged_perceptron_tagger')  # for pos_tag

from nltk import word_tokenize, pos_tag, ne_chunk

text = "Bill Gates founded Microsoft in 1975."

# Tokenize text into words
tokens = word_tokenize(text)

# Tag words with part-of-speech labels
tags = pos_tag(tokens)

# Perform Named Entity Recognition
tree = ne_chunk(tags)

# Print the named entity tree
print(tree)


[nltk_data] Downloading package maxent_ne_chunker to C:\Users\PANDIT
[nltk_data]     JI\AppData\Roaming\nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package maxent_ne_chunker_tab to
[nltk_data]     C:\Users\PANDIT JI\AppData\Roaming\nltk_data...
[nltk_data]   Package maxent_ne_chunker_tab is already up-to-date!
[nltk_data] Downloading package words to C:\Users\PANDIT
[nltk_data]     JI\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package punkt to C:\Users\PANDIT
[nltk_data]     JI\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\PANDIT JI\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


(S
  (PERSON Bill/NNP)
  (PERSON Gates/NNP)
  founded/VBD
  (ORGANIZATION Microsoft/NNP)
  in/IN
  1975/CD
  ./.)


# Advnaced: Fine tuned Transformers for NER with Hugging Face

In [30]:
from transformers import pipeline

# Use a smaller, memory-efficient NER model
ner_pipeline = pipeline("ner", model="dslim/bert-base-NER", grouped_entities=True)
text = "Sundar Pichai is the CEO of Google based in California."

results = ner_pipeline(text)
for entity in results:
    print(entity['word'], "→", entity['entity_group'])

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


Sundar Pichai → PER
Google → ORG
California → LOC


# Types of Text Classification

# 1. Basic Rule-Based Classifier

In [31]:
def classify_text(text):
    if "buy now" in text or "win" in text:
        return "Spam"
    elif "hello" in text or "hi" in text:
        return "Greeting"
    else:
        return "General"

print(classify_text("Win a free iPhone!"))

General


# 2. Using scikit-learn (TF-IDF +  Naive Bayes)

In [32]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB

# Training data
texts = ["I love this!", "This is terrible", "Win cash now!", "Hello there", "Goodbye"]
labels = ["Positive", "Negative", "Spam", "Greeting", "Farewell"]

# Vectorize text
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(texts)

# Train model
model = MultinomialNB()
model.fit(X, labels)

# Test prediction
test = ["You won a prize!", "That was amazing"]
X_test = vectorizer.transform(test)
print(model.predict(X_test))


['Farewell' 'Farewell']


# 3. Advanced: Using Hugging Face Transformers (BERT, RoBERTa)

In [33]:
from transformers import pipeline

classifier = pipeline("text-classification")
result = classifier("I absolutely love the new update!")
print(result)


No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


vocab.txt: 0.00B [00:00, ?B/s]

Device set to use cpu


[{'label': 'POSITIVE', 'score': 0.999871015548706}]


# 4. Multi-Class Text Classification (Custom Categories)

In [34]:
texts = [
    "The election results were announced",
    "Ronaldo scored two goals",
    "NASA launched a new satellite"
]
labels = ["Politics", "Sports", "Science"]

# Same vectorizer + classifier setup as above

# 5. GUI-Based Text Classification (Tkinter)

In [35]:
# from tkinter import *

# def classify():
#     text = entry.get().lower()
#     if "win" in text:
#         result.set("Spam")
#     elif "love" in text:
#         result.set("Positive")
#     else:
#         result.set("Neutral")

# root = Tk()
# root.title("Text Classifier")

# entry = Entry(root, width=50)
# entry.pack(pady=10)

# Button(root, text="Classify", command=classify).pack()
# result = StringVar()
# Label(root, textvariable=result).pack(pady=5)

# root.mainloop()


# Linguistic Annotation 

with using SpaCy , NLTK and Stanza

1. POS Tagging with SpaCy

In [36]:
import spacy

nlp = spacy.load("en_core_web_sm")
text = "The quick brown fox jumps over the lazy dog."
doc = nlp(text)

for token in doc:
    print(token.text, "→", token.pos_)

The → DET
quick → ADJ
brown → ADJ
fox → NOUN
jumps → VERB
over → ADP
the → DET
lazy → ADJ
dog → NOUN
. → PUNCT


2. Lemmatization using SpaCy

In [37]:
for token in doc:
    print(token.text, "→", token.lemma_)

The → the
quick → quick
brown → brown
fox → fox
jumps → jump
over → over
the → the
lazy → lazy
dog → dog
. → .


3. Dependency Parsing with SpaCy (Syntax Tree)

In [38]:
for token in doc:
    print(token.text, "←", token.head.text, "(", token.dep_, ")")

The ← fox ( det )
quick ← fox ( amod )
brown ← fox ( amod )
fox ← jumps ( nsubj )
jumps ← jumps ( ROOT )
over ← jumps ( prep )
the ← dog ( det )
lazy ← dog ( amod )
dog ← over ( pobj )
. ← jumps ( punct )


4. POS Tagging with NLTK

In [39]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

sentence = "Ravneet is writing code."
tokens = nltk.word_tokenize(sentence)
tags = nltk.pos_tag(tokens)

print(tags)

[('Ravneet', 'NNP'), ('is', 'VBZ'), ('writing', 'VBG'), ('code', 'NN'), ('.', '.')]


[nltk_data] Downloading package punkt to C:\Users\PANDIT
[nltk_data]     JI\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\PANDIT JI\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


5. Dependency and POS using Stanza(standford NLP)

In [None]:
# %pip install stanza

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


: 

In [None]:
import stanza

stanza.download('en')
nlp = stanza.Pipeline('en')

doc = nlp("The girl read a book.")
for sent in doc.sentences:
    for word in sent.words:
        print(f"{word.text} → POS: {word.upos}, Dep: {word.deprel}")

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  …

2025-07-07 16:46:56 INFO: Downloaded file to C:\Users\PANDIT JI\stanza_resources\resources.json
2025-07-07 16:46:56 INFO: Downloading default packages for language: en (English) ...
2025-07-07 16:47:03 INFO: File exists: C:\Users\PANDIT JI\stanza_resources\en\default.zip
2025-07-07 16:47:07 INFO: Finished downloading models and saved to C:\Users\PANDIT JI\stanza_resources
2025-07-07 16:47:07 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  …

2025-07-07 16:47:08 INFO: Downloaded file to C:\Users\PANDIT JI\stanza_resources\resources.json
2025-07-07 16:47:10 INFO: Loading these models for language: en (English):
| Processor    | Package                   |
--------------------------------------------
| tokenize     | combined                  |
| mwt          | combined                  |
| pos          | combined_charlm           |
| lemma        | combined_nocharlm         |
| constituency | ptb3-revised_charlm       |
| depparse     | combined_charlm           |
| sentiment    | sstplus_charlm            |
| ner          | ontonotes-ww-multi_charlm |

2025-07-07 16:47:10 INFO: Using device: cpu
2025-07-07 16:47:10 INFO: Loading: tokenize
2025-07-07 16:47:10 INFO: Loading: mwt
2025-07-07 16:47:10 INFO: Loading: pos
2025-07-07 16:47:12 INFO: Loading: lemma
2025-07-07 16:47:13 INFO: Loading: constituency
2025-07-07 16:47:14 INFO: Loading: depparse
2025-07-07 16:47:16 INFO: Loading: sentiment
2025-07-07 16:47:18 INFO: Loading:

6. Coreference Resolution (with HuggingFace)

In [None]:
# ✅ Install Transformers (if needed in Google Colab)
!pip install -q transformers

# 🧠 Load Libraries
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
import re
from IPython.display import display, HTML

# 🔍 Load Model and Tokenizer
model_name = "biu-nlp/lingmess-coref"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# ✍️ Input Text
text = "Ravneet submitted her paper. She hopes the reviewers will appreciate her work."

# 🔡 Tokenization
inputs = tokenizer(text, return_tensors="pt", truncation=True)
with torch.no_grad():
    outputs = model(**inputs)

# 📊 Prediction: Get token-level labels
logits = outputs.logits
predicted_labels = torch.argmax(logits, dim=-1).squeeze().tolist()
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"].squeeze())

# 🔁 Merge BPE tokens into readable words with labels
def merge_tokens_and_labels(tokens, labels):
    words, current_word, current_label = [], "", None
    for token, label in zip(tokens, labels):
        if token in ["<s>", "</s>"]:
            continue
        clean_token = token.replace("Ġ", " ") if token.startswith("Ġ") else token
        if token.startswith("Ġ") and current_word:
            words.append((current_word.strip(), current_label))
            current_word = clean_token
            current_label = label
        else:
            current_word += clean_token
            current_label = label
    if current_word:
        words.append((current_word.strip(), current_label))
    return words

resolved = merge_tokens_and_labels(tokens, predicted_labels)

# 🖨️ Print results
print("🔁 Tokens + Coreference Labels\n" + "-" * 40)
for word, label in resolved:
    print(f"{word:20} → Label {label}")


pytorch_model.bin:  11%|#         | 283M/2.64G [00:00<?, ?B/s]

Visulize coreference Labels in Color

In [None]:
# 🎨 HTML Highlight by Coreference Label
def highlight_coref(text, resolved_words):
    html = text
    seen = set()
    for word, label in resolved_words:
        if label != 0 and word not in seen:
            seen.add(word)
            html = re.sub(rf"\b{re.escape(word)}\b",
                          f"<mark style='background-color:#ffff99'>{word}</mark>", html)
    return html

highlighted = highlight_coref(text, resolved)
display(HTML(f"<p style='font-family:monospace;font-size:16px'>{highlighted}</p>"))
