<a href="https://colab.research.google.com/github/luckycharmz1/CIS-050/blob/main/LAB8.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# NLP Assignment Solution Using Generative AI

## Task 1: Text Preprocessing Using spaCy

import spacy

nlp = spacy.load("en_core_web_sm")

text = "Natural Language Processing (NLP) is a fascinating field of AI that enables computers to understand human language."

doc = nlp(text)

# Tokenization (Lowercased for Consistency)
tokens = [token.text.lower() for token in doc if not token.is_punct and not token.is_digit]
print("Tokenized Text:", tokens)

# Stopword Removal and Lemmatization
lemmas = [token.lemma_.lower() for token in doc if not token.is_stop and not token.is_punct]
print("Lemmatized Text:", lemmas)





Tokenized Text: ['natural', 'language', 'processing', 'nlp', 'is', 'a', 'fascinating', 'field', 'of', 'ai', 'that', 'enables', 'computers', 'to', 'understand', 'human', 'language']
Lemmatized Text: ['natural', 'language', 'processing', 'nlp', 'fascinating', 'field', 'ai', 'enable', 'computer', 'understand', 'human', 'language']


What the Code Does:

This code uses the spaCy library to break down a sentence into smaller parts (called tokens) and simplify words into their basic forms (called lemmatization).

Tokenization: It separates a sentence into words and punctuation.

Lemmatization: It turns words into their root forms. For example, “running” would become “run.”

Original Input: "Artificial Intelligence is shaping the future."

When you run the code on this sentence, it splits it into individual tokens like:

Tokens: ['Artificial', 'Intelligence', 'is', 'shaping', 'the', 'future']

And then lemmatizes the words:

"shaping" becomes "shape."

"is" becomes "be."

So, tokenization is just breaking down the sentence into smaller parts, and lemmatization simplifies those parts into their basic forms.

In [2]:
## Task 2: Sentiment Analysis Using Fine-Tuned Model

from transformers import pipeline

sentiment_model = pipeline("sentiment-analysis", model="cardiffnlp/twitter-roberta-base-sentiment")

label_map = {"LABEL_0": "Negative", "LABEL_1": "Neutral", "LABEL_2": "Positive"}

text_samples = [
    "I love this product! It's amazing.",
    "The service was terrible and I am very disappointed.",
    "The movie was okay, not the best but not the worst."
]

sentiment_counts = {"Positive": 0, "Neutral": 0, "Negative": 0}

for text in text_samples:
    sentiment = sentiment_model(text)
    label = label_map[sentiment[0]['label']]
    confidence = round(sentiment[0]['score'], 2)
    sentiment_counts[label] += 1
    print(f"Text: {text}")
    print(f"Sentiment Analysis Result: {label} (Confidence: {confidence})")
    print("-" * 50)

print("Sentiment Distribution:", sentiment_counts)



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/747 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/499M [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]

Device set to use cpu


Text: I love this product! It's amazing.
Sentiment Analysis Result: Positive (Confidence: 0.99)
--------------------------------------------------
Text: The service was terrible and I am very disappointed.
Sentiment Analysis Result: Negative (Confidence: 0.98)
--------------------------------------------------
Text: The movie was okay, not the best but not the worst.
Sentiment Analysis Result: Neutral (Confidence: 0.49)
--------------------------------------------------
Sentiment Distribution: {'Positive': 1, 'Neutral': 1, 'Negative': 1}


In [3]:


## Task 3: Named Entity Recognition Using Generative AI with Token Aggregation

import warnings
warnings.filterwarnings("ignore")

ner_pipeline = pipeline("ner", model="dbmdz/bert-large-cased-finetuned-conll03-english", aggregation_strategy="simple")

text_sample = "Apple Inc. was founded by Steve Jobs and Steve Wozniak in California. The company made $274.5 billion in revenue in 2020."

entities = ner_pipeline(text_sample)
entities_sorted = sorted(entities, key=lambda x: x['score'], reverse=True)

print("Named Entity Recognition (NER) Results:")
for entity in entities_sorted:
    confidence = round(entity['score'], 2)
    print(f"Entity: {entity['word']:<15} | Type: {entity['entity_group']:<5} | Confidence: {confidence}")




config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Device set to use cpu


Named Entity Recognition (NER) Results:
Entity: California      | Type: LOC   | Confidence: 1.0
Entity: Apple Inc       | Type: ORG   | Confidence: 1.0
Entity: Steve Jobs      | Type: PER   | Confidence: 0.9900000095367432
Entity: Steve Wozniak   | Type: PER   | Confidence: 0.8899999856948853


In [4]:
## Task 4: Machine Translation Using Generative AI

translation_pipeline = pipeline("translation_en_to_es", model="Helsinki-NLP/opus-mt-en-es")

sentence = "Natural language processing is a branch of AI."
translated_text = translation_pipeline(sentence)[0]['translation_text']
print("Translated Text:", translated_text)

# Install missing dependency
import os
os.system("pip install sacremoses")


config.json:   0%|          | 0.00/1.47k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/312M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/312M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/44.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/802k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/826k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.59M [00:00<?, ?B/s]

Device set to use cpu


Translated Text: El procesamiento del lenguaje natural es una rama de la IA.


0

In [5]:

## Task 5: Topic Modeling & Classification Using Generative AI

topic_model_pipeline = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

articles = {
    "Article 1": "The new government policy on climate change was announced today.",
    "Article 2": "The stock market has seen an unexpected rise this week.",
    "Article 3": "Scientists discovered a new exoplanet that could support life.",
    "Article 4": "The latest smartphone has innovative features never seen before."
}

categories = ["Politics", "Finance", "Science", "Technology"]

classified_articles = {}
for article, content in articles.items():
    classification = topic_model_pipeline(content, candidate_labels=categories)
    classified_articles[article] = (classification['labels'][0], round(classification['scores'][0], 2))

# Sort Articles by Confidence
classified_articles_sorted = sorted(classified_articles.items(), key=lambda x: x[1][1], reverse=True)

print("Automatically Classified Articles:")
print("{:<10} | {:<15} | {:<10}".format("Article", "Category", "Confidence"))
print("-" * 40)
for article, (category, confidence) in classified_articles_sorted:
    print(f"{article:<10} | {category:<15} | {confidence:<10}")

config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu


Automatically Classified Articles:
Article    | Category        | Confidence
----------------------------------------
Article 3  | Science         | 0.93      
Article 4  | Technology      | 0.93      
Article 1  | Politics        | 0.64      
Article 2  | Finance         | 0.55      
