<a href="https://colab.research.google.com/github/naidk/NN-Assignment3/blob/main/NeuralNetworkAssignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Q1: Implementing an RNN for Text Generation

In [3]:
# ✅ Step 1: Import Libraries
import tensorflow as tf
import numpy as np
import time

# ✅ Step 2: Load Shakespeare Text
path = tf.keras.utils.get_file("shakespeare.txt",
        "https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt")
text = open(path, 'rb').read().decode(encoding='utf-8')

# ✅ Step 3: Preprocess Text
vocab = sorted(set(text))
ids_from_chars = tf.keras.layers.StringLookup(vocabulary=list(vocab), mask_token=None)
chars_from_ids = tf.keras.layers.StringLookup(vocabulary=ids_from_chars.get_vocabulary(), invert=True, mask_token=None)

def text_from_ids(ids):
    return tf.strings.reduce_join(chars_from_ids(ids), axis=-1)

all_ids = ids_from_chars(tf.strings.unicode_split(text, 'UTF-8'))
seq_length = 100
sequences = tf.data.Dataset.from_tensor_slices(all_ids).batch(seq_length + 1, drop_remainder=True)

def split_input_target(sequence):
    input_text = sequence[:-1]
    target_text = sequence[1:]
    return input_text, target_text

dataset = sequences.map(split_input_target)

# ✅ Step 4: Batch and Shuffle
BATCH_SIZE = 64
BUFFER_SIZE = 10000
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True).prefetch(tf.data.AUTOTUNE)

# ✅ Step 5: Define LSTM Model (Subclassed)
class TextGenModel(tf.keras.Model):
    def __init__(self, vocab_size, embedding_dim, rnn_units):
        super().__init__()
        self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
        self.lstm = tf.keras.layers.LSTM(rnn_units, return_sequences=True, return_state=True)
        self.dense = tf.keras.layers.Dense(vocab_size)

    def call(self, inputs, states=None, return_state=False, training=False):
        x = self.embedding(inputs, training=training)
        if states is None:
            x, state_h, state_c = self.lstm(x, training=training)
        else:
            x, state_h, state_c = self.lstm(x, initial_state=states, training=training)
        x = self.dense(x, training=training)

        if return_state:
            return x, [state_h, state_c]
        else:
            return x

# ✅ Step 6: Build & Compile Model
vocab_size = len(ids_from_chars.get_vocabulary())
embedding_dim = 256
rnn_units = 1024

model = TextGenModel(vocab_size, embedding_dim, rnn_units)

loss = tf.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(optimizer='adam', loss=loss)

# ✅ Step 7: Train (small epoch for testing)
EPOCHS = 1  # Set to 10 or 20 for better output later
model.fit(dataset, epochs=EPOCHS)


[1m172/172[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m18s[0m 69ms/step - loss: 3.1691


<keras.src.callbacks.history.History at 0x7c9ff7148190>

In [4]:
class OneStep(tf.keras.Model):
    def __init__(self, model, chars_from_ids, ids_from_chars, temperature=1.0):
        super().__init__()
        self.model = model
        self.chars_from_ids = chars_from_ids
        self.ids_from_chars = ids_from_chars
        self.temperature = temperature

        skip_ids = self.ids_from_chars(['[UNK]'])[:, None]
        sparse_mask = tf.SparseTensor(
            indices=skip_ids,
            values=[-float('inf')]*len(skip_ids),
            dense_shape=[len(ids_from_chars.get_vocabulary())])
        self.prediction_mask = tf.sparse.to_dense(sparse_mask)

    @tf.function
    def generate_one_step(self, inputs, states=None):
        input_chars = tf.strings.unicode_split(inputs, 'UTF-8')
        input_ids = self.ids_from_chars(input_chars).to_tensor()

        predicted_logits, states = self.model(inputs=input_ids, states=states, return_state=True)
        predicted_logits = predicted_logits[:, -1, :] / self.temperature
        predicted_logits += self.prediction_mask

        predicted_ids = tf.random.categorical(predicted_logits, num_samples=1)
        predicted_ids = tf.squeeze(predicted_ids, axis=-1)
        predicted_chars = self.chars_from_ids(predicted_ids)

        return predicted_chars, states


In [5]:
# Create instance of OneStep
one_step_model = OneStep(model, chars_from_ids, ids_from_chars)

states = None
next_char = tf.constant(["Once upon a time,"])
result = [next_char]

for _ in range(500):
    next_char, states = one_step_model.generate_one_step(next_char, states=states)
    result.append(next_char)

generated_text = tf.strings.join(result)
print("\n--- Generated Text ---\n")
print(generated_text[0].numpy().decode('utf-8'))



--- Generated Text ---

Once upon a time,

POUPGE:
Th sol on hesp, thet thee nor ny ofmind dunca and ond to mith on cotran, jew an ne whrast nobe:
Ond thilld?
Goth ell al noun ba fray limf and me hromk;
Mte, sepalad
Lino coothing ish his at thee pime.
Mlelcol, he'l my bightrl rime ciea if blwithiD, furebinad you lete seash; wuve rir to grie tad ip mreod lue ascere's fapl ond hawid tood co hustre on this tuer;
Sowly in fhat, touth, is und aild hy toldy Isun t'lllly, fore prove cuxd tante ppousldfor oo heove, yound the cony thend, ang wo


Q2: NLP Preprocessing Pipeline

In [9]:
import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import nltk

# Download stopwords only
nltk.download('stopwords', force=True)

# ✅ Step 1: Input sentence
sentence = "NLP techniques are used in virtual assistants like Alexa and Siri."

# ✅ Step 2: Custom Tokenizer (no punkt)
tokens = re.findall(r'\b\w+\b', sentence)

# ✅ Step 3: Stopword removal
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]

# ✅ Step 4: Stemming
stemmer = PorterStemmer()
stemmed_words = [stemmer.stem(word) for word in filtered_tokens]

# ✅ Output
print("1. Original Tokens:")
print(tokens)

print("\n2. Tokens Without Stopwords:")
print(filtered_tokens)

print("\n3. Stemmed Words:")
print(stemmed_words)


1. Original Tokens:
['NLP', 'techniques', 'are', 'used', 'in', 'virtual', 'assistants', 'like', 'Alexa', 'and', 'Siri']

2. Tokens Without Stopwords:
['NLP', 'techniques', 'used', 'virtual', 'assistants', 'like', 'Alexa', 'Siri']

3. Stemmed Words:
['nlp', 'techniqu', 'use', 'virtual', 'assist', 'like', 'alexa', 'siri']


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


1. What is the difference between stemming and lemmatization? Provide examples with the word “running.”

Stemming is a rule-based process that chops off word endings to reduce a word to its root form, often without considering whether the result is a valid word.

Lemmatization, on the other hand, uses a dictionary and grammar rules to convert a word to its base or lemma form, which is always a valid word.

Feature	Stemming	Lemmatization
Logic	Rule-based cutting	Dictionary + grammar-based
Accuracy	Less accurate	More accurate
Example ("running")	→ run (or even runn)	→ run
Output Valid Word?	Not always	Yes

📝 Example:

"running" → Stemming: runn

"running" → Lemmatization: run

2. Why might removing stop words be useful in some NLP tasks, and when might it actually be harmful?

Useful When:
Stopwords (like "the", "is", "in") don’t add meaningful value to the task.

Helps in reducing dimensionality and noise.

Improves performance in:

Text classification

Topic modeling

Clustering

🔹 Example: In sentiment analysis, “not good” becomes “good” if “not” is removed — which leads to incorrect sentiment.

❌ Harmful When:
Stopwords may carry important meaning in context.

In tasks like:

Machine Translation

Question Answering

Text Summarization

Removing stopwords like “not,” “never,” or “until” may reverse the meaning of the sentence

Q3: Named Entity Recognition with spaCy

In [10]:
# ✅ Step 1: Install spaCy and English model (uncomment if using in Colab)
# !pip install -U spacy
# !python -m spacy download en_core_web_sm

import spacy

# ✅ Step 2: Load the small English model
nlp = spacy.load("en_core_web_sm")

# ✅ Step 3: Input sentence
text = "Barack Obama served as the 44th President of the United States and won the Nobel Peace Prize in 2009."

# ✅ Step 4: Apply spaCy NER
doc = nlp(text)

# ✅ Step 5: Print all named entities
print("Named Entities Found:\n")
for ent in doc.ents:
    print(f"Entity: {ent.text}")
    print(f"Label: {ent.label_}")
    print(f"Start Char: {ent.start_char}, End Char: {ent.end_char}")
    print("---")


Named Entities Found:

Entity: Barack Obama
Label: PERSON
Start Char: 0, End Char: 12
---
Entity: 44th
Label: ORDINAL
Start Char: 27, End Char: 31
---
Entity: the United States
Label: GPE
Start Char: 45, End Char: 62
---
Entity: the Nobel Peace Prize
Label: WORK_OF_ART
Start Char: 71, End Char: 92
---
Entity: 2009
Label: DATE
Start Char: 96, End Char: 100
---


1. How does NER differ from POS tagging in NLP?

Difference Between NER and POS Tagging
Feature	NER (Named Entity Recognition)	POS Tagging (Part-of-Speech Tagging)
Purpose	Identifies real-world entities like names, dates	Identifies grammatical role of each word
Output	Entity label (e.g., PERSON, ORG, DATE)	POS tag (e.g., Noun, Verb, Adjective)
Example	"Barack Obama" → PERSON	"Obama" → Proper Noun (NNP)
Use Case	Used in Information Extraction, Search Engines	Used in Syntax Analysis, Parsing

🧠 NER helps machines understand who/what/where in a sentence.
🧠 POS helps machines understand the role of each word in grammar.

2. Describe two applications that use NER in the real world.

1. Financial News Analysis
NER helps extract company names, stock tickers, and monetary values from articles.

Example:
“Apple acquired Beats for $3 billion”
→ Entities: Apple (ORG), $3 billion (MONEY)

✅ 2. Search Engines / Virtual Assistants
Improves query understanding by detecting entities in user questions.

Example:
“When did Nelson Mandela become president?”
→ Entities: Nelson Mandela (PERSON), president (TITLE)

Q4: Scaled Dot-Product Attention

In [11]:
import numpy as np
import tensorflow as tf

# ✅ Step 1: Define Q, K, V
Q = np.array([[1, 0, 1, 0], [0, 1, 0, 1]])
K = np.array([[1, 0, 1, 0], [0, 1, 0, 1]])
V = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])

# ✅ Step 2: Scaled Dot-Product Attention Function
def scaled_dot_product_attention(Q, K, V):
    d_k = Q.shape[-1]  # key dimension (usually d_model)

    # Step 1: Compute raw attention scores → Q @ K^T
    scores = np.matmul(Q, K.T)

    # Step 2: Scale by sqrt(d)
    scaled_scores = scores / np.sqrt(d_k)

    # Step 3: Apply softmax to get attention weights
    attention_weights = tf.nn.softmax(scaled_scores, axis=-1).numpy()

    # Step 4: Multiply attention weights by V
    output = np.matmul(attention_weights, V)

    return attention_weights, output

# ✅ Step 3: Run attention
attention_weights, output = scaled_dot_product_attention(Q, K, V)

# ✅ Step 4: Display results
print("Attention Weights (Softmax Output):")
print(np.round(attention_weights, 4))

print("\nFinal Output (Attention × V):")
print(np.round(output, 4))


Attention Weights (Softmax Output):
[[0.7311 0.2689]
 [0.2689 0.7311]]

Final Output (Attention × V):
[[2.0758 3.0758 4.0758 5.0758]
 [3.9242 4.9242 5.9242 6.9242]]


1. Why do we divide the attention score by √d in the scaled dot-product attention formula?

When the dot product of Q and K becomes large (especially with high-dimensional vectors), the softmax function can produce extremely small gradients, making learning difficult.

✅ Dividing by √d (where d is the key dimension) scales down large values, stabilizing gradients and improving training convergence.

2. How does self-attention help the model understand relationships between words in a sentence?

Self-attention lets each word attend to (or look at) every other word in the sequence, regardless of position.

✅ This helps the model:

Capture contextual meaning (e.g., "bank" in "river bank" vs "money bank")

Understand long-range dependencies

Process words in parallel, enabling transformer models like BERT and GPT

Example:
In the sentence “The dog that chased the cat barked,”
→ self-attention allows “barked” to focus on “dog” instead of “cat”.

Q5: Sentiment Analysis with HuggingFace Transformers

In [12]:
# ✅ Step 1: Install HuggingFace Transformers if needed
# Uncomment below line if using Google Colab
# !pip install transformers

from transformers import pipeline

# ✅ Step 2: Load pre-trained sentiment analysis pipeline
classifier = pipeline("sentiment-analysis")

# ✅ Step 3: Input sentence
sentence = "Despite the high price, the performance of the new MacBook is outstanding."

# ✅ Step 4: Run sentiment analysis
result = classifier(sentence)[0]

# ✅ Step 5: Print output
print("Sentiment:", result['label'])
print("Confidence Score:", round(result['score'], 4))


No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Device set to use cuda:0


Sentiment: POSITIVE
Confidence Score: 0.9998


1. What is the main architectural difference between BERT and GPT? Which uses an encoder and which uses a decoder?

The main architectural difference lies in how they use the Transformer architecture:

BERT (Bidirectional Encoder Representations from Transformers)
→ Uses only the encoder part of the Transformer
→ Processes text bidirectionally (looks both left and right)

GPT (Generative Pre-trained Transformer)
→ Uses only the decoder part of the Transformer
→ Processes text unidirectionally (left-to-right)

Summary:

BERT = Encoder (good for understanding tasks like classification, QA)

GPT = Decoder (good for generation tasks like text completion, summarization)

2. Why is using pre-trained models (like BERT or GPT) beneficial for NLP applications instead of training from scratch?

Using pre-trained models is beneficial because:

✅ They are trained on massive datasets (e.g., Wikipedia, BooksCorpus), which gives them a strong understanding of language.

✅ Saves time and computational cost — training from scratch requires millions of samples and powerful GPUs.

✅ Can be fine-tuned on small datasets for specific tasks like sentiment analysis, NER, or summarization.

✅ Often achieve state-of-the-art accuracy in NLP tasks even with minimal labeled data.

📌 In short, pre-trained models let you stand on the shoulders of giants — reusing powerful language understanding without starting from zero.