# NLP 3rd assignment - RNN and LSTM, Sentence Completion

### By: Idan Dunsky & Yaniv Kaveh-Shtul

# Imports

import necessary libraries and modules

In [1]:
# !pip install --pre torch torchvision torchaudio -i https://download.pytorch.org/whl/nightly/cu118

In [2]:
import numpy as np
import tensorflow as tf
import nltk
import requests
import numpy as np

from transformers import pipeline
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from bs4 import BeautifulSoup
from nltk.corpus import stopwords

  from .autonotebook import tqdm as notebook_tqdm


# Creating Corpus

Downloading necessary components for nltk tokenization. 

In [3]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

stop_words = set(stopwords.words('english'))

file_encoding = 'latin-1'


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Idan\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Idan\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Idan\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## Web Scraping

we download the NLP wikipedia page, and using it as a corpus.

In [4]:
# Specify the URL of the Wikipedia page
url = 'https://en.wikipedia.org/wiki/Natural_language_processing'

# Send a GET request to fetch the raw HTML content
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    # Parse the HTML content using BeautifulSoup
    soup = BeautifulSoup(response.content, 'html.parser')

    # Extract the main content text
    # Wikipedia's main content is typically within <div> tags with the 'mw-parser-output' class
    content_div = soup.find('div', class_='mw-parser-output')

    # Initialize an empty list to hold all text content
    all_text = []

    # Extract text from various elements
    for element in content_div.find_all(['p', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'li', 'blockquote']):
        all_text.append(element.get_text())

    # Combine all text into a single string
    corpus = all_text

Stopwords are common words in a language that are often filtered out in natural language processing (NLP) tasks because they carry less meaningful information compared to other words. Examples include "and," "the," "is," and "in." Removing stopwords helps in reducing the dimensionality of text data and focusing on the more significant words that contribute to the meaning or topic of the text.

in the next code segment we will remove any stopwords within the corpus.

In [5]:
# Function to remove stop words
def remove_stopwords(sentence):
    return ' '.join([word for word in sentence.split() if word not in stop_words])

# Remove stop words from the corpus
corpus = [remove_stopwords(sentence) for sentence in corpus]

Tokenization is the process of breaking down a text into smaller units called tokens. These tokens can be words, sentences, or even individual characters, depending on the level of granularity required. Tokenization is an essential step in natural language processing (NLP) tasks as it helps in analyzing and processing text data. In the provided code, it seems that the necessary components for tokenization are being downloaded from the nltk library. nltk is a popular Python library for NLP, and it provides various tools and resources for tasks like tokenization, stemming, and part-of-speech tagging. By downloading these components, you can leverage the power of nltk for tokenizing text in your code.


In [6]:

# Tokenize the sentences
tokenizer = Tokenizer()
tokenizer.fit_on_texts(corpus)
total_words = len(tokenizer.word_index) + 1
# Convert sentences to sequences of integers
input_sequences = []
for line in corpus:
    token_list = tokenizer.texts_to_sequences([line])[0]
    for i in range(1, len(token_list)):
        n_gram_sequence = token_list[:i+1]
        input_sequences.append(n_gram_sequence)

# Pad sequences to the same length
max_sequence_len = max([len(x) for x in input_sequences])
input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))

# Split data into predictors and labels
predictors, label = input_sequences[:, :-1], input_sequences[:, -1]

# Convert labels to one-hot encoded vectors
label = tf.keras.utils.to_categorical(label, num_classes=total_words)

Now that our corpus is ready, let's play with it.

# Predict next word with RNN

Recurrent Neural Networks (RNNs) are a type of artificial neural network designed for sequence data. Unlike traditional neural networks, RNNs have connections that form directed cycles, allowing them to maintain a hidden state that can capture information from previous time steps. This makes RNNs particularly effective for tasks involving sequential data, such as time series prediction, natural language processing, and speech recognition. The key feature of RNNs is their ability to use their internal state to process variable-length sequences of inputs, enabling them to model temporal dependencies and patterns in the data.

In [7]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, SimpleRNN, Dense

RNN_model = Sequential()
RNN_model.add(Embedding(total_words, 10))
RNN_model.add(SimpleRNN(100))
RNN_model.add(Dense(total_words, activation='softmax'))

RNN_model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
RNN_model.summary()

RNN_model.fit(predictors, label, epochs=50, verbose=1)

Epoch 1/50
[1m103/103[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 9ms/step - accuracy: 0.0029 - loss: 7.4040
Epoch 2/50
[1m103/103[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 9ms/step - accuracy: 0.0186 - loss: 7.1122
Epoch 3/50
[1m103/103[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 9ms/step - accuracy: 0.0247 - loss: 6.8909
Epoch 4/50
[1m103/103[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 9ms/step - accuracy: 0.0260 - loss: 6.8017
Epoch 5/50
[1m103/103[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 9ms/step - accuracy: 0.0272 - loss: 6.7033
Epoch 6/50
[1m103/103[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 9ms/step - accuracy: 0.0355 - loss: 6.4393
Epoch 7/50
[1m103/103[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 9ms/step - accuracy: 0.0418 - loss: 6.1851
Epoch 8/50
[1m103/103[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 9ms/step - accuracy: 0.0571 - loss: 5.9125
Epoch 9/50
[1m103/103[0m [32m━━━━━━━━

<keras.src.callbacks.history.History at 0x13e05ee73e0>

In [8]:
def predict_next_word(model, tokenizer, text, max_sequence_len):
    token_list = tokenizer.texts_to_sequences([text])[0]
    token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
    predicted_probs = model.predict(token_list, verbose=0)[0]
    predicted_index = np.argmax(predicted_probs)
    predicted_word = tokenizer.index_word[predicted_index]
    return predicted_word, predicted_probs[predicted_index]

# Example usage
text = "Natural language processing systems developed"
predicted_word, RNN_probability = predict_next_word(RNN_model, tokenizer, text, max_sequence_len)
print(f"Next word: {predicted_word}, Probability: {RNN_probability}")

Next word: 2012, Probability: 0.48660901188850403


In [9]:
# Compute and print accuracy on the training data
loss, accuracy = RNN_model.evaluate(predictors, label, verbose=0)
print(f"Model accuracy on training data: {accuracy:.4f}")

Model accuracy on training data: 0.9440


As observed, the Recurrent Neural Network (RNN) model achieved an accuracy of approximately `0.95` in next word prediction.

# Predict next word with LSTM

Long Short-Term Memory (LSTM) is a type of recurrent neural network (RNN) architecture designed to effectively learn and remember over long sequences of data, addressing the limitations of traditional RNNs. LSTM units contain mechanisms called gates (input, forget, and output gates) that regulate the flow of information, allowing the network to maintain and update its cell state over long time periods. This makes LSTMs particularly effective for tasks involving sequential data, such as language modeling, speech recognition, and time series prediction, where maintaining long-term dependencies is crucial.

In [10]:
from tensorflow.keras.layers import LSTM

LSTM_model = Sequential()
LSTM_model.add(Embedding(total_words, 10))
LSTM_model.add(LSTM(100))
LSTM_model.add(Dense(total_words, activation='softmax'))

LSTM_model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
LSTM_model.summary()

LSTM_model.fit(predictors, label, epochs=50, verbose=1)

Epoch 1/50
[1m103/103[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 28ms/step - accuracy: 0.0109 - loss: 7.3482
Epoch 2/50
[1m103/103[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 28ms/step - accuracy: 0.0266 - loss: 6.9482
Epoch 3/50
[1m103/103[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 29ms/step - accuracy: 0.0246 - loss: 6.8606
Epoch 4/50
[1m103/103[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 29ms/step - accuracy: 0.0229 - loss: 6.7174
Epoch 5/50
[1m103/103[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 29ms/step - accuracy: 0.0197 - loss: 6.5932
Epoch 6/50
[1m103/103[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 29ms/step - accuracy: 0.0240 - loss: 6.3771
Epoch 7/50
[1m103/103[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 28ms/step - accuracy: 0.0338 - loss: 6.1189
Epoch 8/50
[1m103/103[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 29ms/step - accuracy: 0.0327 - loss: 5.9715
Epoch 9/50
[1m103/103[0m [32m

<keras.src.callbacks.history.History at 0x13e0e2817f0>

In [11]:
text = "Natural language processing systems developed"
predicted_word, LSTM_probability = predict_next_word(LSTM_model, tokenizer, text, max_sequence_len)
print(f"Next word: {predicted_word}, Probability: {LSTM_probability}")

Next word: complex, Probability: 0.138286754488945


In [12]:
# Compute and print accuracy on the training data
loss, accuracy = LSTM_model.evaluate(predictors, label, verbose=0)
print(f"Model accuracy on training data: {accuracy:.4f}")

Model accuracy on training data: 0.7602


As observed, the obtained accuracy is approximately `0.80`, which is significantly lower than the accuracy achieved with the RNN model.

# GPT 2

We will now utilize the pretrained GPT-2 model to perform sentence completion tasks.

In [13]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load pre-trained GPT-2 model and tokenizer
model_name = "gpt2"
model = GPT2LMHeadModel.from_pretrained(model_name)
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
tokenizer.add_special_tokens({'pad_token': '<|pad|>', 'bos_token': '<|startoftext|>'})

2

In [14]:
# Set the model into evaluation mode
model.eval()

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2SdpaAttention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

In [15]:
# helper function that tokenize the input, generates the completion and decodes it all-in-one
def generate_completion(model, tokenizer, text, max_length=15):
    inputs = tokenizer.encode(text, return_tensors="pt")
    outputs = model.generate(inputs, max_length=max_length, num_return_sequences=1,pad_token_id=tokenizer.pad_token_id)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

We will utilize GPT-2 to attempt the completion of these five partial sentences extracted from our corpus.

In [16]:
partial_sentences = [
    "The authors claimed that within",
    "Using almost no information about",
    "As a result, a great",
    "enormous amount of non-annotated",
    "statistical and neural networks, on"
]

## Completion

In [17]:
for partial_sentence in partial_sentences:
    completion = generate_completion(model, tokenizer, partial_sentence)
    print(f"Partial sentence ===> '{partial_sentence}'")
    print(f"Completion ===> '{completion}'\n")

  or (hasattr(torch, "_dynamo") and torch._dynamo.is_compiling())
  or (hasattr(torch, "_dynamo") and torch._dynamo.is_compiling())


Partial sentence ===> 'The authors claimed that within'
Completion ===> 'The authors claimed that within the first year of the study, the number of'

Partial sentence ===> 'Using almost no information about'
Completion ===> 'Using almost no information about the source of the information, the FBI has not'

Partial sentence ===> 'As a result, a great'
Completion ===> 'As a result, a great deal of the work that we do is done'

Partial sentence ===> 'enormous amount of non-annotated'
Completion ===> 'enormous amount of non-annotated data.

The data'

Partial sentence ===> 'statistical and neural networks, on'
Completion ===> 'statistical and neural networks, on the basis of the results of the previous'



# Sentiment Analysis 

In [35]:
sentiment_analyzer = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english", device='cuda')

# Perform sentiment analysis on the corpus
sentiments = [sentiment_analyzer(sentence)[0] for sentence in corpus]

# Calculate sentiment distribution
positive_count = sum(1 for sentiment in sentiments if sentiment['label'] == 'POSITIVE')
negative_count = sum(1 for sentiment in sentiments if sentiment['label'] == 'NEGATIVE')
neutral_count = len(corpus) - positive_count - negative_count

total_count = len(corpus)
positive_percentage = (positive_count / total_count) * 100
negative_percentage = (negative_count / total_count) * 100
neutral_percentage = (neutral_count / total_count) * 100

# Report the statistics of the sentiment distribution
print(f"\nSentiment distribution in the corpus:")
print(f"Positive: {positive_percentage:.2f}%")
print(f"Negative: {negative_percentage:.2f}%")
print(f"Neutral: {neutral_percentage:.2f}%")

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset



Sentiment distribution in the corpus:
Positive: 62.83%
Negative: 37.17%
Neutral: 0.00%


# Summary

The steps that we have taken in this notebook are:

1. **Corpus Creation**: We created a corpus from a Wikipedia page on Natural Language Processing (NLP) using BeautifulSoup for web scraping.

2. **Data Processing**: The corpus was processed to prepare it for modeling and analysis.

3. **Predictive Modeling**: We used Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks to predict the next word in a sequence.

4. **GPT-2 Application**: We employed the GPT-2 model to complete sentences from the processed corpus.

5. **Semantic Analysis**: We performed semantic analysis to understand the meanings and relationships within the text.
