<div style="display: flex; align-items: flex-start;">
  <div>
      <h1>CS 195: Natural Language Processing</h1>
      <h2>Neural Language Modeling</h2>
      </br>
    <a href="https://colab.research.google.com/github/ericmanley/f23-CS195NLP/blob/main/F6_1_NeuralLanguageModeling.ipynb">
      <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab">
    </a>
  </div>
  <div style="margin-left: 20px;">
    <img src="https://github.com/ericmanley/f23-CS195NLP/blob/main/images/dalle_neural_net_viz.png?raw=1" width="500" style="display: block;">
  </div>
</div>


**Cover Illustration:** generated by Dall E using the ChatGPT 4 interface, prompted for a visualization of the network used in the code below. *That's not what I meant.*

## Announcement

AI - English Faculty Candidate: Gabriel Ford

Meeting with students: Thursday at 3:30pm in Howard 309

Scholarly Presentation: Friday at 9:00am in Howard ???

## Reference

SLP: Neural Networks and Neural Language Models, Chapter 7 of Speech and Language Processing by Daniel Jurafsky & James H. Martin https://web.stanford.edu/~jurafsky/slp3/7.pdf



In [1]:
import sys
!{sys.executable} -m pip install datasets keras tensorflow

Collecting datasets
  Downloading datasets-2.14.6-py3-none-any.whl (493 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m493.7/493.7 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0.0,>=0.14.0 (from datasets)
  Downloading huggingface_hub-0.18.0-py3-none-any.whl (301 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.0/302.0 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: dill, multiprocess, huggingface-hub, datasets
Successfully installed datasets-2.14.6

## Dataset for today

AG News dataset
* short news articles
* four classes: World, Sports, Business, Sci/Tech

https://huggingface.co/datasets/ag_news


In [2]:
from datasets import load_dataset
data = load_dataset("ag_news")

print(data["train"]["text"][0])

# 0 is World
# 1 is Sports
# 2 is Business
# 3 is Sci/Tech
print(data["train"]["label"][0])


Downloading builder script:   0%|          | 0.00/4.06k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.65k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.95k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/11.0M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/751k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/120000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/7600 [00:00<?, ? examples/s]

Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics, are seeing green again.
2


## Review: Text Classification with Keras

Last time, we saw
* how to do text classification when there are more than 2 classes
    - one hot encoded output layer, one node per class, *softmax* output
    - categorical crossentropy loss
* embedding layer
    - pad sequences to all be same length
    - learn vector for each word representing word semantics
    
<div>
    <img src="https://github.com/ericmanley/f23-CS195NLP/blob/main/images/neural_text_classification.png?raw=1">
</div>

image source: SLP Fig. 7.11, https://web.stanford.edu/~jurafsky/slp3/7.pdf

*pooling* combines/aggregates all of the embeddings in some way

In [None]:
from datasets import load_dataset
from keras.preprocessing.text import Tokenizer
from keras.utils import pad_sequences
from keras.utils import to_categorical
from keras.models import Sequential
from keras.layers import Dense, Embedding, Flatten, GlobalMaxPooling1D, GlobalAveragePooling1D
from datasets import load_dataset
import numpy as np


data = load_dataset("ag_news")

# Prepare the tokenizer and fit on the training text
tokenizer = Tokenizer()
tokenizer.fit_on_texts(data["train"]["text"])
vocabulary_size = len(tokenizer.word_index) + 1

# Convert text to sequence of integers
train_sequences = tokenizer.texts_to_sequences(data["train"]["text"])
test_sequences = tokenizer.texts_to_sequences(data["test"]["text"])

# Pad sequences to ensure uniform length; you can decide the max length based on your dataset's characteristics
max_length = 100  # This should be adjusted based on the dataset
train_encoding_array = pad_sequences(train_sequences, maxlen=max_length, padding='post')
test_encoding_array = pad_sequences(test_sequences, maxlen=max_length, padding='post')

# Convert labels to one-hot vectors
train_labels = data["train"]["label"]
test_labels = data["test"]["label"]
train_labels_array = to_categorical(train_labels, num_classes=4)
test_labels_array = to_categorical(test_labels, num_classes=4)

#create a neural network architecture
model = Sequential()
model.add(Embedding(input_dim=vocabulary_size, output_dim=50, input_length=max_length))
model.add(Flatten())
#use one of these instead of Flatten() to try a pooling method
#model.add(GlobalMaxPooling1D())
#model.add(GlobalAveragePooling1D())
model.add(Dense(20, input_dim=max_length, activation='relu'))
model.add(Dense(4, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

model.fit(train_encoding_array, train_labels_array, epochs=10, verbose=1)

loss, accuracy = model.evaluate(test_encoding_array, test_labels_array)
print(f"Test accuracy: {accuracy*100:.2f}%")


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test accuracy: 89.91%


## Neural Language Modeling

**Neural Language Modeling:** predict next word(s) from previous ones - like what we did with Markov models

Like classification, but output is softmax of every possible word in the vocabulary

Often a first step before extending the model to do summarization, translation, dialog, and other NLP tasks

<div>
    <img src="https://github.com/ericmanley/f23-CS195NLP/blob/main/images/neural_language_modeling.png?raw=1">
</div>

image source: SLP Fig. 7.13, https://web.stanford.edu/~jurafsky/slp3/7.pdf

## A Neural Language Model in Keras

Let's start by sampling some data from our news dataset

Then split into a training and testing set

In [41]:
from keras.models import Sequential
from keras.layers import Dense, Embedding, Flatten
from keras.utils import to_categorical
from keras.utils import pad_sequences
from keras.preprocessing.text import Tokenizer
from datasets import load_dataset
from sklearn.model_selection import train_test_split
import numpy as np
import random

data = load_dataset("ag_news")

data_subset, _ = train_test_split(data["train"]["text"],train_size=500)
train_data, test_data = train_test_split(data_subset,train_size=0.8)

# Prepare the tokenizer and fit on the training text
tokenizer = Tokenizer()
tokenizer.fit_on_texts(data_subset)
vocabulary_size = len(tokenizer.word_index) + 1
print("Vocabulary size:",vocabulary_size)

# Convert text to sequences of integers
train_texts = tokenizer.texts_to_sequences(train_data)




Vocabulary size: 5278


## Preparing training examples

We want to take the sequences like

In [33]:
print( train_texts[0] )

[372, 1761, 45, 268, 316, 1324, 32, 32, 372, 4092, 12, 4, 135, 1891, 3, 268, 316, 20, 61, 16, 917, 101, 65, 26, 4093, 45, 1, 1325, 277, 4094, 321, 70, 4095, 3, 4096]


and slide a window across to predict the next word

In [34]:
print("Use",train_texts[0][0:5],"to predict",train_texts[0][5])
print("Use",train_texts[0][1:6],"to predict",train_texts[0][6])
print("Use",train_texts[0][2:7],"to predict",train_texts[0][7])
print("Use",train_texts[0][3:8],"to predict",train_texts[0][8])
print("etc.")

Use [372, 1761, 45, 268, 316] to predict 1324
Use [1761, 45, 268, 316, 1324] to predict 32
Use [45, 268, 316, 1324, 32] to predict 32
Use [268, 316, 1324, 32, 32] to predict 372
etc.


## Group Discussion

What data structures (lists, matrixes, etc.) do we need to prepare to make this a classification problem?

## Preparing all of the examples

In [42]:
# Decide the sequence length
sequence_length = 5  # Length of the input sequence before predicting the next word

# Create the sequences
predictor_sequences = []
targets = []
for text in train_texts:
    for i in range(sequence_length, len(text)):
        # Take the sequence of tokens as input and the next token as target
        curr_target = text[i]
        curr_predictor_sequence = text[i-sequence_length:i]
        predictor_sequences.append(curr_predictor_sequence)
        targets.append(curr_target)


print("Number of sequences:",len(predictor_sequences))


print("First train text:",train_texts[0])
print("Example sequence 0:",predictor_sequences[0]," target:",targets[0])
print("Example sequence 1:",predictor_sequences[1]," target:",targets[1])
print("Example sequence 2:",predictor_sequences[2]," target:",targets[2])
print("Example sequence 3:",predictor_sequences[3]," target:",targets[3])
print("Example sequence 4:",predictor_sequences[4]," target:",targets[4])
print("Example sequence 5:",predictor_sequences[5]," target:",targets[5])


Number of sequences: 13857
First train text: [2310, 585, 49, 402, 447, 2310, 1729, 73, 17, 53, 12, 18, 1097, 1128, 5, 1, 49, 402, 4, 1, 1072, 34, 13, 1, 424, 4803, 4804, 3, 4805, 20, 23, 1983, 4806, 1342, 12, 22, 684, 88, 2311]
Example sequence 0: [2310, 585, 49, 402, 447]  target: 2310
Example sequence 1: [585, 49, 402, 447, 2310]  target: 1729
Example sequence 2: [49, 402, 447, 2310, 1729]  target: 73
Example sequence 3: [402, 447, 2310, 1729, 73]  target: 17
Example sequence 4: [447, 2310, 1729, 73, 17]  target: 53
Example sequence 5: [2310, 1729, 73, 17, 53]  target: 12


## Padding

Some of the sequences might be really short - so we'll pad them just in case.

In [43]:
# Pad sequences to ensure uniform length
predictor_sequences_padded = pad_sequences(predictor_sequences, maxlen=sequence_length, padding='pre')

## The target output

Since we're making this into a classification problem, the output layer needs to have one node for each word in the vocabulary.

Target values need to be transformed into a one-hot encoded vector

In [44]:

# Convert output to one-hot encoding
target_word_one_hot = to_categorical(targets, num_classes=vocabulary_size)

print("Predictors words 0:", predictor_sequences_padded[0])
print("target word 0:", targets[0])
print("target word 0 one hot encoded:", target_word_one_hot[0])

Predictors words 0: [2310  585   49  402  447]
target word 0: 2310
target word 0 one hot encoded: [0. 0. 0. ... 0. 0. 0.]


## Preparing the test set

We have to do all of those same things for the test set.

**Group Exercise:** Turn this into a function so that you can use it to prepare both the training and test sets.

In [48]:
def prep_data(test_data,sequence_length,tokenizer):
  test_texts = tokenizer.texts_to_sequences(test_data)

  # Create the sequences
  predictor_sequences_test = []
  targets_test = []
  for text in test_texts:
      for i in range(sequence_length, len(text)):
          # Take the sequence of tokens as input and the next token as target
          curr_target = text[i]
          curr_predictor_sequence = text[i-sequence_length:i]
          predictor_sequences_test.append(curr_predictor_sequence)
          targets_test.append(curr_target)

  # Pad sequences to ensure uniform length
  predictor_sequences_padded_test = pad_sequences(predictor_sequences_test, maxlen=sequence_length, padding='pre')

  # Convert target to one-hot encoding
  target_word_one_hot_test = to_categorical(targets_test, num_classes=vocabulary_size)
  return predictor_sequences_padded_test, target_word_one_hot_test

print("Raw Test Data:", test_data)
predictor_sequences_padded_test, target_word_one_hot_test = prep_data(test_data,sequence_length,tokenizer)
print("Processed Test Data: ", target_word_one_hot_test)

AttributeError: ignored

## Designing the Neural Network

We'll start with a simple network like the one we used for text classification.

In [46]:

# Define the model
model = Sequential()
model.add(Embedding(input_dim=vocabulary_size, output_dim=50, input_length=sequence_length))
model.add(Flatten())
model.add(Dense(100, activation="relu"))
model.add(Dense(vocabulary_size, activation='softmax'))

# Compile the model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Fit the model - you can also pass in the test set
model.fit(predictor_sequences_padded, target_word_one_hot, epochs=10, verbose=1, validation_data=(predictor_sequences_padded_test, target_word_one_hot_test))

# The model can now be used to predict the next word in a sequence

Epoch 1/10

ValueError: ignored

## Testing the final model on the test set

In [10]:
loss, accuracy = model.evaluate(predictor_sequences_padded_test, target_word_one_hot_test)
print(f"Test accuracy: {accuracy*100:.2f}%")

Test accuracy: 6.11%


## Text Generation

We can use this model to successively generate words based on previous ones - like our Markov sequence on steroids.

Let's see how this works

In [11]:
starter_string = "the government said that it"
tokens_list = tokenizer.texts_to_sequences([starter_string])
print(tokens_list)

tokens_array = np.array(tokens_list)
print(tokens_array)

[[1, 152, 18, 13, 24]]
[[  1 152  18  13  24]]


the model will predict probabilities for each possible word in the output

In [12]:
predicted_probabilities = model.predict(tokens_array,verbose=0)
print(predicted_probabilities)
print("We get a probability for each of the",len(predicted_probabilities[0]),"words")

[[7.8775934e-23 1.5505616e-04 1.6108165e-04 ... 2.4824672e-11
  4.7401978e-18 1.9716550e-12]]
We get a probability for each of the 5302 words


then we could get the word associated with the highest probability

In [25]:
predicted_index = np.argmax(predicted_probabilities)
print("word index:",predicted_index)
predicted_word = tokenizer.index_word[predicted_index]
print("word:",predicted_word)

word index: 818
word: maker


or you could generate a random word according the these probabilities (like with did with Markov text generation)

### putting it all together

In [28]:
starter_string = "the government said that was"
tokens_list = tokenizer.texts_to_sequences([starter_string])
tokens = tokens_list[0]

for i in range(50):
    curr_seq = tokens[-sequence_length:]
    curr_array = np.array([curr_seq])
    predicted_probabilities = model.predict(curr_array,verbose=0)
    predicted_index = np.argmax(predicted_probabilities)
    predicted_word = tokenizer.index_word[predicted_index]
    print(predicted_word+" ",end="")
    tokens.append(predicted_index)


pay near than the growth 39 s victory all ticker despite three deal it as minutes israeli wednesday announced up lt three the this 39 s victory drugs its on minister new as surprise to two fullquote 500 that the growth in the growth in the driven of the world's 

## Applied Exploration

Pick another dataset and get it working with this code
* you will likely need to prepare the text a little differently - do you need to first break it into sentences?
* describe your dataset and what you did to prepare it

Perform a neural language modeling experiment
* experiment with something to try to find a well-performing model
    * sliding window size
    * number of hidden nodes in the network
    * learning rate
* describe what you did and write up an interpretation of your results

In [None]:
from keras import optimizers

#an example on changing the learning rate
optimizer = optimizers.Adam(learning_rate=0.01)
model.compile(loss='categorical_crossentropy',optimizer=optimizer, metrics=["accuracy"])

