# CS 195: Natural Language Processing
## Embeddings

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ericmanley/f23-CS195NLP/blob/main/F5_4_Embeddings.ipynb)


## References

Word2Vec Tutorial - The Skip-Gram Model by Chris McCormick: http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/

Word2Vec - Negative Sampling made easy by Munesh Lakhey: https://medium.com/@mnshonco/word2vec-negative-sampling-made-easy-9a587cb4695f

Keras Embedding Layer: https://keras.io/api/layers/core_layers/embedding/

Keras Tokenizer: https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer

In [None]:
import sys
!{sys.executable} -m pip install datasets keras tensorflow

Collecting datasets
  Downloading datasets-2.14.6-py3-none-any.whl (493 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m493.7/493.7 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0.0,>=0.14.0 (from datasets)
  Downloading huggingface_hub-0.18.0-py3-none-any.whl (301 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.0/302.0 kB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: dill, multiprocess, huggingface-hub, datasets
Successfully installed datasets-2.14.6

## Dataset for today

AG News dataset
* short news articles
* four classes: World, Sports, Business, Sci/Tech

https://huggingface.co/datasets/ag_news


In [None]:
from datasets import load_dataset
data = load_dataset("ag_news")

print(data["train"]["text"][0])

# 0 is World
# 1 is Sports
# 2 is Business
# 3 is Sci/Tech
print(data["train"]["label"][0])


Downloading builder script:   0%|          | 0.00/4.06k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.65k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.95k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/11.0M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/751k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/120000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/7600 [00:00<?, ? examples/s]

Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics, are seeing green again.
2


## Categorical classification example (with > 2 classes)

We've only seen binary classification examples so far
* used binary_crossentropy loss and a sigmoid output unit

For more than two classes, use categorical_crossentropy and softmax

In [None]:
from datasets import load_dataset
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.models import Sequential
from keras.layers import Dense
from datasets import load_dataset
import numpy as np


data = load_dataset("ag_news")
print("Here's an example of a sentence from the dataset:",data["train"]["text"][0])

# Prepare the tokenizer and fit on the training text
tokenizer = Tokenizer()
tokenizer.fit_on_texts(data["train"]["text"])

# Convert text to sequence of integers
train_sequences = tokenizer.texts_to_sequences(data["train"]["text"])
test_sequences = tokenizer.texts_to_sequences(data["test"]["text"])
print("Here's an example of a tokenized sentence converted into a sequence of integers:",train_sequences[0])

# Pad sequences to ensure uniform length; you can decide the max length based on your dataset's characteristics
max_length = 100  # This should be adjusted based on the dataset
train_encoding_array = pad_sequences(train_sequences, maxlen=max_length, padding='post')
test_encoding_array = pad_sequences(test_sequences, maxlen=max_length, padding='post')
print("Here's an example after it has been padded:",train_encoding_array[0])


# Convert labels to one-hot vectors
train_labels = data["train"]["label"]
test_labels = data["test"]["label"]
train_labels_array = to_categorical(train_labels, num_classes=4)
test_labels_array = to_categorical(test_labels, num_classes=4)
print("Here's an example of what the target label looks like:", train_labels_array[0])


#create a neural network architecture
model = Sequential()
model.add(Dense(20, input_dim=max_length, activation='relu'))
model.add(Dense(4, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

model.fit(train_encoding_array, train_labels_array, epochs=10, verbose=1)

loss, accuracy = model.evaluate(test_encoding_array, test_labels_array)
print(f"Test accuracy: {accuracy*100:.2f}%")

Here's an example of a sentence from the dataset: Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics, are seeing green again.
Here's an example of a tokenized sentence converted into a sequence of integers: [442, 441, 1681, 14528, 108, 64, 1, 850, 21, 21, 753, 8196, 442, 6640, 10231, 2927, 4, 5810, 25989, 40, 4049, 797, 332]
Here's an example after it has been padded: [  442   441  1681 14528   108    64     1   850    21    21   753  8196
   442  6640 10231  2927     4  5810 25989    40  4049   797   332     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0 

## Why does this do so badly?

No matter how many epochs you run this for, it will not get any better

**The problem:** the integer encoding

So what do we do if we still want to feed *sequential* data - we don't just want a bag of words?

## Word Embeddings

Don't just represent words with a single number - use a whole vector

<div>
   <img src="https://github.com/ericmanley/f23-CS195NLP/blob/main/images/embeddings.png?raw=1">
</div>

image source: https://stackoverflow.com/questions/46155868/keras-embedding-layer

in reality, you can't label the dimensions with exact meanings like "living being", "feline", etc.
* but you don't need to!

## How do we come up with these embeddings?

First, we'll come up with a "fake" learning problem and then extract information about words from the model

**fake learning problem** skip-grams: predict the words that appear around a given word

We can pick any window size - this one uses size 2 (2 words before and 2 after)


<div>
   <img src="https://github.com/ericmanley/f23-CS195NLP/blob/main/images/skip_gram_problem.png?raw=1">
</div>

image source: http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/

## One Hot Encoding

Before we can create a model for this, we need an initial numerical encoding of words

**One Hot Encoding** uses a vector with length equal to the size of the vocabulary
* 1 in the spot representing that word
* 0 in all others

|            | the | quick | brown | fox | jumps | over | lazy | dog |
|------------|-----|-----|-----|----|-----|-----|-----|-----|
| "fox" | 0   | 0   | 0   | 1  | 0   | 0   | 0   | 0   |
| "dog" | 0 | 0   | 0   | 0   | 0  | 0   | 0   | 1   |

We can use Keras' `to_categorical` for this

In [None]:
from keras.utils import to_categorical

#put a 1 at index 3 of 8
print( to_categorical(3, num_classes=8)  )

[0. 0. 0. 1. 0. 0. 0. 0.]


## Let's get started

Here are some toy sentences we can work with.

We'll use Keras' tokenizer and show the mapping of words to indexes it cam up with

In [None]:
from keras.preprocessing.text import Tokenizer

# Sample data
sentences = [
    "I adopted some dogs from the animal shelter",
    "don't you know that dogs and cats both like scritches",
    "are cats or dogs your favorite animal",
    "I have heard that dogs can be obedient",
    "I have heard that cats can be independent",
    "sharks live in the ocean",
    "many birds fly to get around"
]

# Tokenize and create vocabulary
tokenizer = Tokenizer()
tokenizer.fit_on_texts(sentences)
print(tokenizer.word_index)


{'dogs': 1, 'i': 2, 'that': 3, 'cats': 4, 'the': 5, 'animal': 6, 'have': 7, 'heard': 8, 'can': 9, 'be': 10, 'adopted': 11, 'some': 12, 'from': 13, 'shelter': 14, "don't": 15, 'you': 16, 'know': 17, 'and': 18, 'both': 19, 'like': 20, 'scritches': 21, 'are': 22, 'or': 23, 'your': 24, 'favorite': 25, 'obedient': 26, 'independent': 27, 'sharks': 28, 'live': 29, 'in': 30, 'ocean': 31, 'many': 32, 'birds': 33, 'fly': 34, 'to': 35, 'get': 36, 'around': 37}


### So now we know all the words in our vocabulary

0 will be a special index to represent "None"

In [None]:
vocabulary_size = len(tokenizer.word_index) + 1 #we also have to account for 0
print(vocabulary_size)

38


### Now we tokenize the sentences and convert them to their indexes

In [None]:
sequences = tokenizer.texts_to_sequences(sentences)
print(sequences)

[[2, 11, 12, 1, 13, 5, 6, 14], [15, 16, 17, 3, 1, 18, 4, 19, 20, 21], [22, 4, 23, 1, 24, 25, 6], [2, 7, 8, 3, 1, 9, 10, 26], [2, 7, 8, 3, 4, 9, 10, 27], [28, 29, 30, 5, 31], [32, 33, 34, 35, 36, 37]]


### Creating the skip grams

Keras has a nice function for this too.

In [None]:
from keras.preprocessing.sequence import skipgrams

sequence_skipgrams = skipgrams(sequences[0],vocabulary_size=vocabulary_size,window_size=3)
print(sequence_skipgrams)

([[11, 1], [12, 31], [5, 6], [2, 10], [5, 13], [1, 4], [5, 10], [6, 37], [13, 27], [12, 20], [5, 2], [1, 2], [14, 32], [1, 5], [11, 18], [5, 1], [13, 11], [12, 2], [12, 16], [12, 13], [14, 6], [13, 14], [12, 4], [11, 12], [12, 17], [6, 1], [6, 14], [13, 12], [12, 1], [14, 11], [1, 12], [13, 19], [1, 30], [6, 32], [1, 22], [11, 13], [5, 9], [13, 5], [2, 11], [5, 14], [2, 1], [5, 10], [1, 11], [6, 5], [11, 2], [11, 12], [11, 5], [14, 5], [13, 18], [1, 26], [2, 24], [1, 36], [2, 12], [5, 12], [6, 34], [1, 6], [1, 5], [6, 2], [2, 14], [13, 36], [11, 35], [14, 20], [13, 1], [12, 11], [6, 13], [12, 5], [13, 10], [13, 26], [13, 6], [1, 13], [5, 3], [14, 13]], [1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1])


### Group Discussion

What does this output mean?

What are the different pairs of numbers?

What are all the 0s and 1s at the end?

The `skipgrams` function is doing something called **negative sampling**. Write your guess for what that means here.

**Negative sampling:**

Skipgram Documenation: https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/sequence/skipgrams

### Group Exercise: Prepare data for the skip-gram model

use `sequence_skipgrams` to prepare the data

Each training example should be the one hot encoded word concatenated with the one hot encoded context word

Example: if "brown" is index 3 and "fox" is index 4 (and vocab size is 8), then we have

"brown":`[0,0,0,1,0,0,0,0]`

"fox":`[0,0,0,0,1,0,0,0]`

training example input: `[0,0,0,1,0,0,0,0, 0,0,0,0,1,0,0,0]`

*Hint:* you will need to use the `to_categorical` function

Then, do it for all the skipgrams

In [None]:
inputs_array = []
for sg_idx in range(len(sequence_skipgrams[0])):
  first = to_categorical(sequence_skipgrams[0][sg_idx][0], num_classes=vocabulary_size)
  second = to_categorical(sequence_skipgrams[0][sg_idx][1], num_classes=vocabulary_size)
  inputs_array.append(np.append(first,second))

inputs_array = np.array(inputs_array)
print(inputs_array[0])

target_array = np.array(sequence_skipgrams[1])
print(target_array)

[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0.]
[1 0 1 0 1 0 0 0 0 0 0 1 0 0 0 1 1 1 0 1 1 1 0 0 0 1 1 1 1 0 1 0 0 0 0 1 0
 1 1 1 1 0 1 1 1 1 0 1 0 0 0 0 1 1 0 1 1 0 0 0 0 0 1 1 1 1 0 0 1 1 0 1]


## A simple model

Let's draw what this model looks like on the white board

Then we'll train it with the data we prepared

In [None]:
from keras.models import Sequential
from keras.layers import Dense

# Model
embedding_model = Sequential()
# we input a one-hot vector for the word and its context word, so vocabulary_size*2
embedding_model.add(Dense(50, input_dim=vocabulary_size*2, activation='relu'))
embedding_model.add(Dense(1, activation='sigmoid'))

embedding_model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

embedding_model.fit(inputs_array, target_array, epochs=5000, verbose=0)


<keras.src.callbacks.History at 0x7f47ec231480>

## Where is the word embedding in all this?

The weights going from the word's index to hidden layer nodes represent what the model learned about this word, so we'll use those weights as the embedding

In [None]:
word = "dogs"
print("Word:", word)
word_index = tokenizer.word_index[word]
print("index:",word_index)
weights = embedding_model.layers[0].get_weights()[0]
word_embedding = weights[word_index]
print(f"The embedding for the word '{word}' is: {word_embedding.flatten()}")

Word: dogs
index: 1
The embedding for the word 'dogs' is: [ 2.12283000e-01  9.83469859e-02  7.61701539e-02 -6.49205625e-01
 -8.63002598e-01  1.94406599e-01  2.75113106e-01 -3.69346410e-01
  3.37518811e-01 -7.15716183e-01  1.30428210e-01  1.82840765e-01
  7.66289905e-02  2.88704336e-01  2.12408453e-01  6.27622008e-02
  1.12945236e-01 -5.11130691e-02 -3.05207133e-01  2.07711250e-01
  1.94762766e-01  1.62421718e-01  3.31495643e-01 -3.64795655e-01
  1.23716764e-01  1.49313480e-01  5.83440736e-02  1.90845057e-01
 -8.09678733e-01 -2.07483754e-01  2.70185113e-01 -1.22212195e+00
  7.17773065e-02  2.87975788e-01 -4.30566201e-04  3.51905882e-01
  6.25417009e-02  2.89752841e-01  1.58535704e-01  5.53404689e-02
  4.16187942e-01  2.62047172e-01  2.56710231e-01 -1.68569311e-02
  5.20816028e-01 -2.35952482e-01 -4.45004813e-02  2.61496127e-01
 -4.32497710e-01  1.53638929e-01]


### We could make this into a function

In [None]:
def get_embedding(word,embedding_model):
    word_index = tokenizer.word_index[word]
    weights = embedding_model.layers[0].get_weights()[0]
    word_embedding = weights[word_index]
    return word_embedding

cats_embedding = get_embedding("cats",embedding_model)
dogs_embedding = get_embedding("dogs",embedding_model)
shelter_embedding = get_embedding("shelter",embedding_model)
sharks_embedding = get_embedding("sharks",embedding_model)
print(cats_embedding)
print(dogs_embedding)

print( np.sum(np.square(cats_embedding-dogs_embedding)) )
print( np.sum(np.square(sharks_embedding-shelter_embedding)) )

[-0.10914464  0.17390142 -0.01029862  0.158516   -0.02062754  0.02293271
  0.14058737  0.04191242  0.11293747 -0.19899604 -0.17482455  0.18022014
  0.12450443  0.15460493  0.0625229  -0.11306057  0.1376953  -0.14687513
 -0.21631452 -0.03614095  0.19271927 -0.16892806 -0.12047939 -0.1218401
  0.02500333 -0.06849563  0.17632602 -0.07890664  0.09372477 -0.1228691
 -0.09057519 -0.19648427  0.06983532  0.07548179  0.16925766  0.02875933
  0.11240996  0.0753357   0.15577562  0.09267502  0.07587199 -0.04630536
 -0.11116095  0.02958173  0.0735891   0.06195076  0.20934044 -0.1900896
  0.20788597 -0.04554768]
[ 2.12283000e-01  9.83469859e-02  7.61701539e-02 -6.49205625e-01
 -8.63002598e-01  1.94406599e-01  2.75113106e-01 -3.69346410e-01
  3.37518811e-01 -7.15716183e-01  1.30428210e-01  1.82840765e-01
  7.66289905e-02  2.88704336e-01  2.12408453e-01  6.27622008e-02
  1.12945236e-01 -5.11130691e-02 -3.05207133e-01  2.07711250e-01
  1.94762766e-01  1.62421718e-01  3.31495643e-01 -3.64795655e-01
  1

## Applied Exploration

Create word embeddings for the AG News dataset.

Put all of the code into one cell so it isn't spread all throughout the notebook.

Show some example word embeddings.

Describe your results and reflect on them
* How many unique words does the dataset have?
* How many training epochs do you think are appropriate? Why?
* How could you go about figuring out if these embeddings are useful?



## The Keras Embedding Layer

Keras provides an `Embedding` layer that you can put at the beginning of your network which allows you to learn the embeddings as part of the main training process.

Let's see how it does when we include this in our example from the beginning of this notebook.

In [None]:
from datasets import load_dataset
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.models import Sequential
from keras.layers import Dense, Embedding, Flatten
from datasets import load_dataset
import numpy as np


data = load_dataset("ag_news")

# Prepare the tokenizer and fit on the training text
tokenizer = Tokenizer()
tokenizer.fit_on_texts(data["train"]["text"])
vocabulary_size = len(tokenizer.word_index) + 1

# Convert text to sequence of integers
train_sequences = tokenizer.texts_to_sequences(data["train"]["text"])
test_sequences = tokenizer.texts_to_sequences(data["test"]["text"])



# Pad sequences to ensure uniform length; you can decide the max length based on your dataset's characteristics
max_length = 100  # This should be adjusted based on the dataset
train_encoding_array = pad_sequences(train_sequences, maxlen=max_length, padding='post')
test_encoding_array = pad_sequences(test_sequences, maxlen=max_length, padding='post')


# Convert labels to one-hot vectors
train_labels = data["train"]["label"]
test_labels = data["test"]["label"]
train_labels_array = to_categorical(train_labels, num_classes=4)
test_labels_array = to_categorical(test_labels, num_classes=4)

#create a neural network architecture
model = Sequential()
model.add(Embedding(input_dim=vocabulary_size, output_dim=50, input_length=max_length))
model.add(Flatten())
model.add(Dense(20, input_dim=max_length, activation='relu'))
model.add(Dense(4, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

model.fit(train_encoding_array, train_labels_array, epochs=10, verbose=1)

loss, accuracy = model.evaluate(test_encoding_array, test_labels_array)
print(f"Test accuracy: {accuracy*100:.2f}%")


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test accuracy: 89.84%


### Creative Synthesis

Dimensionality of thought?
Creating a more efficient language??


want it to:
- generate off of a large corpus
- break the each value up into a "word"
- propose new never before seen words??
- find a set of modifiers that's efficient for expressing things
- translate from english to the ultimate best language