#**Autoencoder Model for Word Embedding**

#I. Import necessary things

In [None]:
import tensorflow as tf
from tensorflow import keras
import re
import nltk
import numpy as np
from scipy.spatial import distance

nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

#II. Load corpus and preprocess it

In [None]:
corpus = """President Biden represented Delaware for 36 years in the US Senate before becoming the 47th Vice President of the United States. As President, Biden will restore America’s leadership and build our communities back better.
Joseph Robinette Biden, Jr. was born in Scranton, Pennsylvania, the first of four children of Catherine Eugenia Finnegan Biden and Joseph Robinette Biden, Sr. In 1953, the Biden family moved to Claymont, Delaware. President Biden graduated from the University of Delaware and Syracuse Law School and served on the New Castle County Council.
At age 29, President Biden became one of the youngest people ever elected to the United States Senate. Just weeks after his Senate election, tragedy struck the Biden family when his wife Neilia and daughter Naomi were killed, and sons Hunter and Beau were critically injured, in an auto accident.
Biden was sworn into the U.S. Senate at his sons’ hospital bedsides and began commuting from Wilmington to Washington every day, first by car, and then by train, in order to be with his family. He would continue to do so throughout his time in the Senate. 
Biden married Jill Jacobs in 1977, and in 1980, their family was complete with the birth of Ashley Blazer Biden. A lifelong educator, Jill earned her doctorate in education and returned to teaching as an English professor at a community college in Virginia.
Beau Biden, Attorney General of Delaware and Joe Biden’s eldest son, passed away in 2015 after battling brain cancer with the same integrity, courage, and strength he demonstrated every day of his life. Beau’s fight with cancer inspires the mission of President Biden’s life — ending cancer as we know it.
As a Senator from Delaware for 36 years, President Biden established himself as a leader in facing some of our nation’s most important domestic and international challenges. As Chairman or Ranking Member of the Senate Judiciary Committee for 16 years, Biden is widely recognized for his work writing and spearheading the Violence Against Women Act  — the landmark legislation that strengthens penalties for violence against women, creates unprecedented resources for survivors of assault, and changes the national dialogue on domestic and sexual assault.
As Chairman or Ranking Member of the Senate Foreign Relations Committee for 12 years, Biden played a pivotal role in shaping U.S. foreign policy. He was at the forefront of issues and legislation related to terrorism, weapons of mass destruction, post-Cold War Europe, the Middle East, Southwest Asia, and ending apartheid.
As Vice President, Biden continued his leadership on important issues facing the nation and represented our country abroad. Vice President Biden convened sessions of the President’s Cabinet, led interagency efforts, and worked with Congress in his fight to raise the living standards of middle-class Americans, reduce gun violence, address violence against women, and end cancer as we know it.
Biden helped President Obama pass and then oversaw the implementation of the Recovery Act — the biggest economic recovery plan in the history of the nation and our biggest and strongest commitment to clean energy. The President’s plan prevented another Great Depression, created and saved millions of jobs, and led to 75 uninterrupted months of job growth by the end of the administration. And Biden did it all with less than 1% in waste, abuse, or fraud — the most efficient government program in our country’s history.
President Obama and Vice President Biden also secured the passage of the Affordable Care Act, which reduced the number of uninsured Americans by 20 million by the time they left office and banned insurance companies from denying coverage due to pre-existing conditions.
He served as the point person for U.S. diplomacy throughout the Western Hemisphere, strengthened relationships with our allies both in Europe and the Asia-Pacific, and led the effort to bring 150,000 troops home from Iraq.
In a ceremony at the White House, President Obama awarded Biden the Presidential Medal of Freedom with Distinction — the nation’s highest civilian honor.
After leaving the White House, the Bidens continued their efforts to expand opportunity for every American with the creation of the Biden Foundation, the Biden Cancer Initiative, the Penn Biden Center for Diplomacy and Global Engagement, and the Biden Institute at the University of Delaware.
On April 25, 2019, Biden announced his candidacy for President of the United States. Biden’s candidacy was built from the beginning around 3 pillars: the battle for the soul of our nation, the need to rebuild our middle class — the backbone of our country, and a call for unity, to act as One America. It was a message that would only gain more resonance in 2020 as we confront a pandemic, an economic crisis, urgent calls for racial justice, and the existential threat of climate change.
"""

In [None]:
print(corpus)

President Biden represented Delaware for 36 years in the US Senate before becoming the 47th Vice President of the United States. As President, Biden will restore America’s leadership and build our communities back better.
Joseph Robinette Biden, Jr. was born in Scranton, Pennsylvania, the first of four children of Catherine Eugenia Finnegan Biden and Joseph Robinette Biden, Sr. In 1953, the Biden family moved to Claymont, Delaware. President Biden graduated from the University of Delaware and Syracuse Law School and served on the New Castle County Council.
At age 29, President Biden became one of the youngest people ever elected to the United States Senate. Just weeks after his Senate election, tragedy struck the Biden family when his wife Neilia and daughter Naomi were killed, and sons Hunter and Beau were critically injured, in an auto accident.
Biden was sworn into the U.S. Senate at his sons’ hospital bedsides and began commuting from Wilmington to Washington every day, first by ca

In [None]:
def text_cleaner(text):
    # lower case text
    text = text.lower()
    text = re.sub(r"'s\b","",text)
    # remove punctuations
    text = re.sub("[^a-zA-Z]", " ", text)
    return text

In [None]:
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize 

def preprocess_text(corpus):
  corpus_clean = text_cleaner(corpus)

  stop_words = set(stopwords.words('english'))
  word_tokens = word_tokenize(corpus_clean)
  
  return [w for w in word_tokens if not w in stop_words]

corpus_token =  preprocess_text(corpus)
print(corpus_token)

['president', 'biden', 'represented', 'delaware', 'years', 'us', 'senate', 'becoming', 'th', 'vice', 'president', 'united', 'states', 'president', 'biden', 'restore', 'america', 'leadership', 'build', 'communities', 'back', 'better', 'joseph', 'robinette', 'biden', 'jr', 'born', 'scranton', 'pennsylvania', 'first', 'four', 'children', 'catherine', 'eugenia', 'finnegan', 'biden', 'joseph', 'robinette', 'biden', 'sr', 'biden', 'family', 'moved', 'claymont', 'delaware', 'president', 'biden', 'graduated', 'university', 'delaware', 'syracuse', 'law', 'school', 'served', 'new', 'castle', 'county', 'council', 'age', 'president', 'biden', 'became', 'one', 'youngest', 'people', 'ever', 'elected', 'united', 'states', 'senate', 'weeks', 'senate', 'election', 'tragedy', 'struck', 'biden', 'family', 'wife', 'neilia', 'daughter', 'naomi', 'killed', 'sons', 'hunter', 'beau', 'critically', 'injured', 'auto', 'accident', 'biden', 'sworn', 'u', 'senate', 'sons', 'hospital', 'bedsides', 'began', 'commuti

In [None]:
dictionary = sorted(list(set(corpus_token)))
mapping = dict((c, i) for i, c in enumerate(dictionary))
print(dictionary)

['abroad', 'abuse', 'accident', 'act', 'address', 'administration', 'affordable', 'age', 'allies', 'also', 'america', 'american', 'americans', 'announced', 'another', 'apartheid', 'april', 'around', 'ashley', 'asia', 'assault', 'attorney', 'auto', 'awarded', 'away', 'back', 'backbone', 'banned', 'battle', 'battling', 'beau', 'became', 'becoming', 'bedsides', 'began', 'beginning', 'better', 'biden', 'bidens', 'biggest', 'birth', 'blazer', 'born', 'brain', 'bring', 'build', 'built', 'cabinet', 'call', 'calls', 'cancer', 'candidacy', 'car', 'care', 'castle', 'catherine', 'center', 'ceremony', 'chairman', 'challenges', 'change', 'changes', 'children', 'civilian', 'class', 'claymont', 'clean', 'climate', 'cold', 'college', 'commitment', 'committee', 'communities', 'community', 'commuting', 'companies', 'complete', 'conditions', 'confront', 'congress', 'continue', 'continued', 'convened', 'council', 'country', 'county', 'courage', 'coverage', 'created', 'creates', 'creation', 'crisis', 'crit

#III. Convert corpus to one-hot vectors & Define embedding dim

In [None]:
embedding_dim = 20
vocab_size = len(dictionary)
print(vocab_size)

322


In [None]:
corpus_encode = [mapping[x] for x in corpus_token]
onehot_corpus = keras.utils.to_categorical(corpus_encode, num_classes=vocab_size)
print(onehot_corpus)

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


#IV. Define Autoencoder model

In [None]:
ae_model = keras.Sequential()
ae_model.add(keras.Input(shape=(vocab_size,)))
ae_model.add(keras.layers.Dense(embedding_dim, activation='relu'))
ae_model.add(keras.layers.Dense(vocab_size, activation='softmax'))

ae_model.compile(optimizer='adam', loss='categorical_crossentropy')

ae_model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 20)                6460      
_________________________________________________________________
dense_1 (Dense)              (None, 322)               6762      
Total params: 13,222
Trainable params: 13,222
Non-trainable params: 0
_________________________________________________________________


#V. Train AE model

In [None]:
ae_model.fit(x=onehot_corpus, y=onehot_corpus, batch_size=32, epochs=1000)
ae_model.save("ae_model.h5")

Epoch 1/1000
Epoch 2/1000
Epoch 3/1000
Epoch 4/1000
Epoch 5/1000
Epoch 6/1000
Epoch 7/1000
Epoch 8/1000
Epoch 9/1000
Epoch 10/1000
Epoch 11/1000
Epoch 12/1000
Epoch 13/1000
Epoch 14/1000
Epoch 15/1000
Epoch 16/1000
Epoch 17/1000
Epoch 18/1000
Epoch 19/1000
Epoch 20/1000
Epoch 21/1000
Epoch 22/1000
Epoch 23/1000
Epoch 24/1000
Epoch 25/1000
Epoch 26/1000
Epoch 27/1000
Epoch 28/1000
Epoch 29/1000
Epoch 30/1000
Epoch 31/1000
Epoch 32/1000
Epoch 33/1000
Epoch 34/1000
Epoch 35/1000
Epoch 36/1000
Epoch 37/1000
Epoch 38/1000
Epoch 39/1000
Epoch 40/1000
Epoch 41/1000
Epoch 42/1000
Epoch 43/1000
Epoch 44/1000
Epoch 45/1000
Epoch 46/1000
Epoch 47/1000
Epoch 48/1000
Epoch 49/1000
Epoch 50/1000
Epoch 51/1000
Epoch 52/1000
Epoch 53/1000
Epoch 54/1000
Epoch 55/1000
Epoch 56/1000
Epoch 57/1000
Epoch 58/1000
Epoch 59/1000
Epoch 60/1000
Epoch 61/1000
Epoch 62/1000
Epoch 63/1000
Epoch 64/1000
Epoch 65/1000
Epoch 66/1000
Epoch 67/1000
Epoch 68/1000
Epoch 69/1000
Epoch 70/1000
Epoch 71/1000
Epoch 72/1000
E

# **How to use trained model**

#VI. Load trained model & Get output of 1st FC layer

In [None]:
reconstructed_model = keras.models.load_model("ae_model.h5")
we_model = keras.models.Model(inputs=reconstructed_model.inputs, outputs=reconstructed_model.get_layer('dense').output)

Define function to encode one-hot list of words

In [None]:
def encode_onehot(mapping, list_words):
  output = []
  
  for word in list_words:
    word_vector = [0 for _ in range(vocab_size)]

    if word in mapping.keys():
      word_index = mapping[word]
      word_vector[word_index] = 1
    
    output.append(word_vector)
  
  return output

#VII. Test model

In [None]:
input_sentence = "Joe Biden is the most remarkalbe President of American"

preprocess_sentence = preprocess_text(input_sentence)
onehot_sentence = encode_onehot(mapping, preprocess_sentence)

embedded_sentence = we_model.predict(onehot_sentence)
print(embedded_sentence)

[[2.0053334e+00 5.1371992e-01 2.2566528e+00 2.1551943e+00 1.9029834e+00
  2.1072187e+00 2.1747961e+00 1.7129862e+00 2.3267765e+00 2.1576881e-05
  5.4421163e-01 1.7988737e+00 1.0395503e-01 2.2385924e+00 2.5971813e+00
  1.3055776e+00 0.0000000e+00 2.0863397e+00 1.9921701e+00 1.3828278e-05]
 [2.0406063e+00 2.0557201e+00 0.0000000e+00 0.0000000e+00 2.0035925e+00
  1.9182003e+00 0.0000000e+00 3.1561041e-01 1.9562075e+00 0.0000000e+00
  6.1267555e-02 1.9691452e+00 2.0380962e+00 1.9881845e+00 1.9928606e+00
  1.7761350e-02 1.8381437e+00 0.0000000e+00 0.0000000e+00 8.3208084e-05]
 [1.1867527e+00 1.1664162e+00 1.0411265e+00 1.0105231e+00 1.1206063e+00
  1.1076965e+00 1.0481298e+00 1.0385220e+00 1.1045996e+00 1.0470943e+00
  1.0287194e+00 1.1747605e+00 1.1412333e+00 1.0980086e+00 1.1137074e+00
  1.1674483e+00 1.0353850e+00 9.8835534e-01 9.9193412e-01 1.0151087e+00]
 [1.7765069e+00 4.9233437e-05 2.8252602e-05 1.4586878e-01 1.9355775e+00
  7.5221062e-05 1.9189520e+00 3.6903620e-02 1.9803149e+00 1.9

In [None]:
word_0 = "Biden"
word_1 = "President"
word_2 = "American"

preprocess_words = preprocess_text(' '.join([word_0, word_1, word_2]))
onehot_words = encode_onehot(mapping, preprocess_words)

word_0_eb, word_1_eb, word_2_eb = we_model.predict(onehot_words)

print("OUTPUT EMBEDDING")
print(word_0_eb)
print(word_1_eb)
print(word_2_eb)

dst_0_1 = distance.euclidean(word_0_eb, word_1_eb)
dst_1_2 = distance.euclidean(word_1_eb, word_2_eb)
dst_0_2 = distance.euclidean(word_0_eb, word_2_eb)



print("OUTPUT DISTANCE")
print("0 vs 1: ", dst_0_1)
print("1 vs 2: ", dst_1_2)
print("0 vs 2: ", dst_0_2)

OUTPUT EMBEDDING
[2.0406063e+00 2.0557201e+00 0.0000000e+00 0.0000000e+00 2.0035925e+00
 1.9182003e+00 0.0000000e+00 3.1561041e-01 1.9562075e+00 0.0000000e+00
 6.1267555e-02 1.9691452e+00 2.0380962e+00 1.9881845e+00 1.9928606e+00
 1.7761350e-02 1.8381437e+00 0.0000000e+00 0.0000000e+00 8.3208084e-05]
[1.7765069e+00 4.9233437e-05 2.8252602e-05 1.4586878e-01 1.9355775e+00
 7.5221062e-05 1.9189520e+00 3.6903620e-02 1.9803149e+00 1.9547096e+00
 1.9841278e+00 1.7180223e+00 2.2914963e+00 2.3072863e+00 2.1604791e+00
 9.7393990e-05 2.2235661e+00 1.9239359e+00 0.0000000e+00 1.9434297e+00]
[2.2424121e+00 3.0469060e-02 1.5836087e+00 0.0000000e+00 5.2452087e-06
 6.1060345e-01 2.3691382e+00 2.2468843e+00 1.7541101e+00 2.2264832e-01
 2.4107976e+00 1.2147101e+00 1.8974285e+00 2.1949997e+00 2.5987625e-05
 2.2531798e+00 2.3310676e+00 1.9789790e+00 5.7436585e-02 4.8447657e-01]
OUTPUT DISTANCE
0 vs 1:  5.211884498596191
1 vs 2:  5.243825435638428
0 vs 2:  6.428466320037842


#VIII. Evaluate the result

According to above result, we see that distance between **Biden** and **President** is 5.21, closer than distance President - American and Biden - American.

This result is logical because when we see again in 1st sentence in corpus, Biden and President are next to each other and they are belong to same sentence. On the other hand, American is far from President and Biden. Therefore, distance President - American (5.24) and Biden - American (6.43) are bigger than Biden - President.

**Corpus:** **President** **Biden** represented Delaware for 36 years in the US Senate before becoming the 47th Vice President of the United States...