# Homework: Word Embedding

In this exercise, you will work on the skip-gram neural network architecture for Word2Vec. You will be using Keras to train your model. 

You must complete the following tasks:
1. Read/clean text files
2. Indexing (Assign a number to each word)
3. Create skip-grams (inputs for your model)
4. Create the skip-gram neural network model
5. Visualization
6. Evaluation (Using pre-trained, not using pre-trained)
    (classify topic from 4 categories) 
    
This notebook assumes you have already installed Tensorflow and Keras with python3 and had GPU enabled. If you run this exercise on GCloud using the provided disk image you are all set.



In [14]:
# %tensorflow_version 2.x
%matplotlib inline
import numpy as np
import pandas as pd
import math
import glob
import re
import random
import collections
import os
import sys
import tensorflow as tf
from keras.preprocessing import sequence
from keras.models import Sequential, Model
from keras.layers import GRU, Dropout
from keras.models import load_model
from keras.layers import Embedding, Reshape, Activation, Input, Dense, Masking, Conv1D, Bidirectional
from tensorflow.python.keras.layers.merge import Dot
from tensorflow.python.keras.utils import np_utils
from tensorflow.python.keras.utils.data_utils import get_file
from tensorflow.python.keras.utils.np_utils import to_categorical
from keras.preprocessing.sequence import skipgrams
from keras.preprocessing import sequence
from keras import backend as K
from keras.optimizers import Adam

random.seed(42)

# Step 1: Read/clean text files

The given code can be used to processed the pre-tokenzied text file from the wikipedia corpus. In your homework, you must replace those text files with raw text files.  You must use your own tokenizer to process your text files

In [15]:
# from google.colab import drive
# drive.mount('/content/drive')

In [16]:
# import shutil
# shutil.copy("/content/drive/MyDrive/FRA 501 IntroNLP&DL/Dataset/wiki.zip","/content/wiki.zip")
# shutil.copy("/content/drive/MyDrive/FRA 501 IntroNLP&DL/Dataset/BEST-TrainingSet.zip","/content/BEST-TrainingSet.zip")

In [17]:
# !unzip wiki.zip
# !unzip BEST-TrainingSet.zip

In [18]:
#Step 1: read the wikipedia text file
with open("wiki/thwiki_chk.txt") as f:
    #the delimiter is one or more whitespace characters
    input_text = re.compile(r"\s+").split(f.read()) 
    #exclude an empty string from our input
    input_text = [word for word in input_text if word != ''] 

In [19]:
tokens = input_text
print(tokens[:10])
print("total word count:", len(tokens))

['หน้า', 'หลัก', 'วิกิพีเดีย', 'ดำเนินการ', 'โดย', 'มูลนิธิ', 'วิกิ', 'มีเดีย', 'องค์กร', 'ไม่']
total word count: 36349066


# Step 2: Indexing (Assign a number to each word)

The code below generates an indexed dataset(each word is represented by a number), a dictionary, a reversed dictionary

## <font color='purple'>Homework Question 1:</font>
<font color='purple'>“UNK” is often used to represent an unknown word (a word which does not exist in your dictionary/training set). You can also represent a rare word with this token as well.  How do you define a rare word in your program? Explain in your own words and capture the screenshot of your code segment that is a part of this process</font>

 + <font color='purple'>edit or replace create_index with your own code to set a threshold for rare words and replace them with "UNK"</font>

## <font color='red'>Answer</font>

1. Count number of UNK word (if word count <= min_thres_unk, then it is UNK) and at token 'UNK' into word_count dictionary which value is UNK count (need to remove 'UNK' in word_count dictionary because there are word wich spell UNK in input_text).
![]()

2. Sorted dictionary word_count by using its value.
![]()

3. Use dicionary to transform word in dataset into sequencs of uniuqe number for each word (for word which has frequency more than min_thres_unk will have their own uniuqe number. for word spell 'UNK' and others use 'UNK' number which is 9 in this dataset)
![]()

In [20]:
#step 2:Build dictionary and build a dataset(replace each word with its index)
from collections import defaultdict

def create_index(input_text, min_thres_unk = 1, max_word_count = None):
    # TODO#1 : edit or replace this function
    words = [word for word in input_text ]
    word_count = list()

    #word_count => list of count number of word in each unique words  [(word, count), ...]
    #use set and len to get the number of unique words
    word_count.extend(collections.Counter(words).most_common(len(set(words))))
    #include a token for unknown word
    # word_count.append(("UNK",0))
    #print out 10 most frequent words
    print(word_count[-10:])

    #thresold to token UNK
    count_unk = 0
    idx_unk = 0
    #loop for counting UNK Word wich frequency equal or less than thresold
    for i, pair in enumerate(word_count):
        if pair[1] <= min_thres_unk:
            count_unk += pair[1]
        if pair[0] == "UNK": #count "UNK" in original input text
            count_unk += pair[1]
            idx_unk = i

    pop_data = word_count.pop(idx_unk) #pop word 'UNK' which in sentence in input text
    word_count.append(("UNK", count_unk))

    print('UNK count:', count_unk)
    print('pop data:', pop_data)
    print('add new data:', word_count[-1])
    
    #sort dict word count by using value
    word_count = [pair for pair in word_count if pair[1] > min_thres_unk]
    word_count = sorted(word_count, key=lambda x: x[1], reverse=True)

    #Rank theshold frequency
    if max_word_count != None:
        word_count = word_count[:max_word_count]

    #dictionary => is dict consist of word and unique number for each word {("for_keras_zero_padding", 0), (word1, 1), ...}
    dictionary = dict()
    dictionary["for_keras_zero_padding"] = 0
    for word in word_count:
        dictionary[word[0]] = len(dictionary)

    #reverse_dictionary is just reverse dictionary : swap values and keys
    reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys()))

    #create data set sequences by dictionary
    data = list()
    for word in input_text:
        if word in dictionary.keys():
            data.append(dictionary[word])
        else: data.append(dictionary['UNK'])

    return data, dictionary, reverse_dictionary

# call method with min_thres_unk=1ß
dataset, dictionary, reverse_dictionary = create_index(tokens, 1)
print('Number of sequence in dataset:', len(dataset))
print('Number of word in dictionary:', len(dictionary))

[('ไคเซอร์วิลเฮ็ล์ม', 1), ('Jugen', 1), ('เมืองลอเรนซ์เคิร์ช', 1), ('รัป', 1), ('ค็อปเฟอร์มันน์', 1), ('กับค็อปเฟอร์มันน์', 1), ('เมลท์', 1), ('ลิเซลอตต์', 1), ('(ก.พ.', 1), ('ทักกีสำเร็จการ', 1)]
UNK count: 406196
pop data: ('UNK', 4)
add new data: ('UNK', 406196)
Number of sequence in dataset: 36349066
Number of word in dictionary: 295164


In [21]:
print("output sample (dataset):",dataset[:10])
print("output sample (dictionary):",{k: dictionary[k] for k in list(dictionary)[:13]})
print("output sample (reverse dictionary):",{k: reverse_dictionary[k] for k in list(reverse_dictionary)[:13]})

output sample (dataset): [230, 209, 2454, 574, 16, 1830, 7150, 3125, 682, 25]
output sample (dictionary): {'for_keras_zero_padding': 0, 'ที่': 1, 'ใน': 2, 'เป็น': 3, 'และ': 4, 'การ': 5, 'มี': 6, 'ของ': 7, 'ได้': 8, 'UNK': 9, ')': 10, '"': 11, 'จาก': 12}
output sample (reverse dictionary): {0: 'for_keras_zero_padding', 1: 'ที่', 2: 'ใน', 3: 'เป็น', 4: 'และ', 5: 'การ', 6: 'มี', 7: 'ของ', 8: 'ได้', 9: 'UNK', 10: ')', 11: '"', 12: 'จาก'}


# Step3: Create skip-grams (inputs for your model)
Keras has a skipgrams-generator, the cell below shows us how it generates skipgrams 

## <font color='blue'>Homework Question 2:</font>
<font color='blue'>The negative samples are sampled from sampling_table.  Look through Keras source code to find out how they sample negative samples. Discuss the sampling technique taught in class and compare it to the Keras source code.</font>



<font color='red'>Q2: PUT YOUR ANWSER HERE!!!</font>

Negative sample 
function softmax ใช้การคำนวณมาก

In [22]:
# Step 3: Create data samples
vocab_size = len(dictionary)
skip_window = 1       # How many words to consider left and right.

# TODO#2 check out keras source code and find out how their sampling technique works. Describe it in your own words.
sample_set= dataset[:10]
sampling_table = sequence.make_sampling_table(vocab_size)
couples, labels = skipgrams(sample_set, vocab_size, window_size=skip_window, sampling_table=sampling_table)
word_target, word_context = zip(*couples)
word_target = np.array(word_target, dtype="int32")
word_context = np.array(word_context, dtype="int32")

print(couples, labels)
print("number of couples:", len(couples))

for i in range(8):
    print(reverse_dictionary[couples[i][0]],reverse_dictionary[couples[i][1]])

[[209, 83708], [209, 2454], [2454, 574], [209, 235515], [25, 682], [2454, 209], [3125, 145853], [2454, 115575], [25, 3408], [209, 230], [3125, 682], [3125, 7150], [3125, 285707], [2454, 219950]] [0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0]
number of couples: 14
หลัก VPNs
หลัก วิกิพีเดีย
วิกิพีเดีย ดำเนินการ
หลัก LaVoie
ไม่ องค์กร
วิกิพีเดีย หลัก
มีเดีย -House
วิกิพีเดีย บวรราชสกุล


In [23]:
vocab_size

295164

In [24]:
sample_set

[230, 209, 2454, 574, 16, 1830, 7150, 3125, 682, 25]

In [25]:
sampling_table[:20]

array([0.00315225, 0.00315225, 0.00547597, 0.00741556, 0.00912817,
       0.01068435, 0.01212381, 0.01347162, 0.01474487, 0.0159558 ,
       0.0171136 , 0.01822533, 0.01929662, 0.02033198, 0.02133515,
       0.02230924, 0.02325687, 0.02418031, 0.02508148, 0.02596208])

# Step 4: create the skip-gram model
## <font color='blue'>Homework Question 3:</font>
 <font color='blue'>Q3:  In your own words, discuss why Sigmoid is chosen as the activation function in the  skip-gram model.</font>

<font color='red'>Q3: PUT YOUR ANSER HERE!!!</font>

In [26]:
#reference: https://github.com/nzw0301/keras-examples/blob/master/Skip-gram-with-NS.ipynb
dim_embedddings = 32
V= len(dictionary)

#step1: select the embedding of the target word from W
w_inputs = Input(shape=(1, ), dtype='int32')
w = Embedding(V+1, dim_embedddings)(w_inputs)

#step2: select the embedding of the context word from C
c_inputs = Input(shape=(1, ), dtype='int32')
c  = Embedding(V+1, dim_embedddings)(c_inputs)

#step3: compute the dot product:c_k*v_j
o = Dot(axes=2)([w, c])
o = Reshape((1,), input_shape=(1, 1))(o)

#step4: normailize dot products into probability
o = Activation('sigmoid')(o)
#TO DO#4 Question: Why sigmoid?

SkipGram = Model(inputs=[w_inputs, c_inputs], outputs=o)
SkipGram.summary()
opt=Adam(lr=0.01)
SkipGram.compile(loss='binary_crossentropy', optimizer=opt)

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_2 (InputLayer)           [(None, 1)]          0           []                               
                                                                                                  
 input_1 (InputLayer)           [(None, 1)]          0           []                               
                                                                                                  
 embedding_1 (Embedding)        (None, 1, 32)        9445280     ['input_2[0][0]']                
                                                                                                  
 embedding (Embedding)          (None, 1, 32)        9445280     ['input_1[0][0]']                
                                                                                              

  super().__init__(name, **kwargs)


In [29]:
# you don't have to spend too much time training for your homework, you are allowed to do it on a smaller corpus
# currently the dataset is 1/20 of the full text file.
for _ in range(2):
    prev_i=0
    #it is likely that your GPU won't be able to handle large input
    #just do it 100000 words at a time
    for i in range(len(dataset)//100000):
        #generate skipgrams
        data, labels = skipgrams(sequence=dataset[prev_i*100000:(i*100000)+100000], vocabulary_size=V, window_size=2, negative_samples=4.)
        x = [np.array(x) for x in zip(*data)]
        y = np.array(labels, dtype=np.int32)
        if x:
            loss = SkipGram.train_on_batch(x, y)
        prev_i = i 
        print(loss,i*100000)

    SkipGram.save_weights('my_skipgram32_weights-hw.h5')


0.6915693879127502 0
0.690997302532196 100000
0.6901803016662598 200000
0.6889075636863708 300000
0.6874406933784485 400000
0.6856717467308044 500000
0.6837462782859802 600000
0.681596577167511 700000
0.6787207126617432 800000
0.6758802533149719 900000
0.672584593296051 1000000
0.668102502822876 1100000
0.6645346283912659 1200000
0.6588806509971619 1300000
0.653477668762207 1400000
0.6492652893066406 1500000
0.6420718431472778 1600000
0.6341586709022522 1700000
0.626152753829956 1800000
0.6190935969352722 1900000
0.6121807098388672 2000000
0.6014773845672607 2100000
0.5899662375450134 2200000
0.5803203582763672 2300000
0.5700560212135315 2400000
0.5602562427520752 2500000
0.5495749115943909 2600000
0.5413700342178345 2700000
0.531571626663208 2800000
0.5164780020713806 2900000
0.5011336803436279 3000000
0.48799949884414673 3100000
0.4764813780784607 3200000
0.46674808859825134 3300000
0.45204806327819824 3400000
0.43374186754226685 3500000
0.419826865196228 3600000
0.4061056971549988 3

In [None]:
SkipGram.save_weights('my_skipgram32_weights-hw.h5')

In [None]:
#Get weight of the embedding layer
final_embeddings=SkipGram.get_weights()[0]
print(final_embeddings)
print(final_embeddings.shape)

# Step 5: Intrinsic Evaluation: Word Vector Analogies
## <font color='blue'>Homework Question 4: </font>
<font color='blue'> Read section 2.1 and 2.3 in this [lecture note](http://web.stanford.edu/class/cs224n/readings/cs224n-2019-notes02-wordvecs2.pdf). Come up with 10 semantic analogy examples and report results produced by your word embeddings. Discuss t-SNE in 2 dimensions. </font>


In [None]:
# TODO#4:Come up with 10 semantic analogy examples and report results produced by your word embeddings anf do t-SNE in 2 dimensions.
#and tell us what you observe 

# Step 6: Extrinsic Evaluation

## <font color='blue'>Homework Question5:</font>
<font color='blue'>
Use the word embeddings from the skip-gram model as pre-trained weights (GloVe and fastText) in a classification model. Compare the result the with the same classification model that does not use the pre-trained weights. 
</font>


In [None]:
all_news_filepath = glob.glob('BEST-TrainingSet/news/*.txt')
all_novel_filepath = glob.glob('BEST-TrainingSet/novel/*.txt')
all_article_filepath = glob.glob('BEST-TrainingSet/article/*.txt')
all_encyclopedia_filepath = glob.glob('BEST-TrainingSet/encyclopedia/*.txt')

In [None]:
#preparing data for the classificaiton model
#In your homework, we will only use the first 2000 words in each text file
#any text file that has less than 2000 words will be padded
#reason:just to make this homework feasible under limited time and resource
import keras
import tensorflow

max_length = 2000
def word_to_index(word):
    if word in dictionary:
        return dictionary[word]
    else:#if unknown
        return dictionary["UNK"]


def prep_data():
    input_text = list()
    for textfile_path in [all_news_filepath, all_novel_filepath, all_article_filepath, all_encyclopedia_filepath]:
        for input_file in textfile_path:
            f = open(input_file,"r") #open file with name of "*.txt"
            text = re.sub(r'\|', ' ', f.read()) # replace separation symbol with white space           
            text = re.sub(r'<\W?\w+>', '', text)# remove <NE> </NE> <AB> </AB> tags
            text = text.split() #split() method without an argument splits on whitespace 
            indexed_text = list(map(lambda x:word_to_index(x), text[:max_length])) #map raw word string to its index   
            if 'news' in input_file:
                input_text.append([indexed_text,0]) 
            elif 'novel' in input_file:
                input_text.append([indexed_text,1]) 
            elif 'article' in input_file:
                input_text.append([indexed_text,2]) 
            elif 'encyclopedia' in input_file:
                input_text.append([indexed_text,3]) 
            
            f.close()
    random.shuffle(input_text)
    return input_text

input_data = prep_data()
train_data = input_data[:int(len(input_data)*0.6)]
val_data = input_data[int(len(input_data)*0.6):int(len(input_data)*0.8)]
test_data = input_data[int(len(input_data)*0.8):]

train_input = [data[0] for data in train_data]
train_input = keras.utils.pad_sequences(train_input, maxlen=max_length) #padding
train_target = [data[1] for data in train_data]
train_target=to_categorical(train_target, num_classes=4)

val_input = [data[0] for data in val_data]
val_input = keras.utils.pad_sequences(val_input, maxlen=max_length) #padding
val_target = [data[1] for data in val_data]
val_target=to_categorical(val_target, num_classes=4)

test_input = [data[0] for data in test_data]
test_input = keras.utils.pad_sequences(test_input, maxlen=max_length) #padding
test_target = [data[1] for data in test_data]
test_target=to_categorical(test_target, num_classes=4)

del input_data, val_data,train_data, test_data

In [None]:
#the classification model
#TODO#5 find out how to initialize your embedding layer with pre-trained weights, evaluate and observe
#don't forget to compare it with the same model that does not use pre-trained weights
#you can use your own model too! and feel free to customize this model as you wish
# more information --> https://keras.io/examples/nlp/pretrained_word_embeddings/
# fastText --> https://fasttext.cc/docs/en/crawl-vectors.html (optional)
# !wget --no-check-certificate https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.bin.gz

cls_model = Sequential()
cls_model.add(Embedding(len(dictionary)+1, 32, input_length=max_length,mask_zero=True)) 
cls_model.add(GRU(32))
cls_model.add(Dropout(0.5))
cls_model.add(Dense(4, activation='softmax'))
opt=Adam(lr=0.01)
cls_model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])
cls_model.summary()
print('Train...')
cls_model.fit(train_input, train_target,
          epochs=10,
          validation_data=[val_input, val_target])

Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_7 (Embedding)     (None, 2000, 32)          9445280   
                                                                 
 gru_3 (GRU)                 (None, 32)                6336      
                                                                 
 dropout_3 (Dropout)         (None, 32)                0         
                                                                 
 dense_3 (Dense)             (None, 4)                 132       
                                                                 
Total params: 9,451,748
Trainable params: 9,451,748
Non-trainable params: 0
_________________________________________________________________
Train...
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x2611c9c5820>

In [None]:
results = cls_model.evaluate(test_input, test_target)
print("test loss, test acc:", results)