#  Subwords and Embeddings

This notebook contains the essence of the project. We will perform three things:<br>

First we will use **Neural Machine Translation method** to segment text into **subword** units. <br>Then we will **train GloVe embeddings** on our dataset. <br>Finally we will use **Tensorboard** to visualize words transofrmed to vectors.

### Imports

In [5]:
!pip install subword-nmt

Collecting subword-nmt
  Downloading https://files.pythonhosted.org/packages/26/08/58267cb3ac00f5f895457777ed9e0d106dbb5e6388fa7923d8663b04b849/subword_nmt-0.3.6-py2.py3-none-any.whl
Installing collected packages: subword-nmt
Successfully installed subword-nmt-0.3.6


In [0]:

import sys, os
import collections
import nltk
import numpy as np
import pandas as pd
import re
import tempfile
import time
import subprocess


from google.colab import drive
from nltk.corpus import stopwords
from sklearn.externals import joblib
from subprocess import Popen, PIPE, check_call
from subword_nmt import apply_bpe
from subword_nmt.learn_joint_bpe_and_vocab import learn_joint_bpe_and_vocab


%reload_ext autoreload
%autoreload 2

pd.set_option('display.max_colwidth', -1)

### Colab setup

In [4]:
drive.mount('/content/gdrive')

NameError: ignored

In [0]:
cd ~/..


/


In [0]:
os.getcwd()

'/'

**Notice:** There is an issue related to whitespace in "My Drive" when trying to invoke path with it. As for now google colab does not allow to rename e.g. to "MyDrive". One walkaround is to create a symbolic link to omit problem with whitespace within "My Drive". 

In [0]:
# Create a symbolic link to omit issues with whitespace in "My Drive"
!ln -s ~/../content/gdrive/"My Drive"/ /MyDrive


Remember to stay in current directory '/' to execute all cells successfully

In [0]:
PROJECT_HOME_PATH = os.path.join('MyDrive', 'NmtPolishLanguage')
DATA_PATH = os.path.join(PROJECT_HOME_PATH, 'DATA')

In [0]:
os.path.exists(PROJECT_HOME_PATH)

True

### Load dataset

In [0]:
df_data = joblib.load(os.path.join(DATA_PATH, 'interim', 'hate_speach_mod.dat'))

In [0]:
df_data.head()

Unnamed: 0,text,text_mod,target
0,"Dla mnie faworytem do tytułu będzie Cracovia. Zobaczymy, czy typ się sprawdzi.",faworytem do tytulu cracovia zobaczymy typ sprawdzi,0
1,@anonymized_account @anonymized_account Brawo ty Daria kibic ma być na dobre i złe,anonymizedaccount anonymizedaccount brawo daria kibic dobre zle,0
2,"@anonymized_account @anonymized_account Super, polski premier składa kwiaty na grobach kolaborantów. Ale doczekaliśmy czasów.",anonymizedaccount anonymizedaccount super polski premier sklada kwiaty grobach kolaborantow doczekalismy czasow,0
3,@anonymized_account @anonymized_account Musi. Innej drogi nie mamy.,anonymizedaccount anonymizedaccount musi innej drogi nie mamy,0
4,"Odrzut natychmiastowy, kwaśna mina, mam problem",odrzut natychmiastowy kwasna mina problem,0


In [0]:
notes = df_data.text_mod
notes = list(notes)

print(f'Number of notes: {len(notes)}')

Number of notes: 10039


In [0]:
notes[:10]

['faworytem do tytulu cracovia zobaczymy typ sprawdzi',
 'anonymizedaccount anonymizedaccount brawo daria kibic dobre zle',
 'anonymizedaccount anonymizedaccount super polski premier sklada kwiaty grobach kolaborantow doczekalismy czasow',
 'anonymizedaccount anonymizedaccount musi innej drogi nie mamy',
 'odrzut natychmiastowy kwasna mina problem',
 'fajny xdd pamietam spoznilam pierwsze zajecia sporo kare kazal usiasc pierwszej lawce xd',
 'anonymizedaccount nie szczescia',
 'anonymizedaccount dawno kogos wrednego nie widzialam xd',
 'anonymizedaccount anonymizedaccount zaleglosci wazne wezwania do zaplaty klub nie wywiazal',
 'anonymizedaccount anonymizedaccount anonymizedaccount anonymizedaccount brudzinski jestes klamca marnym kutasem anonymizedaccount']

### Words frequency

In [0]:
words_lst = [word.split(' ') for word in notes]
words_lst_flatten = [item for sublist in words_lst for item in sublist]
words_lst_flatten = pd.Series(words_lst_flatten)
vocab_count = words_lst_flatten.value_counts()

In [0]:
vocab_count.head(10)

anonymizedaccount    13747
nie                  3097 
do                   945  
rt                   681  
od                   365  
tez                  336  
moze                 309  
chyba                251  
sa                   214  
wiem                 176  
dtype: int64

In [0]:
print(f'Number of unique words in corpus: {len(vocab_count)}')

Number of unique words in corpus: 21948


## Subword NMT

In this step we use *Subwords Neural Translation* method to generate subwords.

If you wish to dive into details of this approach, here is the original paper: https://arxiv.org/abs/1508.07909

Source code: https://github.com/rsennrich/subword-nmt



In [0]:
Path = collections.namedtuple('Path', ['name'])


class LearnJointBpeVocabArgs:
    """
    Helper for passing arguments to ``learn_joint_bpe_and_vocab``. This class
    needs to exists because logic in ``learn_joint_bpe_and_vocab.py`` is designed
    that way.
    """  
    
    def __init__(self, input_, output, vocab, symbols=10000, min_frequency=5, verbose=False, separator='@@', total_symbols=5000):
      self.input=[Path(input_)]
      self.output = Path(output)
      self.vocab = [Path(vocab)]
      self.symbols = symbols
      self.min_frequency = min_frequency
      self.verbose = verbose
      self.separator = separator
      self.total_symbols = total_symbols

        
def generate_subwords_vocab(notes, output_codes_path, output_vocab_path, symbols=10000, 
                           min_frequency=5, verbose=False, separator='@@', total_symbols=5000):
  
    # temp file used here because learn_joint_bpe_and_vocab does not accept anything other than file
    with tempfile.NamedTemporaryFile('wt', encoding='utf-8', delete=False) as notes_temp_file:
        notes_temp_file.writelines(note + '\n' for note in notes)
        notes_temp_file_path = notes_temp_file.name
        
        
    # this function automatically saves the result
    learn_joint_bpe_and_vocab(LearnJointBpeVocabArgs(
        input_ = notes_temp_file_path,
        output = output_codes_path,
        vocab = output_vocab_path,
        verbose = verbose,
        symbols = symbols,
        min_frequency = min_frequency,
        separator = separator,
        total_symbols = total_symbols
    ))
  
  
    os.remove(notes_temp_file_path)
  

In [0]:
CODES_PATH = os.path.join(PROJECT_HOME_PATH, 'subwords', 'codes_15000_5_15000.txt')
VOCAB_PATH = os.path.join(PROJECT_HOME_PATH, 'subwords', 'vocab_15000_5_15000.txt')

In [0]:
generate_subwords_vocab(
    notes=notes,
    output_codes_path = CODES_PATH,
    output_vocab_path = VOCAB_PATH,
    verbose=True,
    symbols=15000,
    min_frequency=5,
    total_symbols=15000
)

Number of word-internal characters: 26
Number of word-final characters: 26
Reducing number of merge operations by 52
pair 0: z e -> ze (frequency 17152)
pair 1: a n -> an (frequency 16780)
pair 2: a c -> ac (frequency 16487)
pair 3: m i -> mi (frequency 16158)
pair 4: o n -> on (frequency 15745)
pair 5: c o -> co (frequency 14177)
pair 6: u n -> un (frequency 14158)
pair 7: ze d -> zed (frequency 13966)
pair 8: y mi -> ymi (frequency 13836)
pair 9: co un -> coun (frequency 13764)
pair 10: an on -> anon (frequency 13763)
pair 11: zed ac -> zedac (frequency 13762)
pair 12: zedac coun -> zedaccoun (frequency 13762)
pair 13: zedaccoun t</w> -> zedaccount</w> (frequency 13762)
pair 14: ymi zedaccount</w> -> ymizedaccount</w> (frequency 13762)
pair 15: anon ymizedaccount</w> -> anonymizedaccount</w> (frequency 13762)
pair 16: n i -> ni (frequency 7622)
pair 17: i e -> ie (frequency 5768)
pair 18: ni e</w> -> nie</w> (frequency 4319)
pair 19: p o -> po (frequency 4061)
pair 20: s t -> st (fre

Now we replace words in out **notes** with corresponding subwords and save them all as one text file called **corpus**. In a next step we use **corpus** to train glove embeddings.

In [0]:
CORPUS_OUTPUT_PATH = os.path.join(DATA_PATH, 'glove', 'corpus_15000_5_15000.txt')

In [0]:
#minimum frequency of particular subword to be used.
vocab_threshold = 5

with open(VOCAB_PATH, encoding='utf-8') as vocab_file:
    vocab = apply_bpe.read_vocabulary(vocab_file, threshold=vocab_threshold)
        
with open(CODES_PATH, encoding='utf-8') as codes_file:
    bpe = apply_bpe.BPE(codes_file, vocab=vocab)
    
with open(CORPUS_OUTPUT_PATH, mode='x', encoding='utf-8') as output_file:
  output_file.writelines(bpe.process_line(note) + '\n' for note in notes)
  

## Glove embeddings

In [0]:
# If it happened that you don't need to use subwords to your problem, 
# here are required steps to generate corpus and vocab when not using subwords.

# You need to uncomment two following cells and execute them 


In [0]:
"""
CORPUS_OUTPUT_PATH = os.path.join(DATA_PATH, 'glove', 'corpus.txt')

with open(CORPUS_OUTPUT_PATH, mode='w', encoding='utf-8') as output_file:
    output_file.writelines(note + '\n' for note in notes)

"""

In [0]:
"""
VOCAB_PATH = os.path.join(PROJECT_HOME_PATH, 'subwords', 'vocab.txt')

# Number of most common words to be taken into consideration.
vocab_threshold = 20000

words_lst = [word.split(' ') for word in notes]
words_lst_flatten = [item for sublist in words_lst for item in sublist]
words_lst_flatten = pd.Series(words_lst_flatten)
vocab = words_lst_flatten.value_counts()

vocab[:vocab_threshold].to_csv(VOCAB_PATH, sep=' ', index=True, header=False)

"""

#### Download GloVe repository

In [0]:
glove_repo_link = 'https://github.com/stanfordnlp/GloVe.git'
GLOVE_PATH = os.path.join(PROJECT_HOME_PATH, 'GloVe')


if not os.path.exists(GLOVE_PATH):
  print(f'Downloading GloVe project from the repository and placing under PROJECT_HOME_PATH: {glove_repo_link}')
  os.chdir(PROJECT_HOME_PATH)
  !git clone https://github.com/stanfordnlp/GloVe.git
  
else:
  print(f'GloVe project is already downloaded')

GloVe project is already downloaded


In [0]:
 cd ~/..


#### Execute make

In [0]:
#TODO: Describe what make is for

In [0]:
def execute_script(file_path):
    p = Popen(['./{}'.format(file_path)], stdin=PIPE, stdout=PIPE, stderr=PIPE)
    output, err = p.communicate()
    rc = p.returncode
    return output, err

In [0]:
# make GloVe

if not os.path.exists(GLOVE_PATH):
    print(f'Please download GloVe project from the respository and place it under: {GLOVE_PATH}')
else:
    os.chdir(GLOVE_PATH)
    file_name = 'build_glove.sh'
    with open('./{}'.format(file_name), 'w') as file_handle:
        file_handle.write('#!/bin/bash\n')
        file_handle.write('cd {}\n'.format(GLOVE_PATH))
        file_handle.write('make\n')
    os.chmod(file_name, 0o777)
    print('Executing make')
    output, err = execute_script('./{}'.format(file_name))
    print('./{}'.format(file_name))
    os.remove(file_name)
 
    print('Finished')


Executing make
./build_glove.sh
Finished


In [0]:
cd ~/..

In [0]:
GLOVE_BIN_PATH = os.path.join(PROJECT_HOME_PATH, 'GloVe', 'build')

In [0]:
ls

[0m[01;34mbin[0m/      [01;34mdatalab[0m/  [01;34mhome[0m/   [01;34mlib64[0m/  [01;36mMyDrive[0m@  [01;34mroot[0m/  [01;34msrv[0m/    [30;42mtmp[0m/    [01;34mvar[0m/
[01;34mboot[0m/     [01;34mdev[0m/      [01;34mlib[0m/    [01;34mmedia[0m/  [01;34mopt[0m/      [01;34mrun[0m/   [01;34mswift[0m/  [01;34mtools[0m/
[01;34mcontent[0m/  [01;34metc[0m/      [01;34mlib32[0m/  [01;34mmnt[0m/    [01;34mproc[0m/     [01;34msbin[0m/  [01;34msys[0m/    [01;34musr[0m/


### Generate co-occurrence statistics

The GloVe model is trained on the non-zero entries of a global word-word co-occurrence matrix, which tabulates how frequently words co-occur with one another in a given corpus. Populating this matrix requires a single pass through the entire corpus to collect the statistics. For large corpora, this pass can be computationally expensive, but it is a one-time up-front cost.

The core training code is separated from these preprocessing steps and can be executed  independently.

In [0]:
subwords_symbols = 15000
subwords_min_frequency = 5
glove_windows_size = 15
glove_iterations = 15
glove_vector_size = 50

In [0]:
subwords_params = str(subwords_symbols) + '_' + str(subwords_min_frequency)
cooccur_params = str(subwords_params) + '_' + str(glove_windows_size)
COOCCUR_PATH = os.path.join(DATA_PATH, 'glove', f'cooccurrence_{cooccur_params}')

In [0]:
def run_cooccur(vocab_path, corpus_path, cooccur_output_path, windows_size=15, verbose=False):
    
    if os.path.splitext(cooccur_output_path)[1]:
            raise ValueError(f'cooccur_output_path must not have any extension: {cooccur_output_path}')
    cooccur_output_path = cooccur_output_path + '.bin'
    cooccur_shuf_path = cooccur_output_path.replace('.bin', '.shuf.bin')

    check_call(
              os.path.join(GLOVE_BIN_PATH, 'cooccur') +
              f' -vocab-file {vocab_path} -window-size {windows_size} -verbose {2 if verbose else 0}'
              f' < {corpus_path} > {cooccur_output_path}',
              shell=True
              )


    check_call(
              os.path.join(GLOVE_BIN_PATH, 'shuffle') +
              f' -verbose {2 if verbose else 0} < {cooccur_output_path} > {cooccur_shuf_path}',
              shell=True
              )    


In [0]:
#! MyDrive/NmtPolishLanguage/GloVe/build/cooccur -vocab-file MyDrive/NmtPolishLanguage/subwords/vocab_15000_5_15000.txt -window-size 15 -verbose 0 < MyDrive/NmtPolishLanguage/DATA/glove/corpus_15000_5_15000.txt > MyDrive/NmtPolishLanguage/DATA/glove/cooccurrence_15000_5_15.bin
#! MyDrive/NmtPolishLanguage/GloVe/build/shuffle -verbose 0 < MyDrive/NmtPolishLanguage/DATA/glove/cooccurrence_15000_5_15.bin > MyDrive/NmtPolishLanguage/DATA/glove/cooccurrence_15000_5_15.shuf.bin

# subprocess.run('MyDrive/NmtPolishLanguage/GloVe/build/cooccur -vocab-file MyDrive/NmtPolishLanguage/subwords/vocab_15000_5_15000.txt -window-size 15 -verbose 0 < MyDrive/NmtPolishLanguage/DATA/glove/corpus_15000_5_15000.txt > MyDrive/NmtPolishLanguage/DATA/glove/cooccurrence_15000_5_15.bin', shell=True, check=True)
# subprocess.run('MyDrive/NmtPolishLanguage/GloVe/build/shuffle -verbose 0 < MyDrive/NmtPolishLanguage/DATA/glove/cooccurrence_15000_5_15.bin > MyDrive/NmtPolishLanguage/DATA/glove/cooccurrence_15000_5_15.shuf.bin',shell=True, check=True)


In [0]:
run_cooccur(VOCAB_PATH, CORPUS_OUTPUT_PATH, COOCCUR_PATH, windows_size=glove_windows_size)

### Train GloVe embeddings

In [0]:
glove_params = str(cooccur_params) + '_' + str(glove_iterations) + '_' + str(glove_vector_size)
VECTORS_PATH = os.path.join(DATA_PATH, 'glove', f'vectors_{glove_params}')

In [0]:
def run_glove(cooccur_path, vocab_path, vectors_output_path, vector_size=50, iterations=15,
             learning_rate=0.05, x_max=100, alpha=0.75, verbose=False, word2vec_format=False):
    if os.path.splitext(vectors_output_path)[1]:
        raise ValueError(f'vectors_output_path must not have any extansions: {vectors_output_path}')
        
    cooccur_path = cooccur_path + '.shuf.bin'
    threads = os.cpu_count()
    
    # glove automatically adds extension
    vectors_output_path = vectors_output_path.replace('.txt.', '')
    check_call(
        os.path.join(GLOVE_BIN_PATH, 'glove') +
        f' -input-file {cooccur_path} -vocab-file {vocab_path} -write-header {int(word2vec_format)}'
        f' -vector-size {vector_size} -iter {iterations} -eta {learning_rate} -x-max {x_max}'
        f' -alpha {alpha} -threads {threads} -save-file {vectors_output_path}'
        f' -verbose {2 if verbose else 0}',
        shell=True
    )

In [0]:
run_glove(COOCCUR_PATH, VOCAB_PATH, VECTORS_PATH, vector_size=glove_vector_size)

### Visualize embeddings in Tensorboard

In [0]:
import tensorflow as tf
from tensorboard import main as tb
from tensorflow.contrib.tensorboard.plugins import projector

In [0]:
VECTORS_FILE_PATH = VECTORS_PATH + '.txt'
words, embeddings = [], []

with open(VECTORS_FILE_PATH, encoding='utf-8') as input_file:
    for line in input_file:
        word, *embedding = line.split()
        words.append(word)
        embeddings.append(np.array(embedding, dtype=float))
embeddings = np.array(embeddings)

In [183]:
print(f'\tembeddings shape: {embeddings.shape}\n\twords len: {len(words)}' )

	embeddings shape: (6329, 50)
	words len: 6329


In [0]:
TENSORBOARD_PATH = os.path.join(PROJECT_HOME_PATH, 'tensorboard')

In [0]:
def save_for_projector(vectors, tensorboard_output_path, metadata, name):
    metadata_output_path = os.path.join(tensorboard_output_path, 'metadata.tsv')
    model_checkpoint_path = os.path.join(tensorboard_output_path, 'model.ckpt')
    
    # if more than one column: first row must be deader row
    with open(metadata_output_path, 'w', encoding='utf-8') as output_file:
        output_file.writelines(word + '\n' for word in metadata)
        
    session = tf.InteractiveSession()
    with tf.device("/cpu:0"):
        vectors_var = tf.Variable(vectors, name=name, trainable=False)
        
    tf.global_variables_initializer().run()
    saver = tf.train.Saver()
    writer = tf.summary.FileWriter(tensorboard_output_path)
    config = projector.ProjectorConfig()
    config.model_checkpoint_path = model_checkpoint_path
    embedding = config.embeddings.add()
    embedding.tensor_name = vectors_var.name
    embedding.metadata_path = metadata_output_path
    projector.visualize_embeddings(writer, config)
    saver.save(session, model_checkpoint_path)

In [190]:
len(words)

6329

In [191]:
save_for_projector(embeddings, TENSORBOARD_PATH, words, name='embeddings')

Instructions for updating:
Colocations handled automatically by placer.


In [0]:
# Run tensorboard

# In terminal:
#$tensorboard --logdir=TENSORBOARD_PATH --port=6009

# If you run tensorboard on server, use port forwarding when log in: ssh [login@server] -L [port]:localhost:[port]