#  Subwords and Embeddings

This notebook we will first experiment with method:<br>
**Unsupervised Word Segmentation into Subword Units.** <br>
Then we will train **GloVe embeddings on our dataset.** <br>
Finally we will use **Tensorboard** to visualize words transofrmed to vectors.

### Imports

In [1]:
!pip install subword-nmt

Collecting subword-nmt
  Downloading https://files.pythonhosted.org/packages/26/08/58267cb3ac00f5f895457777ed9e0d106dbb5e6388fa7923d8663b04b849/subword_nmt-0.3.6-py2.py3-none-any.whl
Installing collected packages: subword-nmt
Successfully installed subword-nmt-0.3.6


In [0]:

import sys, os
import collections
import nltk
import numpy as np
import pandas as pd
import re
import tempfile
import time
import subprocess


from google.colab import drive
from nltk.corpus import stopwords
from sklearn.externals import joblib
from subprocess import Popen, PIPE, check_call
from subword_nmt import apply_bpe
from subword_nmt.learn_joint_bpe_and_vocab import learn_joint_bpe_and_vocab


%reload_ext autoreload
%autoreload 2

pd.set_option('display.max_colwidth', -1)

### Colab setup

In [13]:
drive.mount('/content/gdrive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/gdrive


In [14]:
cd ~/..


/


In [15]:
os.getcwd()

'/'

**Notice:** There is an issue related to whitespace in "My Drive" when trying to invoke path with it. As for now google colab does not allow to rename e.g. to "MyDrive". One walkaround is to create a symbolic link to omit problem with whitespace within "My Drive". 

In [16]:
# Create a symbolic link to omit issues with whitespace in "My Drive"
!ln -s ~/../content/gdrive/"My Drive"/ /MyDrive


ln: failed to create symbolic link '/MyDrive/My Drive': Function not implemented


Remember to stay in current directory '/' to execute all cells successfully

In [0]:
PROJECT_HOME_PATH = os.path.join('MyDrive', 'NmtPolishLanguage')
DATA_PATH = os.path.join(PROJECT_HOME_PATH, 'DATA')

In [18]:
os.path.exists(PROJECT_HOME_PATH)

True

In [19]:
os.path.exists(DATA_PATH)

True

### Load dataset

In [0]:
df_data = joblib.load(os.path.join(DATA_PATH, 'interim', 'hate_speech_mod.dat'))

In [21]:
df_data.tail()

Unnamed: 0,text,text_mod,target
10036,@anonymized_account Ty zagrasz? Nie wiedziałem 😉,zagrasz nie wiedzialem,0
10037,"@anonymized_account @anonymized_account A VAR nie miał poprawić jakości sędziowania, tylko efekt końcowy - mniej wypaczonych wyników, mniej skandali.",var nie mial poprawic jakosci sedziowania efekt koncowy mniej wypaczonych wynikow mniej skandali,0
10038,"@anonymized_account @anonymized_account Szanowany, bo kolega ładnie go pożegnał ?",szanowany kolega ladnie pozegnal,0
10039,@anonymized_account @anonymized_account @anonymized_account A kto inny ma się bić? Każdy zwyciezca ligi wojewódzkiej gra w barazach.,bic zwyciezca ligi wojewodzkiej gra barazach,0
10040,@anonymized_account A wróżbita Maciej mówi że zrozumiemy,wrozbita maciej mowi zrozumiemy,0


In [0]:
df_data.reset_index(drop=True, inplace=True)

In [23]:
notes = df_data.text_mod
notes = list(notes)

print(f'Number of notes: {len(notes)}')

Number of notes: 10025


Print example sentences:

In [24]:
for i in range(10):
  print(f'note {i}: {notes[i]}')

note 0: faworytem do tytulu cracovia zobaczymy typ sprawdzi
note 1: brawo daria kibic dobre zle
note 2: super polski premier sklada kwiaty grobach kolaborantow doczekalismy czasow
note 3: musi innej drogi nie mamy
note 4: odrzut natychmiastowy kwasna mina problem
note 5: fajny xdd pamietam spoznilam pierwsze zajecia sporo kare kazal usiasc pierwszej lawce xd
note 6: nie szczescia
note 7: dawno kogos wrednego nie widzialam xd
note 8: zaleglosci wazne wezwania do zaplaty klub nie wywiazal
note 9: brudzinski jestes klamca marnym kutasem


### Words frequency

In [0]:
words_lst = [word.split(' ') for word in notes]
words_lst_flatten = [item for sublist in words_lst for item in sublist]
words_lst_flatten = pd.Series(words_lst_flatten)
vocab_count = words_lst_flatten.value_counts()

In [15]:
vocab_count[:10]

nie      3097
do       945 
rt       681 
od       365 
tez      336 
moze     309 
chyba    251 
sa       214 
wiem     176 
wiec     166 
dtype: int64

In [16]:
print(f'Number of unique words in corpus: {len(vocab_count)}')

Number of unique words in corpus: 21947


## Unsupervised Word Segmentation into Subword Units

In this step we will transform our corpus to be represented with generated subwords.

If you wish to dive into details of this approach, here is the original paper: https://arxiv.org/abs/1508.07909

Source code: https://github.com/rsennrich/subword-nmt



In [0]:
Path = collections.namedtuple('Path', ['name'])


class LearnJointBpeVocabArgs:
    """
    Helper for passing arguments to ``learn_joint_bpe_and_vocab``. This class
    needs to exists because logic in ``learn_joint_bpe_and_vocab.py`` is designed
    that way.
    """  
    
    def __init__(self, input_, output, vocab, symbols=10000, min_frequency=5, verbose=False, separator='@@', total_symbols=10000):
      self.input=[Path(input_)]
      self.output = Path(output)
      self.vocab = [Path(vocab)]
      self.symbols = symbols
      self.min_frequency = min_frequency
      self.verbose = verbose
      self.separator = separator
      self.total_symbols = total_symbols

        
def generate_subwords_vocab(notes, output_codes_path, output_vocab_path, symbols=10000, 
                           min_frequency=5, verbose=False, separator='@@', total_symbols=10000):
  
    # temp file used here because learn_joint_bpe_and_vocab does not accept anything other than file
    with tempfile.NamedTemporaryFile('wt', encoding='utf-8', delete=False) as notes_temp_file:
        notes_temp_file.writelines(note + '\n' for note in notes)
        notes_temp_file_path = notes_temp_file.name
        
        
    # this function automatically saves the result
    learn_joint_bpe_and_vocab(LearnJointBpeVocabArgs(
        input_ = notes_temp_file_path,
        output = output_codes_path,
        vocab = output_vocab_path,
        verbose = verbose,
        symbols = symbols,
        min_frequency = min_frequency,
        separator = separator,
        total_symbols = total_symbols
    ))
  
  
    os.remove(notes_temp_file_path)
  

In [0]:
# Set number of symbols to be used:
symbols = 500

In [0]:
CODES_PATH = os.path.join(PROJECT_HOME_PATH, 'subwords', f'codes_{symbols}_5_{symbols}.txt')
VOCAB_PATH = os.path.join(PROJECT_HOME_PATH, 'subwords', f'vocab_{symbols}_5_{symbols}.txt')

In [38]:
generate_subwords_vocab(
    notes=notes,
    output_codes_path = CODES_PATH,
    output_vocab_path = VOCAB_PATH,
    verbose=False,
    symbols=symbols,
    min_frequency=5,
    total_symbols=symbols
)

Number of word-internal characters: 26
Number of word-final characters: 26
Reducing number of merge operations by 52


Now we replace words in out **notes** with corresponding subwords and save them all as one text file called **corpus**.

In [0]:
CORPUS_OUTPUT_PATH = os.path.join(DATA_PATH, 'glove', f'corpus_{symbols}_5_{symbols}.txt')

In [0]:
#minimum frequency of particular subword to be used.
vocab_threshold = 5

with open(VOCAB_PATH, encoding='utf-8') as vocab_file:
    vocab = apply_bpe.read_vocabulary(vocab_file, threshold=vocab_threshold)
        
with open(CODES_PATH, encoding='utf-8') as codes_file:
    bpe = apply_bpe.BPE(codes_file, vocab=vocab)
    
with open(CORPUS_OUTPUT_PATH, mode='x', encoding='utf-8') as output_file:
  output_file.writelines(bpe.process_line(note) + '\n' for note in notes)
  

In [0]:
corpus = pd.read_csv(os.path.join(DATA_PATH, 'glove', 'corpus_500_5_500.txt'), header = None, names=['text_subwords'])
subwords_500 = pd.concat([df_data['text_mod'], corpus['text_subwords']], axis=1)

In [44]:
subwords_500.head()

Unnamed: 0,text_mod,text_subwords
0,faworytem do tytulu cracovia zobaczymy typ sprawdzi,fa@@ wo@@ ry@@ tem do ty@@ tu@@ lu c@@ ra@@ co@@ vi@@ a z@@ ob@@ a@@ czy@@ my ty@@ p spraw@@ dzi
1,brawo daria kibic dobre zle,bra@@ wo d@@ ar@@ i@@ a kibi@@ c dob@@ r@@ e z@@ le
2,super polski premier sklada kwiaty grobach kolaborantow doczekalismy czasow,su@@ pe@@ r pol@@ ski pre@@ mie@@ r sk@@ la@@ da k@@ wia@@ ty g@@ ro@@ ba@@ ch ko@@ la@@ bo@@ ra@@ n@@ tow do@@ cze@@ k@@ ali@@ smy cza@@ so@@ w
3,musi innej drogi nie mamy,mu@@ si in@@ nej d@@ ro@@ gi nie ma@@ my
4,odrzut natychmiastowy kwasna mina problem,od@@ rz@@ u@@ t na@@ ty@@ ch@@ mia@@ st@@ ow@@ y k@@ wa@@ s@@ na mi@@ na prob@@ lem


In [0]:
corpus = pd.read_csv(os.path.join(DATA_PATH, 'glove', 'corpus_1000_5_1000.txt'), header = None, names=['text_subwords'])
subwords_1000 = pd.concat([df_data['text_mod'], corpus['text_subwords']], axis=1)

In [27]:
subwords_1000.head()

Unnamed: 0,text_mod,text_subwords
0,faworytem do tytulu cracovia zobaczymy typ sprawdzi,fa@@ wo@@ ry@@ tem do ty@@ tu@@ lu c@@ ra@@ co@@ vi@@ a zoba@@ czy@@ my ty@@ p spraw@@ dzi
1,brawo daria kibic dobre zle,brawo dar@@ ia kibi@@ c dob@@ re zle
2,super polski premier sklada kwiaty grobach kolaborantow doczekalismy czasow,supe@@ r polski pre@@ mie@@ r skla@@ da k@@ wia@@ ty gro@@ ba@@ ch ko@@ la@@ bo@@ ran@@ tow do@@ cze@@ k@@ ali@@ smy cza@@ sow
3,musi innej drogi nie mamy,musi in@@ nej dro@@ gi nie mamy
4,odrzut natychmiastowy kwasna mina problem,od@@ rzu@@ t na@@ ty@@ ch@@ mia@@ st@@ ow@@ y k@@ wa@@ s@@ na mi@@ na problem


In [0]:
corpus = pd.read_csv(os.path.join(DATA_PATH, 'glove', 'corpus_5000_5_5000.txt'), header = None, names=['text_subwords'])
subwords_5000 = pd.concat([df_data['text_mod'], corpus['text_subwords']], axis=1)

In [45]:
subwords_5000.head()

Unnamed: 0,text_mod,text_subwords
0,faworytem do tytulu cracovia zobaczymy typ sprawdzi,fawory@@ tem do tytu@@ lu cracovia zobaczymy ty@@ p spraw@@ dzi
1,brawo daria kibic dobre zle,brawo dar@@ ia kibi@@ c dobre zle
2,super polski premier sklada kwiaty grobach kolaborantow doczekalismy czasow,super polski premier skla@@ da kwiaty gro@@ bach kola@@ bo@@ ran@@ tow do@@ czek@@ alismy czasow
3,musi innej drogi nie mamy,musi innej drogi nie mamy
4,odrzut natychmiastowy kwasna mina problem,od@@ rzut na@@ ty@@ ch@@ mia@@ stow@@ y kwa@@ s@@ na mina problem


In [0]:
corpus = pd.read_csv(os.path.join(DATA_PATH, 'glove', 'corpus_10000_5_10000.txt'), header = None, names=['text_subwords'])
subwords_10000 = pd.concat([df_data['text_mod'], corpus['text_subwords']], axis=1)

In [50]:
subwords_10000.tail()

Unnamed: 0,text_mod,text_subwords
10020,zagrasz nie wiedzialem,zagra@@ sz nie wiedzialem
10021,var nie mial poprawic jakosci sedziowania efekt koncowy mniej wypaczonych wynikow mniej skandali,var nie mial po@@ prawi@@ c ja@@ kosci sedziow@@ ania efekt koncow@@ y mniej wy@@ pa@@ czo@@ nych wynikow mniej skan@@ dali
10022,szanowany kolega ladnie pozegnal,szan@@ ow@@ any kolega ladnie pozeg@@ nal
10023,bic zwyciezca ligi wojewodzkiej gra barazach,bi@@ c zwy@@ ciez@@ ca ligi woje@@ wo@@ dz@@ kiej gra bara@@ zach
10024,wrozbita maciej mowi zrozumiemy,w@@ roz@@ bi@@ ta ma@@ ciej mowi zrozumie@@ my


In [0]:
hate_speech_subwords = pd.concat([df_data, corpus['text_subwords']], axis=1)
hate_speech_subwords = hate_speech_subwords[['text', 'text_mod', 'text_subwords', 'target']]

In [58]:
hate_speech_subwords.tail(3)

Unnamed: 0,text,text_mod,text_subwords,target
10022,"@anonymized_account @anonymized_account Szanowany, bo kolega ładnie go pożegnał ?",szanowany kolega ladnie pozegnal,szan@@ ow@@ any kolega ladnie pozeg@@ nal,0
10023,@anonymized_account @anonymized_account @anonymized_account A kto inny ma się bić? Każdy zwyciezca ligi wojewódzkiej gra w barazach.,bic zwyciezca ligi wojewodzkiej gra barazach,bi@@ c zwy@@ ciez@@ ca ligi woje@@ wo@@ dz@@ kiej gra bara@@ zach,0
10024,@anonymized_account A wróżbita Maciej mówi że zrozumiemy,wrozbita maciej mowi zrozumiemy,w@@ roz@@ bi@@ ta ma@@ ciej mowi zrozumie@@ my,0


In [57]:
joblib.dump(hate_speech_subwords, os.path.join(DATA_PATH, 'interim', 'hate_speech_subwords.dat'))

['MyDrive/NmtPolishLanguage/DATA/interim/hate_speech_subwords.dat']

In [0]:
def tag_subword(text, subword):
  
  words = text.split()
  return 1 if subword in words else 0

In [0]:
res = subwords_10000.apply(lambda x: tag_subword(x['text_subwords'], 'wymysli@@'), axis=1)
res_idx = res[res == 1].index

In [119]:
subwords_10000.iloc[res_idx]

Unnamed: 0,text_mod,text_subwords
453,ej sluchajcie zajebistego wymyslilam nnjaki najlepszy argument bonnie pamietnikow wampirow nnbo nnie,ej sluchajcie zajebi@@ stego wymysli@@ lam nn@@ jaki najlepszy argument bo@@ nnie pamie@@ t@@ nikow wa@@ mpi@@ row nn@@ bo nnie
454,rt ej sluchajcie zajebistego wymyslilam nnjaki najlepszy argument bonnie pamietnikow wampirow nnbo nnie,rt ej sluchajcie zajebi@@ stego wymysli@@ lam nn@@ jaki najlepszy argument bo@@ nnie pamie@@ t@@ nikow wa@@ mpi@@ row nn@@ bo nnie
516,komentarz raz slyszac trybuny zastanawiam wymyslil mecze pn obejrzal choc,komentarz raz slysz@@ ac trybu@@ ny zastanawiam wymysli@@ l mecze pn obejrz@@ al choc
2377,pewnie macie same pytania nie potraficie wymyslic nowych,pewnie macie same pytania nie potrafi@@ cie wymysli@@ c nowych
2872,prosty przepis wymyslili karakan wyglada mial pelne pieluchomajtki,prosty przepis wymysli@@ li kara@@ kan wyglada mial pelne pie@@ lu@@ cho@@ maj@@ tki
3461,wymyslilem kapitalny zart glowa spocilanw sposob denis urubko wita mieszkancami malajownhi malajen,wymysli@@ lem kapi@@ tal@@ ny zart glowa spo@@ ci@@ lan@@ w sposob de@@ ni@@ s uru@@ b@@ ko wi@@ ta mieszkan@@ cami mala@@ jow@@ n@@ hi mala@@ je@@ n
3462,rt wymyslilem kapitalny zart glowa spocilanw sposob denis urubko wita mieszkancami malajow,rt wymysli@@ lem kapi@@ tal@@ ny zart glowa spo@@ ci@@ lan@@ w sposob de@@ ni@@ s uru@@ b@@ ko wi@@ ta mieszkan@@ cami mala@@ jo@@ w
3666,spoko wymyslilam cala kampanie doktor wszystko widziala kazala chlopakom czytac puscila oczko girl power,spoko wymysli@@ lam cala kampanie do@@ ktor wszystko widzi@@ ala kaz@@ ala chlopa@@ kom czytac pusci@@ la o@@ czko girl po@@ we@@ r
4498,wymyslilas,wymysli@@ las
5060,skoro dzban wymyslil nie wypychania pilki bok nie dziwie kazda kiwka kwestionowana,skoro dz@@ ban wymysli@@ l nie wy@@ py@@ cha@@ nia pilki bo@@ k nie dziwie kazda ki@@ wka kwest@@ io@@ nowana


In [0]:
corpus = pd.read_csv(os.path.join(DATA_PATH, 'glove', 'corpus_15000_5_15000.txt'), header = None, names=['text_subwords'])
subwords_15000 = pd.concat([df_data['text_mod'], corpus['text_subwords']], axis=1)

In [109]:
subwords_15000.head()

Unnamed: 0,text_mod,text_subwords
10022,szanowany kolega ladnie pozegnal,szan@@ ow@@ any kolega ladnie pozeg@@ nal
10023,bic zwyciezca ligi wojewodzkiej gra barazach,bi@@ c zwy@@ ciez@@ ca ligi woje@@ wo@@ dz@@ kiej gra bara@@ zach
10024,wrozbita maciej mowi zrozumiemy,w@@ roz@@ bi@@ ta ma@@ ciej mowi zrozumie@@ my


In [0]:
res = subwords_15000.apply(lambda x: tag_subword(x['text_subwords'], 'wymysli@@'), axis=1)
res_idx = res[res == 1].index

In [120]:
subwords_15000.iloc[res_idx]

Unnamed: 0,text_mod,text_subwords
453,ej sluchajcie zajebistego wymyslilam nnjaki najlepszy argument bonnie pamietnikow wampirow nnbo nnie,ej sluchajcie zajebi@@ stego wymysli@@ lam nn@@ jaki najlepszy argument bo@@ nnie pamie@@ t@@ nikow wa@@ mpi@@ row nn@@ bo nnie
454,rt ej sluchajcie zajebistego wymyslilam nnjaki najlepszy argument bonnie pamietnikow wampirow nnbo nnie,rt ej sluchajcie zajebi@@ stego wymysli@@ lam nn@@ jaki najlepszy argument bo@@ nnie pamie@@ t@@ nikow wa@@ mpi@@ row nn@@ bo nnie
516,komentarz raz slyszac trybuny zastanawiam wymyslil mecze pn obejrzal choc,komentarz raz slysz@@ ac trybu@@ ny zastanawiam wymysli@@ l mecze pn obejrz@@ al choc
2377,pewnie macie same pytania nie potraficie wymyslic nowych,pewnie macie same pytania nie potrafi@@ cie wymysli@@ c nowych
2872,prosty przepis wymyslili karakan wyglada mial pelne pieluchomajtki,prosty przepis wymysli@@ li kara@@ kan wyglada mial pelne pie@@ lu@@ cho@@ maj@@ tki
3461,wymyslilem kapitalny zart glowa spocilanw sposob denis urubko wita mieszkancami malajownhi malajen,wymysli@@ lem kapi@@ tal@@ ny zart glowa spo@@ ci@@ lan@@ w sposob de@@ ni@@ s uru@@ b@@ ko wi@@ ta mieszkan@@ cami mala@@ jow@@ n@@ hi mala@@ je@@ n
3462,rt wymyslilem kapitalny zart glowa spocilanw sposob denis urubko wita mieszkancami malajow,rt wymysli@@ lem kapi@@ tal@@ ny zart glowa spo@@ ci@@ lan@@ w sposob de@@ ni@@ s uru@@ b@@ ko wi@@ ta mieszkan@@ cami mala@@ jo@@ w
3666,spoko wymyslilam cala kampanie doktor wszystko widziala kazala chlopakom czytac puscila oczko girl power,spoko wymysli@@ lam cala kampanie do@@ ktor wszystko widzi@@ ala kaz@@ ala chlopa@@ kom czytac pusci@@ la o@@ czko girl po@@ we@@ r
4498,wymyslilas,wymysli@@ las
5060,skoro dzban wymyslil nie wypychania pilki bok nie dziwie kazda kiwka kwestionowana,skoro dz@@ ban wymysli@@ l nie wy@@ py@@ cha@@ nia pilki bo@@ k nie dziwie kazda ki@@ wka kwest@@ io@@ nowana


## Glove embeddings

In [0]:
# If it happened that you don't need to use subwords to your problem, 
# here are required steps to generate corpus and vocab when not using subwords.

# You need to uncomment two following cells and execute them 


In [0]:
"""
CORPUS_OUTPUT_PATH = os.path.join(DATA_PATH, 'glove', 'corpus.txt')

with open(CORPUS_OUTPUT_PATH, mode='w', encoding='utf-8') as output_file:
    output_file.writelines(note + '\n' for note in notes)

"""

In [0]:
"""
VOCAB_PATH = os.path.join(PROJECT_HOME_PATH, 'subwords', 'vocab.txt')

# Number of most common words to be taken into consideration.
vocab_threshold = 20000

words_lst = [word.split(' ') for word in notes]
words_lst_flatten = [item for sublist in words_lst for item in sublist]
words_lst_flatten = pd.Series(words_lst_flatten)
vocab = words_lst_flatten.value_counts()

vocab[:vocab_threshold].to_csv(VOCAB_PATH, sep=' ', index=True, header=False)

"""

#### Download GloVe repository

In [0]:
glove_repo_link = 'https://github.com/stanfordnlp/GloVe.git'
GLOVE_PATH = os.path.join(PROJECT_HOME_PATH, 'GloVe')


if not os.path.exists(GLOVE_PATH):
  print(f'Downloading GloVe project from the repository and placing under PROJECT_HOME_PATH: {glove_repo_link}')
  os.chdir(PROJECT_HOME_PATH)
  !git clone https://github.com/stanfordnlp/GloVe.git
  
else:
  print(f'GloVe project is already downloaded')

GloVe project is already downloaded


In [0]:
 cd ~/..


/


#### Execute make

In [0]:
def execute_script(file_path):
    p = Popen(['./{}'.format(file_path)], stdin=PIPE, stdout=PIPE, stderr=PIPE)
    output, err = p.communicate()
    rc = p.returncode
    return output, err

In [0]:
# make GloVe

if not os.path.exists(GLOVE_PATH):
    print(f'Please download GloVe project from the respository and place it under: {GLOVE_PATH}')
else:
    os.chdir(GLOVE_PATH)
    file_name = 'build_glove.sh'
    with open('./{}'.format(file_name), 'w') as file_handle:
        file_handle.write('#!/bin/bash\n')
        file_handle.write('cd {}\n'.format(GLOVE_PATH))
        file_handle.write('make\n')
    os.chmod(file_name, 0o777)
    print('Executing make')
    output, err = execute_script('./{}'.format(file_name))
    print('./{}'.format(file_name))
    os.remove(file_name)
 
    print('Finished')


Executing make
./build_glove.sh
Finished


In [0]:
cd ~/..

/


In [0]:
GLOVE_BIN_PATH = os.path.join(PROJECT_HOME_PATH, 'GloVe', 'build')

In [0]:
ls

[0m[01;34mbin[0m/      [01;34mdatalab[0m/  [01;34mhome[0m/   [01;34mlib64[0m/  [01;36mMyDrive[0m@  [01;34mroot[0m/  [01;34msrv[0m/    [30;42mtmp[0m/    [01;34mvar[0m/
[01;34mboot[0m/     [01;34mdev[0m/      [01;34mlib[0m/    [01;34mmedia[0m/  [01;34mopt[0m/      [01;34mrun[0m/   [01;34mswift[0m/  [01;34mtools[0m/
[01;34mcontent[0m/  [01;34metc[0m/      [01;34mlib32[0m/  [01;34mmnt[0m/    [01;34mproc[0m/     [01;34msbin[0m/  [01;34msys[0m/    [01;34musr[0m/


### Generate co-occurrence statistics

The GloVe model is trained on the non-zero entries of a global word-word co-occurrence matrix, which tabulates how frequently words co-occur with one another in a given corpus. Populating this matrix requires a single pass through the entire corpus to collect the statistics. For large corpora, this pass can be computationally expensive, but it is a one-time up-front cost.

The core training code is separated from these preprocessing steps and can be executed  independently.

In [0]:
subwords_symbols = 10000
subwords_min_frequency = 5
glove_windows_size = 15
glove_iterations = 15
glove_vector_size = 50

In [0]:
subwords_params = str(subwords_symbols) + '_' + str(subwords_min_frequency)
cooccur_params = str(subwords_params) + '_' + str(glove_windows_size)
COOCCUR_PATH = os.path.join(DATA_PATH, 'glove', f'cooccurrence_{cooccur_params}')

In [0]:
def run_cooccur(vocab_path, corpus_path, cooccur_output_path, windows_size=15, verbose=False):
    
    if os.path.splitext(cooccur_output_path)[1]:
            raise ValueError(f'cooccur_output_path must not have any extension: {cooccur_output_path}')
    cooccur_output_path = cooccur_output_path + '.bin'
    cooccur_shuf_path = cooccur_output_path.replace('.bin', '.shuf.bin')

    check_call(
              os.path.join(GLOVE_BIN_PATH, 'cooccur') +
              f' -vocab-file {vocab_path} -window-size {windows_size} -verbose {2 if verbose else 0}'
              f' < {corpus_path} > {cooccur_output_path}',
              shell=True
              )


    check_call(
              os.path.join(GLOVE_BIN_PATH, 'shuffle') +
              f' -verbose {2 if verbose else 0} < {cooccur_output_path} > {cooccur_shuf_path}',
              shell=True
              )    


In [0]:
#! MyDrive/NmtPolishLanguage/GloVe/build/cooccur -vocab-file MyDrive/NmtPolishLanguage/subwords/vocab_15000_5_15000.txt -window-size 15 -verbose 0 < MyDrive/NmtPolishLanguage/DATA/glove/corpus_15000_5_15000.txt > MyDrive/NmtPolishLanguage/DATA/glove/cooccurrence_15000_5_15.bin
#! MyDrive/NmtPolishLanguage/GloVe/build/shuffle -verbose 0 < MyDrive/NmtPolishLanguage/DATA/glove/cooccurrence_15000_5_15.bin > MyDrive/NmtPolishLanguage/DATA/glove/cooccurrence_15000_5_15.shuf.bin

# subprocess.run('MyDrive/NmtPolishLanguage/GloVe/build/cooccur -vocab-file MyDrive/NmtPolishLanguage/subwords/vocab_15000_5_15000.txt -window-size 15 -verbose 0 < MyDrive/NmtPolishLanguage/DATA/glove/corpus_15000_5_15000.txt > MyDrive/NmtPolishLanguage/DATA/glove/cooccurrence_15000_5_15.bin', shell=True, check=True)
# subprocess.run('MyDrive/NmtPolishLanguage/GloVe/build/shuffle -verbose 0 < MyDrive/NmtPolishLanguage/DATA/glove/cooccurrence_15000_5_15.bin > MyDrive/NmtPolishLanguage/DATA/glove/cooccurrence_15000_5_15.shuf.bin',shell=True, check=True)


In [0]:
run_cooccur(VOCAB_PATH, CORPUS_OUTPUT_PATH, COOCCUR_PATH, windows_size=glove_windows_size)

### Train GloVe embeddings

In [0]:
glove_params = str(cooccur_params) + '_' + str(glove_iterations) + '_' + str(glove_vector_size)
VECTORS_PATH = os.path.join(DATA_PATH, 'glove', f'vectors_{glove_params}')

In [0]:
def run_glove(cooccur_path, vocab_path, vectors_output_path, vector_size=50, iterations=15,
             learning_rate=0.05, x_max=100, alpha=0.75, verbose=False, word2vec_format=False):
    if os.path.splitext(vectors_output_path)[1]:
        raise ValueError(f'vectors_output_path must not have any extansions: {vectors_output_path}')
        
    cooccur_path = cooccur_path + '.shuf.bin'
    threads = os.cpu_count()
    
    # glove automatically adds extension
    vectors_output_path = vectors_output_path.replace('.txt.', '')
    check_call(
        os.path.join(GLOVE_BIN_PATH, 'glove') +
        f' -input-file {cooccur_path} -vocab-file {vocab_path} -write-header {int(word2vec_format)}'
        f' -vector-size {vector_size} -iter {iterations} -eta {learning_rate} -x-max {x_max}'
        f' -alpha {alpha} -threads {threads} -save-file {vectors_output_path}'
        f' -verbose {2 if verbose else 0}',
        shell=True
    )

In [0]:
run_glove(COOCCUR_PATH, VOCAB_PATH, VECTORS_PATH, vector_size=glove_vector_size)

### Visualize embeddings in Tensorboard

In [0]:
import tensorflow as tf
from tensorboard import main as tb
from tensorflow.contrib.tensorboard.plugins import projector

In [0]:
VECTORS_FILE_PATH = VECTORS_PATH + '.txt'
words, embeddings = [], []

with open(VECTORS_FILE_PATH, encoding='utf-8') as input_file:
    for line in input_file:
        word, *embedding = line.split()
        words.append(word)
        embeddings.append(np.array(embedding, dtype=float))
embeddings = np.array(embeddings)

In [0]:
print(f'\tembeddings shape: {embeddings.shape}\n\twords len: {len(words)}' )

	embeddings shape: (6329, 50)
	words len: 6329


In [0]:
TENSORBOARD_PATH = os.path.join(PROJECT_HOME_PATH, 'tensorboard')

In [0]:
def save_for_projector(vectors, tensorboard_output_path, metadata, name):
    metadata_output_path = os.path.join(tensorboard_output_path, 'metadata.tsv')
    model_checkpoint_path = os.path.join(tensorboard_output_path, 'model.ckpt')
    
    # if more than one column: first row must be deader row
    with open(metadata_output_path, 'w', encoding='utf-8') as output_file:
        output_file.writelines(word + '\n' for word in metadata)
        
    session = tf.InteractiveSession()
    with tf.device("/cpu:0"):
        vectors_var = tf.Variable(vectors, name=name, trainable=False)
        
    tf.global_variables_initializer().run()
    saver = tf.train.Saver()
    writer = tf.summary.FileWriter(tensorboard_output_path)
    config = projector.ProjectorConfig()
    config.model_checkpoint_path = model_checkpoint_path
    embedding = config.embeddings.add()
    embedding.tensor_name = vectors_var.name
    embedding.metadata_path = metadata_output_path
    projector.visualize_embeddings(writer, config)
    saver.save(session, model_checkpoint_path)

In [0]:
len(words)

6329

In [0]:
save_for_projector(embeddings, TENSORBOARD_PATH, words, name='embeddings')

Instructions for updating:
Colocations handled automatically by placer.


In [0]:
# Run tensorboard

# In terminal:
#$tensorboard --logdir=TENSORBOARD_PATH --port=6009

# If you run tensorboard on server, use port forwarding when log in: ssh [login@server] -L [port]:localhost:[port]