<a href="https://colab.research.google.com/github/iyves/ru_col_suggest/blob/master/attest_collocation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This colab shows how to lemmatize and get replacements for collocations.

In [2]:
# Initialize the colab - clone the github repository & install the dependencies
from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


In [None]:
# Note: This file path may require modification to work with the active account's gdrive file structure
%cd drive/MyDrive/
! git clone https://github.com/iyves/ru_col_suggest.git

/content/drive/MyDrive
fatal: destination path 'ru_col_suggest' already exists and is not an empty directory.


In [None]:
%cd /content/drive/MyDrive/ru_col_suggest/src/models

# Downloading the tokenization models
!pip install ufal.udpipe
!pip install treetaggerwrapper

# Note: Make sure to update the version to the latest (see https://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/)
!wget https://www.cis.lmu.de/~schmid/tools/TreeTagger/data/tree-tagger-linux-3.2.4.tar.gz
!wget https://www.cis.lmu.de/~schmid/tools/TreeTagger/data/tagger-scripts.tar.gz
!wget https://www.cis.lmu.de/~schmid/tools/TreeTagger/data/install-tagger.sh
!wget http://corpus.leeds.ac.uk/mocky/russian.par.gz
!sh install-tagger.sh
!wget http://corpus.leeds.ac.uk/mocky/ru-table.tab

# Installation for windows: https://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/#Windows
# Move the bin, lib, and cmd folders to the src/models folder

/content/drive/MyDrive/ru_col_suggest/src/models
Collecting ufal.udpipe
  Downloading ufal.udpipe-1.2.0.3.tar.gz (304 kB)
[K     |████████████████████████████████| 304 kB 5.2 MB/s 
[?25hBuilding wheels for collected packages: ufal.udpipe
  Building wheel for ufal.udpipe (setup.py) ... [?25l[?25hdone
  Created wheel for ufal.udpipe: filename=ufal.udpipe-1.2.0.3-cp37-cp37m-linux_x86_64.whl size=5626696 sha256=09f4b60e20e9977840f7f00610d4fccf7253cd5355bd95d45f7e39718eb1ed5e
  Stored in directory: /root/.cache/pip/wheels/b8/b5/8e/3da091629a21ce2d10bf90759d0cb034ba10a5cf7a01e83d64
Successfully built ufal.udpipe
Installing collected packages: ufal.udpipe
Successfully installed ufal.udpipe-1.2.0.3
Collecting treetaggerwrapper
  Downloading treetaggerwrapper-2.3.tar.gz (43 kB)
[K     |████████████████████████████████| 43 kB 1.2 MB/s 
[?25hBuilding wheels for collected packages: treetaggerwrapper
  Building wheel for treetaggerwrapper (setup.py) ... [?25l[?25hdone
  Created wheel for tr

In [3]:
# If the github is already cloned, only this cell must be run after restarting the runtime
# Note: be sure to mount the gdrive before continuing
%cd /content/drive/MyDrive/ru_col_suggest/
!git fetch
!git pull
# !git reset --hard origin/master

!pip install wget

# For static word embedding models
!pip install glove-python-binary
!pip install gensim

# For dynamic word embedding models
!pip uninstall -y tensorflow
!pip install git+https://github.com/huggingface/transformers
!pip list | grep -E 'transformers|tokenizers'

# For attesting collocations
!pip install mysql-connector-python

# Note: be sure to update the config.ini file and restart the runtime

/content/drive/MyDrive/ru_col_suggest
remote: Enumerating objects: 10, done.[K
remote: Counting objects: 100% (10/10), done.[K
remote: Compressing objects: 100% (2/2), done.[K
remote: Total 6 (delta 4), reused 6 (delta 4), pack-reused 0[K
Unpacking objects: 100% (6/6), done.
From https://github.com/iyves/ru_col_suggest
   f7a04ce..0a425fc  master     -> origin/master
Checking out files: 100% (64/64), done.
HEAD is now at 0a425fc Fixed error with attesting and getting frequency for trigrams.
Collecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to /tmp/pip-req-build-6hq20dok
  Running command git clone -q https://github.com/huggingface/transformers /tmp/pip-req-build-6hq20dok
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
tokenizers                    0.10.3
transformers                  4.10.0.dev0


## Lemmatization of collocations

In [None]:
from src.tokenizer import Tokenizer

input_collocations = ["исследуем вопрос", "по мнению авторов"]

# Lemmatization via UDPipe
print("\n"*2, "Lemmatization via UDPipe")
tokenizer = Tokenizer(Tokenizer.Method.UDPIPE)
udpipe_output = tokenizer.tokenize(input_collocations)
for inp, outp in zip(input_collocations, udpipe_output):
  print(inp, "->", outp)


# Lemmatization via TreeTagger
print("\n"*2, "Lemmatization via TreeTagger")
tokenizer = Tokenizer(Tokenizer.Method.TREETAGGER)
treetagger_output = tokenizer.tokenize(input_collocations)
for inp, outp in zip(input_collocations, treetagger_output):
  print(inp, "->", outp)

  re.IGNORECASE | re.VERBOSE)
  re.VERBOSE | re.IGNORECASE)
  UrlMatch_re = re.compile(UrlMatch_expression, re.VERBOSE | re.IGNORECASE)
  EmailMatch_re = re.compile(EmailMatch_expression, re.VERBOSE | re.IGNORECASE)

Loading the model from /content/drive/MyDrive/ru_col_suggest/src/models/udpipe_syntagrus.model...




 Lemmatization via UDPipe
исследуем вопрос -> исследовать_VERB вопрос_NOUN
по мнению авторов -> по_ADP мнение_NOUN автор_NOUN


 Lemmatization via TreeTagger
исследуем вопрос -> исследовать_V вопрос_N
по мнению авторов -> по_S мнение_N автор_N


## Getting suggested collocations from static word embeddings (w2v, fastText, GloVe)

In [4]:
from src.static_embedder import LossLogger, StaticEmbedder

w2v_path = "/content/drive/MyDrive/models/lemma/w2v/w2v_treetagger.model"
fasttext_path = "/content/drive/MyDrive/models/lemma/fastText/fastText_treetagger.model"
glove_path = "/content/drive/MyDrive/models/lemma/glove/glove_treetagger.txt.word2vec"

# Load static word embedding models pretrained on a treetagger-lemmatized corpus
w2v_model = StaticEmbedder(StaticEmbedder.Model.WORD2VEC, w2v_path)
fasttext_model = StaticEmbedder(StaticEmbedder.Model.FASTTEXT, fasttext_path)
glove_model = StaticEmbedder(StaticEmbedder.Model.GLOVE, glove_path, binary=False)

In [None]:
# Get some suggested collocations
treetagger_output = ["исследовать_V вопрос_N", "по_S мнение_N автор_N"]

for ngram, fixed_positions in zip(treetagger_output, [[1, 0], [1, 1, 0]]):
  query = ngram.lower().split(" ")
  
  w2v_results = w2v_model.suggest_collocations(query, fixed_positions, topn=5)
  fasttext_results = fasttext_model.suggest_collocations(query, fixed_positions, topn=5)
  glove_results = glove_model.suggest_collocations(query, fixed_positions, topn=25)
  
  print("Results for", query, ":")
  print("_____w2v_____")
  for res in w2v_results:
    print(res)
  print("\n_____fastText_____")
  for res in fasttext_results:
    print(res)
  print("\n_____GloVe_____")
  for res in glove_results:
    print(res)
  print("-"*75)

Results for ['исследовать_v', 'вопрос_n'] :
_____w2v_____
('исследовать проблема', 0, 0.6838977336883545)
('исследовать задача', 1, 0.48253941535949707)
('исследовать спор', 2, 0.459981232881546)
('исследовать вопрсы', 3, 0.44508951902389526)

_____fastText_____
('исследовать новопрос', 0, 0.8441401720046997)
('исследовать вопроа', 3, 0.800765872001648)

_____GloVe_____
('исследовать итервьюера', 23, 0.5414384603500366)
---------------------------------------------------------------------------
Results for ['по_s', 'мнение_n', 'автор_n'] :
_____w2v_____
('по мнение исследователь', 0, 0.6782662272453308)
('по мнение ученый', 2, 0.5683425664901733)
('по мнение историк', 3, 0.443101167678833)

_____fastText_____
('по мнение исследователь', 0, 0.6990058422088623)
('по мнение втор', 1, 0.6199074387550354)
('по мнение ученый', 2, 0.6077357530593872)
('по мнение пвтор', 3, 0.5964410305023193)

_____GloVe_____
('по мнение помпонио', 1, 0.7873562574386597)
('по мнение ряженки', 2, 0.74589282274

## Getting suggested collocations from dynamic word embeddings (RoBERTa, t5)

In [None]:
from src.dynamic_embedder import DynamicEmbedder

bert_path = "/content/drive/MyDrive/models/lemma/RuBERT_treetagger_lemma"

# Load a dynamic word embedding model trained on a treetagger-lemmatized corpus
bert_model = DynamicEmbedder(DynamicEmbedder.Model.BERT, bert_path)

In [None]:
# Get some suggested collocations
treetagger_output = ["исследовать_V <mask>_N", 
                     "по_S мнение_N <mask>_N", 
                     "рассматривать_V <mask>_N <mask>_N"]

for ngram in treetagger_output:
  query = ngram.lower()
  bert_results = bert_model.suggest_collocations(query, topn=5)
  
  print("Results for", query, ":")
  print("_____RoBERTa_____")
  for res in bert_results:
    print(res)
  print("-"*75)

Results for исследовать_v <mask>_n :
_____RoBERTa_____
('исследовать конституция', 0)
('исследовать законодательство', 1)
('исследовать автореферат', 2)
('исследовать диссертация', 3)
('исследовать закон', 4)
---------------------------------------------------------------------------
Results for по_s мнение_n <mask>_n :
_____RoBERTa_____
('по мнение NUMarab', 0)
('по мнение автор', 1)
('по мнение ученый', 2)
('по мнение эксперт', 3)
('по мнение судья', 4)
---------------------------------------------------------------------------
Results for рассматривать_v <mask>_n <mask>_n :
_____RoBERTa_____
('рассматривать международный NUMarab', 0)
('рассматривать международный монография', 1)
('рассматривать международный проблема', 2)
('рассматривать международный суд', 3)
('рассматривать международный целесообразный', 4)
('рассматривать содержание NUMarab', 1)
('рассматривать содержание монография', 2)
('рассматривать содержание проблема', 3)
('рассматривать содержание суд', 4)
('рассматривать 

## Getting suggested collocations & statistics

In [None]:
# Run this and allow the IP on the test gcloud server 
!curl ipecho.net/plain

In [4]:
from src.get_collocation_replacements import get_collocation_replacements
from src.dynamic_embedder import DynamicEmbedder
from src.static_embedder import LossLogger, StaticEmbedder

bigrams = [l.lower() for l in ["исследовать_V вопрос_N", "школа_N экономика_N"]]
trigrams = [l.lower() for l in ["прийти_V к_S вывод_N", "по_S мнение_N автор_N"]]

w2v_path = "/content/drive/MyDrive/models/lemma/w2v/w2v_treetagger.model"
bert_path = "/content/drive/MyDrive/models/lemma/RuBERT_treetagger_lemma"

In [10]:
bigram_results = get_collocation_replacements(collocations=bigrams, 
                                              staticModel=True, 
                                              modelType=StaticEmbedder.Model.WORD2VEC,
                                              modelSrc=w2v_path, 
                                              binaryModel=True, # only relevant for GloVe
                                              cossim=True,
                                              topn=3,
                                              replace1=True, replace2=True, replace3=False,
                                              include_pmi=True, include_t_score=True, include_ngram_freq=True)

In [7]:
trigram_results = get_collocation_replacements(collocations=trigrams, 
                                              staticModel=False, 
                                              modelType=DynamicEmbedder.Model.BERT,
                                              modelSrc=bert_path,
                                              mask="<mask>",
                                              topn=3,
                                              replace1=True, replace2=True, replace3=True,
                                              include_pmi=True, include_t_score=True, include_ngram_freq=True)

In [11]:
def print_results(res):
  for k,v in res.items():
    print(f"Results for '{k}':")
    for val in v:
      print(" "*5, val)
    print("-"*75)

print("*"*5, "BIGRAMS WITH W2V", "*"*5)
print_results(bigram_results)
print("\n")
print("*"*5, "TRIGRAMS WITH BERT", "*"*5)
print_results(trigram_results)

***** BIGRAMS WITH W2V *****
Results for 'исследовать_v вопрос_n':
      {'suggested': 'анализировать вопрос', 'rank': 0, 'cosScore': 0.5687171816825867, 'numReplaced': 1, 'pmi': -1.9326116004215073, 't_score': -713.0812254264505, 'ngram_freq': 71}
      {'suggested': 'рассматривать вопрос', 'rank': 1, 'cosScore': 0.5575243234634399, 'numReplaced': 1, 'pmi': -0.8382014019729593, 't_score': -322.48555306741184, 'ngram_freq': 2998}
      {'suggested': 'изучить вопрос', 'rank': 2, 'cosScore': 0.553035318851471, 'numReplaced': 1, 'pmi': -1.5870879940255196, 't_score': -255.31760297908693, 'ngram_freq': 46}
      {'suggested': 'исследовать проблема', 'rank': 0, 'cosScore': 0.6838977336883545, 'numReplaced': 1, 'pmi': -0.722491713486314, 't_score': -132.48833138166378, 'ngram_freq': 959}
      {'suggested': 'анализировать вопрос', 'rank': 0, 'cosScore': 0.5687171816825867, 'numReplaced': 2, 'pmi': -1.9326116004215073, 't_score': -713.0812254264505, 'ngram_freq': 71}
      {'suggested': 'расс