# Creation of models

This notebook is aimed at creating word embeddings. Main idea is to create one Word2vec model per one "kadencja" (4 years each). The outputs are lemmatized words from Polish Parliamentary corpus. 

This notebook is based on an complete Word2vec example by Kavita Ganesan (https://github.com/kavgan/nlp-in-practice/tree/master/word2vec).


### Libraries imports

First, we load the libraries. We have decided for using Gensim library for creating word2vec, as it is relatively easy to use and lots of tutorials are avaliable online.

In [1]:
# imports needed and set up logging
import gzip
import gensim 
import logging
import re
import pandas as pd
import string
import os

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)


This test sentence taken from the dataset will help in creating cleaning function:

In [2]:
test = 'niestety , działanie ten nie dać żaden rezultat . składać my obietnica bez pokrycie . sytuacja być obecnie taki , że w bliski czas praca utracić 1100 pracownik fabryka'

In [3]:
test

'niestety , działanie ten nie dać żaden rezultat . składać my obietnica bez pokrycie . sytuacja być obecnie taki , że w bliski czas praca utracić 1100 pracownik fabryka'

As the original dataset did not fit into laptop memory (30 gb compressed), we have preprocessed it primarily on Google Colab instance. Later, we have downloaded only the parts interesting for us. This means removing part-of-speech tagging, entity recognition and lots of other metadata. The dataset downloaded from Colab has a form of txt files, one file per one parliament session. Below is an excerpt of such file:

```
sprawozdanie stenograficzny z 266 posiedzenie sejm ustawodawczy z dzień 30 listopad 1921 rok .
( początek posiedzenie o godzina . 4 minimum . 40 po pół . )
przedstawiciel rząd :
minister rolnictwo i dobro państwowy Józef Raczyński , minister poczt i telegraf Władysław Stesłowicz , minister robota publiczny Gabriel Narutowicz , minister praca i opieka społeczny Ludwik Darowski , minister sprawiedliwość Bronisław sobolewski , kierownik ministerstwo aprowizacja Jan Stoiński , kierownik ministerstwo zdrowie publiczny doktor . Witold Chodźko , prezes główny urząd ziemski Władysław Kiernik .
podsekretarz stan :
w ministerstwo sprawiedliwość Zygmunt Rymowicz .
```

First step for loading the data is to set up paths:

In [4]:
path_to_files = 'data/drive/MyDrive/out_sejm/' # test location
path_to_files = 'data/takeout-20210110T123042Z-001/Takeout/Dysk/out_sejm/' # full location

In [5]:
files_names = os.listdir(path_to_files)

In [6]:
years = [name.split('_')[0] for name in files_names]
years = sorted(list(set(years)))

In [7]:
years

['1919-1922',
 '1922-1927',
 '1928-1930',
 '1930-1935',
 '1935-1938',
 '1938-1939',
 '1943-1947',
 '1947-1952',
 '1952-1956',
 '1957-1961',
 '1961-1965',
 '1965-1969',
 '1969-1972',
 '1972-1976',
 '1976-1980',
 '1980-1985',
 '1985-1989',
 '1989-1991',
 '1991-1993',
 '1993-1997',
 '1997-2001',
 '2001-2005']

In [8]:
files_names_current_kadencja = [i for i in files_names if years[0] in i]

Next function is aimed at actual cleanup of the data - lowering the letters, removing punctuation, splitting by whitespace etc.:

In [9]:
from stop_words import get_stop_words

In [10]:
def clean_item(item):
    x = item.lower().split()
    removed_digits = [re.sub('[0-9]', '', i) for i in x]
    removed_punct = [i.strip(string.punctuation) for i in removed_digits]
    removed_stopwords = [i for i in removed_punct if i not in get_stop_words('pl')]
    removed_whitespace = [i for i in removed_stopwords if i != '']
    return removed_whitespace

Test on previously generated sentence:

In [11]:
clean_item(test)

['niestety',
 'działanie',
 'dać',
 'rezultat',
 'składać',
 'obietnica',
 'pokrycie',
 'sytuacja',
 'obecnie',
 'w',
 'bliski',
 'czas',
 'praca',
 'utracić',
 'pracownik',
 'fabryka']

And test on actual file:

In [12]:
with open(path_to_files+files_names_current_kadencja[0], 'r') as file:
    for i, line in enumerate(file):
        print(clean_item(line))
        break

['sprawozdanie', 'stenograficzny', 'z', 'posiedzenie', 'sejm', 'ustawodawczy', 'z', 'dzień', 'listopad', 'rok']


This function gets data from all the files from selected kadencja. Using generator, as all files would not fit into memory. Gensim models are able to use generators, so there will be no need to load the data at once.

In [13]:
def yield_all_kadencja(path_to_files, files_names_current_kadencja):
    for file_name in files_names_current_kadencja:
#         print(file_name)
        with open(path_to_files+file_name, 'r') as file:
            for i, line in enumerate(file):
#                 print(clean_item(line))
                yield clean_item(line)

In [14]:
a = yield_all_kadencja(path_to_files, files_names_current_kadencja)

In [15]:
for i, item in enumerate(a):
    print(item)
    if i==10:
        break

['sprawozdanie', 'stenograficzny', 'z', 'posiedzenie', 'sejm', 'ustawodawczy', 'z', 'dzień', 'listopad', 'rok']
['początek', 'posiedzenie', 'o', 'godzina', 'minimum', 'pół']
['przedstawiciel', 'rząd']
['minister', 'rolnictwo', 'i', 'dobro', 'państwowy', 'józef', 'raczyński', 'minister', 'poczt', 'i', 'telegraf', 'władysław', 'stesłowicz', 'minister', 'robota', 'publiczny', 'gabriel', 'narutowicz', 'minister', 'praca', 'i', 'opieka', 'społeczny', 'ludwik', 'darowski', 'minister', 'sprawiedliwość', 'bronisław', 'sobolewski', 'kierownik', 'ministerstwo', 'aprowizacja', 'jan', 'stoiński', 'kierownik', 'ministerstwo', 'zdrowie', 'publiczny', 'doktor', 'witold', 'chodźko', 'prezes', 'główny', 'urząd', 'ziemski', 'władysław', 'kiernik']
['podsekretarz', 'stan']
['w', 'ministerstwo', 'sprawiedliwość', 'zygmunt', 'rymowicz']
['otwierać', 'posiedzenie', 'protokół', 'posiedzenie', 'uważać', 'za', 'przyjęty', 'gdyż', 'wnieść', 'przeciw', 'zarzut', 'protokół', 'posiedzenie', 'leżeć', 'w', 'biuro', 

## Training the Word2Vec model

Next step is to train the Word2Vec model on one kadencja, to check if the results are reasonable and also to get some benchmarks about speed of execution. 

One subtlety is that gensim models can take python iterator object, but not generator. This is because one can go through generator only once, while through iterator - multiple times. In Word2Vec model, the dataset is actually passed couple of times, at least once per each epoch. Below class is able to convert genrator function into iterator object.

In [16]:
class MakeIter():
    def __init__(self, generator_func, **kwargs):
        self.generator_func = generator_func
        self.kwargs = kwargs
    def __iter__(self):
        return self.generator_func(**self.kwargs)

In [17]:
path_to_files

'data/takeout-20210110T123042Z-001/Takeout/Dysk/out_sejm/'

Test of iterator:

In [18]:
sentences = MakeIter(yield_all_kadencja, path_to_files = path_to_files, files_names_current_kadencja =files_names_current_kadencja )

In [19]:
for i in sentences:
    print(i)
    break

['sprawozdanie', 'stenograficzny', 'z', 'posiedzenie', 'sejm', 'ustawodawczy', 'z', 'dzień', 'listopad', 'rok']


Actual model training. We have decided to drop words with count less than 2. Size of the output vector is set at 128. Multiples of 4 are best number of outputs optimisation-wise - as stated in gensim docs. We have selected quite large window of 10, as Polish language has loose sentence order.

In [27]:
model = gensim.models.Word2Vec (sentences, size=128, window=10, min_count=2, workers=10)

2021-01-10 22:35:01,046 : INFO : collecting all words and their counts
2021-01-10 22:35:01,051 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2021-01-10 22:35:04,512 : INFO : PROGRESS: at sentence #10000, processed 335991 words, keeping 15872 word types
2021-01-10 22:35:08,436 : INFO : PROGRESS: at sentence #20000, processed 695438 words, keeping 22032 word types
2021-01-10 22:35:12,758 : INFO : PROGRESS: at sentence #30000, processed 1085464 words, keeping 26520 word types
2021-01-10 22:35:16,696 : INFO : PROGRESS: at sentence #40000, processed 1437433 words, keeping 29695 word types
2021-01-10 22:35:20,251 : INFO : PROGRESS: at sentence #50000, processed 1780904 words, keeping 32542 word types
2021-01-10 22:35:23,522 : INFO : PROGRESS: at sentence #60000, processed 2164652 words, keeping 35333 word types
2021-01-10 22:35:26,714 : INFO : PROGRESS: at sentence #70000, processed 2513471 words, keeping 37533 word types
2021-01-10 22:35:29,880 : INFO : PROGRESS

## Run in batch

The model was able to train without any problems. We can jump directly to batch processing. In total there are 22 kadencje, so this can take a while. Further optimisation by parallelising was not possible, as gensim already uses all the cores under the hood. 

There are 2 kadencje where number of sessions is extraordinarly large - above 5000, while the rest of periods had around 1000. To make comparisons between the years more fair, we have randomly sampled the files for large periods. 

After running the model, results are saved to models/ folder - with one file per each period. These models will be then loaded in the next notebook.

In [20]:
import random

In [21]:
for year_idx in range(21, len(years)):
    year = years[year_idx]
    print(f'running year {year}')
    files_names_current_kadencja = [i for i in files_names if year in i]
    print(f'number of files: {len(files_names_current_kadencja)}')
    if len(files_names_current_kadencja) >5000:
        print('Number of files exceeds 5000. Randomly sampling the files to reduce to that value.')
        files_names_current_kadencja = random.sample(files_names_current_kadencja, 5000)

    sentences = MakeIter(yield_all_kadencja, path_to_files = path_to_files, files_names_current_kadencja =files_names_current_kadencja )
    
    for i in sentences:
        print(i)
        break
    model = gensim.models.Word2Vec (sentences, size=128, window=10, min_count=2, workers=8)
    
    model_file_name = f'model_{year}.model'
    print(f'saving model to {model_file_name}')
    model.save(f'models/{model_file_name}')

2021-01-11 13:31:24,716 : INFO : collecting all words and their counts
2021-01-11 13:31:24,724 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types


running year 2001-2005
number of files: 22236
Number of files exceeds 5000. Randomly sampling the files to reduce to that value.
['otwierać', 'posiedzenie', 'komisja', 'witać', 'wszystek', 'obecny', 'dzisiejszy', 'posiedzenie', 'komisja', 'kontynuować', 'praca', 'nad', 'projekt', 'ustawa', 'prawo', 'spółdzielczy', 'sprawa', 'który', 'chcieć', 'by', 'państwo', 'omówić', 'dzisiejszy', 'posiedzenie', 'komisja']


2021-01-11 13:31:36,251 : INFO : PROGRESS: at sentence #10000, processed 469905 words, keeping 15479 word types
2021-01-11 13:31:45,600 : INFO : PROGRESS: at sentence #20000, processed 957082 words, keeping 20664 word types
2021-01-11 13:31:55,120 : INFO : PROGRESS: at sentence #30000, processed 1443008 words, keeping 24808 word types
2021-01-11 13:32:05,522 : INFO : PROGRESS: at sentence #40000, processed 1968387 words, keeping 28134 word types
2021-01-11 13:32:15,315 : INFO : PROGRESS: at sentence #50000, processed 2448875 words, keeping 30942 word types
2021-01-11 13:32:25,522 : INFO : PROGRESS: at sentence #60000, processed 2955976 words, keeping 33329 word types
2021-01-11 13:32:34,932 : INFO : PROGRESS: at sentence #70000, processed 3486538 words, keeping 35647 word types
2021-01-11 13:32:45,737 : INFO : PROGRESS: at sentence #80000, processed 3961655 words, keeping 37804 word types
2021-01-11 13:32:55,214 : INFO : PROGRESS: at sentence #90000, processed 4407895 words, keeping 39

saving model to model_2001-2005.model


2021-01-11 13:40:08,795 : INFO : saved models/model_2001-2005.model


## Example outputs

Last step in this notebook is to assess the validity of models. We have empirically tested if outputs from Word2vec models are legitimate. One important remark is that while the whole Parliamentary corpus is huge, we have divided it into ~20 parts and ran the embeddings on these parts. All embedding models require huge datasets to train on, that is why the quality of our embeddings will not be outstanding. However, as we demostrate below, the outputs are aligned with common sense.

In [30]:
# model.wv.vocab

Get most similar words to 'kobieta' and 'mężczyzna':

In [22]:
for i, j in zip(model.wv.most_similar (positive='kobieta',topn=30), model.wv.most_similar (positive='mężczyzna',topn=30)):
    print(i, j)


2021-01-11 14:14:43,283 : INFO : precomputing L2-norms of word weight vectors


('mężczyzna', 0.9025676846504211) ('kobieta', 0.9025676250457764)
('ciąża', 0.6820798516273499) ('wiek', 0.6958349943161011)
('wiek', 0.6750679016113281) ('płeć', 0.6423795223236084)
('egzystencja', 0.6424736976623535) ('równy', 0.6396583318710327)
('bezrobotny', 0.6402149200439453) ('egzystencja', 0.6380705833435059)
('emeryt', 0.630551815032959) ('ciąża', 0.606557309627533)
('płeć', 0.623595118522644) ('emeryt', 0.6050634980201721)
('seksualny', 0.6216563582420349) ('bezrobotny', 0.6043360829353333)
('rencista', 0.6213173866271973) ('rencista', 0.5981982946395874)
('dysfunkcyjny', 0.6186416149139404) ('małżeństwo', 0.59466552734375)
('inwalida', 0.6132781505584717) ('terror', 0.5885027647018433)
('wychowywać', 0.6132751703262329) ('inwalida', 0.5874441266059875)
('młody', 0.608988344669342) ('wychowywać', 0.5849355459213257)
('macierzyństwo', 0.6034623980522156) ('inwalidzki', 0.5836641788482666)
('godność', 0.5933580994606018) ('godność', 0.5746052861213684)
('wdowa', 0.585260331630

Some other words:

In [24]:
model.wv.most_similar (positive='katolik',topn=30)

[('wypędzić', 0.8161985874176025),
 ('sowiecki', 0.8075307607650757),
 ('grodno', 0.8066316246986389),
 ('przywódca', 0.803073525428772),
 ('tajlandia', 0.7983343601226807),
 ('łemko', 0.7961394786834717),
 ('kazachstan', 0.7936525344848633),
 ('biskup', 0.7930058240890503),
 ('anglik', 0.7912821769714355),
 ('brytyjczyk', 0.7883634567260742),
 ('okupant', 0.786532461643219),
 ('żyd', 0.7861233949661255),
 ('bośnia', 0.7859318852424622),
 ('birkenau', 0.7840434908866882),
 ('litwin', 0.782244086265564),
 ('nabożeństwo', 0.779995858669281),
 ('szwajcaria', 0.7796066999435425),
 ('arcybiskup', 0.7770388722419739),
 ('metropolita', 0.7761392593383789),
 ('nuncjusz', 0.7718625664710999),
 ('kardynał', 0.7712398767471313),
 ('prawosławie', 0.770678699016571),
 ('normandia', 0.7699627876281738),
 ('hamburg', 0.7693799734115601),
 ('ewangelicki', 0.7691755294799805),
 ('ukrainiec', 0.7686465978622437),
 ('antypolski', 0.7676368951797485),
 ('wileński', 0.765537679195404),
 ('mołotow', 0.76524

In [28]:
w1 = "antysemityzm"
model.wv.most_similar (positive=w1,topn=30)

[('antysemicki', 0.6343580484390259),
 ('odkłamać', 0.5965306758880615),
 ('konotacja', 0.5928347110748291),
 ('superrecenzja', 0.5905281901359558),
 ('pamiątka', 0.5877782106399536),
 ('georóżnorodności', 0.5860491394996643),
 ('znaczeniowy', 0.5841020941734314),
 ('ryś', 0.581831693649292),
 ('kormoran', 0.5804616212844849),
 ('kryptoanaliza', 0.5763731002807617),
 ('trzmielić', 0.5738934278488159),
 ('barwa', 0.5734844207763672),
 ('pejoratywny', 0.5731850266456604),
 ('˝legalizacji', 0.5718041062355042),
 ('folder', 0.5678712129592896),
 ('środowiska˝', 0.5677784085273743),
 ('kaszub', 0.5658748745918274),
 ('˝broni', 0.5655239820480347),
 ('ponownej˝', 0.562920093536377),
 ('sanktuarium', 0.5613067150115967),
 ('gładki', 0.560999870300293),
 ('trójca', 0.5606735348701477),
 ('kuplerstwo', 0.5596817135810852),
 ('celoma', 0.5579859614372253),
 ('kompaktowy', 0.5577108860015869),
 ('automatycznej˝', 0.555688738822937),
 ('nabożeństwo', 0.555428147315979),
 ('publicysta', 0.554144084