# Vergleichen und Kontrollieren die ähnlichen Wörter von unterschiedlichen Embeddings-Methoden

1. Word2Vec-Methoden: CBOW und Skipgram möglich. Nach dem Paper benutzen wir Skipgram für unser Experiment
2. Für das neue Experiment wird BERT-Modell verwendet, um Wortembeddings zu erstellen. Das ist ein mehrschritten-Prozess: Durchschnitt. Die genaue Berechnung wird in dem Bericht beschrieben
3. Durch mehrere Schritten Durchschnitt sind die semantischen Ähnlichkeiten noch bebeihalten. 
4. Die genaue Implementierung für Embeddings in dem `src\embedding.py` und `src/bert*` zu sehen

# Datenvorbereitung

In [1]:
import pandas as pd
from pathlib import Path
import matplotlib.pyplot as plt
%matplotlib inline
import time

from src.prepare_dataset import TextDataLoader
# init TextDataLoader für die Datenquelle 20 News Groups
# Daten abrufen vom Sklearn, tokenisieren und besondere Charaktern entfernen
textsloader = TextDataLoader(source="20newsgroups", train_size=None, test_size=None)
textsloader.load_tokenize_texts("20newsgroups")
# Vorverarbeitung von Daten mit folgenden Schritten:
textsloader.preprocess_texts(length_one_remove=True, 
                             punctuation_lower = True, 
                             stopwords_filter = True)
# Daten zerlegen für Train, Test und Validation. Erstellen Vocabular aus dem Trainset
min_df= 30
textsloader.split_and_create_voca_from_trainset(max_df=0.7, min_df=min_df, stopwords_remove_from_voca=True)

# Erstellen BOW-Repräsentation für ETM Modell
for_lda_model = False
word2id, id2word, train_set, test_set, val_set = textsloader.create_bow_and_savebow_for_each_set(for_lda_model=for_lda_model)

loading texts: ...
train-size after loading: 11314
test-size after loading: 7532
finished load!
start: preprocessing: ...
preprocssing step: remove stopwords
finised: preprocessing!
vocab-size in df: 8496
preprocessing remove stopwords from vocabulary
start creating vocabulary ...
length of the vocabulary: 8496
length word2id list: 8496
length id2word list: 8496
finished: creating vocabulary
train-size-after-all: 11214
test-size-after-all: 7532
validation-size-after-all: 100
test-size-after-all: 11214
test-indices-length: 11214
test-size-after-all: 100
test-indices-length: 100
test-size-after-all: 7532
test-indices-length: 7532
length train-documents-indices : 1150368
length of the vocabulary: 8496


start: creating bow representation...
finised creating bow input!

start: creating bow representation...
finised creating bow input!

start: creating bow representation...
finised creating bow input!

start: creating bow representation...
finised creating bow input!

start: creating bow re

In [2]:
# Kontrollieren die Größen von verschiedenen Datensätzen
print(f'Size of the vocabulary after prprocessing ist: {len(textsloader.vocabulary)}')
print(f'Size of train set: {len(train_set["tokens"])}')
print(f'Size of val set: {len(val_set["tokens"])}')
print(f'Size of test set: {len(test_set["test"]["tokens"])}')

# re-erstellen von Dokumenten nach der Vorverarbeitungen. Die Dokumenten sind in Wörtern und werden für Word-Embedding Training benutzt
docs_tr, docs_t, docs_v = textsloader.get_docs_in_words_for_each_set()
del docs_t
del docs_v
train_docs_df = pd.DataFrame()
train_docs_df['text-after-preprocessing'] = [' '.join(doc) for doc in docs_tr[:100]]
train_docs_df

Size of the vocabulary after prprocessing ist: 8496
Size of train set: 11214
Size of val set: 100
Size of test set: 7532


Unnamed: 0,text-after-preprocessing
0,biochem nwu jackson swimming pool defense nntp...
1,apollo hp red herring police state usa nntp po...
2,hades coos dartmouth brian hughes installing r...
3,jaeger buphy bu gregg jaeger rushdie boston un...
4,king eng umd doug boom computer design lab mar...
...,...
95,physics ca campbell pc windows os unix reply p...
96,ncr jim sharp parts information distribution w...
97,sera zuma serdar argic nazi germany armenians ...
98,chips astro temple charlie mathew bible resear...


# WordEmbedding mit Skipgramm

1. Trainieren Word-Embedding 
2. Speichern in txt File
3. Wiederaufrufen

In [3]:
from src.embedding import WordEmbeddingCreator

In [4]:
vocab = list(word2id.keys()) # vocabulary after preprocessing and creating bow
word2vec_model = 'skipgram'

save_path = Path.joinpath(Path.cwd(), f'prepared_data/min_df_{min_df}')
figures_path = Path.joinpath(Path.cwd(), f'figures/min_df_{min_df}')
Path(save_path).mkdir(parents=True, exist_ok=True)
Path(figures_path).mkdir(parents=True, exist_ok=True)

if word2vec_model!="bert":
    # only for cbow and skipgram model
    wb_creator = WordEmbeddingCreator(model_name=word2vec_model, documents = docs_tr, save_path= save_path)
    wb_creator.train(min_count=0, embedding_size= 300)
    wb_creator.create_and_save_vocab_embedding(vocab, save_path)

train word-embedding with skipgram


  2%|▏         | 186/8496 [00:00<00:04, 1859.78it/s]

length of vocabulary from word-embedding with skipgram: 8496
length of vocabulary after creating BOW: 8496


100%|██████████| 8496/8496 [00:03<00:00, 2750.96it/s]


In [5]:
print(f'word: {vocab[1]} - vector: {list(wb_creator.model.wv.__getitem__(vocab[0]))[:5]} ')

word: rebuild - vector: [0.018075468, -0.12066888, -0.0019907411, 0.12662512, -0.07401212] 


# Vergleichen ähnliche Wörter von Word2Vec (gensim und eigene Cosine)

In [6]:
v = vocab[0]
vec = list(wb_creator.model.wv.__getitem__(v))
print(f'word-embedding of the word-- {v}: ')
print(f'dim of vector: {len(vec)}')

word-embedding of the word-- attendance: 
dim of vector: 300


In [7]:
# using gensim function
wb_creator.find_most_similar_words(n_neighbor=10, word=v)

[('canadians', 0.9249528050422668),
 ('franchise', 0.9242219924926758),
 ('opportunities', 0.9136689901351929),
 ('pennant', 0.9123069047927856),
 ('sixth', 0.9095301628112793),
 ('clubs', 0.9085661172866821),
 ('goaltender', 0.9073867201805115),
 ('basketball', 0.9072387218475342),
 ('arena', 0.9059695601463318),
 ('goalies', 0.905768096446991)]

In [8]:
# using self-implemented cosine
wb_creator.find_similar_words_self_implemented(10, vocab, v)

{'canadians': 0.92495286,
 'franchise': 0.92422205,
 'opportunities': 0.91366893,
 'pennant': 0.91230685,
 'sixth': 0.9095302,
 'clubs': 0.9085662,
 'goaltender': 0.9073868,
 'basketball': 0.9072388,
 'arena': 0.9059696,
 'goalies': 0.90576816}

# Vergleich ähnliche Wörter zwischen Word2Vec und Bert

In [9]:
# only bert_vocab, not embeddings in this file
with open('prepared_data/bert_vocab.txt') as f:
    lines = f.readlines()
readed_bert_vocab = [e.split("\n")[0] for e in lines]
print(len(readed_bert_vocab))

104008


In [10]:
from src.embedding import BertEmbedding
bert_eb = BertEmbedding('prepared_data') #directory, where the txt.file of bert_vocab_embedding.txt ist
bert_eb.get_bert_embeddings(vocab)

read word-embeddings with bert from file...
something wrong at the embedding.py/read_prefitted_embeddings
use for testing bert
bert-embedding ready!


In [11]:
print(bert_eb.bert_embeddings.shape)

(8473, 768)


In [12]:
# controll the consitence of vocabulary
for w in vocab:
    if w not in bert_eb.bert_vocab:
        print(w)

uu
gotta
aa
tt
ee
bb
ss
ll
cc
ccc
iii
pp
gonna
aaa
dd
mmm
xxx
oo
mm
wanna
rr
ii
xx


In [13]:
bert_eb.find_similar_words(v, 10, vocab)

{'im': 0.7970070481718493,
 'movie': 0.7721749805193895,
 'favorite': 0.7692880935677894,
 'tad': 0.7638804330223218,
 'saturday': 0.7591672712979697,
 'lit': 0.7589368140810222,
 'wv': 0.7582765865219752,
 'family': 0.7568920617426359,
 'coincidence': 0.7543518625683001,
 'snow': 0.7526957799229693}