# Vergleichen und Kontrollieren die ähnlichen Wörter von unterschiedlichen Embeddings-Methoden

1. Word2Vec-Methoden: CBOW und Skipgram möglich. Nach dem Paper benutzen wir Skipgram für unser Experiment
2. Für das neue Experiment wird BERT-Modell verwendet, um Wortembeddings zu erstellen. Das ist ein mehrschritten-Prozess: Durchschnitt. Die genaue Berechnung wird in dem Bericht beschrieben
3. Durch mehrere Schritten Durchschnitt sind die semantischen Ähnlichkeiten noch bebeihalten. 
4. Die genaue Implementierung für Embeddings in dem `src\embedding.py` und `src/bert*` zu sehen

# Datenvorbereitung

Da wir Skipgramm und Bert vergleichen möchten, stellen wir in dem textsloader.preprocess, da Bert später auch mit benutzt wird:
1. use_bert_embedding = True
2. Ziel: damit wir das Vocabular konsitent haben können

In [1]:
import pandas as pd
from pathlib import Path
import matplotlib.pyplot as plt
%matplotlib inline
import time

from src.prepare_dataset import TextDataLoader
# init TextDataLoader für die Datenquelle 20 News Groups
# Daten abrufen vom Sklearn, tokenisieren und besondere Charaktern entfernen
textsloader = TextDataLoader(source="20newsgroups", train_size=None, test_size=None)
textsloader.load_tokenize_texts("20newsgroups")
# Vorverarbeitung von Daten mit folgenden Schritten:
textsloader.preprocess_texts(length_one_remove=True, 
                             punctuation_lower = True, 
                             stopwords_filter = True,
                             use_bert_embedding = True)
# Daten zerlegen für Train, Test und Validation. Erstellen Vocabular aus dem Trainset
min_df= 30
textsloader.split_and_create_voca_from_trainset(max_df=0.7, 
                                                min_df=min_df, 
                                                stopwords_remove_from_voca=True)

# Erstellen BOW-Repräsentation für ETM Modell
for_lda_model = False
word2id, id2word, train_set, test_set, val_set = textsloader.create_bow_and_savebow_for_each_set(for_lda_model=for_lda_model)

loading texts: ...
train-size after loading: 11314
test-size after loading: 7532
finished load!
start: preprocessing: ...
preprocessing step: remove stopwords
will use bert embedding, so delete words from not_in_bert_vocab.txt
finised: preprocessing!
vocab-size in df: 8473
preprocessing remove stopwords from vocabulary
start creating vocabulary ...
length of the vocabulary: 8473
length word2id list: 8473
length id2word list: 8473
finished: creating vocabulary
train-size-after-all: 11214
test-size-after-all: 7532
validation-size-after-all: 100
test-size-after-all: 11214
test-indices-length: 11214
test-size-after-all: 100
test-indices-length: 100
test-size-after-all: 7532
test-indices-length: 7532
length train-documents-indices : 1146648
length of the vocabulary: 8473


start: creating bow representation...
finised creating bow input!

start: creating bow representation...
finised creating bow input!

start: creating bow representation...
finised creating bow input!

start: creating bow 

In [2]:
# Kontrollieren die Größen von verschiedenen Datensätzen
print(f'Size of the vocabulary after prprocessing ist: {len(textsloader.vocabulary)}')
print(f'Size of train set: {len(train_set["tokens"])}')
print(f'Size of val set: {len(val_set["tokens"])}')
print(f'Size of test set: {len(test_set["test"]["tokens"])}')

# re-erstellen von Dokumenten nach der Vorverarbeitungen. Die Dokumenten sind in Wörtern und werden für Word-Embedding Training benutzt
docs_tr, docs_t, docs_v = textsloader.get_docs_in_words_for_each_set()
del docs_t
del docs_v
train_docs_df = pd.DataFrame()
train_docs_df['text-after-preprocessing'] = [' '.join(doc) for doc in docs_tr[:100]]
train_docs_df

Size of the vocabulary after prprocessing ist: 8473
Size of train set: 11214
Size of val set: 100
Size of test set: 7532


Unnamed: 0,text-after-preprocessing
0,biochem nwu jackson swimming pool defense nntp...
1,apollo hp red herring police state usa nntp po...
2,hades coos dartmouth brian hughes installing r...
3,jaeger buphy bu gregg jaeger rushdie boston un...
4,king eng umd doug boom computer design lab mar...
...,...
95,physics ca campbell pc windows os unix reply p...
96,ncr jim sharp parts information distribution w...
97,sera zuma serdar argic nazi germany armenians ...
98,chips astro temple charlie mathew bible resear...


# WordEmbedding mit Skipgramm

1. Trainieren Word-Embedding 
2. Speichern in txt File
3. Wiederaufrufen

In [3]:
from src.embedding import WordEmbeddingCreator

In [4]:
vocab = list(word2id.keys()) # vocabulary after preprocessing and creating bow
word2vec_model = 'skipgram'

save_path = Path.joinpath(Path.cwd(), f'prepared_data/min_df_{min_df}')
figures_path = Path.joinpath(Path.cwd(), f'figures/min_df_{min_df}')
Path(save_path).mkdir(parents=True, exist_ok=True)
Path(figures_path).mkdir(parents=True, exist_ok=True)

if word2vec_model!="bert":
    # only for cbow and skipgram model
    wb_creator = WordEmbeddingCreator(model_name=word2vec_model, documents = docs_tr, save_path= save_path)
    wb_creator.train(min_count=0, embedding_size= 300)
    wb_creator.create_and_save_vocab_embedding(vocab, save_path)

train word-embedding with skipgram


  3%|▎         | 290/8473 [00:00<00:02, 2865.55it/s]

length of vocabulary from word-embedding with skipgram: 8473
length of vocabulary after creating BOW: 8473


100%|██████████| 8473/8473 [00:03<00:00, 2302.49it/s]


In [5]:
print(f'word: {vocab[1]} - vector: {list(wb_creator.model.wv.__getitem__(vocab[0]))[:5]} ')

word: sfu - vector: [-0.0062408987, 0.027312417, -0.030454846, 0.039604533, 0.024855938] 


# Vergleichen ähnliche Wörter von Word2Vec (gensim und eigene Cosine)

In [34]:
v = vocab[9]
vec = list(wb_creator.model.wv.__getitem__(v))
print(f'word-embedding of the word-- {v}: ')
print(f'dim of vector: {len(vec)}')

word-embedding of the word-- believing: 
dim of vector: 300


In [35]:
# using gensim function
wb_creator.find_most_similar_words(n_neighbor=10, word=v)

[('sinner', 0.9474649429321289),
 ('loves', 0.934866726398468),
 ('sinful', 0.9320639371871948),
 ('believer', 0.9307171106338501),
 ('judgement', 0.9187048673629761),
 ('damnation', 0.9175612330436707),
 ('deeds', 0.9131304025650024),
 ('savior', 0.9112656116485596),
 ('baptism', 0.9112151861190796),
 ('eternal', 0.9087705016136169)]

In [38]:
# using self-implemented cosine
sw = wb_creator.find_similar_words_self_implemented(10, vocab, v)
df = pd.DataFrame()
df['Ähnliches Wort'] = list(sw.keys())
df['Cosinus-Ähnlichkeit'] = list(sw.values())
print(df.to_latex(index=False))

\begin{tabular}{lr}
\toprule
Ähnliches Wort &  Cosinus-Ähnlichkeit \\
\midrule
        sinner &             0.947465 \\
         loves &             0.934867 \\
        sinful &             0.932064 \\
      believer &             0.930717 \\
     judgement &             0.918705 \\
     damnation &             0.917561 \\
         deeds &             0.913130 \\
        savior &             0.911266 \\
       baptism &             0.911215 \\
       eternal &             0.908771 \\
\bottomrule
\end{tabular}



# Vergleich ähnliche Wörter zwischen Word2Vec und Bert

In [9]:
# only bert_vocab, not embeddings in this file
with open('prepared_data/bert_vocab.txt') as f:
    lines = f.readlines()
readed_bert_vocab = [e.split("\n")[0] for e in lines]
print(len(readed_bert_vocab))

104008


In [10]:
from src.embedding import BertEmbedding
bert_eb = BertEmbedding('prepared_data') #directory, where the txt.file of bert_vocab_embedding.txt ist
bert_eb.get_bert_embeddings(vocab)
print(bert_eb.bert_embeddings.shape)

read word-embeddings with bert from file...
bert-embedding ready!


In [12]:
# controll the consitence of vocabulary
for w in vocab:
    if w not in bert_eb.bert_vocab:
        print(w)

In [39]:
sw = bert_eb.find_similar_words(v, 10, vocab)
df = pd.DataFrame()
df['Ähnliches Wort'] = list(sw.keys())
df['Cosinus-Ähnlichkeit'] = list(sw.values())
print(df.to_latex(index=False))

\begin{tabular}{lr}
\toprule
Ähnliches Wort &  Cosinus-Ähnlichkeit \\
\midrule
        belief &             0.895478 \\
      believed &             0.886011 \\
      believes &             0.845822 \\
     accepting &             0.838407 \\
       beliefs &             0.831869 \\
     convinced &             0.817502 \\
     rejecting &             0.816063 \\
         faith &             0.815969 \\
      claiming &             0.806874 \\
        accept &             0.799588 \\
\bottomrule
\end{tabular}

