# Vergleichen und Kontrollieren die ähnlichen Wörter von unterschiedlichen Embeddings-Methoden

1. Word2Vec-Methoden: CBOW und Skipgram möglich. Nach dem Paper benutzen wir Skipgram für unser Experiment
2. Für das neue Experiment wird BERT-Modell verwendet, um Wortembeddings zu erstellen. Das ist ein mehrschritten-Prozess: Durchschnitt. Die genaue Berechnung wird in dem Bericht beschrieben
3. Durch mehrere Schritten Durchschnitt sind die semantischen Ähnlichkeiten noch bebeihalten. 
4. Die genaue Implementierung für Embeddings in dem `src\embedding.py` und `src/bert*` zu sehen

# Datenvorbereitung

Da wir Skipgramm und Bert vergleichen möchten, stellen wir in dem textsloader.preprocess, da Bert später auch mit benutzt wird:
1. use_bert_embedding = True
2. Ziel: damit wir das Vocabular konsitent haben können

In [1]:
import pandas as pd
from pathlib import Path
import matplotlib.pyplot as plt
%matplotlib inline
import time

from src.prepare_dataset import TextDataLoader
# init TextDataLoader für die Datenquelle 20 News Groups
# Daten abrufen vom Sklearn, tokenisieren und besondere Charaktern entfernen
textsloader = TextDataLoader(source="20newsgroups", train_size=None, test_size=None)
textsloader.load_tokenize_texts("20newsgroups")
# Vorverarbeitung von Daten mit folgenden Schritten:
textsloader.preprocess_texts(length_one_remove=True, 
                             punctuation_lower = True, 
                             stopwords_filter = True,
                             use_bert_embedding = True)
# Daten zerlegen für Train, Test und Validation. Erstellen Vocabular aus dem Trainset
min_df= 100
textsloader.split_and_create_voca_from_trainset(max_df=0.7, 
                                                min_df=min_df, 
                                                stopwords_remove_from_voca=True)

# Erstellen BOW-Repräsentation für ETM Modell
for_lda_model = False
word2id, id2word, train_set, test_set, val_set = textsloader.create_bow_and_savebow_for_each_set(for_lda_model=for_lda_model)

loading texts: ...
train-size after loading: 11314
test-size after loading: 7532
finished load!
start: preprocessing: ...
preprocessing step: remove stopwords
will use bert embedding, so delete words from not_in_bert_vocab.txt
finised: preprocessing!
vocab-size in df: 3095
preprocessing remove stopwords from vocabulary
start creating vocabulary ...
length of the vocabulary: 3095
length word2id list: 3095
length id2word list: 3095
finished: creating vocabulary
save docs in txt...
save docs finished
train-size-after-all: 11214
test-size-after-all: 7532
validation-size-after-all: 100
test-size-after-all: 11214
test-indices-length: 11214
test-size-after-all: 100
test-indices-length: 100
test-size-after-all: 7532
test-indices-length: 7532
length train-documents-indices : 893379
length of the vocabulary: 3095


start: creating bow representation...
finised creating bow input!

start: creating bow representation...
finised creating bow input!

start: creating bow representation...
finised cre

In [2]:
# Kontrollieren die Größen von verschiedenen Datensätzen
print(f'Size of the vocabulary after prprocessing ist: {len(textsloader.vocabulary)}')
print(f'Size of train set: {len(train_set["tokens"])}')
print(f'Size of val set: {len(val_set["tokens"])}')
print(f'Size of test set: {len(test_set["test"]["tokens"])}')

# re-erstellen von Dokumenten nach der Vorverarbeitungen. Die Dokumenten sind in Wörtern und werden für Word-Embedding Training benutzt
docs_tr, docs_t, docs_v = textsloader.get_docs_in_words_for_each_set()
del docs_t
del docs_v
train_docs_df = pd.DataFrame()
train_docs_df['text-after-preprocessing'] = [' '.join(doc) for doc in docs_tr[:100]]
train_docs_df

Size of the vocabulary after prprocessing ist: 3095
Size of train set: 11214
Size of val set: 100
Size of test set: 7532
save docs in txt...
save docs finished


Unnamed: 0,text-after-preprocessing
0,jackson defense nntp posting host university i...
1,apollo hp red police state usa nntp posting ho...
2,dartmouth brian hughes installing ram quadra r...
3,bu boston university physics department articl...
4,king eng umd doug computer design lab maryland...
...,...
95,physics ca pc windows os unix reply physics ca...
96,ncr jim parts information distribution world n...
97,sera zuma serdar argic nazi germany armenians ...
98,chips astro temple bible research temple unive...


# WordEmbedding mit Skipgramm

1. Trainieren Word-Embedding 
2. Speichern in txt File
3. Wiederaufrufen

In [3]:
from src.embedding import WordEmbeddingCreator

In [4]:
vocab = list(word2id.keys()) # vocabulary after preprocessing and creating bow
word2vec_model = 'skipgram'

save_path = Path.joinpath(Path.cwd(), f'prepared_data/min_df_{min_df}')
figures_path = Path.joinpath(Path.cwd(), f'figures/min_df_{min_df}')
Path(save_path).mkdir(parents=True, exist_ok=True)
Path(figures_path).mkdir(parents=True, exist_ok=True)

if word2vec_model!="bert":
    # only for cbow and skipgram model
    wb_creator = WordEmbeddingCreator(model_name=word2vec_model, documents = docs_tr, save_path= save_path)
    wb_creator.train(min_count=0, embedding_size= 300)
    wb_creator.create_and_save_vocab_embedding(vocab, save_path)

train word-embedding with skipgram


 13%|█▎        | 401/3095 [00:00<00:00, 4000.86it/s]

length of vocabulary from word-embedding with skipgram: 3095
length of vocabulary after creating BOW: 3095


100%|██████████| 3095/3095 [00:01<00:00, 2778.57it/s]


In [5]:
print(f'word: {vocab[1]} - vector: {list(wb_creator.model.wv.__getitem__(vocab[0]))[:5]} ')

word: report - vector: [0.09772752, -0.014075972, -0.2555263, -0.056844436, -0.21444546] 


# Vergleichen ähnliche Wörter von Word2Vec (gensim und eigene Cosine)

In [6]:
v = vocab[9]
vec = list(wb_creator.model.wv.__getitem__(v))
print(f'word-embedding of the word-- {v}: ')
print(f'dim of vector: {len(vec)}')

word-embedding of the word-- supported: 
dim of vector: 300


In [7]:
# using gensim function
wb_creator.find_most_similar_words(n_neighbor=10, word=v)

[('supports', 0.8612833023071289),
 ('implemented', 0.7653976082801819),
 ('multi', 0.7261040210723877),
 ('provided', 0.7183036804199219),
 ('tools', 0.7166996598243713),
 ('implementation', 0.7062899470329285),
 ('implement', 0.6892890930175781),
 ('capabilities', 0.6522724628448486),
 ('requirements', 0.6459758281707764),
 ('multiple', 0.6458479762077332)]

In [8]:
# using self-implemented cosine
sw = wb_creator.find_similar_words_self_implemented(10, vocab, v)
df = pd.DataFrame()
df['Ähnliches Wort'] = list(sw.keys())
df['Cosinus-Ähnlichkeit'] = list(sw.values())
print(df.to_latex(index=False))
del sw
del df

\begin{tabular}{lr}
\toprule
Ähnliches Wort &  Cosinus-Ähnlichkeit \\
\midrule
      supports &             0.861283 \\
   implemented &             0.765398 \\
         multi &             0.726104 \\
      provided &             0.718304 \\
         tools &             0.716700 \\
implementation &             0.706290 \\
     implement &             0.689289 \\
  capabilities &             0.652272 \\
  requirements &             0.645976 \\
      multiple &             0.645848 \\
\bottomrule
\end{tabular}



In [9]:
#del wb_creator
del textsloader
del word2id
del id2word
del train_set
del test_set
del val_set

# Vergleich ähnliche Wörter zwischen Word2Vec und Bert

In [10]:
# only bert_vocab, not embeddings in this file
with open('prepared_data/bert_vocab.txt') as f:
    lines = f.readlines()
readed_bert_vocab = [e.split("\n")[0] for e in lines]
print(len(readed_bert_vocab))

104008


In [11]:
from src.embedding import BertEmbedding
bert_eb = BertEmbedding('prepared_data') #directory, where the txt.file of bert_vocab_embedding.txt ist
bert_eb.get_bert_embeddings(vocab)
print(bert_eb.bert_embeddings.shape)

read word-embeddings with bert from file...
prepared_data/bert_vocab_embedding.txt
bert-embedding ready!
(3095, 768)


In [12]:
#my vocabular
print(len(vocab))

3095


In [13]:
print(f'find similar words for {v}: \n')
sw = bert_eb.find_similar_words(v, 10, vocab)
df = pd.DataFrame()
df['Ähnliches Wort'] = list(sw.keys())
df['Cosinus-Ähnlichkeit'] = list(sw.values())
print(df.to_latex(index=False))

find similar words for supported: 

\begin{tabular}{lr}
\toprule
Ähnliches Wort &  Cosinus-Ähnlichkeit \\
\midrule
       support &             0.933962 \\
    supporting &             0.917919 \\
      supports &             0.881589 \\
      provided &             0.860958 \\
   implemented &             0.860854 \\
      included &             0.856517 \\
      accepted &             0.845091 \\
     installed &             0.843016 \\
     protected &             0.835109 \\
      approved &             0.834749 \\
\bottomrule
\end{tabular}



In [14]:
# controll the consitence of vocabulary
for w in vocab:
    if w not in bert_eb.bert_vocab:
        print(w)