# Vergleichen und Kontrollieren die ähnlichen Wörter von unterschiedlichen Embeddings-Methoden

1. Word2Vec-Methoden: CBOW und Skipgram möglich. Nach dem Paper benutzen wir Skipgram für unser Experiment
2. Für das neue Experiment wird BERT-Modell verwendet, um Wortembeddings zu erstellen. Das ist ein mehrschritten-Prozess: Durchschnitt. Die genaue Berechnung wird in dem Bericht beschrieben
3. Durch mehrere Schritten Durchschnitt sind die semantischen Ähnlichkeiten noch bebeihalten. 
4. Die genaue Implementierung für Embeddings in dem `src\embedding.py` und `src/bert*` zu sehen

# Datenvorbereitung

Da wir Skipgramm und Bert vergleichen möchten, stellen wir in dem textsloader.preprocess, da Bert später auch mit benutzt wird:
1. use_bert_embedding = True
2. Ziel: damit wir das Vocabular konsitent haben können

In [6]:
import pandas as pd
from pathlib import Path
import matplotlib.pyplot as plt
%matplotlib inline
import time

from src.prepare_dataset import TextDataLoader
# init TextDataLoader für die Datenquelle 20 News Groups
# Daten abrufen vom Sklearn, tokenisieren und besondere Charaktern entfernen
textsloader = TextDataLoader(source="20newsgroups", train_size=None, test_size=None)
textsloader.load_tokenize_texts("20newsgroups")
# Vorverarbeitung von Daten mit folgenden Schritten:
textsloader.preprocess_texts(length_one_remove=True, 
                             punctuation_lower = True, 
                             stopwords_filter = False,
                             use_bert_embedding = True)
# Daten zerlegen für Train, Test und Validation. Erstellen Vocabular aus dem Trainset
min_df= 2
textsloader.split_and_create_voca_from_trainset(max_df=0.7, 
                                                min_df=min_df, 
                                                stopwords_remove_from_voca=False)

# Erstellen BOW-Repräsentation für ETM Modell
for_lda_model = False
word2id, id2word, train_set, test_set, val_set = textsloader.create_bow_and_savebow_for_each_set(for_lda_model=for_lda_model)

loading texts: ...
From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
   ---- brought to you by your neighborhood Lerxst ----





train-size after loading: 11314
test-size after loading: 7532
finished load!
start: preprocessing: ...
will use bert embedding, so delete words from not_in_bert_vocab.txt
finised: preprocessing!
vocab-size in df: 57723
start creating vocabulary ...
length of 

In [7]:
# Kontrollieren die Größen von verschiedenen Datensätzen
print(f'Size of the vocabulary after prprocessing ist: {len(textsloader.vocabulary)}')
print(f'Size of train set: {len(train_set["tokens"])}')
print(f'Size of val set: {len(val_set["tokens"])}')
print(f'Size of test set: {len(test_set["test"]["tokens"])}')

# re-erstellen von Dokumenten nach der Vorverarbeitungen. Die Dokumenten sind in Wörtern und werden für Word-Embedding Training benutzt
docs_tr, docs_t, docs_v = textsloader.get_docs_in_words_for_each_set()
del docs_t
del docs_v
train_docs_df = pd.DataFrame()
train_docs_df['text-after-preprocessing'] = [' '.join(doc) for doc in docs_tr[:100]]
train_docs_df

Size of the vocabulary after prprocessing ist: 54715
Size of train set: 11214
Size of val set: 100
Size of test set: 7532
save docs in txt...
save docs finished


Unnamed: 0,text-after-preprocessing
0,biochem nwu edu jackson re swimming pool defen...
1,goykhman apollo hp com red herring re welcome ...
2,hades coos dartmouth edu brian hughes re insta...
3,jaeger buphy bu edu gregg jaeger re inimitable...
4,sysmgr king eng umd edu doug mohney re boom wh...
...,...
95,rcampbel weejordy physics mun ca roderick camp...
96,jiml strauss ftcollinsco ncr com jim need shar...
97,sera zuma uucp serdar argic nazi germany armen...
98,chips astro ocis temple edu charlie mathew bib...


# WordEmbedding mit Skipgramm

1. Trainieren Word-Embedding 
2. Speichern in txt File
3. Wiederaufrufen

In [8]:
from src.embedding import WordEmbeddingCreator

In [None]:
vocab = list(word2id.keys()) # vocabulary after preprocessing and creating bow
word2vec_model = 'skipgram'

save_path = Path.joinpath(Path.cwd(), f'prepared_data/min_df_{min_df}')
figures_path = Path.joinpath(Path.cwd(), f'figures/min_df_{min_df}')
Path(save_path).mkdir(parents=True, exist_ok=True)
Path(figures_path).mkdir(parents=True, exist_ok=True)

if word2vec_model!="bert":
    # only for cbow and skipgram model
    wb_creator = WordEmbeddingCreator(model_name=word2vec_model, documents = docs_tr, save_path= save_path)
    wb_creator.train(min_count=0, embedding_size= 300)
    wb_creator.create_and_save_vocab_embedding(vocab, save_path)
del docs_tr

train word-embedding with skipgram


  0%|          | 48/54715 [00:00<01:56, 468.97it/s]

length of vocabulary from word-embedding with skipgram: 54715
length of vocabulary after creating BOW: 54715


  8%|▊         | 4290/54715 [00:09<01:34, 533.00it/s]

In [None]:
print(f'word: {vocab[1]} - vector: {list(wb_creator.model.wv.__getitem__(vocab[0]))[:5]} ')

# Vergleichen ähnliche Wörter von Word2Vec (gensim und eigene Cosine)

In [None]:
v = vocab[9]
vec = list(wb_creator.model.wv.__getitem__(v))
print(f'word-embedding of the word-- {v}: ')
print(f'dim of vector: {len(vec)}')

In [None]:
# using gensim function
wb_creator.find_most_similar_words(n_neighbor=10, word=v)

In [None]:
# using self-implemented cosine
sw = wb_creator.find_similar_words_self_implemented(10, vocab, v)
df = pd.DataFrame()
df['Ähnliches Wort'] = list(sw.keys())
df['Cosinus-Ähnlichkeit'] = list(sw.values())
print(df.to_latex(index=False))
del sw
del df

In [None]:
#del wb_creator
del textsloader
del word2id
del id2word
del train_set
del test_set
del val_set

# Vergleich ähnliche Wörter zwischen Word2Vec und Bert

In [None]:
# only bert_vocab, not embeddings in this file
with open('prepared_data/bert_vocab.txt') as f:
    lines = f.readlines()
readed_bert_vocab = [e.split("\n")[0] for e in lines]
print(len(readed_bert_vocab))

In [None]:
from src.embedding import BertEmbedding
bert_eb = BertEmbedding('prepared_data') #directory, where the txt.file of bert_vocab_embedding.txt ist
bert_eb.get_bert_embeddings(vocab)
print(bert_eb.bert_embeddings.shape)

In [None]:
#my vocabular
print(len(vocab))

In [None]:
print(f'find similar words for {v}: \n')
sw = bert_eb.find_similar_words(v, 10, vocab)
df = pd.DataFrame()
df['Ähnliches Wort'] = list(sw.keys())
df['Cosinus-Ähnlichkeit'] = list(sw.values())
print(df.to_latex(index=False))

In [None]:
# controll the consitence of vocabulary
for w in vocab:
    if w not in bert_eb.bert_vocab:
        print(w)