<a href="https://colab.research.google.com/github/nicolaiberk/GermanNPEmbs/blob/main/emb_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
## estimate word embeddings from newspaper data
## code adapted from https://github.com/damian0604/embeddingworkshop/blob/main/04exercise.ipynb
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize
import string
import re
import os
import pandas as pd
import csv
import sys
import ast
import time


# tqdm allows you to display progress bars in loops
from tqdm import tqdm
from datetime import datetime

import gensim

csv.field_size_limit(sys.maxsize)

# lets get more output
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
# get full set of news articles
if not os.path.isfile('newspapers/_bild_articles.csv'):
    os.system('mkdir newspapers')
    os.system('wget -O newspapers/articles.zip https://www.dropbox.com/sh/r6k4qk9flgz0agu/AAA5ZLsuOwk9UWiEsLAOFmDSa?dl=0')
    os.system('unzip newspapers/articles.zip -d newspapers')
    os.system('rm newspapers/articles.zip')

In [3]:
# load all texts
if 'artcls' not in locals():
  for filename in tqdm(os.listdir('newspapers')):
    if 'artcls' in locals():
      print(f'\nLoaded {artcls.shape[0]} articles')
      artcls = artcls.append(pd.read_csv('newspapers/'+filename))
    else:
      artcls = pd.read_csv('newspapers/'+filename)
  print(f'Loaded {artcls.shape[0]} articles, done.')

  artcls = artcls.reset_index()


# keep only if string
stringvar = [str == type(i) for i in artcls.text]
artcls = artcls[stringvar]
del(stringvar)

print(artcls.text[0])

  9%|▉         | 1/11 [00:04<00:44,  4.42s/it]


Loaded 150648 articles


 18%|█▊        | 2/11 [00:06<00:33,  3.69s/it]


Loaded 249494 articles


 27%|██▋       | 3/11 [00:07<00:22,  2.86s/it]


Loaded 338146 articles


 36%|███▋      | 4/11 [00:10<00:20,  2.97s/it]


Loaded 411554 articles


 45%|████▌     | 5/11 [00:23<00:35,  5.99s/it]


Loaded 630597 articles


 55%|█████▍    | 6/11 [00:46<00:54, 10.99s/it]


Loaded 942622 articles


 64%|██████▎   | 7/11 [01:01<00:48, 12.12s/it]


Loaded 1269462 articles


 73%|███████▎  | 8/11 [01:10<00:33, 11.25s/it]


Loaded 1532728 articles


 82%|████████▏ | 9/11 [01:13<00:17,  9.00s/it]


Loaded 1598095 articles


 91%|█████████ | 10/11 [01:16<00:07,  7.19s/it]


Loaded 1634513 articles


100%|██████████| 11/11 [01:42<00:00,  9.33s/it]


Loaded 2474182 articles, done.
Zum ersten Mal seit dem Amoklauf von Newtown sind in den USA Befürworter und Gegner von schärferen Waffengesetzen vor den Senat getreten. Die frühere demokratische Abgeordnete Gabrielle Giffords, selbst Opfer einer Schusswaffen-Attacke, sagte an ihre ehemaligen Kollegen gerichtet: "Zu viele Kinder sterben. Zu viele Kinder. Wir müssen etwas unternehmen!" Giffords rief den Kongress zum Handeln auf. "Wir müssen etwas tun. Es wird schwer sein, aber jetzt ist die Zeit." Giffords war im Januar 2011 bei einem Besuch in ihrem Wahlkreis im Bundesstaat Arizona von einem jungen Mann aus nächster Nähe in den Kopf geschossen worden. Die Politikerin überlebte schwer verletzt. Bei der Attacke starben sechs Menschen, unter ihnen ein neunjähriges Mädchen. Giffords wurde von ihrem Ehemann Mark Kelly begleitet. Die Politikerin und der Ex-Astronaut hatten Anfang Januar die Initiative "Americans for Responsible Solutions" (Amerikaner für verantwortungsbewusste Lösungen) gegrü

In [4]:
# subset
artcls = artcls.text

In [5]:
# cut into sentences
print('\nCutting into sentences:')
uniquesentences = set()
trans = str.maketrans('', '', string.punctuation) # translation scheme for removing punctuation
for review in tqdm(artcls):
  sentences = sent_tokenize(review) 
  for sentence in sentences:
    sent_trans = sentence.translate(trans).lower()
    if sent_trans not in uniquesentences:
      uniquesentences.add(sent_trans)

del(artcls)

  0%|          | 93/2214853 [00:00<39:50, 926.38it/s]


Cutting into sentences:


100%|██████████| 2214853/2214853 [42:14<00:00, 873.72it/s]


In [6]:
# extract 
print(f"\nWe now have {len(uniquesentences)} unique sentences.")


We now have 42301471 unique sentences.


In [7]:
# with open('uniquesentences.txt', 'w') as fo:
#   writer = csv.writer(fo)
#   for sentence in tqdm(uniquesentences):
#     writer.writerow([sentence])

In [8]:
# del(uniquesentences)

In [9]:
# print('Append split sentences')
# tokenizedsentences = []
# i = 0
# with open('uniquesentences.txt', mode = 'r') as fi:
#     reader = csv.reader(fi)
#     next(reader)
#     for sentence in tqdm(reader):
#       tokenizedsentences.append(sentence[0].split())
#       if i < 2:
#         i += 1
#         print(sentence[0].split())

In [10]:
tokenizedsentences = (sentence.split() for sentence in uniquesentences)
tokenizedsentences2 = (sentence.split() for sentence in uniquesentences)
del(uniquesentences)

In [11]:
print(f"Started setting up the model at {datetime.now()}")
model = gensim.models.Word2Vec(size=300, min_count=100) # we want 300 dimensions and not overdo it with the features
model.build_vocab(tokenizedsentences)
print(f"Started training at {datetime.now()}")
model.train(tokenizedsentences2, total_examples=model.corpus_count,  epochs=1)
print(f"Finished training at {datetime.now()}")

2021-07-31 12:05:47,433 : INFO : collecting all words and their counts
2021-07-31 12:05:47,435 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2021-07-31 12:05:47,512 : INFO : PROGRESS: at sentence #10000, processed 167919 words, keeping 35002 word types


Started setting up the model at 2021-07-31 12:05:47.428705


2021-07-31 12:05:47,590 : INFO : PROGRESS: at sentence #20000, processed 335969 words, keeping 56899 word types
2021-07-31 12:05:47,671 : INFO : PROGRESS: at sentence #30000, processed 502234 words, keeping 74681 word types
2021-07-31 12:05:47,753 : INFO : PROGRESS: at sentence #40000, processed 667870 words, keeping 90427 word types
2021-07-31 12:05:47,833 : INFO : PROGRESS: at sentence #50000, processed 835554 words, keeping 105072 word types
2021-07-31 12:05:47,917 : INFO : PROGRESS: at sentence #60000, processed 1002680 words, keeping 118613 word types
2021-07-31 12:05:48,008 : INFO : PROGRESS: at sentence #70000, processed 1169548 words, keeping 131314 word types
2021-07-31 12:05:48,093 : INFO : PROGRESS: at sentence #80000, processed 1337180 words, keeping 143494 word types
2021-07-31 12:05:48,178 : INFO : PROGRESS: at sentence #90000, processed 1504801 words, keeping 154979 word types
2021-07-31 12:05:48,267 : INFO : PROGRESS: at sentence #100000, processed 1671968 words, keepin

Started training at 2021-07-31 12:13:53.187597


2021-07-31 12:13:54,251 : INFO : EPOCH 1 - PROGRESS: at 0.06% examples, 330223 words/s, in_qsize 5, out_qsize 0
2021-07-31 12:13:55,257 : INFO : EPOCH 1 - PROGRESS: at 0.14% examples, 373902 words/s, in_qsize 5, out_qsize 0
2021-07-31 12:13:56,266 : INFO : EPOCH 1 - PROGRESS: at 0.21% examples, 388548 words/s, in_qsize 5, out_qsize 0
2021-07-31 12:13:57,268 : INFO : EPOCH 1 - PROGRESS: at 0.29% examples, 396503 words/s, in_qsize 6, out_qsize 0
2021-07-31 12:13:58,277 : INFO : EPOCH 1 - PROGRESS: at 0.37% examples, 399167 words/s, in_qsize 4, out_qsize 1
2021-07-31 12:14:07,547 : INFO : EPOCH 1 - PROGRESS: at 0.42% examples, 161208 words/s, in_qsize 5, out_qsize 0
2021-07-31 12:14:08,549 : INFO : EPOCH 1 - PROGRESS: at 0.49% examples, 177665 words/s, in_qsize 5, out_qsize 0
2021-07-31 12:14:09,566 : INFO : EPOCH 1 - PROGRESS: at 0.57% examples, 191440 words/s, in_qsize 5, out_qsize 0
2021-07-31 12:14:10,579 : INFO : EPOCH 1 - PROGRESS: at 0.64% examples, 204564 words/s, in_qsize 5, out_

Finished training at 2021-07-31 12:46:21.804100


In [12]:
print('Saving model:')
model.save("np_emb")
print('Model finished!')

2021-07-31 12:46:21,822 : INFO : saving Word2Vec object under np_emb, separately None
2021-07-31 12:46:21,828 : INFO : storing np array 'vectors' to np_emb.wv.vectors.npy


Saving model:


2021-07-31 12:46:22,022 : INFO : not storing attribute vectors_norm
2021-07-31 12:46:22,025 : INFO : storing np array 'syn1neg' to np_emb.trainables.syn1neg.npy
2021-07-31 12:46:22,387 : INFO : not storing attribute cum_table
2021-07-31 12:46:22,912 : INFO : saved np_emb


Model finished!


In [13]:
# Store just the words + their trained embeddings.
word_vectors = model.wv
word_vectors.save("word2vec.wordvectors")

2021-07-31 12:46:22,924 : INFO : saving Word2VecKeyedVectors object under word2vec.wordvectors, separately None
2021-07-31 12:46:22,926 : INFO : storing np array 'vectors' to word2vec.wordvectors.vectors.npy
2021-07-31 12:46:23,227 : INFO : not storing attribute vectors_norm
2021-07-31 12:46:23,650 : INFO : saved word2vec.wordvectors


In [15]:
from gensim.models import KeyedVectors
wv = KeyedVectors.load("word2vec.wordvectors", mmap='r')

2021-07-31 12:47:15,971 : INFO : loading Word2VecKeyedVectors object from word2vec.wordvectors
2021-07-31 12:47:16,505 : INFO : loading vectors from word2vec.wordvectors.vectors.npy with mmap=r
2021-07-31 12:47:16,528 : INFO : setting ignored attribute vectors_norm to None
2021-07-31 12:47:16,529 : INFO : loaded word2vec.wordvectors


In [16]:
word_vectors.most_similar('flüchtling', topn=10)  # get other similar words

2021-07-31 12:47:19,961 : INFO : precomputing L2-norms of word weight vectors


[('kriegsflüchtling', 0.7087794542312622),
 ('asylsuchender', 0.7084465026855469),
 ('migrant', 0.7026200890541077),
 ('syrer', 0.6834374666213989),
 ('asylbewerber', 0.674170732498169),
 ('häftling', 0.6647118926048279),
 ('afghane', 0.6590137481689453),
 ('kurde', 0.6281337738037109),
 ('muslim', 0.6219367980957031),
 ('terrorist', 0.6130465269088745)]