# Preprocessing for Word2vec
This notebook is used to preprocess the input sentences to be embedded using Word2vec. By preprocessing the input sentences, we aim to increase the embedding coverage and reduce the number of unknown embeddings.

## Setup

In [0]:
!pip install tqdm --upgrade
!pip install pandas
!pip install PyDrive

Requirement already up-to-date: tqdm in /usr/local/lib/python3.6/dist-packages (4.43.0)


Imports:

In [0]:
import pandas as pd
from tqdm import tqdm
tqdm.pandas()

This cell loads the Word2vec embeddings pretrained on the Google News dataset. In order ro run this cell, you need to have downloaded [this file](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/view?usp=sharing) to your Google Drive and replace the `<YOUR_ID_HERE>` with the ID of your file.

In [0]:
from gensim.models import KeyedVectors
import os
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

english = drive.CreateFile({'id': '<YOUR_ID_HERE>'})
english.GetContentFile('GoogleNews-vectors-negative300.bin.gz')

embeddings_index = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz', binary=True)

Loading the input sentences:

In [0]:
def load_data(file):
    with open(file, 'r') as f:
        return f.readlines()
    
train_en = pd.DataFrame(load_data("train.ende.src"), columns=["sentence"])
dev_en = pd.DataFrame(load_data("dev.ende.src"), columns=["sentence"])
test_en = pd.DataFrame(load_data("test.ende.src"), columns=["sentence"])
train_de = pd.DataFrame(load_data("train.ende.mt"), columns=["sentence"])
dev_de = pd.DataFrame(load_data("dev.ende.mt"), columns=["sentence"])
test_de = pd.DataFrame(load_data("test.ende.mt"), columns=["sentence"])
print("Train shape: ", train_en.shape)
print("Test shape: ", test_en.shape)
print(train_en.head())

Train shape :  (7000, 1)
Test shape :  (1000, 1)
                                            sentence
0  José Ortega y Gasset visited Husserl at Freibu...
1  However, a disappointing ninth in China meant ...
2  In his diary, Chase wrote that the release of ...
3  Heavy arquebuses mounted on wagons were called...
4  Once North Pacific salmon die off after spawni...


Builds the vocabulary from the input sentences:

In [0]:
def build_vocab(sentences, verbose =  True):
    """
    :param sentences: list of list of words
    :return: dictionary of words and their count
    """
    vocab = {}
    for sentence in tqdm(sentences, disable = (not verbose)):
        for word in sentence:
            try:
                vocab[word] += 1
            except KeyError:
                vocab[word] = 1
    return vocab

In [0]:
sentences = train_en["sentence"].progress_apply(lambda x: x.split()).values
vocab = build_vocab(sentences)
print({k: vocab[k] for k in list(vocab)[:10]})


100%|██████████| 7000/7000 [00:00<00:00, 218865.33it/s]

100%|██████████| 7000/7000 [00:00<00:00, 146596.14it/s]

{'José': 5, 'Ortega': 1, 'y': 6, 'Gasset': 1, 'visited': 8, 'Husserl': 1, 'at': 491, 'Freiburg': 1, 'in': 2079, '1934.': 1}





Checks the percentage of words in the vocabulary of the input data which currently have Word2vec embeddings:

In [0]:
import operator 

def check_coverage(vocab,embeddings_index):
    a = {}
    oov = {}
    k = 0
    i = 0
    for word in tqdm(vocab):
        try:
            a[word] = embeddings_index[word]
            k += vocab[word]
        except:

            oov[word] = vocab[word]
            i += vocab[word]
            pass

    print('Found embeddings for {:.2%} of vocab'.format(len(a) / len(vocab)))
    print('Found embeddings for  {:.2%} of all text'.format(k / (k + i)))
    sorted_x = sorted(oov.items(), key=operator.itemgetter(1))[::-1]

    return sorted_x

In [0]:
oov = check_coverage(vocab,embeddings_index)


  0%|          | 0/33774 [00:00<?, ?it/s][A
 11%|█         | 3731/33774 [00:00<00:00, 37290.92it/s][A
 26%|██▌       | 8816/33774 [00:00<00:00, 40529.08it/s][A
 45%|████▍     | 15089/33774 [00:00<00:00, 45342.31it/s][A
 63%|██████▎   | 21224/33774 [00:00<00:00, 49192.67it/s][A
100%|██████████| 33774/33774 [00:00<00:00, 64800.59it/s]

Found embeddings for 59.01% of vocab
Found embeddings for  71.10% of all text





Prints the most common words in the vocabulary without an embedding so we can conduct cleaning of the data:

In [0]:
oov[:10]

[('and', 3871),
 ('of', 2774),
 ('to', 1710),
 ('a', 1500),
 ('However,', 84),
 ('him.', 48),
 ('BC.', 34),
 ('"The', 33),
 ('12', 30),
 ('10', 29)]

Removes the punctuation from the data for which there are no embeddings:

In [0]:
def clean_text(x):

    x = str(x)
    for punct in "/-'":
        x = x.replace(punct, ' ')
    for punct in '&':
        x = x.replace(punct, f' {punct} ')
    for punct in '?!.,"#$%\'()*+-/:;<=>@[\\]^_`{|}~' + '“”’':
        x = x.replace(punct, '')
    return x

In [0]:
train_en["sentence"] = train_en["sentence"].progress_apply(lambda x: clean_text(x))
dev_en["sentence"] = dev_en["sentence"].progress_apply(lambda x: clean_text(x))
test_en["sentence"] = test_en["sentence"].progress_apply(lambda x: clean_text(x))
train_de["sentence"] = train_de["sentence"].progress_apply(lambda x: clean_text(x))
dev_de["sentence"] = dev_de["sentence"].progress_apply(lambda x: clean_text(x))
test_de["sentence"] = test_de["sentence"].progress_apply(lambda x: clean_text(x))
sentences = train_en["sentence"].apply(lambda x: x.split())
vocab = build_vocab(sentences)


100%|██████████| 7000/7000 [00:00<00:00, 92806.36it/s]

100%|██████████| 1000/1000 [00:00<00:00, 64246.06it/s]

100%|██████████| 1000/1000 [00:00<00:00, 62894.43it/s]

100%|██████████| 7000/7000 [00:00<00:00, 89403.01it/s]

100%|██████████| 1000/1000 [00:00<00:00, 53953.67it/s]

100%|██████████| 1000/1000 [00:00<00:00, 58469.42it/s]

100%|██████████| 7000/7000 [00:00<00:00, 153147.05it/s]


In [0]:
oov = check_coverage(vocab,embeddings_index)


  0%|          | 0/27616 [00:00<?, ?it/s][A
100%|██████████| 27616/27616 [00:00<00:00, 195856.79it/s]

Found embeddings for 89.66% of vocab
Found embeddings for  84.90% of all text





Removes the punctuation from the data for which there are no embeddings:

In [0]:
import re

def clean_numbers(x):

    x = re.sub('[0-9]{5,}', '#####', x)
    x = re.sub('[0-9]{4}', '####', x)
    x = re.sub('[0-9]{3}', '###', x)
    x = re.sub('[0-9]{2}', '##', x)
    return x

In [0]:
train_en["sentence"] = train_en["sentence"].progress_apply(lambda x: clean_numbers(x))
dev_en["sentence"] = dev_en["sentence"].progress_apply(lambda x: clean_numbers(x))
test_en["sentence"] = test_en["sentence"].progress_apply(lambda x: clean_numbers(x))
train_de["sentence"] = train_de["sentence"].progress_apply(lambda x: clean_numbers(x))
dev_de["sentence"] = dev_de["sentence"].progress_apply(lambda x: clean_numbers(x))
test_de["sentence"] = test_de["sentence"].progress_apply(lambda x: clean_numbers(x))
sentences = train_en["sentence"].progress_apply(lambda x: x.split())
vocab = build_vocab(sentences)


  0%|          | 0/7000 [00:00<?, ?it/s][A
100%|██████████| 7000/7000 [00:00<00:00, 63012.54it/s]

100%|██████████| 1000/1000 [00:00<00:00, 44207.34it/s]

100%|██████████| 1000/1000 [00:00<00:00, 55572.83it/s]

  0%|          | 0/7000 [00:00<?, ?it/s][A
100%|██████████| 7000/7000 [00:00<00:00, 59824.01it/s]

100%|██████████| 1000/1000 [00:00<00:00, 45606.12it/s]

100%|██████████| 1000/1000 [00:00<00:00, 44949.25it/s]

100%|██████████| 7000/7000 [00:00<00:00, 206273.38it/s]

100%|██████████| 7000/7000 [00:00<00:00, 173996.25it/s]


In [0]:
oov = check_coverage(vocab,embeddings_index)


  0%|          | 0/26770 [00:00<?, ?it/s][A
100%|██████████| 26770/26770 [00:00<00:00, 253600.27it/s]

Found embeddings for 92.63% of vocab
Found embeddings for  87.75% of all text





Exports all the data, having improved embedding coverage from 71.1% to 87.75%.

In [0]:
import numpy as np

np.savetxt('processed/train.ende.src', train_en["sentence"].values, fmt = "%s", newline='')
np.savetxt('processed/dev.ende.src', dev_en["sentence"].values, fmt = "%s", newline='')
np.savetxt('processed/test.ende.src', test_en["sentence"].values, fmt = "%s", newline='')
np.savetxt('processed/train.ende.mt', train_de["sentence"].values, fmt = "%s", newline='')
np.savetxt('processed/dev.ende.mt', dev_de["sentence"].values, fmt = "%s", newline='')
np.savetxt('processed/test.ende.mt', test_de["sentence"].values, fmt = "%s", newline='')