## From raw text to word embeddings using pretrained embeddings from 2014 English Wikipedia

Downloading the IMDB data as raw text, load reviews info list of strings - one string per review, prepare labels - neg=0, pos=1

In [6]:
import os

imdb_dir = '/home/user/development/datasets/aclImdb'
train_dir = os.path.join(imdb_dir, 'test')

print(train_dir)

/home/user/development/datasets/aclImdb/test


In [2]:
! ls

6.1.3_From_raw_text_to_word_embeddings.ipynb


In [4]:
! ls cd ../

ls: cannot access 'cd': No such file or directory
../:
base_image	 datasets      docker-JabRef  Pdatascience  test
build_image_gpu  docker-ionic  nlp_tests      README.md


In [7]:
labels = []
texts = []  # read in all txt files from neg and pos

for label_type in ['neg', 'pos']:
    dir_name = os.path.join(train_dir, label_type)
    for fname in os.listdir(dir_name):
        if fname[-4:] == '.txt':
            f = open(os.path.join(dir_name, fname))
            texts.append(f.read())
            f.close()
            if label_type == 'neg':
                labels.append(0)
            else:
                labels.append(1)

Tokenize the data - vectorize text, prepare training and validation split - we want to simulate having "little training data" to show how good pretrained word embeddings work so: we restrict training data to the first 200 examples

In [38]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np

maxlen = 100  # cut off reviews after 100 words
training_samples = 200  # only train on 200 samples to mimic having little training data
validation_samples = 10000  # validate on 10k samples
max_words = 10000  # consider only top 10k words in dataset

tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)

word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

data = pad_sequences(sequences, maxlen=maxlen)  # turns lists of sequences to 2D tensors of shape (sequences, maxlen)

labels = np.asarray(labels)
print('Shape of data tensor:', data.shape)
print('Shape of label tensor:',labels.shape)

#splits data into training set and validation set, but first shuffles the data bec we read in all neg and then all pos data so they are ordered
indices = np.arange(data.shape[0])
np.random.shuffle(indices)
data = data[indices]
labels = labels[indices]

x_train = data[:training_samples]
y_train = labels[:training_samples]
x_val = data[training_samples: training_samples+validation_samples]
y_val = labels[training_samples: training_samples+validation_samples]

print('Shape x_train: ', x_train.shape)
print('Shape y_train: ', y_train.shape)
print('Shape x_val: ', x_val.shape)
print('Shape y_val: ', y_val.shape)

Found 42557 unique tokens.
Shape of data tensor: (5674, 100)
Shape of label tensor: (5674,)
Shape x_train:  (200, 100)
Shape y_train:  (200,)
Shape x_val:  (5474, 100)
Shape y_val:  (5474,)


Downloading the GloVe word embeddings

In [3]:
! cat /etc/os-release

NAME="Ubuntu"
VERSION="20.04.1 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.1 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal


In [1]:
! sudo apt install curl

sudo: effective uid is not 0, is /usr/bin/sudo on a file system with the 'nosuid' option set or an NFS file system without root privileges?


In [2]:
!curl -O http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip

/bin/bash: curl: command not found


In [None]:
glove_dir = 'glove'

embedding_index = {}
f = open(os.path.join(glove_dir, 'glove.6B.100d.txt'))
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

# move to version 2 of the book Deep Learning with Python, F. Chollet