<a href="https://colab.research.google.com/github/nRadis/info2049/blob/main/info2049.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#[INFO2049] Assignement 2 : Practical Project

Authors: Nicolas Radis & Antoine Dekyvere

Date: November 2020


# Introduction

For this project, we had the choice between several interesting subjects. We have decided to choose: "Text Summarization". We are going to implement a seq2seq summarization system with, as insipartion, the following paper from Nallapati et al. : https://openreview.net/pdf?id=gZ9OMgQWoIAPowrRUAN6

# Theoretical focus

In this section, we will describe the different important steps we have gone through in order to successfully complete our implementation.



1.   What is Text Summarization in Natural Language Processing ?
2.   Seq2Seq Modeling
1.   Text Representation
2.   Performance Measures







## 1. What is Text Summarization in Natural Language Processing

"Automatic text summarization is the task of producing a concise and fluent summary while preserving key information content and overall meaning".

There are 2 different approaches to automatic summarization : **Extractive** & **Abstractive**.

The **extractive** approach consists of taking passages from the source text and rearranging them in order to form a summary.

The **abstractive** approach creates a summary with completly new sentences.

In this project, we will implement the abstractive approach. Indeed, it is more intresting because it can generate more human-like summaries than the extractive approach. Moreover, it is more challenging to automatically generate new sentences than just select pieces of text.



## 2. Seq2Seq Modeling (Encoder & Decoder)

A **Seq2Seq model** (sequence to sequence) is a kind of model using an **Encoder** and a **Decoder** on top of the model. As in our case, the objective is to build a text summarizer where the input is a long sequence of words (articles) and the output is a summary (titles), the model will be designed as a Many-to-Many Seq2Seq problem.

![](https://drive.google.com/uc?export=view&id=14QMShhJRreE4RlLQD9xtpcMa-G0llNCX)(https://docs.chainer.org/en/stable/examples/seq2seq.html)

As we can see on the diagram above, we will use RNN (Recurrent Neural Network). RNN is a type of neural network which takes time into conisderation and also it takes the output from the previous step. RNN fits our needs because text summarization demands to understand the semantic and syntactic of the sentences, and RNNs bring us this understanding.


Here is how a encoder - decoder works.

The encoder will encode the input sentence word by word in sequence. The encoder consits of an embedding layer (see next point) and a recurrent layer ,as we can see on the diagram above. The recurrent layer generates the hidden vectors from the embedding vectors. The hidden vector of the last time step is then used to initialize the decoder.

The decoder will decode the input from the encoder output. It will try to predict the next output and try to use it as the next input if it's possible. In other words, the decoder is trained to predict the next word in the sequence given the previous words.







## 3. Text representation

For the text summarization to work, we will first represent the words in a dictionary format. In other words, we will map all the words to an integer.
For instance, for this article = "Very good Article", we will have the following dictionary.



```
dict["Very"] = 0
dict["good"] = 1
dict["Article"] = 2
```

The reverse operation also need to be implemented



```
reverse_dict[0] = "Very"
reverse_dict[1] = "good"
reverse_dict[2] = "Article"
```

We will also preprocess the data with several methods explained in the "Implementation" part. For the Seq2Seq algorithm to work properly, four built-in words will be used:


1.   \<PAD>  :  Used to make the sequences of same length
1.   \<UNK>  :  Used to identify that a word is not in the dictionary
2.   \<BOS>  :  Used to identify the begining of a sentence
2.   \<EOS>  :  Used to identify the end of a sentence

Afterwards, we need to represent the words with a format that the neural network would understand : Word Embeddings.

This concept consist in mapping all words of the dictionary to vector of real numbers. There are already trained models (Word2Vec, Glove, FastText ...) that have been trained over millions of texts to correctly model the words.
Once the words are correctly models, the neural network will be able to "understand" the text of the articles, and thus make a summary of it.







## 4. Performance Measures

blablaba

# Implementation

The implementation will be divided in several parts:

1.   Connection to Google Drive
1.   Data Loading
1.   Data pre-processing
1.   Word Embeddings
2.   Élément de liste
2.   Élément de liste
2.   Élément de liste
2.   Élément de liste
2.   Élément de liste



In [1]:
import numpy as np
import pandas as pd
from nltk.corpus import stopwords

import pathlib
import os
import re
import time

import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_datasets as tfds

import matplotlib.pyplot as plt

print("Version: ", tf.__version__)
print("Eager mode: ", tf.executing_eagerly())
print("Hub version: ", hub.__version__)
print("GPU is", "available" if tf.config.list_physical_devices('GPU') else "NOT AVAILABLE")

Version:  2.3.0
Eager mode:  True
Hub version:  0.10.0
GPU is available


## 1. Connection to Google Drive


We will load the data directly from a Google Drive so there is no need to deal with these data directly on our computer disks.

The dataset that we will use for this project is the articles from CNN and DailyMail (https://www.tensorflow.org/datasets/catalog/cnn_dailymail).

It contains, in one hand, the titles of the articles, and in the other hand their content.

In [2]:
from google.colab import drive
drive.mount('/content/drive')
path = "/content/drive/My Drive/"

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## 2. Data Loading

In [3]:
train_data, test_data = tfds.load(name="cnn_dailymail", split=["train", "test"], 
                                  batch_size=-1, as_supervised=True)

train_articles_b, train_titles_b = tfds.as_numpy(train_data)
test_articles_b, test_titles_b = tfds.as_numpy(test_data)

[1mDownloading and preparing dataset cnn_dailymail/plain_text/3.0.0 (download: 558.32 MiB, generated: 1.27 GiB, total: 1.82 GiB) to /root/tensorflow_datasets/cnn_dailymail/plain_text/3.0.0...[0m


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Completed...', max=1.0, style=Progre…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Size...', max=1.0, style=ProgressSty…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Extraction completed...', max=1.0, styl…









HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to /root/tensorflow_datasets/cnn_dailymail/plain_text/3.0.0.incompletePHXSP6/cnn_dailymail-train.tfrecord


HBox(children=(FloatProgress(value=0.0, max=287113.0), HTML(value='')))

HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to /root/tensorflow_datasets/cnn_dailymail/plain_text/3.0.0.incompletePHXSP6/cnn_dailymail-validation.tfrecord


HBox(children=(FloatProgress(value=0.0, max=13368.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to /root/tensorflow_datasets/cnn_dailymail/plain_text/3.0.0.incompletePHXSP6/cnn_dailymail-test.tfrecord


HBox(children=(FloatProgress(value=0.0, max=11490.0), HTML(value='')))



[1mDataset cnn_dailymail downloaded and prepared to /root/tensorflow_datasets/cnn_dailymail/plain_text/3.0.0. Subsequent calls will reuse this data.[0m


Let's inspect the data that we have loaded.

In [4]:
# Check the number of entries
print("Training entries: {}".format(len(train_articles_b)))
print("Test entries: {}".format(len(test_articles_b)))

# Print 5 articles and 5 title to see the loaded data
print("Here are 5 articles : ")
print(train_articles_b[:5])

print("\n\n")

print("Here are the 5 titles corresponding to previous articles : ")
print(train_titles_b[:5])

Training entries: 287113
Test entries: 11490
Here are 5 articles : 
[b"By . Associated Press . PUBLISHED: . 14:11 EST, 25 October 2013 . | . UPDATED: . 15:36 EST, 25 October 2013 . The bishop of the Fargo Catholic Diocese in North Dakota has exposed potentially hundreds of church members in Fargo, Grand Forks and Jamestown to the hepatitis A virus in late September and early October. The state Health Department has issued an advisory of exposure for anyone who attended five churches and took communion. Bishop John Folda (pictured) of the Fargo Catholic Diocese in North Dakota has exposed potentially hundreds of church members in Fargo, Grand Forks and Jamestown to the hepatitis A . State Immunization Program Manager Molly Howell says the risk is low, but officials feel it's important to alert people to the possible exposure. The diocese announced on Monday that Bishop John Folda is taking time off after being diagnosed with hepatitis A. The diocese says he contracted the infection thro

## 3. Data pre-processing

Let's first decode the binary data into text (utf-8).

In [5]:
# Let's define the size of our train and test
train_size = 10000
test_size = 2000

train_articles = []
train_titles = []
test_articles = []
test_titles = []

for i in range(train_size):
  train_articles.append(train_articles_b[i].decode("utf-8"))
  train_titles.append(train_titles_b[i].decode("utf-8"))

for i in range(test_size):
  test_articles.append(test_articles_b[i].decode("utf-8"))
  test_titles.append(test_titles_b[i].decode("utf-8"))


In [6]:
# TODO : DELETE THE FOLLOWING LINES (JUST TO TEST MORE QUICKLY)
# We do a subset of the data to compute the results faster

sub_train_articles = []
sub_train_titles = []
for i in range(100):
  sub_train_articles.append(train_articles[i])
  sub_train_titles.append(train_titles[i])

# DELETE UNTIL THERE

We will know clean the texts by doing several things : 


1.   Removing all the possible english contractions
1.   Put everything in lowercase
2.   Remove all the unwanted characters
2.   Remove the stopwords from the articles. We will remove the stopwords only from the articles because it does not bring anything for the training of our model. However, we will let those stopwords for the titles so that they sound more like natural sentences.



In [7]:
# List of English contractions : http://stackoverflow.com/questions/19790188/expanding-english-language-contractions-in-python
contractions = { 
"ain't": "am not / are not / is not / has not / have not",
"aren't": "are not / am not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he had / he would",
"he'd've": "he would have",
"he'll": "he shall / he will",
"he'll've": "he shall have / he will have",
"he's": "he has / he is",
"how'd": "how did",
"how'd'y": "how do you",
"how'll": "how will",
"how's": "how has / how is / how does",
"I'd": "I had / I would",
"I'd've": "I would have",
"I'll": "I shall / I will",
"I'll've": "I shall have / I will have",
"I'm": "I am",
"I've": "I have",
"isn't": "is not",
"it'd": "it had / it would",
"it'd've": "it would have",
"it'll": "it shall / it will",
"it'll've": "it shall have / it will have",
"it's": "it has / it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"mightn't've": "might not have",
"must've": "must have",
"mustn't": "must not",
"mustn't've": "must not have",
"needn't": "need not",
"needn't've": "need not have",
"o'clock": "of the clock",
"oughtn't": "ought not",
"oughtn't've": "ought not have",
"shan't": "shall not",
"sha'n't": "shall not",
"shan't've": "shall not have",
"she'd": "she had / she would",
"she'd've": "she would have",
"she'll": "she shall / she will",
"she'll've": "she shall have / she will have",
"she's": "she has / she is",
"should've": "should have",
"shouldn't": "should not",
"shouldn't've": "should not have",
"so've": "so have",
"so's": "so as / so is",
"that'd": "that would / that had",
"that'd've": "that would have",
"that's": "that has / that is",
"there'd": "there had / there would",
"there'd've": "there would have",
"there's": "there has / there is",
"they'd": "they had / they would",
"they'd've": "they would have",
"they'll": "they shall / they will",
"they'll've": "they shall have / they will have",
"they're": "they are",
"they've": "they have",
"to've": "to have",
"wasn't": "was not",
"we'd": "we had / we would",
"we'd've": "we would have",
"we'll": "we will",
"we'll've": "we will have",
"we're": "we are",
"we've": "we have",
"weren't": "were not",
"what'll": "what shall / what will",
"what'll've": "what shall have / what will have",
"what're": "what are",
"what's": "what has / what is",
"what've": "what have",
"when's": "when has / when is",
"when've": "when have",
"where'd": "where did",
"where's": "where has / where is",
"where've": "where have",
"who'll": "who shall / who will",
"who'll've": "who shall have / who will have",
"who's": "who has / who is",
"who've": "who have",
"why's": "why has / why is",
"why've": "why have",
"will've": "will have",
"won't": "will not",
"won't've": "will not have",
"would've": "would have",
"wouldn't": "would not",
"wouldn't've": "would not have",
"y'all": "you all",
"y'all'd": "you all would",
"y'all'd've": "you all would have",
"y'all're": "you all are",
"y'all've": "you all have",
"you'd": "you had / you would",
"you'd've": "you would have",
"you'll": "you shall / you will",
"you'll've": "you shall have / you will have",
"you're": "you are",
"you've": "you have"
}

In [8]:
def clean_text(text, remove_stopwords = True):
    '''Remove unwanted characters, stopwords, and format the text to create fewer nulls word embeddings'''
    
    # Convert words to lower case
    text = text.lower()
    
    # Replace contractions with their longer forms 
    if True:
        text = text.split()
        new_text = []
        for word in text:
            if word in contractions:
                new_text.append(contractions[word])
            else:
                new_text.append(word)
        text = " ".join(new_text)
    
    # Format words and remove unwanted characters
    text = re.sub(r'https?:\/\/.*[\r\n]*', '', text, flags=re.MULTILINE)
    text = re.sub(r'\<a href', ' ', text)
    text = re.sub(r'&amp;', '', text) 
    text = re.sub(r'[_"\-;%()|+&=*%.,!?:#$@\[\]/]', ' ', text)
    text = re.sub(r'<br />', ' ', text)
    text = re.sub(r'\'', ' ', text)
    
    # Remove stopwords if needed
    if remove_stopwords:
        text = text.split()
        stops = set(stopwords.words("english"))
        text = [w for w in text if not w in stops]
        text = " ".join(text)

    return text

In [9]:
import nltk
nltk.download('stopwords')

# Cleaning data
for i in range(len(train_articles)):
  train_articles[i] = clean_text(train_articles[i], remove_stopwords=True)
  train_titles[i] = clean_text(train_titles[i], remove_stopwords=False)

for i in range(len(test_articles)):
  test_articles[i] = clean_text(test_articles[i], remove_stopwords=True)
  test_titles[i] = clean_text(test_titles[i], remove_stopwords=True)

# Converting the Python list to numpy.ndarray
train_articles = np.array(train_articles)
train_titles = np.array(train_titles)

test_articles = np.array(test_articles)
test_titles = np.array(test_titles)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [10]:
# TODO: DELETE
for i in range(len(sub_train_articles)):
  sub_train_articles[i] = clean_text(sub_train_articles[i], remove_stopwords=True)
  sub_train_titles[i] = clean_text(sub_train_titles[i], remove_stopwords=True)

sub_train_articles = np.array(sub_train_articles)
sub_train_titles = np.array(sub_train_titles)
# DELETE

Let's inspect the clean data

In [11]:
# Check the number of entries
print("Training entries: {}".format(len(train_articles)))
print("Test entries: {}".format(len(test_articles)))

# Print 5 articles and 5 title to see the loaded data
print("Here are 5 articles : ")
print(train_articles[:5])

print("\n\n")

print("Here are the 5 titles corresponding to previous articles : ")
print(train_titles[:5])

Training entries: 10000
Test entries: 2000
Here are 5 articles : 
['associated press published 14 11 est 25 october 2013 updated 15 36 est 25 october 2013 bishop fargo catholic diocese north dakota exposed potentially hundreds church members fargo grand forks jamestown hepatitis virus late september early october state health department issued advisory exposure anyone attended five churches took communion bishop john folda pictured fargo catholic diocese north dakota exposed potentially hundreds church members fargo grand forks jamestown hepatitis state immunization program manager molly howell says risk low officials feel important alert people possible exposure diocese announced monday bishop john folda taking time diagnosed hepatitis diocese says contracted infection contaminated food attending conference newly ordained bishops italy last month symptoms hepatitis include fever tiredness loss appetite nausea abdominal discomfort fargo catholic diocese north dakota pictured bishop loc

As we can see, the text is much more refined than before.

## 4. Word Embeddings

First, let's create the vocabulary, which is the number of occurences of every word.

In [12]:
def add_vocabulary(vocabulary, data):
  '''Add the number of occurences of every word in the data to the vocabulary given in args'''
  for sentence in data:
    for word in sentence.split():
      if word not in vocabulary:
        vocabulary[word] = 1 # Add the word in the vocabulary
      else:
        vocabulary[word] += 1 # Increment the number of occurences for the word "word"

In [13]:
# Create the vocabulary for the train articles and titles
vocabulary = {}

add_vocabulary(vocabulary, train_articles)
add_vocabulary(vocabulary, train_titles)

print("There are {} words in the vocabulary.".format(len(vocabulary)))

There are 106463 words in the vocabulary.


In [39]:
def find_missing_words(vocabulary, embedding, threshold):
  '''
  Finds the number of words that are not in the word embedding given in arg,
  and that are used more than the threshold given.
  '''
  missing_words = 0
  for word, occurence in vocabulary.items():
    if occurence > threshold:
      if word not in embedding:
        missing_words += 1

  return missing_words



In [41]:
def create_dictionary(vocabulary, embedding, threshold):
  '''
  Creates both direction of dictionaries (int -> word  &&  word -> int)
  Returns a tuple (dico, reverse_dico)
  '''

  # We will only put in the dictionary the words that have an occurence >= threshold
  # and that belongs to the pre-trained vectors (Word2Vec, GloVe and FastText)

  dico = {} # (words -> int)
  reverse_dico = {} # (int -> words)

  # Built in words (see Theoretical Focus)
  builtin = ["<UNK>", "<PAD>", "<BOS>", "<EOS>"]

  # Add the words in the dictionary  (dico["ex1"] = 0, dico["ex2"] = 1, ...)
  cnt = 0
  for word, occurence in vocabulary.items():
    if occurence >= threshold or word in embedding:
      dico[word] = cnt
      cnt += 1

  # Add the built in words to the dico
  for b in builtin:
    dico[b] = len(dico)

  # Create the reverse dico (reverse_dico[0] = "ex1", dico[1] = "ex2", ...)
  for word, value in dico.items():
    reverse_dico[value] = word

  return dico, reverse_dico

In [52]:
def create_embedding_matrix(vector_dim, dico, embedding):
  '''
  Creates an embedding matrix of size N x M where N is the number of words in 
  the dico and M is the dimensions of the pre-trained vectors.
  Returns : embedding_matrix
  '''

  # create the matrix
  embedding_matrix = np.zeros((len(dico), vector_dim), dtype=np.float32)
  for word, index in dico.items():
    if word in embedding:
      embedding_matrix[index] = embedding[word]
    else:
      # If word not in the pre-trained vectors, create a random embedding for it
      new_embedding = np.array(np.random.uniform(-1.0, 1.0, vector_dim))
      embedding[word] = new_embedding
      embedding_matrix[index] = new_embedding

  return embedding_matrix


In [61]:
def text2integer(text, dico, eos=False):
  '''
  Converts the text to integers thanks to the dictionary.
  Returns a list of integers representing the text
  '''
  int_text = []
  for sentence in text:
    sentence_int = []
    for word in sentence.split():
      if word in dico:
        sentence_int.append(dico[word]) # Add the integer corresponding to the word 
      else:
        sentence_int.append(dico['<UNK>']) # Word not in the dico so add the integer corresponding to <UNK>

    if eos:
      sentence_int.append(dico['<EOS>'])
    
    int_text.append(sentence_int)

    return int_text

In [44]:
def create_dico_matrix(vocabulary, embedding, threshold, vector_dim):
  '''
  Creates the dictionaries and the embedding matrix for the data given in args
  Returns a tuple (dico, reverse_dico, embedding_matrix)
  '''

  # Create the dictionaries
  dico, reverse_dico = create_dictionary(vocabulary, embedding, threshold)

  # Create the embedding matrix
  embedding_matrix = create_embedding_matrix(vector_dim, dico, embedding)

  return dico, reverse_dico, embedding_matrix 

We will now import the different pre-trained word embeddings : Word2Vec, GloVe and FastText.

We limited the words to 200k so that the Google Collab does not crash. Indeed, the amount of data is too huge and the amount of RAM needed exceed what we are allowed. Of course, to have better results, loading the whole pre-trained is better. However, it is still acceptable because we will take the 200k most frequent words.

In [45]:
limit_words = 200000
vector_dim = 300

# We will choose a threshold of 20. It means that the words that are in the
# vocabulary but not in the pre-trained vectors and the words that appear at
# least 20 times will be added to the word embedding matrix. We need a minimum
# threshold so that the words have a little bit of meaning for the model
threshold = 20 

### Word2Vec

We have used the pre-trained vectors trained on part of Google News dataset (about 100 billions words). https://code.google.com/archive/p/word2vec/

In [16]:
from gensim.test.utils import datapath
from gensim.models import KeyedVectors

word2vec = KeyedVectors.load_word2vec_format('/content/drive/My Drive/info2049/GoogleNews-vectors-negative300.bin.gz', binary=True, limit=limit_words)


In [17]:
print("We have retrieved {} pre-trained vectors".format(len(word2vec.vocab)))
print("Each vectors has {} dimensions".format(len(word2vec["project"])))
print("Here is an example of vector for the word 'project' with word2vec\n")
print(word2vec["project"])

We have retrieved 200000 pre-trained vectors
Each vectors has 300 dimensions
Here is an example of vector for the word 'project' with word2vec

[-1.80664062e-02  8.54492188e-03  6.98242188e-02  3.07617188e-02
  8.05664062e-02  3.44238281e-02  4.41406250e-01  1.91650391e-02
  1.10473633e-02  9.13085938e-02 -8.05664062e-02  8.66699219e-03
 -5.66406250e-02 -2.79296875e-01 -3.04687500e-01  2.66113281e-02
 -1.01074219e-01 -2.44140625e-01  1.10473633e-02 -2.19726562e-02
 -1.27929688e-01  2.11914062e-01 -4.27246094e-02  6.34765625e-02
 -6.12792969e-02 -1.52343750e-01 -4.27246094e-02 -1.40625000e-01
 -2.41699219e-02 -1.74804688e-01  1.55639648e-02 -4.61425781e-02
 -1.83593750e-01 -8.74023438e-02  1.27929688e-01 -1.05957031e-01
  7.26318359e-03 -2.64892578e-02  1.35742188e-01 -1.41601562e-01
 -1.19628906e-02  2.43164062e-01  5.61523438e-02  1.40625000e-01
 -3.22265625e-01 -3.39843750e-01 -2.53906250e-01 -1.36718750e-01
 -2.08984375e-01  3.61328125e-01  1.34765625e-01 -1.11816406e-01
 -1.1425781

Let's see the number of words that are missing from word2vec.

In [38]:
missing_word2vec = find_missing_words(vocabulary, word2vec, threshold)

print("There are {} words that do not appear in the pre-trained word2vec vectors.".format(missing_word2vec))
print("In other words, it means that {}% of the words are not present in the {} most popular vectors of word2vec".format(round(missing_word2vec/len(vocabulary), 4 )*100, limit_words))

There are 4329 words that do not appear in the pre-trained word2vec vectors.
In other words, it means that 4.07% of the words are not present in the 200000 most popular vectors of word2vec


In [53]:
word2vec_dico, word2vec_reverse_dico, word2vec_embedding_matrix = create_dico_matrix(vocabulary, word2vec, threshold, vector_dim)

In [62]:
train_articles_int = text2integer(train_articles, word2vec_dico)
train_titles_int = text2integer(train_titles, word2vec_dico)

Let's see the result of a clean integer article:

In [65]:
print(train_articles[0])
print(train_articles_int[0])
print(word2vec_dico['associated'])

associated press published 14 11 est 25 october 2013 updated 15 36 est 25 october 2013 bishop fargo catholic diocese north dakota exposed potentially hundreds church members fargo grand forks jamestown hepatitis virus late september early october state health department issued advisory exposure anyone attended five churches took communion bishop john folda pictured fargo catholic diocese north dakota exposed potentially hundreds church members fargo grand forks jamestown hepatitis state immunization program manager molly howell says risk low officials feel important alert people possible exposure diocese announced monday bishop john folda taking time diagnosed hepatitis diocese says contracted infection contaminated food attending conference newly ordained bishops italy last month symptoms hepatitis include fever tiredness loss appetite nausea abdominal discomfort fargo catholic diocese north dakota pictured bishop located
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 5, 6, 7, 8, 12, 43577, 1

### GloVe

We have used these pre-trained word vectors : Wikipedia 2014 + Gigaword 5 (6B tokens, 400K vocab, uncased, 50d, 100d, 200d, & 300d vectors, 822 MB download): glove.6B.zip

We will use the one with 300 dimensions vectors

https://nlp.stanford.edu/projects/glove/

In [18]:
glove = {}
cnt = 0
with open("/content/drive/My Drive/info2049/glove.6B.300d.txt", 'r', encoding="utf-8") as f:
    for line in f:
      if cnt == limit_words:
        break
      values = line.split()
      word = values[0]
      vector = np.asarray(values[1:], "float32")
      glove[word] = vector
      cnt += 1

In [19]:
print("We have retrieved {} pre-trained vectors".format(len(glove)))
print("Each vectors has {} dimensions".format(len(glove["project"])))
print("Here is an example of vector for the word 'project' with word2vec\n")
print(glove["project"])

We have retrieved 200000 pre-trained vectors
Each vectors has 300 dimensions
Here is an example of vector for the word 'project' with word2vec

[-2.0397e-01 -3.5959e-02 -2.4745e-01 -5.5419e-01  6.7167e-03 -8.7778e-02
  2.3057e-01 -3.3634e-01 -2.1594e-01 -1.3637e+00  2.1076e-01 -4.4217e-01
  2.1688e-01  2.5215e-01  3.8284e-01  1.7151e-02  7.5829e-02  1.8668e-01
  2.5643e-01  4.7164e-01 -3.0530e-01  1.8262e-01 -1.3302e-01  2.1855e-01
 -3.9873e-02  1.9053e-01  3.3508e-01  1.9015e-01 -1.5546e-02  2.3514e-01
  7.2200e-01  2.9326e-01 -2.7213e-01  5.2866e-01 -1.2719e-01  2.0123e-01
 -2.4419e-01 -7.9395e-02  3.3330e-01  1.3958e-02 -2.7907e-01 -3.7687e-01
 -3.3006e-01  3.0789e-01 -1.2030e-01 -2.8289e-01 -8.8605e-02  1.3664e-01
 -1.6403e-01 -6.1411e-02  1.9604e-01  1.0830e-01 -3.6917e-01  1.8505e-03
 -2.9781e-01  3.5050e-01  4.3316e-01  4.4869e-01 -1.3611e-01  1.3710e-01
 -8.4922e-01  3.1850e-01 -4.3727e-02 -5.8593e-01  5.6550e-02  8.6663e-01
  4.2441e-01  3.1674e-01  5.9644e-02 -2.1432e-01  3.1

In [35]:
missing_glove = find_missing_words(vocabulary, glove, threshold)

print("There are {} words that do not appear in the pre-trained GloVe vectors.".format(missing_glove))
print("In other words, it means that {}% of the words are not present in the {} most popular vectors of GloVe".format(round(missing_glove/len(vocabulary), 4 )*100, limit_words))

There are 428 words that do not appear in the pre-trained GloVe vectors.
In other words, it means that 0.4% of the words are not present in the 200000 most popular vectors of GloVe


In [56]:
glove_dico, glove_reverse_dico, glove_embedding_matrix = create_dico_matrix(vocabulary, glove, threshold, vector_dim)

In [60]:
print(len(glove_dico))
print(type(glove_embedding_matrix))

71261
<class 'numpy.ndarray'>


### FastText

We have used these pre-trained word vectors : 1 million word vectors trained on Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset (16B tokens)

https://fasttext.cc/docs/en/english-vectors.html

In [20]:
fasttext = KeyedVectors.load_word2vec_format('/content/drive/My Drive/info2049/wiki-news-300d-1M.vec', binary=False, limit = limit_words)

In [25]:
print("We have retrieved {} pre-trained vectors".format(len(fasttext.vocab)))
print("Each vectors has {} dimensions".format(len(fasttext["project"])))
print("Here is an example of vector for the word 'project' with FastText\n")
print(fasttext["project"])

We have retrieved 200000 pre-trained vectors
Each vectors has 300 dimensions
Here is an example of vector for the word 'project' with FastText

[-3.380e-02 -4.460e-02  2.790e-02 -5.690e-02 -1.607e-01 -4.840e-02
 -3.050e-02  1.190e-01 -2.990e-02 -5.810e-02 -3.190e-02  2.200e-02
 -1.772e-01 -1.140e-01  9.730e-02  5.320e-02  1.827e-01 -4.940e-02
  5.960e-02 -8.740e-02 -1.185e-01  9.800e-03  8.250e-02  7.660e-02
 -6.190e-02 -1.337e-01 -2.268e-01 -5.860e-02 -9.640e-02 -1.860e-02
 -5.940e-02  1.009e-01 -4.900e-03  3.300e-02 -7.130e-02  9.940e-02
  5.170e-02 -1.640e-02 -1.100e-02 -5.210e-02  1.210e-02 -1.023e-01
  3.580e-02 -1.770e-02 -3.830e-02  6.560e-02  4.700e-03 -1.520e-02
 -6.710e-02  7.200e-02  1.091e-01  3.530e-02 -5.655e-01 -4.110e-02
  4.290e-02  1.730e-02  4.390e-02 -2.800e-03  4.200e-03 -1.830e-02
  1.230e-02 -6.680e-02  2.810e-02  2.260e-02  2.220e-02  4.460e-02
 -3.370e-02 -3.700e-02 -4.870e-02 -2.980e-02  3.440e-02  8.180e-02
 -1.006e-01  2.900e-03  1.594e-01 -3.910e-02 -8.480e

In [37]:
missing_fasttext = find_missing_words(vocabulary, fasttext, threshold)

print("There are {} words that do not appear in the pre-trained FastText vectors.".format(missing_fasttext))
print("In other words, it means that {}% of the words are not present in the {} most popular vectors of FastText".format(round(missing_fasttext/len(vocabulary), 4 )*100, limit_words))

There are 2865 words that do not appear in the pre-trained FastText vectors.
In other words, it means that 2.69% of the words are not present in the 200000 most popular vectors of FastText


In [57]:
fasttext_dico, fasttext_reverse_dico, fasttext_embedding_matrix = create_dico_matrix(vocabulary, fasttext, threshold, vector_dim)