Applied Deep Learning Final Project Fall 2019:

*   Shuai Hao (sh3831)
*   Bhaskar Ghosh (bg2625)










## Generating sentence embeddings for training, validation and test data
We are using the CNN/Daily Mail corpus to train our model. At this stage we have already preprocessed the data and created the training, validation and test sets. However, the number of files in the training set is more than 280,000, and further processing would take substantial amount of time.

To reduce time taken for further processing and training, we decided to do two things:
1. Shuffle the training dataset and choose a random sample of articles from the training, validation and test datasets. Choose the human abstracts for the same articles for evaluation.
2. Truncate the articles in the dataset to maximum 20 sentences. (this processing has been done in a separate notebook).

Import all the required libraries in one place.

In [0]:
import os
import json
from google.colab import drive
import pandas as pd
import numpy as np
import shutil
from random import sample

In [0]:
drive.mount('/content/gdrive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/gdrive


### Read GloVe embeddings from Google Drive
We are using pre-trained Glove embeddings for this project.

In [0]:
df = pd.read_csv('/content/gdrive/My Drive/Final Project/glove/glove.6B.200d.txt', sep=" ", quoting=3, header=None, index_col=0)
glove = {key: val.values for key, val in df.T.items()}

### Start preprocessing
Shuffle data from the training dataset and prepare a sample of about 25,000 articles. 
(We started out with the whole dataset of 280,000 training articles, but we faced some issues with preprocessing the data and ended up with 23,000 samples. We used this set for further processing).

In [0]:
training_dataset_path = '/content/gdrive/My Drive/Final Project/summarization-datasets/data/cnn-dailymail/inputs/train/'
shuffled_training_dataset_path = '/content/gdrive/My Drive/Final Project/summarization-datasets/data/cnn-dailymail/shuffled_inputs/train/'

In [0]:
training_files = os.listdir(training_dataset_path)
# random_files = sample(training_files, 50000)
# for file in random_files:
#   full_file_name = os.path.join(training_dataset_path, file)
#   if os.path.isfile(full_file_name):
#       shutil.copy(full_file_name, shuffled_training_dataset_path)

37064

### Function to generate sentence embeddings
The following function takes in sentences and tokens and generates sentence embedding files with the given name in the specified folder.

In [0]:
def generate_sentence_embeddings(sentences, tokens, folder, filename):
  sentence_embeddings = {}
  tokens_list = tokens['tokens']
  for index, sentence in enumerate(tokens_list):
    s_index = sentences[index]
    sentence_embeddings[s_index] = np.zeros(200)
    for word in sentence:
      if word in glove.keys():
        sentence_embeddings[s_index] = np.sum([sentence_embeddings[s_index], glove[word]], axis=0)
    sentence_embeddings[s_index] = (sentence_embeddings[s_index]/len(sentence)).tolist()
  
  
  file_path = '/content/gdrive/My Drive/Final Project/result/cnn-dailymail/sentence_embeddings/' + folder + '/' + filename + '.json'
  with open(file_path, 'w') as f:
    json.dump(sentence_embeddings, f)
  

### Generate embeddings for training data

In [0]:
files_list = os.listdir(training_dataset_path)
document_dict = {}
for file in files_list:
  tokens = {'tokens': []}
  sentences = []
  file_path = train_path + file
  with open(file_path) as json_file:
    data = json.load(json_file)
    # collect tokens of each sentence
    doc_id = data['id']
    all_inputs = data['inputs']
    for input in all_inputs:


      # collect the token array
      tkns = input['tokens']
      text = input['text']
      sentences.append(text)
      tokens['tokens'].append(tkns)

    # document_dict['sentences'] = sentences
    document_dict['embeddings'] = generate_sentence_embeddings(sentences, tokens, 'train', doc_id)




In [0]:
len(os.listdir('/content/gdrive/My Drive/Final Project/result/cnn-dailymail/sentence_embeddings/train/'))

23038

### Generate embeddings for the validation dataset

First create a random sample of 5000 articles from the validation dataset

In [0]:
validation_dataset_path = '/content/gdrive/My Drive/Final Project/summarization-datasets/data/cnn-dailymail/inputs/valid/'
shuffled_validation_path = '/content/gdrive/My Drive/Final Project/summarization-datasets/data/cnn-dailymail/shuffled_inputs/valid/'

In [0]:
# shuffle the validation dataset to pick 5000 articles
validation_files = os.listdir(validation_dataset_path)
random_files = sample(validation_files, 5000)
for file in random_files:
  print('Copying....')
  full_file_name = os.path.join(validation_dataset_path, file)
  if os.path.isfile(full_file_name):
      shutil.copy(full_file_name, shuffled_validation_path)

In [0]:
shuffled_files_num = len(os.listdir(shuffled_validation_path))
shuffled_files_num

5000

Generate the embeddings for the shuffled validation data

In [0]:
files_list = os.listdir(shuffled_validation_path)
document_dict = {}
for file in files_list:
  tokens = {'tokens': []}
  sentences = []
  file_path = shuffled_validation_path + file
  with open(file_path) as json_file:
    data = json.load(json_file)
    # collect tokens of each sentence
    doc_id = data['id']
    all_inputs = data['inputs']
    for input in all_inputs:
      # collect the token array
      tkns = input['tokens']
      text = input['text']
      sentences.append(text)
      tokens['tokens'].append(tkns)

    # document_dict['sentences'] = sentences
    document_dict['embeddings'] = generate_sentence_embeddings(sentences, tokens, 'valid', doc_id)

In [0]:
valid_embeddings_num = len(os.listdir('/content/gdrive/My Drive/Final Project/result/cnn-dailymail/sentence_embeddings/valid/'))
valid_embeddings_num

5000

### Generating embeddings for test data

First create a random sample of 5000.

In [0]:
test_data_path = '/content/gdrive/My Drive/Final Project/summarization-datasets/data/cnn-dailymail/inputs/test/'
test_data_shuffled = '/content/gdrive/My Drive/Final Project/summarization-datasets/data/cnn-dailymail/inputs/test_shuffled/'

In [0]:
test_files = os.listdir(test_data_path)
random_files = sample(test_files, 5000)
for file in random_files:
  print('Copying....')
  full_file_name = os.path.join(test_data_path, file)
  if os.path.isfile(full_file_name):
      shutil.copy(full_file_name, test_data_shuffled)

Copying....
Copying....
Copying....
Copying....
Copying....
Copying....
Copying....
Copying....
Copying....
Copying....
Copying....
Copying....
Copying....
Copying....
Copying....
Copying....
Copying....
Copying....
Copying....
Copying....
Copying....
Copying....
Copying....
Copying....
Copying....
Copying....
Copying....
Copying....
Copying....
Copying....
Copying....
Copying....
Copying....
Copying....
Copying....
Copying....
Copying....
Copying....
Copying....
Copying....
Copying....
Copying....
Copying....
Copying....
Copying....
Copying....
Copying....
Copying....
Copying....
Copying....
Copying....
Copying....
Copying....
Copying....
Copying....
Copying....
Copying....
Copying....
Copying....
Copying....
Copying....
Copying....
Copying....
Copying....
Copying....
Copying....
Copying....
Copying....
Copying....
Copying....
Copying....
Copying....
Copying....
Copying....
Copying....
Copying....
Copying....
Copying....
Copying....
Copying....
Copying....
Copying....
Copying....
Copy

Now create the embeddings for the shuffled test data.

In [0]:
files_list = os.listdir(test_data_shuffled)
document_dict = {}
for file in files_list:
  tokens = {'tokens': []}
  sentences = []
  file_path = test_data_shuffled + file
  with open(file_path) as json_file:
    data = json.load(json_file)
    # collect tokens of each sentence
    doc_id = data['id']
    all_inputs = data['inputs']
    for input in all_inputs:
      # collect the token array
      tkns = input['tokens']
      text = input['text']
      sentences.append(text)
      tokens['tokens'].append(tkns)

    # document_dict['sentences'] = sentences
    document_dict['embeddings'] = generate_sentence_embeddings(sentences, tokens, 'test', doc_id)

In [0]:
len(os.listdir('/content/gdrive/My Drive/Final Project/result/cnn-dailymail/sentence_embeddings/test'))

5000

### Source paths for embeddings and labels for training data
Take the embeddings and labels for training data

In [0]:
train_embedding_path = '/content/gdrive/My Drive/Final Project/result/cnn-dailymail/sentence_embeddings/train/'
train_label_path = '/content/gdrive/My Drive/Final Project/summarization-datasets/data/cnn-dailymail/labels/train/'

Prepare training label files using the path above.

In [0]:
train_label_files = os.listdir(train_label_path)

Destination label paths for training labels

In [0]:
new_train_label_path = '/content/gdrive/My Drive/Final Project/result/cnn-dailymail/labels/train/'

Copy label files to new directory if corresponding embeddings are available

In [0]:
train_embeds = os.listdir(train_embedding_path)

In [0]:
for file_name in train_embeds:
  print(file_name)
  if file_name in train_label_files:
    shutil.copyfile(train_label_path + file_name, new_train_label_path + file_name)

### Source paths for embeddings and labels for validation data
Take the embedding and labels for validation data

In [0]:
valid_embedding_path = '/content/gdrive/My Drive/Final Project/result/cnn-dailymail/sentence_embeddings/valid/'
valid_label_path = '/content/gdrive/My Drive/Final Project/summarization-datasets/data/cnn-dailymail/labels/valid/'

Prepare the validation label files from the path above.

In [0]:
valid_label_files = os.listdir(valid_label_path)

Destination path for validation data labels

In [0]:
new_valid_label_path = '/content/gdrive/My Drive/Final Project/result/cnn-dailymail/labels/valid/'

Copy label files to new directory if corresponding embeddings exist

In [0]:
valid_embeds = os.listdir(valid_embedding_path)

In [0]:
for file_name in valid_embeds:
  if file_name in valid_label_files:
    shutil.copyfile(valid_label_path + file_name, new_valid_label_path + file_name)

Ouptut for some of the cells have not been included here, as we performed our preprocessing on different notebooks to split up the work. The preprocessing code is now contained in only two notebooks.