<h2>Inspecting the Aligned Dataset</h2>
<p>We will be inspecting the dataset of aligned Reddit sequences of comments and Wikipedia sentences. The respective HDF5 files (i.e. `reddit.h5` and `wikipedia.h5`) are built in such a way that each sequence of comments on Reddit is aligned with 20 Wikipedia sentences.</p>

In [1]:
import json
import numpy as np
import h5py

<p>We are loading the `reddit.h5` and `wikipedia.h5` that contain the respective sequences of Reddit comments and Wikipedia sentences, aligned with each other. In those files, each word is represented by its position in the shared dictionary. We are loading that shared dictionary (i.e. `dictionary.json`) that will allow us to get the actual word given its position.</p>

In [2]:
reddit_path = 'Aligned-Dataset/reddit.h5'
wikipedia_path = 'Aligned-Dataset/wikipedia.h5'
dictionary_path = 'Aligned-Dataset/dictionary.json'

reddit = h5py.File(reddit_path, 'r')
wikipedia = h5py.File(wikipedia_path, 'r')

with open(dictionary_path, 'r') as f:
    dictionary = json.load(f, 'utf-8')
    id2word = dictionary['id2word']
    id2word = {int(key): id2word[key] for key in id2word}
    word2id = dictionary['word2id']
    f.close()

In [3]:
def capitalise(string):
    if string[0] == 't':
        string = 'T'
    else:
        string[0] = 'V'
    return 

def getAligned(index, dataset = 'train'):
    if dataset == 'train' or dataset == 'test' or dataset == 'validate':
        if index < len(reddit[dataset]):
            i = 0
            sequence = ''
            while reddit[dataset][index][i + 1] != word2id['<PAD>']:
                if reddit[dataset][index][i] == word2id['<end>'] or reddit[dataset][index][i] == word2id['<eot>']:
                    sequence = sequence + id2word[reddit[dataset][index][i]].encode('utf-8', 'ignore') + '\n'
                else:
                    sequence = sequence + id2word[reddit[dataset][index][i]].encode('utf-8', 'ignore') + ' '
                i += 1
            sequence = sequence + id2word[reddit[dataset][index][i]].encode('utf-8', 'ignore')
            sentences = []
            for j in range(0, 20):
                i = 0
                sentences.append('')
                while wikipedia[dataset][index * 20 + j][i + 1] != word2id['<PAD>']:
                    sentences[j] += id2word[wikipedia[dataset][index * 20 + j][i]].encode('utf-8', 'ignore') + ' '
                    i += 1
                sentences[j] += id2word[wikipedia[dataset][index * 20 + j][i]].encode('utf-8', 'ignore')

            print ('Number: %d Sequence of Comments from the %s Set\n' % (index, dataset.title()))
            print (sequence)
            print ('\n\nWikipedia Sentences for the Number: %d Sequence of Comments from the %s Set\n' % (index, dataset.title()))

            print ('\n'.join(sentences))
        else:
            print ('The index exceeds the available examples in the %s Set.' % (dataset.title()))
            print ('Pick an index between 0 and %d for the %s Set.' % (len(reddit[dataset]) - 1, dataset.title()))
    else:
        print('The available options for the dataset variable are: train, validation and test.')

<p>By running the `getAligned(i, dataset)` function we are printing the $i$-th sequence of comments along with the 20 Wikipedia sentences with which it is aligned. The dataset is split into training, validation and test with respective portions of 80, 10 and 10 that result in the following options for the `dataset` variable: </p>
* `train` containing 11248 sequences of comments along with 224960 sentences
* `validation` containing 1406 sequences of comments along with 28100 sentences
* `test` containing 1406 sequences of comments along with 28100 sentences

<br>The `<sot>` and `<eot>` are the start-of-title and end-of-title tokens of each sequence. Each comment in a sequence is augmented with start-of-comment `<start>` and end-of-comment `<end>` tokens.</br>

In [4]:
getAligned(256, dataset = 'validate')

Number: 256 Sequence of Comments from the Validate Set

<sot> TIL Michael Crichton , author of Jurassic Park , felt his literature professor at Harvard was giving him unfair grades . To prove it , he turned in a paper by George Orwell and received a B- <eot>
<start> <end>
<start> I broke my wrist in college and forgot about a paper that was due , so I just printed out a paper I did in high school . In HS , I got a B- on the paper . In College , I got an A . I always thought my high school teacher was too harsh , and felt that it was proven with my self-plagiarism in college . <end>
<start> I did at least 5 book reports on Gary NaN The Hatchet . <end>
<start> If you did that many book reports on it how can you not know it is just " Hatchet " , not " The Hatchet " ? ( Never read it myself , but helped my daughter with her book report on it last fall ) <end>
<start> Because it was nearly 20 years ago . <end>
<start> Also , the ( very famous ) book is almost universally known as The Hatche