## Process the raw data
In this notebook, we will create a dataset after having processed the raw data using Facebook's tools they have pooled together (the news crawl data, mosescoder tokenizer, fastBPE compression).  

The official API is [here](https://github.com/facebookresearch/XLM).  I used Microsoft's API instead found [here](https://github.com/xutaatmicrosoftdotcom/MASS-1#data-ready).

The relevant portions I have used are as follows:
Download the github [repository](https://github.com/facebookresearch/XLM) and then the following (via terminal, after going into your project folder):
```
wget https://dl.fbaipublicfiles.com/XLM/codes_enfr
wget https://dl.fbaipublicfiles.com/XLM/vocab_enfr
```
Then run the bash file that extracts the necessary data and processes the data (~10 GB space needed).  You might want to activate any virtual environments you might be using as this will utilize certain python packages.  I also had to go into the bash file and added
```
python 
```
in front of command lines that involved running the "preprocess.py" file.  We will be working with the English language as the source and the French language as the target.
```
./get-data-nmt.sh --src en --tgt fr --reload_codes codes_enfr --reload_vocab vocab_enfr
```
You should get a readout at the end indicating the locations of your train, valid, and test data.

In [2]:
data_path = "./data/processed/en-fr"
with open(data_path+"/train.en.pth", mode='rb') as f:
    count = 0
#     print(f)
    for line in f:
        print(line.rstrip())
        count += 1
        if count == 15:
            break

b'\x80\x04\x95\r\x00\x00\x00\x00\x00\x00\x00\x8a'
b'l\xfc\x9cF\xf9 j\xa8P\x19.\x80\x04\x95\x04\x00\x00\x00\x00\x00\x00\x00M\xe9\x03.\x80\x04\x95X\x00\x00\x00\x00\x00\x00\x00}\x94(\x8c\x10protocol_version\x94M\xe9\x03\x8c\rlittle_endian\x94\x88\x8c'
b'type_sizes\x94}\x94(\x8c\x05short\x94K\x02\x8c\x03int\x94K\x04\x8c\x04long\x94K\x04uu.\x80\x04\x95\x02\x00\x01\x00\x00\x00\x00\x00}\x94(\x8c\x04dico\x94\x8c\x13src.data.dictionary\x94\x8c'
b'Dictionary\x94\x93\x94)\x81\x94}\x94(\x8c\x07id2word\x94}\x94(K\x00\x8c\x03<s>\x94K\x01\x8c\x04</s>\x94K\x02\x8c\x05<pad>\x94K\x03\x8c\x05<unk>\x94K\x04\x8c'
b'<special0>\x94K\x05\x8c'
b'<special1>\x94K\x06\x8c'
b'<special2>\x94K\x07\x8c'
b'<special3>\x94K\x08\x8c'
b'<special4>\x94K\t\x8c'
b'<special5>\x94K'
b'\x8c'
b'<special6>\x94K\x0b\x8c'
b'<special7>\x94K\x0c\x8c'
b'<special8>\x94K\r\x8c'
b'<special9>\x94K\x0e\x8c\x01,\x94K\x0f\x8c\x01.\x94K\x10\x8c\x03the\x94K\x11\x8c\x01a\x94K\x12\x8c\x02to\x94K\x13\x8c\x01"\x94K\x14\x8c\x02of\x94K\x15\x8c\x03an

This reads like nonsense.  However, this is just bytes encoding of our original text.  Let us look at the contents of the data after loading it in torch.

In [5]:
import torch

data = torch.load(data_path+"/train.en.pth")
print(data.keys())
print(data['sentences'][0:42])
print(data['positions'][0:10])

dict_keys(['dico', 'positions', 'sentences', 'unk_words'])
[ 5543  1569 13902  7541    14 17242  2729    14    59 21255  4610  1936
     1   202    52    62    77  1396    78  8528  1337    15     1    19
  1010  1374    81   493    59 18566  3342    14    62    41   107    14
    19  9433    36   404    15     1]
[[  0  12]
 [ 13  22]
 [ 23  41]
 [ 42  61]
 [ 62  90]
 [ 91 127]
 [128 157]
 [158 177]
 [178 215]
 [216 241]]


This whole process broke down the corpus into indices of words and positions demarcating the sentences.  Now a dataset loader will need to take this in and break it down into batches for training.

In [None]:
class CorpusDataset():
    
    def __init__(self, int_sent, pos, batch_size=256):
        """
        Constructs a dataset that can iterate over batches of sentences.
        :param int_sent : nd_array containing integer versions of the entire corpus
        :param pos : nd_array of pairs containing starting position and 
                     ending position of each sentence
        :param batch_size : batch size with default value of 256
        """
        self.int_sent = int_sent
        self.pos = pos
        self.batch_size = batch_size
        self.num_sent = len(self.pos)
        
    def batch_iterator(self,):
        ## TODO:
        
    def __len__(self):
        return self.num_sent

    def _process_batch(self,):
        ## TODO:
        
    def __getitem__(self, idx):
        ## TODO: 
        