# <center>PROJECT SANDBOX</center>

## Documentation
The aim of this notebook is to provide a simple sandbox to test different NN architectures for the project. , here is a doc about the functions imported from `scripts` folder : 

- **`prepare_dataset(device,ratio=0.5,shuffle_ctx=False)`** :
    - **Input**:
        - device : a torch.device object
        - ratio : a float ratio between 0 and 1 that determines the average proportion of modern english verses in the data loader
        - shuffle_ctx : if `True`, shuffle the contexts within a Batch so that half of the `x_1` elements has a wrong context `ctx_1`. Useful to train the context recognizer model.
    - **Return** :
        - a torch Dataset | class : Shakespeare inherited from torch.utils.data.Dataset
        - a python word dictionary (aka tokenizer) | class : dict
    - **Tensors returned when loaded in the dataloader**:
        - x_1 : input verse (modern / shakespearian)
        - x_2 : output verse (modern / shakespearian)

        - ctx_1 = context of the input verse
        - ctx_2 = context of the output verse

        - len_x : length of the input verse
        - len_y : length of the output verse

        - len_ctx_x : length of the input verse context
        - len_ctx_y : length of the output verse context

        - label : label of the input verse (0 : modern, 1 : shakespearian)
        - label_ctx : label of the context (0 : wrong context, 1 : right context)
- **`string2code(string,dict)`** : 
    - **Input**:
        - string : a sentence
        - dict : a tokenizer
    - **Return** :
        - a torch Longtensor (sentence tokenized)
- **`code2string(torch.Longtensor,dict)`** : 
    - **Input**:
        - torch.Longtensor : a sentence tokenized
        - dict : a tokenizer
    - **Return** :
        - a string sentence

## Importing packages

In [1]:
from scripts.data_builders.prepare_dataset import prepare_dataset,string2code,code2string,assemble

import torch
import torchvision.datasets as datasets
import torch.nn.functional as F
from torch import nn
from torch import optim
from torch.utils.tensorboard import SummaryWriter
import ipdb
import pickle
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print("device = ",device)

device =  cpu


## Preprocessing data

In [2]:
train_data, dict_words = prepare_dataset(device,ratio=0.5,shuffle_ctx=True) #check with shift+tab to look at the data structure
batch_size = 20
dict_token = {b:a for a,b in dict_words.items()} #dict for code2string

train_loader = torch.utils.data.DataLoader(train_data, batch_size=batch_size,
                                           shuffle=True,collate_fn=train_data.collate)

Loading ...
- Shakespeare dataset length :  20316
- Corrupted samples (ignored) :  763


## Designing NN model

### Language Model 

In [3]:
dict_size=len(dict_words) #19089
d_embedding=300 #cf. paper Y.Kim 2014 Convolutional Neural Networks for Sentence Classification


In [26]:
class CoherenceClassifier(torch.nn.Module):
    def __init__(self,dict_size=dict_size,d_embedding=300,max_length=100):
        super().__init__()
        self.embed_layer=torch.nn.Embedding(dict_size+1,d_embedding,padding_idx=dict_size)

        self.conv_1 = torch.nn.Conv1d(d_embedding,3,kernel_size = 3, stride = 1)
        self.max_pool = torch.nn.MaxPool1d(3,2)
        self.relu = torch.nn.ReLU()
        self.linear = torch.nn.Linear(3,1)
        # self.f=lambda x: torch.norm(x,dim=1)**2 (I am not sure it is necessary at all)
        self.f=lambda x:x
        self.sigmoid=torch.nn.Sigmoid()
#         self.softmax=torch.nn.Softmax()
    
    def forward(self,x,ctx):
        x=self.embed_layer(x)
        ctx=self.embed_layer(ctx)
        input_=assemble(x,ctx,self.f)
#         input_=torch.cat((self.f(x),ctx),dim=1)
        input_ = self.conv_1(input_.transpose(1,2))
        input_ = self.max_pool(input_)
        input_ = self.relu(input_)
        u = torch.max(input_,2)[0]
        s = self.sigmoid(self.linear(u))
        return(s)

        
    



## Running model

In [17]:
import ipdb

In [18]:
for x,y , ctx_x,ctx_y , len_x,len_y , len_ctx_x,len_ctx_y, label,label_ctx in train_loader:
    
    for i in range(x.shape[0]):
        print("\n- x :")
        print(code2string(x[i],dict_token))
        print("- context of x :")
        print(code2string(ctx_x[i],dict_token))
        print("- context label :",label_ctx[i].item())
#         ipdb.set_trace()
    break


- x :
IT’S TOO BAD THEIR BEAUTY FADES RIGHT WHEN IT REACHES PERFECTION !
- context of x :
WOMEN ARE LIKE ROSES : THE MOMENT THEIR BEAUTY IS IN FULL BLOOM , IT’S ABOUT TO DECAY . THAT’S TRUE . WHO’S THE RULER HERE ? A DUKE WHO IS NOBLE IN NAME AND CHARACTER .
- context label : 1

- x :
IT WOULD BE BAD IF SHE KNEW ABOUT BENEDICK’S LOVE AND TEASED HIM ABOUT IT .
- context of x :
THERE IS SCORN AND DISDAIN IN HER EYES , AND THOSE SPARKLING EYES DESPISE EVERYTHING THEY LOOK UPON . SHE CAN’T EVEN IMAGINE WHAT “LOVE” IS . IT’S TRUE . AND SO SHE TURNS MEN INSIDE OUT AND NEVER ACKNOWLEDGES THE INTEGRITY AND MERIT THAT A MAN HAS .
- context label : 1

- x :
WHY SHOULD SHE WRITE TO EDMUND ?
- context of x :
COWARDS DIE MANY TIMES BEFORE THEIR DEATHS . THE VALIANT NEVER TASTE OF DEATH BUT ONCE . WHAT SAY THE AUGURERS ? THEY WOULD NOT HAVE YOU TO STIR FORTH TODAY .
- context label : 0

- x :
SO THE SINS OF MY MOTHER SHOULD BE VISITED UPON ME .
- context of x :
HONORS ON THIS MAN TO EASE OURSELVES 

In [19]:
from torch.optim import Adagrad
from torch.nn import BCELoss,CrossEntropyLoss

In [None]:
epochs=100
model=CoherenceClassifier()
optimizer=Adagrad(params=model.parameters(),lr=0.01)
loss_func=BCELoss()

for epoch in range(epochs):
    for x,y , ctx_x,ctx_y , len_x,len_y , len_ctx_x,len_ctx_y, label,label_ctx in train_loader:
        x[x==-1]=19089
        ctx_x[ctx_x==-1]=19089
#         ipdb.set_trace()
        label_pred=model.forward(x,ctx_x)
        loss=loss_func(label_pred,label.float())
        loss.backward()
        optimizer.step()
    if epoch %1==0:
        print("Epoch %d, loss %f"%(epoch,loss))

  return F.binary_cross_entropy(input, target, weight=self.weight, reduction=self.reduction)


Epoch 0, loss 0.608942
