<a href="https://colab.research.google.com/github/neohack22/IASD/blob/NLP/NLP/imdb_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The goal of this lab session is to implement the model proposed by  Yoon Kim, published in 2014. The original paper can be found [here](https://www.aclweb.org/anthology/D14-1181).
Of course, there exists pytorch and tensorflow implementations on the web. They are more or less correct and efficient. However, here it is important to do it yourself. The goal is to better understand pytorch and the convolution. 

The road-map is to: 
- Implement the convolution and pooling 
- Add dropout on the last layer

To start, it is useful to discover the convolution layers. In this lab, we consider the convolution operation in 1-dimension, followed by the adapted max pooling. 


We use the same dataset as before: imdb. The first following cells are the same as the previous lab session on this dataset (load the data, build the vocabulary, and prepare data for the model). 


# Data loading 


In [None]:
import re
import numpy as np
import torch as th
import torch.autograd as ag
import torch.nn.functional as F
import torch.nn as nn
import random

th.manual_seed(1) # set the seed 


def clean_str(string, tolower=True):
    """
    Tokenization/string cleaning.
    Original taken from https://github.com/yoonkim/CNN_sentence/blob/master/process_data.py
    """
    string = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", string)
    string = re.sub(r"\'s", " \'s", string)
    string = re.sub(r"\'ve", " \'ve", string)
    string = re.sub(r"n\'t", " n\'t", string)
    string = re.sub(r"\'re", " \'re", string)
    string = re.sub(r"\'d", " \'d", string)
    string = re.sub(r"\'ll", " \'ll", string)
    string = re.sub(r",", " ", string) ## remove 
    string = re.sub(r"!", " ! ", string)
    string = re.sub(r"\(", " ", string) ## remove 
    string = re.sub(r"\)", " ", string)## remove 
    string = re.sub(r"\?", " \? ", string)
    string = re.sub(r"\s{2,}", " ", string)
    if tolower:
        string = string.lower()
    return string.strip()


def loadTexts(filename, limit=-1):
    """
    Texts loader for imdb.
    If limit is set to -1, the whole dataset is loaded, otherwise limit is the number of lines
    """
    f = open(filename)
    dataset=[]
    line =  f.readline()
    cpt=1
    skip=0
    while line :
        cleanline = clean_str(line).split()
        if cleanline: 
            dataset.append(cleanline)
        else: 
            line = f.readline()
            skip+=1
            continue
        if limit > 0 and cpt >= limit: 
            break
        line = f.readline()
        cpt+=1        
        
    f.close()
    print("Load ", cpt, " lines from ", filename , " / ", skip ," lines discarded")
    return dataset



Load the data 

In [None]:
LIM=-1
pathd = "/home/allauzen/cours/nlp-iasd/labs/"
txtfile=pathd+"imdb.pos"
postxt = loadTexts(txtfile,limit=LIM)
print(postxt[0:10])
print (len(postxt), " pos sentences")

txtfile=pathd+"imdb.neg"
negtxt = loadTexts(txtfile,limit=LIM)
print(negtxt[0:10])

print (len(negtxt), " neg sentences")


Load  299966  lines from  /home/allauzen/cours/nlp-iasd/labs/imdb.pos  /  35  lines discarded
[['excellent'], ['do', "n't", 'miss', 'it', 'if', 'you', 'can'], ['a', 'great', 'parody'], ['dreams', 'of', 'a', 'young', 'girl'], ['tromendous', 'piece', 'of', 'art'], ['funny', 'funny', 'movie', '!'], ['need', 'more', 'scifi', 'like', 'this'], ['pride', 'and', 'prejudice', 'is', 'absolutely', 'amazing', '!', '!'], ['scott', 'pilgrim', 'vs', 'the', 'world'], ['quirky', 'and', 'effective']]
299965  pos sentences
Load  299949  lines from  /home/allauzen/cours/nlp-iasd/labs/imdb.neg  /  52  lines discarded
[['typical', 'movie', 'where', 'best', 'parts', 'are', 'in', 'the', 'preview'], ['not', 'for', 'the', 'squeamish'], ['cool', 'when', 'i', 'was', 'kid'], ['i', 'appreciate', 'the', 'effort', 'but'], ['pretty', 'bad'], ['much', 'ado', 'about', 'nothing'], ['series', 'of', 'unlikely', 'events'], ['april', 'is', 'the', 'cruelest', 'month'], ['great', 'idea', 'but'], ['and', 'people', 'thought', 't

In [None]:
wfreq = {}
maxlength = 0 
for sent in postxt+negtxt: 
    isent = []
    maxlength = max(maxlength,len(sent))
    for w in sent: 
        if w in wfreq:
            wfreq[w] = wfreq[w]+1
        else :
            wfreq[w]=1
  

In [None]:
print(len(wfreq))
orderedvocab = []
for w in sorted(wfreq, key=wfreq.get, reverse=True):
    orderedvocab.append((w, wfreq[w]))

63699


In [None]:
print(orderedvocab[0:10])

[('!', 153714), ('the', 146409), ('a', 131821), ('of', 94543), ('movie', 80115), ('and', 63910), ('this', 53299), ('to', 46991), ('it', 46431), ('i', 44902)]


In [None]:
VOCSIZE = 10000
w2idx = {}
idx2w = {}
w2idx["<pad>"]  = 0
w2idx["<unk>"] = 1
idx2w[1]="<unk>"
idx2w[0]="<pad>"

for i in range(VOCSIZE): 
    w, _ = orderedvocab[i]
    w2idx[w] = i+2
    idx2w[i+2] = w
    
print(len(w2idx), " == ",len(idx2w))
for i in range(1,6):
    print(i, idx2w[i], w2idx[idx2w[i]])


10002  ==  10002
1 <unk> 1
2 ! 2
3 the 3
4 a 4
5 of 5


In [None]:
NB_SENTENCES = 100000 # for each class
txtidx = []
maxlength = 0 
for sent in postxt[1:NB_SENTENCES+1]+negtxt[:NB_SENTENCES]:
    maxlength = max(maxlength,len(sent))
    isent=[]
    for w in sent: 
        widx=1
        if w in w2idx:
            widx=w2idx[w]
        isent.append(widx)
    txtidx.append(th.LongTensor((isent)))
    
print(len(w2idx), " words in the vocab")
print(len(txtidx), " sentences")
print(maxlength, " is maximum sentence length")
print(txtidx[0])


### For the labels
labels = th.ones([2*NB_SENTENCES])
labels[0:NB_SENTENCES] = 0

10002  words in the vocab
200000  sentences
48  is maximum sentence length
tensor([ 36,  25, 381,  10,  58,  21,  83])


In [None]:
def idx2wordlist(idx_array): 
    l = []
    for i in idx_array: 
        l.append(idx2w[i.item()])
    return l
print(txtidx[0], txtidx[0].shape)

print(idx2wordlist(txtidx[0]))
print(postxt[1])

tensor([ 36,  25, 381,  10,  58,  21,  83]) torch.Size([7])
['do', "n't", 'miss', 'it', 'if', 'you', 'can']
['do', "n't", 'miss', 'it', 'if', 'you', 'can']


In [None]:
pack = (txtidx, labels, idx2w)
import pickle

if True : 
     pickle.dump(pack, open('imdb-200k', 'wb'))