In this script I get the embeddings and prepare the data so that it can be directly fed to the CNN later on. The script can be used for both Bangali and hindi datasets by only changing the dataset name in the beginning of the script.

In [13]:
import torch
import csv
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
from utils import *
import sys

Load the cleaned data which was saved in the previous script as well as the list of the vocabulary items.

In [14]:
dataset_name = 'hindi' #change to bangali
embedding_size = 300
data = load_data(dataset_name + '_tweets')
V = data['all_words']
new_tweets = data['tweets']
Y = data['Y']

The word2vec model. Here in the forward pass I do not use the second layer and the final softmax layer since I only want to get the embeddings which are in the first layer.

In [15]:
class Word2Vec(nn.Module):
  def __init__(self):
    super().__init__()
    self.hidden_1 = nn.Linear(len(V),embedding_size)
    self.hidden_2 = nn.Linear(embedding_size,len(V))
    self.logsoftmax = nn.LogSoftmax()


  def forward(self, one_hot):
    out = self.hidden_1(one_hot)
    # print(out.shape)
    # out = self.hidden_2(out)
    # out = self.logsoftmax(out)
    return out

We need the function which turns a word into its one hot encoding:

In [16]:
def word_to_one_hot(word):
    one_hot = torch.zeros(len(V))
    for i,w in enumerate(V):
        if word == w:
            one_hot[i] = 1
            break
    assert(torch.sum(one_hot)==1)
    return one_hot


I load the embeddings model which we trained in the previous script and define a sen_len which is the mean of the sentence length of the sentences in the dataset. since for the CNN we need all the sentences to have the same size but the sentences in the dataset can have various sizes. I use the mean of the length as the final length of all the sentences and crop the longer sentences and pad the shorter sentences so that they all satisfy this length.

In [17]:
model = Word2Vec()
model.load_state_dict(torch.load(dataset_name + '_embeddings_model'))
model.eval()
sen_len = int(sum([len(sentence) for sentence in new_tweets])/len(new_tweets))

For each word in each sentence, we first get its one hot representation based on the vocaulary and then use the embeddings model to get a feature vector for that word. Note that for each sentence we will have a sen_len * 300 feature matrix since the word2vec models gives us 300 features for each word and we decided to use sen_len as the length of all tweets. For the tweets which are shorter than sen_len, the remaining values are filled with random number as in https://arxiv.org/pdf/1408.5882.pdf

In [18]:
embeddings = torch.zeros((len(new_tweets),sen_len,embedding_size))
# Y = torch.zeros((len(new_tweets)))
for i,sentence in enumerate(new_tweets):
    print(i)
    j = 0
    cnt = 0
    for word in sentence:
        if j == sen_len:
            break
        if word in V:
            embeddings[i][j] = model(word_to_one_hot(word))
            cnt += 1
        else:
            print('-----ERR------')
        j += 1
#     print(sen_len,len(sentence), cnt)
    while j<sen_len:
        embeddings[i][j] = torch.rand(embedding_size)
        j += 1

Next we split the data into train, validation and test set and save it as a dictionary so that we can later it load it for our classification task. 80% of the data is used for training, 10% for validation and 10% for testing.

In [19]:
Y = torch.from_numpy(Y)
print(torch.sum(Y))
import random
frac = 0.8
num_data = len(new_tweets)
num_train = int(frac * num_data)
num_test = int((num_data - num_train)/2)
num_val = num_data - (num_test + num_train)
indices = np.arange(num_data)
random.shuffle(indices)
indices_train = indices[:num_train]
indices_val = indices[num_train:num_train+num_val]
indices_test = indices[num_train+num_val:num_train+num_val+num_test]

data_split = {}
data_split['X'] = embeddings[indices_train]
data_split['Y'] = Y[indices_train]


val = {}
val['X'] = embeddings[indices_val]
val['Y'] = Y[indices_val]

data_split['val'] = val

test = {}
test['X'] = embeddings[indices_test]
test['Y'] = Y[indices_test]
data_split['test'] = test

save_data(data_split,dataset_name + '_data')


tensor(268)
