### Text classification using ConvNet
Do the same, using a ConvNet.  
The ConvNet should get as input a 2D matrix where each column is an embedding vector of a single word, and words are in order. Use zero padding so that all matrices have a similar length.  
Some songs might be very long. Trim them so you keep a maximum of 128 words (after cleaning stop words and rare words).  
Initialize the embedding layer using the word vectors that you've trained before, but allow them to change during training.  

Extra: Try training the ConvNet with 2 slight modifications:
1. freezing the the weights trained using Word2vec (preventing it from updating)
1. random initialization of the embedding layer

You are encouraged to try this question on your own.  

You might prefer to get ideas from the paper "Convolutional Neural Networks for Sentence Classification" (Kim 2014, [link](https://arxiv.org/abs/1408.5882)).

There are several implementations of the paper code in PyTorch online (see for example [this repo](https://github.com/prakashpandey9/Text-Classification-Pytorch) for a PyTorch implementation of CNN and other architectures for text classification). If you get stuck, they might provide you with a reference for your own code.

In [1]:
import os
import time

import numpy as np
import pandas as pd

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable
import torch.optim as optim
import torch.utils.data as data_utils

from gensim.models import Word2Vec
from gensim.models.callbacks import CallbackAny2Vec

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import classification_report, plot_confusion_matrix

In [2]:
DATA_FILE = 'lyrics.csv'
DATA_DIR = 'data'
MODELS_DIR = 'models'

MAX_N_WORDS = 128
MAX_FEATURES = 1000

In [3]:
class EpochLogger(CallbackAny2Vec):
    """Callback to log information about training"""
    
    def __init__(self):
        pass

    
w2v = Word2Vec.load(os.path.join(MODELS_DIR, 'w2v.model'))
# w2v.init_sims(replace=True)

df = pd.read_pickle(os.path.join(DATA_DIR, 'lyrics_df.pkl'))
df.drop(df[df.genre == 'Not Available'].index, axis=0, inplace=True)

First we will prepare the data for training - we split the data into train and test, crop lyrics to the the required length and pad them if needed, and convert each lyrics to a one hot, 2d representation using CountVectorizor

In [4]:
def clip_lyrics(row, n_words):
    return row.clean_lyrics[:n_words]


def lyrics_to_embedding(row, n_words, n_features, count_vect):
    
    embedding = count_vect.transform(row).toarray().argmax(axis=1)
    
    n_vects = embedding.shape[0]
    if n_vects < n_words:
        embedding = np.append(embedding, 
                              np.zeros(n_words - n_vects),
                              axis=0)
    return embedding


In [5]:

# df = df.iloc[idx].copy()
df['genre_code'] = df.genre.astype('category').cat.codes
df['cropped_lyrics'] = df.apply(clip_lyrics, args=(MAX_N_WORDS,), axis=1)

X_train, X_test, y_train, y_test = train_test_split(df.cropped_lyrics, 
                                                    df.genre_code, 
                                                    test_size=0.2, random_state=42)


In [6]:
%%time
vocab = w2v.wv.index2entity[:MAX_FEATURES]
count_vect = CountVectorizer(vocabulary=vocab).fit(X_train.str.join(' '))

CPU times: user 8.63 s, sys: 64.6 ms, total: 8.69 s
Wall time: 8.7 s


In [7]:
%%time
X_train = X_train.apply(lyrics_to_embedding, args=(MAX_N_WORDS, MAX_FEATURES, count_vect))
X_test = X_test.apply(lyrics_to_embedding, args=(MAX_N_WORDS, MAX_FEATURES, count_vect))

CPU times: user 1min 26s, sys: 64.8 ms, total: 1min 26s
Wall time: 1min 26s


Next we convert the processed data into dataloaders to be used for our model.

In [8]:
batch_size = 64

train_target = torch.tensor(y_train.values).long()
train = torch.tensor(np.stack(X_train.values)).long()
train_tensor = data_utils.TensorDataset(train, train_target) 
train_loader = data_utils.DataLoader(dataset=train_tensor, batch_size=batch_size, shuffle=True)

test_target = torch.tensor(y_test.values).long()
test = torch.tensor(np.stack(X_test.values)).long()
test_tensor = data_utils.TensorDataset(test, test_target) 
test_loader = data_utils.DataLoader(dataset=test_tensor, batch_size=batch_size, shuffle=True)

We define our model - the model will receive as input a 2d, one hot representation of the lyrics, as a sequence, and will make use of the word2vec embeddings, allowing them to change during training.

In [9]:

class ConvNet(nn.Module):
    def __init__(self, n_labels, initial_weights=None, freeze_weights=False,
                 in_channels=1, out_channels=100, 
                 kernels=[3,4,5], padding=0, stride=1, 
                 keep_probab = 0.5,
                ):
        super(ConvNet, self).__init__()
        
        self.allow_grad = True
        
        if initial_weights is None:
            raise NotImplementedError('Random Initialization Not Implemented Yet')
        else:
            self.init_weights = torch.tensor(initial_weights, dtype=torch.float)
            self.allow_grad = not freeze_weights
            vocab_size, embedding_length = initial_weights.shape

        
        self.kernels = np.asarray(kernels)
        
#         self.lr = lr        
        
        self.word_embeddings = nn.Embedding(vocab_size, embedding_length)
        self.word_embeddings.weight = nn.Parameter(self.init_weights, requires_grad=self.allow_grad)
        
        self.conv1 = nn.Conv2d(in_channels, out_channels, 
                               kernel_size=(kernels[0], embedding_length), 
                               stride=embedding_length, 
                               )
        
        self.conv2 = nn.Conv2d(in_channels, out_channels, 
                               kernel_size=(kernels[1], embedding_length), 
                               stride=embedding_length, 
                               )
        
        self.conv3 = nn.Conv2d(in_channels, out_channels, 
                               kernel_size=(kernels[2], embedding_length), 
                               stride=embedding_length, 
                               )
        
        self.dropout = nn.Dropout(p=keep_probab)
        self.label = nn.Linear(len(kernels)*out_channels, n_labels)
        
    def _conv_block(self, x, conv_layer):
        conv_out = conv_layer(x)  # conv_out.size() = (batch_size, out_channels, dim, 1)
        activation = F.relu(conv_out.squeeze(3))  # activation.size() = (batch_size, out_channels, dim1)
        max_out = F.max_pool1d(activation, activation.size()[2]).squeeze(2)  # maxpool_out.size() = (batch_size, out_channels)
        return max_out

    def forward(self, x):
    
        x = self.word_embeddings(x)
        out = x.unsqueeze(1)
        max_out1 = self._conv_block(out, self.conv1)
        max_out2 = self._conv_block(out, self.conv2)
        max_out3 = self._conv_block(out, self.conv3)

        all_out = torch.cat((max_out1, max_out2, max_out3), 1)
        all_out = all_out.view(all_out.size(0), -1)
        fc_in = self.dropout(all_out)
        
        scores = self.label(fc_in)
        
        scores = F.softmax(scores, dim=1)
        return scores
        


In [10]:
# Use GPU if available, otherwise stick with cpu
use_cuda = torch.cuda.is_available()
torch.manual_seed(123)
device = torch.device("cuda" if use_cuda else "cpu")
print(device)

cuda


In [11]:
output_size = np.unique(y_train).size
bow_vocab = count_vect.vocabulary  

model = ConvNet(initial_weights=w2v.wv[bow_vocab], n_labels=output_size).to(device)
optimizer = torch.optim.Adam(filter(lambda p: p.requires_grad, model.parameters()), lr=1e-4)

In [12]:
def test():
    model.eval()  # set evaluation mode
    test_loss = 0
    correct = 0
    with torch.no_grad():
        for data, target in test_loader:
            data, target = data.to(device), target.to(device)
            output = model(data)
            test_loss += F.cross_entropy(output, target, size_average=False).item() # sum up batch loss
            pred = output.max(1, keepdim=True)[1] # get the index of the max log-probability
            correct += (torch.max(pred, 1)[1].view(target.size()).data == target.data).sum().item()
            #correct += pred.eq(target.view_as(pred)).sum().item()

    test_loss /= len(test_loader.dataset)
    print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
        test_loss, correct, len(test_loader.dataset),
        100. * correct / len(test_loader.dataset)))

In [13]:
def train_loop(epoch, log_interval=200):
    model.train()  # set training mode
    
    iteration = 0
    for ep in range(epoch):
        start = time.time()

        for batch_idx, (X, target) in enumerate(train_loader):
            # bring data to the computing device, e.g. GPU
            X, target = Variable(X).to(device), Variable(target).to(device)
            # forward pass
            output = model(X)
            # compute loss: negative log-likelihood
            loss = F.cross_entropy(output, target)
            
            # backward pass
            # clear the gradients of all tensors being optimized.
            optimizer.zero_grad()
            # accumulate (i.e. add) the gradients from this forward pass
            loss.backward()
            # performs a single optimization step (parameter update)
            optimizer.step()
            
            if iteration % log_interval == 0:
                print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                    ep, batch_idx * len(X), len(train_loader.dataset),
                    100. * batch_idx / len(train_loader), loss.item()))
            iteration += 1
            
        end = time.time()
        print('{:.2f}s'.format(end-start))
        test() # evaluate at the end of epoch

In [14]:
train_loop(20)

12.68s





Test set: Average loss: 2.0920, Accuracy: 2909/47638 (6%)

11.92s

Test set: Average loss: 2.0650, Accuracy: 2909/47638 (6%)

12.43s

Test set: Average loss: 2.0643, Accuracy: 2909/47638 (6%)

12.20s

Test set: Average loss: 2.0638, Accuracy: 2909/47638 (6%)

11.91s

Test set: Average loss: 2.0640, Accuracy: 2909/47638 (6%)

12.20s

Test set: Average loss: 2.0637, Accuracy: 2909/47638 (6%)

12.12s

Test set: Average loss: 2.0637, Accuracy: 2909/47638 (6%)

12.03s

Test set: Average loss: 2.0636, Accuracy: 2909/47638 (6%)

12.10s

Test set: Average loss: 2.0637, Accuracy: 2909/47638 (6%)

12.23s

Test set: Average loss: 2.0634, Accuracy: 2909/47638 (6%)

12.18s

Test set: Average loss: 2.0634, Accuracy: 2909/47638 (6%)

12.22s

Test set: Average loss: 2.0636, Accuracy: 2909/47638 (6%)

12.74s

Test set: Average loss: 2.0631, Accuracy: 2909/47638 (6%)

12.43s

Test set: Average loss: 2.0621, Accuracy: 2909/47638 (6%)

12.37s

Test set: Average loss: 2.0617, Accuracy: 2909/47638 (6%)

12

Unfortunatly, we were unable to create an efficient architecture for training, and much more tunning of the network is needed.