### Deep Learning approach using PyTorch

***Introduction***:
This is a simple Deep Learning approach to Text classification problem. It is very inefficient since it takes a long time to train and purely uses the "text" column. It scores about 78% on the validation set and goes down to a loss of about 0.2 when trained for 100 epochs.


# **Imports**
We need the following packages:
* pandas for loading the dataset
* pytorch building our deep learning model
* numpy for performing some calculations
* tqdm for showing training progress
* nltk for pre-processing (removing stop words)
* re for pre-processing (filtering strings)

In [1]:
import pandas as pd
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, Dataset
from torch.nn.utils.rnn import pad_sequence
import torch.optim as optim
import torch.nn.functional as F
from tqdm import tqdm
import numpy as np
import nltk
from nltk.corpus import stopwords
import re

# **Hyperparameters**
We need to define hyperparameters such as the batch size and learnign rate. This is done at the start of the code rather than later for readability.

Here is a list of what they do:
1. BATCH_SIZE : Defines how many items are used each training step
2. LEARNING_RATE : Defines the learning rate aka how much the models weights are updated each training step
3. EPOCHS : For how long the model will train for
4. VAL_SPLIT = This indicates how much data we use for training and validation. If it is set to 0.05, 95% of the dataset from train.csv will be used for training and the other 5% will be used to meassure the accuracy. 

In [2]:
BATCH_SIZE = 64
LEARNING_RATE = 0.01
EPOCHS = 100
VAL_SPLIT = 0.05

# **Downloading list of stop words**
Stop words are words that do not add meaning to a sentence. They are considered obsolete and are often removed prior to training NLP models. 

Here is a sample list of stop words: and, but, how, in, on, or, the, what, will

Example:
**What is a** Neural Network**?**

The code below downloads a list of english stopwords using nltk

In [3]:
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to /home/jeffwa/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# **Building the encoder/decoder function**
Next we create the encoder/decoder function. While we don't need a decoder function for text classification, I have included it anyways. 
The encoder function splits the string into all of its words and looks up the index of each one. This way we get a list of integers that can be pased through the embedding layer of the model

In [4]:
vocabulary = []

def encode(str):
    return [vocabulary.index(word) for word in str.split(' ') if word not in stop_words]


def decode(seq):
    return "".join([vocabulary[index] for index in seq])

Our Dataset class holds the train, test or validation dataset. The train and validation dataset can be retrieved by splliting the train.csv as specified in the VAL_SPLIT. 

There are five main functions:
* **collate_fn**: The collate_fn function can be parsed to the DataLoader later. We need this because all items in a batch need to have the same length. All sentences are padded with zeros to the max length using the pad_sequence function provided by PyTorch.
* **len**: Returns the length of the dataset. This is required by PyTorch
* **getitem**: Returns an item from the dataset
* **clean**: This function pre-processes the dataset. At first, it converts the entire dataset to lower case. Why? Because **HeLlO** and **hello** are the same thing. Then we filter out all non-alphabetic characters and whitespace from our dataset, using a [regular expression](https://learn.microsoft.com/en-us/dotnet/standard/base-types/regular-expression-language-quick-reference). Finally, all usernames are stripped since they add little meaning and we apply encoding to the entire DataFrame. 
* **add_vocab**: This adds all words to the vocavulary

In [5]:
class dataset(Dataset):
    def __init__(self, split='train'):
        self.split = split
        
        if split == 'train':
            self.dataset = pd.read_csv('train.csv')
            '''self.dataset = pd.read_csv('/kaggle/input/nlp-getting-started/train.csv')'''
            self.dataset = self.dataset.iloc[:round(len(self.dataset) * (1 - VAL_SPLIT))]
            
            self.clean()
            
            self.X = self.dataset['text']
            self.Y = self.dataset['target']
        elif split == 'val':
            self.dataset = pd.read_csv('train.csv')
            '''self.dataset = pd.read_csv('/kaggle/input/nlp-getting-started/train.csv')'''
            self.dataset = self.dataset.iloc[round(len(self.dataset) * (1 - VAL_SPLIT)):]
            
            self.clean()
            
            self.X = self.dataset['text']
            self.Y = self.dataset['target']
        else:
            self.dataset = pd.read_csv('test.csv')
            '''self.dataset = pd.read_csv('/kaggle/input/nlp-getting-started/test.csv')'''
            self.clean()
            
            self.X = self.dataset['text']
    
    def add_vocab(self):
        for str in self.dataset['text']:
            for item in str.split(' '):
                if item not in vocabulary and item not in stop_words:
                    vocabulary.append(item)
            
    def clean(self):
        self.dataset['text'] = self.dataset['text'].str.lower()
        self.dataset['text'] = self.dataset['text'].str.replace('@\w+', '', regex=True) # remove usernames
        self.dataset['text'] = self.dataset['text'].apply(lambda x: re.sub(r'[^a-zA-Z\s]', '', x))
        
        self.add_vocab()
        
        self.dataset['text'] = self.dataset['text'].apply(encode)
        
    def __len__(self):
        return len(self.X)
    
    def __getitem__(self, idx):
        if self.split == 'train' or self.split == 'val':
            return torch.tensor(self.X.iloc[idx], dtype=torch.long), torch.tensor(self.Y.iloc[idx], dtype=torch.long)
        else:
            return torch.tensor(self.X.iloc[idx], dtype=torch.long), torch.tensor(self.dataset['id'].iloc[idx], dtype=torch.int)

def collate_fn(batch):
    x, y = zip(*batch)
    return pad_sequence(x, batch_first=True, padding_value=0), torch.tensor(y)


# **Creating the Model**
Our model consists of one [Embedding layer](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html) and four [Fully connected layers](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html). As the embedding layer expands the dimensions of our input, we use torch.mean to reduce its dimension.

In [6]:
class Classifier(nn.Module):
    def __init__(self):
        super().__init__()
        
        self.embedding = nn.Embedding(len(vocabulary), 256) # vocab size x embedding dimension
        self.fc1 = nn.Linear(256, 512)
        self.fc2 = nn.Linear(512, 512)
        self.fc3 = nn.Linear(512, 512)
        self.fc4 = nn.Linear(512, 2)
        
    def forward(self, x):
        x = self.embedding(x)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = F.relu(self.fc3(x))
        x = torch.mean(x, dim=1) 
        x = self.fc4(x)
        
        return x
        

# **Training the model**
First of all we start off by creating two instances of the Dataset class, one for training, one for validation, one for testing. Then we create a DataLoader for each of them.

A new instance of our model, our optimizer and a CrossEntropyLoss function.

We define a function to calculate the accuracy by counting the total amount of correct predictions of a batch. 

We define a function to evaluate the models' performance on the validation dataset using the previously defined function.

Then for each epoch we loop through the batches and calculate the loss between the target batch and the networks predictions. Then we adjust the networks' weights and biases using backpropagation. The models' accuracy on the validation dataset is assesed after each epoch.


In [7]:
train_dataset = dataset(split='train')
trainDataLoader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_fn)

val_dataset = dataset(split='val')
valDataLoader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_fn)

test_dataset = dataset(split='test')
testDataLoader = DataLoader(test_dataset, batch_size=1, shuffle=False)

net = Classifier()
optimizer = optim.SGD(net.parameters(), lr=LEARNING_RATE)
criterion = nn.CrossEntropyLoss()

def calculate_accuracy(predictions, targets):
    predicted_labels = torch.argmax(predictions, dim=1)
    correct_predictions = (predicted_labels == targets).sum().item()
    total_predictions = targets.size(0)
    accuracy = correct_predictions / total_predictions
    return accuracy

def evaluate():
    accuracies = []
    for idx, batch in enumerate(valDataLoader):
        x, y = batch
        
        with torch.no_grad():
            accuracies.append(calculate_accuracy(net.forward(x), y))
        
    print(f'Validation accuracy {np.average(accuracies)}')
    
for epoch in range(EPOCHS):
    progress_bar = tqdm(trainDataLoader)
    
    accuracies = []
    losses = []
    
    for index, (x, y) in enumerate(progress_bar):        
        optimizer.zero_grad()
        
        output = net.forward(x)
        
        loss = criterion(output, y)
        
        
        loss.backward()
        optimizer.step()
        
        losses.append(loss.item())
        accuracies.append(calculate_accuracy(output, y))
        
        progress_bar.set_description(f'Epoch [{epoch+1}/{EPOCHS}]')
        progress_bar.set_postfix(loss=np.average(losses), acc=np.average(accuracies))
        
    evaluate()

Epoch [1/100]: 100%|███| 113/113 [00:03<00:00, 35.54it/s, acc=0.572, loss=0.681]


Validation accuracy 0.45436304644808745


Epoch [2/100]: 100%|███| 113/113 [00:03<00:00, 36.20it/s, acc=0.576, loss=0.676]


Validation accuracy 0.45436304644808745


Epoch [3/100]: 100%|███| 113/113 [00:03<00:00, 34.78it/s, acc=0.576, loss=0.675]


Validation accuracy 0.4553876366120219


Epoch [4/100]: 100%|███| 113/113 [00:02<00:00, 42.22it/s, acc=0.576, loss=0.674]


Validation accuracy 0.453978825136612


Epoch [5/100]: 100%|███| 113/113 [00:02<00:00, 43.42it/s, acc=0.576, loss=0.674]


Validation accuracy 0.453978825136612


Epoch [6/100]: 100%|███| 113/113 [00:02<00:00, 43.26it/s, acc=0.574, loss=0.672]


Validation accuracy 0.46166325136612024


Epoch [7/100]: 100%|███| 113/113 [00:02<00:00, 43.29it/s, acc=0.576, loss=0.671]


Validation accuracy 0.46687158469945356


Epoch [8/100]: 100%|███| 113/113 [00:02<00:00, 44.08it/s, acc=0.573, loss=0.671]


Validation accuracy 0.45594262295081966


Epoch [9/100]: 100%|████| 113/113 [00:02<00:00, 44.42it/s, acc=0.578, loss=0.67]


Validation accuracy 0.47169569672131145


Epoch [10/100]: 100%|██| 113/113 [00:02<00:00, 43.78it/s, acc=0.576, loss=0.671]


Validation accuracy 0.47780054644808745


Epoch [11/100]: 100%|███| 113/113 [00:02<00:00, 45.17it/s, acc=0.572, loss=0.67]


Validation accuracy 0.48262465846994534


Epoch [12/100]: 100%|███| 113/113 [00:02<00:00, 42.97it/s, acc=0.575, loss=0.67]


Validation accuracy 0.4832650273224044


Epoch [13/100]: 100%|██| 113/113 [00:02<00:00, 44.38it/s, acc=0.579, loss=0.668]


Validation accuracy 0.4802766393442623


Epoch [14/100]: 100%|██| 113/113 [00:02<00:00, 44.50it/s, acc=0.577, loss=0.669]


Validation accuracy 0.4723360655737705


Epoch [15/100]: 100%|██| 113/113 [00:02<00:00, 42.66it/s, acc=0.577, loss=0.669]


Validation accuracy 0.4729764344262295


Epoch [16/100]: 100%|██| 113/113 [00:02<00:00, 41.99it/s, acc=0.581, loss=0.668]


Validation accuracy 0.4750683060109289


Epoch [17/100]: 100%|██| 113/113 [00:02<00:00, 39.72it/s, acc=0.575, loss=0.667]


Validation accuracy 0.48040471311475413


Epoch [18/100]: 100%|██| 113/113 [00:02<00:00, 42.48it/s, acc=0.578, loss=0.668]


Validation accuracy 0.4744279371584699


Epoch [19/100]: 100%|██| 113/113 [00:02<00:00, 44.61it/s, acc=0.579, loss=0.667]


Validation accuracy 0.4821123633879781


Epoch [20/100]: 100%|██| 113/113 [00:02<00:00, 42.16it/s, acc=0.581, loss=0.666]


Validation accuracy 0.4952612704918033


Epoch [21/100]: 100%|██| 113/113 [00:02<00:00, 43.69it/s, acc=0.581, loss=0.668]


Validation accuracy 0.493041325136612


Epoch [22/100]: 100%|██| 113/113 [00:02<00:00, 42.74it/s, acc=0.582, loss=0.667]


Validation accuracy 0.49291325136612024


Epoch [23/100]: 100%|██| 113/113 [00:02<00:00, 44.63it/s, acc=0.578, loss=0.667]


Validation accuracy 0.5040983606557378


Epoch [24/100]: 100%|██| 113/113 [00:02<00:00, 42.77it/s, acc=0.578, loss=0.666]


Validation accuracy 0.48561304644808745


Epoch [25/100]: 100%|███| 113/113 [00:02<00:00, 42.53it/s, acc=0.58, loss=0.665]


Validation accuracy 0.49667008196721313


Epoch [26/100]: 100%|██| 113/113 [00:02<00:00, 40.32it/s, acc=0.586, loss=0.665]


Validation accuracy 0.49094945355191255


Epoch [27/100]: 100%|██| 113/113 [00:02<00:00, 41.17it/s, acc=0.577, loss=0.665]


Validation accuracy 0.49419398907103823


Epoch [28/100]: 100%|███| 113/113 [00:02<00:00, 42.83it/s, acc=0.58, loss=0.664]


Validation accuracy 0.4886014344262295


Epoch [29/100]: 100%|██| 113/113 [00:02<00:00, 41.95it/s, acc=0.586, loss=0.664]


Validation accuracy 0.4796362704918033


Epoch [30/100]: 100%|██| 113/113 [00:02<00:00, 41.85it/s, acc=0.588, loss=0.662]


Validation accuracy 0.49278517759562845


Epoch [31/100]: 100%|██| 113/113 [00:02<00:00, 41.92it/s, acc=0.589, loss=0.664]


Validation accuracy 0.48497267759562845


Epoch [32/100]: 100%|██| 113/113 [00:02<00:00, 43.73it/s, acc=0.587, loss=0.662]


Validation accuracy 0.49824965846994534


Epoch [33/100]: 100%|███| 113/113 [00:02<00:00, 43.99it/s, acc=0.59, loss=0.661]


Validation accuracy 0.4853568989071038


Epoch [34/100]: 100%|████| 113/113 [00:02<00:00, 42.82it/s, acc=0.6, loss=0.659]


Validation accuracy 0.48757684426229514


Epoch [35/100]: 100%|██| 113/113 [00:02<00:00, 41.43it/s, acc=0.593, loss=0.659]


Validation accuracy 0.4920167349726776


Epoch [36/100]: 100%|██| 113/113 [00:02<00:00, 42.01it/s, acc=0.595, loss=0.658]


Validation accuracy 0.5003415300546448


Epoch [37/100]: 100%|██| 113/113 [00:02<00:00, 43.22it/s, acc=0.594, loss=0.659]


Validation accuracy 0.506958674863388


Epoch [38/100]: 100%|██| 113/113 [00:02<00:00, 43.14it/s, acc=0.598, loss=0.658]


Validation accuracy 0.49082137978142076


Epoch [39/100]: 100%|██| 113/113 [00:02<00:00, 42.74it/s, acc=0.604, loss=0.655]


Validation accuracy 0.5059340846994536


Epoch [40/100]: 100%|██| 113/113 [00:02<00:00, 43.35it/s, acc=0.604, loss=0.655]


Validation accuracy 0.5096909153005464


Epoch [41/100]: 100%|██| 113/113 [00:02<00:00, 41.47it/s, acc=0.609, loss=0.652]


Validation accuracy 0.5003415300546448


Epoch [42/100]: 100%|██| 113/113 [00:02<00:00, 41.52it/s, acc=0.612, loss=0.652]


Validation accuracy 0.521943306010929


Epoch [43/100]: 100%|███| 113/113 [00:02<00:00, 42.79it/s, acc=0.61, loss=0.649]


Validation accuracy 0.5150273224043715


Epoch [44/100]: 100%|██| 113/113 [00:02<00:00, 40.57it/s, acc=0.614, loss=0.649]


Validation accuracy 0.5064463797814208


Epoch [45/100]: 100%|██| 113/113 [00:02<00:00, 41.22it/s, acc=0.615, loss=0.648]


Validation accuracy 0.5091786202185792


Epoch [46/100]: 100%|██| 113/113 [00:02<00:00, 40.10it/s, acc=0.624, loss=0.645]


Validation accuracy 0.5257001366120219


Epoch [47/100]: 100%|██| 113/113 [00:02<00:00, 41.41it/s, acc=0.632, loss=0.642]


Validation accuracy 0.5350922131147541


Epoch [48/100]: 100%|██| 113/113 [00:02<00:00, 42.30it/s, acc=0.638, loss=0.639]


Validation accuracy 0.5432889344262295


Epoch [49/100]: 100%|██| 113/113 [00:02<00:00, 42.16it/s, acc=0.637, loss=0.637]


Validation accuracy 0.5487534153005464


Epoch [50/100]: 100%|███| 113/113 [00:02<00:00, 42.10it/s, acc=0.64, loss=0.634]


Validation accuracy 0.5746670081967213


Epoch [51/100]: 100%|██| 113/113 [00:02<00:00, 41.46it/s, acc=0.651, loss=0.629]


Validation accuracy 0.5798753415300547


Epoch [52/100]: 100%|███| 113/113 [00:02<00:00, 42.75it/s, acc=0.66, loss=0.624]


Validation accuracy 0.6552681010928961


Epoch [53/100]: 100%|███| 113/113 [00:02<00:00, 40.43it/s, acc=0.668, loss=0.62]


Validation accuracy 0.6329832650273224


Epoch [54/100]: 100%|██| 113/113 [00:02<00:00, 42.02it/s, acc=0.672, loss=0.615]


Validation accuracy 0.6525358606557378


Epoch [55/100]: 100%|██| 113/113 [00:02<00:00, 41.45it/s, acc=0.679, loss=0.608]


Validation accuracy 0.6646174863387978


Epoch [56/100]: 100%|██| 113/113 [00:02<00:00, 41.42it/s, acc=0.686, loss=0.601]


Validation accuracy 0.62713456284153


Epoch [57/100]: 100%|██| 113/113 [00:02<00:00, 41.68it/s, acc=0.692, loss=0.593]


Validation accuracy 0.6763575819672131


Epoch [58/100]: 100%|██| 113/113 [00:02<00:00, 40.71it/s, acc=0.703, loss=0.586]


Validation accuracy 0.6922387295081966


Epoch [59/100]: 100%|██| 113/113 [00:02<00:00, 41.88it/s, acc=0.709, loss=0.576]


Validation accuracy 0.6875426912568305


Epoch [60/100]: 100%|███| 113/113 [00:02<00:00, 42.99it/s, acc=0.72, loss=0.566]


Validation accuracy 0.7040642076502732


Epoch [61/100]: 100%|██| 113/113 [00:02<00:00, 42.10it/s, acc=0.728, loss=0.558]


Validation accuracy 0.6932633196721311


Epoch [62/100]: 100%|██| 113/113 [00:02<00:00, 41.00it/s, acc=0.736, loss=0.549]


Validation accuracy 0.7109801912568305


Epoch [63/100]: 100%|██| 113/113 [00:02<00:00, 40.53it/s, acc=0.741, loss=0.539]


Validation accuracy 0.7220372267759562


Epoch [64/100]: 100%|██| 113/113 [00:02<00:00, 40.84it/s, acc=0.747, loss=0.532]


Validation accuracy 0.7429986338797815


Epoch [65/100]: 100%|██| 113/113 [00:02<00:00, 39.94it/s, acc=0.754, loss=0.521]


Validation accuracy 0.7329661885245903


Epoch [66/100]: 100%|██| 113/113 [00:02<00:00, 41.06it/s, acc=0.756, loss=0.512]


Validation accuracy 0.7234887295081966


Epoch [67/100]: 100%|██| 113/113 [00:02<00:00, 42.17it/s, acc=0.767, loss=0.505]


Validation accuracy 0.7375341530054644


Epoch [68/100]: 100%|██| 113/113 [00:02<00:00, 41.53it/s, acc=0.768, loss=0.495]


Validation accuracy 0.7374060792349727


Epoch [69/100]: 100%|███| 113/113 [00:02<00:00, 40.36it/s, acc=0.773, loss=0.49]


Validation accuracy 0.7456028005464481


Epoch [70/100]: 100%|██| 113/113 [00:02<00:00, 42.87it/s, acc=0.777, loss=0.483]


Validation accuracy 0.7534153005464481


Epoch [71/100]: 100%|██| 113/113 [00:02<00:00, 41.96it/s, acc=0.781, loss=0.475]


Validation accuracy 0.7483350409836066


Epoch [72/100]: 100%|██| 113/113 [00:02<00:00, 40.94it/s, acc=0.784, loss=0.472]


Validation accuracy 0.7617400956284154


Epoch [73/100]: 100%|██| 113/113 [00:02<00:00, 43.31it/s, acc=0.787, loss=0.464]


Validation accuracy 0.7508111338797815


Epoch [74/100]: 100%|██| 113/113 [00:02<00:00, 42.72it/s, acc=0.797, loss=0.456]


Validation accuracy 0.7479508196721311


Epoch [75/100]: 100%|██| 113/113 [00:02<00:00, 41.16it/s, acc=0.796, loss=0.452]


Validation accuracy 0.7584955601092895


Epoch [76/100]: 100%|██| 113/113 [00:02<00:00, 40.96it/s, acc=0.799, loss=0.444]


Validation accuracy 0.7687841530054644


Epoch [77/100]: 100%|██| 113/113 [00:02<00:00, 41.97it/s, acc=0.805, loss=0.437]


Validation accuracy 0.7502988387978142


Epoch [78/100]: 100%|██| 113/113 [00:02<00:00, 41.29it/s, acc=0.808, loss=0.431]


Validation accuracy 0.7405225409836066


Epoch [79/100]: 100%|██| 113/113 [00:02<00:00, 41.07it/s, acc=0.812, loss=0.421]


Validation accuracy 0.7825734289617486


Epoch [80/100]: 100%|██| 113/113 [00:02<00:00, 43.31it/s, acc=0.816, loss=0.417]


Validation accuracy 0.7687841530054644


Epoch [81/100]: 100%|██| 113/113 [00:02<00:00, 42.22it/s, acc=0.819, loss=0.411]


Validation accuracy 0.7717725409836066


Epoch [82/100]: 100%|██| 113/113 [00:02<00:00, 38.53it/s, acc=0.821, loss=0.408]


Validation accuracy 0.7584955601092895


Epoch [83/100]: 100%|██| 113/113 [00:02<00:00, 41.64it/s, acc=0.829, loss=0.394]


Validation accuracy 0.7840249316939891


Epoch [84/100]: 100%|██| 113/113 [00:02<00:00, 42.49it/s, acc=0.834, loss=0.388]


Validation accuracy 0.7663080601092895


Epoch [85/100]: 100%|██| 113/113 [00:02<00:00, 40.99it/s, acc=0.837, loss=0.381]


Validation accuracy 0.7897455601092895


Epoch [86/100]: 100%|██| 113/113 [00:02<00:00, 43.48it/s, acc=0.843, loss=0.371]


Validation accuracy 0.7690403005464481


Epoch [87/100]: 100%|██| 113/113 [00:02<00:00, 42.03it/s, acc=0.843, loss=0.366]


Validation accuracy 0.779969262295082


Epoch [88/100]: 100%|███| 113/113 [00:02<00:00, 42.51it/s, acc=0.85, loss=0.357]


Validation accuracy 0.7798411885245903


Epoch [89/100]: 100%|██| 113/113 [00:02<00:00, 40.71it/s, acc=0.849, loss=0.357]


Validation accuracy 0.7847933743169399


Epoch [90/100]: 100%|██| 113/113 [00:02<00:00, 41.70it/s, acc=0.856, loss=0.348]


Validation accuracy 0.7842810792349727


Epoch [91/100]: 100%|███| 113/113 [00:02<00:00, 41.74it/s, acc=0.86, loss=0.335]


Validation accuracy 0.7902578551912569


Epoch [92/100]: 100%|██| 113/113 [00:02<00:00, 41.53it/s, acc=0.865, loss=0.329]


Validation accuracy 0.8023821721311476


Epoch [93/100]: 100%|██| 113/113 [00:02<00:00, 41.75it/s, acc=0.869, loss=0.322]


Validation accuracy 0.7922216530054644


Epoch [94/100]: 100%|██| 113/113 [00:02<00:00, 42.95it/s, acc=0.873, loss=0.316]


Validation accuracy 0.8004183743169399


Epoch [95/100]: 100%|██| 113/113 [00:02<00:00, 42.43it/s, acc=0.877, loss=0.306]


Validation accuracy 0.7872694672131147


Epoch [96/100]: 100%|██| 113/113 [00:02<00:00, 40.81it/s, acc=0.872, loss=0.311]


Validation accuracy 0.7850495218579235


Epoch [97/100]: 100%|██| 113/113 [00:02<00:00, 41.39it/s, acc=0.878, loss=0.298]


Validation accuracy 0.7945696721311476


Epoch [98/100]: 100%|██| 113/113 [00:02<00:00, 42.13it/s, acc=0.877, loss=0.293]


Validation accuracy 0.7971738387978142


Epoch [99/100]: 100%|██| 113/113 [00:02<00:00, 41.28it/s, acc=0.887, loss=0.286]


Validation accuracy 0.7902578551912569


Epoch [100/100]: 100%|█| 113/113 [00:02<00:00, 42.73it/s, acc=0.894, loss=0.271]


Validation accuracy 0.7929900956284154


# **Submission**

A new pandas DataFrame is created to hold the data. We then loop through the DataLoader and predict for each tweet. 

In [8]:
import pandas as pd

results = pd.DataFrame({'id': pd.Series(dtype=int), 'target': pd.Series()})

for batch in testDataLoader:
    text, id = batch
    output = net.forward(text)
    
    result_batch = pd.DataFrame({'id': [id.item()], 'target': [torch.argmax(output).item()]})
    results = pd.concat([results, result_batch])

results.to_csv('submission.csv', index=False)

  results = pd.DataFrame({'id': pd.Series(dtype=int), 'target': pd.Series()})
