# Feedforward Neural Networks

A feedforward neural network is a type of artificial neural network in which nodes' connections do not form a loop. CNN is a typical example of feedforward neural network.

This notebook explores logistic regression and feedforward neural networks for binary text classification, using the pytorch library. 

## Recommended Preparation

Before starting this tutorial it is recommended that you have installed PyTorch,
and have a basic understanding of Tensors:

-  For installation instructions: https://pytorch.org/ 
-  Get started with PyTorch in general and learn the basics of Tensors: https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html
-  For a wide and deep overview: https://pytorch.org/tutorials/beginner/pytorch_with_examples.html

Credits: I am using this notebook that I really like which explains and implements every step in feedforward neural networks. https://github.com/dbamman/anlp21/blob/main/9.neural/FFNN.ipynb

## Data

Included in the ``Datasets/large_movie_review_dataset`` directory are the big movie review dataset we've explored before, including train.tsv, dev.tsv, and test.tsv.

In [3]:
pip install torch

Collecting torch
  Obtaining dependency information for torch from https://files.pythonhosted.org/packages/87/08/4555e05425caca1ad362277a3c38960b40672601639e2cc0d330ba489386/torch-2.1.0-cp38-none-macosx_11_0_arm64.whl.metadata
  Downloading torch-2.1.0-cp38-none-macosx_11_0_arm64.whl.metadata (24 kB)
Collecting filelock (from torch)
  Obtaining dependency information for filelock from https://files.pythonhosted.org/packages/5e/5d/97afbafd9d584ff1b45fcb354a479a3609bd97f912f8f1f6c563cb1fae21/filelock-3.12.4-py3-none-any.whl.metadata
  Downloading filelock-3.12.4-py3-none-any.whl.metadata (2.8 kB)
Collecting sympy (from torch)
  Using cached sympy-1.12-py3-none-any.whl (5.7 MB)
Collecting networkx (from torch)
  Using cached networkx-3.1-py3-none-any.whl (2.1 MB)
Collecting fsspec (from torch)
  Obtaining dependency information for fsspec from https://files.pythonhosted.org/packages/e8/f6/3eccfb530aac90ad1301c582da228e4763f19e719ac8200752a4841b0b2d/fsspec-2023.10.0-py3-none-any.whl.metada

In [4]:
from collections import Counter
import nltk
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import preprocessing
import torch
import torch.nn as nn
import random
import pandas as pd

In [5]:
def read_data(filename, max_data_points=None):
    X=[]
    Y=[]
    with open(filename, encoding="utf-8") as file:
        for idx,line in enumerate(file):
            cols=line.rstrip().split("\t")
            label=cols[0]
            text=cols[1]
            X.append(text)
            Y.append(label)

    # shuffle the data
    tmp = list(zip(X, Y))
    random.shuffle(tmp)
    X, Y = zip(*tmp)
    
    if max_data_points == None:
        return X, Y
    
    return X[:max_data_points], Y[:max_data_points]

In [6]:
# The directory should contain train.tsv, dev.tsv and test.tsv
directory="Datasets/large_movie_review_dataset"

We'll limit the training and dev data to 10,000 data points for this exercise.

In [7]:
trainX, trainY=read_data("%s/train.tsv" % directory, max_data_points=10000)

In [8]:
devX, devY=read_data("%s/dev.tsv" % directory, max_data_points=10000)

We'll represent the data using simple binary indicators of the most frequent 1,000 words in the vocabulary.

In [9]:
vectorizer = CountVectorizer(max_features=1000, analyzer=str.split, lowercase=True, strip_accents=None, binary=True)
X_train = vectorizer.fit_transform(trainX)
X_dev = vectorizer.transform(devX)

_,vocabSize=X_train.shape

le = preprocessing.LabelEncoder()
le.fit(trainY)

Y_train=le.transform(trainY)
Y_dev=le.transform(devY)

In [10]:
print("Shape of X_Train and Y_Train: ", X_train.shape, Y_train.shape)
print("Shape of X_dev and Y_dev: ", X_dev.shape, Y_dev.shape)

Shape of X_Train and Y_Train:  (10000, 1000) (10000,)
Shape of X_dev and Y_dev:  (5000, 1000) (5000,)


In [11]:
# segregate your data into list of lists containing the data

def get_batches(x, y, batch_size=12):
    batches_x=[]
    batches_y=[]
    for i in range(0, len(x), batch_size):
        batches_x.append(x[i:i+batch_size])
        batches_y.append(y[i:i+batch_size])
    
    return batches_x, batches_y

We will convert the sparse matrices to dense matrices and then convert them to tensors for pytorch

In [12]:
train_batches_x, train_batches_y = get_batches(torch.from_numpy(X_train.todense()).float(), torch.LongTensor(Y_train))
dev_batches_x, dev_batches_y = get_batches(torch.from_numpy(X_dev.todense()).float(), torch.LongTensor(Y_dev))

In [13]:
# Try to print your X and Y values to understand step by step transformation of data that we will feed to the Model.
#to see how your tensors look
print(torch.from_numpy(X_train.todense()).float())
print(torch.LongTensor(Y_train))
print(len(train_batches_x), len(train_batches_x[0]), len(train_batches_y))

tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [1., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 1., 1.],
        [1., 0., 0.,  ..., 0., 0., 0.]])
tensor([0, 0, 0,  ..., 1, 0, 1])
834 12 834


In [14]:
# here we define our own accuracy_score like function
def evaluate(model, x, y):

    model.eval()
    corr = 0.
    total = 0.
    # Why do we need with torch no_grad?
    # However, the with torch. no_grad() tells PyTorch to not calculate the gradients, 
    # and the program explicitly uses it here (as with most neural networks) in order to not update the gradients 
    # when it is updating the weights as that would affect the back propagation.
    with torch.no_grad():
        for x, y in zip(x, y):
            y_preds=model.forward(x)
            for idx, y_pred in enumerate(y_preds):
                prediction=torch.argmax(y_pred)
                if prediction == y[idx]:
                    corr += 1.
                total+=1                          
    return corr/total

Read more about no_grad() here- https://pytorch.org/docs/stable/generated/torch.no_grad.html

In [18]:
class LogisticRegressionClassifier(nn.Module):

    def __init__(self, input_dim, output_dim):
        super().__init__()
        # torch.nn.Linear transforms an input of size input_dim (e.g., 1000 above) to an output of size output_dim 
        # (e.g., 2 classes for positive/negative)
        self.linear1 = torch.nn.Linear(input_dim, output_dim) 
    
    def forward(self, input):
        x1 = self.linear1(input)

        return x1

In [19]:
class FFNN_1_Hidden_Layer(nn.Module):

    def __init__(self, input_dim, output_dim, hidden_dim=100):
        super().__init__()
        
#         hidden_dim=100
        # the first layer transforms an input of size input_dim (e.g., 1,000 above) to an output of size hidden_dim (e.g., 100)
        self.linear1 = torch.nn.Linear(input_dim, hidden_dim)

        # the second layer transforms an input of size hidden_dim (e.g., 100) to an output of size output_dim (e.g., 2 classes for positive/negative)       
        self.linear2 = torch.nn.Linear(hidden_dim, output_dim)
    
    def forward(self, input): 
        # pass the input through the first layer
        layer1_output = self.linear1(input)
        
        # pass the output through a non-linearity (here, tanh)
        layer1_output = torch.tanh(layer1_output)
        
        # and then pass the output from that first layer as input to the second layer
        layer2_output = self.linear2(layer1_output)

        return layer2_output

In [20]:
def train(model, model_filename, train_batches_x, train_batches_y, dev_batches_x, dev_batches_y):
    
    # initializing the optimizer
    optimizer = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-5)
    losses = []
    
    # https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html
    cross_entropy=nn.CrossEntropyLoss()

    best_dev_acc = 0.
    
    # we'll only train for 5 epochs for this exercise, but in practice you'd want to train for more epochs
    # (in theory until you stop seeing improvements in accuracy on your *development* data)
    
    for epoch in range(5):
        model.train()

        for x, y in zip(train_batches_x, train_batches_y):
            y_pred=model.forward(x)
            loss = cross_entropy(y_pred.view(-1, 2), y.view(-1))
            losses.append(loss)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
        
        dev_accuracy=evaluate(model, dev_batches_x, dev_batches_y)
        
        # we're going to save the model that performs the best on *dev* data
        if dev_accuracy > best_dev_acc:
            torch.save(model.state_dict(), model_filename)
            print("%.3f is better than %.3f, saving model ..." % (dev_accuracy, best_dev_acc))
            best_dev_acc = dev_accuracy
        if epoch % 1 == 0:
            print("Epoch %s, dev accuracy: %.3f" % (epoch, dev_accuracy))
            
    model.load_state_dict(torch.load(model_filename))            
    print("\nBest Performing Model achieves dev accuracy of : %.3f" % (best_dev_acc))
    return best_dev_acc
    

In [21]:
logreg=LogisticRegressionClassifier(1000, 2)
dev_accuracy=train(logreg, "logreg.model", train_batches_x, train_batches_y, dev_batches_x, dev_batches_y)

0.843 is better than 0.000, saving model ...
Epoch 0, dev accuracy: 0.843
0.850 is better than 0.843, saving model ...
Epoch 1, dev accuracy: 0.850
Epoch 2, dev accuracy: 0.847
Epoch 3, dev accuracy: 0.849
Epoch 4, dev accuracy: 0.848

Best Performing Model achieves dev accuracy of : 0.850


In [22]:
ffnn1=FFNN_1_Hidden_Layer(1000, 2, hidden_dim=100)
dev_accuracy=train(ffnn1, "ffnn1.model", train_batches_x, train_batches_y, dev_batches_x, dev_batches_y)

0.843 is better than 0.000, saving model ...
Epoch 0, dev accuracy: 0.843
Epoch 1, dev accuracy: 0.839
Epoch 2, dev accuracy: 0.830
Epoch 3, dev accuracy: 0.830
Epoch 4, dev accuracy: 0.828

Best Performing Model achieves dev accuracy of : 0.843


Neural networks converge to different solutions as a function of their *initialization* (the random choice of the initial values for parameters).  Let's train the `FFNN_1_Hidden_Layer` model 10 times and then plot the distribution of dev accuracies using [pandas.DataFrame.plot.density](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.density.html). 

In [None]:
dev_accuracies=[]

for i in range(10):
    ffnn1=FFNN_1_Hidden_Layer(1000, 2, hidden_dim=100)
    dev_accuracy=train(ffnn1, "ffnn1.model", train_batches_x, train_batches_y, dev_batches_x, dev_batches_y)
    dev_accuracies.append(dev_accuracy)

In [None]:
# https://www.digitalocean.com/community/tutorials/seaborn-kdeplot
df=pd.DataFrame(dev_accuracies)
ax = df.plot.kde()

Try adding more layers to the FFNN below and experimenting with dropout rate, hidden layer sizes, and different choices of non-linearity.

In [23]:
class FFNN_Experiment(nn.Module):

    def __init__(self, input_dim, output_dim, hidden_dim=100):
        super().__init__()

        hidden_dim=100
        
        # the first layer transforms an input of size input_dim (e.g., 1000 above) to an output of size hidden_dim (e.g., 100)
        self.linear1 = torch.nn.Linear(input_dim, hidden_dim)

        # a dropout layer randomly sets the output from the previous layer to 0 p% of the time
        self.dropout = nn.Dropout(p=0.2)

        # the second layer transforms an input of size hidden_dim (e.g., 100) to an output of size output_dim (e.g., 2 classes for positive/negative)       
        self.linear2 = torch.nn.Linear(hidden_dim, output_dim)
    
    def forward(self, input): 
        # pass the input through the first layer
        layer1_output = self.linear1(input)
        
        # pass that output through a non-linearity (here, tanh)
        # alternatives include torch.relu and torch.sigmoid
        layer1_output = torch.tanh(layer1_output)
        
        # then dropout some outputs during training time (not test time)
        layer1_output=self.dropout(layer1_output)
        
        # and then pass the output from that first layer as input to the second layer
        layer2_output = self.linear2(layer1_output)

        return layer2_output

In [24]:
ffnn_e=FFNN_Experiment(1000, 2, hidden_dim=100)
dev_accuracy=train(ffnn_e, "ffnn_e.model", train_batches_x, train_batches_y, dev_batches_x, dev_batches_y)

0.845 is better than 0.000, saving model ...
Epoch 0, dev accuracy: 0.845
Epoch 1, dev accuracy: 0.842
Epoch 2, dev accuracy: 0.839
Epoch 3, dev accuracy: 0.833
Epoch 4, dev accuracy: 0.831

Best Performing Model achieves dev accuracy of : 0.845
