## Yelp reviews sentiment analysis using pytorch

The below code is focused on exploring various aspects of Natural Language Processing (NLP) and building classifiers for sentiment analysis using PyTorch. Here's a breakdown of what the code does:

1. **Data Processing**: The code processes raw Yelp review data, converts it to a pandas DataFrame, and manipulates the data as needed.

2. **Data Vectorization and Vocabulary**: It creates a vocabulary for the NLP task and vectorizes the text reviews using one-hot encoding. A Vocabulary class is implemented to handle the mapping between tokens and indices.

3. **Data Processing in PyTorch**: The code uses PyTorch's DataLoader to batch, shuffle, and load the data in parallel using multiprocessing workers.

4. **Models Implementation**:
   - A simple perceptron (`ReviewClassifier`) with a single linear layer and sigmoid activation for binary classification.
   - A multi-layer perceptron (`MultiLayerPerceptron`) with hidden layers, ReLU activation, and dropout for regularization.
   - A Convolutional Neural Network (`ConvolutionNeuralNetwork`) using 1D convolutions for sequence-based analysis.

5. **Training and Evaluation**: The code includes training and evaluation loops for each epoch. It tracks train and validation loss, accuracy, and evaluates the model's performance on a test dataset.

6. **Inference and Prediction**: It demonstrates how to predict the sentiment rating of new reviews using the trained model and the vectorizer.

7. **Weight Inspection**: The code inspects the influential words in positive and negative reviews by analyzing the weights of the models first fully connected layer (`fcl1`).

8. **Hyperparameters Tuning and Tracking**: The code uses various hyperparameters and keeps track of training and validation loss, as well as accuracy.

9. **Device Usage and CUDA**: The code checks for the availability of CUDA and uses GPU acceleration if available.

Overall, it provides a comprehensive exploration of building and training different types of neural network models for sentiment analysis tasks, ranging from simple perceptrons to more complex models like multi-layer perceptrons and convolutional neural networks. It also covers preprocessing steps, vocabulary creation, and evaluation of the trained models.

### Data Processing: 
1. import libraries and data

In [4]:
import collections
import pandas as pd
import numpy as np
import re
from argparse import Namespace
import string

In [47]:
# argument definations 
arg = Namespace(
    islite = False,
    train_csv_lite_with_split = "data/yelp/reviews_with_splits_lite.csv",
    test_csv = "data/yelp/raw_test.csv",
    train_csv = "data/yelp/raw_train.csv",   
    test_split_ratio = 0.25,
    train_split_ratio = 0.75,
    seed = 1330,
    output_file_sentiment = "data/yelp/final_sentiment_list.csv",
    output_file_rating = "data/yelp/final_ratings_list.csv"
)

In [9]:
# reading of data and removing the null reviews
if not arg.islite:
  test_csv = pd.read_csv(arg.test_csv, header=None, names=["rating","review"])
  train_csv  = pd.read_csv(arg.train_csv, header=None, names = ["rating","review"])
  test_csv = test_csv[~pd.isnull(test_csv.review)]
  train_csv = train_csv[~pd.isnull(train_csv.review)]
  

In [10]:
# print the test
test_csv.head(6)

Unnamed: 0,rating,review
0,2,"Contrary to other reviews, I have zero complai..."
1,1,Last summer I had an appointment to get new ti...
2,2,"Friendly staff, same starbucks fair you get an..."
3,1,The food is good. Unfortunately the service is...
4,2,Even when we didn't have a car Filene's Baseme...
5,2,"Picture Billy Joel's \""Piano Man\"" DOUBLED mix..."


2. Partitions the training data into train and validation sets based on ratings, shuffles within rating groups, and combines with test data.
   Process the review text.
   Change the numerical ratings to sentiments (1 for negative and 2 for positive).
   

In [42]:
def preprocess_data(text):
  if type(text) == float:
        print(text)
  text = text.lower()
  text = re.sub(r"([.,!?])", r" \1 ", text)
  text = re.sub(r"[^a-zA-Z.,!?]+", r" ", text)
  return text

def partition_dataset(train_csv, test_csv):
  
  rating_dict = collections.defaultdict(list)
  for _, row in train_csv.iterrows():
    rating_dict[row.rating].append(row.to_dict())
  final_list =[]
  np.random.seed(arg.seed)
  for _,  item_list in sorted(rating_dict.items()):
    np.random.shuffle(item_list)
    total_rows = len(item_list)
    total_train_required = int(arg.train_split_ratio*total_rows)
    total_test_required = int(arg.test_split_ratio*total_rows)
    # Give data point a split attribute
    for item in item_list[:total_train_required]:
        item['split'] = 'train'
    for item in item_list[total_train_required:total_train_required+total_test_required]:
        item['split'] = 'val'

    # Add to final list
    final_list.extend(item_list)
    for _, row in test_csv.iterrows():
      row_dict = row.to_dict()
      row_dict['split'] = 'test'
      final_list.append(row_dict)

    return final_list, pd.DataFrame(final_list)


In [43]:
final_list, final_list_df = partition_dataset(train_csv,test_csv)
final_list_df.review = final_list_df.review.apply(preprocess_data)
final_list_rating = final_list_df.copy()
final_list_df.split.value_counts()

train    210000
val       70000
test      38000
Name: split, dtype: int64

In [44]:
final_list_sentiment = final_list_df.copy()
final_list_sentiment['rating'] = final_list_sentiment.rating.apply({1: 'negative', 2: 'positive'}.get)
final_list_sentiment.head()

Unnamed: 0,rating,review,split
0,negative,used to come here a lot then i think the owner...,train
1,negative,got a notice for the preferred customer sale l...,train
2,negative,the burgers here are probably the best burgers...,train
3,negative,i am a road warrior . i am used to eating alon...,train
4,negative,i come to this place whenever i am in town for...,train


In [45]:
final_list_rating.head()

Unnamed: 0,rating,review,split
0,1,used to come here a lot then i think the owner...,train
1,1,got a notice for the preferred customer sale l...,train
2,1,the burgers here are probably the best burgers...,train
3,1,i am a road warrior . i am used to eating alon...,train
4,1,i come to this place whenever i am in town for...,train


In [48]:
final_list_rating.to_csv(arg.output_file_rating)
final_list_sentiment.to_csv(arg.output_file_sentiment)

### Data Vectorization

####  Data Vocabulary
Vocabulary class is commonly used in NLP tasks to convert words or tokens into numerical indices, which can then be used as input features for machine learning models. It provides methods for token addition, lookup, serialization, and deserialization, making it a fundamental component in many NLP pipelines.

In [87]:
class Vocabulary(object):
    """
    A class for processing text and building a vocabulary for mapping tokens to indices.
    """

    def __init__(self, token_to_idx=None, add_unk=True, unk_token='<UNK>'):
        """
        Initialize the Vocabulary object.
        
        Args:
            token_to_idx (dict): Mapping of tokens to indices.
            add_unk (bool): Whether to add an unknown token.
            unk_token (str): Token to represent unknown words.
        """
        if token_to_idx is None:
            token_to_idx = {}

        self._token_to_idx = token_to_idx
        self._idx_to_token = {idx: token for token, idx in self._token_to_idx.items()}
        self._add_unk = add_unk
        self._unk_token = unk_token
        self.unk_index = -1
        if add_unk:
            self.unk_index = self.add_token(unk_token)

    def to_serializable(self):
        """
        Returns a dictionary that can be serialized.
        
        Returns:
            dict: Serializable representation of the Vocabulary object.
        """
        return {'token_to_idx': self._token_to_idx,
                'add_unk': self._add_unk,
                'unk_token': self._unk_token}

    @classmethod
    def from_serializable(cls, contents):
        """
        Instantiate a Vocabulary object from a serialized dictionary.
        
        Args:
            contents (dict): Serialized representation of the Vocabulary object.
        
        Returns:
            Vocabulary: Instantiated Vocabulary object.
        """
        return cls(**contents)

    def add_token(self, token):
        """
        Update the mapping dictionary based on the token and return its index.
        
        Args:
            token (str): Token to be added to the vocabulary.
        
        Returns:
            int: Index of the added token.
        """
        if token in self._token_to_idx:
            index = self._token_to_idx[token]
        else:
            index = len(self._token_to_idx)
            self._token_to_idx[token] = index
            self._idx_to_token[index] = token

        return index

    def lookup_token(self, token):
        """
        Retrieve the index of a token from the vocabulary.
        
        Args:
            token (str): Token to look up.
        
        Returns:
            int: Index of the token in the vocabulary.
        """
        if self.unk_index >= 0:
            return self._token_to_idx.get(token, self.unk_index)
        else:
            if token not in self._token_to_idx:
                return 0
            return self._token_to_idx[token]

    def lookup_index(self, index):
        """
        Retrieve the token based on the index from the mapping dictionary.
        
        Args:
            index (int): Index to look up.
        
        Returns:
            str: Token corresponding to the index.
        """
        if index not in self._idx_to_token:
            raise KeyError("The provided index: %d is not in the vocab" % index)
        else:
            return self._idx_to_token[index]

    def __str__(self):
        """
        Returns a string representation of the Vocabulary object.
        
        Returns:
            str: String representation of the Vocabulary object.
        """
        return "<Vocabulary(size=%d)>" % len(self)

    def __len__(self):
        """
        Returns the length of the vocabulary.
        
        Returns:
            int: Number of unique tokens in the vocabulary.
        """
        return len(self._token_to_idx)


ReviewVectorizer class is used to convert text reviews into numerical vectors using the provided vocabulary.

In [88]:
class ReviewVectorizer(object):
    """
    Used the Vocabulary class to convert the tokens into actual numerical vectors.
    """

    def __init__(self, review_vocab, rating_vocab):
        """
        Initialize the ReviewVectorizer.

        Args:
            review_vocab (Vocabulary): Maps word to integer.
            rating_vocab (Vocabulary): Maps class label to integer ("negative/positive").
        """
        self.review_vocab = review_vocab
        self.rating_vocab = rating_vocab

    def vectorize(self, review):
        """
        Vectorize a text review to one-hot encoding.

        Args:
            review (str): Text review.

        Returns:
            np.ndarray: One-hot encoding vector.
        """
        one_hot = np.zeros(len(self.review_vocab), dtype=np.float32)
        for token in review.split(" "):
            if token not in string.punctuation:
                one_hot[self.review_vocab.lookup_token(token)] = 1
        return one_hot

    @classmethod
    def from_dataframe(cls, review_df, cutoff=25):
        """
        Instantiate a vectorizer for reviews directly from a dataset DataFrame.

        Args:
            review_df (pandas.DataFrame): Dataset containing reviews and ratings.
            cutoff (int): Count threshold for word inclusion.

        Returns:
            ReviewVectorizer: Instantiated ReviewVectorizer.
        """
        review_vocab = Vocabulary(add_unk=True)
        rating_vocab = Vocabulary(add_unk=False)

        for rating in sorted(set(review_df.rating)):
            rating_vocab.add_token(rating)

        word_count = collections.Counter()

        for review in review_df.review:
            for word in review.split(" "):
                if word not in string.punctuation:
                    word_count[word] += 1

        for word, count in word_count.items():
            if count > cutoff:
                review_vocab.add_token(word)

        return cls(review_vocab, rating_vocab)

    @classmethod
    def from_serializable(cls, contents):
        """
        Instantiate a vectorizer from serialized contents.

        Args:
            contents (dict): Serialized representation of the vectorizer.

        Returns:
            ReviewVectorizer: Instantiated ReviewVectorizer.
        """
        review_vocab = Vocabulary.from_serializable(contents['review_vocab'])
        rating_vocab = Vocabulary.from_serializable(contents['rating_vocab'])
        return cls(review_vocab=review_vocab, rating_vocab=rating_vocab)

    def to_serializable(self):
        """
        Convert the vectorizer to a serializable representation.

        Returns:
            dict: Serializable representation of the vectorizer.
        """
        return {'review_vocab': self.review_vocab.to_serializable(),
                'rating_vocab': self.rating_vocab.to_serializable()}



### Data Processing in PyTorch:

ReviewDataset class is designed for handling text classification tasks using PyTorch. The dataset contains reviews and their corresponding ratings, and the purpose of this class is to facilitate the loading, processing, and manipulation of the data for training, validation, and testing.

In [89]:
from torch.utils.data import Dataset, DataLoader

class ReviewDataset(Dataset):
    def __init__(self, review_df, vectorizer):
        """
        Initialize the ReviewDataset.

        Args:
            review_df (pandas.DataFrame): Dataset containing reviews and splits
            vectorizer (ReviewVectorizer): Vectorizer for converting reviews to vectors
        """
        self.review_df = review_df
        self._vectorizer = vectorizer

        # Split the dataset into train, validation, and test subsets
        self.train_df = self.review_df[self.review_df.split == 'train']
        self.train_size = len(self.train_df)
        self.val_df = self.review_df[self.review_df.split == 'val']
        self.validation_size = len(self.val_df)
        self.test_df = self.review_df[self.review_df.split == 'test']
        self.test_size = len(self.test_df)

        # Lookup dictionary for splits
        self._lookup_dict = {'train': (self.train_df, self.train_size),
                             'val': (self.val_df, self.validation_size),
                             'test': (self.test_df, self.test_size)}

        self.set_split('train')

    @classmethod
    def load_dataset_and_make_vectorizer(cls, review_csv):
        """
        Load dataset and create a new vectorizer from scratch.

        Args:
            review_csv (str): Location of the dataset CSV file

        Returns:
            ReviewDataset: An instance of ReviewDataset
        """
        review_df = pd.read_csv(review_csv)
        train_review_df = review_df[review_df.split == 'train']
        return cls(review_df, ReviewVectorizer.from_dataframe(train_review_df))
    
    
    @classmethod
    def load_dataset_and_load_vectorizer(cls, review_csv, vectorizer_filepath):
        """
        Load dataset and a cached vectorizer.

        Args:
            review_csv (str): Location of the dataset CSV file
            vectorizer_filepath (str): Location of the saved vectorizer file

        Returns:
            ReviewDataset: An instance of ReviewDataset
        """
        review_df = pd.read_csv(review_csv)
        vectorizer = cls.load_vectorizer_only(vectorizer_filepath)
        return cls(review_df, vectorizer)

    @staticmethod
    def load_vectorizer_only(vectorizer_filepath):
        """
        Load a vectorizer from a file.

        Args:
            vectorizer_filepath (str): Location of the serialized vectorizer file

        Returns:
            ReviewVectorizer: An instance of ReviewVectorizer
        """
        with open(vectorizer_filepath) as fp:
            return ReviewVectorizer.from_serializable(json.load(fp))

    def save_vectorizer(self, vectorizer_filepath):
        """
        Save the vectorizer to a file using JSON.

        Args:
            vectorizer_filepath (str): Location to save the vectorizer
        """
        with open(vectorizer_filepath, "w") as fp:
            json.dump(self._vectorizer.to_serializable(), fp)

    def get_vectorizer(self):
        """
        Get the associated vectorizer.

        Returns:
            ReviewVectorizer: The vectorizer used by the dataset
        """
        return self._vectorizer

    def set_split(self, split="train"):
        """
        Set the active split of the dataset.

        Args:
            split (str): One of "train", "val", or "test"
        """
        self._target_split = split
        self._target_df, self._target_size = self._lookup_dict[split]

    def __len__(self):
        """
        Get the length of the active split.

        Returns:
            int: Number of data points in the active split
        """
        return self._target_size

    def __getitem__(self, index):
        """
        Get a data point from the dataset.

        Args:
            index (int): Index of the data point

        Returns:
            dict: Dictionary containing features (x_data) and label (y_target)
        """
        row = self._target_df.iloc[index]

        review_vector = self._vectorizer.vectorize(row.review)
        rating_index = self._vectorizer.rating_vocab.lookup_token(row.rating)

        return {'x_data': review_vector, 'y_target': rating_index}

    def get_num_batches(self, batch_size):
        """
        Calculate the number of batches in the dataset.

        Args:
            batch_size (int): Size of each batch

        Returns:
            int: Number of batches in the dataset
        """
        return len(self) // batch_size


The function generates batches of data from a PyTorch dataset using a DataLoader, ensuring tensors are moved to a specified device.

In [90]:
from torch.utils.data import DataLoader

def generate_batches(dataset, batch_size, shuffle=True,
                     drop_last=True, device="cpu"):
    """
    Generates batches of data from a PyTorch dataset using DataLoader.
    
    Args:
        dataset (torch.utils.data.Dataset): The dataset to generate batches from.
        batch_size (int): The size of each batch.
        shuffle (bool, optional): Whether to shuffle the dataset. Default is True.
        drop_last (bool, optional): Whether to drop the last incomplete batch if its size is less than batch_size.
                                   Default is True.
        device (str, optional): The device to move the tensors to, e.g., "cuda" or "cpu". Default is "cpu".
    
    Yields:
        dict: A dictionary containing batched tensors moved to the specified device.
    """
    # Create a DataLoader instance
    dataloader = DataLoader(dataset=dataset, batch_size=batch_size,
                            shuffle=shuffle, drop_last=drop_last)

    # Iterate through the batches
    for data_dict in dataloader:
        out_data_dict = {}
        # Move each tensor to the specified device
        for name, tensor in data_dict.items():
            out_data_dict[name] = data_dict[name].to(device)
        yield out_data_dict


### Deep Learning Models: Single Layer Perceptron

A basic perceptron comprises a sole linear layer that performs an affine transformation, followed by passing the output through an activation function. In this case, we are employing the sigmoid activation function, which introduces non-linearity to the model's output. This process enables the perceptron to capture complex relationships within the data and make predictions.

In [91]:
import torch
import torch.nn as nn

class ReviewClassifier(nn.Module):
    """
    A simple perceptron classifier using PyTorch's neural network module (nn.Module).
    
    Args:
        num_features (int): Number of input features for the linear layer.
    """
    def __init__(self, num_features):
        super(ReviewClassifier, self).__init__()
        self.fcl = nn.Linear(in_features=num_features, out_features=1)
  
    def forward(self, vectorized_review, apply_sigmoid=False):
        """
        Forward pass of the perceptron classifier.
        
        Args:
            vectorized_review (torch.Tensor): Input data tensor after vectorization.
            apply_sigmoid (bool): Whether to apply sigmoid activation to the output.
        
        Returns:
            torch.Tensor: Output tensor after the linear transformation and activation (if applied).
        """
        output = self.fcl(vectorized_review).squeeze()
        if apply_sigmoid:
            output = torch.sigmoid(output)
        return output


### Deep Learning Models:Multi-Layer Perceptron

A multilayer perceptron (MLP) comprises interconnected layers of neurons, with optional softmax output. Here, dropout regularization is applied to prevent overfitting. Dropout prevents coadaptation between neurons, mitigating overfitting by randomly deactivating connections during training. This promotes network robustness and avoids over-reliance on specific connections. Dropout introduces randomness without adding parameters, controlled by a "drop probability" hyperparameter (often 0.5). The provided MLP with dropout showcases its practical implementation for building resilient neural networks.

In [92]:
class MultiLayerPerceptron(nn.Module):
    """
    A simple multi-layer perceptron model.

    Args:
        in_dimension (int): Input dimension size.
        hidden_dimension (int): Hidden layer dimension size.
        out_dimension (int): Output dimension size.
    """
    def __init__(self, in_dimension, hidden_dimension, out_dimension):
        super(MultiLayerPerceptron, self).__init__()
        self.fcl1 = nn.Linear(in_dimension, hidden_dimension)
        self.fcl2 = nn.Linear(hidden_dimension, out_dimension)

    def forward(self, x_in, apply_softmax=False):
        """
        The forward pass for the multi-layer perceptron.

        Args:
            x_in (torch.Tensor): Input tensor.
            apply_softmax (bool): Whether to apply softmax to the output.

        Returns:
            torch.Tensor: Output tensor.
        """
        # Pass through the first linear layer
        first_layer_output = self.fcl1(x_in)

        # Apply ReLU activation to the intermediate result
        intermediate = F.relu(first_layer_output)

        # Apply dropout to the intermediate result and pass through the second linear layer
        output = self.fcl2(F.dropout(intermediate, p=0.5))

        if apply_softmax:
            # Apply softmax if specified
            output = F.softmax(output, dim=1)

        # Squeeze the output tensor
        return output.squeeze()


### Convolutions neural networks:

## Convolutional Neural Networks (CNNs):

CNNs involve convolving an input matrix with a kernel matrix through element-wise multiplication and summation. This produces an output matrix containing crucial information. The significance can be understood based on CNN dimensionality:

**Dimensionality:** CNNs come in 1D, 2D, and 3D (Conv1d, Conv2d, and Conv3d). 1D is ideal for time series, like text. 2D serves image processing with width and height. 3D suits videos, adding the time dimension.

**Channels:** These are input dimensions. For 2D convolutions, like images with color channels, there are 3 channels. Analogously, text uses a vocabulary length. Parameters: in_channel, out_channel.

**Kernel Size:** The size of the kernel matrix, akin to n-grams in NLP. Larger size captures more context.

**Stride:** Controls convolution steps. If stride equals kernel size, no overlap occurs, yielding smaller output. Smaller stride yields larger output.

**Padding:** Adds 0s to input edges, aligning with stride during convolution.

**Dilation:** Adjusts kernel application by introducing gaps.

CNNs aim to configure convolution layers for desired features. Initial layers extract features, followed by processing. In classification, Linear (fc) layers handle classification. Implementation involves designing a feature vector. An artificial data tensor mirrors real data in shape. Data tensor is 3D for minibatch, where sequences of one-hot vectors become matrices, constituting the input channels.

In [93]:
import torch.nn.functional as F

class ConvolutionNeuralNetwork(nn.Module):
    def __init__(self, in_channels, num_channels, output_size):
        super(ConvolutionNeuralNetwork, self).__init__()
        
        # Define the convolutional layers
        self.convnet = nn.Sequential(
            nn.Conv1d(in_channels=in_channels, out_channels=num_channels, kernel_size=1),
            nn.ELU(),
            nn.Conv1d(in_channels=num_channels, out_channels=num_channels, kernel_size=1, stride=2),
            nn.ELU(),
            nn.Conv1d(in_channels=num_channels, out_channels=num_channels, kernel_size=1, stride=2),
            nn.ELU(),
            nn.Conv1d(in_channels=num_channels, out_channels=num_channels, kernel_size=1),
            nn.ELU()
        )
        
        # Define the fully connected layer
        self.fcl = nn.Linear(num_channels, output_size)

    def forward(self, x_in, apply_softmax=False):
        # Pass the input through the convolutional layers
        features = self.convnet(x_in)
        
        # Pass the features through the fully connected layer
        prediction_vector = self.fcl(features)
        
        if apply_softmax:
            # Apply softmax if required
            prediction_vector = F.softmax(prediction_vector, dim=2)

        return prediction_vector


### Training and Validation: 

In [94]:
import torch.optim as optim
from argparse import Namespace

args = Namespace(
    frequency_cutoff=25,
    model_state_file='model.pth',
    review_csv='data/yelp/reviews_with_splits_lite.csv',
    save_dir='data/yelp/',
    vectorizer_file='vectorizer.json',
    batch_size=256,
    early_stopping_criteria=5,
    learning_rate=0.001,
    num_epochs=5,
    seed=1330,
    cuda = True,
    device = torch.device("cuda"))


def make_train_state(args):
  return {'epoch_index':0,
          'train_loss':[],
          'train_acc':[],
          'val_loss':[],
          'val_acc':[],
          'test_loss':-1,
          'test_acc':1
  }

train_state = make_train_state(args)

CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model developed by NVIDIA. It enables efficient computation on NVIDIA GPUs, accelerating various tasks by offloading them to the GPU for faster processing compared to traditional CPU-based computations.

In [95]:
# Check the availabily of the CUDA
np.random.seed(args.seed)
torch.manual_seed(args.seed)
if args.cuda:
    torch.cuda.manual_seed_all(args.seed)

if not torch.cuda.is_available():
    args.cuda = False

print("Using CUDA: {}".format(args.cuda))
args.device = torch.device("cuda" if args.cuda else "cpu")

Using CUDA: False


In [102]:
# Check if CUDA is available, otherwise use CPU
if not torch.cuda.is_available():
    args.cuda = False
    args.device = torch.device("cpu")
    
import json
# Load dataset and create vectorizer
dataset = ReviewDataset.load_dataset_and_make_vectorizer(args.review_csv)
# Get the vectorizer from the dataset
vectorizer = dataset.get_vectorizer()

# Initialize the classifier (Choose one of the following)
# 1. For Single Layer Perceptron
# classifier = SingleLayerPerceptron(in_dimension=len(vectorizer.review_vocab), out_dimension=1)
# 2. For MultiLayer Perceptron
classifier = MultiLayerPerceptron(in_dimension=len(vectorizer.review_vocab), hidden_dimension=100, out_dimension=1)
# 3. For Convolutional Neural Network
# classifier = ConvolutionNeuralNetwork(in_channels=len(vectorizer.review_vocab), num_channels=256, output_size=1)

# Move the classifier to the selected device (CUDA or CPU)
classifier = classifier.to(args.device)

# Define the loss function and optimizer
loss_func = nn.BCEWithLogitsLoss()
optimizer = optim.Adam(classifier.parameters(), lr=args.learning_rate)

# Function to compute accuracy
def compute_accuracy(y_pred, y_target):
    y_target = y_target.cpu()
    y_pred_indices = (torch.sigmoid(y_pred) > 0.5).cpu().long()
    n_correct = torch.eq(y_pred_indices, y_target).sum().item()
    return n_correct / len(y_pred_indices) * 100


In [105]:
# Method to train and evaluate
for epoch_index in range(args.num_epochs):
    train_state['epoch_index'] = epoch_index
    
    # Set the dataset split to 'train'
    dataset.set_split('train')
    
    # Generate batches for training
    batch_generator = generate_batches(dataset, batch_size=args.batch_size, device=args.device)
    
    running_loss = 0.0
    running_acc = 0.0
    
    # Set the classifier in training mode
    classifier.train()
    
    for batch_index, batch_dict in enumerate(batch_generator):
        optimizer.zero_grad()
        
        # Extract input data
        x_in = batch_dict['x_data']
        
        # Forward pass
        y_pred = classifier(x_in.float())
        
        # Calculate loss
        loss = loss_func(y_pred, batch_dict['y_target'].float())
        loss_batch = loss.item()
        running_loss += (loss_batch - running_loss) / (batch_index + 1)
        
        # Backpropagation and optimization
        loss.backward()
        optimizer.step()
        
        # Calculate accuracy
        acc_batch = compute_accuracy(y_pred, batch_dict['y_target'])
        running_acc += (acc_batch - running_acc) / (batch_index + 1)
    
    # Update training state
    train_state['train_loss'].append(running_loss)
    train_state['train_acc'].append(running_acc)
    
    # Calculate average training loss and accuracy
    average_train_loss = np.mean(train_state['train_loss'])
    average_train_acc = np.mean(train_state['train_acc'])
    
    # Print training results
    print("Epoch: {}, Train Loss: {:.4f}, Train Accuracy: {:.4f}".format(epoch_index, average_train_loss, average_train_acc))

    # Validation loop
    # Set the dataset split to 'val'
    dataset.set_split('val')
    
    # Generate batches for validation
    batch_generator = generate_batches(dataset, batch_size=args.batch_size, device=args.device)
    
    running_loss = 0.0
    running_acc = 0.0
    
    # Set the classifier in evaluation mode
    classifier.eval()
    
    for batch_index, batch_dict in enumerate(batch_generator):
        # Forward pass
        y_pred = classifier(batch_dict['x_data'].float())
        
        # Calculate loss
        loss = loss_func(y_pred, batch_dict['y_target'].float())
        loss_batch = loss.item()
        running_loss += (loss_batch - running_loss) / (batch_index + 1)
        
        # Calculate accuracy
        acc_batch = compute_accuracy(y_pred, batch_dict['y_target'])
        running_acc += (acc_batch - running_acc) / (batch_index + 1)
    
    # Update training state
    train_state['val_loss'].append(running_loss)
    train_state['val_acc'].append(running_acc)
    
    # Calculate average validation loss and accuracy
    average_val_loss = np.mean(train_state['val_loss'])
    average_val_acc = np.mean(train_state['val_acc'])
    
    # Print validation results
    print("Epoch: {}, Val Loss: {:.4f}, Val Accuracy: {:.4f}".format(epoch_index, average_val_loss, average_val_acc))


Epoch: 0, Train Loss: 0.2103, Train Accuracy: 92.0398
Epoch: 0, Val Loss: 0.2186, Val Accuracy: 91.8742
Epoch: 1, Train Loss: 0.2074, Train Accuracy: 92.1529
Epoch: 1, Val Loss: 0.2167, Val Accuracy: 91.9329
Epoch: 2, Train Loss: 0.2008, Train Accuracy: 92.4074
Epoch: 2, Val Loss: 0.2156, Val Accuracy: 91.9388
Epoch: 3, Train Loss: 0.1930, Train Accuracy: 92.7007
Epoch: 3, Val Loss: 0.2153, Val Accuracy: 91.9189
Epoch: 4, Train Loss: 0.1850, Train Accuracy: 93.0078
Epoch: 4, Val Loss: 0.2165, Val Accuracy: 91.8713


### Testing:

In [106]:
# Testing accuracy
dataset.set_split('test')
batch_generator = generate_batches(dataset, batch_size=args.batch_size, device=args.device)

# Initialize variables to track loss and accuracy
running_loss = 0.0
running_acc = 0.0

# Set the classifier to evaluation mode
classifier.eval()

print("Testing")

# Iterate through batches in the test dataset
for batch_index, batch_dict in enumerate(batch_generator):
    # Get predictions from the classifier
    y_pred = classifier(batch_dict['x_data'].float())
    
    # Calculate the loss
    loss = loss_func(y_pred, batch_dict['y_target'].float())
    loss_batch = loss.item()
    running_loss += (loss_batch - running_loss) / (batch_index + 1)
    
    # Compute accuracy
    acc_batch = compute_accuracy(y_pred, batch_dict['y_target'])
    running_acc += (acc_batch - running_acc) / (batch_index + 1)

# Store the test loss and accuracy in the train_state dictionary
train_state['test_loss'] = running_loss
train_state['test_acc'] = running_acc

# Calculate average test loss and accuracy
average_test_loss = train_state['test_loss']
average_test_acc = train_state['test_acc']

# Print the results
print("Test Loss: {}, and Test Accuracy: {}".format(average_test_loss, average_test_acc))



Testing
Test Loss: 0.23524149926379326, and Test Accuracy: 91.56494140624997


In [107]:
def predict_rating(review, classifier, vectorizer, decision_threshold=0.5):
    """
    Predict the rating of a review using a trained classifier and vectorizer.

    Args:
        review (str): The input review text.
        classifier (nn.Module): The trained classifier model.
        vectorizer (ReviewVectorizer): The vectorizer used for encoding reviews.
        decision_threshold (float, optional): The decision threshold for classification.

    Returns:
        str: Predicted rating ('positive' or 'negative') for the review.
    """
    # Preprocess the review text
    review = preprocess_data(review)

    # Vectorize the preprocessed review
    vectorized_review = torch.tensor(vectorizer.vectorize(review=review)).to(args.device)

    # Pass the vectorized review through the classifier
    result = classifier(vectorized_review.view(1, -1))

    # Calculate the probability using sigmoid activation
    probability = torch.sigmoid(result).item()

    # Determine the predicted rating based on the decision threshold
    index = 1  # Default to positive rating
    if probability < decision_threshold:
        index = 0  # Predict negative rating

    # Look up the rating label using the index
    predicted_rating = vectorizer.rating_vocab.lookup_index(index)

    return predicted_rating


In [119]:
new_review = "Will come here again"
prediction = predict_rating(new_review, classifier, vectorizer)
print(f"Review: '{new_review}' --> Predicted Rating: {prediction}")

new_review = "if i could give . . . i would . don t do it ."
prediction = predict_rating(new_review, classifier, vectorizer)
print(f"Review: '{new_review}' --> Predicted Rating: {prediction}")

new_review = "great store with good discounts on quality products ."
prediction = predict_rating(new_review, classifier, vectorizer)
print(f"Review: '{new_review}' --> Predicted Rating: {prediction}")

new_review = "the only thing keeping me from raging on the employees right now is the insert of bread they give you with your sandwich . enjoy over paying for a worse version of jimmy johns . dont come here . "
prediction = predict_rating(new_review, classifier, vectorizer)
print(f"Review: '{new_review}' --> Predicted Rating: {prediction}")


Review: 'Will come here again' --> Predicted Rating: positive
Review: 'if i could give . . . i would . don t do it .' --> Predicted Rating: negative
Review: 'great store with good discounts on quality products .' --> Predicted Rating: positive
Review: 'the only thing keeping me from raging on the employees right now is the insert of bread they give you with your sandwich . enjoy over paying for a worse version of jimmy johns . dont come here . ' --> Predicted Rating: negative


In [124]:
# Weight inspection to identify most influential positive and negative words
fcl_weights = classifier.fcl1.weight.detach()[0]
_, indices = torch.sort(fcl_weights, dim=0, descending=True)
indices = indices.cpu()
indices = indices.numpy().tolist()

print("Top influential words in Positive Reviews:")
for i in range(20):
    word = vectorizer.review_vocab.lookup_index(indices[i])
    print(word)
print("")    
print("Top influential words in Negative Reviews:")
indices.reverse()
for i in range(20):
    word = vectorizer.review_vocab.lookup_index(indices[i])
    print(word)

Top influential words in Positive Reviews:
pleasantly
guide
deliciousness
phenomenal
nthank
solid
delicious
fabulous
drawback
excellent
cozy
scrumptious
yum
web
fear
ngreat
mongolian
consistent
mmmm
vegas

Top influential words in Negative Reviews:
worst
meh
mediocre
bland
terrible
unfriendly
blah
unfortunately
slowest
downhill
inedible
lacked
inconsistent
shame
horrible
rude
dried
towels
soggy
disgusting
