# Sentiment Analysis with RNNs and LSTMs using Pytorch
In this notebook we'll do a sentiment analysis of IMDB reviews, first using the base RNN network, which we will improve with an LSTM network.
Data: we will download data from Kaggle at [this url](https://www.kaggle.com/datasets/columbine/imdb-dataset-sentiment-analysis-in-csv-format) and save it to the `.\data\imdb` folder, where you'll download 3 files - `Train.csv`, `Valid.csv` and `Test.csv`.

In [19]:
import warnings
import logging
import logging.config

warnings.filterwarnings("ignore")
logging.config.fileConfig(fname="logging.config")

import sys
import os
import pathlib
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Pytorch
import torch
print("Using Pytorch version: ", torch.__version__)
import torch.nn as nn
import torchmetrics
import torchsummary
import torchtext
from torchtext.legacy import data
# import the Pytorch training helper classes (mine)
import torch_training_toolkit as t3

# to ensure that you get consistent results across runs & machines
seed = 123
t3.seed_all(seed)

# tweak libraries for consistent outputs
np.set_printoptions(precision=4, linewidth=1024, suppress=True)
sns.set(style="darkgrid", context="notebook", font_scale=1.20)

logger = logging.getLogger(__name__)

DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
DATA_PATH = pathlib.Path(os.getcwd()) / "data" / "imdb"

print(f"Will train model on {DEVICE}")
print(f"DATA_PATH = {DATA_PATH}")

Using Pytorch version:  2.1.0+cpu


ModuleNotFoundError: No module named 'torchtext.legacy'

## Exploring the dataset
Let's explore the training dataset.

In [12]:
df = pd.read_csv(str(DATA_PATH / "Train.csv"))
df.head()

Unnamed: 0,text,label
0,I grew up (b. 1965) watching and loving the Th...,0
1,"When I put this movie in my DVD player, and sa...",0
2,Why do people who do not know what a particula...,0
3,Even though I have great interest in Biblical ...,0
4,Im a die hard Dads Army fan and nothing will e...,1


In [13]:
print(f"Dataset shape: {df.shape}")
print(f"Label distribution:")
print(df.label.value_counts())

Dataset shape: (40000, 2)
Label distribution:
label
0    20019
1    19981
Name: count, dtype: int64


### Create a iterator based dataset for the `text` column using torchtext 

In [16]:
# a simple tokenizer, splits sentence into words separated by whitespace
my_tokenizer = lambda x: str(x).split()

# field definitions
TEXT = data.Field(sequential=True, lower=True, tokenize=my_tokenizer, use_vocab=True)
LABEL = data.Field(sequential=False, use_vocab=False)
# link with data
trainval_fields = [ ("text", TEXT), ("label", LABEL)]
# build the training & cross-validation datasets
train_data, val_data = data.TabularDataset(
    path = str(DATA_PATH),
    train = "Train.csv",
    validation = "Valid.csv",
    format = "csv",
    skip_header = True,
    fields = trainval_fields,
)

MAX_VOCAB_SIZE = 25_000
TEXT.build_vocab(train_data, max_size=MAX_VOCAB_SIZE)

AttributeError: module 'torchtext.data' has no attribute 'Field'

In [25]:
from torchtext.datasets import IMDB

def tokenize(label, line):
    return line.split()

train_iter = iter(IMDB(split="train"))
test_iter = iter(IMDB(split="test"))

In [26]:
MAX_COUNT = 10
count = 0

for label, line in train_iter:
    print(f"label: {label} - text: {line}")
    count += 1
    if count >= MAX_COUNT: break

AttributeError: 'NoneType' object has no attribute 'Lock'
This exception is thrown by __iter__ of _MemoryCellIterDataPipe(remember_elements=1000, source_datapipe=_ChildDataPipe)