## Sentiment Classification with Pytorch

In [1]:
import sys, os
import random
import pathlib
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Pytorch imports
import torch

print(
    f"Using Pytorch version {torch.__version__}. "
    + f'GPU {"is available :)" if torch.cuda.is_available() else "is not available :("}'
)
import torch.nn as nn
from torchvision import datasets, transforms
import torchmetrics
import torchsummary

# My helper functions for training/evaluating etc.
import torch_training_toolkit as t3

SEED = t3.seed_all()
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")

Using Pytorch version 2.0.1+cu117. GPU is not available :(


In [2]:
DATASET_BASE_PATH = pathlib.Path(os.getcwd()) / "data" / "sentiment140"
DATAFILE_PATH = DATASET_BASE_PATH / "training.1600000.processed.noemoticon.csv"
assert os.path.exists(DATAFILE_PATH), f"FATAL: {DATAFILE_PATH} - path does not exist!"

In [4]:
df = pd.read_csv(str(DATAFILE_PATH), header=None)
df.head()

Unnamed: 0,0,1,2,3,4,5
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


We want the values from column 0 (sentiment) and column 5, the tweet. The data is suppose to classify sentiment of tweet as -ve (0), 2 (neutral) and +ve (4).

In [5]:
df[0].value_counts()

0
0    800000
4    800000
Name: count, dtype: int64

Looks like there are no `neutral` tweets.

In [8]:
df["sentiment_cat"] = df[0].astype("category")

In [10]:
df["sentiment"] = df["sentiment_cat"].cat.codes

In [12]:
# save the pre-processed dataset
df.to_csv(str(DATASET_BASE_PATH / "train-processed.csv"), header=None, index=None)

In [None]:
# also randomly sample 10,000 records for use
df.sample(10_000).to_csv(str(DATASET_BASE_PATH / "train-processed-sample.csv"), header=None, index=None)