# Pytroch sentiment analysis

#### Dataset : Sentiment140, 0 : Negative , 2 : Neutrual , 4 : Positive

- target: the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
- ids: The id of the tweet ( 2087)
- date: the date of the tweet (Sat May 16 23:58:44 UTC 2009)
- flag: The query (lyx). If there is no query, then this value is NO_QUERY.
- user: the user that tweeted (robotickilldozr)
- text: the text of the tweet (Lyx is cool)

In [1]:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import random
import math

In [2]:
dataset = pd.read_csv("training.1600000.processed.noemoticon.csv",engine = 'python', header = None) # avoid utf-8 error message by using engine = 'python'

In [3]:
dataset.head()

Unnamed: 0,0,1,2,3,4,5
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


In [4]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1600000 entries, 0 to 1599999
Data columns (total 6 columns):
 #   Column  Non-Null Count    Dtype 
---  ------  --------------    ----- 
 0   0       1600000 non-null  int64 
 1   1       1600000 non-null  int64 
 2   2       1600000 non-null  object
 3   3       1600000 non-null  object
 4   4       1600000 non-null  object
 5   5       1600000 non-null  object
dtypes: int64(2), object(4)
memory usage: 73.2+ MB


In [5]:
dataset[0].value_counts()

0    800000
4    800000
Name: 0, dtype: int64

In [6]:
dataset["sentiment"] = dataset[0].replace(4,1)

In [7]:
dataset["sentiment"].value_counts()

0    800000
1    800000
Name: sentiment, dtype: int64

In [8]:
dataset

Unnamed: 0,0,1,2,3,4,5,sentiment
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t...",0
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...,0
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...,0
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire,0
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all....",0
...,...,...,...,...,...,...,...
1599995,4,2193601966,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,AmandaMarie1028,Just woke up. Having no school is the best fee...,1
1599996,4,2193601969,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,TheWDBoards,TheWDB.com - Very cool to hear old Walt interv...,1
1599997,4,2193601991,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,bpbabe,Are you ready for your MoJo Makeover? Ask me f...,1
1599998,4,2193602064,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,tinydiamondz,Happy 38th Birthday to my boo of alll time!!! ...,1


In [12]:
tweet_df = dataset[[5,'sentiment']]

In [14]:
tweet_df.columns = ["tweet","label"]

In [16]:
tweet_df.to_csv("tweets_lstm.csv", index = None)

In [18]:
from torchtext.legacy import data

#### Create dataset

In [29]:
label = data.LabelField()
tweet = data.Field(tokenize = "spacy",lower = True)

In [30]:
fields = [("tweet",tweet),("label",label)]
twitterDataset = data.TabularDataset(
    path = 'tweets_lstm.csv',
    format = 'CSV',
    fields = fields,
    skip_header = True,
)

train,test,val = twitterDataset.split(split_ratio = [0.8,0.1,0.1])

In [31]:
len(train),len(test),len(val)

(1280000, 160000, 160000)

In [32]:
vars(train.examples[7])

{'tweet': ['already',
  'i',
  'forgot',
  'you',
  'but',
  'your',
  'shadow',
  'always',
  'distrub',
  'me'],
 'label': '0'}

In [33]:
vocab_size = 20000
tweet.build_vocab(train,max_size = vocab_size)

In [34]:
len(tweet.vocab)

20002

**the extra "2" comes from \<unk> : unknown words and \<pad> : pad the text to ensure input that has the same length**

In [35]:
tweet.vocab.freqs.most_common(10)

[('i', 798981),
 ('!', 722657),
 ('.', 646778),
 (' ', 469865),
 ('to', 452234),
 ('the', 417462),
 (',', 386417),
 ('a', 304461),
 ('my', 252770),
 ('it', 242801)]

#### Create dataloader

In [37]:
train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits((train, val, test),batch_size = 32)

#### Build LSTM model