<a href="https://www.kaggle.com/code/kartikeysharmaah/1rt720-notebook-2?scriptVersionId=231155280" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

## Encoder-based Sentiment Analysis
* 1.6 million tweets
* Use transformer (**encoder**) model
* Classify tweets as having positive or negative tone

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

**Dataset**
* target: 0 (negative) and 4 (positive)
* ID: tweet id
* text: the text in tweet
* labels generated automatically based on emoticons in tweets and replies.

In [2]:
!pip install torchsummary



In [3]:
## pytorch libraries
import torch
import torch.nn as nn
from torchsummary import summary
from torch.utils.data import Dataset, DataLoader

In [4]:
df = pd.read_csv('/kaggle/input/sentiment140/training.1600000.processed.noemoticon.csv', encoding='ISO-8859-1', names=['target', 'tweet_id', 'datetime', 'redundant', 'user', 'text'])

In [5]:
df.head()

Unnamed: 0,target,tweet_id,datetime,redundant,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


In [6]:
## column statistics
## filled
df = df.astype({'tweet_id': str})
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1600000 entries, 0 to 1599999
Data columns (total 6 columns):
 #   Column     Non-Null Count    Dtype 
---  ------     --------------    ----- 
 0   target     1600000 non-null  int64 
 1   tweet_id   1600000 non-null  object
 2   datetime   1600000 non-null  object
 3   redundant  1600000 non-null  object
 4   user       1600000 non-null  object
 5   text       1600000 non-null  object
dtypes: int64(1), object(5)
memory usage: 73.2+ MB


In [7]:
## target
print(df.groupby('target').size())

target
0    800000
4    800000
dtype: int64


```Balanced dataset: 800k records for positive and negative tweets```

In [8]:
## tweet id
print('Total tweet count: ', df.loc[:,'tweet_id'].count())
print('Unique:', df.loc[:,'tweet_id'].nunique())
print('Duplicate:', df.loc[:,'tweet_id'].duplicated().sum()) ## Inspect

duplicate_tweets = df.loc[df.index[df.loc[:,'tweet_id'].duplicated()==True],'tweet_id'].to_list()

Total tweet count:  1600000
Unique: 1598315
Duplicate: 1685


Example duplicate tweets😞 [contain both positive and negative labels]
* 1467863684
* 1467880442
* 1468053611
* 1468100580
* 1468115720

In [9]:
df.loc[df.index[df.loc[:,'tweet_id'].isin(duplicate_tweets)],['target','tweet_id','user','text']].sort_values(by='tweet_id').head()

Unnamed: 0,target,tweet_id,user,text
213,0,1467863684,DjGundam,Awwh babs... you look so sad underneith that s...
800261,4,1467863684,DjGundam,Awwh babs... you look so sad underneith that s...
275,0,1467880442,iCalvin,Haven't tweeted nearly all day Posted my webs...
800300,4,1467880442,iCalvin,Haven't tweeted nearly all day Posted my webs...
989,0,1468053611,mariejamora,@hellobebe I also send some updates in plurk b...


* drop **user**, **datetime**, **tweet_id** and **redundant**
* they don't contribute to predictions
* also drop duplicate rows shown above

In [10]:
df.drop(df.index[df.loc[:,'tweet_id'].isin(duplicate_tweets)],axis=0,inplace=True)
df.drop(['user','tweet_id','datetime','redundant'],axis=1,inplace=True)

In [11]:
## resulting dataframe
print(df.shape)
df.head()

(1596630, 2)


Unnamed: 0,target,text
0,0,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,is upset that he can't update his Facebook by ...
2,0,@Kenichan I dived many times for the ball. Man...
3,0,my whole body feels itchy and like its on fire
4,0,"@nationwideclass no, it's not behaving at all...."


---

In [12]:
## numpy array
target = np.array(df.loc[:,'target'],dtype=np.float32)
text = np.array(df.loc[:,'text'],dtype=str)

In [13]:
from sklearn.model_selection import train_test_split
train_text, valid_text, train_target, valid_target = train_test_split(text,target,test_size=0.1,random_state=42)

In [14]:
print(train_target.shape)
print(valid_target.shape)

(1436967,)
(159663,)


In [15]:
## replace 4 with 1
train_target[train_target == 4] = 1
valid_target[valid_target == 4] = 1

**Token list**
* trim '.'
* ignore tags '@'
* split words which have '...' in between

In [16]:
## Kind of a very basic tokenizer way
## just split the words, trim '.' and
## ignore tags. And then append it in
## a dictionary

freqn_bag = {}

def tokenize(text):
    
    global freqn_bag;
    token_list = str(text).lower().split(' ')
    for token in token_list:
        if len(token) == 0:
            continue
        if token[0] == '@':
            continue
        
        token = token.strip('.')
        yoken = ''
        for chars in token:
            if chars =='.':
                if yoken not in freqn_bag.keys():
                    if yoken != '':
                        freqn_bag[yoken] = 1
                else:
                    freqn_bag[yoken] = freqn_bag[yoken] + 1
                yoken = '' ## and then make it an empty string for the next word
            else:
                yoken = yoken + chars
        if yoken not in freqn_bag.keys():
            if yoken != '':
                freqn_bag[yoken] = 1
        else:
            freqn_bag[yoken] = freqn_bag[yoken] + 1

In [17]:
## we want to include mask and cls tags
## help with pre-training part and fine
## tuning. Include <pad> tag as well

for text in train_text:
    tokenize('<cls> '+ text+' <mask>')

In [18]:
## only take the first 70,000 
## tokens (size constraints)
token_bag = {}
limit = 50000 ## HYPERPARMATER
count = 0
for key in dict(sorted(freqn_bag.items(),key=lambda item: item[1],reverse=True)).keys():
    if count >= limit:
        break
    token_bag[key] = count
    count = count + 1

In [19]:
token_length = len(token_bag)
print(f'Token length: {token_length}')

Token length: 50000


---

In [20]:
## dataset classes using torch dataset classes
class SentimentDataset(Dataset):
    def __init__(self, text, target):
        super().__init__()
        self.text = text
        self.target = target

    def __len__(self):
        return len(target)

    def __getitem__(self, idx):
        return self.text[idx], self.target[idx]

In [21]:
## dataset
train_dataset = SentimentDataset(train_text,train_target)
valid_dataset = SentimentDataset(valid_text,valid_target)

---

**Encoder (BERT like model)**
* word embedding and positional encoding
* Multiple transformer layers
* Multi-head self-attention blocks

TransformerBlockSAMultiHead.svg

In [22]:
## Multihead self attention block in transformers
## Idea behind self attention is that this model
## is 'attending' or aware of the other token in
## the tweet and their positions. Use multi-head
## because of parallel computations.

class MultiHeadSelfAttention(nn.Module): ## Multi-head
    def __init__(self, embedding_dimension, num_heads):
        super().__init__()
        assert embedding_dimension % num_heads == 0

        self.dim = embedding_dimension
        self.num_heads = num_heads
        self.head_dim = embedding_dimension//num_heads

        self.K = nn.Linear(self.dim,self.dim)
        self.V = nn.Linear(self.dim,self.dim)
        self.Q = nn.Linear(self.dim,self.dim)
        self.projection = nn.Linear(self.dim,self.dim,bias=False)

    def forward(self, x):

        B,N,D = x.shape

        Kx = self.K(x) ## BxNxD
        Qx = self.Q(x)
        Vx = self.V(x)

        Kx = torch.reshape(Kx,(B,N,self.num_heads,self.head_dim)) ## BxNxHx(D/H)
        Qx = torch.reshape(Qx,(B,N,self.num_heads,self.head_dim))
        Vx = torch.reshape(Vx,(B,N,self.num_heads,self.head_dim))

        Attx = nn.Softmax(dim=3)((1/np.sqrt(self.head_dim))*torch.transpose(Qx,1,2)@torch.transpose(torch.transpose(Kx,1,2),2,3)) ## BxHxNxN
        Satx = torch.transpose(Attx@torch.transpose(Vx,1,2),1, 2) ## BxNxHx(D/H)

        return self.projection(torch.reshape(Satx,(B,N,self.dim)))## BxNxD

```
## dummy code
tempx = torch.randn(2,3,256)
tempmodel = MultiHeadSelfAttention(256,8)
output = tempmodel(tempx)
print(output)
print(output.shape)
```

**Transformer Block**
* self-attention
* Layer Norm
* MLP
* skip connections

TransformerBlock.svg

In [23]:
## Each of transformer layer contains multi-head
## self-attention, with residual connections and
## layer norm. Then comes MLP, with skip connect
## and layer norm.
class MLP(nn.Module):
    def __init__(self, input_dimension):
        super().__init__()
        self.dim = input_dimension

        self.gelu = torch.nn.GELU(approximate='tanh')
        self.l1 = nn.Linear(self.dim,self.dim*4)
        self.l2 = nn.Linear(self.dim*4,self.dim)

    def forward(self, x):
        return self.l2(self.gelu(self.l1(x))) ## MLP!

In [24]:
## One transformer layer is a combination of multi-head
## self-attention and  linear layers, interspersed with
## layer norm. There are residual connections over self
## attention and MLP block

class Transformer(nn.Module):
    def __init__(self, embedding_dimension, num_heads):
        super().__init__()
        
        self.dim = embedding_dimension
        self.num_heads = num_heads

        self.mhsa = MultiHeadSelfAttention(self.dim,self.num_heads)
        self.ln1  = nn.LayerNorm(self.dim)
        self.ln2  = nn.LayerNorm(self.dim)
        self.mlp  = MLP(self.dim)

    def forward(self, x):
        y1 = self.mhsa(x)
        y2 = x + y1
        y3 = self.ln1(y2)
        y4 = self.mlp(y3)
        y5 = y3 + y4
        y6 = self.ln2(y5)
        return y6 ## mhsa -> ln1->mlp->ln2

```
## dummy testing
tempx = torch.rand(2,3,256)
tempmodel = Transformer(256,8)
output = tempmodel(tempx)
print(output)
print(output.shape)
```

In [25]:
class Encoder(nn.Module):
    def __init__(self, embedding_dimension, num_heads, label, tweet_length=400):
        super().__init__()

        ## Encoder block contains 2 transformer layers,
        ## followed by a linear layer that outputs one 
        ## Inputs have tokens and postional embeddings.
        
        self.dim = embedding_dimension
        self.num_heads = num_heads
        self.label = label

        self.t1 = Transformer(self.dim,self.num_heads)
        self.t2 = Transformer(self.dim,self.num_heads)
        self.te = nn.Embedding(token_length, self.dim)
        self.pe = nn.Embedding(tweet_length, self.dim)

        self.pipelines = nn.Sequential(nn.Linear(self.dim,4*self.dim),nn.ReLU(),
                                       nn.Linear(4*self.dim,token_length))

    def forward(self, x):

        N = x.shape[1]
        embed = self.te(x) + self.pe(torch.arange(N,device=x.device))
        embed = self.t1(embed)
        embed = self.t2(embed)
        embed = self.pipelines(embed)
        return embed

In [26]:
loss = nn.CrossEntropyLoss()

In [27]:
## dummy testing 
## to make sure
## no NaN show up
tempx = torch.tensor([[1,44226,3444,24774,2],[1,100,18949,2,27654]])
tempy = torch.randint(low=0,high=2,size=(2,5,50000))
tempmodel = Encoder(256,8,1)
output = tempmodel(tempx)

In [28]:
print(output)
print(output.shape)
print(loss(output,tempy.float()))
print(f'Parameter: {sum(p.numel() for p in tempmodel.parameters())}')

tensor([[[-0.1356,  0.1453,  0.2589,  ...,  0.0474, -0.1551,  0.2649],
         [-0.0421, -0.0915,  0.3160,  ...,  0.0033, -0.3610,  0.3805],
         [-0.1301, -0.2099,  0.1268,  ...,  0.0019, -0.1569, -0.1955],
         [-0.2233, -0.2571,  0.3200,  ...,  0.0050, -0.2074,  0.1232],
         [-0.1039, -0.1025, -0.1284,  ...,  0.3545, -0.2583, -0.1243]],

        [[-0.1438,  0.1988,  0.1794,  ...,  0.1195, -0.1226,  0.3347],
         [-0.1449, -0.3391,  0.0843,  ...,  0.0694, -0.3047,  0.4355],
         [ 0.1611, -0.0832,  0.0740,  ...,  0.5237,  0.0113,  0.0962],
         [-0.1765, -0.2817,  0.1241,  ...,  0.1283, -0.2345, -0.2048],
         [ 0.1646, -0.0804,  0.1938,  ...,  0.4382, -0.0601, -0.0095]]],
       grad_fn=<ViewBackward0>)
torch.Size([2, 5, 50000])
tensor(4.0577, grad_fn=<DivBackward1>)
Parameter: 65994576


---
**Pre-training**   
* uses self-supervision by masking some words
* and predict them using context
* Model learns syntax of a sentence

In [29]:
## reverse tokens
reverse_token_bag = {}
for key, value in token_bag.items():
    reverse_token_bag[value] = key

In [30]:
## HYPERPARAMETERS
embedding_dimension = 256
num_heads = 8
learning_rate = 0.001

In [31]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

cuda


In [32]:
encoder_model = Encoder(embedding_dimension,num_heads,1).to(device=device)
optimizer = torch.optim.SGD(encoder_model.parameters(),lr=learning_rate,momentum=0.9) ## SGD

In [33]:
## convert words to numbers using
## token_bag. attach  <cls> token
## in front. randomly replace one
## element with <mask>.

def text_embedding(text):
    embedding = []
    token_list = str(text).lower().split(' ')
    for token in token_list:
        if len(token) == 0:
            continue
        if token[0] == '@':
            continue

        token = token.strip('.')
        yoken = ''
        for chars in token:
            if chars =='.':
                if yoken in token_bag.keys():
                    embedding.append(token_bag[yoken])
                yoken = ''
            else:
                yoken = yoken + chars
        if yoken in token_bag.keys():
            embedding.append(token_bag[yoken])
    return torch.tensor([embedding])

```
## dummy
text_embedding('<cls> I hate <mask> @ABCD very..much....)
```

In [34]:
step = 0
loss_avg = []
encoder_model.train()
print('Training....')
for text, target in train_dataset: ## target unused

    x = text_embedding(text)
    if x.shape[1] == 0:
        continue
    
    ## REMEMBER INDEX
    index = np.random.randint(low=0,high=len(x[0]))
    original = int(x[0,index])
    x[0,index] = 2 ## a <mask>

    ## ADD <cls>
    x = torch.cat((torch.tensor([[1]]), x) , dim=1)
    index = index+1
    y = torch.zeros(token_length)
    y[original] = 1.0
    
    x = x.to(device=device)
    y = y.to(device=device)

    ## TRAINING
    optimizer.zero_grad()
    logits = encoder_model(x)[0,index,:]
    loss_value = loss(logits,y)
    loss_avg.append(loss_value)

    loss_value.backward()
    optimizer.step()
    if step%10000 == 0:
        mean_loss = torch.mean(torch.tensor(loss_avg)) ## way to capture progress
        print(f'Step: {step+1} \t Mean Loss: {mean_loss:.2f}')
        loss_avg  = []
    step += 1

Training....
Step: 1 	 Mean Loss: 11.26
Step: 10001 	 Mean Loss: 8.13
Step: 20001 	 Mean Loss: 7.67
Step: 30001 	 Mean Loss: 7.54
Step: 40001 	 Mean Loss: 7.42
Step: 50001 	 Mean Loss: 7.34
Step: 60001 	 Mean Loss: 7.18
Step: 70001 	 Mean Loss: 7.06
Step: 80001 	 Mean Loss: 7.04
Step: 90001 	 Mean Loss: 6.98
Step: 100001 	 Mean Loss: 6.83
Step: 110001 	 Mean Loss: 6.89
Step: 120001 	 Mean Loss: 6.80
Step: 130001 	 Mean Loss: 6.78
Step: 140001 	 Mean Loss: 6.75
Step: 150001 	 Mean Loss: 6.71
Step: 160001 	 Mean Loss: 6.64
Step: 170001 	 Mean Loss: 6.64
Step: 180001 	 Mean Loss: 6.58
Step: 190001 	 Mean Loss: 6.56
Step: 200001 	 Mean Loss: 6.52
Step: 210001 	 Mean Loss: 6.51
Step: 220001 	 Mean Loss: 6.44
Step: 230001 	 Mean Loss: 6.46
Step: 240001 	 Mean Loss: 6.49
Step: 250001 	 Mean Loss: 6.45
Step: 260001 	 Mean Loss: 6.34
Step: 270001 	 Mean Loss: 6.38
Step: 280001 	 Mean Loss: 6.35
Step: 290001 	 Mean Loss: 6.25
Step: 300001 	 Mean Loss: 6.25
Step: 310001 	 Mean Loss: 6.36
Step: 32

**FINAL LOSS:** 6.11   
**save model parameters and dataset**

In [35]:
torch.save(encoder_model.state_dict(),'/kaggle/working/encoder_model.pth')
torch.save(train_dataset,'/kaggle/working/train_dataset.pt')
torch.save(valid_dataset,'/kaggle/working/valid_dataset.pt')

In [36]:
torch.save(token_bag,'/kaggle/working/token_bag.pt') ## token_bag needed during fine-tuning

**TEST**

In [115]:
x = torch.cat((torch.tensor([[1]]),text_embedding(train_dataset[1][0])),dim=1)
print(f'Original Sentence: {train_dataset[1][0]}')
print(f'Index to mask: 5')
x[0,5] = 2

x = x.to(device=device)
encoder_model.eval()
p = torch.argmax(nn.Softmax(dim=0)(encoder_model(x)[0,5,:]))
print(f'Predicted word: {reverse_token_bag[int(p)]}')

Original Sentence: @DaRealestDCG Hello there! Thanks for following... 
Index to mask: 5
Predicted word: following


In [116]:
x = torch.cat((torch.tensor([[1]]),text_embedding(train_dataset[9][0])),dim=1)
print(f'Original Sentence: {train_dataset[9][0]}')
print(f'Index to mask: 3')
x[0,3] = 2

x = x.to(device=device)
encoder_model.eval()
p = torch.argmax(nn.Softmax(dim=0)(encoder_model(x)[0,3,:]))
print(f'Predicted word: {reverse_token_bag[int(p)]}')

Original Sentence: Looking after my auntie's dog for two weeks. My kitten is totally unimpressed! Poor Miu 
Index to mask: 3
Predicted word: my


---
**END**