# **Sentiment Analysis with Deep Learning using BERT**


## **What is BERT?**

BERT is a large-scale transformer-based Language Model that can be finetuned for a variety of tasks.

For more information, the original paper can be found here (https://arxiv.org/abs/1810.04805).

HuggingFace documentation (https://huggingface.co/transformers/model_doc/bert.html)

## 1: Exploratory Data Analysis and Preprocessing

In [51]:
!pip install torch #no change

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [52]:
import torch #no change
from tqdm.notebook import tqdm #no change
import numpy as np #no change
import pandas as pd #no change

In [56]:
df = pd.read_csv('TextVsLabel.csv') #change


In [57]:
df.head() #no change

Unnamed: 0.1,Unnamed: 0,text,label_bias
0,0,"""Orange Is the New Black"" star Yael Stone is r...",Non-biased
1,1,"""We have one beautiful law,"" Trump recently sa...",Biased
2,2,"...immigrants as criminals and eugenics, all o...",Biased
3,3,...we sounded the alarm in the early months of...,Biased
4,4,[Black Lives Matter] is essentially a non-fals...,Biased


In [58]:
df.drop(['Unnamed: 0'], axis = 1, inplace = True)

In [59]:
df.head()

Unnamed: 0,text,label_bias
0,"""Orange Is the New Black"" star Yael Stone is r...",Non-biased
1,"""We have one beautiful law,"" Trump recently sa...",Biased
2,"...immigrants as criminals and eugenics, all o...",Biased
3,...we sounded the alarm in the early months of...,Biased
4,[Black Lives Matter] is essentially a non-fals...,Biased


In [60]:
#set(df.category) #change
set(df.label_bias)

{'Biased', 'Non-biased'}

In [61]:
#df.category.value_counts() #change
df.label_bias.value_counts()

Non-biased    1863
Biased        1810
Name: label_bias, dtype: int64

In [8]:
#df = df[df.category.isin(['happy', 'not-relevant', 'angry', 'surprise', 'sad', 'disgust'])] #change

In [62]:
df.label_bias.value_counts() #change 

Non-biased    1863
Biased        1810
Name: label_bias, dtype: int64

In [64]:
#possible_labels = df.category.unique()  change
possible_labels = df.label_bias.unique() 

In [65]:
possible_labels

array(['Non-biased', 'Biased'], dtype=object)

In [66]:
label_dict = {}
for index, possible_label in enumerate(possible_labels):
    label_dict[possible_label] = index

In [67]:
label_dict

{'Non-biased': 0, 'Biased': 1}

In [68]:
#df.category.unique()
df.label_bias.unique()

array(['Non-biased', 'Biased'], dtype=object)

In [69]:
#df.category = df['category'].map(label_dict)
df.label_bias = df['label_bias'].map(label_dict)

In [70]:
#df.category.unique()
df.label_bias.unique()

array([0, 1])

In [71]:
df.head(10)

Unnamed: 0,text,label_bias
0,"""Orange Is the New Black"" star Yael Stone is r...",0
1,"""We have one beautiful law,"" Trump recently sa...",1
2,"...immigrants as criminals and eugenics, all o...",1
3,...we sounded the alarm in the early months of...,1
4,[Black Lives Matter] is essentially a non-fals...,1
5,[Democrats employ] their full arsenal to deleg...,1
6,[Newsoms's] obsession with masks has created a...,1
7,[Newsoms's] onslaught of propaganda ignores co...,1
8,[The police] now prefer to think of themselves...,1
9,‘A new low’: Washington Post media critic blow...,1


Classes are imbalanced as visible

## 2: Training/Validation Split

In [18]:
from sklearn.model_selection import train_test_split

In [73]:
X_train, X_val, y_train, y_val = train_test_split(df.index.values, 
                                                  #df.category.values, 
                                                  df.label_bias.values,
                                                  test_size=0.15, 
                                                  random_state=42,
                                                  stratify=df.label_bias.values)
                                                  #stratify=df.category.values)

In [74]:
    df.head()

Unnamed: 0,text,label_bias
0,"""Orange Is the New Black"" star Yael Stone is r...",0
1,"""We have one beautiful law,"" Trump recently sa...",1
2,"...immigrants as criminals and eugenics, all o...",1
3,...we sounded the alarm in the early months of...,1
4,[Black Lives Matter] is essentially a non-fals...,1


In [75]:
len(df)

3673

In [76]:
df['data_type'] = ['not_set']*df.shape[0]

In [77]:
df.head()

Unnamed: 0,text,label_bias,data_type
0,"""Orange Is the New Black"" star Yael Stone is r...",0,not_set
1,"""We have one beautiful law,"" Trump recently sa...",1,not_set
2,"...immigrants as criminals and eugenics, all o...",1,not_set
3,...we sounded the alarm in the early months of...,1,not_set
4,[Black Lives Matter] is essentially a non-fals...,1,not_set


In [78]:
df.loc[X_train, 'data_type'] = 'train'
df.loc[X_val, 'data_type'] = 'val'

In [79]:
df

Unnamed: 0,text,label_bias,data_type
0,"""Orange Is the New Black"" star Yael Stone is r...",0,train
1,"""We have one beautiful law,"" Trump recently sa...",1,train
2,"...immigrants as criminals and eugenics, all o...",1,val
3,...we sounded the alarm in the early months of...,1,train
4,[Black Lives Matter] is essentially a non-fals...,1,train
...,...,...,...
3668,You’ve heard of Jim Crow and Southern Segregat...,1,train
3669,Young female athletes’ dreams and accomplishme...,1,train
3670,"Young white men, reacting to social and educat...",1,val
3671,Young women taking part in high school and col...,1,train


In [80]:
#df.groupby(['category', 'data_type']).count()
df.groupby(['label_bias', 'data_type']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,text
label_bias,data_type,Unnamed: 2_level_1
0,train,1584
0,val,279
1,train,1538
1,val,272


# 3. Loading Tokenizer and Encoding our Data

In [81]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [82]:
from transformers import BertTokenizer
from torch.utils.data import TensorDataset

In [83]:
tokenizer = BertTokenizer.from_pretrained(
    'bert-base-uncased', #bert-base-uncased using small bert model for simple data , bert-large-uncased fo large data
    do_lower_case=True
)

In [85]:
encoded_data_train = tokenizer.batch_encode_plus(
    
    df[df.data_type=='train'].text.values,
    add_special_tokens=True,
    return_attention_mask=True,
    pad_to_max_length=True,
    max_length=256,
    return_tensors='pt'
)

encoded_data_val = tokenizer.batch_encode_plus(
    df[df.data_type=='val'].text.values,
    add_special_tokens=True,
    return_attention_mask=True,
    pad_to_max_length=True,
    max_length=256,
    return_tensors='pt'
)

input_ids_train = encoded_data_train['input_ids']
attention_masks_train = encoded_data_train['attention_mask']
#labels_train = torch.tensor(df[df.data_type=='train'].category.values)
labels_train = torch.tensor(df[df.data_type=='train'].label_bias.values)

input_ids_val = encoded_data_val['input_ids']
attention_masks_val = encoded_data_val['attention_mask']
#labels_val = torch.tensor(df[df.data_type=='val'].category.values)
labels_val = torch.tensor(df[df.data_type=='val'].label_bias.values)



In [86]:
input_ids_train

tensor([[ 101, 1000, 4589,  ...,    0,    0,    0],
        [ 101, 1000, 2057,  ...,    0,    0,    0],
        [ 101, 1012, 1012,  ...,    0,    0,    0],
        ...,
        [ 101, 2402, 2931,  ...,    0,    0,    0],
        [ 101, 2402, 2308,  ...,    0,    0,    0],
        [ 101, 7858, 2003,  ...,    0,    0,    0]])

In [87]:
#TensorDataset create a single variable which stores info of inputids , attentionmsk, labels
dataset_train = TensorDataset(input_ids_train, 
                              attention_masks_train,
                              labels_train)

dataset_val = TensorDataset(input_ids_val, 
                            attention_masks_val,
                           labels_val)

In [88]:
len(dataset_train)

3122

In [89]:
dataset_val.tensors

(tensor([[ 101, 1012, 1012,  ...,    0,    0,    0],
         [ 101, 1520, 1996,  ...,    0,    0,    0],
         [ 101, 1037, 3438,  ...,    0,    0,    0],
         ...,
         [ 101, 2017, 2089,  ...,    0,    0,    0],
         [ 101, 2017, 2342,  ...,    0,    0,    0],
         [ 101, 2402, 2317,  ...,    0,    0,    0]]),
 tensor([[1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         ...,
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0]]),
 tensor([1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
         0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0,
         0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0,
         1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,
         0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0,
         0, 1, 1, 0, 0, 0, 1, 1, 1,

# 4. Setting up BERT Pretrained Model

In [90]:
from transformers import BertForSequenceClassification

In [91]:
model = BertForSequenceClassification.from_pretrained(
                                      'bert-base-uncased', 
                                      num_labels = len(label_dict),#6
                                      output_attentions = False,
                                      output_hidden_states = False
                                     )

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

# 5. Creating Data Loaders

In [92]:
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

In [93]:
dataset_train

<torch.utils.data.dataset.TensorDataset at 0x7ff3049b8970>

In [94]:
batch_size = 4

dataloader_train = DataLoader(
    dataset_train,
    sampler=RandomSampler(dataset_train),
    batch_size=batch_size
)

dataloader_val = DataLoader(
    dataset_val,
    sampler=RandomSampler(dataset_val),
    batch_size=32
)

# 6. Setting Up Optimizer and Scheduler

In [95]:
from transformers import AdamW, get_linear_schedule_with_warmup

In [96]:
optimizer = AdamW(
    model.parameters(),
    lr = 1e-5,
    eps = 1e-8
)



In [97]:
epochs = 1

scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=0,
    num_training_steps = len(dataloader_train)*epochs
)

# 7. Defining our Performance Metrics

In [98]:
import numpy as np
from sklearn.metrics import f1_score

In [99]:
def f1_score_func(preds, labels):
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return f1_score(labels_flat, preds_flat, average = 'weighted')

In [100]:
def accuracy_per_class(preds, labels):
    label_dict_inverse = {v: k for k, v in label_dict.items()}
    
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    
    for label in np.unique(labels_flat):
        y_preds = preds_flat[labels_flat==label]
        y_true = labels_flat[labels_flat==label]
        print(f'Class: {label_dict_inverse[label]}')
        print(f'Accuracy:{len(y_preds[y_preds==label])}/{len(y_true)}\n')

# 8. Creating our Training Loop

In [101]:
import random

seed_val = 17
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

In [102]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)
print(device)

cuda


In [103]:
def evaluate(dataloader_val):

    model.eval()
    
    loss_val_total = 0
    predictions, true_vals = [], []
    
    for batch in tqdm(dataloader_val):
        
        batch = tuple(b.to(device) for b in batch)
        
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }

        with torch.no_grad():        
            outputs = model(**inputs)
            
        loss = outputs[0]
        logits = outputs[1]
        loss_val_total += loss.item()

        logits = logits.detach().cpu().numpy()
        label_ids = inputs['labels'].cpu().numpy()
        predictions.append(logits)
        true_vals.append(label_ids)
    
    loss_val_avg = loss_val_total/len(dataloader_val) 
    
    predictions = np.concatenate(predictions, axis=0)
    true_vals = np.concatenate(true_vals, axis=0)
            
    return loss_val_avg, predictions, true_vals

In [104]:
for epoch in tqdm(range(1, epochs+1)):
    model.train() #forward propagation
    loss_train_total = 0
    
    progress_bar = tqdm(dataloader_train, 
                        desc='Epoch {:1d}'.format(epoch), 
                        leave=False, 
                        disable=False)
    
    for batch in progress_bar:
        model.zero_grad()
        batch = tuple(b.to(device) for b in batch)
        inputs = {
            'input_ids': batch[0],
            'attention_mask': batch[1],
            'labels': batch[2]
        }
        
        outputs = model(**inputs)
        loss = outputs[0]
        loss_train_total +=loss.item()
        loss.backward() #backwardprop
        
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        
        optimizer.step()
        scheduler.step()
        
        progress_bar.set_postfix({'training_loss': '{:.3f}'.format(loss.item()/len(batch))})     
    
    torch.save(model, f'BERT_ft_Epoch{epoch}.model')
    
    tqdm.write(f'\nEpoch {epoch}')
    
    loss_train_avg = loss_train_total/len(dataloader_train)
    tqdm.write(f'Training loss: {loss_train_avg}')
    
    val_loss, predictions, true_vals = evaluate(dataloader_val)
    val_f1 = f1_score_func(predictions, true_vals)
    tqdm.write(f'Validation loss: {val_loss}')
    tqdm.write(f'F1 Score (weighted): {val_f1}')


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 1:   0%|          | 0/781 [00:00<?, ?it/s]


Epoch 1
Training loss: 0.5426440585533422


  0%|          | 0/18 [00:00<?, ?it/s]

Validation loss: 0.5494070814715492
F1 Score (weighted): 0.7504648320308159


# EVALUATION

In [106]:
import torch

In [107]:
headline = "Trump's indictment is sending shockwaves across the political landscape"

In [109]:
from transformers import BertTokenizer

In [110]:
tokenizer = BertTokenizer.from_pretrained(
    'bert-base-uncased', #bert-base-uncased using small bert model for simple data , bert-large-uncased fo large data
    do_lower_case=True
)

In [111]:
device = torch.device('cpu')

In [112]:
print(device)

cpu


In [117]:
encoded_headline = tokenizer(headline, return_tensors = 'pt')

In [118]:
encoded_headline

{'input_ids': tensor([[  101,  8398,  1005,  1055, 24265,  2003,  6016,  5213, 16535,  2015,
          2408,  1996,  2576,  5957,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [119]:
input_ids = encoded_headline['input_ids'].to(device)
attention_msk = encoded_headline['attention_mask'].to(device)

In [121]:
path = '/content/BERT_ft_Epoch1.model'
model = torch.load(path, map_location = torch.device('cpu'))

In [122]:
model

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, element

In [123]:
model_output = model(input_ids,attention_msk)

In [124]:
model_output

SequenceClassifierOutput(loss=None, logits=tensor([[-0.4621,  1.3676]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

In [125]:
model_output_tensor = torch.tensor(model_output.logits)

  model_output_tensor = torch.tensor(model_output.logits)


In [126]:
model_output_tensor

tensor([[-0.4621,  1.3676]])

In [128]:
model_output_tensor_categoryIndex = int(torch.argmax(model_output_tensor))

In [129]:
model_output_tensor_categoryIndex

1

In [None]:
classes = {0: 'Non-biased', 1: 'Biased'}

In [131]:
classes[model_output_tensor_categoryIndex]

'Biased'