<a href="https://colab.research.google.com/github/SentiBert/Bert-Model/blob/master/SentiBert.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Checking the config of COLAB!

In [1]:
import psutil
def get_size(bytes, suffix="B"):
    factor = 1024
    for unit in ["", "K", "M", "G", "T", "P"]:
        if bytes < factor:
            return f"{bytes:.2f}{unit}{suffix}"
        bytes /= factor
print("="*40, "Memory Information", "="*40)
svmem = psutil.virtual_memory()
print(f"Total: {get_size(svmem.total)}") ; print(f"Available: {get_size(svmem.available)}")
print(f"Used: {get_size(svmem.used)}") ; print(f"Percentage: {svmem.percent}%")

Total: 12.72GB
Available: 11.88GB
Used: 577.13MB
Percentage: 6.6%


In [2]:
! nvidia-smi

Thu Jul 23 23:19:30 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.05    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   38C    P0    29W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## EDA

In [3]:
import torch
import pandas as pd
from tqdm.notebook import tqdm

In [4]:
from google.colab import drive
drive.mount('/content/gdrive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/gdrive


In [5]:
df = pd.read_csv('/content/gdrive/My Drive/data/Hotel_Reviews_modified.csv')

In [6]:
df = df[["Review","Sentiment"]]

In [7]:
df = df.iloc[0:12000, 0:2]

In [8]:
df

Unnamed: 0,Review,Sentiment
0,I am so angry that i made this post available...,0
1,No Negative,0
2,Rooms are nice but for elderly a bit difficul...,0
3,My room was dirty and I was afraid to walk ba...,0
4,You When I booked with your company on line y...,0
...,...,...
11995,The staff were very welcoming right from the ...,1
11996,Everything,1
11997,breakfast location,1
11998,Very nice old hotel with a great climate,1


In [9]:
df.Review.iloc[0]

' I am so angry that i made this post available via all possible sites i use when planing my trips so no one will make the mistake of booking this place I made my booking via booking com We stayed for 6 nights in this hotel from 11 to 17 July Upon arrival we were placed in a small room on the 2nd floor of the hotel It turned out that this was not the room we booked I had specially reserved the 2 level duplex room so that we would have a big windows and high ceilings The room itself was ok if you don t mind the broken window that can not be closed hello rain and a mini fridge that contained some sort of a bio weapon at least i guessed so by the smell of it I intimately asked to change the room and after explaining 2 times that i booked a duplex btw it costs the same as a simple double but got way more volume due to the high ceiling was offered a room but only the next day SO i had to check out the next day before 11 o clock in order to get the room i waned to Not the best way to begin y

In [10]:
df.Sentiment.value_counts()

0    6908
1    5092
Name: Sentiment, dtype: int64

In [11]:
possible_labels = df.Sentiment.unique()
possible_labels.sort()
print(possible_labels)

[0 1]


In [12]:
emotion_dict = {}
for index, label in enumerate(possible_labels):
    emotion_dict[label] = index

In [13]:
emotion_dict

{0: 0, 1: 1}

In [14]:
df['labels'] = df.Sentiment.replace(emotion_dict)
df.head()

Unnamed: 0,Review,Sentiment,labels
0,I am so angry that i made this post available...,0,0
1,No Negative,0,0
2,Rooms are nice but for elderly a bit difficul...,0,0
3,My room was dirty and I was afraid to walk ba...,0,0
4,You When I booked with your company on line y...,0,0


# Train/Test split

In [15]:
from sklearn.model_selection import train_test_split

In [16]:
X_train, X_test, y_train, y_test = train_test_split(
    df.index.values,
    df.labels.values,
    test_size=0.15,
    random_state=17,
    stratify=df.labels.values)

In [17]:
df['data_type'] = ['not_set']*df.shape[0]

In [18]:
df.loc[X_train, 'data_type'] = 'train'
df.loc[X_test, 'data_type'] = 'test'

In [19]:
df.groupby(['Sentiment','labels','data_type']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Review
Sentiment,labels,data_type,Unnamed: 3_level_1
0,0,test,1036
0,0,train,5872
1,1,test,764
1,1,train,4328


# Loading tokenizer and encoding our data

In [20]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/27/3c/91ed8f5c4e7ef3227b4119200fc0ed4b4fd965b1f0172021c25701087825/transformers-3.0.2-py3-none-any.whl (769kB)
[K     |████████████████████████████████| 778kB 6.7MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 23.4MB/s 
[?25hCollecting tokenizers==0.8.1.rc1
[?25l  Downloading https://files.pythonhosted.org/packages/40/d0/30d5f8d221a0ed981a186c8eb986ce1c94e3a6e87f994eae9f4aa5250217/tokenizers-0.8.1rc1-cp36-cp36m-manylinux1_x86_64.whl (3.0MB)
[K     |████████████████████████████████| 3.0MB 51.7MB/s 
Collecting sentencepiece!=0.1.92
[?25l  Downloading https://files.pythonhosted.org/packages/d4/a4/d0a884c4300004a78cca907a6ff9a5e9fe4f090f5d95ab341c53d28cbc58/sentencepiece-0.1.91-cp36-cp36m-manylinux1_x86_64.whl (1.1MB

In [21]:
from transformers import BertTokenizer
from torch.utils.data import TensorDataset

In [22]:
#the data is tokenized using a bert pretrained model
tokenizer = BertTokenizer.from_pretrained(
    'bert-base-uncased',
    do_lower_case=True
)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




In [24]:
# convert text to an encoded form (basically numbers) using batch_encode_plus()
encoded_data_train = tokenizer.batch_encode_plus(
    df[df.data_type=='train'].Review.values,
    add_special_tokens=True,
    return_attention_mask=True,
    pad_to_max_length=True,
    max_length=256,
    return_tensors='pt'
)

encoded_data_test = tokenizer.batch_encode_plus(
    df[df.data_type=='test'].Review.values,
    add_special_tokens=True,
    return_attention_mask=True,
    pad_to_max_length=True,
    max_length=256,
    return_tensors='pt'
)

#splitting data into the format in which bert needs as input
input_ids_train = encoded_data_train['input_ids']
attention_masks_train = encoded_data_train['attention_mask']
labels_train = torch.tensor(df[df.data_type=='train'].labels.values)

input_ids_test = encoded_data_test['input_ids']
attention_masks_test = encoded_data_test['attention_mask']
labels_test = torch.tensor(df[df.data_type=='test'].labels.values)

Truncation was not explicitely activated but `max_length` is provided a specific value, please use `truncation=True` to explicitely truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Truncation was not explicitely activated but `max_length` is provided a specific value, please use `truncation=True` to explicitely truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


In [25]:
dataset_train = TensorDataset(input_ids_train, attention_masks_train, labels_train)
dataset_test = TensorDataset(input_ids_test, attention_masks_test, labels_test)

In [26]:
len(dataset_train)

10200

In [27]:
len(dataset_test)

1800

# Setting up the BERT Pretrained Model

In [28]:
from transformers import BertForSequenceClassification

In [29]:
model = BertForSequenceClassification.from_pretrained(
    'bert-base-uncased',
    num_labels = len(emotion_dict),
    output_attentions=False,
    output_hidden_states=False
)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=440473133.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

# Creating Data Loaders

In [30]:
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

In [31]:
batch_size = 16

dataloader_train = DataLoader(
    dataset_train,
    sampler=RandomSampler(dataset_train),
    batch_size=batch_size
)

dataloader_test = DataLoader(
    dataset_test,
    sampler=RandomSampler(dataset_test),
    batch_size=batch_size
)

# Setting up optimizer and scheduler

In [32]:
from transformers import AdamW, get_linear_schedule_with_warmup

In [33]:
optimizer = AdamW(
    model.parameters(),
    lr=1e-5,
    eps=1e-8
)

In [34]:
epochs = 10
scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=0,
    num_training_steps=len(dataloader_train)*epochs
)

# Performance Metrics

In [35]:
import numpy as np

In [36]:
from sklearn.metrics import f1_score

In [37]:
def f1_score_func(preds, labels):
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return f1_score(labels_flat, preds_flat, average='weighted')

In [54]:
def accuracy_per_class(preds, labels):
    label_dict_inverse = {v: k for k, v in emotion_dict.items()}
    
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    
    for label in np.unique(labels_flat):
        y_preds = preds_flat[labels_flat==label]
        y_true = labels_flat[labels_flat==label]
        print(f'Class: {label_dict_inverse[label]}')
        print(f'Accuracy: {len(y_preds[y_preds==label])}/{len(y_true)}')
        percent = (len(y_preds[y_preds==label])/(len(y_true)) * 100)
        print("Accuracy%:", percent, "\n")

# Creating our training loop

In [39]:
import random
seed_val = 17
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

In [40]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)
print(device)

cuda


In [41]:
def evaluate(dataloader_test):
    model.eval()
    loss_test_total = 0
    predictions, true_test = [], []
    
    for batch in dataloader_test:
        batch = tuple(b.to(device) for b in batch)
        
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2].long()}
        
        with torch.no_grad():
            outputs = model(**inputs)
        
        loss = outputs[0]
        logits = outputs[1]
        loss_test_total += loss.item()
        
        logits = logits.detach().cpu().numpy()
        label_ids = inputs['labels'].cpu().numpy()
        predictions.append(logits)
        true_test.append(label_ids)
        
    loss_test_avg = loss_test_total/len(dataloader_test)
    
    predictions = np.concatenate(predictions, axis=0)
    true_test = np.concatenate(true_test, axis=0)
    
    return loss_test_avg, predictions, true_test

In [42]:
for epoch in tqdm(range(1, epochs+1)):
    
    model.train()
    
    loss_train_total = 0
    
    progress_bar = tqdm(dataloader_train,
                       desc = 'Epoch {:1d}'.format(epoch),
                       leave = False,
                        disable = False)
    
    for batch in progress_bar:
        
        model.zero_grad()
        
        batch = tuple(b.to(device) for b in batch)
        
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2].long()}
        
        outputs = model(**inputs)
        
        loss=outputs[0]
        loss_train_total += loss.item()
        loss.backward()
        
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        
        optimizer.step()
        scheduler.step()
        
        progress_bar.set_postfix({'training_loss': '{:.3f}'.format(loss.item()/len(batch))})
        
    torch.save(model.state_dict(), f'/content/gdrive/My Drive/Models/SentiBERT_ft_epoch{epoch}.model')
    
    tqdm.write('\nEpoch {epoch}')
    
    loss_train_avg = loss_train_total/len(dataloader_train)
    tqdm.write(f'Training loss: {loss_train_avg}')
    
    test_loss, predictions, true_test = evaluate(dataloader_test)
    test_f1 = f1_score_func(predictions, true_test)
    tqdm.write(f'Validation loss: {test_loss}')
    tqdm.write(f'F1 score (weighted): {test_f1}')

HBox(children=(FloatProgress(value=0.0, max=10.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, description='Epoch 1', max=638.0, style=ProgressStyle(description_widt…


Epoch {epoch}
Training loss: 0.18100654756574422
Validation loss: 0.15944434995804213
F1 score (weighted): 0.9473530026676407


HBox(children=(FloatProgress(value=0.0, description='Epoch 2', max=638.0, style=ProgressStyle(description_widt…


Epoch {epoch}
Training loss: 0.100786598177399
Validation loss: 0.12572388130083548
F1 score (weighted): 0.969408605129098


HBox(children=(FloatProgress(value=0.0, description='Epoch 3', max=638.0, style=ProgressStyle(description_widt…

KeyboardInterrupt: ignored

# Loading and evaluating the model

In [43]:
model = BertForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels = len(emotion_dict),
    output_attentions=False,
    output_hidden_states=False
)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

In [44]:
model.to(device)
pass

In [45]:
model.load_state_dict(torch.load('/content/gdrive/My Drive/Models/SentiBERT_ft_epoch2.model'))

<All keys matched successfully>

In [46]:
_, predictions, true_test = evaluate(dataloader_test)

In [55]:
accuracy_per_class(predictions,true_test)

Class: 0
Accuracy: 1015/1036
Accuracy%: 97.97297297297297 

Class: 1
Accuracy: 730/764
Accuracy%: 95.54973821989529 

