# Sentiment Analysis in Course Review using DistilBERT Transfer Learning

## Background
Dalam perusahaan edukasi, feedback siswa merupakan komponen penting dalam meningkatkan kualitas pembelajaran. Biasanya, feedback tersebut dalam bentuk tulisan atau teks. Feedback teks mengandung berbagai macam insight sehingga dapat dieksplorasi lebih lanjut.

Dalam data bentuk teks, kita dapat menemukan sentimen dari teks tersebut, yaitu mengetahui apakah feedback tersebut bersifat positif, netral, atau negatif. Namun, bila jumlah feedback besar akan sangat sulit untuk mengecek satu-satu sentimennya. Oleh karena itu perlu proses yang otomatis untuk mendapatkan sentimen dari teks tersebut, yaitu dengan menggunakan analisis sentimen.

## Install Package

In [1]:
# !pip install nlpaug

In [2]:
!pip install transformers

[0m

## Set Device

In [3]:
#mengecek apakah terdapat GPU pada komputer
import torch
if torch.cuda.is_available():       
    device = torch.device("cuda")
    print(f'There are {torch.cuda.device_count()} GPU(s) available.')
    print('Device name:', torch.cuda.get_device_name(0))
#jika tidak ada maka gunakan CPU untuk menjalankan program
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

There are 2 GPU(s) available.
Device name: Tesla T4


## Import Packages

In [4]:
import numpy as np 
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
from os import path
from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import torch
import transformers
from transformers import DistilBertModel, DistilBertTokenizer
from torch.optim import AdamW

import os
import nltk
nltk.download("punkt")

RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
torch.manual_seed(RANDOM_SEED)



[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


<torch._C.Generator at 0x7fc8d8d67d50>

## Load Dataset
Dalam notebook ini, akan digunakan dataset 100k Courseras Course Review, yang telah discrapping dari website coursera.

In [5]:
df = pd.read_csv("/kaggle/input/100k-courseras-course-reviews-dataset/reviews.csv")
df.head()

Unnamed: 0,Id,Review,Label
0,0,good and interesting,5
1,1,"This class is very helpful to me. Currently, I...",5
2,2,like!Prof and TAs are helpful and the discussi...,5
3,3,Easy to follow and includes a lot basic and im...,5
4,4,Really nice teacher!I could got the point eazl...,4


In [6]:
# convert label
df['Label'] = df['Label'].astype("category")
df['Label'] = df['Label'].cat.rename_categories(
    {
        5 : 4, 4 : 3, 3 : 2, 2: 1, 1: 0
    }
)

In [7]:
df.head()

Unnamed: 0,Id,Review,Label
0,0,good and interesting,4
1,1,"This class is very helpful to me. Currently, I...",4
2,2,like!Prof and TAs are helpful and the discussi...,4
3,3,Easy to follow and includes a lot basic and im...,4
4,4,Really nice teacher!I could got the point eazl...,3


## Text Analysis (EDA)
Pada bagian ini, akan dilakukan EDA (Exploratory Data Analysis) untuk mendapatkan insight dari dataset tersebut.

### Class Distribution
Pada bagian ini, akan dilakukan pengecekan distribusi kelas dengan menghitung jumlah kalimat yang termasuk kelas tersebut

In [8]:
# check class distribution
import seaborn as sns
df["Label"].value_counts()

4    79173
3    18054
2     5071
0     2469
1     2251
Name: Label, dtype: int64

### Checking Sentence Length
Pada bagian ini, akan dilakukan pengecekan panjang kalimat dengan menghitung jumlah kata per kalimat.

In [9]:
# check len sentence
from nltk.tokenize import word_tokenize

df['length_sen'] = df['Review'].apply(lambda x : len(word_tokenize(x)))

In [10]:
df.head()

Unnamed: 0,Id,Review,Label,length_sen
0,0,good and interesting,4,3
1,1,"This class is very helpful to me. Currently, I...",4,26
2,2,like!Prof and TAs are helpful and the discussi...,4,21
3,3,Easy to follow and includes a lot basic and im...,4,15
4,4,Really nice teacher!I could got the point eazl...,3,13


Berikut ini adalah maksimal panjang kalimat dari dataset tersebut.

In [11]:
# max length
max_len = max(df['length_sen'])
print("Max Length : ", max(df['length_sen']))

Max Length :  1461


### Text Preprocessing

In [12]:
# Lowercase
df['Review'] = df['Review'].apply(lambda x : str.lower(x))

### Data Splitting

In [13]:
import numpy as np
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df["Review"], df["Label"], test_size=0.2, random_state=42, stratify = df["Label"])
X_val, X_test, y_val, y_test = train_test_split(X_test, y_test, test_size=0.5, random_state=42, stratify = y_test)

In [14]:
print(X_train.shape)
print(X_test.shape)
print(X_val.shape)

(85614,)
(10702,)
(10702,)


In [15]:
y_train.value_counts()

4    63338
3    14443
2     4057
0     1975
1     1801
Name: Label, dtype: int64

In [16]:
y_test.value_counts()

4    7917
3    1806
2     507
0     247
1     225
Name: Label, dtype: int64

In [17]:
y_val.value_counts()

4    7918
3    1805
2     507
0     247
1     225
Name: Label, dtype: int64

## Build BERT Model

- Tambah penjelasan bert model --> dibaliknya apa, basic algoritma
- penjelasan framework pytorch

In [18]:
#import bert tokenizer, pilih yang multilingual karena lebih dari 1 bahasa

tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased')
model = DistilBertModel.from_pretrained('distilbert-base-cased')

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/411 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/263M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-cased were not used when initializing DistilBertModel: ['vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_projector.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


### BERT Data Loader

In [19]:
#fungsi untuk loading dataset
from torch.utils.data import Dataset

class ReviewDataset(Dataset):
  def __init__(self, reviews, labels, max_len, tokenizer):
    self.reviews = reviews.reset_index()["Review"]
    self.labels = labels.reset_index()["Label"]
    self.max_len = max_len
    self.tokenizer = tokenizer

  def __len__(self):
    return len(self.labels)
  
  def __getitem__(self, idx):
    reviews = self.reviews[idx]
    labels = self.labels[idx]

    encoding = self.tokenizer.encode_plus(
      reviews,                      
      add_special_tokens=True,    
      max_length=self.max_len,    
      pad_to_max_length=True,     
      truncation=True,            
      return_attention_mask=True, 
      return_tensors='pt'         
    )

    return {
        'review': reviews,
        'input_ids': encoding['input_ids'].flatten(),
        'attention_mask': encoding['attention_mask'].flatten(),
        'label': torch.tensor(labels, dtype=torch.long)
    }

In [20]:
train_data = ReviewDataset(X_train,y_train, 512, tokenizer)
val_data = ReviewDataset(X_val,y_val, 512, tokenizer)
test_data = ReviewDataset(X_test,y_test, 512, tokenizer)

In [21]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(train_data, batch_size=32, shuffle=True)
val_dataloader = DataLoader(val_data, batch_size=32, shuffle=True)
test_dataloader = DataLoader(test_data, batch_size=32, shuffle=True)

### Bert Transfer Learning

In [22]:
from torch import nn
class Bert_TL(nn.Module):
  def __init__(self,bert):
    super(Bert_TL, self).__init__()
    self.bert = bert
    self.fc1 = nn.Linear(768,512)
    self.relu = nn.ReLU()
    self.fc2 = nn.Linear(512,5)
  
  def forward(self, input_ids, attention_mask):
    hidden = self.bert(input_ids = input_ids, attention_mask = attention_mask)
    x = self.fc1(hidden[0][:, 0, :])
    x = self.relu(x)
    x = self.fc2(x)
    
    return x

In [23]:
model = Bert_TL(model)
model.to(torch.device("cuda"))

Bert_TL(
  (bert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
            (lin1): Linear(in_featu

In [24]:
from sklearn.utils.class_weight import compute_class_weight

#compute the class weights
class_weights = compute_class_weight(class_weight = 'balanced', 
                                     classes = np.unique(y_train), 
                                     y = y_train)

class_weights_t = torch.tensor(class_weights, dtype=torch.float)

class_weights_t = class_weights_t.to(device)

print("Class Weights:",class_weights)
print(np.unique(y_train))

Class Weights: [8.66977215 9.50738479 4.22055706 1.18554317 0.27034008]
[0 1 2 3 4]


In [25]:
# Inisiasi hyperparameter
learning_rate = 1e-3

#inisiasi loss function
loss_fn =nn.CrossEntropyLoss(weight=class_weights_t).to(device)
optimizer = AdamW(model.parameters(), lr=learning_rate)

for param in model.bert.parameters():
    param.requires_grad = False

In [26]:
def train_loop(
  data_loader, 
  model, 
  loss_fn, 
  optimizer, 
  device, 
  n_examples
):
  #menandakan bahwa model sedang dilatih
  model = model.train()

  losses = []
  correct_predictions = 0
  
  #iterasi di setiap data
  for d in data_loader:

    #mengambil input id, attention mask, dan target dari setiap data
    input_ids = d["input_ids"].to(device)
    attention_mask = d["attention_mask"].to(device)
    targets = d["label"].to(device)

    #mengeluarkan output dari data tersebut
    outputs = model(
      input_ids=input_ids,
      attention_mask=attention_mask
    )

    #mengeluarkan prediksi target
    _, preds = torch.max(outputs, dim=1)

    #mengeluarkan lossnya
    loss = loss_fn(outputs, targets)

    #menjumlahkan prediksi target yang benar
    correct_predictions += torch.sum(preds == targets)

    #mengumpulkan loss dari output dan target
    losses.append(loss.item())

    loss.backward()

    #untuk menghindari vanishing gradient
    nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
    optimizer.step()
    optimizer.zero_grad()

  #mengembalikan akurasi dan rata-rata loss
  return correct_predictions.double() / n_examples, np.mean(losses)


def test_loop(model, data_loader, loss_fn, device, n_examples):

  #menandakan model sedang dievaluasi
  model = model.eval()

  #array losses dari setiap data dan variabel correct_prediction untuk jumlah prediksi benar
  losses = []
  correct_predictions = 0

  with torch.no_grad():
    for d in data_loader:
      #mengambil input id, attention mask, dan target dari setiap data
      input_ids = d["input_ids"].to(device)
      attention_mask = d["attention_mask"].to(device)
      targets = d["label"].to(device)

      #mengambil output dari data
      outputs = model(
        input_ids=input_ids,
        attention_mask=attention_mask
      )

      #mengambil kelas prediksi
      _, preds = torch.max(outputs, dim=1)
      
      #mengeluarkan loss dari output dan targets
      loss = loss_fn(outputs, targets)

      #menjumlahkan prediksi yang benar
      correct_predictions += torch.sum(preds == targets)
      losses.append(loss.item())

  #mengembalikan akurasi dan rata-rata loss
  return correct_predictions.double() / n_examples, np.mean(losses)

In [27]:
# loss_fn = nn.CrossEntropyLoss()
# optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

epochs = 2
for t in range(epochs):
    print(f"Epoch {t+1}\n-------------------------------")
    train_acc, train_loss = train_loop(train_dataloader, model, loss_fn, optimizer,device,len(X_train))
    #memanggil fungsi eval model
    val_acc, val_loss = test_loop(
      model,
      val_dataloader,
      loss_fn, 
      device, 
      len(X_val)
    )

    #mencetak val loss dan accuracy
    print(f'Train loss {train_loss} accuracy {train_acc}')
    print(f'Val loss {val_loss} accuracy {val_acc}')
    print("Done!")

Epoch 1
-------------------------------




Train loss 1.2303594723789324 accuracy 0.6182867288060364
Val loss 1.160517922977903 accuracy 0.6810876471687535
Done!
Epoch 2
-------------------------------
Train loss 1.1725525234802243 accuracy 0.6434228046814774
Val loss 1.1602736528240034 accuracy 0.5584002990095309
Done!


### Prediction with Test Set

In [31]:
# prediction with test set
import torch.nn.functional as F

def get_predictions(model, data_loader):
  model = model.eval()
  
  story_texts = []
  predictions = []
  real_values = []

  with torch.no_grad():
    for d in data_loader:

      texts = d["review"]
      input_ids = d["input_ids"].to(device)
      attention_mask = d["attention_mask"].to(device)
      targets = d["label"].to(device)

      outputs = model(
        input_ids=input_ids,
        attention_mask=attention_mask
      )
      _, preds = torch.max(outputs, dim=1)

      predictions.extend(preds)
      real_values.extend(targets)

  predictions = torch.stack(predictions).cpu()
  real_values = torch.stack(real_values).cpu()
  return predictions, real_values

In [34]:
from sklearn.metrics import classification_report
y_pred, y_test = get_predictions(model, test_dataloader)

print(classification_report(y_test, y_pred))



              precision    recall  f1-score   support

           0       0.51      0.32      0.39       247
           1       0.27      0.31      0.29       225
           2       0.24      0.55      0.34       507
           3       0.22      0.51      0.31      1806
           4       0.92      0.58      0.71      7917

    accuracy                           0.56     10702
   macro avg       0.43      0.45      0.41     10702
weighted avg       0.75      0.56      0.61     10702

