<a href="https://colab.research.google.com/github/nadyadtm/Sentiment-Analysis-of-Course-Review-using-DistilBERT-Transfer-Learning/blob/main/distilbert_text_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment Analysis in Course Review using DistilBERT Transfer Learning

## Background
Dalam perusahaan edukasi, feedback siswa merupakan komponen penting dalam meningkatkan kualitas pembelajaran. Biasanya, feedback tersebut dalam bentuk tulisan atau teks. Feedback teks mengandung berbagai macam insight sehingga dapat dieksplorasi lebih lanjut.

Dalam data bentuk teks, kita dapat menemukan sentimen dari teks tersebut, yaitu mengetahui apakah feedback tersebut bersifat positif, netral, atau negatif. Namun, bila jumlah feedback besar akan sangat sulit untuk mengecek satu-satu sentimennya. Oleh karena itu perlu proses yang otomatis untuk mendapatkan sentimen dari teks tersebut, yaitu dengan menggunakan analisis sentimen.

## Install Package

In [1]:
# !pip install nlpaug

In [2]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.28.1-py3-none-any.whl (7.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m50.8 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0 (from transformers)
  Downloading huggingface_hub-0.14.1-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m17.0 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m39.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.14.1 tokenizers-0.13.3 transformers-4.28.1


## Set Device

In [3]:
#mengecek apakah terdapat GPU pada komputer
import torch
if torch.cuda.is_available():       
    device = torch.device("cuda")
    print(f'There are {torch.cuda.device_count()} GPU(s) available.')
    print('Device name:', torch.cuda.get_device_name(0))
#jika tidak ada maka gunakan CPU untuk menjalankan program
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

There are 1 GPU(s) available.
Device name: Tesla T4


## Import Packages

In [4]:
import numpy as np 
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
from os import path
from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import torch
import transformers
from transformers import DistilBertModel, DistilBertTokenizer
from torch.optim import AdamW

import os
import nltk
nltk.download("punkt")

RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
torch.manual_seed(RANDOM_SEED)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


<torch._C.Generator at 0x7f9a7c0f5f50>

In [5]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Load Dataset
Dalam notebook ini, akan digunakan dataset 100k Courseras Course Review, yang telah discrapping dari website coursera.

In [6]:
df = pd.read_csv("/content/drive/MyDrive/Trial Project/reviews.csv")
df.head()

Unnamed: 0,Id,Review,Label
0,0,good and interesting,5
1,1,"This class is very helpful to me. Currently, I...",5
2,2,like!Prof and TAs are helpful and the discussi...,5
3,3,Easy to follow and includes a lot basic and im...,5
4,4,Really nice teacher!I could got the point eazl...,4


## Text Analysis (EDA)
Pada bagian ini, akan dilakukan EDA (Exploratory Data Analysis) untuk mendapatkan insight dari dataset tersebut.

### Class Distribution
Pada bagian ini, akan dilakukan pengecekan distribusi kelas dengan menghitung jumlah kalimat yang termasuk kelas tersebut

In [7]:
# check class distribution
import seaborn as sns
df["Label"].value_counts()

5    79173
4    18054
3     5071
1     2469
2     2251
Name: Label, dtype: int64

### Downsampling
Mencoba downsampling pada label 5 dan 4, karena terlalu banyak

In [8]:
df_1_3 = df[(df["Label"]!=4) & (df["Label"]!=5)]
df_4 = df[df["Label"]==4]
df_5 = df[df["Label"]==5]

In [9]:
from sklearn.utils import resample
rat4_downsample = resample(df_4,
             replace=True,
             n_samples=len(df_1_3[df_1_3["Label"]==3]),
             random_state=42)

rat5_downsample = resample(df_5,
             replace=True,
             n_samples=len(df_1_3[df_1_3["Label"]==3]),
             random_state=42)

In [10]:
df = pd.concat([df_1_3,rat4_downsample,rat5_downsample])

### Checking Sentence Length
Pada bagian ini, akan dilakukan pengecekan panjang kalimat dengan menghitung jumlah kata per kalimat.

In [11]:
# check len sentence
from nltk.tokenize import word_tokenize

df['length_sen'] = df['Review'].apply(lambda x : len(word_tokenize(x)))

In [12]:
df.head()

Unnamed: 0,Id,Review,Label,length_sen
7,7,I was disappointed because the name is mislead...,3,69
13,13,"Good content, but the course setting does (at ...",3,27
17,17,This course does not say anything about digiti...,2,18
19,19,"The course content is quite good, though it co...",3,32
48,48,I'll start by saying that this course gives a ...,3,122


Berikut ini adalah maksimal panjang kalimat dari dataset tersebut.

In [13]:
# max length
max_len = max(df['length_sen'])
print("Max Length : ", max(df['length_sen']))

Max Length :  1314


### Text Preprocessing

In [14]:
# Lowercase
df['Review'] = df['Review'].apply(lambda x : str.lower(x))

### Data Splitting

In [15]:
import numpy as np
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df["Review"], df["Label"], test_size=0.2, random_state=42, stratify = df["Label"])
X_val, X_test, y_val, y_test = train_test_split(X_test, y_test, test_size=0.5, random_state=42, stratify = y_test)

In [16]:
print(X_train.shape)
print(X_test.shape)
print(X_val.shape)

(15946,)
(1994,)
(1993,)


In [17]:
y_train.value_counts()

3    4057
4    4057
5    4056
1    1975
2    1801
Name: Label, dtype: int64

In [18]:
y_test.value_counts()

5    508
4    507
3    507
1    247
2    225
Name: Label, dtype: int64

In [19]:
y_val.value_counts()

5    507
4    507
3    507
1    247
2    225
Name: Label, dtype: int64

## Build BERT Model

- Tambah penjelasan bert model --> dibaliknya apa, basic algoritma
- penjelasan framework pytorch

In [20]:
#import bert tokenizer, pilih yang multilingual karena lebih dari 1 bahasa

tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-multilingual-cased')
model = DistilBertModel.from_pretrained('distilbert-base-multilingual-cased')

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/466 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/542M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-multilingual-cased were not used when initializing DistilBertModel: ['vocab_projector.bias', 'vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_transform.weight', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


### BERT Data Loader

In [21]:
#fungsi untuk loading dataset
from torch.utils.data import Dataset

class ReviewDataset(Dataset):
  def __init__(self, reviews, labels, max_len, tokenizer):
    self.reviews = reviews.reset_index()["Review"]
    self.labels = labels.reset_index()["Label"]
    self.max_len = max_len
    self.tokenizer = tokenizer

  def __len__(self):
    return len(self.labels)
  
  def __getitem__(self, idx):
    reviews = self.reviews[idx]
    labels = self.labels[idx]

    encoding = self.tokenizer.encode_plus(
      reviews,                      
      add_special_tokens=True,    
      max_length=self.max_len,    
      pad_to_max_length=True,     
      truncation=True,            
      return_attention_mask=True, 
      return_tensors='pt'         
    )

    return {
        'review': reviews,
        'input_ids': encoding['input_ids'].flatten(),
        'attention_mask': encoding['attention_mask'].flatten(),
        'label': torch.tensor(labels, dtype=torch.long)
    }

In [22]:
train_data = ReviewDataset(X_train,y_train, 512, tokenizer)
val_data = ReviewDataset(X_val,y_val, 512, tokenizer)
test_data = ReviewDataset(X_test,y_test, 512, tokenizer)

In [23]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(train_data, batch_size=32, shuffle=True)
val_dataloader = DataLoader(val_data, batch_size=32, shuffle=True)
test_dataloader = DataLoader(test_data, batch_size=32, shuffle=True)

### Bert Transfer Learning

In [24]:
from torch import nn
class Bert_TL(nn.Module):
  def __init__(self,bert):
    super(Bert_TL, self).__init__()
    self.bert = bert
    self.fc1 = nn.Linear(768,512)
    self.relu = nn.ReLU()
    self.fc2 = nn.Linear(512,6)

    self.softmax = nn.LogSoftmax(dim=1)
  
  def forward(self, input_ids, attention_mask):
    hidden = self.bert(input_ids = input_ids, attention_mask = attention_mask)
    x = self.fc1(hidden[0][:, 0, :])
    x = self.relu(x)
    x = self.fc2(x)

    x = self.softmax(x)

    return x

In [25]:
model = Bert_TL(model)
model.to(torch.device("cuda"))

Bert_TL(
  (bert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(119547, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
            (lin1): Linear(in_feat

In [26]:
# Inisiasi hyperparameter
learning_rate = 1e-3

#inisiasi loss function
loss_fn =nn.NLLLoss().to(device)
optimizer = AdamW(model.parameters(), lr=learning_rate)

In [27]:
def train_loop(dataloader, model, loss_fn, optimizer):
    size = len(dataloader.dataset)
    model.train()
    for batch, data in enumerate(dataloader):
        # Compute prediction and loss

        input_ids = data["input_ids"].to(device)
        attention_mask = data["attention_mask"].to(device)
        labels = data["label"].to(device)

        pred = model(input_ids, attention_mask)
        # _, preds = torch.max(pred, dim=1)
        loss = loss_fn(pred, labels)

        # Backpropagation
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if batch % 100 == 0:
            loss, current = loss.item(), (batch + 1) * len(data)
            print(f"loss: {loss:>7f}  [{current:>5d}/{size:>5d}]")


# def test_loop(dataloader, model, loss_fn):
#     size = len(dataloader.dataset)
#     num_batches = len(dataloader)
#     test_loss, correct = 0, 0

#     with torch.no_grad():
#         for batch, data in enumerate(dataloader):
#           input_ids = data["input_ids"].to(device)
#           attention_mask = data["attention_mask"].to(device)
#           labels = data["label"].to(device)
          
#           pred = model(input_ids, attention_mask)
#           test_loss += loss_fn(pred, labels).item()
#           correct += (pred.argmax(1) == labels).type(torch.float).sum().item()

#     test_loss /= num_batches
#     correct /= size
#     print(f"Test Error: \n Accuracy: {(100*correct):>0.1f}%, Avg loss: {test_loss:>8f} \n")


def test_loop(model, data_loader, loss_fn, device, n_examples):

  #menandakan model sedang dievaluasi
  model = model.eval()

  #array losses dari setiap data dan variabel correct_prediction untuk jumlah prediksi benar
  losses = []
  correct_predictions = 0

  with torch.no_grad():
    for d in data_loader:
      #mengambil input id, attention mask, dan target dari setiap data
      input_ids = d["input_ids"].to(device)
      attention_mask = d["attention_mask"].to(device)
      targets = d["label"].to(device)

      #mengambil output dari data
      outputs = model(
        input_ids=input_ids,
        attention_mask=attention_mask
      )

      #mengambil kelas prediksi
      _, preds = torch.max(outputs, dim=1)
      
      #mengeluarkan loss dari output dan targets
      loss = loss_fn(outputs, targets)

      #menjumlahkan prediksi yang benar
      correct_predictions += torch.sum(preds == targets)
      losses.append(loss.item())

  #mengembalikan akurasi dan rata-rata loss
  return correct_predictions.double() / n_examples, np.mean(losses)

In [None]:
# loss_fn = nn.CrossEntropyLoss()
# optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

epochs = 10
for t in range(epochs):
    print(f"Epoch {t+1}\n-------------------------------")
    train_loop(train_dataloader, model, loss_fn, optimizer)
    #memanggil fungsi eval model
    val_acc, val_loss = test_loop(
      model,
      val_dataloader,
      loss_fn, 
      device, 
      len(X_val)
    )

    #mencetak val loss dan accuracy
    print(f'Val loss {val_loss} accuracy {val_acc}')
    print("Done!")

In [None]:
for batch,data in enumerate(train_dataloader):
  input_ids = data["input_ids"].to(device)
  attention_mask = data["attention_mask"].to(device)

  pred = model(input_ids, attention_mask)
  labels = data["label"].to(device)