<a href="https://colab.research.google.com/github/luch91/30-DAYS-OF-AI/blob/main/Sentiment_Prediction_DistilBERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment Prediction Using DistilBERT

In [1]:
!pip install ktrain


Collecting ktrain
  Downloading ktrain-0.38.0.tar.gz (25.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m25.3/25.3 MB[0m [31m67.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting langdetect (from ktrain)
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m70.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting cchardet (from ktrain)
  Downloading cchardet-2.1.7.tar.gz (653 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m653.6/653.6 kB[0m [31m57.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting syntok>1.3.3 (from ktrain)
  Downloading syntok-1.4.4-py3-none-any.whl (24 kB)
Collecting tika (from ktrain)
  Downloading tika-2.6.0.tar.gz (27 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting transfo

## Getting Dataset

In [2]:
!git clone https://github.com/luch91/IMDB-Movie-Reviews-Large-Dataset-50k.git

Cloning into 'IMDB-Movie-Reviews-Large-Dataset-50k'...
remote: Enumerating objects: 10, done.[K
remote: Counting objects: 100% (10/10), done.[K
remote: Compressing objects: 100% (8/8), done.[K
remote: Total 10 (delta 1), reused 0 (delta 0), pack-reused 0[K
Receiving objects: 100% (10/10), 25.78 MiB | 26.67 MiB/s, done.
Resolving deltas: 100% (1/1), done.


In [3]:
# /content/IMDB-Movie-Reviews-Large-Dataset-50k

In [4]:
import pandas as pd
import numpy as np
import tensorflow as tf
import ktrain
from ktrain import text

In [5]:
data_train = pd.read_excel("/content/IMDB-Movie-Reviews-Large-Dataset-50k/train.xlsx", dtype= str)

data_test = pd.read_excel("/content/IMDB-Movie-Reviews-Large-Dataset-50k/test.xlsx", dtype= str)

In [6]:
data_train.sample(5)

Unnamed: 0,Reviews,Sentiment
10164,"i wasn't a fan of seeing this movie at all, bu...",pos
23238,THIS POST MAY CONTAIN SPOILERS :<br /><br />Al...,pos
6009,These type of movies about young teenagers str...,pos
19818,"this movie is similar to Darkness Falls,and Th...",neg
6028,I did not like the pretentious and overrated A...,pos


In [7]:
data_test.sample(5)

Unnamed: 0,Reviews,Sentiment
4659,I sometimes enjoy really lousy movies....those...,neg
3669,"The more I watch Nicholas Cage, the more I app...",pos
17843,I revisited Grand Canyon earlier this year whe...,pos
10342,Direction must be the problem here. I recently...,neg
296,Simply put: the movie is boring. Cliché upon c...,neg


Print Text Classifiers in Ktrain

In [8]:
text.print_text_classifiers()

fasttext: a fastText-like model [http://arxiv.org/pdf/1607.01759.pdf]
logreg: logistic regression using a trainable Embedding layer
nbsvm: NBSVM model [http://www.aclweb.org/anthology/P12-2018]
bigru: Bidirectional GRU with pretrained fasttext word vectors [https://fasttext.cc/docs/en/crawl-vectors.html]
standard_gru: simple 2-layer GRU with randomly initialized embeddings
bert: Bidirectional Encoder Representations from Transformers (BERT) from keras_bert [https://arxiv.org/abs/1810.04805]
distilbert: distilled, smaller, and faster BERT from Hugging Face transformers [https://arxiv.org/abs/1910.01108]


In [9]:
(train, val, preproc) = text.texts_from_df(train_df=data_train, text_column="Reviews", label_columns="Sentiment",
                   val_df=data_test,
                   maxlen= 512,
                   preprocess_mode = "distilbert")

['neg', 'pos']
   neg  pos
0  1.0  0.0
1  1.0  0.0
2  1.0  0.0
3  1.0  0.0
4  1.0  0.0
['neg', 'pos']
   neg  pos
0  0.0  1.0
1  0.0  1.0
2  1.0  0.0
3  0.0  1.0
4  1.0  0.0


Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

preprocessing train...
language: en
train sequence lengths:
	mean : 234
	95percentile : 598
	99percentile : 913


Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Is Multi-Label? False
preprocessing test...
language: en
test sequence lengths:
	mean : 234
	95percentile : 598
	99percentile : 913


## Build Model

In [10]:
model = text.text_classifier(name ="distilbert", train_data= train,
                            preproc = preproc )

Is Multi-Label? False
maxlen is 512
done.


In [11]:
learner = ktrain.get_learner(model = model,
                             train_data= train,
                             val_data = val,
                             batch_size = 6)

## Training the learner

In [12]:
learner.fit_onecycle(lr= 2e-5, epochs = 2)



begin training using onecycle policy with max lr of 2e-05...
Epoch 1/2
Epoch 2/2


<keras.src.callbacks.History at 0x7e180c1dd7b0>

In [13]:
predictor = ktrain.get_predictor(learner.model, preproc)

**Save Model**

In [14]:
from google.colab import drive
drive.mount("/content/drive")

Mounted at /content/drive


In [18]:
predictor.save("/content/drive/MyDrive/distilBERT")

## Model Evaluation

In [22]:
data = ["I am yet to find a better movie than this",
        "the plot is so scattered that it hardly makes it enjoyable",
        "Well, I need to take back my recommendation",
        "I couldn't leave my seat for a second"]

In [23]:
predictor.predict(data)

['neg', 'neg', 'neg', 'neg']

In [26]:
predictor.predict(data, return_proba=True)

array([[0.7803387 , 0.21966134],
       [0.99619496, 0.003805  ],
       [0.6002634 , 0.3997366 ],
       [0.62755245, 0.3724475 ]], dtype=float32)

In [25]:
predictor.get_classes()

['neg', 'pos']