# BERT (Bidirectional Encoder Representation from Transformers)

As the name suggests its a bidirectional encoder representation of the transformers

It is a huge neural network model which has almost 340 million parameters. 

We dont have to train all of the model since there are multiple trained models available online for the task such as ours.

So we do the same thing we download one of the pretrained models and finetune it on our data with the help of Ktrain .

# Data download

In [1]:
!wget https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xzf aclImdb_v1.tar.gz
!ls

--2021-12-07 14:29:54--  https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘aclImdb_v1.tar.gz’


2021-12-07 14:29:57 (27.1 MB/s) - ‘aclImdb_v1.tar.gz’ saved [84125825/84125825]

aclImdb  aclImdb_v1.tar.gz  sample_data


# Alternative with tf.datasets

In [2]:
!pip install tensorflow-datasets > /dev/null

In [3]:
import tensorflow_datasets as tfds

In [4]:
(ds_train,ds_test),ds_info = tfds.load(
    name="imdb_reviews",
    split=["train","test"],
    shuffle_files=True,
    as_supervised=True,
    with_info=True
)

[1mDownloading and preparing dataset imdb_reviews/plain_text/1.0.0 (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...[0m


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]





0 examples [00:00, ? examples/s]

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteINWZXC/imdb_reviews-train.tfrecord


  0%|          | 0/25000 [00:00<?, ? examples/s]

0 examples [00:00, ? examples/s]

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteINWZXC/imdb_reviews-test.tfrecord


  0%|          | 0/25000 [00:00<?, ? examples/s]

0 examples [00:00, ? examples/s]

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteINWZXC/imdb_reviews-unsupervised.tfrecord


  0%|          | 0/50000 [00:00<?, ? examples/s]



[1mDataset imdb_reviews downloaded and prepared to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0. Subsequent calls will reuse this data.[0m


In [5]:
ds_info

tfds.core.DatasetInfo(
    name='imdb_reviews',
    version=1.0.0,
    description='Large Movie Review Dataset.
This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.',
    homepage='http://ai.stanford.edu/~amaas/data/sentiment/',
    features=FeaturesDict({
        'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
        'text': Text(shape=(), dtype=tf.string),
    }),
    total_num_examples=100000,
    splits={
        'test': 25000,
        'train': 25000,
        'unsupervised': 50000,
    },
    supervised_keys=('text', 'label'),
    citation="""@InProceedings{maas-EtAl:2011:ACL-HLT2011,
      author    = {Maas, Andrew L.  and  Daly, Raymond E.  and  Pham, Peter T.  and  Huang, Dan  and  Ng, Andrew Y.  and  Potts, Christopher},
      title     = {Learning Word

  # Implementation Steps

  1 Basic data cleaning

  2 Preprocess using text.texts_from_df

  3 Define Model

  4 Find Learninf rate

  5 Fit Model

  PS steps explained in detail while implementing

In [6]:
# we create a dataframe from a tensorflow data object
#we take a higher value than 25000 in take() so that we do not miss any values
ds_train = tfds.as_dataframe(ds_train.take(25100), ds_info)
ds_test = tfds.as_dataframe(ds_test.take(25100), ds_info)

In [7]:
ds_train.head(5)
#we see these b's as data converts to bytes hence we need to decode the bytes and do some basic cleaning to the data set, its probably because of utf

Unnamed: 0,label,text
0,0,"b""This was an absolutely terrible movie. Don't..."
1,0,b'I have been known to fall asleep during film...
2,0,b'Mann photographs the Alberta Rocky Mountains...
3,1,b'This is the kind of film for a snowy Sunday ...
4,1,"b'As others have mentioned, all the women that..."


In [9]:
# As we can see their are some weird characters in between, lets do the very basic cleaning
import re
def basic_clean(txt):
  txt = txt.decode("utf-8") #to remove b's from the beginning of the text and make it string
  txt = re.compile("[.;:!\'?,\"()\[\]]").sub("", txt.lower()) #remove punctuations
  txt = re.compile("(<br\s*/><br\s*/>)|(\-)|(\/)").sub(" ", txt.lower()) #remove links
  return txt
ds_train['text'] =  ds_train['text'].apply(basic_clean)
ds_test['text'] =  ds_test['text'].apply(basic_clean)

In [10]:
# cleaned data
ds_train.head(5)

Unnamed: 0,label,text
0,0,this was an absolutely terrible movie dont be ...
1,0,i have been known to fall asleep during films ...
2,0,mann photographs the alberta rocky mountains i...
3,1,this is the kind of film for a snowy sunday af...
4,1,as others have mentioned all the women that go...


# What is Ktrain and why we use it?

ktrain is a lightweight wrapper for the deep learning library TensorFlow Keras (and other libraries) to help build, train, and deploy neural networks and other machine learning models. 

Ktrain is designed to make deep learning and AI more accessible and easier to apply for both newcomers and experienced practitioners. With only a few lines of code, ktrain allows you to easily and quickly: 

Source:https://pythonrepo.com/repo/amaiya-ktrain-python-deep-learning

In other it makes the implementation of different deep learning models much simpler. As you'll see further.


In [11]:
!pip install ktrain
import ktrain
from ktrain import text #ktrain text is primarily for text data


Collecting ktrain
  Downloading ktrain-0.28.3.tar.gz (25.3 MB)
[K     |████████████████████████████████| 25.3 MB 1.6 MB/s 
[?25hCollecting scikit-learn==0.23.2
  Downloading scikit_learn-0.23.2-cp37-cp37m-manylinux1_x86_64.whl (6.8 MB)
[K     |████████████████████████████████| 6.8 MB 41.0 MB/s 
Collecting langdetect
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[K     |████████████████████████████████| 981 kB 40.7 MB/s 
Collecting cchardet
  Downloading cchardet-2.1.7-cp37-cp37m-manylinux2010_x86_64.whl (263 kB)
[K     |████████████████████████████████| 263 kB 42.3 MB/s 
Collecting syntok
  Downloading syntok-1.3.1.tar.gz (23 kB)
Collecting seqeval==0.0.19
  Downloading seqeval-0.0.19.tar.gz (30 kB)
Collecting transformers<=4.10.3,>=4.0.0
  Downloading transformers-4.10.3-py3-none-any.whl (2.8 MB)
[K     |████████████████████████████████| 2.8 MB 43.1 MB/s 
[?25hCollecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl 

In [12]:
(X_train, y_train), (X_test, y_test), preprocess= text.texts_from_df(train_df = ds_train,
                  text_column = 'text',
                  label_columns = 'label',
                  val_df = ds_test,
                  maxlen = 400,
                  preprocess_mode = 'bert')
#we use texts_from_df since the data is in a data frame, train_df is train data, val_df = test data,
#preprocess the data at mode = bert it tells how the preprocessing has to be done in this we've selected bert
#we define the maxlen to be 400 as for our model since we are using all of the data with stopwords the length of an average sentence is much longer. If we use more than 512 for instance complete length of sentence it will give an error since bert can only take till 512
#we give text coloumn and label coloumn for processing the data and map x train y train, x test y test
#text_from_df ktrain will do preprocessing of data from dataframe and will return five variables out of it these variables are (x_train,y_train)(x_test,y_test) and preprocess


['not_label', 'label']
   not_label  label
0        1.0    0.0
1        1.0    0.0
2        1.0    0.0
3        0.0    1.0
4        0.0    1.0
['not_label', 'label']
   not_label  label
0        0.0    1.0
1        0.0    1.0
2        1.0    0.0
3        1.0    0.0
4        0.0    1.0
downloading pretrained BERT model (uncased_L-12_H-768_A-12.zip)...
[██████████████████████████████████████████████████]
extracting pretrained BERT model...
done.

cleanup downloaded zip...
done.

preprocessing train...
language: en


Is Multi-Label? False
preprocessing test...
language: en


In [None]:
#we can see above model has recognized the data is not multi label
#It has recognized the language of the test as en which is english.
#It has further extracted a pretrained bert model uncased_L-12_H-768_A-12.zip which can be established by source code of preprocess https://github.com/amaiya/ktrain/blob/master/ktrain/text/preprocessor.py
#in uncased_L-12_H-768_A-12 H = hidden_size = 768 L = num_hidden_layers = 12 A = num_attention_heads = 12 it is the best bert model

In [13]:
#here we define the model as, a text classifier model bert, we use preproc as preprocessed data in the bert mode a which we got from previous step, in addition we give training data
model = text.text_classifier(name = 'bert', train_data = (X_train, y_train),
                             preproc = preprocess)

Is Multi-Label? False
maxlen is 400
done.


In [14]:
learner = ktrain.get_learner(model = model,
                             train_data = (X_train, y_train),
                             val_data = (X_test,y_test),
                             batch_size = 6) #we use train data as train data, test data as validation data, and we keep a low batch size of for good performance as similar low batch sizes are suggested in link https://huggingface.co/google/bert_uncased_L-12_H-768_A-12 
# get_learner it returns a Learner instance that can be used to tune and train the models.

In [17]:
learner.fit_onecycle(lr = 2e-5, epochs = 2) #learning rate of 2e-5 was found to one of the be optimal which is similar to kearning rates suggested in the link https://huggingface.co/google/bert_uncased_L-12_H-768_A-12, we run only for 2 epochs beacuse a high accuracy is achieved easily and since considerable resources are used we stick with 2 only
#fit_onecycle trains with onecycle policy
#one cycle policy is picking the right learning rate at different iterations helps model to converge quickly. It follows the Cyclical Learning Rate (CLR) to obtain faster training time
#Specifically, it uses one cycle that is smaller than the total number of iterations/epochs and allow learning rate to decrease several orders of magnitude less than the initial learning rate for the remaining iterations (i.e. last few iterations). Source: https://derekchia.com/the-1-cycle-policy/ 
#onecycle also helps the model to be trained in reduced epochs



begin training using onecycle policy with max lr of 2e-05...
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7f6e7a85cd10>

# Conclusion

Bert is by far the best performing model in this assignment and by far the most computationally challenging. But the results show that bert understands much better the context of the language with its millions of parameters and is quite good at analysing the sentiments.

The validation accuracy is almost 94% (93.96) which is much higher than validation accuracy of fast text model and test accuracy of all the other models.