We are going to use the Transformer library created by HuggingFace.
through this library, we can download and use the pretrained BERT model which is a transformer based model.
Transformers library allows you to use a wide range of transformer based model for different purposes such as text classification, sentiment analysis and in different languages as well.

In [1]:
# installing the libraries
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


This transformers library has a lot of pretrained transformer models such as BERT, GPT etc. There are two ways to use the pretrained model.
# a) using Pipeline API

In [2]:
from transformers import pipeline
classifier=pipeline('sentiment-analysis', model='nlptown/bert-base-multilingual-uncased-sentiment')

In [3]:
movie_review="A delightful experience that will make you forget about the recent tragedy of Infifnity war. Ant man and the wasp is the 20th MCU Movie which is directed by Peyton Reed who has directed the first Ant Man Movie and this Movie is about scott, who is  hoyse arrested due to the Events that took place due to the Sokovia Accords(Civil War) and Hank and Hope trying to Get Back Janet( Hanks wife and Hopes Mother) who they think is still alive in the Quantum Realm and thier only hope to get her Back is Scott. Well this movie isnt Great or Brilliant but its is a Good Movie."

In [4]:
# to convert movie reviews into movie stars
classifier(movie_review)

[{'label': '4 stars', 'score': 0.6429524421691895}]

Creating another pipeline for sentiment analysis, using Distillbert pretrained model.

In [5]:
classifier1=pipeline('sentiment-analysis', model='distilbert-base-uncased-finetuned-sst-2-english')

In [6]:
classifier1('I like the movie')

[{'label': 'POSITIVE', 'score': 0.9998114109039307}]

# Using Tokenizer and the pretrained model seperately.
This methods allows you to load the model and tokenizer into memory and fine tune the model on your dataset.

In [7]:
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification

Downloading the model and tokenizer for NLP tasks from the model hub.

In [8]:
model_name='nlptown/bert-base-multilingual-uncased-sentiment'
#this model exists in PyTorch only so we use the from_pt flag to import that model in TensorFlow
model=TFAutoModelForSequenceClassification.from_pretrained(model_name,from_pt=True)
tokenizer=AutoTokenizer.from_pretrained(model_name)
classifier=pipeline('sentiment-analysis',model=model,tokenizer=tokenizer)

All PyTorch model weights were used when initializing TFBertForSequenceClassification.

All the weights of TFBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForSequenceClassification for predictions without further training.


In [9]:
classifier('This movie is not doing well')

[{'label': '1 star', 'score': 0.6130624413490295}]

In [10]:
# using the tokenizer seperately
tokenizer('this movie is not doing good')

{'input_ids': [101, 10372, 13113, 10127, 10497, 27799, 12050, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}

#### Fine Tuning Pretrained model on custom dataset using transformers:
here we are goingt to take a custom dataset and fine tune the pretrained model to see how well it predicts the output.
the pretrained model we are going to use is DistilBert.
The dataset is spam vs ham classification.

In [11]:
from google.colab import drive
import pandas as pd
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [12]:
df=pd.read_csv('drive/MyDrive/spam.csv',encoding="ISO-8859-1")
df.drop(['Unnamed: 2','Unnamed: 3','Unnamed: 4'], axis=1, inplace=True)
df.rename(columns={'v1':'label','v2':'message'},inplace=True)

In [13]:
df.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [14]:
df.shape

(5572, 2)

In [15]:
# creating independent and dependent features:
y=df.iloc[:,:-1]
X=df.iloc[:,-1:]

In [16]:
# one hot encoding for the dependent variable
y=pd.get_dummies(y,drop_first=True)

In [17]:
X=X['message'].tolist()
y=y['label_spam'].tolist()

In [18]:
print(type(X),type(y))

<class 'list'> <class 'list'>


In [19]:
# train and test split
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=0)

In [20]:
X_train

['No no:)this is kallis home ground.amla home town is durban:)',
 'I am in escape theatre now. . Going to watch KAVALAN in a few minutes',
 'We walked from my moms. Right on stagwood pass right on winterstone left on victors hill. Address is &lt;#&gt;',
 'I dunno they close oredi not... ÌÏ v ma fan...',
 'Yo im right by yo work',
 '\\Its Ur luck to Love someone. Its Ur fortune to Love the one who Loves U. But',
 'He also knows about lunch menu only da. . I know',
 'Oh yeah! And my diet just flew out the window',
 "Nah it's straight, if you can just bring bud or drinks or something that's actually a little more useful than straight cash",
 'SplashMobile: Choose from 1000s of gr8 tones each wk! This is a subscrition service with weekly tones costing 300p. U have one credit - kick back and ENJOY',
 'Fighting with the world is easy, u either win or lose bt fightng with some1 who is close to u is dificult if u lose - u lose if u win - u still lose.',
 "Of cos can lar i'm not so ba dao ok...

# Transformer based model:
Now we are going to use the distilbert model for prediction.

In [21]:
from transformers import DistilBertTokenizerFast
tokenizer=DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

#Tokenizer:
This tokenizer will pre process the data before we feed it to the distilbert model. each model has its own tokenizer.
Here we are applying tokenizer onto the train and test dataset seperately to prevent data leakage. if it is applied on the whole dataset then there is a chance of data leakage.

In [22]:
train_encoding=tokenizer(X_train, truncation=True, padding=True)
test_encoding=tokenizer(X_test, truncation=True, padding=True)

In [23]:
#train_encoding['input_ids']

# Converting Encoding into dataset objects:
using tf.data.Dataset.from_tensor_slices() converts the tokenized data into a format that can be fed into the model for training. this method also allows you to create data pipelines where the data can be sent efficiently into batches.

In [24]:
import tensorflow as tf
train_dataset=tf.data.Dataset.from_tensor_slices((dict(train_encoding),y_train))
test_dataset=tf.data.Dataset.from_tensor_slices((dict(test_encoding),y_test))



In [25]:
from transformers import TFDistilBertForSequenceClassification, TFTrainer, TFTrainingArguments

In [26]:
training_args=TFTrainingArguments(
 output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=16,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,    



)



In [27]:
"""
with training_args.strategy.scope():
  model=TFDistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")
  trainer=TFTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset)
"""



'\nwith training_args.strategy.scope():\n  model=TFDistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")\n  trainer=TFTrainer(\n    model=model,\n    args=training_args,\n    train_dataset=train_dataset,\n    eval_dataset=test_dataset)\n'

In [28]:
import tensorflow as tf

In [None]:
optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
model.compile(optimizer=optimizer, loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)) # can also use any keras loss fn
model.fit(train_dataset.shuffle(1).batch(1), epochs=1, batch_size=1) # to train the model with our custom dataset.

Instructions for updating:
Lambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. https://github.com/tensorflow/tensorflow/issues/56089
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method


Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method




  36/4457 [..............................] - ETA: 8:11:54 - loss: 0.6613