<a href="https://colab.research.google.com/github/nvarghese-u2can/Natural-Language-Processing---My-NLP-Projects/blob/main/SentimentAnalysis_with_Transformer_basic_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [43]:
!pip install transformers



In [44]:
!pip install tensorflow



In [45]:
from transformers import pipeline

# Basic Sentiment Classfication model with Transformer default pipeline

In [46]:
sentiment_classifier = pipeline('sentiment-analysis')

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


In [47]:
sentiments = sentiment_classifier(['I am so happy to use the Transformer library','It is really helpful for NLP tasks'])
for sentiment in sentiments:
  print(sentiment)

{'label': 'POSITIVE', 'score': 0.9997748732566833}
{'label': 'POSITIVE', 'score': 0.9985191226005554}


Use a custom Transformer library - roberta-large-mnli

In [48]:
model_name = 'distilbert-base-uncased-finetuned-sst-2-english'

In [49]:
sentiment_classifier = pipeline('text-classification',model = model_name)

In [50]:
sentiments = sentiment_classifier(['This restaurant is awesome','I am so happy about this food '])
for sentiment in sentiments:
  print(sentiment)

{'label': 'POSITIVE', 'score': 0.9998743534088135}
{'label': 'POSITIVE', 'score': 0.9998840093612671}


**More  Customized Approach**

In [51]:
import tensorflow as tf
from transformers import AutoTokenizer,TFAutoModelForSequenceClassification

In [52]:
model = TFAutoModelForSequenceClassification.from_pretrained(model_name)

Some layers from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing TFDistilBertForSequenceClassification: ['dropout_19']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english and are newly initialized: ['dropout_79']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [53]:
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer

PreTrainedTokenizerFast(name_or_path='distilbert-base-uncased-finetuned-sst-2-english', vocab_size=30522, model_max_len=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'})

In [54]:
sentiment_classifier = pipeline('text-classification',model = model,tokenizer=tokenizer)

In [55]:
# sentiments = sentiment_classifier(['This restaurant is awesome','I am so happy about this food '])
# for sentiment in sentiments:
#   print(sentiment)

**More Manual approach without using the pipepline**

In [56]:
tokens = tokenizer.tokenize('I am so happy to use the transformer library')
tokens

['i', 'am', 'so', 'happy', 'to', 'use', 'the', 'transform', '##er', 'library']

In [57]:
token_ids = tokenizer.convert_tokens_to_ids(tokens)
token_ids

[1045, 2572, 2061, 3407, 2000, 2224, 1996, 10938, 2121, 3075]

In [58]:
input_ids = tokenizer('I am so happy to use the transformer library')
input_ids

{'input_ids': [101, 1045, 2572, 2061, 3407, 2000, 2224, 1996, 10938, 2121, 3075, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [59]:
X_train = ['This restaurant is awesome','I am so happy about this food ']

In [60]:
batch = tokenizer(X_train,padding=True,truncation=True,max_length=512,return_tensors='tf')
batch

{'input_ids': <tf.Tensor 'Const_20:0' shape=(2, 9) dtype=int32>, 'attention_mask': <tf.Tensor 'Const_21:0' shape=(2, 9) dtype=int32>}

In [61]:
# with tf.no_gradient('Size'):
outputs = model(batch)
print(f'\noutputs are :{outputs}\n\n')
predictions = tf.nn.softmax(outputs.logits,axis=1)
print(f'predictions are :{predictions}')


outputs are :TFSequenceClassifierOutput(loss=None, logits=<tf.Tensor 'tf_distil_bert_for_sequence_classification_3_2/classifier/BiasAdd:0' shape=(2, 2) dtype=float32>, hidden_states=None, attentions=None)


predictions are :Tensor("Softmax:0", shape=(2, 2), dtype=float32)


In [63]:
labels = tf.argmax(predictions,axis=1)
print(f'labels are :{labels}')
labels = [model.config.id2label[label_id] for label_id in labels]