###Started with pipeline

### The easiest way to use a pretrained model on a given task is to use pipeline. 🤗 Transformers provides the following tasks out of the box:

Sentiment analysis: is a text positive or negative?
Text generation (in English): provide a prompt and the model will generate what follows.
Name entity recognition (NER): in an input sentence, label each word with the entity it represents (person, place, etc.)
Question answering: provide the model with some context and a question, extract the answer from the context.
Filling masked text: given a text with masked words (e.g., replaced by [MASK]), fill the blanks.
Summarization: generate a summary of a long text.
Translation: translate a text in another language.
Feature extraction: return a tensor representation of the text.
Let's see how this work for sentiment analysis (the other tasks are all covered in the task summary):



In [65]:
from transformers import pipeline
classifier = pipeline('sentiment-analysis')

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

In [3]:
classifier('I am happy to   show you the 🤗 Transformers library.')

[{'label': 'POSITIVE', 'score': 0.9997319579124451}]

In [4]:
classifier('where can I find stability')

[{'label': 'NEGATIVE', 'score': 0.842963695526123}]

In [6]:
classifier('my sadness is my strength')

[{'label': 'POSITIVE', 'score': 0.9979116320610046}]

###You can use it on a list of sentences, which will be preprocessed then fed to the model as a batch, returning a list of dictionaries like this one:

In [7]:
#for multiple string list
results = classifier(["We are very happy to show you the 🤗 Transformers library.",
           "We hope you don't hate it."])
for result in results:
    print(f"label: {result['label']}, with score: {round(result['score'], 4)}")

label: POSITIVE, with score: 0.9998
label: NEGATIVE, with score: 0.5309


###You can see the second sentence has been classified as negative (it needs to be positive or negative) but its score is fairly neutral.

By default, the model downloaded for this pipeline is called "distilbert-base-uncased-finetuned-sst-2-english". We can look at its model page to get more information about it. It uses the DistilBERT architecture and has been fine-tuned on a dataset called SST-2 for the sentiment analysis task.

Let's say we want to use another model; for instance, one that has been trained on French data. We can search through the model hub that gathers models pretrained on a lot of data by research labs, but also community models (usually fine-tuned versions of those big models on a specific dataset). Applying the tags "French" and "text-classification" gives back a suggestion "nlptown/bert-base-multilingual-uncased-sentiment". Let's see how we can use it.

You can directly pass the name of the model to use to pipeline:

In [66]:
classifier = pipeline('sentiment-analysis', model="nlptown/bert-base-multilingual-uncased-sentiment")

In [67]:
classifier('Je suis très heureux de montrer mon travail')

[{'label': '5 stars', 'score': 0.643379271030426}]

##Now, to download the models and tokenizer we found previously, we just have to use the AutoModelForSequenceClassification.from_pretrained method (feel free to replace model_name by any other model from the model hub):

In [63]:
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification

In [68]:
model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
# This model only exists in PyTorch, so we use the `from_pt` flag to import that model in TensorFlow.
model = TFAutoModelForSequenceClassification.from_pretrained(model_name, from_pt=True)
tokenizer = AutoTokenizer.from_pretrained(model_name)
classifier = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)

All PyTorch model weights were used when initializing TFBertForSequenceClassification.

All the weights of TFBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForSequenceClassification for predictions without further training.


In [69]:
classifier("I am a awesome girl")

[{'label': '5 stars', 'score': 0.8656461238861084}]

In [72]:
inputs = tokenizer("We are very happy to show you the 🤗 Transformers library.")

##This returns a dictionary string to list of ints. It contains the ids of the tokens, as mentioned before, but also additional arguments that will be useful to the model. Here for instance, we also have an attention mask that the model will use to have a better understanding of the sequence:



In [73]:
inputs

{'input_ids': [101, 11312, 10320, 12495, 19308, 10114, 11391, 10855, 10103, 100, 58263, 13299, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [74]:
tf_batch = tokenizer(
    ["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it."],
    padding=True,
    truncation=True,
    max_length=512,
    return_tensors="tf"
)

In [75]:
for key, value in tf_batch.items():
    print(f"{key}: {value.numpy().tolist()}")

input_ids: [[101, 11312, 10320, 12495, 19308, 10114, 11391, 10855, 10103, 100, 58263, 13299, 119, 102], [101, 11312, 18763, 10855, 11530, 112, 162, 39487, 10197, 119, 102, 0, 0, 0]]
token_type_ids: [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
attention_mask: [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0]]


In [77]:
tf_outputs = model(tf_batch)

### 🤗 Transformers, all outputs are tuples (with only one element potentially). Here, we get a tuple with just the final
activations of the model.

In [78]:
print(tf_outputs)

TFSequenceClassifierOutput(loss=None, logits=<tf.Tensor: shape=(2, 5), dtype=float32, numpy=
array([[-2.6222007 , -2.7745318 , -0.8966622 ,  2.0137324 ,  3.3063853 ],
       [ 0.00635777, -0.12577419, -0.05034586, -0.16553022,  0.13285828]],
      dtype=float32)>, hidden_states=None, attentions=None)


In [79]:
import tensorflow as tf
tf_predictions = tf.nn.softmax(tf_outputs[0], axis=-1)

In [82]:
inputs = tokenizer("Hello, my dog is cute", return_tensors="tf")
>>> inputs["labels"] = tf.reshape(tf.constant(1), (-1, 1)) # Batch size 1


In [83]:
inputs

{'input_ids': <tf.Tensor: shape=(1, 9), dtype=int32, numpy=
array([[  101, 29155,   117, 11153, 14791, 10127, 18233, 10111,   102]],
      dtype=int32)>, 'token_type_ids': <tf.Tensor: shape=(1, 9), dtype=int32, numpy=array([[0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(1, 9), dtype=int32, numpy=array([[1, 1, 1, 1, 1, 1, 1, 1, 1]], dtype=int32)>, 'labels': <tf.Tensor: shape=(1, 1), dtype=int32, numpy=array([[1]], dtype=int32)>}

In [84]:
 outputs = model(inputs)
loss, logits = outputs[:2]

In [85]:
loss

<tf.Tensor: shape=(1,), dtype=float32, numpy=array([3.80506], dtype=float32)>

In [86]:
logits

<tf.Tensor: shape=(1, 5), dtype=float32, numpy=
array([[-1.6257219 , -1.5413442 ,  0.01382296,  0.9951886 ,  1.7027003 ]],
      dtype=float32)>

In [88]:
logits.shape

TensorShape([1, 5])

In [80]:
tf_predictions

<tf.Tensor: shape=(2, 5), dtype=float32, numpy=
array([[0.00205668, 0.00176608, 0.01154936, 0.2120929 , 0.77253497],
       [0.20841917, 0.18262216, 0.19692987, 0.17550425, 0.23652449]],
      dtype=float32)>

##Models are standard torch.nn.Module or tf.keras.Model so you can use them in your usual training loop. 🤗 Transformers also provides a Trainer (or TFTrainer if you are using TensorFlow) class to help with your training (taking care of things such as distributed training, mixed precision, etc.). See the training tutorial for more details.

In [27]:
import tensorflow as tf
tf_outputs = model(tf_batch, labels = tf.constant([1, 0]))

In [28]:
tf_outputs

TFSequenceClassifierOutput(loss=<tf.Tensor: shape=(2,), dtype=float32, numpy=array([6.3389955, 1.568204 ], dtype=float32)>, logits=<tf.Tensor: shape=(2, 5), dtype=float32, numpy=
array([[-2.6222007 , -2.7745318 , -0.8966622 ,  2.0137324 ,  3.3063853 ],
       [ 0.00635777, -0.12577419, -0.05034586, -0.16553022,  0.13285828]],
      dtype=float32)>, hidden_states=None, attentions=None)

Once your model is fine-tuned, you can save it with its tokenizer in the following way:

In [29]:
tokenizer.save_pretrained("/content/tokenizer_est")
model.save_pretrained("/content/model_test")

In [30]:
from transformers import AutoTokenizer, AutoModel,TFAutoModel

tokenizer = AutoTokenizer.from_pretrained("/content/tokenizer_est")
model = TFAutoModel.from_pretrained("/content/model_test")


Some layers from the model checkpoint at /content/model_test were not used when initializing TFBertModel: ['classifier', 'dropout_37']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at /content/model_test.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


You can then load this model back using the AutoModel.from_pretrained method by passing the directory name instead of the model name. One cool feature of 🤗 Transformers is that you can easily switch between PyTorch and TensorFlow: any model saved as before can be loaded back either in PyTorch or TensorFlow. If you are loading a saved PyTorch model in a TensorFlow model, use TFAutoModel.from_pretrained like this:

In [31]:
tf_outputs = model(tf_batch, output_hidden_states=True, output_attentions=True)
all_hidden_states, all_attentions = tf_outputs[-2:]

In [32]:
len(all_hidden_states), len(all_attentions)

(13, 12)

In [33]:
all_attentions[0].shape

TensorShape([2, 12, 14, 14])

In [None]:
all_attentions[0]

Sentiment analysis using distilbert

In [2]:
import pandas as pd
df =pd.read_csv('spam.csv')
df.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [36]:
df.shape

(5572, 2)

In [3]:
X=list(df['Message'])
y=list(df['Category'])

In [4]:
y=df['Category'].map(lambda x:1 if x=='spam' else 0)


In [41]:
X[0]

'Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...'

In [40]:
y

0       0
1       0
2       1
3       0
4       0
       ..
5567    1
5568    0
5569    0
5570    0
5571    0
Name: Category, Length: 5572, dtype: int64

In [5]:
#split the traing and test dataset
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)



In [15]:
len(X_train),len(X_test)

(4457, 1115)

In [3]:
print(transformers.__version__)

4.41.2


In [42]:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
tokenized_data = tokenizer(X_train, return_tensors="np", padding=True)
# Tokenizer returns a BatchEncoding, but we convert that to a dict for Keras
tokenized_data = dict(tokenized_data)

labels = np.array([y_train])

In [43]:
from transformers import TFAutoModelForSequenceClassification
#from tensorflow.keras.optimizers import Adam

# Load and compile our model
model = TFAutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")
# Lower learning rates are often better for fine-tuning transformers
model.compile(optimizer='rmsprop')  # No loss argument!
# optimizer = Adam(learning_rate=3e-5) # Set learning rate here
# model.compile(optimizer=optimizer)  # No loss argument!



Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_transform.weight']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFDistilBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']
You should 

In [18]:
# Reshape labels to match the first dimension of tokenized_data

labels = labels.reshape(tokenized_data['input_ids'].shape[0])

model.fit(tokenized_data, labels)



<tf_keras.src.callbacks.History at 0x7e1f605dc280>

In [19]:
model.summary()

Model: "tf_distil_bert_for_sequence_classification_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 distilbert (TFDistilBertMa  multiple                  66362880  
 inLayer)                                                        
                                                                 
 pre_classifier (Dense)      multiple                  590592    
                                                                 
 classifier (Dense)          multiple                  1538      
                                                                 
 dropout_59 (Dropout)        multiple                  0         
                                                                 
Total params: 66955010 (255.41 MB)
Trainable params: 66955010 (255.41 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [20]:
model.evaluate(tokenized_data, labels)



0.395347535610199

In [21]:
#for prediction tokenzie the test data
tokenized_testdata = tokenizer(X_test, return_tensors="np", padding=True)
test_labels = np.array([y_test])

In [31]:
pred=model.predict(tokenized_testdata)



In [36]:
output=np.argmax(pred[0],axis=1)

In [37]:
output.shape

(1115,)

In [38]:
from sklearn.metrics import confusion_matrix

cm=confusion_matrix(y_test,output)
cm

array([[955,   0],
       [160,   0]])

In [39]:
model.save_pretrained("/content/model_test")
tokenizer.save_pretrained("/content/tokenizer_test")

('/content/tokenizer_test/tokenizer_config.json',
 '/content/tokenizer_test/special_tokens_map.json',
 '/content/tokenizer_test/vocab.txt',
 '/content/tokenizer_test/added_tokens.json',
 '/content/tokenizer_test/tokenizer.json')

In [36]:
tokenized_data

{'input_ids': array([[  101,  2053,  1045, ...,     0,     0,     0],
        [  101,  2065,  2017, ...,     0,     0,     0],
        [  101,  2031,  2017, ...,     0,     0,     0],
        ...,
        [  101,  2005, 24471, ...,     0,     0,     0],
        [  101,  1054,  1057, ...,     0,     0,     0],
        [  101,  3461,  3110, ...,     0,     0,     0]]),
 'attention_mask': array([[1, 1, 1, ..., 0, 0, 0],
        [1, 1, 1, ..., 0, 0, 0],
        [1, 1, 1, ..., 0, 0, 0],
        ...,
        [1, 1, 1, ..., 0, 0, 0],
        [1, 1, 1, ..., 0, 0, 0],
        [1, 1, 1, ..., 0, 0, 0]])}