<a href="https://colab.research.google.com/github/just-joseph/NLP-basics/blob/main/NLP_applications_using_huggingface_transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install transformers
!pip install tensorflow
!pip install torch==1.5.0+cpu torchvision==0.6.0+cpu -f https://download.pytorch.org/whl/torch_stable.html

Looking in links: https://download.pytorch.org/whl/torch_stable.html
Collecting torch==1.5.0+cpu
[?25l  Downloading https://download.pytorch.org/whl/cpu/torch-1.5.0%2Bcpu-cp37-cp37m-linux_x86_64.whl (127.3MB)
[K     |████████████████████████████████| 127.3MB 82kB/s 
[?25hCollecting torchvision==0.6.0+cpu
[?25l  Downloading https://download.pytorch.org/whl/cpu/torchvision-0.6.0%2Bcpu-cp37-cp37m-linux_x86_64.whl (5.7MB)
[K     |████████████████████████████████| 5.7MB 3.1MB/s 
[31mERROR: torchtext 0.9.1 has requirement torch==1.8.1, but you'll have torch 1.5.0+cpu which is incompatible.[0m
Installing collected packages: torch, torchvision
  Found existing installation: torch 1.8.1+cu101
    Uninstalling torch-1.8.1+cu101:
      Successfully uninstalled torch-1.8.1+cu101
  Found existing installation: torchvision 0.9.1+cu101
    Uninstalling torchvision-0.9.1+cu101:
      Successfully uninstalled torchvision-0.9.1+cu101
Successfully installed torch-1.5.0+cpu torchvision-0.6.0+cpu


## Sentiment analysis

In [None]:
from transformers import pipeline

# Sentiment analysis pipeline
sentiment_classifier = pipeline('sentiment-analysis')

In [None]:
# Testing
sentiment_classifier('We are very happy to show you the 🤗 Transformers library.')

[{'label': 'POSITIVE', 'score': 0.9997795224189758}]

In [None]:
results = sentiment_classifier(["We are very happy to show you the Transformers library.","We hope you won't hate it."])

for result in results:
  print(f"label: {result['label']}, with score: {round(result['score'], 4)}")

label: POSITIVE, with score: 0.9998
label: NEGATIVE, with score: 0.504


## Question Answering

In [None]:
qNa= pipeline("question-answering")
paragraph = ''' The number of lives claimed by the Covid-19 coronavirus in India escalated sharply to 640 on Wednesday morning, with the total tally of positive cases rapidly nearing the 20,000 mark. The Indian Medical Association (IMA) called off the White Alert protest of doctors after the association was assured by Union Home Minister Amit Shah that they would be provided security by government. Meanwhile, a civil aviation ministry employee tested coronavirus positive today after which the B wing of the ministry was sealed and sanitisation procedure was initiated. '''
ans = qNa({'question': 'How much a total number of cases will India reach in the near future?',
           'context': f'{paragraph}'})
print(ans)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=473.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=260793700.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=213450.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=435797.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=29.0, style=ProgressStyle(description_w…


{'score': 0.4759603440761566, 'start': 172, 'end': 183, 'answer': '20,000 mark'}


## Fill mask

In [None]:
fillMask = pipeline("fill-mask")
print(fillMask(f"Artificial intelligence (AI) is an area of {fillMask.tokenizer.mask_token}"))

[{'sequence': 'Artificial intelligence (AI) is an area of concern', 'score': 0.5680367350578308, 'token': 2212, 'token_str': ' concern'}, {'sequence': 'Artificial intelligence (AI) is an area of research', 'score': 0.05392199754714966, 'token': 557, 'token_str': ' research'}, {'sequence': 'Artificial intelligence (AI) is an area of interest', 'score': 0.043272946029901505, 'token': 773, 'token_str': ' interest'}, {'sequence': 'Artificial intelligence (AI) is an area of expertise', 'score': 0.03091997094452381, 'token': 6424, 'token_str': ' expertise'}, {'sequence': 'Artificial intelligence (AI) is an area of vulnerability', 'score': 0.015377149917185307, 'token': 15661, 'token_str': ' vulnerability'}]


## NER

In [None]:
namedEntityRecognition = pipeline("ner")
sentence = 'Larry Page is an owner and co-founder of Google Corporation.'
print(namedEntityRecognition(sentence))

[{'word': 'Larry', 'score': 0.9986725449562073, 'entity': 'I-PER', 'index': 1, 'start': 0, 'end': 5}, {'word': 'Page', 'score': 0.9992311596870422, 'entity': 'I-PER', 'index': 2, 'start': 6, 'end': 10}, {'word': 'Google', 'score': 0.9996185898780823, 'entity': 'I-ORG', 'index': 11, 'start': 41, 'end': 47}, {'word': 'Corporation', 'score': 0.9989662170410156, 'entity': 'I-ORG', 'index': 12, 'start': 48, 'end': 59}]


## Summarization

In [None]:
summarizer = pipeline("summarization")
article = ''' The number of lives claimed by the Covid-19 coronavirus in India escalated sharply to 640 on Wednesday morning, with the total tally of positive cases rapidly nearing the 20,000 mark. The Indian Medical Association (IMA) called off the White Alert protest of doctors after the association was assured by Union Home Minister Amit Shah that they would be provided security by government. Meanwhile, a civil aviation ministry employee tested coronavirus positive today after which the B wing of the ministry was sealed and sanitisation procedure was initiated. '''
print(summarizer(article, max_length=90, min_length=20))

## Using other models to get better results or to experiment with them

By default, the model downloaded for this pipeline is called “distilbert-base-uncased-finetuned-sst-2-english”. We can look at its model page to get more information about it. It uses the DistilBERT architecture and has been fine-tuned on a dataset called SST-2 for the sentiment analysis task.

Let’s say we want to use another model; for instance, one that has been trained on French data. We can search through the model hub that gathers models pretrained on a lot of data by research labs, but also community models (usually fine-tuned versions of those big models on a specific dataset). Applying the tags “French” and “text-classification” gives back a suggestion “nlptown/bert-base-multilingual-uncased-sentiment”. Let’s see how we can use it.

You can directly pass the name of the model to use to pipeline():

In [None]:
sentiment_classifier_2 = pipeline('sentiment-analysis', model="nlptown/bert-base-multilingual-uncased-sentiment")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=953.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=669491321.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=871891.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=112.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=39.0, style=ProgressStyle(description_w…




This classifier can now deal with texts in English, French, but also Dutch, German, Italian and Spanish! You can also replace that name by a local folder where you have saved a pretrained model (see below). You can also pass a model object and its associated tokenizer

## Single-shot text classification

In [None]:
from transformers import pipeline
zero_shot_classifier = pipeline("zero-shot-classification")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=908.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1629486723.0, style=ProgressStyle(descr…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=898822.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1355863.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=26.0, style=ProgressStyle(description_w…




In [None]:
candidate_labels = ["human name", "living entity", "place", "object", "occupation", "trait"]
print ( zero_shot_classifier( "Justin Joseph", candidate_labels) )

{'sequence': 'Justin Joseph', 'labels': ['human name', 'occupation', 'trait', 'object', 'living entity', 'place'], 'scores': [0.4947136342525482, 0.28604793548583984, 0.09834138303995132, 0.05426729843020439, 0.039871685206890106, 0.02675805427134037]}


## Extract embeddings

In [None]:
model='bert-base-uncased'
tokenizer='bert-base-uncased'

nlp_features = pipeline('feature-extraction', model= model, tokenizer= tokenizer)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=570.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=440473133.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466062.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=28.0, style=ProgressStyle(description_w…




In [None]:
vector= nlp_features( ['I am hosting a session'] )

In [None]:
vector

[[[0.09493163973093033,
   0.3086220324039459,
   0.03897552937269211,
   -0.11560036987066269,
   -0.11470772325992584,
   -0.3752146065235138,
   0.20115214586257935,
   0.35256966948509216,
   -0.1285945475101471,
   -0.3272238075733185,
   -0.0430142842233181,
   -0.184303417801857,
   0.06787338852882385,
   0.30418461561203003,
   0.2804088592529297,
   -0.09695954620838165,
   -0.06829401105642319,
   0.3953365385532379,
   0.11791492998600006,
   -0.30550137162208557,
   -0.20970705151557922,
   -0.3151448667049408,
   -0.14482496678829193,
   -0.15501615405082703,
   -0.07630527764558792,
   -0.14232470095157623,
   -0.028971003368496895,
   0.1630312204360962,
   -0.011457648128271103,
   -0.16919122636318207,
   0.16906476020812988,
   0.05721445381641388,
   -0.10199295729398727,
   0.055853258818387985,
   0.049752723425626755,
   -0.09026310592889786,
   0.07826662808656693,
   -0.11986708641052246,
   -0.07075384259223938,
   0.009846257045865059,
   0.001918238704092800

In [None]:
len(vector[0][0])

768

In [None]:
len(vector[0])

7