<a href="https://colab.research.google.com/github/mariaberardi/NLP_examples/blob/main/NLP_introduction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

I had to learn some NLP techniques for a project I was working on. Here is a collection of NLP tools that I had to use in my work. For my project, my team was using BERT via the Transformers library. 

In [1]:
#import BERT
!pip install bert-serving-client
!pip install bert-serving-server 
from bert_serving.client import BertClient
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting bert-serving-client
  Downloading bert_serving_client-1.10.0-py2.py3-none-any.whl (28 kB)
Installing collected packages: bert-serving-client
Successfully installed bert-serving-client-1.10.0
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting bert-serving-server
  Downloading bert_serving_server-1.10.0-py3-none-any.whl (61 kB)
[K     |████████████████████████████████| 61 kB 199 kB/s 
[?25hCollecting GPUtil>=1.3.0
  Downloading GPUtil-1.4.0.tar.gz (5.5 kB)
Building wheels for collected packages: GPUtil
  Building wheel for GPUtil (setup.py) ... [?25l[?25hdone
  Created wheel for GPUtil: filename=GPUtil-1.4.0-py3-none-any.whl size=7410 sha256=0f5318a73b674334260026ae24ff05e82ffd11536bc7cc06a201df80462d1445
  Stored in directory: /root/.cache/pip/wheels/6e/f8/83/534c52482d6da64622ddbf72cd93c35d2ef2881b78fd08ff0c
Success

In [2]:
#import pipeline
#these will be pre-made functions allowing us to perform a number of NLP tasks using a pretrained model
from transformers import pipeline

The main tool used in my project was sentiment analysis. As our starting point, we used this very easily accessible pipeline available through the Transformers library. 

In [3]:
#import sentiment analysis pipeline
classifier = pipeline('sentiment-analysis')

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/268M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Here I haven't specified a model. In this case, a pretrained publicly available model is used. Most of my team's work was to train our own model. Some details on that will be in a separate notebook. The pretrained model is sufficient to demonstrate the use of these pipelines. 

In [4]:
classifier('I love ice cream.')

[{'label': 'POSITIVE', 'score': 0.9998069405555725}]

The output is a label between "positive" and "negative", together with the probability that this label is correct, according to the model. The phrase "I love ice cream" is classified as positive with a 99.9% probability. Here are a few other examples. 

In [5]:
classifier('I m having a boring day.')

[{'label': 'NEGATIVE', 'score': 0.9995437264442444}]

In [6]:
classifier('I am going out')

[{'label': 'POSITIVE', 'score': 0.7241319417953491}]

For our project's goal, only positive and negative labels weren't quite sufficient. The next pipeline comes in handy if the user needs to specify their own labels. 

In [7]:
classifier2 = pipeline("zero-shot-classification")

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [8]:
classifier2("I will take a French class next semester", candidate_labels=["education", "medicine", "technology"])

{'sequence': 'I will take a French class next semester',
 'labels': ['education', 'technology', 'medicine'],
 'scores': [0.9502918720245361, 0.02883167937397957, 0.020876480266451836]}

Now the output consists of two lists, one with labels, and one with probabilities for each label, in the same order. 

In [9]:
classifier2("This is an election year", candidate_labels=["education", "politics", "business"])

{'sequence': 'This is an election year',
 'labels': ['politics', 'business', 'education'],
 'scores': [0.9928311109542847, 0.004880791064351797, 0.0022880847100168467]}

In [10]:
classifier2("Tom is training to become a pilot", candidate_labels=["career", "family"])

{'sequence': 'Tom is training to become a pilot',
 'labels': ['career', 'family'],
 'scores': [0.9786313772201538, 0.021368680521845818]}

The Transformers library allows us to use many other such pipelines, and they all follow similar rules. I include a few more examples next. 

In [11]:
generator = pipeline("text-generation")

No model was supplied, defaulted to gpt2 and revision 6c0e608 (https://huggingface.co/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [12]:
generator("For lunch I'm going to have") 
#the output in this case will be a completed sentence, beginning with the given input

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'For lunch I\'m going to have a pizza. I plan to go up to my first of his many tables, his entire family, and tell them to look at each other.\n\n"Do you think he\'ll tell you that tomorrow?" I'}]

In [13]:
generator2 = pipeline("text-generation", model = "distilgpt2") 
#with distilgpt2 we can specify more arguments, like length and number of sentences of generated text

Downloading:   0%|          | 0.00/762 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/353M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [14]:
generator2("For lunch I'm going to have",
max_length=29, #specify maximum length of generated text
num_return_sequences=2) #specify number of sentences to return

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "For lunch I'm going to have breakfast with a friend of mine, and I'm going to have lunch with an awesome guy in the house."},
 {'generated_text': "For lunch I'm going to have a little coffee on my hands and I\u202cll be going over my shoulder. I love this sandwich."}]

In [15]:
unmasker = pipeline("fill-mask")
#this pipeline allows us to fill in a missing word in a phrase

No model was supplied, defaulted to distilroberta-base and revision ec58a5b (https://huggingface.co/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading:   0%|          | 0.00/480 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/331M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [16]:
unmasker("This class will teach you about <mask> models") 
#outputs are most likely candidates according to the model, 
#together with probabilities indicating how likely they are correct

[{'score': 0.23607781529426575,
  'token': 30412,
  'token_str': ' mathematical',
  'sequence': 'This class will teach you about mathematical models'},
 {'score': 0.058790311217308044,
  'token': 27930,
  'token_str': ' predictive',
  'sequence': 'This class will teach you about predictive models'},
 {'score': 0.031382620334625244,
  'token': 745,
  'token_str': ' building',
  'sequence': 'This class will teach you about building models'},
 {'score': 0.03129860386252403,
  'token': 38163,
  'token_str': ' computational',
  'sequence': 'This class will teach you about computational models'},
 {'score': 0.0308521818369627,
  'token': 3034,
  'token_str': ' computer',
  'sequence': 'This class will teach you about computer models'}]