## The HuggingFace Pipelines
Tests based on https://huggingface.co/docs/transformers/en/main_classes/pipelines.

## Setting up environment
Downloading dependencies and loading Transformers models from the Hugging Face.


In [None]:
!pip install transformers



In [None]:
import transformers
from transformers import pipeline
import numpy as np

## Testing basic concepts with Kerouac
**Pipeline() function**: connects a model with its preprocessing and postprocessing steps, allowing direct text inputs and outputs answers.  
Downloads prertained model finetuned for the specified task.  
Preprocesses text, passes onto model, post-processes predictions.

Other available tasks in pipeline:
*   feature-extraction: get the vector representation of a text
*   fill-mask: fill in the blanks in a given text
*   ner: named entity recognition, find which parts of the input text correspond to entities
*   question-answering:  answers questions using information from a given context
*   sentiment-analysis: classify sentiment of the phrase
*   zero-shot-classification: classify non-annotated texts, that haven’t been labelled
*   summarization
*   text-generation
*   translation

### Sentiment analysis

In [None]:
classifier = pipeline("sentiment-analysis")
classifier("I like too many things and get all confused and hung-up running from one falling star to another till i drop.")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'NEGATIVE', 'score': 0.9995410442352295}]

In [None]:
# multiple sentences
classifier(
    ["Nothing behind me, everything ahead of me, as is ever so on the road.", "The only truth is music."]
)

[{'label': 'POSITIVE', 'score': 0.9822170734405518},
 {'label': 'NEGATIVE', 'score': 0.9679506421089172}]

In [None]:
classifier(["I hate being this happy", "I love being this hateful"])

[{'label': 'NEGATIVE', 'score': 0.9878647923469543},
 {'label': 'POSITIVE', 'score': 0.9458838701248169}]

### Zero-shot classification

In [None]:
classifier = pipeline("zero-shot-classification")
classifier(
    "What is that feeling when you're driving away from people and they recede on the plain till you see their specks dispersing? It's the too-huge world vaulting us, and it's good-bye. But we lean forward to the next crazy venture beneath the skies.",
    candidate_labels=["religion", "travel", "mysticism", "happy", "sad", "sappy"],
)

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


{'sequence': "What is that feeling when you're driving away from people and they recede on the plain till you see their specks dispersing? It's the too-huge world vaulting us, and it's good-bye. But we lean forward to the next crazy venture beneath the skies.",
 'labels': ['sad', 'travel', 'mysticism', 'sappy', 'happy', 'religion'],
 'scores': [0.6033982038497925,
  0.3093996047973633,
  0.05203428119421005,
  0.021787965670228004,
  0.00938243418931961,
  0.00399745674803853]}

### Text generation

In [None]:
# english
generator = pipeline("text-generation", model="distilgpt2")
generator(
    "The air was soft, the stars so fine, that I thought I was in a dream",
    max_length=60,
    num_return_sequences=5,
)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "The air was soft, the stars so fine, that I thought I was in a dreamscape. I wasn't a dreamscape. Not just a dreamscape, not just a dreamscape. But then I saw that I was in the Dreamscape and I was in the dreamscape. A dream"},
 {'generated_text': "The air was soft, the stars so fine, that I thought I was in a dream come true, I had been born. The sky had fallen on me, and it was only on day two — morning, night, day and weekend. The stars were shining in the sky. I didn't"},
 {'generated_text': 'The air was soft, the stars so fine, that I thought I was in a dream.\u200d\n\n\n※There was more to the earth, and the stars were so hard to see, but I was all alone.\u200d\nAfterward the stars turned dark and a shadowless'},
 {'generated_text': 'The air was soft, the stars so fine, that I thought I was in a dream, the sun would be still shining.‹\n\n‹\nThe wind was shining at least three times higher than it is now, and the wind grew as fast as the moon and rain would hav

### Mask filling


In [None]:
unmasker = pipeline("fill-mask")
unmasker("I woke up as the <mask> was reddening; and that was the one <mask> time in my life, the <mask> moment of all, when I didn't know who <mask> was - I was far away from <mask>, haunted and tired with <mask>", top_k=2)

No model was supplied, defaulted to distilbert/distilroberta-base and revision ec58a5b (https://huggingface.co/distilbert/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/331M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert/distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

[[{'score': 0.512516438961029,
   'token': 6360,
   'token_str': ' sky',
   'sequence': "<s>I woke up as the sky was reddening; and that was the one<mask> time in my life, the<mask> moment of all, when I didn't know who<mask> was - I was far away from<mask>, haunted and tired with<mask></s>"},
  {'score': 0.11220761388540268,
   'token': 3778,
   'token_str': ' sun',
   'sequence': "<s>I woke up as the sun was reddening; and that was the one<mask> time in my life, the<mask> moment of all, when I didn't know who<mask> was - I was far away from<mask>, haunted and tired with<mask></s>"}],
 [{'score': 0.06869398057460785,
   'token': 15567,
   'token_str': ' terrifying',
   'sequence': "<s>I woke up as the<mask> was reddening; and that was the one terrifying time in my life, the<mask> moment of all, when I didn't know who<mask> was - I was far away from<mask>, haunted and tired with<mask></s>"},
  {'score': 0.06080641224980354,
   'token': 94,
   'token_str': ' last',
   'sequence': "<s>I 

### NER

In [None]:
ner = pipeline("ner", grouped_entities=True)
ner("My name is Heloísa and I am a MSc student at UFRGS in Brazil.")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'entity_group': 'PER',
  'score': 0.9938758,
  'word': 'Heloísa',
  'start': 11,
  'end': 18},
 {'entity_group': 'ORG',
  'score': 0.9950711,
  'word': 'UFRGS',
  'start': 45,
  'end': 50},
 {'entity_group': 'LOC',
  'score': 0.9995597,
  'word': 'Brazil',
  'start': 54,
  'end': 60}]

### Q&A

In [None]:
question_answerer = pipeline("question-answering")
question_answerer(
    question="Where do I study at?",
    context="My name is Heloísa and I am a MSc student at UFRGS in Brazil.",
)

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

{'score': 0.7654989957809448, 'start': 45, 'end': 50, 'answer': 'UFRGS'}

### Summarization

In [None]:
summarizer = pipeline("summarization")
summarizer(
    """
    I have lots of things to teach you now, in case we ever meet, concerning the message that was transmitted to me under a pine tree in North Carolina on a cold winter moonlit night.
    It said that Nothing Ever Happened, so don't worry. It's all like a dream. Everything is ecstasy, inside. We just don't know it because of our thinking-minds.
    But in our true blissful essence of mind is known that everything is alright forever and forever and forever. Close your eyes, let your hands and nerve-ends drop, stop breathing for 3 seconds,
    listen to the silence inside the illusion of the world, and you will remember the lesson you forgot, which was taught in immense milky way soft cloud innumerable worlds long ago and not even at all.
    It is all one vast awakened thing. I call it the golden eternity. It is perfect. We were never really born, we will never really die. It has nothing to do with the imaginary idea of a personal self,
    other selves, many selves everywhere: Self is only an idea, a mortal idea. That which passes into everything is one thing. It's a dream already ended. There's nothing to be afraid of and nothing to be
     glad about. I know this from staring at mountains months on end. They never show any expression, they are like empty space. Do you think the emptiness of space will ever crumble away?
     Mountains will crumble, but the emptiness of space, which is the one universal essence of mind, the vast awakenerhood, empty and awake, will never crumble away because it was never born.
"""
)

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

[{'summary_text': " It's all like a dream. Everything is ecstasy, inside. We just don't know it because of our thinking-minds . We were never really born, we will never really die. It has nothing to do with the imaginary idea of a personal self . Self is only an idea, a mortal idea ."}]

### Translation

In [None]:
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")
translator("""
Parmi les écrits en français de Kerouac, on trouve des lettres adressées à sa mère, quelques prières, dont le Notre Père,
qu'il a récité des milliers de fois dans son enfance et qu'il a pris le temps de recopier, et surtout de nombreuses nouvelles plus ou moins terminées ainsi
u'un roman inachevé de 59 pages intitulé Old Bull in the Bowery, écrit en français à Mexico en 1952, auquel il manque quelques pages.
""")

config.json:   0%|          | 0.00/1.42k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/301M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/802k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/778k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.34M [00:00<?, ?B/s]



[{'translation_text': "Among Kerouac's writings in French are letters addressed to his mother, some prayers, including the Our Father, which he has recited thousands of times in his childhood and which he took the time to copy, and especially many more or less completed news, as well as an unfinished 59-page novel entitled Old Bull in the Bowery, written in French in Mexico City in 1952, which is missing a few pages."}]

### Feature extraction - clinical notes

In [None]:
getembeddings = pipeline("feature-extraction", model="yikuan8/Clinical-Longformer")

config.json:   0%|          | 0.00/929 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/595M [00:00<?, ?B/s]

Some weights of LongformerModel were not initialized from the model checkpoint at yikuan8/Clinical-Longformer and are newly initialized: ['longformer.pooler.dense.bias', 'longformer.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/347 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

In [None]:
embedding = getembeddings("""
Chief Complaint:
Patient is a 58-year-old male presenting with acute chest pain and shortness of breath that began approximately 2 hours prior to admission.

History of Present Illness:
The patient describes the pain as a pressing sensation behind the sternum, radiating to the left arm.
The pain was preceded by episodes of mild exertional dyspnea over the past week, which the patient did not seek medical attention for.
Denies associated symptoms of palpitations, dizziness, or loss of consciousness. Past medical history is significant for hypertension and
type 2 diabetes mellitus. The patient is a current smoker, with a 30-pack-year history, and reports occasional alcohol use.

Physical Examination:

General: The patient is alert and oriented, in mild distress due to pain.
Vital Signs: Blood pressure 160/90 mmHg, heart rate 110 bpm, respiratory rate 20 breaths/min, temperature 37.1°C, oxygen saturation 94% on room air.
Cardiovascular: Tachycardic regular rhythm, no murmurs, rubs, or gallops noted. Jugular venous pressure is not elevated.
Respiratory: Mild bilateral basilar crackles, no wheezes or stridor.
Abdomen: Soft, non-distended, with no tenderness or guarding.
Extremities: No edema, cyanosis, or clubbing. Pulses are palpable and symmetrical.
Assessment/Plan:
The clinical presentation is suggestive of an acute coronary syndrome, possibly a myocardial infarction.
Immediate steps include administration of aspirin, nitroglycerin, and morphine for pain control.
Additional diagnostic tests ordered include a 12-lead ECG, chest X-ray, and cardiac biomarkers.
A cardiology consult has been requested for further evaluation and management.
The patient has been advised to remain NPO (nothing by mouth) pending further evaluation.

""") # generated with gpt

Input ids are automatically padded from 418 to 512 to be a multiple of `config.attention_window`: 512


In [None]:
embedding_array = np.array(embedding)
embedding_array.shape # batch of 1, 418tokens, 768dim
# separate 768-dimensional embedding for each of the 418 tokens

(1, 418, 768)

In [None]:
embedding_sq = np.squeeze(embedding, axis=0)
embedding_avg = np.mean(embedding_sq, axis=0)
embedding_avg, embedding_avg.shape

(array([-3.46009759e-02,  1.55596265e-01,  8.17671806e-02, -4.81228345e-03,
         2.54785595e-01,  1.09827688e-01,  7.33116215e-02,  1.52526702e-01,
         8.63002986e-02,  9.80462407e-02, -4.82862532e-02, -1.54222523e-01,
         2.61210673e-02,  1.17844121e-01,  2.03454963e-02, -2.54022934e-01,
         1.38637488e-01, -1.96570461e-01, -5.42501210e-02, -1.76759881e-01,
        -2.05944780e-02,  1.03985821e-01, -6.45535437e-02,  7.45781291e-02,
         3.21862262e-02,  3.03095159e-02, -2.89332050e-03,  3.11585434e-02,
         1.03879462e-01, -9.39867108e-03, -7.93160855e-02, -4.18044249e-02,
         3.51665218e-02, -2.32365232e-02,  2.23304626e-02,  6.10091227e-02,
         1.21380387e-02,  6.07826935e-03,  2.93582144e-01,  4.35516000e-03,
        -1.39483555e-02,  2.11589066e-01, -1.13473774e-01,  4.38919656e-02,
        -3.12074131e-02, -6.55760327e-02, -1.00282163e-01, -7.85054326e-02,
         2.05545377e-02,  7.78504587e-02, -3.13456294e-02, -2.73423467e-02,
        -2.1