#Intro
Transformers examples,
Frank Soboczenski, PhD
June, 6th 2024


## First Example: Classification

Here we look at classifying the following text:


What do you notice?

https://pubmed.ncbi.nlm.nih.gov/
https://pubmed.ncbi.nlm.nih.gov/28157742/


In [11]:
from transformers import AutoTokenizer, AutoModel
import torch


#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('gubartz/st_scibert_pubmed_rct')
model = AutoModel.from_pretrained('gubartz/st_scibert_pubmed_rct')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

print("Sentence embeddings:")
print(sentence_embeddings)


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Sentence embeddings:
tensor([[ 0.7809,  1.1117,  0.6039,  ..., -0.2821, -0.9588,  0.1228],
        [ 1.0622,  1.1331,  0.8132,  ...,  0.1721, -1.0030, -0.0026]])


In [3]:
from transformers import BertTokenizer, BertForSequenceClassification
import torch

# Load pre-trained model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('textattack/bert-base-uncased-SST-2')

# Input text
text = "Purpose of review: Death from stroke has decreased over the past decade, with stroke now the fifth leading cause of death in the United States. In addition, the incidence of new and recurrent stroke is declining, likely because of the increased use of specific prevention medications, such as statins and antihypertensives. Despite these positive trends in incidence and mortality, many strokes remain preventable. The major modifiable risk factors are hypertension, diabetes mellitus, tobacco smoking, and hyperlipidemia, as well as lifestyle factors, such as obesity, poor diet/nutrition, and physical inactivity. This article reviews the current recommendations for the management of each of these modifiable risk factors. Recent findings: It has been documented that some blood pressure medications may increase variability of blood pressure and ultimately increase the risk for stroke. Stroke prevention typically includes antiplatelet therapy (unless an indication for anticoagulation exists), so the most recent evidence supporting use of these drugs is reviewed. In addition, emerging risk factors, such as obstructive sleep apnea, electronic cigarettes, and elevated lipoprotein (a), are discussed. Summary: Overall, secondary stroke prevention includes a multifactorial approach. This article incorporates evidence from guidelines and published studies and uses an illustrative case study throughout the article to provide examples of secondary prevention management of stroke risk factors."

# Tokenize input text
encoded_input = tokenizer(text, return_tensors='pt')

# Get model output
output = model(**encoded_input)

# Get classification result
logits = output.logits
predicted_class_id = torch.argmax(logits, dim=1).item()

# Print classification result
if predicted_class_id == 0:
    classification_result = "negative"
else:
    classification_result = "positive"

print(f"Classification result: {classification_result}")


config.json:   0%|          | 0.00/477 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Classification result: negative


In [8]:
from transformers import BertTokenizer, BertForSequenceClassification
import torch

# Load pre-trained model and tokenizer
tokenizer = BertTokenizer.from_pretrained('allenai/scibert_scivocab_cased')
model = BertForSequenceClassification.from_pretrained('allenai/scibert_scivocab_cased')

# Input text
text = "Purpose of review: Death from stroke has decreased over the past decade, with stroke now the fifth leading cause of death in the United States. In addition, the incidence of new and recurrent stroke is declining, likely because of the increased use of specific prevention medications, such as statins and antihypertensives. Despite these positive trends in incidence and mortality, many strokes remain preventable. The major modifiable risk factors are hypertension, diabetes mellitus, tobacco smoking, and hyperlipidemia, as well as lifestyle factors, such as obesity, poor diet/nutrition, and physical inactivity. This article reviews the current recommendations for the management of each of these modifiable risk factors. Recent findings: It has been documented that some blood pressure medications may increase variability of blood pressure and ultimately increase the risk for stroke. Stroke prevention typically includes antiplatelet therapy (unless an indication for anticoagulation exists), so the most recent evidence supporting use of these drugs is reviewed. In addition, emerging risk factors, such as obstructive sleep apnea, electronic cigarettes, and elevated lipoprotein (a), are discussed. Summary: Overall, secondary stroke prevention includes a multifactorial approach. This article incorporates evidence from guidelines and published studies and uses an illustrative case study throughout the article to provide examples of secondary prevention management of stroke risk factors."

# Tokenize input text
encoded_input = tokenizer(text, return_tensors='pt')

# Get model output
output = model(**encoded_input)

# Get classification result
logits = output.logits
predicted_class_id = torch.argmax(logits, dim=1).item()

# Print classification result
if predicted_class_id == 0:
    classification_result = "negative"
else:
    classification_result = "positive"

print(f"Classification result: {classification_result}")

vocab.txt:   0%|          | 0.00/222k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/442M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at allenai/scibert_scivocab_cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Classification result: negative


## Summarisation

In [9]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Load pre-trained model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("Shobhank-iiitdwd/BERT_summary")
model = AutoModelForSeq2SeqLM.from_pretrained("Shobhank-iiitdwd/BERT_summary")

# Input text
text = "Purpose of review: Death from stroke has decreased over the past decade, with stroke now the fifth leading cause of death in the United States. In addition, the incidence of new and recurrent stroke is declining, likely because of the increased use of specific prevention medications, such as statins and antihypertensives. Despite these positive trends in incidence and mortality, many strokes remain preventable. The major modifiable risk factors are hypertension, diabetes mellitus, tobacco smoking, and hyperlipidemia, as well as lifestyle factors, such as obesity, poor diet/nutrition, and physical inactivity. This article reviews the current recommendations for the management of each of these modifiable risk factors. Recent findings: It has been documented that some blood pressure medications may increase variability of blood pressure and ultimately increase the risk for stroke. Stroke prevention typically includes antiplatelet therapy (unless an indication for anticoagulation exists), so the most recent evidence supporting use of these drugs is reviewed. In addition, emerging risk factors, such as obstructive sleep apnea, electronic cigarettes, and elevated lipoprotein (a), are discussed. Summary: Overall, secondary stroke prevention includes a multifactorial approach. This article incorporates evidence from guidelines and published studies and uses an illustrative case study throughout the article to provide examples of secondary prevention management of stroke risk factors."

# Tokenize input text
inputs = tokenizer.encode("summarize: " + text, return_tensors='pt', max_length=512, truncation=True)

# Generate summary
summary_ids = model.generate(inputs, max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

# Print summary
print("Summary:", summary)

tokenizer_config.json:   0%|          | 0.00/252 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/3.66k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/156 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/990M [00:00<?, ?B/s]

Summary: stroke is the fifth leading cause of death in the u. s., with stroke now the fifth most common cause. the major modifiable risk factors include hypertension, diabetes mellitus, tobacco smoking, and hyperlipidemia.


In [10]:
from transformers import pipeline

summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

ARTICLE = """ New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County, New York.
A year later, she got married again in Westchester County, but to a different man and without divorcing her first husband.
Only 18 days after that marriage, she got hitched yet again. Then, Barrientos declared "I do" five more times, sometimes only within two weeks of each other.
In 2010, she married once more, this time in the Bronx. In an application for a marriage license, she stated it was her "first and only" marriage.
Barrientos, now 39, is facing two criminal counts of "offering a false instrument for filing in the first degree," referring to her false statements on the
2010 marriage license application, according to court documents.
Prosecutors said the marriages were part of an immigration scam.
On Friday, she pleaded not guilty at State Supreme Court in the Bronx, according to her attorney, Christopher Wright, who declined to comment further.
After leaving court, Barrientos was arrested and charged with theft of service and criminal trespass for allegedly sneaking into the New York subway through an emergency exit, said Detective
Annette Markowski, a police spokeswoman. In total, Barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002.
All occurred either in Westchester County, Long Island, New Jersey or the Bronx. She is believed to still be married to four men, and at one time, she was married to eight men at once, prosecutors say.
Prosecutors said the immigration scam involved some of her husbands, who filed for permanent residence status shortly after the marriages.
Any divorces happened only after such filings were approved. It was unclear whether any of the men will be prosecuted.
The case was referred to the Bronx District Attorney\'s Office by Immigration and Customs Enforcement and the Department of Homeland Security\'s
Investigation Division. Seven of the men are from so-called "red-flagged" countries, including Egypt, Turkey, Georgia, Pakistan and Mali.
Her eighth husband, Rashid Rajput, was deported in 2006 to his native Pakistan after an investigation by the Joint Terrorism Task Force.
If convicted, Barrientos faces up to four years in prison.  Her next court appearance is scheduled for May 18.
"""
print(summarizer(ARTICLE, max_length=130, min_length=30, do_sample=False))

config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

[{'summary_text': 'Liana Barrientos, 39, is charged with two counts of "offering a false instrument for filing in the first degree" In total, she has been married 10 times, with nine of her marriages occurring between 1999 and 2002. She is believed to still be married to four men.'}]


## Vision Transformers
SDO

In [None]:
from transformers import AutoFeatureExtractor, AutoModelForImageClassification
from PIL import Image
import requests

url = 'https://sdo.gsfc.nasa.gov/assets/gallery/preview/211_coronalhole.jpg'
image = Image.open(requests.get(url, stream=True).raw)

feature_extractor = AutoFeatureExtractor.from_pretrained("kenobi/SDO_VT1")
model = AutoModelForImageClassification.from_pretrained("kenobi/SDO_VT1")
inputs = feature_extractor(images=image, return_tensors="pt")

outputs = model(**inputs)
logits = outputs.logits
# model predicts one of the three fine-tuned classes (NASA_SDO_Coronal_Hole, NASA_SDO_Coronal_Loop or NASA_SDO_Solar_Flare)
predicted_class_idx = logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


preprocessor_config.json:   0%|          | 0.00/228 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/857 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/343M [00:00<?, ?B/s]

Predicted class: NASA_SDO_Coronal_Hole


GeneLab

In [13]:
import urllib.request

def download_image(url, filename):
    try:
        # Define custom headers
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
        }

        # Create a request with custom headers
        req = urllib.request.Request(url, headers=headers)

        # Open the URL and read the content
        with urllib.request.urlopen(req) as response:
            img_data = response.read()

        # Write the content to a file
        with open(filename, 'wb') as handler:
            handler.write(img_data)

        print(f"Image '{filename}' downloaded successfully")
    except Exception as e:
        print(f"Error downloading the image '{filename}':", e)

# List of URLs and corresponding filenames
urls = [
    ('https://roosevelt.devron-systems.com/HF/P242_73665006707-A6_002_008_proj.tif', 'P242_73665006707-A6_002_008_proj.tif'),
    ('https://roosevelt.devron-systems.com/HF/P278_73668090728-A7_003_027_proj.tif', 'P278_73668090728-A7_003_027_proj.tif')
]

# Download each image
for url, filename in urls:
    download_image(url, filename)

Image 'P242_73665006707-A6_002_008_proj.tif' downloaded successfully
Image 'P278_73668090728-A7_003_027_proj.tif' downloaded successfully


In [14]:
#!pip install transformers --quiet # uncomment this pip install for local use if you do not have transformers installed
from transformers import AutoFeatureExtractor, AutoModelForImageClassification
from PIL import Image

# Load the image
#image = Image.open('P242_73665006707-A6_002_008_proj.tif') #First Image
image = Image.open('P278_73668090728-A7_003_027_proj.tif')  #Second Image

# Convert grayscale image to RGB
image_rgb = image.convert("RGB")

# Load the pre-trained feature extractor and classification model
feature_extractor = AutoFeatureExtractor.from_pretrained("kenobi/NASA_GeneLab_MBT")
model = AutoModelForImageClassification.from_pretrained("kenobi/NASA_GeneLab_MBT")

# Extract features from the image
inputs = feature_extractor(images=image_rgb, return_tensors="pt")

# Perform classification
outputs = model(**inputs)
logits = outputs.logits

# Obtain the predicted class index and label
predicted_class_idx = logits.argmax(-1).item()
predicted_class_label = model.config.id2label[predicted_class_idx]

print("Predicted class:", predicted_class_label)

preprocessor_config.json:   0%|          | 0.00/327 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/801 [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/343M [00:00<?, ?B/s]

Predicted class: XRay_irradiated_Nuclei
