In [None]:
# !pip install transformers

## Working with language models and tokenizers


### load and use a pretrained BERT model

In [2]:
# BERT model provided by Google and use its pretrained version

from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [3]:
text = "using transformers is easy!"
tokenizer(text)

{'input_ids': [101, 2478, 19081, 2003, 3733, 999, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}

**input_ids** shows the token ID for each token, and **token_type_ids** shows the type of each token that separates the first and second sequence.

**attention_mask** is a mask of 0s and 1s that is used to show the start and end of a sequence for the transformer model in order to prevent unnecessary computations.

In the BERT tokenizer, it adds a `[CLS]` token to the beginning and an `[SEP]` token to the end of the sequence, which can be seen by 101 and 102.


A tokenizer can be used for both PyTorch- and TensorFlow-based Transformer models. In order to have an output for each one, the `pt` and `tf` keywords must be used in return_tensors.

In [4]:
encoded_input = tokenizer(text, return_tensors="pt")

In [6]:
from transformers import BertModel
model = BertModel.from_pretrained('bert-base-uncased')

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

In [7]:
output = model(**encoded_input)

In [8]:
from transformers import BertTokenizer, TFBertModel
tokenizer = BertTokenizer.from_pretrained('BERT-base-uncased')
model = TFBertModel.from_pretrained("BERT-base-uncased")
text = " Using Transformer is easy!"

# load the TensorFlow version
encoded_input = tokenizer(text, return_tensors='tf')
output = model(**encoded_input)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions w

In [9]:
from transformers import pipeline
unmasker = pipeline('fill-mask', model='BERT-base-uncased')
unmasker("The man worked as a [MASK].")

BertForMaskedLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.
  - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception).
  - If you are not the owner of the model architecture class, please contact the model code owner to update it.
Some weights of the model checkpoint at BERT-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another archite

[{'score': 0.09747565537691116,
  'token': 10533,
  'token_str': 'carpenter',
  'sequence': 'the man worked as a carpenter.'},
 {'score': 0.052383214235305786,
  'token': 15610,
  'token_str': 'waiter',
  'sequence': 'the man worked as a waiter.'},
 {'score': 0.04962708428502083,
  'token': 13362,
  'token_str': 'barber',
  'sequence': 'the man worked as a barber.'},
 {'score': 0.037886086851358414,
  'token': 15893,
  'token_str': 'mechanic',
  'sequence': 'the man worked as a mechanic.'},
 {'score': 0.037680841982364655,
  'token': 18968,
  'token_str': 'salesman',
  'sequence': 'the man worked as a salesman.'}]

In [12]:
import pandas as pd

pd.DataFrame(unmasker("The man worked as a [MASK]."))

Unnamed: 0,score,token,token_str,sequence
0,0.097476,10533,carpenter,the man worked as a carpenter.
1,0.052383,15610,waiter,the man worked as a waiter.
2,0.049627,13362,barber,the man worked as a barber.
3,0.037886,15893,mechanic,the man worked as a mechanic.
4,0.037681,18968,salesman,the man worked as a salesman.


### Working with community-provided models

In [13]:
from transformers import pipeline
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

sequence_to_classify = "I am going to france."

candidate_labels = ['travel', 'cooking', 'dancing']

classifier(sequence_to_classify, candidate_labels)

config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu


{'sequence': 'I am going to france.',
 'labels': ['travel', 'dancing', 'cooking'],
 'scores': [0.9866883754730225, 0.007197572849690914, 0.006114031653851271]}

### Working with multimodal transformers

In [24]:
from PIL import Image
import requests

url = "http://images.cocodataset.org/test-stuff2017/000000000448.jpg"
image = Image.open(requests.get(url, stream=True).raw)

In [None]:
image

In [16]:
prompt = "a photo of a "
class_names = ["fighting", "meeting"]

inputs = [prompt + class_name for class_name in class_names]


### CLIPProcessor 和 CLIPModel

是 Hugging Face 的 transformers 库中用于处理和使用 OpenAI 的 CLIP 模型的组件。CLIP(Contrastive Language-Image Pretraining)是一个多模态模型, 能够理解图像和文本之间的关系，并在两者之间建立联系。

In [25]:
from transformers import CLIPProcessor, CLIPModel
from PIL import Image
import requests

# Load the model and processor
model = (CLIPModel.from_pretrained("openai/clip-vit-large-patch14"))

processor = (CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14"))

prompt = "a photo of a "

class_names = ["fighting", "meeting"]

inputs = [prompt + class_name for class_name in class_names]

# Preprocess the inputs
inputs = processor(text=inputs, images=image, return_tensors="pt", padding=True)

# Run the model
outputs = model(**inputs)

# Get similarity scores
logits_per_image = outputs.logits_per_image
probs = logits_per_image.softmax(dim=1)  # Convert to probabilities

print("Similarity probabilities:", probs)

Similarity probabilities: tensor([[4.5135e-04, 9.9955e-01]], grad_fn=<SoftmaxBackward0>)


### Working with benchmarks and datasets

transfer learning (TL)  
multitask learning (MTL)  
sequential transfer learning (STL)  


#### benchmarks

such as General Language Understanding Evaluation (GLUE),  
Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME),  
and Stanford Question Answering Dataset (SquAD)

In [None]:
# !pip install datasets

In [27]:
from datasets import load_dataset

cola = load_dataset("glue", "cola")
cola['train'][25:28]

README.md:   0%|          | 0.00/35.3k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/251k [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/37.6k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/37.7k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/8551 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1043 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1063 [00:00<?, ? examples/s]

{'sentence': ['Harry coughed himself.',
  'Harry coughed us into a fit.',
  'Bill followed the road into the forest.'],
 'label': [0, 0, 1],
 'idx': [25, 26, 27]}

In [28]:
cola

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 8551
    })
    validation: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1043
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1063
    })
})

In [30]:
cola['train'][12]

{'sentence': 'Bill rolled out of the room.', 'label': 1, 'idx': 12}

In [32]:
cola['validation'][68]

{'sentence': 'Which report that John was incompetent did he submit?',
 'label': 0,
 'idx': 68}

In [34]:
cola['test'][20]

{'sentence': 'Has John seen Mary?', 'label': -1, 'idx': 20}

In [37]:

sst2 = load_dataset('glue', 'sst2')

train-00000-of-00001.parquet:   0%|          | 0.00/3.11M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/72.8k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/148k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/67349 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/872 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1821 [00:00<?, ? examples/s]

In [38]:
mrpc = load_dataset('glue', 'mrpc')

train-00000-of-00001.parquet:   0%|          | 0.00/649k [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/75.7k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/308k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/3668 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/408 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1725 [00:00<?, ? examples/s]

In [None]:
# !pip install --upgrade datasets

In [None]:
# !pip install datasets==1.18.4

In [None]:
# from pprint import pprint
# from datasets import list_datasets, list_metrics

# all = list_datasets()
# metrics = list_metrics()

# print(f"{len(all)} datasets and {len(metrics)} metrics exists in the hub\n")
# pprint(all[:20], compact=True)
# pprint(metrics, compact=True)


In [7]:
from datasets import load_dataset

en_de = load_dataset('xtreme', 'MLQA.en.de')

Downloading:   0%|          | 0.00/9.04k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/23.1k [00:00<?, ?B/s]

Downloading and preparing dataset xtreme/MLQA.en.de (download: 72.21 MiB, generated: 5.39 MiB, post-processed: Unknown size, total: 77.60 MiB) to /root/.cache/huggingface/datasets/xtreme/MLQA.en.de/1.0.0/2fc6b63c5326cc0d1f73060649612889b3a7ed8a6605c91cecdbd228a7158b17...


Downloading:   0%|          | 0.00/75.7M [00:00<?, ?B/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset xtreme downloaded and prepared to /root/.cache/huggingface/datasets/xtreme/MLQA.en.de/1.0.0/2fc6b63c5326cc0d1f73060649612889b3a7ed8a6605c91cecdbd228a7158b17. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

In [8]:
en_de

DatasetDict({
    test: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 4517
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 512
    })
})

In [10]:
# View dataset as a pandas data frame
import pandas as pd

pd.DataFrame(en_de['test'][0:4])

Unnamed: 0,id,title,context,question,answers
0,037e8929e7e4d2f949ffbabd10f0f860499ff7c9,Cell culture,An established or immortalized cell line has a...,Woraus besteht die Linie?,"{'answer_start': [31], 'text': ['cell']}"
1,4b36724f3cbde7c287bde512ff09194cbba7f932,Cell culture,The 19th-century English physiologist Sydney R...,Wann hat Roux etwas von seiner Medullarplatte ...,"{'answer_start': [232], 'text': ['1885']}"
2,13e58403df16d88b0e2c665953e89575704942d4,TRIPS Agreement,"After the Uruguay round, the GATT became the b...","Was muss ratifiziert werden, wenn ein Land ger...","{'answer_start': [131], 'text': ['TRIPS']}"
3,d23b5372af1de9425a4ae313c01eb80764c910d8,TRIPS Agreement,"Since TRIPS came into force, it has been subje...",Welche Teile der Welt kritisierten das TRIPS a...,"{'answer_start': [67], 'text': ['developing co..."


### Selecting, sorting, filtering

#### Split

which split of the data to be loaded. If None by default, will return a dict with all splits (Train, Test, Validation or any other). If split is specified, it will return a single Dataset rather than a Dictionary

In [11]:
cola = load_dataset('glue', 'cola', split ='train[:300]+validation[-30%:]')

# Which means the first 300 examples of train  plus the last 30% of validation.


Downloading:   0%|          | 0.00/7.78k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/4.47k [00:00<?, ?B/s]

Downloading and preparing dataset glue/cola (download: 368.14 KiB, generated: 596.73 KiB, post-processed: Unknown size, total: 964.86 KiB) to /root/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad...


Downloading:   0%|          | 0.00/377k [00:00<?, ?B/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset glue downloaded and prepared to /root/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad. Subsequent calls will reuse this data.


  table = cls._concat_blocks(blocks, axis=0)


In [12]:
# The first 100 examples from train and validation

split='train[:100]+validation[:100]'

# 50% of train and 30 % of validation

split='train[:50%]+validation[:30%]'

# The first 20% of train and examples in the slice 30:50 from validation

split='train[:20%]+validation[30:50]'

In [14]:
# Sorting

cola.sort('label')['label'][:15]



[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

In [15]:

cola.sort('label')['label'][-15:]



[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

#### Indexing

You can also access several rows using slice notation or with a list of indices

In [16]:
cola[6,19,44]

{'sentence': ['Fred watered the plants flat.',
  'The professor talked us into a stupor.',
  'The trolley rumbled through the tunnel.'],
 'label': [1, 1, 1],
 'idx': [6, 19, 44]}

In [17]:

cola[42:46]

{'sentence': ['They made him to exhaustion.',
  'They made him into a monster.',
  'The trolley rumbled through the tunnel.',
  'The wagon rumbled down the road.'],
 'label': [0, 1, 1, 1],
 'idx': [42, 43, 44, 45]}

#### Shuffling



In [18]:
cola.shuffle(seed=42)[:3]

{'sentence': ['Lou forgot the umbrella in the closet.',
  'It is the problem that he is here.',
  'I met the person who left.'],
 'label': [1, 0, 1],
 'idx': [904, 1017, 885]}

#### Dataset Filter and Map Function

In [19]:
# To get 3 sentences ,including the term "kick" with Filter
cola = load_dataset('glue', 'cola', split='train[:100%]+validation[-30%:]')
pprint(cola.filter(lambda s: "kick" in s['sentence'])["sentence"][:3])

  table = cls._concat_blocks(blocks, axis=0)


  0%|          | 0/9 [00:00<?, ?ba/s]

['Jill kicked the ball from home plate to third base.',
 'Fred kicked the ball under the porch.',
 'Fred kicked the ball behind the tree.']


In [20]:
# To get 3 acceptable sentences
pprint(cola.filter(lambda s: s['label']== 1 )["sentence"][:3])

  0%|          | 0/9 [00:00<?, ?ba/s]

["Our friends won't buy this analysis, let alone the next one we propose.",
 "One more pseudo generalization and I'm giving up.",
 "One more pseudo generalization or I'm giving up."]


In [21]:
# To get 3 acceptable sentences - alternative version
cola.filter(lambda s: s['label']== cola.features['label'].str2int('acceptable'))["sentence"][:3]

  0%|          | 0/9 [00:00<?, ?ba/s]

["Our friends won't buy this analysis, let alone the next one we propose.",
 "One more pseudo generalization and I'm giving up.",
 "One more pseudo generalization or I'm giving up."]

#### Processing data with map function

datasets.Dataset.map() function iterates over the dataset applying a processing function to each examples in a dataset and modifies the content of the samples.

In [22]:
# E.g. adding new features
cola_new=cola.map(lambda e: {'len': len(e['sentence'])})

0ex [00:00, ?ex/s]

In [23]:
cola_new

Dataset({
    features: ['sentence', 'label', 'idx', 'len'],
    num_rows: 8864
})

In [24]:
pprint(cola_new[0:3])

{'idx': [0, 1, 2],
 'label': [1, 1, 1],
 'len': [71, 49, 48],
 'sentence': ["Our friends won't buy this analysis, let alone the next one we "
              'propose.',
              "One more pseudo generalization and I'm giving up.",
              "One more pseudo generalization or I'm giving up."]}


In [25]:
cola_cut=cola_new.map(lambda e: {'sentence': e['sentence'][:20]})

0ex [00:00, ?ex/s]

In [26]:
pprint(cola_cut[:3])

{'idx': [0, 1, 2],
 'label': [1, 1, 1],
 'len': [71, 49, 48],
 'sentence': ["Our friends won't bu",
              'One more pseudo gene',
              'One more pseudo gene']}


### Working with Local Files

In [27]:
import os
from google.colab import drive

drive.mount('/content/drive')

Mounted at /content/drive


In [28]:
os.getcwd()

'/content'

In [29]:
os.listdir("/content/drive/My Drive/")

['french',
 'Colab Notebooks',
 'work',
 'other',
 'papers',
 'Untitled presentation.gslides',
 'Bromate',
 'llms',
 'Monday discussion meeting agenda.gsheet']

#### To load a dataset from local files CSV, TXT, JSON, a generic loading scripts are provided

In [None]:
# under data folder there are the files[a.csv, b.csv, c.csv], some random part of SST-2 dataset
from datasets import load_dataset

data1 = load_dataset('csv', data_files='./data/a.csv', delimiter="\t")

data2 = load_dataset('csv', data_files=['./data/a.csv','./data/b.csv', './data/c.csv'], delimiter="\t")

data3 = load_dataset('csv', data_files={'train':['./data/a.csv','./data/b.csv'], 'test':['./data/c.csv']}, delimiter="\t")