### 1. 기존 토크나이저에서 새로운 토크나이저 학습

#### 말뭉치 모으기
🤗Transformers에는 기존에 존재하는 것들과 동일한 특성을 가진 새로운 토크나이저를 학습하는데 사용할 수 있는 매우 간단한 API가 있습니다. 바로 AutoTokenizer.train_new_from_iterator()가 그것입니다. 이를 실제로 실행해 보기 위해 GPT-2를 처음부터 영어가 아닌 다른 언어로 학습하고 싶다고 가정해 보겠습니다. 첫 번째 작업은 해당 언어로 표현된 대규모의 데이터를 수집하여 학습 말뭉치로 구성하는 것입니다. 모든 사람이 이해할 수 있는 예시를 제공하기 위해 여기서는 러시아어나 중국어와 같은 언어가 아니라 특수한 영어 텍스트로 볼 수 있는 Python 소스코드 집합을 사용합니다.

🤗Datasets 라이브러리는 Python 소스코드를 모으는데 도움을 줄 수 있습니다. 간단하게 load_dataset() 함수를 사용하여 CodeSearchNet 데이터셋을 다운로드하고 캐시합니다. 이 데이터셋은 CodeSearchNet 챌린지를 위해 생성되었으며 여러 프로그래밍 언어로 된 GitHub의 오픈소스 라이브러리에서 수백만 개의 함수를 포함하고 있습니다. 이 예시에서는 이 데이터셋의 Python 부분을 로드합니다:


In [1]:
from datasets import load_dataset
from pprint import pprint

# 로드하는데 몇 분이 소요될 수 있습니다. 커피나 차를 준비하세요.
raw_datasets = load_dataset("code_search_net", "python")


You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


In [2]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['repository_name', 'func_path_in_repository', 'func_name', 'whole_func_string', 'language', 'func_code_string', 'func_code_tokens', 'func_documentation_string', 'func_documentation_tokens', 'split_name', 'func_code_url'],
        num_rows: 412178
    })
    test: Dataset({
        features: ['repository_name', 'func_path_in_repository', 'func_name', 'whole_func_string', 'language', 'func_code_string', 'func_code_tokens', 'func_documentation_string', 'func_documentation_tokens', 'split_name', 'func_code_url'],
        num_rows: 22176
    })
    validation: Dataset({
        features: ['repository_name', 'func_path_in_repository', 'func_name', 'whole_func_string', 'language', 'func_code_string', 'func_code_tokens', 'func_documentation_string', 'func_documentation_tokens', 'split_name', 'func_code_url'],
        num_rows: 23107
    })
})

In [3]:
raw_datasets["train"]

Dataset({
    features: ['repository_name', 'func_path_in_repository', 'func_name', 'whole_func_string', 'language', 'func_code_string', 'func_code_tokens', 'func_documentation_string', 'func_documentation_tokens', 'split_name', 'func_code_url'],
    num_rows: 412178
})

In [4]:
pprint(raw_datasets["train"][0])

{'func_code_string': 'def findArgs(args, prefixes):\n'
                     '\t\t"""\n'
                     '\t\tExtracts the list of arguments that start with any '
                     'of the specified prefix values\n'
                     '\t\t"""\n'
                     '\t\treturn list([\n'
                     '\t\t\targ for arg in args\n'
                     '\t\t\tif len([p for p in prefixes if '
                     'arg.lower().startswith(p.lower())]) > 0\n'
                     '\t\t])',
 'func_code_tokens': ['def',
                      'findArgs',
                      '(',
                      'args',
                      ',',
                      'prefixes',
                      ')',
                      ':',
                      'return',
                      'list',
                      '(',
                      '[',
                      'arg',
                      'for',
                      'arg',
                      'in',
                      'args

In [5]:
print(raw_datasets["train"][123456]["whole_func_string"])

def check_result(running, recurse=False, highstate=None):
    '''
    Check the total return value of the run and determine if the running
    dict has any issues
    '''
    if not isinstance(running, dict):
        return False

    if not running:
        return False

    ret = True
    for state_id, state_result in six.iteritems(running):
        expected_type = dict
        # The __extend__ state is a list
        if "__extend__" == state_id:
            expected_type = list
        if not recurse and not isinstance(state_result, expected_type):
            ret = False
        if ret and isinstance(state_result, dict):
            result = state_result.get('result', _empty)
            if result is False:
                ret = False
            # only override return value if we are not already failed
            elif result is _empty and isinstance(state_result, dict) and ret:
                ret = check_result(
                    state_result, recurse=True, highstate=highstate

In [6]:
import psutil

# Process.memory_info는 바이트 단위로 표시되므로 이를 메가바이트로 변환합니다.
print(f"RAM used: {psutil.Process().memory_info().rss / (1024 * 1024):.2f} MB")
# 여기에서 rss 속성은 프로세스가 RAM에서 차지하는 메모리 비율인 resident set size(상주 세트 크기)를 나타냅니다. 

RAM used: 182.55 MB



가장 먼저 해야 할 일은 데이터셋을 텍스트 리스트의 이터레이터(iterator)로 변환하는 것입니다. 예를 들어, 텍스트 **리스트의 리스트**로 구성할 수 있습니다. 텍스트 리스트를 사용하면 개별 텍스트를 하나씩 처리하는 대신 텍스트 배치(batches)에 대한 학습을 통해서 토크나이저가 더 빨라질 수 있으며, 모든 것을 한 번에 메모리에 로딩하지 않으려면 이 리스트를 이터레이터(iterator)로 변환되어야 합니다. 말뭉치의 규모가 크다면 🤗Datasets는 RAM에 모든 것을 로드하지 않고 데이터셋의 요소를 디스크에 저장한다는 사실을 활용할 수 있습니다.

다음을 수행하면 각각 1,000개의 텍스트 리스트가 생성되지만 모든 것이 메모리에 로드됩니다:

In [7]:
# 데이터셋의 규모가 작지 않다면, 다음 라인의 주석을 제거하지 마세요.
#training_corpus = [raw_datasets["train"][i: i + 1000]["whole_func_string"] for i in range(0, len(raw_datasets["train"]), 1000)]
#print(f"RAM used: {psutil.Process().memory_info().rss / (1024 * 1024):.2f} MB")
# RAM used: 2142.84 MB

In [8]:
#print(len(training_corpus))

In [9]:
#training_corpus[:2]

Python 제너레이터(generator)를 사용하면 실제로 필요할 때까지 Python이 메모리에 아무 것도 로드하지 않도록 할 수 있습니다. 이러한 생성기를 만들려면 꺽쇠괄호(brackets)를 소괄호(parentheses)로 바꾸기만 하면 됩니다:

In [10]:
training_corpus = (
    raw_datasets["train"][i : i + 1000]["whole_func_string"]
    for i in range(0, len(raw_datasets["train"]), 1000)
)

print(f"RAM used: {psutil.Process().memory_info().rss / (1024 * 1024):.2f} MB")
# RAM used: 167.87 MB

RAM used: 182.80 MB


위 코드는 데이터셋의 요소를 가져오지 않습니다. Python의 for 루프에서 사용할 수 있는 객체를 생성할 뿐입니다. 텍스트는 필요할 때만 메모리로 로드되며(즉, 해당 텍스트 집합이 필요한 for-loop 단계에 있을 때만) 한 번에 1,000개의 텍스트만 로드됩니다. 이렇게 하면 거대한 데이터셋을 처리하더라도 메모리를 완전히 소진하지 않습니다.

파이썬 제너레이터(generator) 객체의 문제점은 단 한번만 사용할 수 있다는 것입니다. 따라서, 아래 코드의 결과는 우리가 예상한 것처럼 10개의 숫자 리스트를 두번 출력하는 것이 아니라:

In [11]:
gen = (i for i in range(10))
print(list(gen))
print(list(gen))


[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[]


첫번째 print문은 10개의 숫자를 출력하지만, 그 다음은 비어있는 리스트를 출력하고 있습니다. 이것이 바로 제너레이터(generator)를 반환하는 함수를 정의하는 이유입니다:

In [12]:
def get_training_corpus():
    return (
        raw_datasets["train"][i : i + 1000]["whole_func_string"]
        for i in range(0, len(raw_datasets["train"]), 1000)
    )

training_corpus = get_training_corpus()


In [13]:
corpus = next(training_corpus)

In [14]:
len(corpus)

1000

yield 문을 사용하여 for 루프 내에서 제너레이터(generator)를 정의할 수도 있습니다:

In [15]:
def get_training_corpus():
    dataset = raw_datasets["train"]
    for start_idx in range(0, len(dataset), 1000):
        samples = dataset[start_idx : start_idx + 1000]
        yield samples["whole_func_string"]
        
        
training_corpus = get_training_corpus()

이전과 동일한 제너레이터를 생성하지만 리스트 내포(list comprehension)에서 할 수 있는 것보다 더 복잡한 로직을 사용할 수 있습니다.

#### 새로운 토크나이저 학습
이제 텍스트 배치(batch)의 이터레이터(iterator) 형태로 말뭉치를 구성했으므로 새로운 토크나이저를 학습할 준비가 되었습니다. 학습을 위해서 먼저 모델과 일치시키려는 토크나이저를 로드해야 합니다. 여기서는 GPT-2를 사용하겠습니다:

In [16]:
from transformers import AutoTokenizer

old_tokenizer = AutoTokenizer.from_pretrained("gpt2")


In [17]:
old_tokenizer

GPT2TokenizerFast(name_or_path='gpt2', vocab_size=50257, model_max_length=1024, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	50256: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
}

새로운 토크나이저를 학습하지만, 완전히 처음부터 시작하지 않도록 하는 것이 좋습니다. 이렇게 하면 토큰화 알고리즘이나 사용하려는 특수 토큰(special tokens)에 대해 아무 것도 신경쓰거나 지정할 필요가 없습니다. 우리의 새로운 토크나이저는 GPT-2와 정확히 동일할 것이며, 우리 말뭉치를 이용한 학습을 통해 vocabulary만 변경됩니다.

먼저 이 토크나이저가 예제 함수(example function)를 처리하는 방법을 살펴보겠습니다:

In [18]:
example = '''def add_numbers(a, b):
    """Add the two numbers `a` and `b`."""
    return a + b'''

tokens = old_tokenizer.tokenize(example)
tokens


['def',
 'Ġadd',
 '_',
 'n',
 'umbers',
 '(',
 'a',
 ',',
 'Ġb',
 '):',
 'Ċ',
 'Ġ',
 'Ġ',
 'Ġ',
 'Ġ"""',
 'Add',
 'Ġthe',
 'Ġtwo',
 'Ġnumbers',
 'Ġ`',
 'a',
 '`',
 'Ġand',
 'Ġ`',
 'b',
 '`',
 '."',
 '""',
 'Ċ',
 'Ġ',
 'Ġ',
 'Ġ',
 'Ġreturn',
 'Ġa',
 'Ġ+',
 'Ġb']

In [19]:
old_tokenizer

GPT2TokenizerFast(name_or_path='gpt2', vocab_size=50257, model_max_length=1024, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	50256: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
}

이 토크나이저는 각각 공백과 줄바꿈을 나타내는 Ċ 및 Ġ와 같은 몇가지 특수 기호를 포함하고 있습니다. 결과에서 볼 수 있듯이, 이는 효율적이지 않습니다. 여러 개의 공백이 나타날 때 토크나이저는 이를 그룹화하여 하나의 토큰으로 표현할 수도 있는데, 여기서는 각 공백을 개별 토큰으로 표현하고 있습니다. 소스코드에서 4개 또는 8개의 공백 그룹이 나타나는 것은 매우 일반적입니다. 또한 _ 문자가 있는 단어가 익숙하지 않은지, 함수명을 약간 이상하게 분할합니다.

새로운 토크나이저를 학습하고 이러한 문제를 해결하는지 봅시다. 이를 위해 우리는 train_new_from_iterator() 메서드를 사용할 것입니다:

In [20]:
%%time
tokenizer = old_tokenizer.train_new_from_iterator(training_corpus, 52000)




CPU times: user 3min 17s, sys: 12 s, total: 3min 29s
Wall time: 1min 9s


이 명령은 말뭉치가 매우 큰 경우 시간이 걸릴 수 있지만 1.6GB 텍스트 데이터셋의 경우에는 매우 빠릅니다(12코어가 있는 AMD Ryzen 9 3900X CPU에서 1분 16초).

AutoTokenizer.train_new_from_iterator()는 사용 중인 토크나이저가 "빠른(fast)" 토크나이저인 경우에만 작동합니다. 다음 섹션에서 볼 수 있듯이 🤗Transformers 라이브러리에는 두 가지 유형의 토크나이저가 포함되어 있습니다. 한 유형은 순수하게 Python으로 작성되어 있고 다른 유형(빠른 토크나이저)은 🤗Tokenizers 라이브러리의 도움을 받아서 Rust 프로그래밍 언어로 작성된 토크나이저입니다. Python은 데이터 과학 및 딥러닝 응용 프로그램에 가장 자주 사용되는 언어이지만 빠른 병렬 처리가 필요한 경우 다른 언어로 작성해야 합니다. 예를 들어, 모델 계산(model computation)의 핵심인 행렬 곱셈(matrix multiplication)은 GPU에 최적화된 C 라이브러리인 CUDA로 작성되어 있습니다.

순수한 Python으로 새로운 토크나이저를 학습하는 것은 엄청나게 느릴 것입니다. 이는 우리가 🤗Tokenizers 라이브러리를 개발한 이유입니다. GPU에 로드된 입력 배치(input batch)에서 모델을 실행하기 위해 CUDA 언어를 배울 필요가 없었던 것처럼 빠른 토크나이저를 사용하기 위해 Rust를 배울 필요가 없습니다. 🤗Tokenizers 라이브러리는 내부적으로 Rust의 일부 코드를 호출하는 많은 메서드에 대한 Python 바인딩을 제공합니다. 예를 들어, 새 토크나이저의 학습을 병렬화하거나 3장에서 보았듯이 입력 배치(batch)의 토큰화를 병렬화합니다.

대부분의 트랜스포머(Transformer) 모델에는 사용 가능한 "빠른(fast)" 토크나이저가 있으며(여기에서 확인할 수 있는 몇 가지 예외가 있음) AutoTokenizer API는 사용 가능한 경우 항상 빠른 토크나이저를 선택합니다. 다음 섹션에서는 토큰 분류(token classification) 및 질의 응답(question answering)과 같은 작업에 정말 유용한 빠른 토크나이저의 다른 몇 가지 특수 기능을 살펴보겠습니다. 그러나 이에 대해 알아보기 전에 위 예제에서 새로운 토크나이저를 사용해 보겠습니다:

In [21]:
tokens = tokenizer.tokenize(example)
tokens

['def',
 'Ġadd',
 '_',
 'numbers',
 '(',
 'a',
 ',',
 'Ġb',
 '):',
 'ĊĠĠĠ',
 'Ġ"""',
 'Add',
 'Ġthe',
 'Ġtwo',
 'Ġnumbers',
 'Ġ`',
 'a',
 '`',
 'Ġand',
 'Ġ`',
 'b',
 '`."""',
 'ĊĠĠĠ',
 'Ġreturn',
 'Ġa',
 'Ġ+',
 'Ġb']

In [22]:
print(tokenizer)

GPT2TokenizerFast(name_or_path='gpt2', vocab_size=52000, model_max_length=1024, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
}


위 결과에서 공백(space)과 줄바꿈(newline)을 나타내는 특수 기호 Ċ 및 Ġ를 다시 볼 수 있지만, 새롭게 학습된 토크나이저는 Python 함수(function) 코퍼스에 매우 특화된 일부 토큰을 학습했음을 알 수 있습니다. 예를 들어, 들여쓰기를 나타내는 ĊĠĠĠ 토큰과 독스트링을 시작하는 세 개의 따옴표를 나타내는 Ġ""" 토큰이 있습니다. 토크나이저는 _ 문자를 중심으로 함수명도 올바르게 분할합니다. 이는 매우 간결한(compact) 표현입니다. 이에 비해, 동일한 예제에서 일반적인 영어 토크나이저를 사용하면 더 긴 문장(혹은 토큰 시퀀스)을 얻을 수 있습니다:

In [23]:
print(len(tokens))
print(len(old_tokenizer.tokenize(example)))


27
36


In [24]:
example = """class LinearLayer():
    def __init__(self, input_size, output_size):
        self.weight = torch.randn(input_size, output_size)
        self.bias = torch.zeros(output_size)
 
    def __call__(self, x):
        return x @ self.weights + self.bias
    """
tokenizer.tokenize(example)


['class',
 'ĠLinear',
 'Layer',
 '():',
 'ĊĠĠĠ',
 'Ġdef',
 'Ġ__',
 'init',
 '__(',
 'self',
 ',',
 'Ġinput',
 '_',
 'size',
 ',',
 'Ġoutput',
 '_',
 'size',
 '):',
 'ĊĠĠĠĠĠĠĠ',
 'Ġself',
 '.',
 'weight',
 'Ġ=',
 'Ġtorch',
 '.',
 'randn',
 '(',
 'input',
 '_',
 'size',
 ',',
 'Ġoutput',
 '_',
 'size',
 ')',
 'ĊĠĠĠĠĠĠĠ',
 'Ġself',
 '.',
 'bias',
 'Ġ=',
 'Ġtorch',
 '.',
 'zeros',
 '(',
 'output',
 '_',
 'size',
 ')',
 'ĊĠĊĠĠĠ',
 'Ġdef',
 'Ġ__',
 'call',
 '__(',
 'self',
 ',',
 'Ġx',
 '):',
 'ĊĠĠĠĠĠĠĠ',
 'Ġreturn',
 'Ġx',
 'Ġ@',
 'Ġself',
 '.',
 'weights',
 'Ġ+',
 'Ġself',
 '.',
 'bias',
 'ĊĠĠĠĠ']

들여쓰기에 해당하는 토큰 외에도 여기에서는 이중 들여쓰기에 대한 토큰(ĊĠĠĠĠĠĠĠ)을 볼 수 있습니다. class, init, call, self, return과 같은 특수한 Python 단어는 각각 하나의 토큰으로 토큰화되며 _ 및 . 으로 분할되는 것을 볼 수 있습니다. 토크나이저는 camel-cased name도 올바르게 분할합니다. LinearLayer는 ["ĠLinear", "Layer"]로 토큰화됩니다.

#### 학습된 토크나이저 저장
이제 나중에 사용할 수 있도록 새 토크나이저를 저장해야 합니다. 모델과 마찬가지로 이 작업은 save_pretrained() 메서드로 수행됩니다:

In [25]:
tokenizer.save_pretrained("code-search-net-tokenizer")

('code-search-net-tokenizer/tokenizer_config.json',
 'code-search-net-tokenizer/special_tokens_map.json',
 'code-search-net-tokenizer/vocab.json',
 'code-search-net-tokenizer/merges.txt',
 'code-search-net-tokenizer/added_tokens.json',
 'code-search-net-tokenizer/tokenizer.json')

In [26]:
from huggingface_hub import notebook_login

notebook_login()


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [27]:
tokenizer.push_to_hub("code-search-net-tokenizer", use_temp_dir=True)

CommitInfo(commit_url='https://huggingface.co/hwang2006/code-search-net-tokenizer/commit/e8ab7fb55d2ac92ff1a090844b54cbf622e74753', commit_message='Upload tokenizer', commit_description='', oid='e8ab7fb55d2ac92ff1a090844b54cbf622e74753', pr_url=None, pr_revision=None, pr_num=None)

In [28]:
# 당신이 직접 이 섹션에서 학습한 토크나이저를 사용하기 위해서,
# 아래의 "spasis"를 당신의 실제 네임스페이스로 변경하십시오.
tokenizer = AutoTokenizer.from_pretrained("hwang2006/code-search-net-tokenizer")

### 2. "빠른(fast)" 토크나이저의 특별한 능력
#### 배치 인코딩 (Batch encoding)
토크나이저의 출력은 단순한 Python 딕셔너리가 아닙니다. 우리가 얻는 것은 실제로 특별한 BatchEncoding 객체입니다. 이것은 딕셔너리의 하위 클래스이지만(이것이 이전에 우리가 문제없이 해당 결과를 색인화할 수 있었던 이유입니다), 빠른 토크나이저에서 주로 사용하는 추가 메서드가 있습니다.

병렬화(parallelization) 기능 외에도, 빠른 토크나이저의 주요 기능은 최종 토큰이 원본 텍스트에서 어디에 위치하는지 범위(span)를 항상 추적한다는 것입니다. 이를 오프셋 매핑(offset mapping) 이라고 합니다. 이것은 차례대로 각 단어를 생성된 토큰에 매핑하거나 원본 텍스트의 각 문자를 내부 토큰에 매핑하거나 그 반대로 매핑하는 것과 같은 기능들입니다.

예를 살펴보겠습니다:

In [29]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
example = "My name is Sylvain and I work at Hugging Face in Brooklyn."
encoding = tokenizer(example)
print(encoding)
print(type(encoding))


{'input_ids': [101, 1422, 1271, 1110, 156, 7777, 2497, 1394, 1105, 146, 1250, 1120, 20164, 10932, 10289, 1107, 6010, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
<class 'transformers.tokenization_utils_base.BatchEncoding'>


In [30]:
encoding.is_fast

True

빠른 토크나이저를 가지고 우리가 무엇을 할 수 있는지 봅시다. 첫째, 토큰 아이디를 다시 토큰으로 변환하지 않고도 토큰에 액세스할 수 있습니다:

In [31]:
encoding.tokens()

['[CLS]',
 'My',
 'name',
 'is',
 'S',
 '##yl',
 '##va',
 '##in',
 'and',
 'I',
 'work',
 'at',
 'Hu',
 '##gging',
 'Face',
 'in',
 'Brooklyn',
 '.',
 '[SEP]']

In [32]:
encoding.word_ids()

[None, 0, 1, 2, 3, 3, 3, 3, 4, 5, 6, 7, 8, 8, 9, 10, 11, 12, None]

In [33]:
start, end = encoding.word_to_chars(3)
print(start, end)
example[start:end]


11 18


'Sylvain'

#### token-classification 파이프라인의 내부 동작
1장에서 우리는 🤗Transformers의 pipeline() 함수를 사용하여, 텍스트의 어느 부분이 사람(person), 위치(location) 또는 조직(organization)과 같은 엔터티(entities)에 해당하는지 식별하는 작업인 NER을 처음으로 살펴봤습니다. 그런 다음 2장에서 파이프라인이 원시 텍스트를 대상으로 예측하는데 필요한 세 단계 즉, 토큰화(tokenization), 모델을 통한 입력 전달, 후처리(post-processing)를 어떻게 그룹화하는지를 보았습니다. token-classification 파이프라인의 처음 두 단계는 다른 파이프라인과 동일하지만 후처리(post-processing)는 조금 더 복잡합니다. 한번 살펴봅시다!

#### 파이프라인으로 기본 실행 결과 도출하기
먼저, 수작업으로 비교할 결과를 얻을 수 있도록 token-classification 파이프라인을 구현해 보겠습니다. 사용되는 모델은 dbmdz/bert-large-cased-finetuned-conll03-english입니다. 이 모델은 문장에 대해 NER를 수행합니다:

In [34]:
from transformers import pipeline

token_classifier = pipeline("token-classification")
token_classifier("My name is Sylvain and I work at Hugging Face in Brooklyn.")


No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'entity': 'I-PER',
  'score': 0.99938285,
  'index': 4,
  'word': 'S',
  'start': 11,
  'end': 12},
 {'entity': 'I-PER',
  'score': 0.99815494,
  'index': 5,
  'word': '##yl',
  'start': 12,
  'end': 14},
 {'entity': 'I-PER',
  'score': 0.99590707,
  'index': 6,
  'word': '##va',
  'start': 14,
  'end': 16},
 {'entity': 'I-PER',
  'score': 0.99923277,
  'index': 7,
  'word': '##in',
  'start': 16,
  'end': 18},
 {'entity': 'I-ORG',
  'score': 0.9738931,
  'index': 12,
  'word': 'Hu',
  'start': 33,
  'end': 35},
 {'entity': 'I-ORG',
  'score': 0.976115,
  'index': 13,
  'word': '##gging',
  'start': 35,
  'end': 40},
 {'entity': 'I-ORG',
  'score': 0.9887976,
  'index': 14,
  'word': 'Face',
  'start': 41,
  'end': 45},
 {'entity': 'I-LOC',
  'score': 0.9932106,
  'index': 16,
  'word': 'Brooklyn',
  'start': 49,
  'end': 57}]

In [35]:
from transformers import pipeline

token_classifier = pipeline("token-classification", aggregation_strategy="simple")
token_classifier("My name is Sylvain and I work at Hugging Face in Brooklyn.")


No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'entity_group': 'PER',
  'score': 0.9981694,
  'word': 'Sylvain',
  'start': 11,
  'end': 18},
 {'entity_group': 'ORG',
  'score': 0.9796019,
  'word': 'Hugging Face',
  'start': 33,
  'end': 45},
 {'entity_group': 'LOC',
  'score': 0.9932106,
  'word': 'Brooklyn',
  'start': 49,
  'end': 57}]

In [36]:
from transformers import pipeline

token_classifier = pipeline("token-classification", aggregation_strategy="average")
token_classifier("My name is Sylvain and I work at Hugging Face in Brooklyn.")


No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'entity_group': 'PER',
  'score': 0.9981694,
  'word': 'Sylvain',
  'start': 11,
  'end': 18},
 {'entity_group': 'ORG',
  'score': 0.9819008,
  'word': 'Hugging Face',
  'start': 33,
  'end': 45},
 {'entity_group': 'LOC',
  'score': 0.9932106,
  'word': 'Brooklyn',
  'start': 49,
  'end': 57}]

aggregation_strategy를 위와 같이 지정하면 토큰들이 하나로 합쳐진 엔터티에 대해 새롭게 계산된 스코어를 제시합니다. "simple"의 경우 스코어는 해당 개체명 내의 각 토큰에 대한 스코어의 평균입니다. 예를 들어, "Sylvain"의 스코어는 이전 예에서 S, ##yl, ##va 및 ##in 토큰에 대해 계산된 스코어의 평균입니다. 사용 가능한 다른 지정자는 다음과 같습니다:

"first", 여기서 각 개체명의 스코어는 해당 개체명의 첫 번째 토큰의 스코어입니다(따라서 "Sylvain"의 경우 토큰 S의 점수인 0.993828이 됨).

"max", 여기서 각 엔터티의 스코어는 해당 엔터티내의 토큰들 중의 최대값 스코어입니다("Hugging Face"의 경우 "Face"의 점수는 0.98879766이 됨).

"average", 여기서 각 항목의 스코어는 해당 항목을 구성하는 단어(토큰이 아닙니다) 스코어의 평균입니다(따라서 "Sylvain"의 경우 "simple" 지정자와 차이가 없지만 "Hugging Face"의 점수는 0.9819이며 "Hugging"은 0.975이고 "Face"는 0.98879입니다).

이제 pipeline() 함수를 사용하지 않고 이러한 결과를 얻는 방법을 살펴보겠습니다!

In [37]:
# Named Entity Recognition (NER)
from transformers import pipeline

token_classifier = pipeline("ner", grouped_entities=True)
token_classifier("My name is Sylvain and I work at Hugging Face in Brooklyn.")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'entity_group': 'PER',
  'score': 0.9981694,
  'word': 'Sylvain',
  'start': 11,
  'end': 18},
 {'entity_group': 'ORG',
  'score': 0.9796019,
  'word': 'Hugging Face',
  'start': 33,
  'end': 45},
 {'entity_group': 'LOC',
  'score': 0.9932106,
  'word': 'Brooklyn',
  'start': 49,
  'end': 57}]

In [38]:
from transformers import AutoTokenizer, AutoModelForTokenClassification

model_checkpoint = "dbmdz/bert-large-cased-finetuned-conll03-english"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForTokenClassification.from_pretrained(model_checkpoint)

example = "My name is Sylvain and I work at Hugging Face in Brooklyn."
inputs = tokenizer(example, return_tensors="pt")
outputs = model(**inputs)


Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [39]:
print(inputs["input_ids"].shape)
print(outputs.logits.shape)

torch.Size([1, 19])
torch.Size([1, 19, 9])


In [40]:
print(inputs.tokens())

['[CLS]', 'My', 'name', 'is', 'S', '##yl', '##va', '##in', 'and', 'I', 'work', 'at', 'Hu', '##gging', 'Face', 'in', 'Brooklyn', '.', '[SEP]']


19개의 토큰으로 구성된 1개의 시퀀스가 있는 배치(batch)가 있고 모델에는 9개의 서로 다른 레이블이 존재하므로 모델의 출력은 1 x 19 x 9의 모양을 갖습니다. text-classification 파이프라인과 마찬가지로 softmax 함수를 사용하여 해당 logits을 확률로 변환하고 argmax를 사용하여 예측 결과를 얻을 수 있습니다(softmax는 순서를 변경하지 않기 때문에 logits에 대해서 argmax를 취할 수 있습니다):

In [41]:
from pprint import pprint
import torch

probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)[0].tolist()
predictions = outputs.logits.argmax(dim=-1)[0].tolist()
print(probabilities)
print(predictions)

[[0.9994322657585144, 1.6470283298986033e-05, 3.4266999136889353e-05, 1.6042311472119763e-05, 8.250683458754793e-05, 2.1382273189374246e-05, 0.00015649090346414596, 1.965209776244592e-05, 0.00022089220874477178], [0.9989631175994873, 1.8515736883273348e-05, 5.240452446741983e-05, 1.253474511031527e-05, 0.0004347366339061409, 3.087432560278103e-05, 0.00031468752422370017, 2.78607003565412e-05, 0.00014510865730699152], [0.999708354473114, 8.30812678032089e-06, 2.8745640520355664e-05, 5.650358161801705e-06, 8.69486466399394e-05, 9.783458153833635e-06, 6.786145240766928e-05, 1.1793980775109958e-05, 7.241900311782956e-05], [0.9998350143432617, 5.645536475640256e-06, 1.3955165741208475e-05, 4.3133732106070966e-06, 4.017691026092507e-05, 8.123070074361749e-06, 5.6484961532987654e-05, 8.99163478607079e-06, 2.7239138944423757e-05], [0.00018333422485738993, 2.5156617994070984e-05, 4.8462032282259315e-05, 1.4900553651386872e-05, 0.9993828535079956, 1.99977403099183e-05, 0.00011153621017001569, 1.

In [42]:
outputs.logits.argmax(dim=-1)

tensor([[0, 0, 0, 0, 4, 4, 4, 4, 0, 0, 0, 0, 6, 6, 6, 0, 8, 0, 0]])

In [43]:
torch.nn.functional.softmax(outputs.logits, dim=-1)

tensor([[[9.9943e-01, 1.6470e-05, 3.4267e-05, 1.6042e-05, 8.2507e-05,
          2.1382e-05, 1.5649e-04, 1.9652e-05, 2.2089e-04],
         [9.9896e-01, 1.8516e-05, 5.2405e-05, 1.2535e-05, 4.3474e-04,
          3.0874e-05, 3.1469e-04, 2.7861e-05, 1.4511e-04],
         [9.9971e-01, 8.3081e-06, 2.8746e-05, 5.6504e-06, 8.6949e-05,
          9.7835e-06, 6.7861e-05, 1.1794e-05, 7.2419e-05],
         [9.9984e-01, 5.6455e-06, 1.3955e-05, 4.3134e-06, 4.0177e-05,
          8.1231e-06, 5.6485e-05, 8.9916e-06, 2.7239e-05],
         [1.8333e-04, 2.5157e-05, 4.8462e-05, 1.4901e-05, 9.9938e-01,
          1.9998e-05, 1.1154e-04, 1.0791e-05, 2.0289e-04],
         [6.4403e-04, 7.4379e-05, 1.3197e-04, 3.4720e-05, 9.9815e-01,
          3.3830e-05, 5.4382e-04, 1.9978e-05, 3.6245e-04],
         [1.6408e-03, 9.4695e-05, 2.7364e-04, 4.4406e-05, 9.9591e-01,
          5.1262e-05, 1.2788e-03, 3.2835e-05, 6.7633e-04],
         [2.2902e-04, 2.5183e-05, 5.7899e-05, 9.9570e-06, 9.9923e-01,
          1.7655e-05, 2.344

In [44]:
model.config

BertConfig {
  "_name_or_path": "dbmdz/bert-large-cased-finetuned-conll03-english",
  "_num_labels": 9,
  "architectures": [
    "BertForTokenClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "directionality": "bidi",
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 1024,
  "id2label": {
    "0": "O",
    "1": "B-MISC",
    "2": "I-MISC",
    "3": "B-PER",
    "4": "I-PER",
    "5": "B-ORG",
    "6": "I-ORG",
    "7": "B-LOC",
    "8": "I-LOC"
  },
  "initializer_range": 0.02,
  "intermediate_size": 4096,
  "label2id": {
    "B-LOC": 7,
    "B-MISC": 1,
    "B-ORG": 5,
    "B-PER": 3,
    "I-LOC": 8,
    "I-MISC": 2,
    "I-ORG": 6,
    "I-PER": 4,
    "O": 0
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 16,
  "num_hidden_layers": 24,
  "pad_token_id": 0,
  "pooler_fc_size": 768,
  "pooler_num_attention_heads": 12,
  "pooler_num_fc_layers": 3,
  "pooler_s

In [45]:
example = "My name is Sylvain and I work at Hugging Face in Brooklyn."
inputs = tokenizer(example, return_tensors="pt")

In [46]:
print(inputs.tokens())

['[CLS]', 'My', 'name', 'is', 'S', '##yl', '##va', '##in', 'and', 'I', 'work', 'at', 'Hu', '##gging', 'Face', 'in', 'Brooklyn', '.', '[SEP]']


In [47]:
results = []
tokens = inputs.tokens()

# predictions: [0, 0, 0, 0, 4, 4, 4, 4, 0, 0, 0, 0, 6, 6, 6, 0, 8, 0, 0]
# probabilities.shape = [19, 9]
for idx, pred in enumerate(predictions):
    label = model.config.id2label[pred]
    if label != "O":
        results.append(
            {"entity": label, "score": probabilities[idx][pred], "word": tokens[idx]}
        )

pprint(results)


[{'entity': 'I-PER', 'score': 0.9993828535079956, 'word': 'S'},
 {'entity': 'I-PER', 'score': 0.9981548190116882, 'word': '##yl'},
 {'entity': 'I-PER', 'score': 0.995907187461853, 'word': '##va'},
 {'entity': 'I-PER', 'score': 0.9992327690124512, 'word': '##in'},
 {'entity': 'I-ORG', 'score': 0.9738931059837341, 'word': 'Hu'},
 {'entity': 'I-ORG', 'score': 0.9761149883270264, 'word': '##gging'},
 {'entity': 'I-ORG', 'score': 0.9887974858283997, 'word': 'Face'},
 {'entity': 'I-LOC', 'score': 0.99321049451828, 'word': 'Brooklyn'}]


In [48]:
inputs_with_offsets = tokenizer(example, return_offsets_mapping=True)
inputs_with_offsets["offset_mapping"]

[(0, 0),
 (0, 2),
 (3, 7),
 (8, 10),
 (11, 12),
 (12, 14),
 (14, 16),
 (16, 18),
 (19, 22),
 (23, 24),
 (25, 29),
 (30, 32),
 (33, 35),
 (35, 40),
 (41, 45),
 (46, 48),
 (49, 57),
 (57, 58),
 (0, 0)]

In [49]:
example[12:14]

'yl'

In [50]:
results = []
inputs_with_offsets = tokenizer(example, return_offsets_mapping=True)
tokens = inputs_with_offsets.tokens()
offsets = inputs_with_offsets["offset_mapping"]

# prediction : [0, 0, 0, 0, 4, 4, 4, 4, 0, 0, 0, 0, 6, 6, 6, 0, 8, 0, 0]
for idx, pred in enumerate(predictions): 
    label = model.config.id2label[pred]
    if label != 'O':
        start, end = offsets[idx]
        results.append(
            {
                "entity": label,
                "score": probabilities[idx][pred],
                "word": tokens[idx],
                "start": start,
                "end": end,
            }
        )

pprint(results)


[{'end': 12,
  'entity': 'I-PER',
  'score': 0.9993828535079956,
  'start': 11,
  'word': 'S'},
 {'end': 14,
  'entity': 'I-PER',
  'score': 0.9981548190116882,
  'start': 12,
  'word': '##yl'},
 {'end': 16,
  'entity': 'I-PER',
  'score': 0.995907187461853,
  'start': 14,
  'word': '##va'},
 {'end': 18,
  'entity': 'I-PER',
  'score': 0.9992327690124512,
  'start': 16,
  'word': '##in'},
 {'end': 35,
  'entity': 'I-ORG',
  'score': 0.9738931059837341,
  'start': 33,
  'word': 'Hu'},
 {'end': 40,
  'entity': 'I-ORG',
  'score': 0.9761149883270264,
  'start': 35,
  'word': '##gging'},
 {'end': 45,
  'entity': 'I-ORG',
  'score': 0.9887974858283997,
  'start': 41,
  'word': 'Face'},
 {'end': 57,
  'entity': 'I-LOC',
  'score': 0.99321049451828,
  'start': 49,
  'word': 'Brooklyn'}]


In [51]:
example[33:45]

'Hugging Face'

In [52]:
import numpy as np

results = []
inputs_with_offsets = tokenizer(example, return_offsets_mapping=True)
tokens = inputs_with_offsets.tokens()
offsets = inputs_with_offsets["offset_mapping"]

idx = 0
while idx < len(predictions):
    pred = predictions[idx]
    label = model.config.id2label[pred]
    if label != "O":
        # Remove the B- or I-
        label = label[2:]
        start, _ = offsets[idx]

        # Grab all the tokens labeled with I-label
        all_scores = []
        while (
            idx < len(predictions)
            and model.config.id2label[predictions[idx]] == f"I-{label}"
        ):
            all_scores.append(probabilities[idx][pred])
            _, end = offsets[idx]
            idx += 1

        # The score is the mean of all the scores of the tokens in that grouped entity
        score = np.mean(all_scores).item()
        word = example[start:end]
        results.append(
            {
                "entity_group": label,
                "score": score,
                "word": word,
                "start": start,
                "end": end,
            }
        )
    idx += 1

pprint(results)

[{'end': 18,
  'entity_group': 'PER',
  'score': 0.998169407248497,
  'start': 11,
  'word': 'Sylvain'},
 {'end': 45,
  'entity_group': 'ORG',
  'score': 0.9796018600463867,
  'start': 33,
  'word': 'Hugging Face'},
 {'end': 57,
  'entity_group': 'LOC',
  'score': 0.99321049451828,
  'start': 49,
  'word': 'Brooklyn'}]


### 3. QA 파이프라인에서의 "빠른(fast)" 토크나이저
#### question-answering 파이프라인 사용하기
1장에서 보았듯이 우리는 질문에 대한 답을 얻기 위해 다음과 같은 question-answering 파이프라인을 사용할 수 있습니다:

In [53]:
from transformers import pipeline

question_answerer = pipeline("question-answering")
context = """
🤗 Transformers is backed by the three most popular deep learning libraries — Jax, PyTorch, and TensorFlow — with a seamless integration
between them. It's straightforward to train your models with one before loading them for inference with the other.
"""
question = "Which deep learning libraries back 🤗 Transformers?"
question_answerer(question=question, context=context)


No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


{'score': 0.9802603125572205,
 'start': 78,
 'end': 106,
 'answer': 'Jax, PyTorch, and TensorFlow'}

In [54]:
long_context = """
🤗 Transformers: State of the Art NLP

🤗 Transformers provides thousands of pretrained models to perform tasks on texts such as classification, information extraction,
question answering, summarization, translation, text generation and more in over 100 languages.
Its aim is to make cutting-edge NLP easier to use for everyone.

🤗 Transformers provides APIs to quickly download and use those pretrained models on a given text, fine-tune them on your own datasets and
then share them with the community on our model hub. At the same time, each python module defining an architecture is fully standalone and
can be modified to enable quick research experiments.

Why should I use transformers?

1. Easy-to-use state-of-the-art models:
  - High performance on NLU and NLG tasks.
  - Low barrier to entry for educators and practitioners.
  - Few user-facing abstractions with just three classes to learn.
  - A unified API for using all our pretrained models.
  - Lower compute costs, smaller carbon footprint:

2. Researchers can share trained models instead of always retraining.
  - Practitioners can reduce compute time and production costs.
  - Dozens of architectures with over 10,000 pretrained models, some in more than 100 languages.

3. Choose the right framework for every part of a model's lifetime:
  - Train state-of-the-art models in 3 lines of code.
  - Move a single model between TF2.0/PyTorch frameworks at will.
  - Seamlessly pick the right framework for training, evaluation and production.

4. Easily customize a model or an example to your needs:
  - We provide examples for each architecture to reproduce the results published by its original authors.
  - Model internals are exposed as consistently as possible.
  - Model files can be used independently of the library for quick experiments.

🤗 Transformers is backed by the three most popular deep learning libraries — Jax, PyTorch and TensorFlow — with a seamless integration
between them. It's straightforward to train your models with one before loading them for inference with the other.
"""
question_answerer(question=question, context=long_context)


{'score': 0.9714871048927307,
 'start': 1892,
 'end': 1919,
 'answer': 'Jax, PyTorch and TensorFlow'}

In [55]:
question_answerer

<transformers.pipelines.question_answering.QuestionAnsweringPipeline at 0x2b02c825cac0>

In [56]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering

model_checkpoint = "distilbert-base-cased-distilled-squad"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

inputs = tokenizer(question, context, return_tensors="pt")
outputs = model(**inputs)


In [57]:
pprint(inputs)

{'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]),
 'input_ids': tensor([[  101,  5979,  1996,  3776,  9818,  1171,   100, 25267,   136,   102,
           100, 25267,  1110,  5534,  1118,  1103,  1210,  1211,  1927,  1996,
          3776,  9818,   783, 13612,   117,   153,  1183,  1942,  1766,  1732,
           117,  1105,  5157, 21484,  2271,  6737,   783,  1114,   170,  2343,
          1306,  2008,  9111,  1206,  1172,   119,  1135,   112,   188, 21546,
          1106,  2669,  1240,  3584,  1114,  1141,  1196, 10745,  1172,  1111,
          1107, 16792,  1114,  1103,  1168,   119,   102]])}


In [58]:
print(len(inputs.tokens()))
print(inputs.tokens())

67
['[CLS]', 'Which', 'deep', 'learning', 'libraries', 'back', '[UNK]', 'Transformers', '?', '[SEP]', '[UNK]', 'Transformers', 'is', 'backed', 'by', 'the', 'three', 'most', 'popular', 'deep', 'learning', 'libraries', '—', 'Jax', ',', 'P', '##y', '##T', '##or', '##ch', ',', 'and', 'Ten', '##sor', '##F', '##low', '—', 'with', 'a', 'sea', '##m', '##less', 'integration', 'between', 'them', '.', 'It', "'", 's', 'straightforward', 'to', 'train', 'your', 'models', 'with', 'one', 'before', 'loading', 'them', 'for', 'in', '##ference', 'with', 'the', 'other', '.', '[SEP]']


In [59]:
outputs

QuestionAnsweringModelOutput(loss=None, start_logits=tensor([[-4.4952, -6.4454, -4.7115, -7.0968, -7.0726, -7.4981, -5.5397, -4.1368,
         -5.9199, -5.4193, -1.5920, -1.0857, -5.0981, -2.9331, -3.4070,  2.2467,
          5.1563, -1.3602, -2.2209, -0.9686, -4.8112, -2.2527,  1.4383, 10.1211,
         -1.5311,  2.2685, -1.8951, -2.2108, -4.2142, -2.5571, -2.3252, -2.6046,
          1.7047, -1.9867, -1.7211, -0.5415, -2.0239, -4.4246, -5.1012, -4.4966,
         -7.8940, -6.7200, -4.6759, -6.3278, -4.8339, -5.1839, -3.3724, -7.4120,
         -8.1542, -4.4871, -7.4659, -4.3293, -4.2293, -3.1903, -7.9467, -5.2665,
         -7.5902, -5.0570, -7.4476, -7.9083, -6.5951, -7.4061, -8.8821, -7.6749,
         -6.9879, -7.0466, -5.4193]], grad_fn=<CloneBackward0>), end_logits=tensor([[-2.3958e+00, -7.0978e+00, -7.0745e+00, -6.3676e+00, -5.9532e+00,
         -7.9585e+00, -7.1869e+00, -3.6494e+00, -6.9677e+00, -5.1421e+00,
         -3.1757e+00, -1.1649e+00, -7.0748e+00, -5.2875e+00, -6.8611e+00,
 

In [60]:
start_logits = outputs.start_logits
end_logits = outputs.end_logits
print(start_logits.shape, end_logits.shape)


torch.Size([1, 67]) torch.Size([1, 67])


In [61]:
pprint(outputs)

QuestionAnsweringModelOutput(loss=None,
                             start_logits=tensor([[-4.4952, -6.4454, -4.7115, -7.0968, -7.0726, -7.4981, -5.5397, -4.1368,
         -5.9199, -5.4193, -1.5920, -1.0857, -5.0981, -2.9331, -3.4070,  2.2467,
          5.1563, -1.3602, -2.2209, -0.9686, -4.8112, -2.2527,  1.4383, 10.1211,
         -1.5311,  2.2685, -1.8951, -2.2108, -4.2142, -2.5571, -2.3252, -2.6046,
          1.7047, -1.9867, -1.7211, -0.5415, -2.0239, -4.4246, -5.1012, -4.4966,
         -7.8940, -6.7200, -4.6759, -6.3278, -4.8339, -5.1839, -3.3724, -7.4120,
         -8.1542, -4.4871, -7.4659, -4.3293, -4.2293, -3.1903, -7.9467, -5.2665,
         -7.5902, -5.0570, -7.4476, -7.9083, -6.5951, -7.4061, -8.8821, -7.6749,
         -6.9879, -7.0466, -5.4193]], grad_fn=<CloneBackward0>),
                             end_logits=tensor([[-2.3958e+00, -7.0978e+00, -7.0745e+00, -6.3676e+00, -5.9532e+00,
         -7.9585e+00, -7.1869e+00, -3.6494e+00, -6.9677e+00, -5.1421e+00,
         -3.1757e

In [62]:
print(start_logits.argmax(dim=-1))
print(start_logits.argmax(dim=-1).shape)
print(type(start_logits.argmax(dim=-1)))
print(end_logits.argmax(dim=-1)[0])
print(end_logits.argmax(dim=-1)[0].shape)
print(type(end_logits.argmax(dim=-1)[0]))

tensor([23])
torch.Size([1])
<class 'torch.Tensor'>
tensor(35)
torch.Size([])
<class 'torch.Tensor'>


In [63]:
print(inputs.tokens())

['[CLS]', 'Which', 'deep', 'learning', 'libraries', 'back', '[UNK]', 'Transformers', '?', '[SEP]', '[UNK]', 'Transformers', 'is', 'backed', 'by', 'the', 'three', 'most', 'popular', 'deep', 'learning', 'libraries', '—', 'Jax', ',', 'P', '##y', '##T', '##or', '##ch', ',', 'and', 'Ten', '##sor', '##F', '##low', '—', 'with', 'a', 'sea', '##m', '##less', 'integration', 'between', 'them', '.', 'It', "'", 's', 'straightforward', 'to', 'train', 'your', 'models', 'with', 'one', 'before', 'loading', 'them', 'for', 'in', '##ference', 'with', 'the', 'other', '.', '[SEP]']


In [64]:
print(inputs.tokens()[0:3])

['[CLS]', 'Which', 'deep']


In [65]:
ttt = (inputs.tokens())[0:3]
ttt

['[CLS]', 'Which', 'deep']

In [66]:
print((inputs.tokens())[23])
print((inputs.tokens())[35])
selected_elements = (inputs.tokens())[23:36]
print(selected_elements)

Jax
##low
['Jax', ',', 'P', '##y', '##T', '##or', '##ch', ',', 'and', 'Ten', '##sor', '##F', '##low']


In [67]:
print(inputs.sequence_ids())

[None, 0, 0, 0, 0, 0, 0, 0, 0, None, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, None]


In [68]:
inputs

{'input_ids': tensor([[  101,  5979,  1996,  3776,  9818,  1171,   100, 25267,   136,   102,
           100, 25267,  1110,  5534,  1118,  1103,  1210,  1211,  1927,  1996,
          3776,  9818,   783, 13612,   117,   153,  1183,  1942,  1766,  1732,
           117,  1105,  5157, 21484,  2271,  6737,   783,  1114,   170,  2343,
          1306,  2008,  9111,  1206,  1172,   119,  1135,   112,   188, 21546,
          1106,  2669,  1240,  3584,  1114,  1141,  1196, 10745,  1172,  1111,
          1107, 16792,  1114,  1103,  1168,   119,   102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [69]:
mask = [i != 1 for i in inputs.sequence_ids()]
print(mask)
mask[0] = False
print(mask)

[True, True, True, True, True, True, True, True, True, True, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, True]
[False, True, True, True, True, True, True, True, True, True, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, True]


In [70]:
import torch
print(torch.tensor(mask))
print(torch.tensor(mask)[None].shape) 
print(torch.tensor([mask]))
#print(torch.tensor([1, 2, 3], [3, 4, 5])[None])

tensor([False,  True,  True,  True,  True,  True,  True,  True,  True,  True,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False,  True])
torch.Size([1, 67])
tensor([[False,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         False, False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False, F

In [71]:
# print(torch.tensor([1, 2, 3], [3, 4, 5])[None])
# TypeError: tensor() takes 1 positional argument but 2 were given

In [72]:
print(torch.tensor([[1, 2, 3], [3, 4, 5]]))

tensor([[1, 2, 3],
        [3, 4, 5]])


In [73]:
start_logits

tensor([[-4.4952, -6.4454, -4.7115, -7.0968, -7.0726, -7.4981, -5.5397, -4.1368,
         -5.9199, -5.4193, -1.5920, -1.0857, -5.0981, -2.9331, -3.4070,  2.2467,
          5.1563, -1.3602, -2.2209, -0.9686, -4.8112, -2.2527,  1.4383, 10.1211,
         -1.5311,  2.2685, -1.8951, -2.2108, -4.2142, -2.5571, -2.3252, -2.6046,
          1.7047, -1.9867, -1.7211, -0.5415, -2.0239, -4.4246, -5.1012, -4.4966,
         -7.8940, -6.7200, -4.6759, -6.3278, -4.8339, -5.1839, -3.3724, -7.4120,
         -8.1542, -4.4871, -7.4659, -4.3293, -4.2293, -3.1903, -7.9467, -5.2665,
         -7.5902, -5.0570, -7.4476, -7.9083, -6.5951, -7.4061, -8.8821, -7.6749,
         -6.9879, -7.0466, -5.4193]], grad_fn=<CloneBackward0>)

In [74]:
import torch

sequence_ids = inputs.sequence_ids()
# 컨텍스트 토큰들을 제외하고는 모두 마스킹한다.
mask = [i != 1 for i in sequence_ids]
# [CLS] 토큰은 마스킹하지 않는다.
mask[0] = False
# adds another dimension using [None] to make it a 2D tensor.
mask = torch.tensor(mask)[None] # torch.tensor([mask]) 

start_logits[mask] = -10000
end_logits[mask] = -10000


In [75]:
start_logits

tensor([[-4.4952e+00, -1.0000e+04, -1.0000e+04, -1.0000e+04, -1.0000e+04,
         -1.0000e+04, -1.0000e+04, -1.0000e+04, -1.0000e+04, -1.0000e+04,
         -1.5920e+00, -1.0857e+00, -5.0981e+00, -2.9331e+00, -3.4070e+00,
          2.2467e+00,  5.1563e+00, -1.3602e+00, -2.2209e+00, -9.6861e-01,
         -4.8112e+00, -2.2527e+00,  1.4383e+00,  1.0121e+01, -1.5311e+00,
          2.2685e+00, -1.8951e+00, -2.2108e+00, -4.2142e+00, -2.5571e+00,
         -2.3252e+00, -2.6046e+00,  1.7047e+00, -1.9867e+00, -1.7211e+00,
         -5.4148e-01, -2.0239e+00, -4.4246e+00, -5.1012e+00, -4.4966e+00,
         -7.8940e+00, -6.7200e+00, -4.6759e+00, -6.3278e+00, -4.8339e+00,
         -5.1839e+00, -3.3724e+00, -7.4120e+00, -8.1542e+00, -4.4871e+00,
         -7.4659e+00, -4.3293e+00, -4.2293e+00, -3.1903e+00, -7.9467e+00,
         -5.2665e+00, -7.5902e+00, -5.0570e+00, -7.4476e+00, -7.9083e+00,
         -6.5951e+00, -7.4061e+00, -8.8821e+00, -7.6749e+00, -6.9879e+00,
         -7.0466e+00, -1.0000e+04]], g

In [76]:
torch.nn.functional.softmax(start_logits[0], dim=-1)

tensor([4.4531e-07, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
        0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 8.1185e-06, 1.3470e-05,
        2.4368e-07, 2.1236e-06, 1.3220e-06, 3.7722e-04, 6.9219e-03, 1.0237e-05,
        4.3289e-06, 1.5143e-05, 3.2463e-07, 4.1933e-06, 1.6808e-04, 9.9179e-01,
        8.6288e-06, 3.8557e-04, 5.9956e-06, 4.3725e-06, 5.8977e-07, 3.0929e-06,
        3.8998e-06, 2.9493e-06, 2.1940e-04, 5.4713e-06, 7.1354e-06, 2.3212e-05,
        5.2711e-06, 4.7788e-07, 2.4291e-07, 4.4467e-07, 1.4879e-08, 4.8133e-08,
        3.7169e-07, 7.1242e-08, 3.1735e-07, 2.2365e-07, 1.3685e-06, 2.4093e-08,
        1.1470e-08, 4.4891e-07, 2.2828e-08, 5.2562e-07, 5.8092e-07, 1.6419e-06,
        1.4114e-08, 2.0591e-07, 2.0161e-08, 2.5390e-07, 2.3251e-08, 1.4667e-08,
        5.4533e-08, 2.4235e-08, 5.5390e-09, 1.8524e-08, 3.6818e-08, 3.4721e-08,
        0.0000e+00], grad_fn=<SoftmaxBackward0>)

In [77]:
start_probabilities = torch.nn.functional.softmax(start_logits, dim=-1)[0]
end_probabilities = torch.nn.functional.softmax(end_logits, dim=-1)[0]

In [78]:
print(start_probabilities.shape)
print(start_probabilities)

torch.Size([67])
tensor([4.4531e-07, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
        0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 8.1185e-06, 1.3470e-05,
        2.4368e-07, 2.1236e-06, 1.3220e-06, 3.7722e-04, 6.9219e-03, 1.0237e-05,
        4.3289e-06, 1.5143e-05, 3.2463e-07, 4.1933e-06, 1.6808e-04, 9.9179e-01,
        8.6288e-06, 3.8557e-04, 5.9956e-06, 4.3725e-06, 5.8977e-07, 3.0929e-06,
        3.8998e-06, 2.9493e-06, 2.1940e-04, 5.4713e-06, 7.1354e-06, 2.3212e-05,
        5.2711e-06, 4.7788e-07, 2.4291e-07, 4.4467e-07, 1.4879e-08, 4.8133e-08,
        3.7169e-07, 7.1242e-08, 3.1735e-07, 2.2365e-07, 1.3685e-06, 2.4093e-08,
        1.1470e-08, 4.4891e-07, 2.2828e-08, 5.2562e-07, 5.8092e-07, 1.6419e-06,
        1.4114e-08, 2.0591e-07, 2.0161e-08, 2.5390e-07, 2.3251e-08, 1.4667e-08,
        5.4533e-08, 2.4235e-08, 5.5390e-09, 1.8524e-08, 3.6818e-08, 3.4721e-08,
        0.0000e+00], grad_fn=<SelectBackward0>)


In [79]:
a = torch.tensor([1, 2, 3])

b = a.view(a.shape[0], -1)
b

tensor([[1],
        [2],
        [3]])

In [80]:
print(start_probabilities[:, None].shape)
print(end_probabilities[None, :].shape)

torch.Size([67, 1])
torch.Size([1, 67])


In [81]:
print(start_probabilities.view(start_probabilities.shape[0],-1).shape)
print(end_probabilities.view(-1, end_probabilities.shape[0]).shape)

torch.Size([67, 1])
torch.Size([1, 67])


In [82]:
scores = start_probabilities[:, None] * end_probabilities[None, :]

In [83]:
scores.shape

torch.Size([67, 67])

In [84]:
torch.ones(3, 3).triu()

tensor([[1., 1., 1.],
        [0., 1., 1.],
        [0., 0., 1.]])

In [85]:
scores = torch.triu(scores)

In [86]:
scores.shape

torch.Size([67, 67])

In [87]:
print(scores.argmax())
scores.argmax().item()

tensor(1576)


1576

In [88]:
scores.shape

torch.Size([67, 67])

In [89]:
max_index = scores.argmax().item()
start_index = max_index // scores.shape[1]
end_index = max_index % scores.shape[1]
print(scores[start_index, end_index])


tensor(0.9803, grad_fn=<SelectBackward0>)


In [90]:
print(scores.argmax().item() // 67)
print(scores.argmax().item() % 67)

23
35


In [91]:
scores[23,35]

tensor(0.9803, grad_fn=<SelectBackward0>)

In [92]:
inputs_with_offsets = tokenizer(question, context, return_offsets_mapping=True)
offsets = inputs_with_offsets["offset_mapping"]

start_char, _ = offsets[start_index]
_, end_char = offsets[end_index]
answer = context[start_char:end_char]


In [93]:
result = {
    "answer": answer,
    "start": start_char,
    "end": end_char,
    "score": scores[start_index, end_index]
}
print(result)


{'answer': 'Jax, PyTorch, and TensorFlow', 'start': 78, 'end': 106, 'score': tensor(0.9803, grad_fn=<SelectBackward0>)}


In [94]:
print(inputs_with_offsets.tokens()[start_index]) #23
print(inputs_with_offsets.tokens()[end_index]) #35
inputs_with_offsets.tokens()

Jax
##low


['[CLS]',
 'Which',
 'deep',
 'learning',
 'libraries',
 'back',
 '[UNK]',
 'Transformers',
 '?',
 '[SEP]',
 '[UNK]',
 'Transformers',
 'is',
 'backed',
 'by',
 'the',
 'three',
 'most',
 'popular',
 'deep',
 'learning',
 'libraries',
 '—',
 'Jax',
 ',',
 'P',
 '##y',
 '##T',
 '##or',
 '##ch',
 ',',
 'and',
 'Ten',
 '##sor',
 '##F',
 '##low',
 '—',
 'with',
 'a',
 'sea',
 '##m',
 '##less',
 'integration',
 'between',
 'them',
 '.',
 'It',
 "'",
 's',
 'straightforward',
 'to',
 'train',
 'your',
 'models',
 'with',
 'one',
 'before',
 'loading',
 'them',
 'for',
 'in',
 '##ference',
 'with',
 'the',
 'other',
 '.',
 '[SEP]']

#### Try it out with top 5

In [95]:
from transformers import pipeline

question_answerer = pipeline("question-answering")
context = """
🤗 Transformers is backed by the three most popular deep learning libraries — Jax, PyTorch, and TensorFlow — with a seamless integration
between them. It's straightforward to train your models with one before loading them for inference with the other.
"""
question = "Which deep learning libraries back 🤗 Transformers?"
question_answerer(question=question, context=context, top_k=5)


No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'score': 0.9802603125572205,
  'start': 78,
  'end': 106,
  'answer': 'Jax, PyTorch, and TensorFlow'},
 {'score': 0.008247792720794678,
  'start': 78,
  'end': 108,
  'answer': 'Jax, PyTorch, and TensorFlow —'},
 {'score': 0.0013677021488547325,
  'start': 78,
  'end': 90,
  'answer': 'Jax, PyTorch'},
 {'score': 0.00038108628359623253,
  'start': 83,
  'end': 106,
  'answer': 'PyTorch, and TensorFlow'},
 {'score': 0.000216845452087, 'start': 96, 'end': 106, 'answer': 'TensorFlow'}]

In [96]:
torch.topk(scores.flatten(), 5)

torch.return_types.topk(
values=tensor([9.8026e-01, 8.2478e-03, 6.8414e-03, 1.3677e-03, 3.8109e-04],
       grad_fn=<TopkBackward0>),
indices=tensor([1576, 1577, 1107, 1570, 1710]))

In [97]:
import torch
from pprint import pprint

value_5, index_5 = torch.topk(scores.flatten(), 5)
top5_indexes = [((i//scores.shape[1]).item(), (i%scores.shape[1]).item()) for i in index_5]

results = []
for s_i, e_i in top5_indexes:
     start_char, _ = offsets[s_i]
     _, end_char = offsets[e_i]
     result = {
         "answer": context[start_char:end_char],
         "start": start_char,
         "end": end_char,
         "score": scores[s_i, e_i].item()
     }
     results.append(result)
pprint(results)


[{'answer': 'Jax, PyTorch, and TensorFlow',
  'end': 106,
  'score': 0.9802601933479309,
  'start': 78},
 {'answer': 'Jax, PyTorch, and TensorFlow —',
  'end': 108,
  'score': 0.008247792720794678,
  'start': 78},
 {'answer': 'three most popular deep learning libraries — Jax, PyTorch, and '
            'TensorFlow',
  'end': 106,
  'score': 0.006841439288109541,
  'start': 33},
 {'answer': 'Jax, PyTorch',
  'end': 90,
  'score': 0.0013677021488547325,
  'start': 78},
 {'answer': 'PyTorch, and TensorFlow',
  'end': 106,
  'score': 0.0003810862544924021,
  'start': 83}]


#### 길이가 긴 컨텍스트 다루기
위에서 예제로 사용한 질문 및 길이가 긴 컨텍스트를 토큰화 해보면 question-answering 파이프라인에서 사용된 최대 길이(384)보다 더 많은 토큰들이 출력됩니다:

In [98]:
inputs = tokenizer(question, long_context)
print(len(inputs["input_ids"]))

461


In [99]:
tokenizer

DistilBertTokenizerFast(name_or_path='distilbert-base-cased-distilled-squad', vocab_size=28996, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

In [100]:
inputs = tokenizer(question, long_context, max_length=384, truncation="only_second")
print(tokenizer.decode(inputs["input_ids"]))

[CLS] Which deep learning libraries back [UNK] Transformers? [SEP] [UNK] Transformers : State of the Art NLP [UNK] Transformers provides thousands of pretrained models to perform tasks on texts such as classification, information extraction, question answering, summarization, translation, text generation and more in over 100 languages. Its aim is to make cutting - edge NLP easier to use for everyone. [UNK] Transformers provides APIs to quickly download and use those pretrained models on a given text, fine - tune them on your own datasets and then share them with the community on our model hub. At the same time, each python module defining an architecture is fully standalone and can be modified to enable quick research experiments. Why should I use transformers? 1. Easy - to - use state - of - the - art models : - High performance on NLU and NLG tasks. - Low barrier to entry for educators and practitioners. - Few user - facing abstractions with just three classes to learn. - A unified A

In [101]:
print(len(inputs["input_ids"]))

384


In [102]:
sentence = "This sentence is not too long but we are going to split it anyway."
inputs = tokenizer(
    sentence, truncation=True, return_overflowing_tokens=True, max_length=8, stride=2
)
pprint(inputs)
for ids in inputs["input_ids"]:
    print(tokenizer.decode(ids))


{'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1],
                    [1, 1, 1, 1, 1, 1, 1, 1],
                    [1, 1, 1, 1, 1, 1, 1, 1],
                    [1, 1, 1, 1, 1]],
 'input_ids': [[101, 1188, 5650, 1110, 1136, 1315, 1263, 102],
               [101, 1315, 1263, 1133, 1195, 1132, 1280, 102],
               [101, 1132, 1280, 1106, 3325, 1122, 4050, 102],
               [101, 1122, 4050, 119, 102]],
 'overflow_to_sample_mapping': [0, 0, 0, 0]}
[CLS] This sentence is not too long [SEP]
[CLS] too long but we are going [SEP]
[CLS] are going to split it anyway [SEP]
[CLS] it anyway. [SEP]


In [103]:
print(inputs.keys())

dict_keys(['input_ids', 'attention_mask', 'overflow_to_sample_mapping'])


In [104]:
print(inputs["overflow_to_sample_mapping"])

[0, 0, 0, 0]


In [105]:
sentences = [
    "This sentence is not too long but we are going to split it anyway.",
    "This sentence is shorter but will still get split.",
]
inputs = tokenizer(
    sentences, truncation=True, return_overflowing_tokens=True, max_length=6, stride=2
)

print(inputs["overflow_to_sample_mapping"])
print(inputs)

[0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1]
{'input_ids': [[101, 1188, 5650, 1110, 1136, 102], [101, 1110, 1136, 1315, 1263, 102], [101, 1315, 1263, 1133, 1195, 102], [101, 1133, 1195, 1132, 1280, 102], [101, 1132, 1280, 1106, 3325, 102], [101, 1106, 3325, 1122, 4050, 102], [101, 1122, 4050, 119, 102], [101, 1188, 5650, 1110, 7681, 102], [101, 1110, 7681, 1133, 1209, 102], [101, 1133, 1209, 1253, 1243, 102], [101, 1253, 1243, 3325, 119, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]], 'overflow_to_sample_mapping': [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1]}


In [106]:
inputs.sequence_ids()

[None, 0, 0, 0, 0, None]

In [107]:
inputs = tokenizer(
    question,
    long_context,
    stride=60,
    max_length=200,
    padding="longest",
    #truncation=True,
    truncation="only_second",
    return_overflowing_tokens=True,
    return_offsets_mapping=True,
)

for ids in inputs["input_ids"]:
    print(tokenizer.decode(ids))
    print("\n")

[CLS] Which deep learning libraries back [UNK] Transformers? [SEP] [UNK] Transformers : State of the Art NLP [UNK] Transformers provides thousands of pretrained models to perform tasks on texts such as classification, information extraction, question answering, summarization, translation, text generation and more in over 100 languages. Its aim is to make cutting - edge NLP easier to use for everyone. [UNK] Transformers provides APIs to quickly download and use those pretrained models on a given text, fine - tune them on your own datasets and then share them with the community on our model hub. At the same time, each python module defining an architecture is fully standalone and can be modified to enable quick research experiments. Why should I use transformers? 1. Easy - to - use state - of - the - art models : - High performance on NLU and NLG tasks. - Low barrier to entry for educators and practitioners. - Few user - facing abstractions with just three classes to [SEP]


[CLS] Which 

In [110]:
inputs = tokenizer(
    question,
    long_context,
    stride=60,
    max_length=200,
    padding="longest",
    truncation=True,
    #truncation="only_second",
    return_overflowing_tokens=True,
    return_offsets_mapping=True,
)

print(inputs["overflow_to_sample_mapping"])

for ids in inputs["input_ids"]:
    print(tokenizer.decode(ids))
    print("\n")

[0, 0, 0, 0]
[CLS] Which deep learning libraries back [UNK] Transformers? [SEP] [UNK] Transformers : State of the Art NLP [UNK] Transformers provides thousands of pretrained models to perform tasks on texts such as classification, information extraction, question answering, summarization, translation, text generation and more in over 100 languages. Its aim is to make cutting - edge NLP easier to use for everyone. [UNK] Transformers provides APIs to quickly download and use those pretrained models on a given text, fine - tune them on your own datasets and then share them with the community on our model hub. At the same time, each python module defining an architecture is fully standalone and can be modified to enable quick research experiments. Why should I use transformers? 1. Easy - to - use state - of - the - art models : - High performance on NLU and NLG tasks. - Low barrier to entry for educators and practitioners. - Few user - facing abstractions with just three classes to [SEP]



In [108]:
inputs.keys()

dict_keys(['input_ids', 'attention_mask', 'offset_mapping', 'overflow_to_sample_mapping'])

In [118]:
print(inputs.sequence_ids())
len(inputs.sequence_ids())

[None, 0, 0, 0, 0, 0, 0, 0, 0, None, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, None]


200

In [119]:
inputs = tokenizer(
    question,
    long_context,
    stride=128,
    max_length=384,
    padding="longest",
    truncation="only_second",
    return_overflowing_tokens=True,
    return_offsets_mapping=True,
)

print(inputs)

{'input_ids': [[101, 5979, 1996, 3776, 9818, 1171, 100, 25267, 136, 102, 100, 25267, 131, 1426, 1104, 1103, 2051, 21239, 2101, 100, 25267, 2790, 4674, 1104, 3073, 4487, 9044, 3584, 1106, 3870, 8249, 1113, 6685, 1216, 1112, 5393, 117, 1869, 16026, 117, 2304, 10937, 117, 7584, 7317, 2734, 117, 5179, 117, 3087, 3964, 1105, 1167, 1107, 1166, 1620, 3483, 119, 2098, 6457, 1110, 1106, 1294, 5910, 118, 2652, 21239, 2101, 5477, 1106, 1329, 1111, 2490, 119, 100, 25267, 2790, 20480, 1116, 1106, 1976, 9133, 1105, 1329, 1343, 3073, 4487, 9044, 3584, 1113, 170, 1549, 3087, 117, 2503, 118, 9253, 1172, 1113, 1240, 1319, 2233, 27948, 1105, 1173, 2934, 1172, 1114, 1103, 1661, 1113, 1412, 2235, 10960, 119, 1335, 1103, 1269, 1159, 117, 1296, 185, 25669, 8613, 13196, 13682, 1126, 4220, 1110, 3106, 2484, 20717, 1673, 1105, 1169, 1129, 5847, 1106, 9396, 3613, 1844, 7857, 119, 2009, 1431, 146, 1329, 11303, 1468, 136, 122, 119, 12167, 118, 1106, 118, 1329, 1352, 118, 1104, 118, 1103, 118, 1893, 3584, 131, 118,

In [120]:
_ = inputs.pop("overflow_to_sample_mapping")
offsets = inputs.pop("offset_mapping")

inputs = inputs.convert_to_tensors("pt")
print(inputs["input_ids"].shape)


torch.Size([2, 384])


In [122]:
outputs = model(**inputs)

start_logits = outputs.start_logits
end_logits = outputs.end_logits
print(start_logits.shape, end_logits.shape)


torch.Size([2, 384]) torch.Size([2, 384])


In [105]:
inputs

{'input_ids': tensor([[  101,  5979,  1996,  3776,  9818,  1171,   100, 25267,   136,   102,
           100, 25267,   131,  1426,  1104,  1103,  2051, 21239,  2101,   100,
         25267,  2790,  4674,  1104,  3073,  4487,  9044,  3584,  1106,  3870,
          8249,  1113,  6685,  1216,  1112,  5393,   117,  1869, 16026,   117,
          2304, 10937,   117,  7584,  7317,  2734,   117,  5179,   117,  3087,
          3964,  1105,  1167,  1107,  1166,  1620,  3483,   119,  2098,  6457,
          1110,  1106,  1294,  5910,   118,  2652, 21239,  2101,  5477,  1106,
          1329,  1111,  2490,   119,   100, 25267,  2790, 20480,  1116,  1106,
          1976,  9133,  1105,  1329,  1343,  3073,  4487,  9044,  3584,  1113,
           170,  1549,  3087,   117,  2503,   118,  9253,  1172,  1113,  1240,
          1319,  2233, 27948,  1105,  1173,  2934,  1172,  1114,  1103,  1661,
          1113,  1412,  2235, 10960,   119,  1335,  1103,  1269,  1159,   117,
          1296,   185, 25669,  8613, 1

In [113]:
print(inputs["input_ids"].shape)
print(len(inputs.sequence_ids()))
print(inputs.sequence_ids()) # the sequence_ids is applied to every chuck, the first and the second chunks in this case  

torch.Size([2, 384])
384
[None, 0, 0, 0, 0, 0, 0, 0, 0, None, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

In [124]:
print(inputs.sequence_ids(0))

[None, 0, 0, 0, 0, 0, 0, 0, 0, None, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 

In [123]:
print(inputs.sequence_ids(1))

[None, 0, 0, 0, 0, 0, 0, 0, 0, None, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, 

In [125]:
sequence_ids = inputs.sequence_ids()
# Mask everything apart from the tokens of the context
mask = [i != 1 for i in sequence_ids]
# Unmask the [CLS] token
mask[0] = False
# Mask all the [PAD] tokens
print(torch.tensor(mask)[None].shape, inputs["attention_mask"].shape)
mask = torch.logical_or(torch.tensor(mask)[None], (inputs["attention_mask"] == 0))
print(mask.shape)
print(mask)
start_logits[mask] = -10000
end_logits[mask] = -10000


torch.Size([1, 384]) torch.Size([2, 384])
torch.Size([2, 384])
tensor([[False,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         False, False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, Fal

In [127]:
start_logits.shape

torch.Size([2, 384])

In [129]:
start_probabilities = torch.nn.functional.softmax(start_logits, dim=-1)
end_probabilities = torch.nn.functional.softmax(end_logits, dim=-1)
print(start_probabilities.shape)
print(start_probabilities)

torch.Size([2, 384])
tensor([[6.0396e-01, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
         0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 1.1519e-02, 1.8898e-02,
         2.3307e-03, 2.6660e-01, 3.3987e-04, 5.7259e-03, 8.8251e-03, 2.4441e-02,
         1.1653e-03, 1.6499e-02, 4.2213e-03, 2.5967e-04, 5.1531e-04, 4.8240e-05,
         1.0526e-03, 3.3808e-05, 4.8656e-05, 3.3979e-04, 7.6070e-05, 8.2786e-05,
         2.0089e-04, 3.7564e-05, 3.8198e-04, 2.2247e-05, 1.9128e-05, 3.1091e-04,
         2.6787e-05, 2.9143e-04, 5.4116e-05, 2.5045e-05, 2.0686e-04, 5.5922e-05,
         3.0922e-05, 1.8686e-04, 3.0462e-05, 3.8663e-05, 3.2295e-05, 2.6850e-04,
         3.2148e-05, 5.0713e-04, 7.8036e-05, 2.6229e-05, 2.1632e-04, 1.0072e-04,
         2.8341e-04, 2.6911e-04, 1.7327e-04, 6.4009e-05, 4.9601e-04, 1.2098e-04,
         4.9918e-05, 2.3866e-04, 2.2675e-04, 7.6235e-04, 5.3037e-05, 6.1205e-05,
         2.1717e-03, 6.1977e-05, 8.0132e-05, 3.5199e-05, 3.8517e-05, 1.9568e-05,
       

In [130]:
import torch
v0, i = torch.topk(start_probabilities[0], 1)
print(v0, i)

import torch
v, j = torch.topk(end_probabilities[0], 1)
print(v, j)
print(v0*v)

tensor([0.6040], grad_fn=<TopkBackward0>) tensor([0])
tensor([0.5608], grad_fn=<TopkBackward0>) tensor([18])
tensor([0.3387], grad_fn=<MulBackward0>)


In [131]:
start_index, _ = offsets[0][i.item()]
_, end_index = offsets[0][j.item()]
print(long_context[start_index:end_index])


🤗 Transformers: State of the Art NLP


In [132]:
import torch
v0, i = torch.topk(start_probabilities[1], 1)
print(v0, i)

import torch
v, j = torch.topk(end_probabilities[1], 1)
print(v, j)
print(v0*v)

tensor([0.9912], grad_fn=<TopkBackward0>) tensor([173])
tensor([0.9801], grad_fn=<TopkBackward0>) tensor([184])
tensor([0.9715], grad_fn=<MulBackward0>)


In [133]:
start_index, _ = offsets[1][i.item()]
_, end_index = offsets[1][j.item()]
print(long_context[start_index:end_index])

Jax, PyTorch and TensorFlow


In [134]:
candidates = []
for start_probs, end_probs in zip(start_probabilities, end_probabilities):
    scores = start_probs[:, None] * end_probs[None, :]
    idx = torch.triu(scores).argmax().item()

    start_idx = idx // scores.shape[0]
    end_idx = idx % scores.shape[0]
    score = scores[start_idx, end_idx].item()
    candidates.append((start_idx, end_idx, score))

print(candidates)


[(0, 18, 0.33867067098617554), (173, 184, 0.9714868664741516)]


In [135]:
for candidate, offset in zip(candidates, offsets):
    start_token, end_token, score = candidate
    start_char, _ = offset[start_token]
    _, end_char = offset[end_token]
    answer = long_context[start_char:end_char]
    result = {"answer": answer, "start":start_char, "end":end_char, "score":score}
    print(result)


{'answer': '\n🤗 Transformers: State of the Art NLP', 'start': 0, 'end': 37, 'score': 0.33867067098617554}
{'answer': 'Jax, PyTorch and TensorFlow', 'start': 1892, 'end': 1919, 'score': 0.9714868664741516}
