# 5 Named Entity Recognition and Relation Extraction

**Assignment**
- Copy this colab notebook into your google drive
- Use `LUKE` for recognizing named entities in a sentence you choose
  - Print entity names and their labels
    - Example) `Franz Kafka PER`
  - You can reuse all the above codes
- Save your colab notebook as a pdf via **Print** in the file menu and submit it to https://edu-portal.naist.jp/ under **NLP #5** using the report submission portal. Please make sure that all the codes and the execution results are visible for the assessment.
- The PDF file name: `studentID_firstName_lastName.pdf`
    - Example) `222222_hiroki_ouchi.pdf`
- **Deadline: January 8**

For help regarding [Colab](https://colab.research.google.com/) or any technical issues, ask our TA, <sun.hongyu.sg6@naist.ac.jp>.

In [1]:
#@markdown Please fill in your name, student id and email address.

name = 'Raturi Himanshu' #@param {type: 'string'}
stuent_id = '2411422' #@param {type: 'string'}
email = 'raturi.himanshu.rf4@naist.ac.jp' #@param {type: 'string'}

#@markdown ---

## Named Entity Recognition

[LUKE](https://huggingface.co/docs/transformers/model_doc/luke#transformers.LukeForEntitySpanClassification) is the model proposed in [LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention](https://arxiv.org/abs/2010.01057) by Ikuya Yamada, Akari Asai, Hiroyuki Shindo, Hideaki Takeda and Yuji Matsumoto.

You can easily use LUKE from the [Hugging Face library](https://huggingface.co/).

### Install an NLP library for using LUKE

In [2]:
!pip install transformers



### Load a tokenizer and a model

In [3]:
from transformers import AutoTokenizer, LukeForEntitySpanClassification, LukeForEntityPairClassification

In [4]:
MODEL_NAME = "studio-ousia/luke-large-finetuned-conll-2003"

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = LukeForEntitySpanClassification.from_pretrained(MODEL_NAME)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/1.70k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/877 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

entity_vocab.json:   0%|          | 0.00/15.3M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/33.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/1.00k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

Some weights of the model checkpoint at studio-ousia/luke-large-finetuned-conll-2003 were not used when initializing LukeForEntitySpanClassification: ['luke.embeddings.position_ids']
- This IS expected if you are initializing LukeForEntitySpanClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing LukeForEntitySpanClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


### Pre-processing

In [5]:
# Example sentence
text = "Beyoncé lives in Los Angeles"

In [6]:
def get_positions(_split_text):
    _word_start_positions = []  # character-based start positions of word tokens
    _word_end_positions = []  # character-based end positions of word tokens
    word_start = -1
    word_end = -1
    for word in _split_text:
        word_start = word_end + 1
        word_end = word_start + len(word)
        _word_start_positions.append(word_start)
        _word_end_positions.append(word_end)
    assert len(_split_text) == len(_word_start_positions) == len(_word_end_positions)
    return _word_start_positions, _word_end_positions


# Split the text by the white space
split_text = text.split()
print("split_text:", split_text)

# Prepare positions (character offsets) of each word in the sentence
word_start_positions, word_end_positions = get_positions(split_text)
for word, start, end in zip(split_text,
                            word_start_positions,
                            word_end_positions):
    print(word, start, end)

split_text: ['Beyoncé', 'lives', 'in', 'Los', 'Angeles']
Beyoncé 0 7
lives 8 13
in 14 16
Los 17 20
Angeles 21 28


In [7]:
def create_candidates(_word_start_positions, _word_end_positions):
    _entity_spans = []
    for i, start in enumerate(_word_start_positions):
        for end in _word_end_positions[i:]:
            _entity_spans.append((start, end))
    return _entity_spans


# Prepare candidate spans
entity_spans = create_candidates(word_start_positions, word_end_positions)
print("entity_spans", entity_spans)

for start, end in entity_spans:
    print(start, end, text[start:end])

entity_spans [(0, 7), (0, 13), (0, 16), (0, 20), (0, 28), (8, 13), (8, 16), (8, 20), (8, 28), (14, 16), (14, 20), (14, 28), (17, 20), (17, 28), (21, 28)]
0 7 Beyoncé
0 13 Beyoncé lives
0 16 Beyoncé lives in
0 20 Beyoncé lives in Los
0 28 Beyoncé lives in Los Angeles
8 13 lives
8 16 lives in
8 20 lives in Los
8 28 lives in Los Angeles
14 16 in
14 20 in Los
14 28 in Los Angeles
17 20 Los
17 28 Los Angeles
21 28 Angeles


### Check all the class labels to predict.

In [8]:
model.config.id2label

{0: 'NIL', 1: 'MISC', 2: 'PER', 3: 'ORG', 4: 'LOC'}

- NIL = not named entity
- MISC = miscellaneous (others)
- PER = Person
- ORG = Organization
- LOC = Location

### Predict class labels

In [9]:
# Convert text to IDs
inputs = tokenizer(text, entity_spans=entity_spans, return_tensors="pt")

# The model calculates logists for all possible spans
outputs = model(**inputs)
logits = outputs.logits

print(logits)

tensor([[[ -1.9349,  -2.4820,   4.9590,  -1.7057,  -2.9863],
         [ 19.6495,  -6.1777,  -4.8789,  -6.0958, -10.4862],
         [ 20.1395,  -6.1046,  -5.4534,  -6.3328,  -9.9215],
         [ 20.6528,  -6.7990,  -5.1158,  -7.0594, -10.3352],
         [ 10.7447,  -6.2030,  -3.7324,  -4.9827,  -2.8323],
         [ 21.3224,  -5.8755,  -5.4406,  -5.2967,  -8.7788],
         [ 26.4558,  -7.2542,  -7.5937,  -7.1223, -11.4900],
         [ 29.6949,  -8.5762,  -8.6888,  -8.3652, -12.4721],
         [ 21.4111,  -8.2313,  -8.2208,  -6.3827,  -4.8849],
         [ 25.6505,  -6.9217,  -6.7429,  -6.8426, -10.8582],
         [ 21.5046,  -6.3015,  -4.8518,  -7.2439,  -8.5246],
         [ 12.6306,  -5.3144,  -4.9148,  -4.5891,  -1.3775],
         [  9.9836,  -5.1412,  -3.3172,  -6.4617,  -1.5044],
         [ -1.5675,  -3.7830,  -2.4797,  -3.9234,   7.4301],
         [ 13.8551,  -6.3372,  -4.5943,  -5.2015,  -2.8574]]],
       grad_fn=<ViewBackward0>)


- logits = score for each class label

In [10]:
id2label = model.config.id2label
for span, logits_each_span in zip(entity_spans, logits[0].tolist()):
    start, end = span
    print(f'{text[start: end]}')
    for label_id, label in id2label.items():
        print(f'-- {label_id} {label} {logits_each_span[label_id]}')

Beyoncé
-- 0 NIL -1.9349348545074463
-- 1 MISC -2.4819788932800293
-- 2 PER 4.958950996398926
-- 3 ORG -1.7056972980499268
-- 4 LOC -2.9862539768218994
Beyoncé lives
-- 0 NIL 19.649463653564453
-- 1 MISC -6.177742004394531
-- 2 PER -4.878870964050293
-- 3 ORG -6.0957932472229
-- 4 LOC -10.486166954040527
Beyoncé lives in
-- 0 NIL 20.13947296142578
-- 1 MISC -6.104556083679199
-- 2 PER -5.4534430503845215
-- 3 ORG -6.33275032043457
-- 4 LOC -9.921531677246094
Beyoncé lives in Los
-- 0 NIL 20.652751922607422
-- 1 MISC -6.799007892608643
-- 2 PER -5.115772247314453
-- 3 ORG -7.059402942657471
-- 4 LOC -10.335214614868164
Beyoncé lives in Los Angeles
-- 0 NIL 10.744658470153809
-- 1 MISC -6.202977180480957
-- 2 PER -3.7324435710906982
-- 3 ORG -4.982699394226074
-- 4 LOC -2.8323137760162354
lives
-- 0 NIL 21.322420120239258
-- 1 MISC -5.875490665435791
-- 2 PER -5.440551280975342
-- 3 ORG -5.296720504760742
-- 4 LOC -8.778790473937988
lives in
-- 0 NIL 26.45578384399414
-- 1 MISC -7.254156

### Print the class labels with the highest score for all possible spans.

In [11]:
# Print the label ids with the highest logit (score)
predicted_class_indices = logits.argmax(-1).squeeze().tolist()
print(predicted_class_indices)

[2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0]


In [12]:
# Print the predicted class labels for each span
for span, predicted_class_idx in zip(entity_spans, predicted_class_indices):
    start, end = span
    predicted_class = model.config.id2label[predicted_class_idx]
    print(f'{text[start: end]}\t{predicted_class}')

Beyoncé	PER
Beyoncé lives	NIL
Beyoncé lives in	NIL
Beyoncé lives in Los	NIL
Beyoncé lives in Los Angeles	NIL
lives	NIL
lives in	NIL
lives in Los	NIL
lives in Los Angeles	NIL
in	NIL
in Los	NIL
in Los Angeles	NIL
Los	NIL
Los Angeles	LOC
Angeles	NIL


In [13]:
# Print the class labels excluding 'NIL'
named_entity_spans = []
for span, predicted_class_idx in zip(entity_spans, predicted_class_indices):
    if predicted_class_idx != 0:  # ID=0 is the 'NIL' label (not named entity)
        start, end = span
        predicted_class = model.config.id2label[predicted_class_idx]
        named_entity_spans.append((start, end))
        print(f'{text[start: end]}\t{predicted_class}')

# character-based entity spans corresponding to "Beyoncé" and "Los Angeles"
print(named_entity_spans)

Beyoncé	PER
Los Angeles	LOC
[(0, 7), (17, 28)]


## Relation Extraction

[The TAC Relation Extraction Dataset](https://nlp.stanford.edu/projects/tacred/)

### Load a tokenizer and a model

In [14]:
MODEL_NAME = "studio-ousia/luke-large-finetuned-tacred"

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = LukeForEntityPairClassification.from_pretrained(MODEL_NAME)

tokenizer_config.json:   0%|          | 0.00/1.69k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/3.33k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

entity_vocab.json:   0%|          | 0.00/15.3M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/33.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/1.00k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

Some weights of the model checkpoint at studio-ousia/luke-large-finetuned-tacred were not used when initializing LukeForEntityPairClassification: ['luke.embeddings.position_ids']
- This IS expected if you are initializing LukeForEntityPairClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing LukeForEntityPairClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


### Pre-processing

In [15]:
# Make a pair of "Beyoncé" and "Los Angeles"
entity_pairs = [[named_entity_spans[0], named_entity_spans[1]]]
# Prepare one input text (sentence) for each pair
texts = [text]

### Check all the class labels to predict

In [16]:
model.config.id2label

{0: 'no_relation',
 1: 'org:alternate_names',
 2: 'org:city_of_headquarters',
 3: 'org:country_of_headquarters',
 4: 'org:dissolved',
 5: 'org:founded',
 6: 'org:founded_by',
 7: 'org:member_of',
 8: 'org:members',
 9: 'org:number_of_employees/members',
 10: 'org:parents',
 11: 'org:political/religious_affiliation',
 12: 'org:shareholders',
 13: 'org:stateorprovince_of_headquarters',
 14: 'org:subsidiaries',
 15: 'org:top_members/employees',
 16: 'org:website',
 17: 'per:age',
 18: 'per:alternate_names',
 19: 'per:cause_of_death',
 20: 'per:charges',
 21: 'per:children',
 22: 'per:cities_of_residence',
 23: 'per:city_of_birth',
 24: 'per:city_of_death',
 25: 'per:countries_of_residence',
 26: 'per:country_of_birth',
 27: 'per:country_of_death',
 28: 'per:date_of_birth',
 29: 'per:date_of_death',
 30: 'per:employee_of',
 31: 'per:origin',
 32: 'per:other_family',
 33: 'per:parents',
 34: 'per:religion',
 35: 'per:schools_attended',
 36: 'per:siblings',
 37: 'per:spouse',
 38: 'per:state

### Predict class labels

In [17]:
inputs = tokenizer(texts, entity_spans=entity_pairs, return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits

print(logits)

tensor([[ 3.2066, -3.2249,  3.2884, -2.9900, -3.0967, -2.4664, -2.8530, -1.7383,
         -3.0650, -2.7716, -2.2769, -3.4903, -2.4796, -1.2456, -2.9390, -0.3413,
         -2.2316, -1.3379,  0.6568, -2.1910, -3.1682, -2.1295,  9.3798,  3.1360,
          1.9412,  0.5659, -0.8443, -2.9207, -2.0429, -2.7728,  0.8149, -0.9816,
         -1.0304, -1.1288, -2.7429, -0.7868, -1.4846, -1.2134,  0.0816, -2.6639,
          2.4028, -0.7915]], grad_fn=<MmBackward0>)


In [18]:
id2label = model.config.id2label
for pair, each_text, logits_each_pair in zip(entity_pairs, texts, logits.tolist()):
    span_head, span_tail = pair
    start_h, end_h = span_head
    start_t, end_t = span_tail
    print(f'{each_text[start_h: end_h]}, {each_text[start_t: end_t]}')
    for label_id, label in id2label.items():
        print(f'-- {label_id} {label} {logits_each_pair[label_id]}')

Beyoncé, Los Angeles
-- 0 no_relation 3.2065906524658203
-- 1 org:alternate_names -3.22489070892334
-- 2 org:city_of_headquarters 3.288393497467041
-- 3 org:country_of_headquarters -2.9900221824645996
-- 4 org:dissolved -3.096672534942627
-- 5 org:founded -2.4663782119750977
-- 6 org:founded_by -2.8530101776123047
-- 7 org:member_of -1.7382992506027222
-- 8 org:members -3.0649538040161133
-- 9 org:number_of_employees/members -2.7716476917266846
-- 10 org:parents -2.276923656463623
-- 11 org:political/religious_affiliation -3.490264654159546
-- 12 org:shareholders -2.4795608520507812
-- 13 org:stateorprovince_of_headquarters -1.2456047534942627
-- 14 org:subsidiaries -2.938952922821045
-- 15 org:top_members/employees -0.3413412570953369
-- 16 org:website -2.231614112854004
-- 17 per:age -1.337915301322937
-- 18 per:alternate_names 0.6568173170089722
-- 19 per:cause_of_death -2.1910128593444824
-- 20 per:charges -3.1682000160217285
-- 21 per:children -2.129486083984375
-- 22 per:cities_o

### Print the class label with the highest score

In [19]:
# Print the label ids with the highest logit (score)
predicted_label_ids = logits.argmax(-1).tolist()
print(predicted_label_ids)

[22]


In [20]:
for pair, each_text, label_id in zip(entity_pairs, texts, predicted_label_ids):
    span_head, span_tail = pair
    start_h, end_h = span_head
    start_t, end_t = span_tail
    print(f'{each_text[start_h: end_h]}, {each_text[start_t: end_t]}')
    print(f'-- {label_id} {id2label[label_id]}')

Beyoncé, Los Angeles
-- 22 per:cities_of_residence


## Assignment

- Copy this colab notebook into your google drive
- Use `LUKE` for recognizing named entities in a sentence you choose
  - Print entity names and their labels
    - Example) `Franz Kafka PER`
  - You can reuse all the above codes
- Save your colab notebook as a pdf via **Print** in the file menu and submit it to https://edu-portal.naist.jp/ under **NLP #5** using the report submission portal. Please make sure that all the codes and the execution results are visible for the assessment.
- The PDF file name: `studentID_firstName_lastName.pdf`
    - Example) `222222_hiroki_ouchi.pdf`
- **Deadline: January 8**

For help regarding [Colab](https://colab.research.google.com/) or any technical issues, ask our TA, <sun.hongyu.sg6 at naist.ac.jp>.

### Example

In [21]:
from transformers import AutoTokenizer, LukeForEntitySpanClassification, LukeForEntityPairClassification

In [22]:
MODEL_NAME = "studio-ousia/luke-large-finetuned-conll-2003"

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = LukeForEntitySpanClassification.from_pretrained(MODEL_NAME)

Some weights of the model checkpoint at studio-ousia/luke-large-finetuned-conll-2003 were not used when initializing LukeForEntitySpanClassification: ['luke.embeddings.position_ids']
- This IS expected if you are initializing LukeForEntitySpanClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing LukeForEntitySpanClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [35]:
text = "NAIST will be moved to Osaka , Kobe or Tokyo"

In [36]:
# Split the text by the white space
split_text = text.split()

# Prepare positions (character offsets) of each word in the sentence
word_start_positions, word_end_positions = get_positions(split_text)

# Prepare candidate spans
entity_spans = create_candidates(word_start_positions, word_end_positions)

In [37]:
# Convert text to IDs
inputs = tokenizer(text, entity_spans=entity_spans, return_tensors="pt")

# The model calculates logists for all possible spans
outputs = model(**inputs)
logits = outputs.logits

In [38]:
# Extract the label ids with the highest logit (score)
predicted_class_indices = logits.argmax(-1).squeeze().tolist()

# Print the class labels excluding 'NIL'
for span, predicted_class_idx in zip(entity_spans, predicted_class_indices):
    if predicted_class_idx != 0:  # ID=0 is the 'NIL' label (not named entity)
        start, end = span
        predicted_class = model.config.id2label[predicted_class_idx]
        print(f'{text[start: end]}\t{predicted_class}')

NAIST	ORG
Osaka	LOC
Kobe	LOC
Tokyo	LOC


### Write Your Code Here!