<a href="https://colab.research.google.com/github/dbamman/nlp22/blob/main/HW5/HW_5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## HW5: Neural Sequence Labeling

For this assignment, we'll provide an implementation of BERT and view its performance on **named entity recognition (NER)**. We use a standard NER dataset that adheres to the BIO tagging format discussed in lecture.

We ask you to implement:

1) a span extraction method, to determine which entities are present in a given piece of text

2) a method to calculate F1, which evaluates our model's performance. 

In [59]:
# ensure Transformers is installed 
# https://huggingface.co/docs/transformers/index
!pip install transformers



In [60]:
import sys
from transformers import BertModel, BertTokenizer
import torch.nn as nn
import torch
import numpy as np
from torch.nn import CrossEntropyLoss
import nltk
nltk.download('punkt')

import tqdm
from collections import Counter

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [61]:
device=torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

cuda


## Intro: Wordpiece Tokenization Exploration

To start the notebook, let's explore how BERT tokenizes inputs. There are no deliverables for this portion, but we encourage you to step through these cells to understand what happens when words are inputted into BERT!

BERT uses **WordPiece** tokenization, which is a subword tokenzation technique that breaks down words that don't appear within its 30K-word vocabulary into small pieces. The word "vaccinated", for instance, is tokenized as `["va", "##cci", "##nated"]`

To explore how BERT tokenizes inputs, let's first load in the [Bert-Base](https://github.com/google-research/bert#bert) model from the [Transformers](https://huggingface.co/docs/transformers/model_doc/bert) library. This call allows us to use a pretrained BERT model (i.e., one that has already gone through the training phase of MLM + NSP discussed in the [2/15 lecture](https://people.ischool.berkeley.edu/~dbamman/nlp22_slides/9_LM_3.pdf)) and use it for whatever tasks we'd like.

In [62]:
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
model = BertModel.from_pretrained('bert-base-cased')

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Let's start with an example sentence:

*New data shows 26 states have fully vaccinated more than half their residents.*

and see how BERT tokenizes it.

In [63]:
inputs=tokenizer("New data shows 26 states have fully vaccinated more than half their residents.", return_tensors="pt")
tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])

['[CLS]',
 'New',
 'data',
 'shows',
 '26',
 'states',
 'have',
 'fully',
 'v',
 '##ac',
 '##cin',
 '##ated',
 'more',
 'than',
 'half',
 'their',
 'residents',
 '.',
 '[SEP]']

Note how common words like "have" are represented in their entirety, while "vaccinated" is broken into 3 different parts. You can also see the reserved start `[CLS]` and ending `[SEP]` tags we discussed in class. BERT will generate representations of each WordPiece token, including these special `[CLS]` and `[SEP]` tags.

You can really see the effect of WordPiece when you provide a word that BERT, most likely, never encountered during its training, like *supercalifragilisticexpialidocious*

In [64]:
inputs=tokenizer("BERT is supercalifragilisticexpialidocious", return_tensors="pt")
tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])

['[CLS]',
 'B',
 '##ER',
 '##T',
 'is',
 'super',
 '##cal',
 '##if',
 '##rag',
 '##ilis',
 '##tic',
 '##ex',
 '##pia',
 '##lid',
 '##oc',
 '##ious',
 '[SEP]']

Now let's work with an example sentence, *This jam is delicious*

In [65]:
inputs=tokenizer("This jam is delicious", return_tensors="pt")
outputs = model(**inputs)
tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])

['[CLS]', 'This', 'jam', 'is', 'delicious', '[SEP]']

Representations for each of BERT layers (12 in this model) are accessible, but let's explore just the outputs from the final layer.  This BERT model has 768-dimensional representations, so this 6-token input (`[CLS, this, jam, is, delicious, [SEP]`) has an output that is is a 1 x 6 tokens x 768 dimensional tensor.

In [66]:
last_hidden_states = outputs.last_hidden_state
print(outputs.last_hidden_state.shape)

torch.Size([1, 6, 768])


Before we move to using BERT to help us carry out NER (which is, fundamentally, a classification problem), what can we do with just these representations?  

While we used word2vec-style static embeddings to find nearest neighbors for word *types*, we can do the same here for word *tokens*.

In [67]:
# this should look familiar!
def cosine_similarity(a, b):
    return np.dot(a, b)/(np.linalg.norm(a)*np.linalg.norm(b))

Let's see how BERT does at answering questions of ambiguity, which is of major concern when carrying out a task like NER (as well as other topics we'll explore later in class like word senses)

We'll start with a base sentence:

*I ate some jam with toast*

that uses the word *jam* in the sense of what you spread onto toast, i.e.
"A conserve of fruit prepared by boiling it with sugar to a pulp." ([OED](https://www.oed.com/view/Entry/100680?rskey=v7CKhW&result=2#eid)).

Now we'll write a method that extracts the representation BERT uses for a given word in a sentence.

In [68]:
def get_bert_for_token(string, term):
    
    # tokenize
    inputs = tokenizer(string, return_tensors="pt")
    
    # convert input ids to words
    tokens=tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
    
    # find the first location of the query term among those tokens (so we know which BERT rep to use)
    term_idx=tokens.index(term)
    
    outputs = model(**inputs)

    # return the BERT rep for that token index
    # The output is a pytorch tensor object, but let's convert it to a numpy object to work with numpy functions
    
    return outputs.last_hidden_state[0][term_idx].detach().numpy()

We'll start with some generic methods for reading the data our model will apply BIO tags to. We will be working with the 2003 CoNLL NER dataset [(Sang and Kim, 2003)](https://arxiv.org/pdf/cs/0306050v1.pdf).

In [69]:
query="I ate some jam with toast"

query_rep=get_bert_for_token(query, "jam")
# note the shape
print(query_rep.shape)

(768,)


Now, let's write sentences using other forms of jam, like the [sense](https://www.oed.com/view/Entry/100679?rskey=v7CKhW&result=1#eid) that refers to "preventing movements" and then compare how closely their usage of the word jam matches our starter sentence (according to BERT).



In [70]:
comp_sents=["She got me out of a real jam", "This jam is made of strawberries", "I sat in a traffic jam for 2 hours", "The Grateful Dead used to jam for like two days straight.", "My grandma makes the best jam.", "I had to jam on the brakes to avoid hitting him."]

In [71]:
vals=[]
for sent in comp_sents:
    comp_rep=get_bert_for_token(sent, "jam")
    cos_sim=cosine_similarity(query_rep, comp_rep)
    vals.append((cos_sim, query, sent))

for c, q, s in reversed(sorted(vals)):
    print("%.3f\t%s\t%s" % (c, q, s))

0.843	I ate some jam with toast	My grandma makes the best jam.
0.837	I ate some jam with toast	This jam is made of strawberries
0.736	I ate some jam with toast	The Grateful Dead used to jam for like two days straight.
0.665	I ate some jam with toast	I sat in a traffic jam for 2 hours
0.658	I ate some jam with toast	She got me out of a real jam
0.636	I ate some jam with toast	I had to jam on the brakes to avoid hitting him.


Now that we've explored how BERT tokenizes data firsthand, let's incorporate our learning task and see how a BERT model performs at tagging named entities in a given text.

## Neural NER

We'll start with some generic methods for reading the data our model will apply BIO tags to. We will be working with the 2003 CoNLL NER dataset [(Sang and Kim, 2003)](https://arxiv.org/pdf/cs/0306050v1.pdf).

In [72]:
def read_labels(filename):
	labels={}
	with open(filename) as file:
		for line in file:
			cols=line.rstrip().split("\t")
			if len(cols) < 2:
				continue
			label=cols[1]
			if label not in labels:
				labels[label]=len(labels)
	
	return labels

def read_data(filename, labels):
	sentences=[]
	sentence=[]
	with open(filename) as file:
		for line in file:
			cols=line.rstrip().split("\t")
			if len(cols) < 2:
				if len(sentence) > 0:
					sentences.append(sentence)
					sentence=[]

			else:
				token=cols[0]
				assert cols[1] in labels
				label=labels[cols[1]]
				sentence.append((token, label))
	
	if len(sentence) > 0:
		sentences.append(sentence)
		sentence=[]

	return sentences

In [73]:
!wget https://raw.githubusercontent.com/dbamman/nlp22/main/HW5/data/train.txt
!wget https://raw.githubusercontent.com/dbamman/nlp22/main/HW5/data/valid.txt
!wget https://raw.githubusercontent.com/dbamman/nlp22/main/HW5/data/test.txt

labels=read_labels("train.txt")
rev_labels={labels[key]:key for key in labels}

train=read_data("train.txt", labels)
dev=read_data("valid.txt", labels)
test=read_data("test.txt", labels)

--2022-03-03 07:50:23--  https://raw.githubusercontent.com/dbamman/nlp22/main/HW5/data/train.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1683940 (1.6M) [text/plain]
Saving to: ‘train.txt.1’


2022-03-03 07:50:23 (26.5 MB/s) - ‘train.txt.1’ saved [1683940/1683940]

--2022-03-03 07:50:23--  https://raw.githubusercontent.com/dbamman/nlp22/main/HW5/data/valid.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 423701 (414K) [text/plain]
Saving to: ‘valid.txt.1’


2022-03-03 07:50:23 (11.9 MB/s) - ‘valid.txt.1’ saved [423701/42

### BERTClassifier
Then, let's write out a BERT classifier in Pytorch so you can see the model we'll be using for NER tagging.

In [74]:
class BERTClassifier(nn.Module):

	def __init__(self, params):
		super().__init__()

		self.model_name=params["model_name"]
		self.do_lower_case = params["doLowerCase"]

		self.tokenizer = BertTokenizer.from_pretrained(self.model_name, do_lower_case=params["doLowerCase"], do_basic_tokenize=False)
		self.bert = BertModel.from_pretrained(self.model_name)
	
		self.num_labels = params["label_length"]
		self.fc = nn.Linear(params["embedding_size"], self.num_labels)
	
		self.device=device

	def get_batches(self, data, batch_size=32):

		batches_original=[]
		batches_x=[]
		batches_y=[]
		batches_attention=[]
		
		for i in range(0, len(data), batch_size):

			current_x=[]
			current_y=[]
			current_o=[]

			for sentence in data[i:i+batch_size]:
				wp_sentence=[self.tokenizer.convert_tokens_to_ids("[CLS]")]
				wp_labels=[-100]

				for token, label in sentence:

					if self.do_lower_case:
						token=token.lower()
						
					wp_tokens=self.tokenizer.convert_tokens_to_ids(self.tokenizer.tokenize(token))
					wp_sentence.extend(wp_tokens)
					wp_labels.append(label)
					for wp_tok in wp_tokens[1:]:
						wp_labels.append(-100)

					wp_sentence.append(self.tokenizer.convert_tokens_to_ids("[SEP]"))
					wp_labels.append(-100)

				if len(wp_sentence) >= 512:
					print("sentence is longer than BERT's 512 wp token max: %s" % len(wp_sentence))
					sys.exit(1)

				current_x.append(wp_sentence)
				current_y.append(wp_labels)
				
				words=[x[0] for x in sentence]
				current_o.append(words)

			# batch each sentence to the max size within that batch
			max_len=max([len(x) for x in current_x])
			attention_mask=np.ones((len(current_x), max_len))
			for idx in range(len(current_x)):
				for i in range(len(current_x[idx]), max_len):
					current_x[idx].append(0)
					current_y[idx].append(-100)
					attention_mask[idx][i]=0

			batches_original.append(current_o)
			batches_x.append(torch.LongTensor(current_x))
			batches_y.append(torch.LongTensor(current_y))
			batches_attention.append(torch.LongTensor(attention_mask))
			
		# each sentence in each batch has:
		# -- word piece token ids (batches_x)
		# -- attention mask (noting which tokens are just padding)
		# -- NER labels (one per token
		# -- original (i.e., non-word piece) words
		return batches_x, batches_attention, batches_y, batches_original
  

	def forward(self, input_ids, attention_mask): 
	
		input_ids=input_ids.to(self.device)
		attention_mask=attention_mask.to(self.device)
			
		output = self.bert(input_ids=input_ids,
						 attention_mask=attention_mask,
						 output_hidden_states=True)

		hidden_states=output["hidden_states"]
		out=hidden_states[-1]

		logits = self.fc(out)

		return logits

### Using BERT for NER

(Note: No deliverables are in this section; it's optional).

These methods are good examples of how to train a language model for a task of your choosing and carry out predictions on unseen input.

`train_and_evaluate` takes in various settings for a BERT model (like how large the embeddings should be, whether the model should lowercase all the inputs, etc.) instantiates a `BERTClassifier` and trains the model for our NER task. You'll see that it actually calls the `get_span_f1` method you are writing for this assignment. This provides a gauge for how well the model is performing NER. Also, this method will continually [save](https://pytorch.org/tutorials/beginner/saving_loading_models.html) whichever iteration of the model performed the best, and it ultimately returns that best model. 

`predict` generates predictions (applies BIO tags) to unseen data. For each sentence it tags, it returns the predicted BIO tag alongside the correct classification. 


In [75]:
def train_and_evaluate(bert_model_name, model_filename, embedding_size, num_epochs, doLowerCase =None):

  bert_model = BERTClassifier(params={"doLowerCase": doLowerCase, "model_name": bert_model_name, "embedding_size":embedding_size, "label_length": len(labels)})
  bert_model.to(device)

  train_batch_x, train_attention, train_batch_y, train_original = bert_model.get_batches(train)
  dev_batch_x, dev_attention, dev_batch_y, dev_original = bert_model.get_batches(dev)

  optimizer = torch.optim.Adam(bert_model.parameters(), lr=1e-5)
  cross_entropy=nn.CrossEntropyLoss(ignore_index=-100)

  bestF1 = 0.

  for epoch in range(num_epochs):
    print('Epoch', epoch + 1)
    
    # Train
    bert_model.train()

    for x, a, y in tqdm.notebook.tqdm(list(zip(train_batch_x, train_attention, train_batch_y))):
      y=y.to(device)
      y_pred = bert_model.forward(x, a)
      loss = cross_entropy(y_pred.view(-1, bert_model.num_labels), y.view(-1))
      optimizer.zero_grad()
      loss.backward()
      optimizer.step()
    
    # Evaluate
    dev_predictions=predict(bert_model, dev_batch_x, dev_attention, dev_batch_y, dev_original)

    f1=get_span_f1(dev_predictions)

    print("span F1: %.3f" % f1)
    if f1 > bestF1:
      print("%.3f is better than %.3f, saving ..."  % (f1, bestF1))
      torch.save(bert_model.state_dict(), model_filename)
      bestF1 = f1

  bert_model.load_state_dict(torch.load(model_filename))

  return bert_model

def predict(model, batch_x, attention, batch_y, original):
    model.eval()
    dev_predictions=[]
    corrected_ill_formed=0

    for x, a, y, o in zip(batch_x, attention, batch_y, original):
      y=y.to(device)
      y_pred = model.forward(x, a)
      size=y_pred.shape
      y_pred=y_pred.detach().cpu()

      for sentence in range(size[0]):
        sentence_preds=[]
        o_indx=0
        open_cat=None

        # start at token index 1 to skip [CLS] token
        for token in range(1,size[1]):
          # ignore CLS, SEP, padding, and all but the first WP token (all marked with a label of -100)
          if y[sentence][token] != -100:
            pred=int(np.argmax(y_pred[sentence][token]))
            true_label=rev_labels[int(y[sentence][token])]
            pred_label=rev_labels[pred]

            # if an I- label is not preceded by a B- label of the same category, change that I- to a B-.
            label_parts=pred_label.split("-")
            if len(label_parts) == 2:
              bio=label_parts[0]
              cat=label_parts[1]

              # other small corrections for ill-formed tokens
              if bio == "I" and open_cat != cat:
                pred_label="B-%s" % cat
                open_cat=cat

              if bio == "B":
                open_cat=cat
            
            sentence_preds.append((o[sentence][o_indx], pred_label, true_label))
            o_indx+=1
        dev_predictions.append(sentence_preds)
      return dev_predictions

Now, you'll work on two methods that will help us evaluate BERT's performance at NER.

***

In [76]:
# for ind in range(len(token_label_list)):
# 		# print("Iteration:",ind)
# 		loc_token = token_label_list[ind].split("-")[0]
# 		if loc_token!= "O":
# 			entity_token = token_label_list[ind].split("-")[1]

# 		# check for ending and close entity
# 		# append to dict 
# 		if loc_token== "B" or loc_token== "O":
# 			if track_len>0: # this track_len is the final length of previous entity
# 				# end entity & specify end index 
# 				list_tuple[1] = list_tuple[0] + (track_len-1)
# 				# print(loc_token)
# 				# print("Ending:",tuple(list_tuple))
# 				# print("Span:",spans)
# 				spans.append(tuple(list_tuple)) # add completed tuple to dict
# 				# start new entity by initialising list_tuple again
# 				list_tuple = [0,0,0]
# 				track_len = 0

# 		if loc_token== "B" and track_len == 0 : # this means start of entity
# 			# start tracking length
# 			track_len = 1
# 			# add entity_token at third place
# 			list_tuple[2] = entity_token
# 			list_tuple[0] = ind # put start index in list; end index to be done later
# 			# print(list_tuple)

# 		if loc_token== "I":
# 			# increment length by 1
# 			track_len+=1

	
# 	return set(spans)

IndentationError: ignored

## Deliverable 1 - `get_spans()`

As input, this method will take in a list of strings in BIO notation. The method should parse each relevant entity.

As output, return the entities with their spand boundaries. 

See the assignment .pdf for a detailed explanation of BIO notation.

Example input: `["O", "B-PER", "I-PER", "O", "B-PER", "B-LOC", "I-LOC", "I-LOC", "O"]`

Example output: `{ (1,2,PER), (4,4,PER), (5,7,LOC) }`

In [39]:
def get_spans(token_label_list):
	"""Return a set of the spans entailed by the BIO tags.
	
	Args:
			token_label_list: a list of string BIO tags

	Hints:
	- each string in the list can be split into its BIO tag (B, I, or O) and its category (if applicable)
	- every "new" entity starts with B.
	- the same entity can be entailed by multiple tags, and that entity terminates once another B or O tag appears. 
	- use the spans dict to track where an entity begins, ends, and its category
	- an entity of span "length" 1 should have bounds like (2,2); "length" 2 should have bounds like (2,3), and so on

	"""

	 
	
	# start=None

	# BEGIN SOLUTION

	spans= []
	track_len = 0
	list_tuple = [0,0,0]



	# for ind in range(len(token_label_list)):
	for ind,val in enumerate(token_label_list):

		# extracting entity and location
		loc_token = val.split("-")[0]
		if loc_token!= "O":
			entity_token = val.split("-")[1]

		# check for ending and close entity
		# append to dict 
		if loc_token== "B" or loc_token== "O":
			if track_len>0: # this track_len is the final length of previous entity
				# end entity & specify end index 
				list_tuple[1] = list_tuple[0] + (track_len-1)
				# print(loc_token)
				# print("Ending:",tuple(list_tuple))
				# print("Span:",spans)
				spans.append(tuple(list_tuple)) # add completed tuple to dict

				# start new entity by initialising list_tuple again
				list_tuple = [0,0,0]
				track_len = 0

		# start 
		if loc_token== "B" and track_len == 0 : # this means start of entity
			# start tracking length
			track_len = 1
			# add entity_token at third place
			list_tuple[2] = entity_token
			list_tuple[0] = ind # put start index in list; end index to be done later

		if loc_token== "I":
			# increment length by 1
			track_len+=1

	
	return set(spans)
	
 	# END SOLUTION

In [40]:
try_list = ["O", "B-PER", "I-PER", "O", "B-PER", "B-LOC", "I-LOC", "I-LOC", "O"]
get_spans(try_list)

{(1, 2, 'PER'), (4, 4, 'PER'), (5, 7, 'LOC')}

## Deliverable 2 - `get_span_f1()`

As input, this method will take in `predictions`, a list containing every sentence tagged by the model. The sentences themselves are lists of (token, predicted label, true label) triples. `predictions` could potentially hold hundreds of entries, since this method calculates the overall F1 score for our model. So be sure that your method works with more than single sentence inputted.

As output, this method should return the overall F1 score (a decimal number) for the entire model.

See the assignment .pdf for a detailed explanation of F1.

Example input:

```python
predictions = []
predictions.append([('Tim', 'B-PER', 'B-PER'), ('Cook', 'I-PER', 'I-PER'), ('is', 'O', 'O'), ('the', 'O', 'O'), ('CEO', 'O', 'O'), ('of', 'O', 'O'), ('Apple', 'O', 'B-ORG')])
predictions.append([('He', 'O', 'O'), ('started', 'O', 'O'), ('in', 'O', 'O'), ('2011', 'O', 'O')])
get_span_f1(predictions)
```

Example output: 0.667

Why? We correctly identified 1 of the 2 true spans (recall of 1/2) and
our 1 prediction was correct (precision of 1).

In [2]:
list_try = [('Tim', 'B-PER', 'B-PER'), ('Cook', 'I-PER', 'I-PER'), ('is', 'O', 'O'), ('the', 'O', 'O'), ('CEO', 'O', 'O'), ('of', 'O', 'O'), ('Apple', 'O', 'B-ORG')]
flat_list = []
for sublist in list_try:
    for i in range(len(sublist)):
      if i!=0:
        flat_list.append(sublist[i])
flat_list

['B-PER',
 'B-PER',
 'I-PER',
 'I-PER',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'B-ORG']

In [81]:
a = { (0, 1, 'PER') }
b = { (0, 1, 'PER'), (1, 1, 'PER') }

a.intersection(b)

{(0, 1, 'PER')}

In [3]:
matches = [2,4]
len = [3,5]
p = 6/8
p_2 = 2/3 + 4/5
print(p,p_2)

0.75 1.4666666666666668


In [45]:
ff = set({1,2}
len(ff)

TypeError: ignored

In [46]:
ff = [(0, 1, 'PER'),(0,1,'RDF')]
len(ff)

TypeError: ignored

In [41]:
def get_span_f1(predictions):
	"""Return span F1 Score for predicted BIO tags
			
			Args:
					predictions, a list of sentences w/ their BIO tag predictions (from the model) and gold standard labels

			Hints:
				- every sentence contains a token with its predicted label and true label
				- recall that your get_spans method extracts spans from a list of BIO-tag labels
				- use set intersection to find the number of matches between the predicted spans and true spans
				- avoid division by 0!
				- this metric needs to be evaluated for the entire model, so only calculate precision, recall and then F1 once

	"""

	# BEGIN SOLUTION
	match_list = []
	len_pred = []
	len_true = []

	# loop through each sentence
	for sent in predictions:

		print("Sentence:",sent)

		# for every sentence create list of predicted labels and true labels
		pred_labels = []
		true_lablels = []
		for sublist in sent:
				for i,val in enumerate(sublist):
					if i==1:
						pred_labels.append(val)
					if i==2:
						true_lablels.append(val)

		print("Pred:",pred_labels)
		print("True:",true_lablels)
		# these lists are now used in the get_spans function 
		pred_set = get_spans(pred_labels)
		print("Pred set:",pred_set)
		true_set = get_spans(true_lablels)
		print("True set:",true_set)
	
	
		# find intersection 
		match_spans = pred_set.intersection(true_set)

		match_list.append(match_spans)
		len_pred.append(len(pred_set))
		len_true.append(len(true_set))
	
	
	# if sum(len_pred)
	p = sum(match_list)/sum(len_pred)
	r = sum(match_list)/sum(len_true)
	
	if p ==0 or r ==0:
		f_score = 0
	else:
		f_score = 2*p*r/(p+r)

	
	return f_score

	# END SOLUTION

In [42]:
predictions = []
predictions.append([('Tim', 'B-PER', 'B-PER'), ('Cook', 'I-PER', 'I-PER'), ('is', 'O', 'O'), ('the', 'O', 'O'), ('CEO', 'O', 'O'), ('of', 'O', 'O'), ('Apple', 'O', 'B-ORG')])
predictions.append([('He', 'O', 'O'), ('started', 'O', 'O'), ('in', 'O', 'O'), ('2011', 'O', 'O')])
get_span_f1(predictions)

Sentence: [('Tim', 'B-PER', 'B-PER'), ('Cook', 'I-PER', 'I-PER'), ('is', 'O', 'O'), ('the', 'O', 'O'), ('CEO', 'O', 'O'), ('of', 'O', 'O'), ('Apple', 'O', 'B-ORG')]
Pred: ['B-PER', 'I-PER', 'O', 'O', 'O', 'O', 'O']
True: ['B-PER', 'I-PER', 'O', 'O', 'O', 'O', 'B-ORG']
Pred set: {(0, 1, 'PER')}
True set: {(0, 1, 'PER')}


TypeError: ignored

In [24]:
predictions

[[('Tim', 'B-PER', 'B-PER'),
  ('Cook', 'I-PER', 'I-PER'),
  ('is', 'O', 'O'),
  ('the', 'O', 'O'),
  ('CEO', 'O', 'O'),
  ('of', 'O', 'O'),
  ('Apple', 'O', 'B-ORG')],
 [('He', 'O', 'O'),
  ('started', 'O', 'O'),
  ('in', 'O', 'O'),
  ('2011', 'O', 'O')]]


## Further Exploration

(Note: No deliverables are in this section; it's optional).

To see how you can use your span F1 measure during training, call this to invoke the functions you wrote for deliverables 1 and 2. 

First, let's use a smaller version of BERT that sacrifices accuracy for speed of training, so you should see results within five minutes. Improvement should be rapid, and after 10 epochs your F1 should be around .6

In [None]:
# Bert Tiny - 2 layers, 128 dimensional embeddings, doLowerCase = True
ner_f1_bert_tiny_uncased_model=train_and_evaluate("google/bert_uncased_L-2_H-128_A-2", "ner-f1-bert-tiny-uncased", embedding_size=128, num_epochs = 10, doLowerCase=True)

We can use that trained model to make predictions about new sentences.

In [None]:
def analyze_sentence(model, sentence):

  toks=nltk.word_tokenize(sentence)
  input=[[(word, 0) for word in toks]]
  predict_batch_x, predict_attention, predict_batch_y, predict_original=ner_f1_bert_tiny_uncased_model.get_batches(input)
  dev_predictions=predict(model, predict_batch_x, predict_attention, predict_batch_y, predict_original)
  for sent in dev_predictions:
    for tok, pred, _ in sent:
      print(tok, pred)

In [None]:
analyze_sentence(ner_f1_bert_tiny_uncased_model, "John is from Washington, DC")

### Model Improvements

Google has released a number of smaller BERT models with fewer layers (2, 4, 6, 8, 10) and smaller dimensions (128, 256, 512) that effectively trade off accuracy for speed. For the prior portions of the assignment, we used a model with only 2 neural layers and an embedding size of 512. Plus, that model was uncased (so all text is lowercase).

The size and parameters of model we use for this task will no doubt affect our performance. Try experimenting with a larger BERT model and see how much performance improves (relative, of course, to other important factors like learning time!). 

To use these models in the `transformers` library that we have been using, the correct name of the model can be derived from the URL linking to it:

https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-2_H-128_A-2.zip -> `google/bert_uncased_L-2_H-128_A-2`

All of the smaller models are uncased (so all text is lowercase), so be sure to set `doLowerCase` to be true if needed. You'll also need to change the embedding_size parameter to this function based on the H value from the model (listed both on the BERT [Github page](https://github.com/google-research/bert#bert) and in the model's URL). One sample model is provided below.



In [None]:
# Bert Medium - 8 layers, 512 dimensional embeddings, doLowerCase = True
ner_f1_bert_medium_uncased_model=train_and_evaluate("google/bert_uncased_L-8_H-512_A-8", "ner-f1-bert-medium-uncased", embedding_size=512, num_epochs = 5, doLowerCase=True)

In [None]:
# Bert BASE - 12 layers, 768 dimensional embeddings, doLowerCase = True
# This is named differently because it is the "standard" BERT model (what we saw in lecture and is described in Devlin et al. 2019)
ner_f1_bert_base_cased_model=train_and_evaluate("bert-base-cased", "ner-f1-bert-base-cased", embedding_size=768, num_epochs = 5, doLowerCase=False)

In [None]:
analyze_sentence(ner_f1_bert_base_cased_model, "John is from Washington, DC")