# **AdapterHub** quickstart example for **chunk phrase** extraction. 

###In this particular notebook, we'll focus on extracting **noun** phrases. However, given that the model was trained on all chunk phrase types, one could easily extract other chunks as well. See the available labels below.

Let's install adapter-transformers from github/master, import the required modules.

In [None]:
!pip install git+https://github.com/adapter-hub/adapter-transformers.git

In [None]:
from typing import Dict
import string
import numpy as np
from transformers import AutoConfig, AutoTokenizer, AutoModelForTokenClassification, TokenClassificationPipeline

Here are the chunk labels in IOB format.    

In [None]:
labels = ["O", "B-ADVP", "B-INTJ", "B-LST", "B-PRT", "B-NP", "B-SBAR", "B-VP", "B-ADJP", "B-CONJP", "B-PP",
               "I-ADVP", "I-INTJ", "I-LST", "I-PRT", "I-NP", "I-SBAR", "I-VP", "I-ADJP", "I-CONJP", "I-PP"]
label_map: Dict[int, str] = {i: label for i, label in enumerate(labels)}

Next, we load a standard Bert model and its tokenizer


In [None]:
model_name = "bert-base-uncased"
config = AutoConfig.from_pretrained(model_name,
                                    num_labels=len(labels),
                                    id2label=label_map,
                                    label2id={label: i for i, label in enumerate(labels)})
model = AutoModelForTokenClassification.from_pretrained(model_name, config=config)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Now, we'll load the chunking adapter. It's light-weight and appx 3MB! The F1 accuracy of this model was 91.3. We can now leverage adapter to predict the chunking tags of words in sentences:

In [None]:
model.load_adapter("chunk/conll2003@vblagoje", "text_task")

We'll also need a helper function to wrap model inferencing

In [None]:
def predict(sentence):
    tokens = tokenizer.encode(
        sentence,
        return_tensors="pt",
        truncation= "only_first",
        max_length=tokenizer.max_len,
    )
    preds = model(tokens, adapter_names=['chunk'])[0]
    preds = preds.detach().numpy()
    preds = np.argmax(preds, axis=2)
    return tokenizer.tokenize(sentence), preds.squeeze()[1:-1] # chop of CLS and SEP

And a filtering function to clean up the resulting list of noun chunks

In [None]:
import nltk
import re
nltk.download("stopwords")
from nltk.corpus import stopwords

def filter_chunk(s):    
    # Isolate and remove punctuations except '?'
    s = re.sub(r'([\'\"\.\(\)\!\?\\\/\,])', r' \1 ', s)
    s = re.sub(r'[^\w\s\?]', ' ', s)
    # Remove some special characters
    s = re.sub(r'([\;\:\|•«\n])', ' ', s)
    # Remove stopwords except 'not' and 'can'
    s = " ".join([word for word in s.split()
                  if word not in stopwords.words('english')
                  or word in ['not', 'can']])
    # Remove trailing whitespace
    s = re.sub(r'\s+', ' ', s).strip()
    return s

Next, we'll need to extract noun phrase chunks

In [None]:
def decode(chunk):
  return tokenizer.convert_tokens_to_string(chunk)

def extract_chunks(sentence):
  all_chunks = []
  chunks = []
  tokens, labels = predict(sentence)  
  for w, l in zip(tokens, labels):
    l = label_map[l]
    #print(f"-{w}-{l}")
    # is this a new noun phrase?
    if l == 'B-NP':      
        if len(chunks) > 0:
            all_chunks.append("".join(decode(chunks)))      
        chunks = [w] 
    # or another word of some compound noun phrase 
    elif l == 'I-NP':
      chunks.append(w)      

  #last noun phrase
  if len(chunks) > 0:
      all_chunks.append("".join(decode(chunks)))

  all_chunks = [filter_chunk(chunk) for chunk in all_chunks]  
  all_chunks = [chunk for chunk in all_chunks if len(chunk)>0]  
  return all_chunks

In [None]:
print(extract_chunks("Autonomous cars move insurance liability toward manufacturers."))

In [None]:
print(extract_chunks("Norges Bank’s Supervisory Council told key lawmakers gathered for a rare parliamentary hearing that risks remain for conflicts of interest and that rules were broken when manager Nicolai Tangen was hired to head the fund."))

In [None]:
print(extract_chunks("The opposition Labor Party, the parliament’s biggest group, has yet to decide on whether it will push for the committee to get the government involved, its deputy leader Hadia Tajik said by phone before the hearing"))