# Named Entity Recognition(NER) using BERT

### **Problem statement:** Find and highlight the organisation names in the given statements. 

> **Input:** Conflict of interest Name is an employee and stockholder of Novo Nordisk A/S. \
> **Output:** Conflict of interest Name is an employee and stockholder of < e1 >Novo Nordisk A/S < /e1 >.


>**Input:** I have conflicting interests with Microsoft, Google and Myntra. \
>**Output:** I have conflicting interests with < e1 >Microsoft< /e1 >, < e1 >Google< /e1 > and < e1 >Myntra< /e1 >.

### **Solution summary:** 
The implemented solution uses the pretrained BERT model available on Huggingface model repository [[LINK]](https://huggingface.co/dslim/bert-base-NER). The model is finetuned enough for the task at hand. The preprocessing part which includes tokenisation of words in a sentence and their encoding are readily available as pipeline module in transformers library. The main concern is to correctly parse the model outputs that will require understanding the output structure.

![Flow Diagram](figures/flow_diagram.png)

### **This notebook will describe the steps involved in implementing the solution and explain the reason behind  particular design choices.**

In [14]:
from copy import deepcopy
from collections import OrderedDict
from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForTokenClassification

## Choice of architecture

BERT(Bidirectional Encoder Representations from Transformers) is a transformer model that can be used for variety of NLP tasks like Question Answering, Next-sentence predication, Text classification and named entity recognition. 

Generally, the important step in customizing BERT for different downstream tasks is to use the final hidden representation generated for each token encoding. 

![BERT for NER](figures/BERT_NER.png)

In our case for NER task, we utilise the final hidden layer representations by feeding them into the softmax function from where we select the class with maximum probability.

![Softmax](figures/softmax.png)
    where **t** refers to the particular class out of given classes and **h** refers to the corresponding hidden representation.
    
### For our case of finding organisation names that generally starts with capital letter, we use the Cased BERT model. 

> In situations, where we are concerned with finding pronouns, we can use uncased BERT model and embedding, which usually converts the input text to lowercase.
    

In [5]:
tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")
ner_classifier = pipeline("ner", model=model, tokenizer=tokenizer)

## Structure of output

In [7]:
ner_classifier("Vijay used to work at Google.")

[{'entity': 'B-PER',
  'score': 0.99945873,
  'index': 1,
  'word': 'Vijay',
  'start': 0,
  'end': 5},
 {'entity': 'B-ORG',
  'score': 0.9989275,
  'index': 6,
  'word': 'Google',
  'start': 22,
  'end': 28}]

#### Here, we are interested in the following attributes of each predicted entity:

- **'entity'** : The predicted class of word token. 

- **'word'** : Tokenised word representation

- **'start'** : Starting position of word in sentence

- **'end'** : Ending position of word in sentence

#### Particularly, we will filter the following entities for our purpose:

- **B-ORG** : Beginning of an organization right after another organization

- **I-ORG** : Inside the organization present previously

#### Also, the structure of tokenised word representation also needs to be taken into consideration:

- Presence of ## in a word, for example '##tra' ,indicates that it is not a seperate entity but a subpart of some another entity.

## Post-processing (Parsing)

<img src="figures/parsing.png" alt="drawing" width="800"/>

In [36]:
def bert_parser(entity_lst):
    '''
    Parsing function to parse and select the organisation names from the entity_lst output from BERT model.
    '''
    
    n_entities = len(entity_lst)
    org_dict = OrderedDict()
    
    for word_dict in entity_lst:
        
        word_ent = word_dict['word']
        cur_word = word_dict['word'].replace("#", "")
        start_pos = word_dict['start']
        end_pos = word_dict['end']
        entity_type = word_dict['entity']
        
        if not word_ent.count("#") and entity_type == 'B-ORG':    
            org_dict[cur_word] = {'start':start_pos, 'end':end_pos}
            #print("Item added:", cur_word)
        
        elif entity_type == 'I-ORG':
            
            prev_entity = list(org_dict.items())[-1]
            prev_word = prev_entity[0]
            new_word = prev_word + cur_word
            new_endpos = end_pos
            
            del org_dict[prev_word]
            
            prev_start = prev_entity[1]['start']
            org_dict[new_word] = {'start':prev_start, 'end':new_endpos}

            
    return org_dict

In [34]:
def add_tags(input_txt, org_dict):
    tagged_txt = deepcopy(input_txt)
    for i, entries in enumerate(org_dict.items()):
        sent = list(tagged_txt)
        cur_start_pos = entries[1]['start'] + i*(8) + i
        cur_end_pos = entries[1]['end'] + (i+1) + i*8
        sent.insert(cur_start_pos, "<e1>")
        sent.insert(cur_end_pos, "</e1>")
        tagged_txt = ''.join(sent)
        #print(tagged_txt)
    return tagged_txt

In [49]:
def Organisation_Search_Utility(query_txt, ner_model):
    '''
    Final wrapper function with all the steps performed at once to obtain the required output
    '''
    bert_output = ner_model(query_txt)
    org_dct = bert_parser(bert_output)
    print("Ordered dictionary of organisations:", org_dct)
    tagged_txt = add_tags(query_txt, org_dct)
    print("\nTagged output:")
    print(tagged_txt)

## Final testing of the pipeline

In [51]:
#Test case 1
query = "I have conflicting interests with Microsoft, Google and Myntra"

In [52]:
Organisation_Search_Utility(query, ner_classifier)

Ordered dictionary of organisations: OrderedDict([('Microsoft', {'start': 34, 'end': 43}), ('Google', {'start': 45, 'end': 51}), ('Myntra', {'start': 56, 'end': 62})])

Tagged output:
I have conflicting interests with <e1>Microsoft</e1>, <e1>Google</e1> and <e1>Myntra</e1>


In [53]:
#Test case 2
query = "Conflict of interest Name is an employee and stockholder of Novo Nordisk A/S."

In [54]:
Organisation_Search_Utility(query, ner_classifier)

Ordered dictionary of organisations: OrderedDict([('NovoNordiskA/S', {'start': 60, 'end': 76})])

Tagged output:
Conflict of interest Name is an employee and stockholder of <e1>Novo Nordisk A/S</e1>.


### More test cases for robustness (and out of curiosity :-) )

In [60]:
#extra test case
query = "I have conflicting interests with Zerodha, LayerIV and Cred"

In [59]:
Organisation_Search_Utility(query, ner_classifier)

Ordered dictionary of organisations: OrderedDict([('Zerodha', {'start': 34, 'end': 41}), ('LayIV', {'start': 43, 'end': 50}), ('Cred', {'start': 55, 'end': 59})])

Tagged output:
I have conflicting interests with <e1>Zerodha</e1>, <e1>LayerIV</e1> and <e1>Cred</e1>


In [61]:
#extra test case
query = "Stock prices of SAIL are roaring high this week."

In [62]:
Organisation_Search_Utility(query, ner_classifier)

Ordered dictionary of organisations: OrderedDict([('SAIL', {'start': 16, 'end': 20})])

Tagged output:
Stock prices of <e1>SAIL</e1> are roaring high this week.


In [63]:
#extra test case
query = "Google Cloud Platform is quite important for machine learning deployments."

In [64]:
Organisation_Search_Utility(query, ner_classifier)

Ordered dictionary of organisations: OrderedDict([('GoogleCloudPlatform', {'start': 0, 'end': 21})])

Tagged output:
<e1>Google Cloud Platform</e1> is quite important for machine learning deployments.
