# Name Entity Recognition
by [CSpanias](https://cspanias.github.io/aboutme/), 4th Week's Project for [Solving Business Problems with NLP](https://omdena.com/course/solving-business-problems-with-nlp/) by Omdena, 03/2022

# CONTENT
1. [Import Data](#data)
1. [Named Entity Recognition with spaCy](#ner)

This project's goal is to __identify the named entities__ in the documents and answer the following questions:
- Name of parties involved in the contract
- Place of the contract
- Date of contract & Contact Duration
- Contract value (Amount of money )

We also need to structure the information extracted in a __tabular format__.

<a name='docs'></a>
# 1. Import Data

First we have to import data in a suitable form. For now we have __510__ `.txt` files. 

Our goal is to make a list where each element represents the contents of each file.

In [140]:
import os

# folder path
path = r"C:\Users\10inm\nlp_resources\nlp_omdena\w4\legal_contract_txt"

# change directory
os.chdir(path)

# create empty list
docs = []
# iterate through all files
for file in os.listdir():
    # check whether file is in text format or not
    if file.endswith(".txt"):
        file_path = f"{path}\{file}"
        with open(file_path, encoding="UTF-8") as f:
            docs.append(f.read())

In [141]:
print(f"The list `docs` contains {len(docs)} elements. The first 100 characters of the 1st element is:\n\n{docs[0][:100]}")

The list `docs` contains 510 elements. The first 100 characters of the 1st element is:

CO-BRANDING AND ADVERTISING AGREEMENT

THIS CO-BRANDING AND ADVERTISING AGREEMENT (the "Agreement") 


<a name='ner'></a>
# 2. Named Entity Recognition with `spaCy`

More info on NER with spaCy [here](https://spacy.io/usage/linguistic-features#named-entities).

In [142]:
import spacy

def tabular_ner(file, tag=None):
    
    # load spacy pipeline
    nlp = spacy.load("en_core_web_sm")
    # apply pipeline to file
    doc = nlp(file)
    
    # iterate through named entities
    for ent in doc.ents:
        # extract desired characteristics
        text = [ent.text for ent in doc.ents]
        labels = [ent.label_ for ent in doc.ents]
        
        # convert NER to tabular format
        df = pd.DataFrame(list(zip(text, labels)), columns=['Text', 'Label'])
    # if no specific tag is provided
    if tag == None:
        # return whole dataframe
        return df
    # if a tag is provided
    else: # tag != None
        # return the named entities with that tag
        return df[(df['Label'] == tag)]

In [144]:
# return all NER
tabular_ner(docs[0])

Unnamed: 0,Text,Label
0,CO-BRANDING AND ADVERTISING AGREEMENT,ORG
1,"June 21, 1999",DATE
2,"the ""Effective Date",LAW
3,INC.,GPE
4,S. Amphlett Blvd,PERSON
...,...,...
311,6,CARDINAL
312,7,CARDINAL
313,2THEMART COM INC,ORG
314,10-12,CARDINAL


In [145]:
# find the persons involved in a document
tabular_ner(docs[0], tag='PERSON').head()

Unnamed: 0,Text,Label
4,S. Amphlett Blvd,PERSON
5,Suite 233,PERSON
8,94402,PERSON
10,Von Karman Avenue,PERSON
157,Trojan,PERSON


In [147]:
# find organizations included in doc
tabular_ner(docs[0], tag='ORG').head()

Unnamed: 0,Text,Label
0,CO-BRANDING AND ADVERTISING AGREEMENT,ORG
17,DEFINITIONS,ORG
19,CO-BRANDED SITE,ORG
20,Domain Name,ORG
22,Co-Branded Site,ORG


In [148]:
# find dates included in doc
tabular_ner(docs[0], tag='DATE').head()

Unnamed: 0,Text,Label
1,"June 21, 1999",DATE
9,18301,DATE
15,92612,DATE
41,8/26/1999,DATE
52,2 months,DATE


In [149]:
# find places included in doc
tabular_ner(docs[0], tag='GPE').head()

Unnamed: 0,Text,Label
3,INC.,GPE
6,San Mateo,GPE
7,California,GPE
12,Floor,GPE
13,Irvine,GPE


In [151]:
# find money sums included in doc
tabular_ner(docs[0], tag='MONEY').head()

Unnamed: 0,Text,Label
21,2TheMart Marks,MONEY
132,2TheMart Marks,MONEY
273,215,MONEY


We can also __visualize NER__ using `displacy`.

In [152]:
from spacy import displacy

# visualise entities

sentence_spans = list(doc.sents)

displacy.render(sentence_spans, style="ent", jupyter=True)

