# ExxonMobil CodeJam July 2017

This notebook follows the work our team went through to extract legal entities from contracts. It will have the following sections:

### 1. Load Text Files
### 2. NER Tagging (Initial)
### 3. Preparing New Training Set
### 4. Testing Results

The reason for this order was because we initially believed Stanford's NER Tagger would be able to recognize all of our supplier companies, however this was not the case. Therefore we had to spend the latter half of the hackathon creating our own corpus to retrain their model.

### Imports

In [8]:
import glob
import pandas as pd
from itertools import groupby
import string
import pickle
import chardet
from nltk.tokenize import word_tokenize

## 1. Load Text Files

In [1]:
# obtain the contracts saved as text files
txt_directory = '<insert directory here>/*.txt' #can't place the actual directory used 

In [2]:
text_files = glob.glob(txt_directory) #returns a list of every text file in the given directory

In [4]:
# Create a dataframe based on a list of text files (Column 1: Filename; Column 2: Text)

'''
Parameters:
-----------
txt_files - list of strings that contain the directories for each text file

Returns:
--------
DataFrame - pandas DataFrame with 2 columns (filename, text in file)
'''
def load_txt(txt_files):
    
    text_dict = {'File': [], 'Text': []}
    for filename in txt_files:
        
        with open(filename, 'r', encoding = 'utf-8') as file:
            
            text = file.read().strip()
        
        name = filename.split('/')[-1].split('.')[0]
        
        text_dict['File'].append(name)
        text_dict['Text'].append(text)
        
    
    return pd.DataFrame(text_dict, columns = ['File', 'Text'])
        

In [5]:
df = load_txt(text_files)

## 2. NER Tagging

In [1]:
from nltk.tag.stanford import StanfordNERTagger

In [None]:
model_directory = '<insert where stanford-ner 3 class classifier was saved>'
ner_jar_directory = '<insert stanford-ner jar directory>'

st = StanfordNERTagger(model_directory,
                       ner_jar_directory, encoding='utf-8')

#### extractLegalEntities is the main function to collect the information we wanted (Supplier + ExxonMobil Entity)

In [19]:
'''
Parameters:
-----------
df - dataframe containing two columns: (filename, text in file)
tagger - instance of Stanford NER Tagger through nltk's API

Returns:
--------
DataFrame - pandas DataFrame containing desired entities found in each contract file

'''
def extractLegalEntities(df, tagger):
    
    # initialize dictionary
    entities = {'File': [], 'Supplier': [], 'ExxonMobil Entity': []}
    
    # go through each file in df (text dataframe)
    for row in df.itertuples():
        
        filename = row[1]
        text = row[2]
        
        # use the inputted tagger to tokenize current text
        tags = tagger.tag(word_tokenize(text))
        
        # collect the first 2 continuous chunks of words labeled as organizations
        # source: https://stackoverflow.com/questions/30664677/extract-list-of-persons-and-organizations-using-stanford-ner-tagger-in-nltk
        orgs = []
        count = 0
        for tag, chunk in groupby(tags, lambda x:x[1]):
            if count == 2:
                break
            if tag == 'ORGANIZATION':
                count += 1
                orgs.append(' '.join(w for w, t in chunk))
        
        
        entities['File'].append(filename)
        
        # determine which org found was the ExxonMobil entity, the other must be supplier
        found = False
        for org in orgs:
            
            if not found and ('ExxonMobil' in org or 'Exxon' in org or 'Mobil' in org):
                
                entities['ExxonMobil Entity'].append(org)
                found = True
            
            else:
                
                entities['Supplier'].append(org)
                
        
    return pd.DataFrame(entities, columns = ['File', 'ExxonMobil Entity', 'Supplier'])
        

### Preliminary Results:

Because each contract has private information, I'm unable to give an example of the results. However, the main key points we noted were:

- Accuracy of obtaining correct entities was 40% (out of 15 files)
- The original tagger could not recognize some suppliers as organizations (a few were identified as locations)

** Thus because of these results, we opted to go another route to try and improve the accuracy: building a new corpus and retrain their model on it: **

## 3. Preparing new Training Set

A lot of this section is ad-hoc and multiple developers scraped data so most of it will work on pulling everything together into one training text file.

** Due to the constraint on time (only had 10 hours) we did not look deep into the needed format for training. Following Stanford NER's FAQ, we saw it should be in this format:**

word_1 \t label 

word_2 \t label

...

word_n \t label

** Here is a link to the FAQ + Tutorial on how to train your own: https://nlp.stanford.edu/software/crf-faq.shtml#a**

**Because another member of the team scraped all ExxonMobil entities and pickled it, I had to load it into this notebook:**

In [6]:
# load pickled dictionary containing ExxonMobil entities
companies_pickled = '<insert directory here>'

with open(companies_pickled, 'rb') as pickle_file:
    
    companies = pickle.load(pickle_file)

**The vendors were also collected by another member, so I collected the information in the given text file:**

In [7]:
lines = []
vendors_txt_directory = '<insert vendor directory here>'
with open(vendors_txt_directory,'r') as f:
    
    c = 0
    
    for line in f:
        
        if c > 0:
            
            lines.append(line.strip())
        
        c += 1

In [8]:
# combine companies + vendors into one list
names = companies.values()
names = list(names) + lines

In [9]:
len(lines)

530599

In [10]:
# create a dataframe that splits each entity into space separated tokens (putting each in a unique column)

df_companies = pd.DataFrame(list(names), columns = ['Name'])

splitted = pd.DataFrame([x.split() for x in df_companies['Name'].tolist()])

In [39]:
len(splitted)

534479

**Because each entity was still located in a single row (separated into multiple columns per token), I could then go through and filter out punctuation as well as None values. The None values indicated that there are no more tokens for that entity.**

**We opted to remove punctuation from the corpus, but that may actually help with future work.**

In [15]:
parts = []
translator = str.maketrans('', '', string.punctuation)
count = 0
for row in splitted.iterrows():
    
    count+=1
    
    if (count%10000 == 0):
        
        print(count)
    
    for part in row[1]:
        
        if not part is None:
            
            part = part.translate(translator)
            
            if part == '':
                
                continue
                
            parts.append(part)
            
        else:
            
            break

10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
110000
120000
130000
140000
150000
160000
170000
180000
190000
200000
210000
220000
230000
240000
250000
260000
270000
280000
290000
300000
310000
320000
330000
340000
350000
360000
370000
380000
390000
400000
410000
420000
430000
440000
450000
460000
470000
480000
490000
500000
510000
520000
530000


In [20]:
# forgot to add the label so added here
for i in range(len(parts)):
    
    parts[i] += '\tORG'

** Saving current results: **

In [24]:
with open('parts.pickle', 'wb') as file:
    
    pickle.dump(parts,file)

** While we were training on this dataset, we ended up with an error saying we had too many words (unlikely). This could have also been because we ONLY had ORG words (positive class) and not regular/filler words (the negative class). So initially I removed all repeats: **

In [30]:
parts = list(set(parts))

In [None]:
total = '\n'.join(parts)

# write the prepared training data to be used in the model
with open('trainingData.txt', 'w') as f:
    
    f.write(total)

**The results were very skewed using this first trial of training, so our team went and extracted the filler words in a contract itself. Note because this was from a single contract, then not that many were added... in the future if this can be done on more documents, then the results should definitely improve. (Did not have time to review more the day of CodeJam).** 

In [None]:
lines = []
contract_filler_words_directory = '<insert contract words directory here> (.txt)'
with open(contract_filler_words, 'r') as f:
    
    for line in f:
        
        lines.append(line.strip())
        
big_doc = ' '.join(lines)

In [21]:
tokens = list(set(word_tokenize(big_doc)))
len(tokens)

891

In [53]:
# add in 'O' label
parts2 = parts[:]
for token in tokens:
    
    parts2.append(token + '\tO')

In [59]:
total = '\n'.join(parts2)

# write new training data
with open('trainingData2.txt', 'w') as f:
    
    f.write(total)


In [56]:
# save current results
with open('parts2.pickle', 'wb') as file:
    
    pickle.dump(parts2,file)

** Because 900 extra words would not be enough to balance the datasets, we quickly found a repository containing the top 10000 used words in english inside a repo here: https://github.com/first20hours/google-10000-english**

** Unfortunately the classes are still unbalanced, but it was better than before.**

In [60]:
lines = []
with open('google-10000-english-no-swears.txt', 'r') as f:
    
    for line in f:
        
        lines.append(line.strip())
        
big_doc = ' '.join(lines)

# Same steps as before...

tokens = list(set(word_tokenize(big_doc)))
len(tokens)

parts3 = parts2[:]
for token in tokens:
    
    parts3.append(token + '\tO')

total = '\n'.join(parts2)

with open('trainingData3.txt', 'w') as f:
    
    f.write(total)

## 4. Testing Results

After training each new dataset, we ran this simple test to see how it was improving in generalizing.

In [81]:
new_model_directory = '<insert new model directory here (after training)>'


st = StanfordNERTagger(model_directory,
                       ner_jar_directory, encoding='utf-8')
st.tag('Hi my name is ExxonMobil Global Services Company'.split())

[('Hi', 'ORG'),
 ('my', 'O'),
 ('name', 'O'),
 ('is', 'O'),
 ('ExxonMobil', 'ORG'),
 ('Global', 'ORG'),
 ('Services', 'ORG'),
 ('Company', 'ORG')]

Above were the results of our final training set; 'Hi' was still being labeled as an ORG but significantly better than our first attempt (where everything got labeled as an ORG).

** In terms of the goal at hand though, the corpus was far from being good enough to extract the legal entities accurately (worse than Stanford's NER tagger). Because we had only gone through preliminary iterations of the training set, this can certainly be improved with the following:**

- Improved data quality of supplier/EM tokens
- More understanding of how to set up the training set. This is what I'd first look into, because from Stanford's tutorial, repeats definitely exist. In addition, can the sequence of words in the file affect the model? In their tutorial, they utilized a chapter from a book as a training set (which is sequenced). Our way of creating the training data was not only skewed, but the classes were never intermixed (where the large ORG block came first in the file and the small O block followed). This may be fixed by using Stanford's provided tokenizer. 
- Increasing our contract filler ('O') words in the corpus. This can be done by manually reviewing contract files and removing ORG information and streaming the words into the corpus as done in Step 3.

Overall, improvement will be an iterative process, but certainly the work we've done is promising for the problem at hand. 