# NER and Information Extraction POC/Research

1. walk through IE tutorials and examine results on real postings data
1. define priorities and use cases 
    1. current systems (look at fextures definitions, wage data extraction, talk to John Moon)
        1. https://github.com/Snagajob/match.fexter-rules/blob/master/features.cfg
        1. https://github.com/Snagajob/match.wage
    1. use cases:
        1. brands 
        1. job titles 
        1. wages
        1. shift information
        1. location information
        1. requirements
        1. ...
1. research, build list and evaluate open source IE tools/models and potentially useful corpora
    1. NLTK (research only)
    1. Stanford NLP toolkit (https://nlp.stanford.edu/software/openie.html)
    1. Spacy (https://spacy.io/)
    1. Open NLP (https://opennlp.apache.org/)
    1. Resources list: https://nlp.stanford.edu/links/statnlp.html#NER
    1. ???
1. annotation tool:
    1. http://brat.nlplab.org/

## Where to Start

- useful scripts and lexicons to adapt:
    - https://github.com/robbymeals/alto-boot/blob/master/scripts/generate_text_data.py
    - https://github.com/robbymeals/alto-boot/tree/master/nlp_resources
    - don't pull from mongo directly, use snowflake table: CUSTOMER.DIMJOBPOSTING_VIEW
- week of June 11: walk through tutorials and documentation, begin creating list of various tools
- week of June 18: complete and review list of tools, start working on comparison demo notebook
- week of June 25: complete and review comparison notebook, plan next steps

In [1]:
import snowflake
import getpass
import pprint
from snowflake.connector import DictCursor
from preprocessing import *
from postings_ner import *

CONNECTION_PARAMS = {
    "user":None,
    "password":None,
    "account":'snagajob',
    "authenticator":"https://snagajob.okta.com/",
    "database":"PROD_SAJ_SHARE",
    "warehouse":"PROD_WH",
    "schema":"CUSTOMER"
}

In [2]:
CONNECTION_PARAMS["user"] = input("your snowflake user: ")

your snowflake user: nazih.bissat@snagajob.com


In [3]:
CONNECTION_PARAMS["password"] = getpass.getpass("your snowflake password: ")

your snowflake password: ········


In [8]:
query = """select JOBDESCRIPTION from CUSTOMER.DIMJOBPOSTING_VIEW where CREATEDATE >= '2018-07-01'::date limit 1000;"""

In [9]:
with snowflake.connector.connect(** CONNECTION_PARAMS) as ctx:
    with ctx.cursor(DictCursor) as cs:
        with cs.execute(query) as results:
            results = [r for r in results] # you can leave as generator or feed directly in pandas dataframe

In [10]:
with mp.Pool(mp.cpu_count(), worker_init) as pool:
    try:
        results = pool.map(scrub_posting, results)
    except KeyboardInterrupt:
        pool.terminate()

In [12]:
import pickle

output = open('1000-postings.pkl', 'wb')
pickle.dump(results, output)
output.close()

In [7]:
spacy_entity_types = {'ORG': 'Organization', 'PRODUCT': 'Product', 'CARDINAL': 'Cardinal', 'PERSON': 'Person',
                     'TIME': 'Time', 'DATE': 'Date', 'PERCENT': 'Percent', 'GPE': 'Location_Geo-Political_Area',
                     'WORK_OF_ART': 'Job_Title_or_Degree', 'LANGUAGE': 'Language', 'QUANTITY': 'Quantity',
                     'EVENT': 'Event', 'MONEY': 'Money', 'NORP': 'Nationality_or_Religion_or_Political_group',
                     'ORDINAL': 'Ordinal', 'FAC': 'Facility', 'LOC': 'Location', 'LAW': 'Law'}

stfd_entity_types = {'ORGANIZATION': 'Organization', 'TITLE': 'Title', 'LOCATION': 'Location', 'EMAIL': 'Email',
               'URL': 'URL', 'CITY': 'Location_City', 'STATE_OR_PROVINCE': 'Location_State_or_Province',
               'COUNTRY': 'Location_Country', 'NATIONALITY': 'Nationality', 'RELIGION': 'Religion', 'TITLE': 'Title',
               'IDEOLOGY': 'Ideology', 'CRIMINAL_CHARGE': 'Criminal_Charge', 'CAUSE_OF_DEATH': 'Cause_of_Death',
               'PERSON': 'Person', 'MONEY': 'Money', 'TIME': 'Time', 'ORDINAL': 'Ordinal', 'SET': 'Set',
               'DURATION': 'Duration', 'NUMBER': 'Number', 'PERCENT': 'Percent', 'MISC': 'Miscellaneous',
                'DATE': 'Date'}

In [8]:
# CREATING POSTING TEXT FILES AND RESPECTIVE ENTITY ANNOTATION FILES FOR BRAT TEXT ANNOTATION TOOL (SPACY)

import spacy

nlp = spacy.load('en_core_web_sm')

posting_index = 1
for r in results:
    posting_fname = 'posting' + str(posting_index)
    posting_text = r['JD_SCRUBBED']
    
    with open('/Users/nazih.bissat/Desktop/brat-v1.3_Crunchy_Frog/data/NER/spaCy/' + posting_fname + '.txt', 'w') as text_file:
        text_file.write(posting_text)
        text_file.close()
    
    entity_index = 1
    entity_ann_file_text = ''
    for ent in nlp(r['JD_SCRUBBED']).ents:
        entity_ann_file_text += 'T' + str(entity_index) + '\t' + spacy_entity_types[ent.label_] + ' ' \
                                    + str(ent.start_char) + ' ' + str(ent.end_char) + '\t' \
                                    + ent.text + '\n'
        entity_index += 1
    
    ann_file = open('/Users/nazih.bissat/Desktop/brat-v1.3_Crunchy_Frog/data/NER/spaCy/' + posting_fname + '.ann', 'w')
    ann_file.write(entity_ann_file_text)
    ann_file.close()
    
    posting_index += 1

In [74]:
# BRAT IS BUGGY WITH THE 100TH POSTING IN SPACY, FIGURING OUT WHY

# doc = nlp("u'" + results[99]['JD_SCRUBBED'] + "'")
#     # doc = nlp("u'" + text + "'")
# for ent in doc.ents:
#     print(ent.text, ent.start_char, ent.end_char, ent.label_)
# print(doc.ents)

(results[99]['JD_SCRUBBED'][1267:1270])

'\n  '

In [22]:
# Information extraction (relations) - SPACY

from __future__ import unicode_literals, print_function

import plac
import spacy


TEXTS = [results[1]['JD_SCRUBBED']]


# @plac.annotations(
#     model=("Model to load (needs parser and NER)", "positional", None, str))

def main(model='en_core_web_sm'):
    nlp = spacy.load(model)
#     print("Loaded model '%s'" % model)
#     print("Processing %d texts" % len(TEXTS))

    for text in TEXTS:
        doc = nlp(text)
        relations = extract_entity_relations(doc)
        print(relations)
        for r1, r2 in relations:
            print('{:<10}\t{}\t{}'.format(r1.text, r2.ent_type_, r2.text))


def extract_entity_relations(doc):
    # merge entities and noun chunks into one token
    spans = list(doc.ents) + list(doc.noun_chunks)
    for span in spans:
        span.merge()

    relations = []
    for entity in filter(lambda w: w.ent_type_ != 'ORG', doc):
        if entity.dep_ in ('attr', 'dobj'):
            subject = [w for w in entity.head.lefts if w.dep_ == 'nsubj']
            if subject:
                subject = subject[0]
                relations.append((subject, entity))
        elif entity.dep_ == 'pobj' and entity.head.dep_ == 'prep':
            relations.append((entity.head.head, entity))
    return relations


# if __name__ == '__main__':
#     plac.call(main)

main()

[(Come, a place), (we, pride), (that, hard work), (everyone, more sports), (The Cashier, an essential part), (an essential part, customer service), (customer service, any Academy store), (a fast, friendly, and accurate checkout experience, all customers), (impacts, this important work), (symbol, people), (solving, customers), (solving, use), (use, POS equipment), (documents, safety rules), (furnished, written, oral or diagram form
  - Working knowledge), (written, oral or diagram form
  - Working knowledge, inventory software), (cash register, all transaction types), (all transaction types, returns), (run, an assigned register), (stocks, racks), (knowledge, cashier processes), (- Reads, company work hours), (- Specific vision abilities, close vision), (flexible schedule, nights), (based, business needs)]
Come      		a place
we        		pride
that      		hard work
everyone  		more sports
The Cashier		an essential part
an essential part		customer service
customer service		any Academy sto

In [9]:
# StanfordNLP NER + RegexNER
startup_corenlp_server()

In [10]:
# CREATING POSTING TEXT FILES AND RESPECTIVE ENTITY ANNOTATION FILES FOR BRAT TEXT ANNOTATION TOOL (StanfordNLP)

posting_index = 1
for r in results:
    posting_fname = 'posting' + str(posting_index)
    posting_text = r['JD_SCRUBBED']
    
    with open('/Users/nazih.bissat/Desktop/brat-v1.3_Crunchy_Frog/data/NER/StanfordNLP/' + posting_fname + '.txt', 'w') as text_file:
        text_file.write(posting_text)
        text_file.close()
    
    entity_index = 1
    entity_ann_file_text = ''
    posting_details = annotate_posting(posting_text)
    for s in posting_details['sentences']:
        for e in s['entitymentions']:
            entity_ann_file_text += 'T' + str(entity_index) + '\t' + stfd_entity_types[e['ner']] + ' ' \
                                    + str(e['characterOffsetBegin']) + ' ' + str(e['characterOffsetEnd']) + '\t' \
                                    + e['text'] + '\n'
            entity_index += 1
    
    ann_file = open('/Users/nazih.bissat/Desktop/brat-v1.3_Crunchy_Frog/data/NER/StanfordNLP/' + posting_fname + '.ann', 'w')
    ann_file.write(entity_ann_file_text)
    ann_file.close()
    
    posting_index += 1

In [11]:
shutdown_corenlp_server()

In [12]:
# CREATING POSTING TEXT FILES AND RESPECTIVE ENTITY ANNOTATION FILES FOR BRAT TEXT ANNOTATION TOOL (COGCOMP)
# TRIED BOTH CONLL NER AND ONTONOTES NER (doc.get_ner_conll), ONTONOTES SEEMED TO DETECT MORE

from ccg_nlpy import remote_pipeline

pipeline = remote_pipeline.RemotePipeline()

def find_substring(substring, string):
    """ 
    Returns list of indices where substring begins in string

    >>> find_substring('me', "The cat says meow, meow")
    [13, 19]
    """
    indices = []
    index = -1  # Begin at -1 so index + 1 is 0
    while True:
        # Find next index of substring, by starting search from index + 1
        index = string.find(substring, index + 1)
        if index == -1:  
            break  # All occurrences have been found
        indices.append(index)
    return indices

posting_index = 1
for r in results:
    posting_fname = 'posting' + str(posting_index)
    
    entity_index = 1
    entity_ann_file_text = ''

    reg = re.compile(r"^\s+", re.MULTILINE)
    test = reg.sub(" ", re.sub(r'  ', ' ', r['JD_SCRUBBED']))
    test = re.sub("\n", "", test)
    test = re.sub('(?<=\w)([:+\',/\%])', r' \1', test)
    document = pipeline.doc(test)

    posting_text = document.get_text
    
    with open('/Users/nazih.bissat/Desktop/brat-v1.3_Crunchy_Frog/data/NER/CogCompNLPy/' + posting_fname + '.txt', 'w') as text_file:
        text_file.write(posting_text)
        text_file.close()
    
    entities = dict()
    if (str(document.get_ner_ontonotes) != 'NER_ONTONOTES view: this view does not have constituents in your input text. '):
        for entity in document.get_ner_ontonotes:
            mentions = find_substring(entity['tokens'], posting_text)
#             if mentions == []:
#                 print(posting_fname, entity['tokens'])
        
            elif entity['tokens'] in entities.keys():
                if entities[entity['tokens']] < len(mentions):
                    start = mentions[entities[entity['tokens']]]
                    entities[entity['tokens']] += 1
                    print(start)
                    entity_ann_file_text += 'T' + str(entity_index) + '\t' + spacy_entity_types[entity['label']] + ' ' \
                                                + str(start) + ' ' + str(start+len(entity['tokens'])) + '\t' \
                                                + entity['tokens'] + '\n'
    
                    entity_index += 1
            else:
                start = mentions[0]
                entities[entity['tokens']] = 1
                entity_ann_file_text += 'T' + str(entity_index) + '\t' + spacy_entity_types[entity['label']] + ' ' \
                                                + str(start) + ' ' + str(start+len(entity['tokens'])) + '\t' \
                                                + entity['tokens'] + '\n'
                entity_index += 1
    
        ann_file = open('/Users/nazih.bissat/Desktop/brat-v1.3_Crunchy_Frog/data/NER/CogCompNLPy/' + posting_fname + '.ann', 'w')
        ann_file.write(entity_ann_file_text)
        ann_file.close()
    
        posting_index += 1




2096
2103
564
679
1551
739
1045
1241
564
679
739
1045
1241
739
1045
1241
1438
1711
1817
1888
1551
593
posting15 Biopharmaceutical / Vaccine
posting15 PHYSICAL / MENTAL REQUIREMENTS Ability
3602
3886
4133
4178
4372
4642
564
679
3611
1067
1183
877
992
4173
3611
564
679
1166
1494
1589
2088
2367
1909
2613
2866
2934
2981
1013
2187
2460
2674
351
504
976
1174
1120
788
1230
1157
1522
1561
1002
posting34 Medical / Dental / Vision Advancement Opportunity 401K Positive Working Environment Sales Consultant
494
915
757
930
958
1039
1422
564
679
posting39 AAP / EEO Statement The Company
735
1080
279
508
1112
posting42 Allied Business Unit ) Quality Assurance
1201
1448
posting42 Lean / Six Sigma Black Belt - Retail Operations
2580
2605
2820
3010
3171
599
739
1045
1241
1438
1711
1817
1888
3611
564
679
1367
1597
1188
1843
1697
1946
1977
2076
2096
2103
1438
1711
1817
1887
494
1393
1393
3611
posting64 Sears Auto Center If
posting64 YhZoUk2cM3Y { # 560
posting64 315 # } #
posting64 Job Requirements Job Du

In [92]:
posting_text = results[0]['JD_SCRUBBED']
# reg = re.compile(r"^\s+", re.MULTILINE)
# posting_text = reg.sub("", re.sub(r'  ', ' ', results[0]['JD_SCRUBBED']))
# posting_text = re.sub("\n", "", posting_text)
doc1 = pipeline.doc(posting_text)
# print(posting_text)

for ent in doc1.get_ner_ontonotes:
        entity_ann_file_text += 'T' + str(entity_index) + '\t' + spacy_entity_types[ent['label']] + ' ' \
                                    + str(ent['start']) + ' ' + str(ent['end']) + '\t' \
                                    + ent['tokens'] + '\n'
        start = find_substring(ent['tokens'], results[0]['JD_SCRUBBED'])
        print(ent)
        print(start)
        print(posting_text[start[0]:(start[0]+len(ent['tokens']))])


{'label': 'ORG', 'score': 1.0, 'start': 1, 'end': 4, 'tokens': 'DC Area Manager'}
[4]
DC Area Manager
{'label': 'ORG', 'score': 1.0, 'start': 12, 'end': 15, 'tokens': 'the Distribution Center'}
[78]
the Distribution Center
{'label': 'ORG', 'score': 1.0, 'start': 116, 'end': 117, 'tokens': 'OSHA'}
[785]
OSHA
{'label': 'GPE', 'score': 1.0, 'start': 268, 'end': 270, 'tokens': 'Operations Management'}
[1801]
Operations Management
{'label': 'ORG', 'score': 1.0, 'start': 271, 'end': 273, 'tokens': 'Business Administration'}
[1824]
Business Administration
{'label': 'ORG', 'score': 1.0, 'start': 277, 'end': 280, 'tokens': 'Transportation - Working'}
[]


IndexError: list index out of range

In [113]:
reg = re.compile(r"^\s+", re.MULTILINE)
test = reg.sub(" ", re.sub(r'  ', ' ', results[31]['JD_SCRUBBED']))
test = re.sub("\n", "", test)
test = re.sub('(?<=\w)([:+\',)(])', r' \1', test)
document = pipeline.doc(test)

posting_text = document.get_text

entity_ann_file_text = ''
s = str(document.get_ner_ontonotes)
print(s == 'NER_ONTONOTES view: this view does not have constituents in your input text. ')

# print(type(document.get_ner_ontonotes))

# for ent in document.get_ner_ontonotes:
#         start = find_substring(ent['tokens'], posting_text)
#         print(ent)
#         print(start)
#         entity_ann_file_text += 'T' + str(entity_index) + '\t' + spacy_entity_types[ent['label']] + ' ' \
#                                     + str(start[0]) + ' ' + str(start[0]+len(ent['tokens'])) + '\t' \
#                                     + ent['tokens'] + '\n'
        

True


In [2]:
# CREATING POSTING TEXT FILES AND RESPECTIVE ENTITY ANNOTATION FILES FOR BRAT TEXT ANNOTATION TOOL (GCP)
# SET ENVIRONMENT VARIABLE 'GOOGLE_APPLICATION_CREDENTIALS=/Users/nazih.bissat/Downloads/My_First_Project-12919f080214.json'
# BEFORE RUNNING JUPYTER NOTEBOOK
from google.cloud import language
import pickle

client = language.LanguageServiceClient()

# Load results pickle file
pkl_file = open('postings-6-29.pkl', 'rb')
results = pickle.load(pkl_file)
pkl_file.close()

# entity types from enums.Entity.Type
gcp_entity_types = ['UNKNOWN', 'PERSON', 'LOCATION', 'ORGANIZATION','EVENT', 'WORK_OF_ART', 'CONSUMER_GOOD','OTHER']
gcp_entity_dict = {'UNKNOWN': 'Unknown', 'PERSON': 'Person', 'LOCATION': 'Location', 'ORGANIZATION': 'Organization',
                  'EVENT': 'Event', 'WORK_OF_ART': 'Work_of_Art', 'CONSUMER_GOOD': 'Consumer_Good', 'OTHER': 'Other'}

posting_index = 1
for r in results:
    posting_fname = 'posting' + str(posting_index)
    posting_text = r['JD_SCRUBBED']
    document = language.types.Document(
        content=posting_text,
        type=language.enums.Document.Type.PLAIN_TEXT,
    )
    
    with open('/Users/nazih.bissat/Desktop/brat-v1.3_Crunchy_Frog/data/NER/GCP/' + posting_fname + '.txt', 'w') as text_file:
        text_file.write(posting_text)
        text_file.close()
    
    response = client.analyze_entities(
        document=document,
        encoding_type='UTF32',
    )
    
    entity_index = 1
    entity_ann_file_text = ''
    dups = dict()
    
    for entity in response.entities:
        for mention in entity.mentions:
            start = mention.text.begin_offset
            entity_ann_file_text += 'T' + str(entity_index) + '\t' + gcp_entity_dict[gcp_entity_types[entity.type]] + ' ' \
                                        + str(start) + ' ' + str(start + len(entity.name)) + '\t' \
                                        + entity.name + '\n'
            entity_index += 1
    
    ann_file = open('/Users/nazih.bissat/Desktop/brat-v1.3_Crunchy_Frog/data/NER/GCP/' + posting_fname + '.ann', 'w')
    ann_file.write(entity_ann_file_text)
    ann_file.close()
    
    posting_index += 1

In [25]:
# EDA (GCP)
from google.cloud import language


client = language.LanguageServiceClient()

posting_text = results[0]['JD_SCRUBBED']
document = language.types.Document(
        content=posting_text,
        type=language.enums.Document.Type.PLAIN_TEXT,
    )

response = client.analyze_entities(
        document=document,
        encoding_type='UTF32',
    )

print(posting_text)

for entity in response.entities:
    if len(entity.mentions) > 1:
        print(entity)
        print('mentions:', entity.mentions[0].text.begin_offset)

Job Description

Req/Job ID: 934314BR  
Employing Entity: Sears, Roebuck and Co.  
Employment Category: Regular, Part-time  
Job Function: Automotive  
Store ID: 06371: Sears Auto Center  

If you are looking to become part of an auto center that really values your skills and ability to provide quality service, consider joining the sears auto center team . Now is the perfect time to join as we are going through an exciting transformation of our business!! The Express Service Technician is responsible for the successful tire, battery, and oil installation to customer vehicles following all automotive processes and procedures. This position interacts daily with customers, Customer Service Adviser, other technicians, and the Auto Center Manager.  

[#video#https://www.youtube.com/watch?v=YhZoUk2cM3Y{#560,315#}#/video#]

Job Requirements

Job Duties/Responsibilities:  
• Participates actively with team servicing of every customer vehicle to ensure fast, expert service  
• Dedicates work ti

In [13]:
# CREATING POSTING TEXT FILES AND RESPECTIVE ENTITY ANNOTATION FILES FOR BRAT TEXT ANNOTATION TOOL (APACHE OPENNLP)
from subprocess import Popen, PIPE

apache_entity_types = {'person':'Person', 'location':'Location', 'date':'Date', 'time':'Time', 'money':'Money',
                      'organization': 'Organization', 'percentage':'Percent'}

def find_substring(substring, string):
    """ 
    Returns list of indices where substring begins in string

    >>> find_substring('me', "The cat says meow, meow")
    [13, 19]
    """
    indices = []
    index = -1  # Begin at -1 so index + 1 is 0
    while True:
        # Find next index of substring, by starting search from index + 1
        index = string.find(substring, index + 1)
        if index == -1:  
            break  # All occurrences have been found
        indices.append(index)
    return indices

# THIS IS THE COMMAND I RAN FOR IT TO WORK:
# opennlp TokenNameFinder en-ner-person.bin en-ner-location.bin 
# en-ner-time.bin en-ner-money.bin ener-organization.bin en-ner-percentage.bin en-ner-date.bin

batch_posting = ''
for r in results:
    batch_posting += r['JD_SCRUBBED'] + '\n\n\n'
    
with open('//Users/nazih.bissat/Desktop/apache_opennlp/apache-opennlp-1.8.4/bin/batch_postings.txt', 'w') as text_file:
    text_file.write(batch_posting)
    text_file.close()

commands = '''
cd /Users/nazih.bissat/Desktop/apache_opennlp/apache-opennlp-1.8.4/bin
cat batch_postings.txt | opennlp TokenNameFinder en-ner-person.bin en-ner-location.bin en-ner-time.bin en-ner-money.bin en-ner-organization.bin en-ner-percentage.bin en-ner-date.bin
'''

process = Popen('/bin/bash', stdin=PIPE, stdout=PIPE)
out, err = process.communicate(commands.encode('utf-8'))

full_entities = out.decode('utf-8')

posting_index = 1
for posting in full_entities.split('\n\n\n'):
    posting_fname = 'posting' + str(posting_index)
    
    posting_text = re.sub(r'\<START:[a-z]*\> ', '', posting)
    posting_text = re.sub(r' \<END\>', '', posting_text)
    
    with open('/Users/nazih.bissat/Desktop/brat-v1.3_Crunchy_Frog/data/NER/Apache/' + posting_fname + '.txt', 'w') as text_file:
        text_file.write(posting_text)
        text_file.close()
    
    entity_ann_file_text = ''
    counter = 0
    entity_index = 1
    for start in find_substring('<START:', posting):
        end_type = find_substring('>', posting)[2 * (entity_index - 1)]
        e_type = posting[(start+7):(end_type)]
        end_name = find_substring('<END>', posting)[entity_index - 1]
        name = posting[(end_type+2):(end_name-1)]
        entity_ann_file_text += 'T' + str(entity_index) + '\t' + apache_entity_types[e_type] + ' ' \
                                    + str(start-counter) + ' ' + str(start-counter+len(name)) + '\t' \
                                    + name + '\n'
        counter += 15 + len(e_type)
        entity_index += 1
    
    ann_file = open('/Users/nazih.bissat/Desktop/brat-v1.3_Crunchy_Frog/data/NER/Apache/' + posting_fname + '.ann', 'w')
    ann_file.write(entity_ann_file_text)
    ann_file.close()
    
    posting_index += 1
    
commands = '''
rm /Users/nazih.bissat/Desktop/apache_opennlp/apache-opennlp-1.8.4/bin/batch_postings.txt
'''
process = Popen('/bin/bash', stdin=PIPE, stdout=PIPE)

In [75]:
# EDA FOR ABOVE CODE, TRYING TO FIND INSTANCES WHERE BUG IS CAUSED
from subprocess import Popen, PIPE

def find_substring(substring, string):
    """ 
    Returns list of indices where substring begins in string

    >>> find_substring('me', "The cat says meow, meow")
    [13, 19]
    """
    indices = []
    index = -1  # Begin at -1 so index + 1 is 0
    while True:
        # Find next index of substring, by starting search from index + 1
        index = string.find(substring, index + 1)
        if index == -1:  
            break  # All occurrences have been found
        indices.append(index)
    return indices

commands = '''
cd /Users/nazih.bissat/Desktop/apache_opennlp/apache-opennlp-1.8.4/bin
cat test.txt | opennlp TokenNameFinder en-ner-person.bin en-ner-location.bin en-ner-time.bin en-ner-money.bin en-ner-organization.bin en-ner-percentage.bin en-ner-date.bin
'''

process = Popen('/bin/bash', stdin=PIPE, stdout=PIPE)
out, err = process.communicate(commands.encode('utf-8'))

ents = out.decode('utf-8')
sent = re.sub(r'\<START:[a-z]*\> ', '', ents)
sent = re.sub(r' \<END\>', '', sent)
# reg = re.compile(r"^\s+", re.MULTILINE)
# # sent = reg.sub("", results[0]['JD_SCRUBBED'])
# sent = re.sub(r'  ', '', results[0]['JD_SCRUBBED'])
# test = re.sub("\n", "", test)
# sent = re.sub('(?<=\w)([:+\',/\%])', r' \1', test)

print(sent)
print('\n')
print(ents)
print('\n')

counter = 0
entity_index = 1
for start in find_substring('<START:', ents):
    end_type = find_substring('>', ents)[2 * (entity_index - 1)]
    e_type = ents[(start+7):(end_type)]
    end_name = find_substring('<END>', ents)[entity_index - 1]
    name = ents[(end_type+2):(end_name-1)]
    print('name: {0}'.format(name))
    print('type: {0}'.format(e_type))
    print('start: {0}'.format(start-counter))
    print('end: {0}'.format(start-counter+len(name)))
    print('check name in original sentence: {0}'.format(sent[(start-counter):(start-counter+len(name))]))
#     print('rest of original sentence: {0}'.format(sent[(start-counter+len(name)):]))
    print('\n')
    counter += 15 + len(e_type)
    entity_index += 1


Job Description

Req/Job ID: 934314BR
Employing Entity: Sears, Roebuck and Co.
Employment Category: Regular, Part-time
Job Function: Automotive
Store ID: 06371: Sears Auto Center

If you are looking to become part of an auto center that really values your skills and ability to provide quality service, consider joining the sears auto center team . Now is the perfect time to join as we are going through an exciting transformation of our business!! The Express Service Technician is responsible for the successful tire, battery, and oil installation to customer vehicles following all automotive processes and procedures. This position interacts daily with customers, Customer Service Adviser, other technicians, and the Auto Center Manager.

[#video#https://www.youtube.com/watch?v=YhZoUk2cM3Y{#560,315#}#/video#] 

Job Requirements

Job Duties/Responsibilities:
• Participates actively with team servicing of every customer vehicle to ensure fast, expert service
• Dedicates work time to providing

In [108]:
print(len('<START:>  <END>'))

27
5
4


In [74]:
s = 'Employing <START:person> Entity: Sears, Roebuck <END> and Co.'
s = re.sub(r'\<START:[a-z]*\> ', '', s)
s = re.sub(r' \<END\>', '', s)
print(s)

Employing Entity: Sears, Roebuck and Co.


In [68]:
test = 'Employing <START:person> Entity: Sears, Roebuck <END> and Co.'
print(re.sub(r'(\<START\:.*?\> )?(.*?)( \<END\>)$', "", test))

Employing <START:person> Entity: Sears, Roebuck <END> and Co.
