# Testing the stanford corenlp library

* The library should be downloaded and stored at '{PROJECT_ROOT}/stanford_corenlp_library'
* Using python wrappers at 'https://github.com/Lynten/stanford-corenlp'. They also provide official python wrappers, but those aren't used here.
* Java needs to be installed to run it.
* Can run on linux without modifications, on Mac something may have to be modified. It can be run though.
* On mac you have to run the jupyter command as sudo.


In [1]:
# Necessary imports
from stanfordcorenlp import StanfordCoreNLP
import sys
sys.path.append("../..") #to add the root project directory to the python modules path, so that subdirectories of it can be imported
from src.preparation.data_loading import read_dossier, read_news_article
import json

## Testing it using news articles

In [2]:
# testing it using a news article from the web
airplane_story = 'https://edition.cnn.com/travel/article/best-way-disembark-airplane/index.html'
trump_election_story = 'https://www.nytimes.com/2016/11/09/us/politics/hillary-clinton-donald-trump-president.html'

article_text = read_news_article.read_news_article(trump_election_story)


In [24]:
# Testing coreference resolution

nlp = StanfordCoreNLP('../../stanford_corenlp_library')
props={'annotators': 'coref','pipelineLanguage':'en','outputFormat':'json'}
# results = nlp.annotate('This is Janice. She is very big.')
results = nlp.annotate(article_text, properties=props)
nlp.close()
json.loads(results)['corefs']

{'33': [{'id': 5,
   'text': 'the white blue-collar voters who had formed the party base from the presidency of Franklin D. Roosevelt to Mr. Clintons',
   'type': 'NOMINAL',
   'number': 'PLURAL',
   'gender': 'UNKNOWN',
   'animacy': 'ANIMATE',
   'startIndex': 15,
   'endIndex': 35,
   'headIndex': 18,
   'sentNum': 1,
   'position': [1, 6],
   'isRepresentativeMention': True},
  {'id': 35,
   'text': 'these voters',
   'type': 'NOMINAL',
   'number': 'PLURAL',
   'gender': 'UNKNOWN',
   'animacy': 'ANIMATE',
   'startIndex': 4,
   'endIndex': 6,
   'headIndex': 5,
   'sentNum': 3,
   'position': [3, 9],
   'isRepresentativeMention': False},
  {'id': 32,
   'text': 'they',
   'type': 'PRONOMINAL',
   'number': 'PLURAL',
   'gender': 'UNKNOWN',
   'animacy': 'ANIMATE',
   'startIndex': 27,
   'endIndex': 28,
   'headIndex': 27,
   'sentNum': 3,
   'position': [3, 6],
   'isRepresentativeMention': False},
  {'id': 33,
   'text': 'their',
   'type': 'PRONOMINAL',
   'number': 'PLURAL',


# Evaluation of Coref

* It runs pretty slowly, it takes 2 minutes to run on the newspaper article, it takes about a minute to even run on 1 sentence...
* The Stanford NLP server also requires a lot of memory to run, a few gigs in practice.
* I think that this library might be too impractical to run in practice.

In [32]:
# Testing relation extraction

nlp = StanfordCoreNLP('../../stanford_corenlp_library')
props={'annotators': 'tokenize,ssplit,pos,lemma,ner,parse,depparse,relation','pipelineLanguage':'en','outputFormat':'json'}
results = nlp.annotate('He works for a company', properties=props)
# results = nlp.annotate(article_text, properties=props)
nlp.close()

In [33]:
json.loads(results).keys()

dict_keys(['sentences'])

In [34]:
json.loads(results)['sentences'][0].keys()

dict_keys(['index', 'parse', 'basicDependencies', 'enhancedDependencies', 'enhancedPlusPlusDependencies', 'entitymentions', 'tokens'])

In [31]:
json.loads(results)['sentences'][0]['tokens']

[{'index': 1,
  'word': 'He',
  'originalText': 'He',
  'lemma': 'he',
  'characterOffsetBegin': 0,
  'characterOffsetEnd': 2,
  'pos': 'PRP',
  'ner': 'O',
  'before': '',
  'after': ' '},
 {'index': 2,
  'word': 'works',
  'originalText': 'works',
  'lemma': 'work',
  'characterOffsetBegin': 3,
  'characterOffsetEnd': 8,
  'pos': 'VBZ',
  'ner': 'O',
  'before': ' ',
  'after': ' '},
 {'index': 3,
  'word': 'for',
  'originalText': 'for',
  'lemma': 'for',
  'characterOffsetBegin': 9,
  'characterOffsetEnd': 12,
  'pos': 'IN',
  'ner': 'O',
  'before': ' ',
  'after': ' '},
 {'index': 4,
  'word': 'a',
  'originalText': 'a',
  'lemma': 'a',
  'characterOffsetBegin': 13,
  'characterOffsetEnd': 14,
  'pos': 'DT',
  'ner': 'O',
  'before': ' ',
  'after': ' '},
 {'index': 5,
  'word': 'company',
  'originalText': 'company',
  'lemma': 'company',
  'characterOffsetBegin': 15,
  'characterOffsetEnd': 22,
  'pos': 'NN',
  'ner': 'O',
  'before': ' ',
  'after': ''}]

# Evaluation of Rel Extraction

* Takes a while for it to load all of the different annotators
* Haven't been able to get it to work, it doesn't actually output any relations that it found for some reason...
* Anyway, it can only detect Live_In, Located_In, OrgBased_In, Work_For, and None, so it is basically useless for us

* there is another relation extraction system built in called the KBPAnnotator
* it has a lot more possible relations, including a few interpersonal ones. however, it doesn't seem to work on many sentences. it's very picky... it works on the example that they gave online though: "Joe Smith was born in Oregon."
* performance seems about the same - pretty slow

In [33]:
# Testing Open IE

nlp = StanfordCoreNLP('../../stanford_corenlp_library')
props={'annotators': 'tokenize,ssplit,pos,lemma,ner,parse,coref,kbp','pipelineLanguage':'en','outputFormat':'json', 'coref.md.type':'RULE'}
results = nlp.annotate("Bob is a muslim. Charlie is Bob's brother. Charlie is a christian. Donald Trump is a Republican. Donald Trump is American. Joe and Bill are members of the Republican Party.", properties=props)
nlp.close()

In [34]:
for s in json.loads(results)['sentences']:
    print(s['kbp'])

[{'subject': 'Bob', 'subjectSpan': [0, 1], 'relation': 'per:religion', 'relationSpan': [-2, -1], 'object': 'muslim', 'objectSpan': [3, 4]}]
[{'subject': 'Bob', 'subjectSpan': [2, 3], 'relation': 'per:siblings', 'relationSpan': [-2, -1], 'object': 'Charlie', 'objectSpan': [0, 1]}, {'subject': 'Charlie', 'subjectSpan': [0, 1], 'relation': 'per:siblings', 'relationSpan': [-2, -1], 'object': 'Bob', 'objectSpan': [2, 3]}]
[{'subject': 'Charlie', 'subjectSpan': [0, 1], 'relation': 'per:religion', 'relationSpan': [-2, -1], 'object': 'christian', 'objectSpan': [3, 4]}]
[]
[{'subject': 'Donald Trump', 'subjectSpan': [0, 2], 'relation': 'per:origin', 'relationSpan': [-2, -1], 'object': 'American', 'objectSpan': [3, 4]}]
[{'subject': 'Joe', 'subjectSpan': [0, 1], 'relation': 'per:employee_or_member_of', 'relationSpan': [-2, -1], 'object': 'Republican Party', 'objectSpan': [7, 9]}, {'subject': 'Republican Party', 'subjectSpan': [7, 9], 'relation': 'org:top_members_employees', 'relationSpan': [-2, 

# Evaluation of Open IE

* Takes a while for it to load all of the different annotators
* It does work
* However for our use we probably need something more streamlined, since just IE will give a lot of false positive relations. We are only interested in relations between different people.
* This is also not the latest Open IE model there is 

In [None]:
# Testing Open IE

nlp = StanfordCoreNLP('../../stanford_corenlp_library')
props={'annotators': 'tokenize,ssplit,pos,lemma,depparse,natlog,openie','pipelineLanguage':'en','outputFormat':'json'}
results = nlp.annotate('He lives for a company', properties=props)
# results = nlp.annotate(article_text, properties=props)
nlp.close()

In [None]:
json.loads(results)['sentences'][0]['openie']

In [None]:
json.loads(results)['sentences'][0]['openie']