# Testing the stanford corenlp library

* The library should be downloaded and stored at '{PROJECT_ROOT}/stanford_corenlp_library'
* Using python wrappers at 'https://github.com/Lynten/stanford-corenlp'. They also provide official python wrappers, but those aren't used here.
* Java needs to be installed to run it.
* Can run on linux without modifications, on Mac something may have to be modified. It can be run though.
* On mac you have to run the jupyter command as sudo.


In [1]:
# Necessary imports
from stanfordcorenlp import StanfordCoreNLP
import sys
sys.path.append("../..") #to add the root project directory to the python modules path, so that subdirectories of it can be imported
from src.preparation.data_loading import read_dossier, read_news_article
import json

## Testing it using news articles

In [2]:
# testing it using a news article from the web
airplane_story = 'https://edition.cnn.com/travel/article/best-way-disembark-airplane/index.html'
trump_election_story = 'https://www.nytimes.com/2016/11/09/us/politics/hillary-clinton-donald-trump-president.html'

article_text = read_news_article.read_news_article(trump_election_story)


In [9]:
# Testing coreference resolution

nlp = StanfordCoreNLP('../../stanford_corenlp_library')
props={'annotators': 'coref','pipelineLanguage':'en','outputFormat':'json'}
# results = nlp.annotate('This is Janice. She is very big.')
results = nlp.annotate(article_text, properties=props)
nlp.close()
json.loads(results)['corefs']

{'33': [{'id': 5,
   'text': 'the white blue-collar voters who had formed the party base from the presidency of Franklin D. Roosevelt to Mr. Clintons',
   'type': 'NOMINAL',
   'number': 'PLURAL',
   'gender': 'UNKNOWN',
   'animacy': 'ANIMATE',
   'startIndex': 15,
   'endIndex': 35,
   'headIndex': 18,
   'sentNum': 1,
   'position': [1, 6],
   'isRepresentativeMention': True},
  {'id': 35,
   'text': 'these voters',
   'type': 'NOMINAL',
   'number': 'PLURAL',
   'gender': 'UNKNOWN',
   'animacy': 'ANIMATE',
   'startIndex': 4,
   'endIndex': 6,
   'headIndex': 5,
   'sentNum': 3,
   'position': [3, 9],
   'isRepresentativeMention': False},
  {'id': 32,
   'text': 'they',
   'type': 'PRONOMINAL',
   'number': 'PLURAL',
   'gender': 'UNKNOWN',
   'animacy': 'ANIMATE',
   'startIndex': 27,
   'endIndex': 28,
   'headIndex': 27,
   'sentNum': 3,
   'position': [3, 6],
   'isRepresentativeMention': False},
  {'id': 33,
   'text': 'their',
   'type': 'PRONOMINAL',
   'number': 'PLURAL',


# Evaluation

* It runs pretty slowly, it takes 2 minutes to run on the newspaper article, it takes about a minute to even run on 1 sentence...
* The Stanford NLP server also requires a lot of memory to run, a few gigs in practice.
* I think that this library might be too impractical to run in practice.

In [5]:
# Testing relation extraction

nlp = StanfordCoreNLP('../../stanford_corenlp_library')
props={'annotators': 'relation','pipelineLanguage':'en','outputFormat':'json'}
results = nlp.annotate(article_text, properties=props)
nlp.close()
json.loads(results)

{'sentences': [{'index': 0,
   'parse': '(ROOT\n  (S\n    (NP\n      (NP (DT The) (NNS returns))\n      (NP-TMP (NNP Tuesday)))\n    (ADVP (RB also))\n    (VP (VBD amounted)\n      (PP (TO to)\n        (NP\n          (NP (DT a) (JJ historic) (NN rebuke))\n          (PP (IN of)\n            (NP (DT the) (NNP Democratic) (NNP Party)))))\n      (PP (IN from)\n        (NP\n          (NP (DT the) (JJ white) (JJ blue-collar) (NNS voters))\n          (SBAR\n            (WHNP (WP who))\n            (S\n              (VP (VBD had)\n                (VP (VBN formed)\n                  (NP (DT the) (NN party) (NN base))\n                  (PP (IN from)\n                    (NP\n                      (NP (DT the) (NN presidency))\n                      (PP (IN of)\n                        (NP (NNP Franklin) (NNP D.) (NNP Roosevelt)))))\n                  (PP (TO to)\n                    (NP (NNP Mr.) (NNP Clintons))))))))))\n    (. .)))',
   'basicDependencies': [{'dep': 'ROOT',
     'governor': 0,

In [13]:
json.loads(results)['sentences'][0].keys()

dict_keys(['index', 'parse', 'basicDependencies', 'enhancedDependencies', 'enhancedPlusPlusDependencies', 'entitymentions', 'tokens'])

In [14]:
json.loads(results)['sentences'][0]['entitymentions']

[{'docTokenBegin': 2,
  'docTokenEnd': 3,
  'tokenBegin': 2,
  'tokenEnd': 3,
  'text': 'Tuesday',
  'characterOffsetBegin': 12,
  'characterOffsetEnd': 19,
  'ner': 'DATE',
  'normalizedNER': 'XXXX-WXX-2',
  'timex': {'tid': 't1', 'type': 'DATE', 'value': 'XXXX-WXX-2'}},
 {'docTokenBegin': 11,
  'docTokenEnd': 13,
  'tokenBegin': 11,
  'tokenEnd': 13,
  'text': 'Democratic Party',
  'characterOffsetBegin': 62,
  'characterOffsetEnd': 78,
  'ner': 'ORGANIZATION'},
 {'docTokenBegin': 28,
  'docTokenEnd': 31,
  'tokenBegin': 28,
  'tokenEnd': 31,
  'text': 'Franklin D. Roosevelt',
  'characterOffsetBegin': 166,
  'characterOffsetEnd': 187,
  'ner': 'PERSON'},
 {'docTokenBegin': 33,
  'docTokenEnd': 34,
  'tokenBegin': 33,
  'tokenEnd': 34,
  'text': 'Clintons',
  'characterOffsetBegin': 195,
  'characterOffsetEnd': 203,
  'ner': 'PERSON'}]