### Named Entity Extraction -- NER  
The goal of this example is to take a corpus of texts  
and extract named entities such as names of persons,  
locations, organizations, etc.  

The corpus of texts are from NLTK data of Presidental  
inaugural addresses. This example uses both NLTK and Spacy  
libraries and exports the results to a JSON file.

In [54]:
import json
from nltk.corpus import PlaintextCorpusReader
import spacy

In [55]:
nlp = spacy.load('en')

In [56]:
corp_dir = 'H:\\My Documents\\Work\\datasets\\pres_txt'

In [57]:
newcorpus = PlaintextCorpusReader(corp_dir, '.*')

#### Loading Corpus  
The English language model from Spacy is loaded and  
the path to the text corpus is defined. A plain text  
reader from NLTK is used to read the corpus of texts  
and the names of those text files are listed below.

In [58]:
newcorpus.fileids()

['1789-Washington.txt',
 '1805-Jefferson.txt',
 '1861-Lincoln.txt',
 '1945-Roosevelt.txt',
 '1965-Johnson.txt',
 '2009-Obama.txt']

In [59]:
txt_ID = newcorpus.fileids()

In [60]:
entity_results = []
def getEnts(i):
    txt = newcorpus.raw(txt_ID[i])
    doc = nlp(txt)
    txt_ents = [ent.text for ent in doc.ents]
    entity_results.append(txt_ents)
    

#### Extraction  
A new variable was created to hold the IDs  
of the elements of the list of corups texts.  
A function is created to take a raw text file,  
gather the entities into a list and place that  
list within another list.  

After the function loops over all of the texts  
within the corpus, the results show a list of entities  
for each file placed within a list corresponding to a list  
of all text files in the corpus -- lists within a list.

In [61]:
for item in range(len(txt_ID)):
    getEnts(item)

In [62]:
entity_results

[['Senate',
  'the House of Representatives',
  'the 14th day',
  'the present month',
  'every day',
  'first',
  'the United States a Government',
  'the United States',
  'republican',
  'American',
  'fifth',
  'Constitution',
  'the House of Representatives',
  'first',
  'the Human Race',
  'American',
  '\n'],
 ['Constitution',
  'Commonwealth',
  'State',
  'American',
  'the United States',
  'States',
  'Constitution',
  'State',
  'the year',
  'the year',
  'Louisiana',
  'Mississippi',
  'Constitution',
  'the General Government',
  'Constitution',
  'first',
  'Creator',
  'first',
  'States',
  'Constitution',
  'States',
  'Contemplating',
  'years',
  'Israel',
  '\n'],
 ['the United States',
  'the Constitution of the United States',
  'the Southern States',
  'a Republican Administration',
  'one',
  'States',
  '\n\nResolved',
  'States',
  'State',
  'State',
  'Administration',
  'Constitution',
  'States',
  'Constitution',
  'one',
  'State',
  'Congress',
  'Co

#### Creating the JSON file  
The desire is to export the results this operation  
in a JSON file that can be used for other purposes.  
A final empty list is created and a loop combining each  
each item in the corpus text list with the entity results  
to create dictionaries in the empty list.  

A JSON file resembles a list of dictionaries and this final  
list is saved to a JSON file.

In [63]:
final_ent = []
for x in range(len(txt_ID)):
    final_ent.append({'name': txt_ID[x], 'text': entity_results[x]})

In [64]:
final_ent

[{'name': '1789-Washington.txt',
  'text': ['Senate',
   'the House of Representatives',
   'the 14th day',
   'the present month',
   'every day',
   'first',
   'the United States a Government',
   'the United States',
   'republican',
   'American',
   'fifth',
   'Constitution',
   'the House of Representatives',
   'first',
   'the Human Race',
   'American',
   '\n']},
 {'name': '1805-Jefferson.txt',
  'text': ['Constitution',
   'Commonwealth',
   'State',
   'American',
   'the United States',
   'States',
   'Constitution',
   'State',
   'the year',
   'the year',
   'Louisiana',
   'Mississippi',
   'Constitution',
   'the General Government',
   'Constitution',
   'first',
   'Creator',
   'first',
   'States',
   'Constitution',
   'States',
   'Contemplating',
   'years',
   'Israel',
   '\n']},
 {'name': '1861-Lincoln.txt',
  'text': ['the United States',
   'the Constitution of the United States',
   'the Southern States',
   'a Republican Administration',
   'one',
   

In [65]:
with open("C:\\data_mongo\\pres_entities.json", 'w', encoding='UTF-8') as z:
    json.dump(final_ent, z, ensure_ascii=False)