# A Quick Survey and Comparison of Open Source Named Entity Extractor Tools for Python

Named entity extraction is a core subtask of building knowledge from semi/unstructured text sources.  Considering recent increases in computing power and decreases in the costs of data storage, data scientists and developers can build large knowledge bases that contain millions of entities and hundreds of millions of facts about them.  These knowledge bases are key contributors to intelligence computer behavior. Therefore, named entity extraction is at the core of several popular technologies such as smart assistants ([Siri](http://www.apple.com/ios/siri/), [Google Now](https://www.google.com/landing/now/)), machine reading, and deep interpretation of natural language.

With a realization of how essential it is to recognize information units like names, including person, organization and location names, and numeric expressions including time, date, money
and percent expressions, several questions come to mind.  How do you perform named entity extraction, which is formally called “[Named Entity Recognition and Classification (NERC)](https://benjamins.com/catalog/bct.19)”?  What tools are out there?  How can you evaluate their performance?  And most important, what works with Python (shamelessly exposing my bias)?  

This post will survey openly available NERC tools and compare the results against hand labeled data for precision, accuracy, and recall.  The tools and basic information extraction principles in this discussion begin the process of structuring unstructured data.    

### The Data: Peer Reviewed Journals and Keynote Speaker Abstracts from KDD 2015

The first step in this exercise is to find a good dataset.  Jason Brownlee of www.machinelearningmastery.com had some good suggestions in his [August 2015 article](http://machinelearningmastery.com/practice-machine-learning-with-small-in-memory-datasets-from-the-uci-machine-learning-repository/) on picking a dataset for machine learning exercises:  

1. **Real-World**: The datasets should be drawn from the real world (rather than being contrived). This will keep them interesting and introduce the challenges that come with real data.

2. **Small**: The datasets need to be small so that you can inspect and understand them and that you can run many models quickly to accelerate your learning cycle.

3. **Well-Understood**: There should be a clear idea of what the data contains, why it was collected, what the problem is that needs to be solved so that you can frame your investigation.

4. **Baseline**: It is also important to have an idea of what algorithms are known to perform well and the scores they achieved so that you have a useful point of comparison. This is important when you are getting started and learning because you need quick feedback as to how well you are performing (close to state-of-the-art or something is broken).

5. **Plentiful**: You need many datasets to choose from, both to satisfy the traits you would like to investigate and (if possible) your natural curiosity and interests. 

Luckily, we have a dataset that meets nearly all of these requirements.  In August 2015, I attended the [Knowledge Discovery and Data Mining (KDD) conference in Sydney, Australia](http://www.kdd.org/kdd2015/).  KDD 2015 attendees received a USB with the conference proceedings.  The repository contains around 250 peer reviewed journal articles and keynote speaker abstracts on data mining, knowledge discovery, big data, data science and their applications. The full conference proceedings can be purchased for \$60 at the [Association for Computing Machinery's Digital Library](https://dl.acm.org/purchase.cfm?id=2783258&CFID=740512201&CFTOKEN=34489585) (includes ACM membership). This post will work with a dataset that is equivalent to the conference proceedings.  The dataset creates a real word data science exercise that is instructive of big data problems.  The goal is to take semi-structured data (PDF journal articles and abstracts in publication format), strip text the files, and add more structure to the data for follow on analysis. 

<blockquote cite="https://github.com/linwoodc3/LC3-Creations/blob/master/DDL/namedentityblog/KDDwebscrape.ipynb">
Interested parties looking for a free option can use the <a href="https://pypi.python.org/pypi/beautifulsoup4/4.4.1">beautifulsoup</a> and <a href="https://pypi.python.org/pypi/requests/2.9.1">request</a> libraries to scrape the <a href="http://dl.acm.org/citation.cfm?id=2785464&CFID=740512201&CFTOKEN=3448958">ACM website for KDD 2015 conference data</a> that can be used in natural language processing pipelines.  I have some <a href="https://github.com/linwoodc3/LC3-Creations/blob/master/DDL/namedentityblog/KDDwebscrape.ipynb">skeleton web scraping code</a> to generate lists of all abstracts, author names, and journal/keynote address titles.    
</blockquote>

Let's begin to explore the data set.  The first thing we will do is see how many files we have in our documents directory from the conference proceedings USB.  

### Data Exploration: Getting the number of files, file type, and word count

In [39]:
import os
import time
from os import walk


path        = os.path.abspath(os.getcwd())
TESTDIR     = os.path.normpath(os.path.join(os.path.expanduser("~"),"Desktop","KDD_15","docs"))

files = []

%time
start_time = time.time()
for dirName, subdirList, fileList in os.walk(TESTDIR):
    for fileName in fileList:
        if fileName.startswith('p') and fileName.endswith('.pdf'):
            files.append(fileName)
end_time = time.time()
print 
print end_time - start_time
print "There are %r files in our corpus.  This includes peer reviewed journals and keynote speaker abstracts." % len(files)
print #adding white space
print 
print "Here are the filenames:"
print '[%s]' % ', '.join(map(str, files))

CPU times: user 3 µs, sys: 1e+03 ns, total: 4 µs
Wall time: 6.91 µs

0.00239205360413
There are 253 files in our corpus.  This includes peer reviewed journals and keynote speaker abstracts.


Here are the filenames:
[p1.pdf, p1005.pdf, p1015.pdf, p1025.pdf, p1035.pdf, p1045.pdf, p1055.pdf, p1065.pdf, p1075.pdf, p1085.pdf, p109.pdf, p1095.pdf, p1105.pdf, p1115.pdf, p1125.pdf, p1135.pdf, p1145.pdf, p1155.pdf, p1165.pdf, p1175.pdf, p1185.pdf, p119.pdf, p1195.pdf, p1205.pdf, p1215.pdf, p1225.pdf, p1235.pdf, p1245.pdf, p1255.pdf, p1265.pdf, p1275.pdf, p1285.pdf, p129.pdf, p1295.pdf, p1305.pdf, p1315.pdf, p1325.pdf, p1335.pdf, p1345.pdf, p1355.pdf, p1365.pdf, p1375.pdf, p1385.pdf, p139.pdf, p1395.pdf, p1405.pdf, p1415.pdf, p1425.pdf, p1435.pdf, p1445.pdf, p1455.pdf, p1465.pdf, p1475.pdf, p1485.pdf, p149.pdf, p1495.pdf, p1503.pdf, p1513.pdf, p1523.pdf, p1533.pdf, p1543.pdf, p1553.pdf, p1563.pdf, p1573.pdf, p1583.pdf, p159.pdf, p1593.pdf, p1603.pdf, p1621.pdf, p1623.pdf, p1625.pdf, p1627.pdf, 

We see there are 253 total files in the directory, and that filenames begin with a "p" and end with "pdf".  This is the data in its rawest form. 

Here is an example of how the corpus PDFs look:

<img src="./files/journalscreencap.png" alt="Sample of Journal Format" height="700" width="700" top:"35" left:"170">


### Parking Lot of links, leftover paragraphs, ideas, etc.

Describe the data -> Data available here http://dl.acm.org/citation.cfm?id=2783258# 

## Ben's Outline from email

* Give a brief introduction to the task, and why it's interesting, important. Then begin to discuss the data set, how you acquired, and where a reader can get access to it. 

* You then could have a data exploration section where you show the number of documents, perform a word count, show snippets of data (e.g. references) etc that are of interest. 

* You can then go through one or a few of your "code to get" sections. These functions all follow basically the same pattern, so you could probably merge them into a single function, that appropriately selects the right regular expression. 

* The next step is to discuss, demonstrate your "truth tests" for text extraction accuracy. 

* Finally, you can get to an introduction of your three methods for NERC, and show how do do each of them. Then compare (visually) the results of the three according to the evaluation mechanism discussed above. 

* You could then conclude with a discussion about NLTK chunk vs. hand labelled entities. 