# A Quick Survey and Comparison of Open Source Named Entity Extractor Tools for Python

Named entity extraction is a core subtask of building knowledge from semi/unstructured text sources<sup><a href="#fn1" id="ref1">1</a></sup>.  Considering recent increases in computing power and decreases in the costs of data storage, data scientists and developers can build large knowledge bases that contain millions of entities and hundreds of millions of facts about them.  These knowledge bases are key contributors to intelligence computer behavior<sup><a href="#fn2" id="ref2">2</a></sup>.  Therefore, named entity extraction is at the core of several popular technologies such as smart assistants ([Siri](http://www.apple.com/ios/siri/), [Google Now](https://www.google.com/landing/now/)), machine reading, and deep interpretation of natural language<sup><a href="#fn3" id="ref3">3</a></sup>.

With a realization of how essential it is to recognize information units like names, including person, organization and location names, and numeric expressions including time, date, money
and percent expressions, several questions come to mind.  How do you perform named entity extraction, which is formally called “[Named Entity Recognition and Classification (NERC)](https://benjamins.com/catalog/bct.19)”?  What tools are out there?  How can you evaluate their performance?  And most important, what works with Python (shamelessly exposing my bias)?  

This post will survey openly available NERC tools and compare the results against hand labeled data for precision, accuracy, and recall.  The tools and basic information extraction principles in this discussion begin the process of structuring unstructured data.

We will specifically learn to:
1. follow the data science pipeline (see image below)
2. prepare semistructured natural language data for ingest using regex
3. create a custom corpus in [Natural Language Toolkit](http://www.nltk.org/) 
4. use a suite of openly available NERC tools to extract entities and store in json format 
5. compare the performance of NERC tools on our corpus

<br>
<a href="#pipe" id="pipeline"><center><h3>The Data Science Pipeline:<br>Georgetown Data Science Certificate Program</h3></center></a>
<div class="image">

      <img src="./files/data_science_pipeline.png" alt="Data Science Pipeline" height="300" width="450" top:"35" left:"170" />
      
      

</div>



### The Data: Peer Reviewed Journals and Keynote Speaker Abstracts from KDD 2014 and 2015

Before delving into the pipeline, we need a good dataset.  Jason Brownlee of www.machinelearningmastery.com had some good suggestions in his [August 2015 article](http://machinelearningmastery.com/practice-machine-learning-with-small-in-memory-datasets-from-the-uci-machine-learning-repository/) on picking a dataset for machine learning exercises:  

* **Real-World**: The datasets should be drawn from the real world (rather than being contrived). This will keep them interesting and introduce the challenges that come with real data.

* **Small**: The datasets need to be small so that you can inspect and understand them and that you can run many models quickly to accelerate your learning cycle.

* **Well-Understood**: There should be a clear idea of what the data contains, why it was collected, what the problem is that needs to be solved so that you can frame your investigation.

* **Baseline**: It is also important to have an idea of what algorithms are known to perform well and the scores they achieved so that you have a useful point of comparison. This is important when you are getting started and learning because you need quick feedback as to how well you are performing (close to state-of-the-art or something is broken).

* **Plentiful**: You need many datasets to choose from, both to satisfy the traits you would like to investigate and (if possible) your natural curiosity and interests. 

Luckily, we have a dataset that meets nearly all of these requirements.  I attended the Knowledge Discovery and Data Mining (KDD) conferences in [New York City (2014)](http://www.kdd.org/kdd2014/) and [Sydney, Australia (2015)](http://www.kdd.org/kdd2015/).  Both years, attendees received a USB with the conference proceedings.  Each repository contains over 230 peer reviewed journal articles and keynote speaker abstracts on data mining, knowledge discovery, big data, data science and their applications. The full conference proceedings can be purchased for \$60 at the [Association for Computing Machinery's Digital Library](https://dl.acm.org/purchase.cfm?id=2783258&CFID=740512201&CFTOKEN=34489585) (includes ACM membership). This post will work with a dataset that is equivalent to the conference proceedings.  It's important to note that this dataset recreates a real word data science exercise that is instructive of big data problems.  We will take semi-structured data (PDF journal articles and abstracts in publication format), strip text from the files, and add more structure to the data that would facilitate follow on analysis. 

<blockquote cite="https://github.com/linwoodc3/LC3-Creations/blob/master/DDL/namedentityblog/KDDwebscrape.ipynb">
Interested parties looking for a free option can use the <a href="https://pypi.python.org/pypi/beautifulsoup4/4.4.1">beautifulsoup</a> and <a href="https://pypi.python.org/pypi/requests/2.9.1">request</a> libraries to scrape the <a href="http://dl.acm.org/citation.cfm?id=2785464&CFID=740512201&CFTOKEN=3448958">ACM website for KDD 2015 conference data</a> that can be used in natural language processing pipelines.  I have some <a href="https://github.com/linwoodc3/LC3-Creations/blob/master/DDL/namedentityblog/KDDwebscrape.ipynb">skeleton web scraping code</a> to generate lists of all abstracts, author names, and journal/keynote address titles.    
</blockquote>


### Data Exploration: Getting the number of files, and file type 

The data is stored locally in the following directory:
```python
>>> import os
>>> print os.getcwd()
/Users/linwood/Desktop/KDD_15/docs
```
Let's explore the number of files we have and naming conventions. We begin with the administrative tasks of loading modules, establishing paths, etc.  
<br><br>

In [16]:
#**********************************************************************
# Importing what we need
#**********************************************************************
import os
import time
from os import walk

#**********************************************************************
# Administrative code to set the path for file loading
#**********************************************************************

path        = os.path.abspath(os.getcwd())
TESTDIR     = os.path.normpath(os.path.join(os.path.expanduser("~"),"Desktop","KDD_15","docs"))

<br><br>Next we iterate over the files in the directory and store those names in the empty list we created called *files*.  We time the operation, print list with the file names and also print out the length of the list (gives number of target files).<br><br>

In [4]:
# Establish an empty list to append filenames as we iterate over the directory with filenames
files = []

%time
start_time = time.time()

#**********************************************************************
# Core "workerbee" code for this section to iterate over directory files
#**********************************************************************

# Iterate over the directory of filenames and add to list.  Inspection shows our target filenames begin with 'p' and end with 'pdf'
for dirName, subdirList, fileList in os.walk(TESTDIR):
    for fileName in fileList:
        if fileName.startswith('p') and fileName.endswith('.pdf'):
            files.append(fileName)
end_time = time.time()

#**********************************************************************
# Output
#**********************************************************************
print
print len(files) # Print the number of files
print 
print '[%s]' % ', '.join(map(str, files)) # print the list of filenames

CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 6.91 µs

253

[p1.pdf, p1005.pdf, p1015.pdf, p1025.pdf, p1035.pdf, p1045.pdf, p1055.pdf, p1065.pdf, p1075.pdf, p1085.pdf, p109.pdf, p1095.pdf, p1105.pdf, p1115.pdf, p1125.pdf, p1135.pdf, p1145.pdf, p1155.pdf, p1165.pdf, p1175.pdf, p1185.pdf, p119.pdf, p1195.pdf, p1205.pdf, p1215.pdf, p1225.pdf, p1235.pdf, p1245.pdf, p1255.pdf, p1265.pdf, p1275.pdf, p1285.pdf, p129.pdf, p1295.pdf, p1305.pdf, p1315.pdf, p1325.pdf, p1335.pdf, p1345.pdf, p1355.pdf, p1365.pdf, p1375.pdf, p1385.pdf, p139.pdf, p1395.pdf, p1405.pdf, p1415.pdf, p1425.pdf, p1435.pdf, p1445.pdf, p1455.pdf, p1465.pdf, p1475.pdf, p1485.pdf, p149.pdf, p1495.pdf, p1503.pdf, p1513.pdf, p1523.pdf, p1533.pdf, p1543.pdf, p1553.pdf, p1563.pdf, p1573.pdf, p1583.pdf, p159.pdf, p1593.pdf, p1603.pdf, p1621.pdf, p1623.pdf, p1625.pdf, p1627.pdf, p1629.pdf, p1631.pdf, p1633.pdf, p1635.pdf, p1637.pdf, p1639.pdf, p1641.pdf, p1651.pdf, p1661.pdf, p1671.pdf, p1681.pdf, p169.pdf, p1691.pdf, p170

<br><br>There are 253 total files in the directory. We examine the pdf file in its rawest form to get an idea of the format. Here is one example:<br><br>



<img src="./files/journalscreencap.png" alt="Sample of Journal Format" height="700" width="700" top:"35" left:"170">


<br><br>We learn a few things immediately. Our data is in PDF format and it's semistructured (follows journal article format with sections like "abstract", "title").  PDFs are a wonderful human readable presentation of data. But for data analyisis, they are extremely difficult to work with.  If you have an option to get the data BEFORE it was converted to or added to PDF, go for that option.  Save yourself the headache.  In this case however, we have no alternatives outside of the web scraping code linked above.  The web scraping code is imperfect because it is incomplete (only get abstracts and not full-text of journal ariticle) and unordered (multiple authors need to be aligned to specific articles).<br><br>

### Data Ingestion: Stripping text from PDFs and creating a custom NLTK corpus

The first step in the <href id="pipe"><a href="#pipeline" title="Jump back to data science pipeline graphic.">data science pipeline</a> is to ingest our data.  We use several Python tools which include:

* [pdfminer](https://pypi.python.org/pypi/pdfminer/) - this is the tool that makes it ALL happen.  It has a command line tool called "pdf2text.py" that extract text contents from a PDF. **This must be installed on your computer BEFORE executing this code**.  Visit the [pdfminer homepage](http://euske.github.io/pdfminer/index.html#pdf2txt) for instructions

* [subprocess](https://docs.python.org/2/library/subprocess.html) - a standard library module that allows you to spawn new processes, connect to their input/output/error pipes, and obtain their return codes.  In this excerise, we use it to invoke the pdf2texy.py command line tool within our code.  

* [nltk](http://www.nltk.org/) - another work horse in this exercise.  The Natural Language ToolKit (NLTK) is one of Python's leading platforms to analyze natural language data.  The [NLTK Book](http://www.nltk.org/book/) provides practical guidance on how to handle just about any natural language preprocessing job.  

* [string](https://docs.python.org/2/library/string.html) - used for variable substitutions and value formatting to strip non printable characters from the output of the text extracted from our journal article PDFs

* [unicodedata](https://docs.python.org/2/library/unicodedata.html) - some unicode characters won't extract nicely. This library allows latin unicode characters to degrade gracefully into ASCII.

We are now going to iterate over each file in our raw data directory, strip the text, and write the *.txt* file to newly created directory.  Then we will follow the instructions from [Section 1.9, Chapter 2 of NLTK's Book](http://www.nltk.org/book/ch02.html) to build a custom corpus from our text files.  Having our target documents loaded as an NLTK corpus brings the power of NLTK to our analysis goals.  Let's begin with administrative tasks such as loading modules and creating the necessary directories.<br><br>

In [412]:
#**********************************************************************
# Importing what we need
#**********************************************************************
import string
import unicodedata
import subprocess
import nltk
import os, os.path
import re

#**********************************************************************
# Create the directory we will write the .txt files to after stripping text
#**********************************************************************

corpuspath = os.path.normpath(os.path.expanduser('~/Desktop/KDD_corpus/'))
if not os.path.exists(corpuspath):
    os.mkdir(corpuspath)

<br><br>Now we are to the big task of stripping text from the PDFs.  In the code below, we walk down the directory, and strip text from the files with names that begin with 'p' and end with 'pdf'.  We use the *fileName* variable to name the files we write to disk.  This will come in handy when we load data into NLTK.  Keep in mind, this task takes the longest, so be prepared to wait a a few minutes depending on good your computer is.  If you are doing this in an environment where you can spin up compute resources, your time will be drastically reduced.  Let's begin.<br><br>

In [10]:
#**********************************************************************
# Core code to iterate over files in the directory
#**********************************************************************

# We start from the code to iterate over the files
%timeit
for dirName, subdirList, fileList in os.walk(TESTDIR):
    for fileName in fileList:
        if fileName.startswith('p') and fileName.endswith('.pdf'):
            if os.path.exists(os.path.normpath(os.path.join(corpuspath,fileName.split(".")[0]+".txt"))):
                pass
            else:
            
            
#**********************************************************************
# This code strips the text from the PDFs
#**********************************************************************
                try:
                    document = filter(lambda x: x in string.printable,unicodedata.normalize('NFKD', (unicode(subprocess.check_output(['pdf2txt.py',str(os.path.normpath(os.path.join(TESTDIR,fileName)))]),errors='ignore'))).encode('ascii','ignore').decode('unicode_escape').encode('ascii','ignore'))
                except UnicodeDecodeError:
                    document = unicodedata.normalize('NFKD', unicode(subprocess.check_output(['pdf2txt.py',str(os.path.normpath(os.path.join(TESTDIR,fileName)))]),errors='ignore')).encode('ascii','ignore')    

                if len(document)<300:
                    pass
                else:
                    # used this for assistance http://stackoverflow.com/questions/2967194/open-in-python-does-not-create-a-file-if-it-doesnt-exist
                    if not os.path.exists(os.path.normpath(os.path.join(corpuspath,fileName.split(".")[0]+".txt"))):
                        file = open(os.path.normpath(os.path.join(corpuspath,fileName.split(".")[0]+".txt")), 'w+')
                        file.write(document)
                    else:
                        pass

kddcorpus= nltk.corpus.PlaintextCorpusReader(corpuspath, '.*\.txt')

In [413]:
kddcorpus= nltk.corpus.PlaintextCorpusReader(corpuspath, '.*\.txt')

<br><br>This is a pretty big step.  We have a semi-structured data set in a format where we can query and analyze different pieces of data.  All of our data is loaded as an NLTK corpus, meaning we could try tons of techniques outlined in the [NLTK book](http://www.nltk.org/book/) or use the NLTK APIs to pass data into [scikit-learn machine learning pipelines for text](http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html) (maybe for a later blog). Let's see how many words (including stop words) we have in our entire corpus.  <br><br>

In [12]:
wordcount = 0
for fileid in kddcorpus.fileids():
    wordcount += len(kddcorpus.words(fileid))
print wordcount


2795267


<br>To begin exploration of regular expressions, let's extract 'good enough' titles from a few of the documents.  For help on regex, visit https://regex101.com/. Here are the titles for the first 26 papers. <br><br>

In [422]:
# code uses regular expression to extract text up to the first new line character

p=re.compile('(.+)(\\n)')
for fileid in kddcorpus.fileids()[:25]:
    print p.search(kddcorpus.raw(fileid)).group(1).strip()  # use .strip() to remove whitespace from beginning and end of string

Online Controlled Experiments:
Mining Frequent Itemsets through Progressive Sampling
Why It Happened: Identifying and Modeling the Reasons of
Matrix Completion with Queries
Stochastic Divergence Minimization
Bayesian Poisson Tensor Factorization for Inferring
TimeCrunch: Interpretable Dynamic Graph Summarization
Inside Jokes: Identifying Humorous Cartoon Captions
Community Detection based on Distance Dynamics
Discovery of Meaningful Rules in Time Series
On the Formation of Circles in Co-authorship Networks
An Evaluation of Parallel Eccentricity Estimation
Efcient Latent Link Recommendation in
Turn Waste into Wealth: On Simultaneous Clustering and
Set Cover at Web Scale
Exploiting Relevance Feedback in Knowledge Graph
LINKAGE: An Approach for Comprehensive Risk
Transitive Transfer Learning
PTE: Predictive Text Embedding through Large-scale
An Effective Marketing Strategy for Revenue Maximization
Scaling Up Stochastic Dual Coordinate Ascent
Heterogeneous Network Embedding via Deep
Discov

### Data wrangling and computation: Using Regular Expressions to extract specific sections of the paper

We are close to the NERC portion.  But, there's a bit more wrangling to do (remember, PDFs are tough work).  For simplicity, let's focus the NERC on two sections of the paper:
* the top section which includes authors and schools
* the references section of the paper (keynote speaker abstracts do not have an abstract)

The tools of choice to extract sections are the ["positive lookbehind" and "positive lookahead"](https://docs.python.org/2/library/re.html) expressions. Here is an example of code to extract the abstract only:<br>

In [15]:
# set our regular expression
p= re.compile('(?<=ABSTRACT)(.+)(?=Categories and Subject Descriptors)')
try:
    abstract= p.search(re.sub('[\s]'," ",kddcorpus.raw('p1035.txt'))).group(1)
except AttributeError:
    # include a lowercase regex match incase consistency is a problem
    p=re.compile('(?<=abstract)(.+)(?=categories and subject descriptors)')
    abstract=p.search(re.sub('[\s]'," ",holder.lower())).group(1)
else:
    pass
unicodedata.normalize('NFKD', abstract).encode('ascii','ignore').strip() # convert output from unicode to string and strip leading and trailing whitespace

'The collapsed variational Bayes zero (CVB0) inference is a vari- ational inference improved by marginalizing out parameters, the same as with the collapsed Gibbs sampler. A drawback of the CVB0 inference is the memory requirements. A probability vec- tor must be maintained for latent topics for every token in a corpus. When the total number of tokens is N and the number of topics is K, the CVB0 inference requires O(N K) memory. A stochas- tic approximation of the CVB0 (SCVB0) inference can reduce O(N K) to O(V K), where V denotes the vocabulary size. We re- formulate the existing SCVB0 inference by using the stochastic di- vergence minimization algorithm, with which convergence can be analyzed in terms of Martingale convergence theory. We also reveal the property of the CVB0 inference in terms of the leave-one-out perplexity, which leads to the estimation algorithm of the Dirichlet distribution parameters. The predictive performance of the propose SCVB0 inference is better than that o

### References

<sup id="fn1">1. [(2014). Text Mining and its Business Applications - CodeProject. Retrieved December 26, 2015, from http://www.codeproject.com/Articles/822379/Text-Mining-and-its-Business-Applications.]<a href="#ref1" title="Jump back to footnote 1 in the text.">↩</a></sup>

<sup id="fn2">2. [Suchanek, F., & Weikum, G. (2013). Knowledge harvesting in the big-data era. Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. ACM.]<a href="#ref2" title="Jump back to footnote 2 in the text.">↩</a></sup>


<sup id ="fn3">3. [Nadeau, D., & Sekine, S. (2007). A survey of named entity recognition and classification. Lingvisticae Investigationes, 30(1), 3-26.]<a href="#ref3" title = "Jump back to footnote 3 in the text">↩</a></sup>

# <span style="color:red">Parking Lot of links, leftover paragraphs, ideas, etc.</span>

Describe the data -> Data available here http://dl.acm.org/citation.cfm?id=2783258# 

<href id="pipe"><a href="#pipeline" title="Jump back to data science pipeline graphic.">data science pipeline</a>

## Ben's Outline from email

* ~~Give a brief introduction to the task, and why it's interesting, important. Then begin to discuss the data set, how you acquired, and where a reader can get access to it.~~ 

* ~~You then could have a data exploration section where you show the number of documents, perform a word count, show snippets of data (e.g. references) etc that are of interest.~~

* You can then go through one or a few of your "code to get" sections. These functions all follow basically the same pattern, so you could probably merge them into a single function, that appropriately selects the right regular expression. 

* The next step is to discuss, demonstrate your "truth tests" for text extraction accuracy. 

* Finally, you can get to an introduction of your three methods for NERC, and show how do do each of them. Then compare (visually) the results of the three according to the evaluation mechanism discussed above. 

* You could then conclude with a discussion about NLTK chunk vs. hand labelled entities. 

In [None]:
kddcorpus_bigrams=[]
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
for fileid in kddcorpus.fileids():
    for l in (BigramCollocationFinder.from_words(kddcorpus.words(fileid)).nbest(bigram_measures.pmi, 10)):
        kddcorpus_bigrams.append(l)


# <span style="color:green">Prototype Holder</span>

In [None]:
# completed gold standard for keywords, got all of them..no stragglers
text = kddcorpus.raw('p39.txt').lower()
failids = []
full = True
section = "keywords"
if full == True:
    for fileid in kddcorpus.fileids():
        text = kddcorpus.raw(fileid).lower()
        if section == "keywords":
            section1="keywords"
            target = ""   
            section2=["1.  introduction  ","1.  introd ","1. motivation","permission to make ","1.motivation" ]
        
            part1= "(?<="+str(section1)+")(.+)"

            for sect in section2:
                try:
                    part2 = "(?="+str(sect)+")"
                    p=re.compile(part1+part2)
                    target=p.search(re.sub('[\s]'," ",text)).group(1)
                    if len(target) >50:
                        if len(target) > 300:
                            target = target[:200]
                        else:
                            target = target
                        
                        print [fileid,target,len(text)]
                        break

                    else:
                        failids.append(fileid)
                        pass
                except AttributeError:
                    pass
else:
    section = "keywords"
    
    if section == "keywords":
        section1="keywords"
        target = ""   
        section2=["1.  introduction  ","1.  introd ","1. motivation","permission to make ","1.motivation" ]

        part1= "(?<="+str(section1)+")(.+)"

        for sect in section2:
            try:
                part2 = "(?="+str(sect)+")"
                p=re.compile(part1+part2)
                target=p.search(re.sub('[\s]'," ",text)).group(1)
                if target > 3:                 
                    break                  
            except:
                pass
print target.strip()

In [178]:
failids

[]

In [None]:
# completed gold standard for abstract, near zero stragglers
failids = []
text=kddcorpus.raw('p1055.txt')

full = True
section = "abstract"
if full == True:
    for fileid in kddcorpus.fileids():
        text = kddcorpus.raw(fileid).lower()
        if section == "abstract":
            section1="abstract"
            target = ""   
            section2=["categories and subject descriptors","categories & subject descriptors","permission to make","keywords","introduction  1.","introduction", "\\\\n"]
            part1= "(?<="+str(section1)+")(.+)"

            for sect in section2:
                try:
                    part2 = "(?="+str(sect)+")"
                    p=re.compile(part1+part2)
                    target=p.search(re.sub('[\s]'," ",text)).group(1)
                    
                    if len(target) > 50:
                        
                        
                        print [fileid,len(target),len(text)]
                        break
                    else:
                        failids.append(fileid)
                        pass
                except AttributeError:
                    
                    pass
                              
else:
    
    section = "abstract"
    text = kddcorpus.raw('p1627.txt').lower()
    if section == "abstract":
        section1="abstract"
        target = ""   
        section2=["categories and subject descriptors","categories & subject descriptors","permission to make","keywords","introduction  1.","introduction", "\\\\n"]

        part1= "(?<="+str(section1)+")(.+)"

        for sect in section2:
            try:
                part2 = "(?="+str(sect)+")"
                p=re.compile(part1+part2)
                target=p.search(re.sub('[\s]'," ",text)).group(1)
                if target > 50:

                    
                    break
            except:
                
                pass
                
print target.strip()

In [None]:
# completed gold standard for references, counts number of references and does "word per reference" score

failids = []
text=kddcorpus.raw('p149.txt').lower()

full = True
section = "references"
if full == True:
    for fileid in kddcorpus.fileids():
        text = kddcorpus.raw(fileid).lower()
        if section == "references":
            section1="references \[" 
            target = ""   
            
            part1= "(?<="+str(section1)+")(.+)"
            
            for sect in section1:
                try:
                    p=re.compile(part1)
                    target=p.search(re.sub('[\s]'," ",text)).group(1)

                    if len(target) > 50:
                        
                        # calculate the number of references in a journal; finds digits between [] in references section only
                        try:
                            if 'references' in locals():

                                refnum = len(re.findall('\[(\d){1,3}\]',target))+1
                        except:
                            print "This file does not appear to have a references section"
                            pass
                        print [fileid,len(target),len(text), refnum, len(nltk.word_tokenize(text))/refnum]
                        break
                    else:
                        
                        pass
                except AttributeError:
                    failids.append(fileid)
                    pass
                              
else:
    
    section = "references"
    
    if section == "references":
        section1="references \["
        target = ""   
        
        part1= "(?<="+str(section1)+")(.+)"

        for sect in section1:
            try:
                
                p=re.compile(part1)
                target=p.search(re.sub('[\s]'," ",text)).group(1)
                if target > 50:
                    print len(target)

                    
                break
            except:
                
                pass
                
print target.strip()

In [411]:
len(set(failids))

In [None]:
# gold standard to get top section

failids = []
text=kddcorpus.raw('p1623.txt').lower()

full = True
section = "top"
if full == True:
    for fileid in kddcorpus.fileids():
        text = kddcorpus.raw(fileid).lower()
        if section == "top":
            section1=""
            section2=["abstract"]
            part1= "(?<="+str(section1)+")(.+)"

            for sect in section2:
                try:
                    part2 = "(?="+str(sect)+")"
                    p=re.compile(part1+part2)
                    target=p.search(re.sub('[\s]'," ",text)).group(1)
                    
                    if len(target)> 1000:
                        if len(target) > 3000 and float(len(target))/float(len(text)) > .22:
                            target = target[:2500]
                        else:
                            target=target
                   
                        print [fileid,len(target),len(text)]
                        break
                    else:
                        pass
                except AttributeError:
                    failids.append(fileid)
                    pass
    print "Done"
                              
else:
    
    if section == "top":
            section1=""
            section2=["abstract"]
            part1= "(?<="+str(section1)+")(.+)"

            for sect in section2:
                try:
                    part2 = "(?="+str(sect)+")"
                    p=re.compile(part1+part2)
                    target=p.search(re.sub('[\s]'," ",text)).group(1)
                    
                    if len(target)> 1000:
                        if len(target) > 3000 and float(len(target))/float(len(text)) > .22:
                            target = target[:2500]
                        else:
                            target=target
                   
                        print [fileid,len(target),len(text)]
                        break
                     
                    else:
                        pass
                except AttributeError:
                    failids.append(fileid)
                    pass
                
print target.strip()

In [None]:
failids

# <span style="color:violet">Drawing Board/Assembly Line</span>

In [583]:
# attempting function with gold keywords....WORKS

def keypull(docnum=None,section='keywords',full = False):
    
    ans={}
    failids = []
    section = section.lower()    
    if docnum is None and full == False:
        raise BaseException("Enter target file to extract data from")
    
    if docnum is None and full == True:
        
        text=kddcorpus.raw(docnum).lower()

        

        # to return output from entire corpus
        if full == True:
            for fileid in kddcorpus.fileids():
                text = kddcorpus.raw(fileid).lower()
                if section == "keywords":
                    section1="keywords"
                    target = ""   
                    section2=["1.  introduction  ","1.  introd ","1. motivation","(1. tutorial )"," permission to make ","  permission to make","(  permission to make digital )","    bio  ","abstract:  ","1.motivation" ]

                    part1= "(?<="+str(section1)+")(.+)"
                    for sect in section2:
                        try:
                            part2 = "(?="+str(sect)+")"
                            p=re.compile(part1+part2)
                            target=p.search(re.sub('[\s]'," ",text)).group(1)
                            if len(target) >50:
                                if len(target) > 300:
                                    target = target[:200]
                                else:
                                    target = target

                                ans[str(fileid)]={}
                                ans[str(fileid)]["keywords"]=target.strip()
                                ans[str(fileid)]["charcount"]=len(target)
                                #print [fileid,len(target),len(text)]
                                break
                            else:
                                if len(target)==0:
                                     failids.append(fileid)   
                                pass
                        except AttributeError:
                            failids.append(fileid)
                            pass
            set(failids)
            return ans
        # to return output from one document
    else:
        ans = {}
        text=kddcorpus.raw(docnum).lower()
        if full == False:
            if section == "keywords":
                section1="keywords"
                target = ""   
                section2=["1.  introduction  ","1.  introd ","1. motivation","permission to make ","1.motivation" ]

                part1= "(?<="+str(section1)+")(.+)"

                for sect in section2:
                    try:
                        part2 = "(?="+str(sect)+")"
                        p=re.compile(part1+part2)
                        target=p.search(re.sub('[\s]'," ",text)).group(1)
                        if len(target) >50:
                            if len(target) > 300:
                                target = target[:200]
                            else:
                                target = target
                            ans[docnum]={}
                            ans[docnum]["keywords"]=target.strip()
                            ans[docnum]["charcount"]=len(target)
                            break                  
                    except:
                        pass
    return ans
    return failids


In [611]:
# attempting function with gold abstracts...WORKS

def abpull(docnum=None,section='abstract',full = False):
    
    ans={}
    failids = []
    section = section.lower()    
    if docnum is None and full == False:
        raise BaseException("Enter target file to extract data from")
    
    if docnum is None and full == True:
        
        text=kddcorpus.raw(docnum).lower()
        # to return output from entire corpus
        if full == True:
            for fileid in kddcorpus.fileids():
                text = kddcorpus.raw(fileid).lower()
                if section == "abstract":
                    section1="abstract"
                    target = ""   
                    section2=["categories and subject descriptors","categories & subject descriptors","permission to make","keywords","introduction  1.","introduction", "\\\\n"]
                    part1= "(?<="+str(section1)+")(.+)"

                    for sect in section2:
                        try:
                            part2 = "(?="+str(sect)+")"
                            p=re.compile(part1+part2)
                            target=p.search(re.sub('[\s]'," ",text)).group(1)

                            if len(target) > 50:
                                ans[str(fileid)]={}
                                ans[str(fileid)]["abstract"]=target.strip()
                                ans[str(fileid)]["charcount"]=len(target)
                                
                                #print [fileid,len(target),len(text)]
                                break
                            else:
                                failids.append(fileid)
                                pass
                        except AttributeError:
                            pass
            return ans
                              
        # to return output from one document
    else:
        ans = {}
        failids=[]
        text = kddcorpus.raw(docnum).lower()
        if section == "abstract":
            section1="abstract"
            target = ""   
            section2=["categories and subject descriptors","categories & subject descriptors","permission to make","keywords","introduction  1.","introduction", "\\\\n"]

            part1= "(?<="+str(section1)+")(.+)"

            for sect in section2:
                try:
                    part2 = "(?="+str(sect)+")"
                    p=re.compile(part1+part2)
                    target=p.search(re.sub('[\s]'," ",text)).group(1)
                    
                    if target > 50:
                        ans[str(docnum)]={}
                        ans[str(docnum)]["abstract"]=target.strip()
                        ans[str(docnum)]["charcount"]=len(target)
                        break
                except:
                    pass
        return ans
        return failids


In [612]:
test = abpull('p19.txt')

In [None]:
print abstracts;

In [None]:
for key, value in abstracts.iteritems():
    print value['charcount']

In [564]:
print len(keywords.keys())
print len(set(failids))

197
66


In [None]:
print set(failids)
print
print
print set(failids) & set(keywords.keys())

# <span style="color:orange">Testing Station</span>

In [455]:
print part1+part2

(?<=keywords)(.+)(?=introduction  )


In [477]:
import re
text = kddcorpus.raw('p2029.txt')
p=re.compile('(.*)(?=abstract)')
top = p.search(re.sub('[\s]'," ",text)).group(1)
if len(top)> 3000 and float(len(top))/float(len(text)) > .22:
    top = top[:2500]
print top

Learning a Hierarchical Monitoring System for Detecting  and Diagnosing Service Issues  Vinod Nair, Ameya Raul  Microsoft Research India  {vnair,  t-amraul}@microsoft.com  Shwetabh Khanduja Microsoft Research India  t-shwetk@microsoft.com  Vikas Bahirwani  Microsoft, Redmond, WA  vikasba@microsoft.com  S. Sundararajan  Microsoft Research India ssrajan@microsoft.com  Sathiya Keerthi  Microsoft, Mountain View, CA keerthi@microsoft.com  Steve Herbert,  Sudheer Dhulipalla Microsoft, Redmond, WA {steve.herbert, sud-  heerd}@microsoft.com  ABSTRACT We propose a machine learning based framework for build- ing a hierarchical monitoring system to detect and diagnose service issues. We demonstrate its use for building a moni- toring system for a distributed data storage and computing service consisting of tens of thousands of machines. Our solution has been deployed in production as an end-to-end system, starting from telemetry data collection from indi- vidual machines, to a visualization tool 