# Identifying entities in notes

Goal of this notebook is to determine how successful entity identification is using


In [1]:
%matplotlib inline
from __future__ import print_function
import os
from pyspark import SQLContext
from pyspark.sql import Row
import pyspark.sql.functions as sql
#from pyspark.sql.functions import udf, length
import matplotlib.pyplot as plt
import numpy
import math
import matplotlib.pyplot as plt
import seaborn as sns

import nltk
import pyspark.ml.feature as feature



In [2]:
# Load Processed Parquet
sqlContext = SQLContext(sc)
notes = sqlContext.read.parquet("../data/idigbio_notes.parquet")
total_records = notes.count()
print(total_records)
# Small sample of the df
notes = notes.sample(withReplacement=False, fraction=0.001)
notes.cache()
print(notes.count())

3232459
3214


In [3]:
for r in notes.head(20):
    print(r['document'] + "\n")

Interior label \"Polyporus ursinus\"  

 TC 733 (Cekalovic's lot number) 

 Orchard mesophytic. 

locally common  

 Cruise 6 

  [ #67 | 6 | 30/7/70 ][ Nanorchestes | 1) L | 2) F | 3) N2 | 4) N3 | 5) F | 6) F ] - (data from lid of box) Nanorchestes antarctius | Strandtmann | 4 | coll by : Kay L. Lindsay | 1969-70 | austral su. | Taylor, Wright Valleys | sites #: 52, 53, 54, 55, 56, 58, 59, 60, 61, 62, 63,64, 65, 66, 67, 68 ]

Above crossbars green & black, below bright yellow in preserved animals on arrival.  

Location: Slide Box 36, Tray 14, Slide 3; J-31  

Designated HOLOTYPE.  

  [19°58'18"S;40°32'05"W BRASIL: Espirito Santo, Santa Teresa, Est. Biol. Sta. Lúcia, 1-4.VII.1997 840m. W.A. Hoffman, R. Ribeiro, yellow pan traps]

Arching perennial herb 1.5 m tall; flowers white; young fruit green.  

Expedition by Dr. Barton W. Evermann, Dr. John Van Denburgh and Joseph R. Slevin.  

BBP 7.10-C5  

Sheet 1 of 2.  

 malaise trap 

Abundant. Near top of Seth Bullock Peak. Open woods. 

In [4]:
notes_pdf = notes.toPandas()

## Sentence detection

Does splitting in to sentences matter? Is there enough information to do this with a natural language library or should things like "," "[]", and "{}" be worked in to address semi-structured data?

## Entitys from documents


In [5]:
def tokenize(s):
    '''
    Take a string and return a list of tokens split out from it
    with the nltk library
    '''
    if s is not None:
        return nltk.tokenize.word_tokenize(s)
    else:
        return ""

notes_pdf['tokens'] = map(tokenize, notes_pdf['document'])

In [6]:
print(notes_pdf.head()['tokens'])

0    [Interior, label, \, '', Polyporus, ursinus\, '']
1          [TC, 733, (, Cekalovic, 's, lot, number, )]
2                             [Orchard, mesophytic, .]
3                                    [locally, common]
4                                          [Cruise, 6]
Name: tokens, dtype: object


In [7]:
def part_of_speech(t):
    '''
    With a list of tokens, mark their part of speech and return
    a list of tuples.
    '''
    return nltk.pos_tag(t)

notes_pdf['pos'] = map(part_of_speech, notes_pdf['tokens'])

In [9]:
print(notes_pdf.head()['pos'])

0    [(Interior, NNP), (label, NN), (\, :), ('', ''...
1    [(TC, NNP), (733, CD), ((, CD), (Cekalovic, NN...
2           [(Orchard, NNP), (mesophytic, JJ), (., .)]
3                        [(locally, RB), (common, JJ)]
4                              [(Cruise, NN), (6, CD)]
Name: pos, dtype: object


In [11]:
def chunk(p):
    return nltk.chunk.ne_chunk(p)

notes_pdf['chunks'] = map(chunk, notes_pdf['pos'])

In [12]:
print(notes_pdf.head()['chunks'])

0    [[(Interior, NNP)], (label, NN), (\, :), ('', ...
1    [(TC, NNP), (733, CD), ((, CD), [(Cekalovic, N...
2         [[(Orchard, NNP)], (mesophytic, JJ), (., .)]
3                        [(locally, RB), (common, JJ)]
4                              [(Cruise, NN), (6, CD)]
Name: chunks, dtype: object


Now, with some chunks, can we find any that match ones from darwinCore text? Use word2vec on the 
Dude, this is a Hard Problem. Need ontology lookup service's code:
http://www.ebi.ac.uk/ols/beta/search?q=puma&groupField=iri&start=0&ontology=envo

In [22]:
# https://github.com/alvations/pywsd
# This uses it's own term definitions
from pywsd.similarity import max_similarity
s = """locality The specific description of the place. Less specific geographic information can be 
provided in other geographic terms (higherGeography, continent, country, stateProvince, county, 
                                    municipality, waterBody, island, islandGroup). This term may 
contain information modified from the original to correct perceived errors or standardize the description."""

In [21]:
print(max_similarity(s, 'town', 'lin'))

Synset('township.n.01')


## Making triples
Piece together subject-verb-predicate sets and take a look at the manually even if we don't know what they mean.