## Cleaning up NER tagging results
- This is a notebook I used to take a list of named entities from Stanford's NER tagger (people, in this case), and clean that list up to produce a list of all of the PERSON entitites in a text.
- Most of this comes from this tutorial: https://erickpeirson.github.io/python/2015/05/01/named-entity-recognition.html. Thank you, Erick Peirson!
- Need to change kernel of notebook to Python 2; here's how to do that: https://ipython.readthedocs.io/en/latest/install/kernel_install.html
- This notebook assumes you have already produced a tagged list of entitites (people, locations, and objects) using Stanford's NER tagger via the command line. You can learn more about that here: https://erickpeirson.github.io/python/2015/05/01/named-entity-recognition.html
    - There is an NER Python package, but I was never able to get it working correctly. Here's some info on that: https://www.npmjs.com/package/ner-server
    - There is also the StanfordNERTagger available via the nltk package, but I also wasn't able to get that working correctly.
- This tutorial also assumes that you have a relatively clean tagged document, with ",'/-...-- removed from the text (usually best to remove these prior to tagging, I have found). The code below is fairly brittle and won't work correctly if the tagged text doesn't break into entity:value pairs evenly (i.e., if there are 2 entities and 1 value in a tuple: ('-','-','O'). Punctuation characters often cause this to happen.
- The usual cavaets about academic programming apply. I am an especially inexperienced Python programmer, so there are most definitely better ways to do this. I just don't know how to do them yet. But this worked for my purposes.

In [None]:
# Load the file (tagged text).
infile = open("littlelife_tagged.txt")
tagged_text = infile.read()
infile.close()


In [None]:
# Begin parsing tagged text file.
# Pull tags apart to create a list of tuples.
tagged_tokens = [ tuple(ttok.split('/')) for ttok in tagged_text.split() ]

In [None]:
# Check first few items in list you just created.
print(tagged_tokens[0:100])

In [None]:
# Check to see if the for-loop below will work.
# If this creates a dictionary without a problem, you're ok to move on to the next step.
# If there is an error, see the two commented out cells below.
tagged_tokens_dict = dict(tagged_tokens)

In [None]:
# Use these cells if function below isn't working (uncomment the code).
# Take the number the dictionary error returns and use it to discover tuples with more than 2 elements in them.
# This cell confirms that this is a tuple with more than 2 elements and tells you what they are.
# tagged_tokens[305472]

In [None]:
# This one deletes them from the list.
# tagged_tokens.pop(305472)

# After deleting, go back up to the tagged_tokens_dict code above and repeat until there are no more errors.
# If you've cleaned your text prior to tagging, hopefully you won't have too many of these errors.
# If there are a lot, this will be very tedious.

In [None]:
# Now we have to figure out which tokens belong to each other as part of the same named entity.
# Stanford's NER will split first and last names, for example.
# But since Stanford’s NER treats punctuation characters as separate tokens (e.g. ./O), 
# we can be reasonably sure that when a sequence of tokens with the same tag occur together, they probably belong to the same named entity.
# Generate a list of entities from a single tagged text. 
entities = []         # Named entity instances will go here.
current_entity = []   # Tokens that are part of the current entity will go here.

last_tag = None       # We'll use this to check whether a token is part of the same entity as the previous.

for i in xrange(len(tagged_tokens)):    # Evaluate each token, in order. This is why you need Python 2, FYI. There is definitely a way to convert this to Python 3, but it's beyond me right now.
	# Separate the token from its tag, so that we can evaluate them separately.
    token, tag = tagged_tokens[i]       

    if tag == 'O' or last_tag != tag:	# We've reached the end of the current entity.
    	# If that entity had a real tag (not 'O' or None), then save it.
        if last_tag != 'O' and last_tag != None:
        	# We save the list of tokens in this named entity, along with its tag, as a tuple.
        	#  string.join() converts the list of tokens into a string.
            entities.append((' '.join(current_entity), last_tag))
        current_entity = []	# Reset for a new entity.
    last_tag = tag			# Keep track of the current entity tag; see lines 10 and 12.
    current_entity.append(token)


In [None]:
# Because I like knowing what's happened.
print(entities)

In [None]:
# Now that we have our entities and their classes (tags), we can go in many different directions. 
# We need to pull out all of the entities and group them by class. 
# In the code below, we iterate over the list of (entity,tag) tuples, and sort the entities into a 
# dictionary (entities_binned) based on their tags.

entities_binned = {}
for entity, tag in entities:
    # When we encounter a tag for the first time, we need to create a spot for it in our dictionary.
    if tag not in entities_binned:    
        entities_binned[tag] = []

    entities_binned[tag].append(entity) 

In [None]:
# This is the resulting dictionary
entities_binned

In [None]:
# But we want just the character names in this specific case.
char_names = entities_binned["PERSON"]

In [None]:
# Write to json because I couldn't get it to write to a text file for some reason.
# This will produce a simple json file of all of the entities from the PERSON category.
# I then saved this json file as a plain text file and did some minor reg-exing in my text editor to 
# delete { and ' and extra spaces, resulting in a list of all of the PERSON entities in a text, where each entity is 
# on its own line. You will still have duplicate entities at this stage, so you will need to delete them.
# It's fairly easy to do this using a text editor like the old-school TextWrangler.
# Again, there are definitely ways to get Python to do this work for you, but it's easier for me
# at this stage to simply do them myself in a text editor.
# You can then customize this list further to create a character list (consolidating names and nicknames, including alternate names, deleting mistakes/historical figures, etc.).
import json
 
json = json.dumps(char_names)
f = open("littlelife_char_names.json","w")
f.write(json)
f.close()