# Homework 2

In this homework you will be performing some analysis with entity extraction. In particular, you will be looking at the Reuters corpus and trying to construct entity profiles of persons, organizations, and locations. This will require you to iterate through the documents in the Reuters corpus, parse them appropriately, extract entities, and then store the entities along with some surrounding text. Additionally, you will be looking for mechanisms to identify potential relationships between persons and locations.


Throughout this you will need to use NLTK to access the corpus. At the same time, you will need to use an entity extraction system. You will want to use Spacy for named entity recognition.

The basic idea is to build a knowledge base around the entities you will extract in the Reuters corpus. Normally, this would be a first step to trying to model such things as entity resolution across documents. You could also use this as a first step to analyzing the sentiment towards particular entities. For example, people expressing dissatistfaction at a restaurant or brand.


Follow the below steps and read the comments carefully on the types of tasks your code will need to do.


I would expect that some of you might be able to reuse parts of this code for your project...

# Step 1) 
Import necessary libraries¶

In [None]:
# This will be the corpus we work from

# in order for this to work you will have to have installed NLTK 
# and also installed the reuters data

# to install NLTK, pip install nltk
# 
#
# To install the reuters corpus following the instructions here: https://www.nltk.org/data.html
# The easy way to install the Reuters corpus is usally:
# import nltk
# nltk.download('reuters')


# This will import the Reuters corpus, assuming you have it
from nltk.corpus import reuters



In [None]:
# You will want to use Spacy as your entity recognizer
# my suggestion would be to make sure you are using a 2.x version of Spacy
# pip install spacy==2.3.5
import spacy
# note, the model load can be odd. In some instances your model might have the full name or the short name here.
# if you run into issues here, check the spacy model page at https://spacy.io/usage/models
nlp = spacy.load("en")

# alternatively try: 
# spacy.load("en_core_web_sm")

# Step 2) 
FIll in the following function to extract the entity, document id, and relevant sentence text from the input

In [None]:
def extract_entities(doc_id, doc_text):
    analyzed_doc = nlp(doc_text)
    
    # these two dictionaries will include all the persons and locations you find in a document.
    # You will need to add each person or location you encounter in the document to them
    # for the key you can use the text of the entity, for the value you will want to use the document_id and the
    # text of the sentence one challenge could be that an entity might occur multiple times in the document, 
    # thus the value should really be a document id and a list of the text of the sentences ( or something such as that)
    doc_persons = {}
    doc_organizations = {}
    doc_locations = {}
    
    for entity in analyzed_doc.ents:
        if entity.text.strip() != "":
            # The .label_ property will provide information on the type of entity tagged
            print(" -> ", entity.label_)
            # The .text property will display the actual text of the entity in the text
            print("->", entity.text.strip(), "<-")
            # You can also access the sentence that the entity is contained in by using the .sent property
            # inside the sentence you can then use the .text property
            print("->", entity.sent.text, "<-")
            
            
            # one way to represent the document id and the sentence text would be with a tuple
            # thus, you could do:
            relevant_sentence = (doc_id, entity.sent.text)
            
            # add the relevant document id and sentence to the entity record
            
            
            
    return doc_persons, doc_organizations, doc_locations
        

# Step 3) 
Adjust the following code to run the document entity extraction function
Also, add the entity records you are constructing to your master list of entities
Note: for the full subission run across all the Reuters documents

In [None]:
num_docs = len(reuters.fileids())
#  this has a large number of files... 
# you might wish to limit the number of documents you use while developing your technique 
# ex. reuters.fileids()[0:25]

# these two dictionaries will incorporate all the referneces to 
combined_persons = {}
combined_organizations = {}
combined_locations = {}

# this will only iterate over the first 25 documents, for the real submission you will need to run across all documents
for doc_id in reuters.fileids()[0:25]: 
    # this doc_text variable will give you a text version of the news article. This could be tokenized.
    persons, organizations, locations = extract_entities(doc_id, reuters.open(doc_id).read())
    
    # you will need to write something here to put the persons and locations found in a document into the 
    # combined_persons, combined_organizations, and combined_locations dictionaries.
    # here you will need to consider how to extend the values already in the dictionaries
    # maybe something like:
    # for person in persons.keys():
    #     if person not in combined_persons.keys():
    #         --- add a person key to the combined persons list
    #     now here you can add the person's document ids and sentence texts to the dictionary value
    
    

# Step 4)
Fill in the following method to look through the content of an entity dictionary to determine the most popular based on number of mentions

In [None]:
# now that we have the text associated with the entities, 
# you will want to focus on the 500 top entities in each category
# Identify the top 500 entities by the count of their occurrences
def find_most_popular_entities(entity_dictionary):
    # sort through the entities in the dictionary by the number of sentences
    
    return list_of_dictionary_keys_with_most_mentions

# Step 5)
Now invoke your top entity mention finder

In [None]:
# simply get the top persons and locations
top_persons = find_most_popular_entities(combined_persons)
top_locations = find_most_popular_entities(combined_locations)

# Step 6) 

Analyze the most popular entities to determine what words they most frequently occur with

In [None]:
# use these dictionaries to store the most frequent terms associated with the entities
person_most_popular_terms = {}
organization_most_popular_terms = {}
location_most_popular_terms = {}

# finally, now find the most frequent tokens associated with the entities
for person in top_persons:
    # fill this dictionary with all the words in the context of the person entity
    person_token_dictionary = {}

# finally, now find the most frequent tokens associated with the entities
for organization in top_organization:
    # fill this dictionary with all the words in the context of the person entity
    organization_token_dictionary = {}

    
    
for location in top_locations:
    # fill this dictionary with all the words in the context of the location entity
    location_token_dictionary = {}

# Step 7)

Present your results of the most popular entities and their associated terms

In [None]:
# present you results

## Extra Credit
There are several extra credit options for this assignment.
The first would be to determine which persons, organizations, and locations most frequently occur in the same sentences.
Another task would be to attempt to resolve different forms of the same name for each person and location. For example, George Bush and Bush inside the same document.