# Homework 2

In this homework you will be performing some analysis with entity extraction. In particular, you will be looking at the Reuters corpus and trying to construct entity profiles of persons, organizations, and locations. This will require you to iterate through the documents in the Reuters corpus, parse them appropriately, extract entities, and then store the entities along with some surrounding text. Additionally, you will be looking for mechanisms to identify potential relationships between persons and locations.


Throughout this you will need to use NLTK to access the corpus. At the same time, you will need to use an entity extraction system. You will want to use Spacy for named entity recognition.

The basic idea is to build a knowledge base around the entities you will extract in the Reuters corpus. Normally, this would be a first step to trying to model such things as entity resolution across documents. You could also use this as a first step to analyzing the sentiment towards particular entities. For example, people expressing dissatistfaction at a restaurant or brand.


Follow the below steps and read the comments carefully on the types of tasks your code will need to do.


I would expect that some of you might be able to reuse parts of this code for your project...

# Step 1) 
Import necessary libraries¶

In [1]:
import nltk
nltk.download('reuters')

[nltk_data] Downloading package reuters to
[nltk_data]     /Users/kristinlevine/nltk_data...
[nltk_data]   Package reuters is already up-to-date!


True

In [2]:
# This will be the corpus we work from

# in order for this to work you will have to have installed NLTK 
# and also installed the reuters data

# to install NLTK, pip install nltk

# To install the reuters corpus following the instructions here: https://www.nltk.org/data.html
# The easy way to install the Reuters corpus is usally:
# import nltk
# nltk.download('reuters')


# This will import the Reuters corpus, assuming you have it
from nltk.corpus import reuters

In [3]:
# You will want to use Spacy as your entity recognizer
# my suggestion would be to make sure you are using a 2.x version of Spacy
# pip install spacy==2.3.5
import spacy
# note, the model load can be odd. In some instances your model might have the full name or the short name here.
# if you run into issues here, check the spacy model page at https://spacy.io/usage/models
nlp = spacy.load("en")

# alternatively try: 
# spacy.load("en_core_web_sm")

In [4]:
print(reuters.fileids()[0])
#reuters.categories()
#return raw text of reuters corpus
def get_corpus_text():
    return [" ".join(reuters.words(fid)) for fid in reuters.fileids()] 

#Get text for first article so we can see what we are dealing with
a = reuters.fileids()[0]
b = get_corpus_text()[0]

print(b)

test/14826


# Step 2) 
FIll in the following function to extract the entity, document id, and relevant sentence text from the input

In [5]:
def extract_entities(doc_id, doc_text):
    analyzed_doc = nlp(doc_text)

    #Three dictionaries for persons, organizations, and locations found in a document.

    doc_persons = {}
    doc_organizations = {}
    doc_locations = {}
    
    for entity in analyzed_doc.ents:
        if entity.text.strip() != "" and entity.label_ == "PERSON":
            if entity.text.strip() not in doc_persons.keys():
                relevant_sentence = (doc_id, entity.sent.text)
                doc_persons[entity.text.strip()] = list()
            if entity.text.strip() in doc_persons.keys():
                relevant_sentence = (doc_id, entity.sent.text)
                doc_persons[entity.text.strip()].append(relevant_sentence)  

    for entity in analyzed_doc.ents:
        if entity.text.strip() != "" and entity.label_ == 'ORG':
            relevant_sentence = (doc_id, entity.sent.text)
            doc_organizations[entity.text.strip()] = relevant_sentence
            
    for entity in analyzed_doc.ents:
        if entity.text.strip() != "" and (entity.label_ == 'LOC' or entity.label_ == "GPE"):
            relevant_sentence = (doc_id, entity.sent.text)
            doc_locations[entity.text.strip()] = relevant_sentence
            
            
    return doc_persons, doc_organizations, doc_locations
     

In [6]:
per, org, loc = extract_entities(a, b)

In [7]:
print(per)

{'Reuter': [('test/14826', 'They told Reuter correspondents in Asian capitals a U .')], 'MC': [('test/14826', '" We wouldn \' t be able to do business ," said a spokesman for leading Japanese electronics firm Matsushita Electric Industrial Co Ltd & lt ; MC .')], 'Tom Murtha': [('test/14826', 'S .," said Tom Murtha , a stock analyst at the Tokyo office of broker & lt ; James Capel and Co >.')], 'Paul Sheen': [('test/14826', 'Retaliation ," said Paul Sheen , chairman of textile exporters & lt ; Taiwan Safe Group')], 'Lawrence Mills': [('test/14826', '" That is a very short - term view ," said Lawrence Mills , director - general of the Federation of Hong Kong Industry .')], 'John Button': [('test/14826', 'And Japan with interest and concern , Industry Minister John Button said in Canberra last Friday .')], 'Yasuhiro Nakasone': [('test/14826', "They also call for stepped - up spending as an emergency measure to stimulate the economy despite Prime Minister Yasuhiro Nakasone ' s avowed fisca

In [9]:
#Persons in Document
import pandas as pd
print('Persons in document:', a)
pd.DataFrame.from_dict(per, orient = 'index')

Persons in document: test/14826


Unnamed: 0,0
Reuter,"(test/14826, They told Reuter correspondents i..."
MC,"(test/14826, "" We wouldn ' t be able to do bus..."
Tom Murtha,"(test/14826, S .,"" said Tom Murtha , a stock a..."
Paul Sheen,"(test/14826, Retaliation ,"" said Paul Sheen , ..."
Lawrence Mills,"(test/14826, "" That is a very short - term vie..."
John Button,"(test/14826, And Japan with interest and conce..."
Yasuhiro Nakasone,"(test/14826, They also call for stepped - up s..."
Michael Smith,"(test/14826, Trade Representative Michael Smit..."


In [65]:
#Organizations in document
import pandas as pd
print('Organizations in document:', a)
pd.DataFrame.from_dict(org, orient = 'index', columns = ['File', "Sentence"])

Organizations in document: test/14833


Unnamed: 0,File,Sentence
CPO,test/14833,Indonesian exports of CPO in calendar 1986 wer...
Hasrul Harahap,test/14833,RISING SHARPLY Indonesia expects crude palm oi...
central bank,test/14833,Indonesian exports of CPO in calendar 1986 wer...


In [66]:
#Locations in document
import pandas as pd
print('Locations in document:', a)
pd.DataFrame.from_dict(loc, orient = 'index', columns = ['File', "Sentence"])

Locations in document: test/14833


Unnamed: 0,File,Sentence
Indonesia,test/14833,"Indonesia , the world ' s second largest produ..."
Malaysia,test/14833,"Indonesia , the world ' s second largest produ..."


# Step 3)
Adjust the following code to run the document entity extraction function Also, add the entity records you are constructing to your master list of entities Note: for the full subission run across all the Reuters documents

In [153]:
num_docs = len(reuters.fileids()[0:25])
#  this has a large number of files... 
# you might wish to limit the number of documents you use while developing your technique 
# ex. reuters.fileids()[0:25]

# these two dictionaries will incorporate all the referneces to 
combined_persons = {}
combined_organizations = {}
combined_locations = {}

# this will only iterate over the first 25 documents, for the real submission you will need to run across all documents
for doc_id in reuters.fileids()[0:25]: 
    # this doc_text variable will give you a text version of the news article. This could be tokenized.
    persons, organizations, locations = extract_entities(doc_id, reuters.open(doc_id).read())
    
    for per in persons.keys():
        if per not in combined_persons:
            combined_persons[per] = persons.get(per)
        else:
            combined_persons[per].append(persons.get(per))
            
    for org in organizations.keys():
        if org not in combined_organizations:
            combined_organizations[org] = organizations.get(org)
    
    for loc in locations.keys():
        if loc not in combined_locations:
            combined_locations[loc] = locations.get(loc)
        else:
            combined_locations[loc].append(locations.get(loc))
  

AttributeError: 'tuple' object has no attribute 'append'

In [149]:
#Combined Persons
import pandas as pd
person_df = pd.DataFrame.from_dict(combined_persons, orient = 'index', columns = ['File', "Sentence"])
person_df.sort_index()

Unnamed: 0,File,Sentence
John Button,test/14826,The Australian government is awaiting the outc...
Lawrence Mills,test/14826,"""That is a very short-term view,"" said Lawrenc..."
Michael Smith,test/14826,Deputy U.S. Trade Representative Michael Smith...
Paul Sheen,test/14826,"""We must quickly open our markets, remove trad..."
Reuter,test/14826,They told Reuter correspondents in Asian capit...
Tom\n Murtha,test/14826,"""If the tariffs remain in place for any length..."
Yasuhiro Nakasone's,test/14826,They also call for stepped-up spending as an e...


In [150]:
#Combined Organizations
import pandas as pd
pd.DataFrame.from_dict(combined_organizations, orient = 'index', columns = ['File', "Sentence"])

Unnamed: 0,File,Sentence
Matsushita Electric\n Industrial Co Ltd &lt;MC.T,test/14826,"""We wouldn't be able to do business,"" said a s..."
broker &lt;James\n Capel and Co,test/14826,"""If the tariffs remain in place for any length..."
U.S. Products,test/14826,"""We must quickly open our markets, remove trad..."
&lt;Taiwan Safe Group,test/14826,"""We must quickly open our markets, remove trad..."
U.S. Pressure,test/14826,But other businessmen said such\n a short-ter...
the Federation of Hong Kong Industry,test/14826,"""That is a very short-term view,"" said Lawrenc..."
Button,test/14826,This kind of deterioration in trade relations ...
Liberal Democratic Party,test/14826,Japan's ruling Liberal Democratic Party yester...
Makoto\n Kuroda,test/14826,Deputy U.S. Trade Representative Michael Smith...
International Trade and,test/14826,Deputy U.S. Trade Representative Michael Smith...


In [151]:
#Combined Locations
import pandas as pd
pd.DataFrame.from_dict(combined_locations, orient = 'index', columns = ['File', "Sentence"])

Unnamed: 0,File,Sentence
U.S.,test/14826,Deputy U.S. Trade Representative Michael Smith...
Japan,test/14826,Deputy U.S. Trade Representative Michael Smith...
Asia,test/14826,And Japan has raised fears among many of Asia'...
Tokyo,test/14826,"""If the tariffs remain in place for any length..."
Taiwan,test/14826,The surplus helped swell Taiwan's foreign exch...
South Korea's,test/14826,A senior official of South Korea's trade promo...
South Korea,test/14826,Last year South Korea had a trade surplus of 7...
Malaysia,test/14826,"In Malaysia, trade officers and businessmen sa..."
Hong Kong,test/14826,Much more serious for Hong Kong\n is the disa...
Hong Kong's,test/14826,The U.S. Last year was Hong Kong's biggest exp...


# Step 4)
Fill in the following method to look through the content of an entity dictionary to determine the most popular based on number of mentions

In [None]:
# now that we have the text associated with the entities, 
# you will want to focus on the 500 top entities in each category
# Identify the top 500 entities by the count of their occurrences
def find_most_popular_entities(entity_dictionary):
    # sort through the entities in the dictionary by the number of sentences
    
    return list_of_dictionary_keys_with_most_mentions

# Step 5)
Now invoke your top entity mention finder

In [None]:
# simply get the top persons and locations
top_persons = find_most_popular_entities(combined_persons)
top_locations = find_most_popular_entities(combined_locations)

# Step 6) 

Analyze the most popular entities to determine what words they most frequently occur with

In [None]:
# use these dictionaries to store the most frequent terms associated with the entities
person_most_popular_terms = {}
organization_most_popular_terms = {}
location_most_popular_terms = {}

# finally, now find the most frequent tokens associated with the entities
for person in top_persons:
    # fill this dictionary with all the words in the context of the person entity
    person_token_dictionary = {}

# finally, now find the most frequent tokens associated with the entities
for organization in top_organization:
    # fill this dictionary with all the words in the context of the person entity
    organization_token_dictionary = {}

    
    
for location in top_locations:
    # fill this dictionary with all the words in the context of the location entity
    location_token_dictionary = {}

# Step 7)

Present your results of the most popular entities and their associated terms

In [None]:
# present you results

## Extra Credit
There are several extra credit options for this assignment.
The first would be to determine which persons, organizations, and locations most frequently occur in the same sentences.
Another task would be to attempt to resolve different forms of the same name for each person and location. For example, George Bush and Bush inside the same document.

In [None]:
combined_persons = {}
combined_organizations = {}
combined_locations = {}

# this will only iterate over the first 25 documents, for the real submission you will need to run across all documents
for doc_id in reuters.fileids()[0:25]: 
    # this doc_text variable will give you a text version of the news article. This could be tokenized.
    persons, organizations, locations = extract_entities(doc_id, reuters.open(doc_id).read())

#For Persons
    for per in persons.keys():
        if per not in combined_persons.keys():
            combined_persons[per] = list()
        
        if per in combined_persons.keys():
            combined_persons[per].append(persons.get(per))

In [None]:
def extract_entities(doc_id, doc_text):
    analyzed_doc = nlp(doc_text)

    #Three dictionaries for persons, organizations, and locations found in a document.

    doc_persons = {}
    doc_organizations = {}
    doc_locations = {}
    
    for entity in analyzed_doc.ents:
        if entity.text.strip() != "" and entity.label_ == "PERSON":
            
            if entity.text.strip() not in doc_persons.keys():
                relevant_sentence = (doc_id, entity.sent.text)
                doc_persons[entity.text.strip()] = list()
                
            if entity.text.strip() in doc_persons.keys():
                relevant_sentence = (doc_id, entity.sent.text)
                doc_persons[entity.text.strip()].append(relevant_sentence)  