# Homework 2

In this homework you will be performing some analysis with entity extraction. In particular, you will be looking at the Reuters corpus and trying to construct entity profiles of persons, organizations, and locations. This will require you to iterate through the documents in the Reuters corpus, parse them appropriately, extract entities, and then store the entities along with some surrounding text. Additionally, you will be looking for mechanisms to identify potential relationships between persons and locations.


Throughout this you will need to use NLTK to access the corpus. At the same time, you will need to use an entity extraction system. You will want to use Spacy for named entity recognition.

The basic idea is to build a knowledge base around the entities you will extract in the Reuters corpus. Normally, this would be a first step to trying to model such things as entity resolution across documents. You could also use this as a first step to analyzing the sentiment towards particular entities. For example, people expressing dissatistfaction at a restaurant or brand.


Follow the below steps and read the comments carefully on the types of tasks your code will need to do.


I would expect that some of you might be able to reuse parts of this code for your project...

# Step 1) 
Import necessary libraries¶

In [1]:
import nltk
nltk.download('reuters')

[nltk_data] Downloading package reuters to
[nltk_data]     /Users/kristinlevine/nltk_data...
[nltk_data]   Package reuters is already up-to-date!


True

In [2]:
# This will be the corpus we work from

# in order for this to work you will have to have installed NLTK 
# and also installed the reuters data

# to install NLTK, pip install nltk

# To install the reuters corpus following the instructions here: https://www.nltk.org/data.html
# The easy way to install the Reuters corpus is usally:
# import nltk
# nltk.download('reuters')


# This will import the Reuters corpus, assuming you have it
from nltk.corpus import reuters

In [3]:
# You will want to use Spacy as your entity recognizer
# my suggestion would be to make sure you are using a 2.x version of Spacy
# pip install spacy==2.3.5
import spacy
# note, the model load can be odd. In some instances your model might have the full name or the short name here.
# if you run into issues here, check the spacy model page at https://spacy.io/usage/models
nlp = spacy.load("en")

# alternatively try: 
# spacy.load("en_core_web_sm")

In [4]:
print(reuters.fileids()[0])
#reuters.categories()
#return raw text of reuters corpus
def get_corpus_text():
    return [" ".join(reuters.words(fid)) for fid in reuters.fileids()] 

#Get text for first article so we can see what we are dealing with
a = reuters.fileids()[0]
b = get_corpus_text()[0]

#print(b)

test/14826


# Step 2) 
FIll in the following function to extract the entity, document id, and relevant sentence text from the input

In [5]:
def extract_entities(doc_id, doc_text):
    analyzed_doc = nlp(doc_text)

    #Three dictionaries for persons, organizations, and locations found in a document.

    doc_persons = {}
    doc_organizations = {}
    doc_locations = {}
    
    for entity in analyzed_doc.ents:
        if entity.text.strip() != "" and entity.label_ == "PERSON":
            
            if entity.text.strip() not in doc_persons.keys():
                relevant_sentence = (doc_id, entity.sent.text)
                doc_persons[entity.text.strip()] = list()
                
            if entity.text.strip() in doc_persons.keys():
                relevant_sentence = (doc_id, entity.sent.text)
                doc_persons[entity.text.strip()].append(relevant_sentence)  

    for entity in analyzed_doc.ents:
        if entity.text.strip() != "" and entity.label_ == 'ORG':
            
            if entity.text.strip() not in doc_organizations.keys():
                relevant_sentence = (doc_id, entity.sent.text)
                doc_organizations[entity.text.strip()] = list()
                
            if entity.text.strip() in doc_organizations.keys():
                relevant_sentence = (doc_id, entity.sent.text)
                doc_organizations[entity.text.strip()].append(relevant_sentence)  
            
    for entity in analyzed_doc.ents:
        if entity.text.strip() != "" and (entity.label_ == 'LOC' or entity.label_ == "GPE"):
            
            if entity.text.strip() not in doc_locations.keys():
                relevant_sentence = (doc_id, entity.sent.text)
                doc_locations[entity.text.strip()] = list()
                
            if entity.text.strip() in doc_locations.keys():
                relevant_sentence = (doc_id, entity.sent.text)
                doc_locations[entity.text.strip()].append(relevant_sentence)  
            
            
    return doc_persons, doc_organizations, doc_locations
     

In [6]:
per, org, loc = extract_entities(a, b)

In [7]:
print(per)

{'Reuter': [('test/14826', 'They told Reuter correspondents in Asian capitals a U .')], 'MC': [('test/14826', '" We wouldn \' t be able to do business ," said a spokesman for leading Japanese electronics firm Matsushita Electric Industrial Co Ltd & lt ; MC .')], 'Tom Murtha': [('test/14826', 'S .," said Tom Murtha , a stock analyst at the Tokyo office of broker & lt ; James Capel and Co >.')], 'Paul Sheen': [('test/14826', 'Retaliation ," said Paul Sheen , chairman of textile exporters & lt ; Taiwan Safe Group')], 'Lawrence Mills': [('test/14826', '" That is a very short - term view ," said Lawrence Mills , director - general of the Federation of Hong Kong Industry .')], 'John Button': [('test/14826', 'And Japan with interest and concern , Industry Minister John Button said in Canberra last Friday .')], 'Yasuhiro Nakasone': [('test/14826', "They also call for stepped - up spending as an emergency measure to stimulate the economy despite Prime Minister Yasuhiro Nakasone ' s avowed fisca

In [8]:
#Testing one document
#Persons in Document
import pandas as pd
print('Persons in document:', a)
pd.DataFrame.from_dict(per, orient = 'index')

Persons in document: test/14826


Unnamed: 0,0
Reuter,"(test/14826, They told Reuter correspondents i..."
MC,"(test/14826, "" We wouldn ' t be able to do bus..."
Tom Murtha,"(test/14826, S .,"" said Tom Murtha , a stock a..."
Paul Sheen,"(test/14826, Retaliation ,"" said Paul Sheen , ..."
Lawrence Mills,"(test/14826, "" That is a very short - term vie..."
John Button,"(test/14826, And Japan with interest and conce..."
Yasuhiro Nakasone,"(test/14826, They also call for stepped - up s..."
Michael Smith,"(test/14826, Trade Representative Michael Smit..."


In [9]:
#Testing one document
#Organizations in document
import pandas as pd
print('Organizations in document:', a)
pd.DataFrame.from_dict(org, orient = 'index')

Organizations in document: test/14826


Unnamed: 0,0
Matsushita Electric Industrial Co Ltd & lt,"(test/14826, "" We wouldn ' t be able to do bus..."
broker & lt,"(test/14826, S .,"" said Tom Murtha , a stock a..."
James Capel and Co,"(test/14826, S .,"" said Tom Murtha , a stock a..."
Taiwan Safe Group,"(test/14826, Retaliation ,"" said Paul Sheen , ..."
the Federation of Hong Kong Industry,"(test/14826, "" That is a very short - term vie..."
Button,"(test/14826, "" This kind of deterioration in t..."
Liberal Democratic Party,"(test/14826, Japan ' s ruling Liberal Democrat..."
Trade,"(test/14826, Trade Representative Michael Smit..."
Makoto Kuroda,"(test/14826, Trade Representative Michael Smit..."
International Trade and Industry,"(test/14826, Trade Representative Michael Smit..."


In [10]:
#Testing one document
#Locations in document
import pandas as pd
print('Locations in document:', a)
pd.DataFrame.from_dict(loc, orient = 'index')

Locations in document: test/14826


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
JAPAN,"(test/14826, JAPAN RIFT Mounting trade frictio...",,,,,,,,,,,
Japan,"(test/14826, And Japan has raised fears among ...","(test/14826, Move against Japan might boost pr...","(test/14826, Has said it will impose 300 mln d...","(test/14826, Threat against Japan because it s...","(test/14826, And Japan might also lead to pres...","(test/14826, And Japan might also lead to pres...","(test/14826, In Malaysia , trade officers and ...","(test/14826, In Hong Kong , where newspapers h...","(test/14826, And Japan with interest and conce...","(test/14826, He said Australia ' s concerns ce...","(test/14826, Japan ' s ruling Liberal Democrat...","(test/14826, Trade Representative Michael Smit..."
Asia,"(test/14826, And Japan has raised fears among ...",,,,,,,,,,,
Tokyo,"(test/14826, But some exporters said that whil...","(test/14826, S .,"" said Tom Murtha , a stock a...",,,,,,,,,,
Taiwan,"(test/14826, In Taiwan , businessmen and offic...","(test/14826, Taiwan had a trade trade surplus ...","(test/14826, The surplus helped swell Taiwan '...",,,,,,,,,
South Korea ',"(test/14826, A senior official of South Korea ...",,,,,,,,,,,
South Korea,"(test/14826, And Japan might also lead to pres...","(test/14826, Last year South Korea had a trade...",,,,,,,,,,
Malaysia,"(test/14826, In Malaysia , trade officers and ...",,,,,,,,,,,
Hong Kong,"(test/14826, In Hong Kong , where newspapers h...","(test/14826, Much more serious for Hong Kong i...",,,,,,,,,,
Hong Kong ',"(test/14826, Last year was Hong Kong ' s bigge...",,,,,,,,,,,


# Step 3)
Adjust the following code to run the document entity extraction function Also, add the entity records you are constructing to your master list of entities Note: for the full subission run across all the Reuters documents

In [11]:
#num_docs = len(reuters.fileids()[0:50])
num_docs = len(reuters.fileids())
#  this has a large number of files... 
# you might wish to limit the number of documents you use while developing your technique 
# ex. reuters.fileids()[0:25]

# these two dictionaries will incorporate all the referneces to 
combined_persons = {}
combined_organizations = {}
combined_locations = {}

# this will only iterate over the first 25 documents, for the real submission you will need to run across all documents
#for doc_id in reuters.fileids()[0:50]: 
for doc_id in reuters.fileids(): 
    # this doc_text variable will give you a text version of the news article. This could be tokenized.
    persons, organizations, locations = extract_entities(doc_id, reuters.open(doc_id).read())

#For Persons
    for per in persons.keys():
        if per not in combined_persons.keys():
            combined_persons[per] = list()
        
        if per in combined_persons.keys():
            combined_persons[per].append(persons.get(per))

#For Organizations
    for org in organizations.keys():
        if org not in combined_organizations.keys():
            combined_organizations[org] = list()
        
        if org in combined_organizations.keys():
            combined_organizations[org].append(organizations.get(org))  
            
#For Locations
    for loc in locations.keys():
        if loc not in combined_locations.keys():
            combined_locations[loc] = list()
        
        if loc in combined_locations.keys():
            combined_locations[loc].append(locations.get(loc))

In [12]:
#Combined Persons
import pandas as pd
person_df = pd.DataFrame.from_dict(combined_persons, orient = 'index')
person_df.sort_index()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,197,198,199,200,201,202,203,204,205,206
1.9 pct,"[(training/1478, Ruiz Ligero attributed this d...",,,,,,,,,,...,,,,,,,,,,
1985/86 Thyssen's,"[(training/10342, In 1985/86 Thyssen's world g...",,,,,,,,,,...,,,,,,,,,,
1ST QTR CHARGE,"[(training/10231, GREAT AMERICAN CORP SEES 1ST...",,,,,,,,,,...,,,,,,,,,,
7-dlr,"[(test/17839, Laurence said that, since Entert...",,,,,,,,,,...,,,,,,,,,,
8-dlr,"[(test/17839, The 8-dlr-a-share offer by Enter...",,,,,,,,,,...,,,,,,,,,,
A.,"[(training/6463, In a letter to Gencorp chairm...",,,,,,,,,,...,,,,,,,,,,
A. Dale Mayo,"[(test/19839, A. Dale Mayo, Clearview's presid...",,,,,,,,,,...,,,,,,,,,,
A. Gordon,"[(test/18071, IN SEVERAL DAYS\n U.S. District...",,,,,,,,,,...,,,,,,,,,,
A. Hormel,"[(training/8768, A. Hormel and Co said its\n ...",,,,,,,,,,...,,,,,,,,,,
A. Malachi Mixon III,"[(training/3899, SALES\n Invacare Corp chairm...",,,,,,,,,,...,,,,,,,,,,


In [13]:
#Combined Organizations
import pandas as pd
org_df = pd.DataFrame.from_dict(combined_organizations, orient = 'index')
org_df.sort_index()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,430,431,432,433,434,435,436,437,438,439
&,"[(training/7354, have\n acquired &)]",,,,,,,,,,...,,,,,,,,,,
&lt;A.H.A. AUTOMOTIVE TECHNOLOGIES CORP,"[(training/9995, &lt;A.H.A. AUTOMOTIVE TECHNOL...",,,,,,,,,,...,,,,,,,,,,
&lt;ACKERLY COMMUNICATIONS INC,"[(training/9287, &lt;ACKERLY COMMUNICATIONS IN...",,,,,,,,,,...,,,,,,,,,,
&lt;ACKLANDS LTD,"[(training/9029, &lt;ACKLANDS LTD> 1ST)]",,,,,,,,,,...,,,,,,,,,,
&lt;AGF,"[(test/18953, &lt;AGF)]",,,,,,,,,,...,,,,,,,,,,
&lt;AGRA INDUSTRIES LTD,"[(training/5128, &lt;AGRA INDUSTRIES LTD)]",,,,,,,,,,...,,,,,,,,,,
&lt;AIN LEASING CORP,"[(training/11019, &lt;AIN LEASING CORP>)]",,,,,,,,,,...,,,,,,,,,,
&lt;ALFA.O,"[(test/16792, It said the name change\n shoul...",,,,,,,,,,...,,,,,,,,,,
&lt;ALTEX RESOURCES LTD,"[(training/6696, &lt;ALTEX RESOURCES LTD> YEAR...",,,,,,,,,,...,,,,,,,,,,
&lt;AMERICAN EAGLE PETROLEUMS LTD,"[(test/16045, &lt;AMERICAN EAGLE PETROLEUMS LT...",,,,,,,,,,...,,,,,,,,,,


In [14]:
#Combined Locations
import pandas as pd
loc_df = pd.DataFrame.from_dict(combined_locations, orient = 'index')
loc_df.sort_index()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1736,1737,1738,1739,1740,1741,1742,1743,1744,1745
- Britain,"[(training/13270, In Paris on February 22, six...",,,,,,,,,,...,,,,,,,,,,
"315,695-dwt Arabian Sea","[(training/4171, The 315,695-dwt Arabian Sea h...",,,,,,,,,,...,,,,,,,,,,
3M's,"[(test/18080, The business, which has 145 empl...",,,,,,,,,,...,,,,,,,,,,
428.30/430.44,"[(training/12877, It set a lira/D-mark rate of...",,,,,,,,,,...,,,,,,,,,,
468p,"[(test/14872, EXPECTATIONS\n Bowater Industri...",,,,,,,,,,...,,,,,,,,,,
697p,"[(training/6434, Rank shares firmed in morning...",,,,,,,,,,...,,,,,,,,,,
A.M.,"[(test/21443, Regional shipping sources earlie...",,,,,,,,,,...,,,,,,,,,,
ADJUSTS REVENUES,"[(training/5114, ADJUSTS REVENUES\n )]",,,,,,,,,,...,,,,,,,,,,
AEGEAN,"[(training/10588, U.S. URGES RESTRAINT IN AEGE...",,,,,,,,,,...,,,,,,,,,,
ALGERIA,"[(training/2838, IRANIAN OIL MINISTER ARRIVES ...",,,,,,,,,,...,,,,,,,,,,


# Step 4)
Fill in the following method to look through the content of an entity dictionary to determine the most popular based on number of mentions

In [15]:
# now that we have the text associated with the entities, 
# you will want to focus on the 500 top entities in each category
# Identify the top 500 entities by the count of their occurrences
def find_most_popular_entities(entity_dictionary):
    
    list_most_mentions = {}
    
    for entity in entity_dictionary:
        x = []
        for i in range(len(entity_dictionary[entity])):
            x.append(len(entity_dictionary[entity][i]))
        list_most_mentions[entity] = sum(x)
        
    # sort through the entities in the dictionary by the number of sentences
    
    return list_most_mentions

# Step 5)
Now invoke your top entity mention finder

In [16]:
# simply get the top persons, organizations, and locations
# print the top 500

In [17]:
#Persons
top_persons = find_most_popular_entities(combined_persons)
df_top_persons = pd.DataFrame.from_dict(top_persons, orient = 'index')
df_top_persons = df_top_persons.rename(columns = {0:'Count'})
df_top_persons = df_top_persons.sort_values(by=['Count'], ascending = False)
print(df_top_persons.head(500))

                        Count
Reagan                    371
Baker                     214
Lawson                    116
Yeutter                   110
James Baker                94
Stoltenberg                65
Baldrige                   53
Johnson                    47
Volcker                    46
Brown                      43
Qtr                        43
Clayton Yeutter            41
Herrington                 38
Kiichi Miyazawa            38
Yasuhiro Nakasone          34
Rotterdam                  34
dlrs                       33
mln dlrs                   30
Bass                       29
Petrobras                  29
Williams                   29
Richard Lyng               27
Satoshi Sumita             27
Cyclops                    25
Nigel Lawson               25
Subroto                    25
Monier                     24
Caspar Weinberger          24
Karl Otto Poehl            23
Clayton                    23
...                       ...
Abe                         4
Charles Sc

In [18]:
#Organizations
top_organizations = find_most_popular_entities(combined_organizations)
df_top_organizations = pd.DataFrame.from_dict(top_organizations, orient = 'index')
df_top_organizations = df_top_organizations.rename(columns = {0:'Count'})
df_top_organizations = df_top_organizations.sort_values(by=['Count'], ascending = False)
print(df_top_organizations.head(500))

                                            Count
EC                                            827
Reuters                                       497
OPEC                                          463
USDA                                          438
Fed                                           385
Bundesbank                                    316
CTS                                           278
NET                                           277
Treasury                                      249
FED                                           231
pct                                           227
QTR                                           219
GATT                                          213
Congress                                      212
Oper                                          208
Bank                                          175
the Securities and Exchange Commission        160
ICO                                           149
USAir                                         147


In [19]:
#Location as DataFrame
top_locations = find_most_popular_entities(combined_locations)
df_top_locations = pd.DataFrame.from_dict(top_locations, orient = 'index')
df_top_locations = df_top_locations.rename(columns = {0:'Count'})
df_top_locations = df_top_locations.sort_values(by=['Count'], ascending = False)
df_top_locations = df_top_locations.head(500)
print(df_top_locations)

                       Count
U.S.                    4250
Japan                   1308
Brazil                   448
the United States        411
U.K.                     391
Canada                   366
Paris                    340
Washington               331
China                    326
London                   280
West Germany             274
Taiwan                   269
New York                 268
Iran                     254
Gulf                     246
Britain                  241
Tokyo                    225
JAPAN                    214
France                   208
Indonesia                189
Australia                175
Ecuador                  169
Europe                   162
Texas                    146
the Soviet Union         140
Italy                    133
Saudi Arabia             133
Kuwait                   127
South Korea              119
India                    114
...                      ...
the Ivory Coast            4
Iceland                    4
PARIS         

In [20]:
#Top 10 Persons as dictionary
df_top_persons = df_top_persons.head(10)
top_persons = df_top_persons['Count']
top_persons = top_persons.to_dict()
print(top_persons)

{'Reagan': 371, 'Baker': 214, 'Lawson': 116, 'Yeutter': 110, 'James Baker': 94, 'Stoltenberg': 65, 'Baldrige': 53, 'Johnson': 47, 'Volcker': 46, 'Brown': 43}


In [21]:
#Top 10 Organizations as dictionary
df_top_organizations = df_top_organizations.head(10)
top_organizations = df_top_organizations['Count']
top_organizations = top_organizations.to_dict()
print(top_organizations)

{'EC': 827, 'Reuters': 497, 'OPEC': 463, 'USDA': 438, 'Fed': 385, 'Bundesbank': 316, 'CTS': 278, 'NET': 277, 'Treasury': 249, 'FED': 231}


In [22]:
#Top 10 Locations as dictionary
df_top_locations = df_top_locations.head(10)
top_locations = df_top_locations['Count']
top_locations = top_locations.to_dict()
print(top_locations)

{'U.S.': 4250, 'Japan': 1308, 'Brazil': 448, 'the United States': 411, 'U.K.': 391, 'Canada': 366, 'Paris': 340, 'Washington': 331, 'China': 326, 'London': 280}


# Step 6) 

Analyze the most popular entities to determine what words they most frequently occur with

NOTE: For this section, I decided to focus on the top 10 persons, organizations, and locations and the top 10 words most frequently associated with them.  It would be easy to scale up the code if you want the top 20 or top 100; I just through it was more legitable if we just stuck to 10.
I'm using the "top 10" dictionaries I created above.  To do larger numbers, I'd just create top 20 or top 100 dictionaries there.

In [23]:
#This function gets the sentences associated with the most popular terms:

def most_popular_terms(combined, top):
    most_popular_terms = {}
    for entity in combined:
        if entity in top:
            for i in range(len(combined[entity])):
                text = []
                text.append(combined[entity][i][0][1])
                if entity not in most_popular_terms.keys():
                    most_popular_terms[entity] = text
                if entity in most_popular_terms.keys():
                    most_popular_terms[entity].append(text)  
    return most_popular_terms

In [24]:
#Persons
person_most_popular_terms = most_popular_terms(combined_persons, top_persons)
print(person_most_popular_terms.keys())
#print(person_most_popular_terms)

dict_keys(['Stoltenberg', 'James Baker', 'Baker', 'Reagan', 'Volcker', 'Brown', 'Lawson', 'Baldrige', 'Yeutter', 'Johnson'])


In [25]:
#Organizations
organization_most_popular_terms = most_popular_terms(combined_organizations, top_organizations)
print(organization_most_popular_terms.keys())
#print(organization_most_popular_terms)

dict_keys(['Reuters', 'Fed', 'Bundesbank', 'EC', 'OPEC', 'NET', 'CTS', 'Treasury', 'FED', 'USDA'])


In [26]:
#Location
location_most_popular_terms = most_popular_terms(combined_locations, top_locations)
print(location_most_popular_terms.keys())
#print(location_most_popular_terms)

dict_keys(['U.S.', 'Japan', 'Washington', 'China', 'Canada', 'Brazil', 'U.K.', 'the United States', 'London', 'Paris'])


In [27]:
from nltk.stem.porter import *
porter_stemmer = PorterStemmer()
import spacy
from collections import Counter

In [28]:
#This functions returns the top 10 associated terms for any item

def token_dictionary(most_popular_terms, item):
    #for item in most_popular_terms.keys():
    doc = nlp(str(most_popular_terms[item]))
    #Tokenize the words
    words = [token.text for token in doc if token.is_stop != True and token.is_punct != True and token.is_alpha]
    #stem the words
    sample_words = words
    stemmed_words = [porter_stemmer.stem(word) for word in sample_words]
    #frequency of the words
    word_freq = Counter(stemmed_words)
    common_words = word_freq.most_common()
    token_dictionary = {}
    if item not in token_dictionary.keys():
        token_dictionary[item] = common_words[0:10]
    return token_dictionary         

Note: If I wanted more thant the 10 most frequently associated word, I could just change the common_words[0:10] to whatever range I wanted. Below, I'm using the function I created to combine them all into one dictionary for each type of entity.

In [29]:
#Top 10 persons and words they frequently occur with:

per_terms = {}
for i in person_most_popular_terms.keys():
    per_terms[i] = token_dictionary(person_most_popular_terms, i)[i]

print(per_terms)

{'Stoltenberg': [('stoltenberg', 22), ('said', 9), ('meet', 7), ('baker', 5), ('west', 5), ('currenc', 5), ('pari', 4), ('stabil', 4), ('foreign', 4), ('financ', 4)], 'James Baker': [('jame', 95), ('secretari', 89), ('treasuri', 86), ('baker', 76), ('said', 33), ('rate', 22), ('financ', 19), ('exchang', 17), ('dollar', 16), ('nation', 15)], 'Baker': [('baker', 71), ('said', 37), ('currenc', 13), ('meet', 12), ('hugh', 11), ('rate', 9), ('agreement', 9), ('dollar', 9), ('monetari', 8), ('polici', 8)], 'Reagan': [('reagan', 183), ('presid', 96), ('administr', 85), ('said', 66), ('trade', 56), ('offici', 31), ('japan', 28), ('congress', 20), ('retali', 20), ('propos', 19)], 'Volcker': [('volcker', 15), ('said', 7), ('bank', 6), ('dollar', 5), ('senat', 5), ('reserv', 5), ('chairman', 5), ('fed', 4), ('told', 4), ('fall', 3)], 'Brown': [('brown', 29), ('inc', 25), ('wagner', 19), ('afg', 18), ('industri', 16), ('share', 13), ('said', 11), ('gener', 11), ('gencorp', 11), ('dlr', 10)], 'Laws

In [30]:
#Top 10 organizations and words they frequently occur with:

org_terms = {}
for i in organization_most_popular_terms.keys():
    org_terms[i] = token_dictionary(organization_most_popular_terms, i)[i]

print(org_terms)

{'Reuters': [('told', 355), ('reuter', 274), ('offici', 86), ('spokesman', 51), ('bank', 46), ('year', 45), ('pct', 41), ('said', 41), ('mln', 38), ('dlr', 38)], 'Fed': [('fed', 134), ('said', 72), ('reserv', 49), ('dlr', 45), ('billion', 35), ('feder', 31), ('market', 27), ('repurchas', 25), ('custom', 20), ('mln', 20)], 'Bundesbank': [('bundesbank', 100), ('bank', 35), ('rate', 34), ('mark', 31), ('central', 30), ('west', 25), ('german', 24), ('presid', 23), ('said', 23), ('billion', 21)], 'EC': [('EC', 259), ('european', 76), ('commun', 70), ('tonn', 55), ('export', 42), ('sugar', 39), ('trade', 38), ('said', 35), ('oil', 27), ('minist', 24)], 'OPEC': [('opec', 119), ('oil', 73), ('price', 39), ('said', 38), ('output', 31), ('barrel', 25), ('dlr', 23), ('product', 23), ('member', 21), ('bpd', 21)], 'NET': [('qtr', 200), ('feb', 100), ('vs', 85), ('jan', 83), ('shr', 82), ('ct', 77), ('net', 11), ('dlr', 11), ('march', 9), ('loss', 9)], 'CTS': [('ct', 68), ('VS', 63), ('shr', 54), ('

In [31]:
#Top 10 locations and words they frequently occur with:

loc_terms = {}
for i in location_most_popular_terms.keys():
    loc_terms[i] = token_dictionary(location_most_popular_terms, i)[i]

print(loc_terms)

{'U.S.': [('said', 369), ('dlr', 232), ('trade', 208), ('mln', 193), ('oil', 153), ('market', 145), ('export', 141), ('pct', 140), ('japan', 124), ('dollar', 120)], 'Japan': [('japan', 497), ('trade', 165), ('said', 136), ('pct', 81), ('germani', 73), ('west', 73), ('unit', 70), ('market', 62), ('state', 62), ('offici', 60)], 'Washington': [('washington', 201), ('said', 86), ('trade', 55), ('offici', 36), ('minist', 34), ('meet', 34), ('japan', 25), ('japanes', 23), ('tariff', 23), ('talk', 20)], 'China': [('china', 151), ('tonn', 41), ('wheat', 33), ('year', 29), ('mln', 29), ('said', 26), ('export', 20), ('new', 18), ('import', 17), ('offici', 16)], 'Canada': [('canada', 239), ('pct', 62), ('japan', 61), ('said', 53), ('britain', 46), ('franc', 45), ('west', 37), ('bank', 35), ('germani', 32), ('mln', 32)], 'Brazil': [('brazil', 169), ('said', 51), ('dlr', 38), ('mln', 37), ('coffe', 34), ('export', 31), ('produc', 25), ('countri', 20), ('year', 19), ('loan', 19)], 'U.K.': [('market'

# Step 7)

Present your results of the most popular entities and their associated terms

In [32]:
#Ten most popular persons entities with the top 10 words they most frequentyly occur with:
df = pd.DataFrame(per_terms)
print(df)

         Stoltenberg      James Baker           Baker           Reagan  \
0  (stoltenberg, 22)       (jame, 95)     (baker, 71)    (reagan, 183)   
1          (said, 9)  (secretari, 89)      (said, 37)     (presid, 96)   
2          (meet, 7)   (treasuri, 86)   (currenc, 13)  (administr, 85)   
3         (baker, 5)      (baker, 76)      (meet, 12)       (said, 66)   
4          (west, 5)       (said, 33)      (hugh, 11)      (trade, 56)   
5       (currenc, 5)       (rate, 22)       (rate, 9)     (offici, 31)   
6          (pari, 4)     (financ, 19)  (agreement, 9)      (japan, 28)   
7        (stabil, 4)    (exchang, 17)     (dollar, 9)   (congress, 20)   
8       (foreign, 4)     (dollar, 16)   (monetari, 8)     (retali, 20)   
9        (financ, 4)     (nation, 15)     (polici, 8)     (propos, 19)   

         Volcker           Brown         Lawson            Baldrige  \
0  (volcker, 15)     (brown, 29)   (lawson, 25)       (baldrig, 18)   
1      (said, 7)       (inc, 25)     (said,

In [33]:
#Ten most popular organization entities with the top 10 words they most frequentyly occur with:
df = pd.DataFrame(org_terms)
print(df)

           Reuters              Fed         Bundesbank              EC  \
0      (told, 355)       (fed, 134)  (bundesbank, 100)       (EC, 259)   
1    (reuter, 274)       (said, 72)         (bank, 35)  (european, 76)   
2     (offici, 86)     (reserv, 49)         (rate, 34)    (commun, 70)   
3  (spokesman, 51)        (dlr, 45)         (mark, 31)      (tonn, 55)   
4       (bank, 46)    (billion, 35)      (central, 30)    (export, 42)   
5       (year, 45)      (feder, 31)         (west, 25)     (sugar, 39)   
6        (pct, 41)     (market, 27)       (german, 24)     (trade, 38)   
7       (said, 41)  (repurchas, 25)       (presid, 23)      (said, 35)   
8        (mln, 38)     (custom, 20)         (said, 23)       (oil, 27)   
9        (dlr, 38)        (mln, 20)      (billion, 21)    (minist, 24)   

            OPEC         NET           CTS         Treasury            FED  \
0    (opec, 119)  (qtr, 200)      (ct, 68)  (treasuri, 178)     (fed, 164)   
1      (oil, 73)  (feb, 100) 

In [34]:
#Ten most popular location entities with the top 10 words they most frequentyly occur with:
df = pd.DataFrame(loc_terms)
print(df)

            U.S.          Japan         Washington         China  \
0    (said, 369)   (japan, 497)  (washington, 201)  (china, 151)   
1     (dlr, 232)   (trade, 165)         (said, 86)    (tonn, 41)   
2   (trade, 208)    (said, 136)        (trade, 55)   (wheat, 33)   
3     (mln, 193)      (pct, 81)       (offici, 36)    (year, 29)   
4     (oil, 153)  (germani, 73)       (minist, 34)     (mln, 29)   
5  (market, 145)     (west, 73)         (meet, 34)    (said, 26)   
6  (export, 141)     (unit, 70)        (japan, 25)  (export, 20)   
7     (pct, 140)   (market, 62)      (japanes, 23)     (new, 18)   
8   (japan, 124)    (state, 62)       (tariff, 23)  (import, 17)   
9  (dollar, 120)   (offici, 60)         (talk, 20)  (offici, 16)   

          Canada         Brazil            U.K. the United States  \
0  (canada, 239)  (brazil, 169)   (market, 114)       (unit, 305)   
1      (pct, 62)     (said, 51)    (money, 103)      (state, 249)   
2    (japan, 61)      (dlr, 38)       (mln, 

Note: You may want to consider starting with the 2nd most popular term, since for "Baker" for example, the 1st most popular term is "baker."  But this is not always true, so I did not eliminate this first term for this exercise. 

This homework really stressed to me again the power of dictionaries and the importance of really undertanding how they work. Every time I got stuck, I tried to step back and consider the task in a very simple manner. Sometimes I created my own "scratch" dictionaries with very few examples.  This really helped me to figure out the syntax and clarify in my own mind what I was trying to do.  I learned a lot!

## Extra Credit
There are several extra credit options for this assignment.
The first would be to determine which persons, organizations, and locations most frequently occur in the same sentences.
Another task would be to attempt to resolve different forms of the same name for each person and location. For example, George Bush and Bush inside the same document.