# AWS Comprehend Demo
## Analyzing a Financial Statement using Natural Language Comprehension


### Addressing the Problem

Hawk Center students are currently combing financial documents and SEC filings individually, which is a process that is both labor intensive and prone to human error. Finding a way to automate this process would prove useful in  saving time and increasing comprehension accuracy, allowing users to potentially make decisions that are quicker, safer, and more accurate.

### What I'm doing

I'm using AWS Comprehend to analyze a set of Amazon press releases. After the results are returned, I will be dissecting the output to easily show the capabilities of AWS's natural language processing services, as well as the potential of applying more complex natural language processing methods to financial documents.

### Technology Capabilities

AWS Comprehend is a natural language processing system used to comprehend text and develop insights on its own. Through this technology, we can automate the manual process of combing through financial documents and increase the amount of data for future research. 


In [5]:
#Code: imports, grabbing the output document from directory

%matplotlib inline

import json
import requests
import uuid

import networkx as nx
import pandas as pd
import matplotlib.pyplot as plt
import boto3
import smart_open

from time import sleep
from matplotlib import cm, colors
from spacy import displacy
from collections import Counter
from pyvis.network import Network

# gathering output and raw input from local files

output_data_s3_file = r'/Users/john/Documents/aws/sample_finance_dataset.txt.out'

# Load the output into a result dictionary    # Get the files.
results = []
with smart_open.open(output_data_s3_file) as fi:
    results.extend([json.loads(line) for line in fi.readlines() if line])

    
input_data_path = r'/Users/john/Documents/aws/sample_finance_dataset.txt'
input_data = []
with smart_open.open(input_data_path) as fi:
    input_data.extend([line for line in fi.readlines() if line])

result = results[0]
raw_text = input_data[0]

In [6]:
#uncomment to show JSON output
result

{'Entities': [{'Mentions': [{'BeginOffset': 0,
     'EndOffset': 6,
     'GroupScore': 1.0,
     'Score': 0.999501,
     'Text': 'Amazon',
     'Type': 'ORGANIZATION'},
    {'BeginOffset': 149,
     'EndOffset': 155,
     'GroupScore': 0.9936,
     'Score': 0.999615,
     'Text': 'Amazon',
     'Type': 'ORGANIZATION'},
    {'BeginOffset': 468,
     'EndOffset': 474,
     'GroupScore': 0.584697,
     'Score': 0.998912,
     'Text': 'Amazon',
     'Type': 'ORGANIZATION'}]},
  {'Mentions': [{'BeginOffset': 8,
     'EndOffset': 19,
     'GroupScore': 1.0,
     'Score': 0.990119,
     'Text': 'NASDAQ:AMZN',
     'Type': 'STOCK_CODE'}]},
  {'Mentions': [{'BeginOffset': 25,
     'EndOffset': 49,
     'GroupScore': 1.0,
     'Score': 0.999654,
     'Text': 'Whole Foods Market, Inc.',
     'Type': 'ORGANIZATION'},
    {'BeginOffset': 169,
     'EndOffset': 187,
     'GroupScore': 0.990907,
     'Score': 0.999668,
     'Text': 'Whole Foods Market',
     'Type': 'ORGANIZATION'},
    {'BeginOffset

In [7]:
raw_text

"Amazon (NASDAQ:AMZN) and Whole Foods Market, Inc. (NASDAQ:WFM) today announced that they have entered into a definitive merger agreement under which Amazon will acquire Whole Foods Market for $42 per share in an all-cash transaction valued at approximately $13.7 billion, including Whole Foods Market’s net debt.  “Millions of people love Whole Foods Market because they offer the best natural and organic foods, and they make it fun to eat healthy,” said Jeff Bezos, Amazon founder and CEO. “Whole Foods Market has been satisfying, delighting and nourishing customers for nearly four decades – they’re doing an amazing job and we want that to continue.”  “This partnership presents an opportunity to maximize value for Whole Foods Market’s shareholders, while at the same time extending our mission and bringing the highest quality, experience, convenience and innovation to our customers,” said John Mackey, Whole Foods Market co-founder and CEO.  Whole Foods Market will continue to operate store

## Method

For effective results, Comprehend scans text documents to find links between events and entities. Events are detected through triggers (certain phrases that indicate an occurence of an event), and entities are detected through mentions (referential keywords). The triggers and mentions are aggregated to represent events and mentions respectively, and given a confidence score to represent the accuracy of these insights. Similar to correlation, a score closest to 1 is favorable (range: [0,1]).

### Gathering the Data

Amazon provided a set of 10 press releases from 2017 aggregated in a text file. I was able to utilize S3 to not only store the text file in a bucket but also configure a basic cloud system that allowed Comprehend to read input and store output in the same bucket. 

The press release records were taken directly from the linked articles less unrelated footnotes. 


### Processing the Data

For the sake of ease, Comprehend was utilized through the AWS console rather than through code. Additionally, the output originally from the S3 bucket was downloaded to a local folder. Both of these measures could easily be modified to allow for bulk processing of text documents. Comprehend outputs a JSON format file containing insights about Events and Entities detected through the document through individual "timestamps" (location where the occurence took place in the article). I will be using this raw output to show the event and entity detection, visualization of the output, as well as aggregation analysis of the main events of the article. 

It should also be noted that AWS Comprehend has certain event types that it looks out for, making it easy to identify events within that scope. However, its analysis is limited to those events.


In [8]:
#The text version of the press release
#Raw input before Comprehend scans text
raw_text

"Amazon (NASDAQ:AMZN) and Whole Foods Market, Inc. (NASDAQ:WFM) today announced that they have entered into a definitive merger agreement under which Amazon will acquire Whole Foods Market for $42 per share in an all-cash transaction valued at approximately $13.7 billion, including Whole Foods Market’s net debt.  “Millions of people love Whole Foods Market because they offer the best natural and organic foods, and they make it fun to eat healthy,” said Jeff Bezos, Amazon founder and CEO. “Whole Foods Market has been satisfying, delighting and nourishing customers for nearly four decades – they’re doing an amazing job and we want that to continue.”  “This partnership presents an opportunity to maximize value for Whole Foods Market’s shareholders, while at the same time extending our mission and bringing the highest quality, experience, convenience and innovation to our customers,” said John Mackey, Whole Foods Market co-founder and CEO.  Whole Foods Market will continue to operate store

## Analyzing the Output

DataFrames created to show the occurences of entities and events. Most importantly, the API allows us to get the most common events from the article into a table.

In [14]:
# Creation of the entity dataframe. Entity indices must be explicitly created.
entities_df = pd.DataFrame([
    {"EntityIndex": i, **m}
    for i, e in enumerate(result['Entities'])
    for m in e['Mentions']
])

# Creation of the events dataframe. Event indices must be explicitly created.
events_df = pd.DataFrame([
    {"EventIndex": i, **a, **t}
    for i, e in enumerate(result['Events'])
    for a in e['Arguments']
    for t in e['Triggers']
])

# Join the two tables into one flat data structure.
events_df = events_df.merge(entities_df, on="EntityIndex", suffixes=('Event', 'Entity'))

In [16]:
events_df

Unnamed: 0,EventIndex,EntityIndex,Role,ScoreEvent,BeginOffsetEvent,EndOffsetEvent,GroupScoreEvent,TextEvent,TypeEvent,BeginOffsetEntity,EndOffsetEntity,GroupScoreEntity,ScoreEntity,TextEntity,TypeEntity
0,0,4,DATE,0.999611,120,126,1.000000,merger,CORPORATE_MERGER,63,68,1.000000,0.994578,today,DATE
1,0,4,DATE,0.999829,662,673,0.999969,partnership,CORPORATE_MERGER,63,68,1.000000,0.994578,today,DATE
2,0,4,DATE,0.992193,1237,1248,0.509699,transaction,CORPORATE_MERGER,63,68,1.000000,0.994578,today,DATE
3,0,4,DATE,0.998367,1403,1414,0.336708,transaction,CORPORATE_MERGER,63,68,1.000000,0.994578,today,DATE
4,1,4,DATE,0.999958,161,168,1.000000,acquire,CORPORATE_ACQUISITION,63,68,1.000000,0.994578,today,DATE
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
132,1,0,INVESTOR,0.931136,221,232,0.999985,transaction,CORPORATE_ACQUISITION,468,474,0.584697,0.998912,Amazon,ORGANIZATION
133,2,6,EMPLOYEE,0.999938,1116,1122,1.000000,remain,EMPLOYMENT,897,908,1.000000,0.999606,John Mackey,PERSON
134,2,6,EMPLOYEE,0.999938,1116,1122,1.000000,remain,EMPLOYMENT,1099,1110,0.977111,0.999699,John Mackey,PERSON
135,2,7,EMPLOYEE_TITLE,0.999938,1116,1122,1.000000,remain,EMPLOYMENT,944,947,1.000000,0.997071,CEO,PERSON_TITLE


In [17]:
entities_df

Unnamed: 0,EntityIndex,BeginOffset,EndOffset,GroupScore,Score,Text,Type
0,0,0,6,1.0,0.999501,Amazon,ORGANIZATION
1,0,149,155,0.9936,0.999615,Amazon,ORGANIZATION
2,0,468,474,0.584697,0.998912,Amazon,ORGANIZATION
3,1,8,19,1.0,0.990119,NASDAQ:AMZN,STOCK_CODE
4,2,25,49,1.0,0.999654,"Whole Foods Market, Inc.",ORGANIZATION
5,2,169,187,0.990907,0.999668,Whole Foods Market,ORGANIZATION
6,2,282,300,0.618077,0.999653,Whole Foods Market,ORGANIZATION
7,2,339,357,0.379389,0.999708,Whole Foods Market,ORGANIZATION
8,2,366,370,0.277769,0.956068,they,ORGANIZATION
9,2,417,421,0.289622,0.68926,they,PERSON


In [18]:
def format_compact_events(x):
    """Collapse groups of mentions and triggers into a single set."""
    # Take the most commonly occurring EventType and the set of triggers.
    d = {"EventType": Counter(x['TypeEvent']).most_common()[0][0],
         "Triggers": set(x['TextEvent'])}
    # For each argument Role, collect the set of mentions in the group.
    for role in x['Role']:
        d.update({role: set((x[x['Role']==role]['TextEntity']))})
    return d

# Group data by EventIndex and format.
event_analysis_df = pd.DataFrame(
    events_df.groupby("EventIndex").apply(format_compact_events).tolist()
).fillna('')

In [20]:
#Most important bit
event_analysis_df

Unnamed: 0,EventType,Triggers,DATE,PARTICIPANT,INVESTEE,AMOUNT,INVESTOR,EMPLOYER,EMPLOYEE,EMPLOYEE_TITLE
0,CORPORATE_MERGER,"{transaction, merger, partnership}","{today, during the second half of 2017}","{they, Whole Foods Market, Inc., NASDAQ:WFM, N...",,,,,,
1,CORPORATE_ACQUISITION,"{transaction, acquire}",{today},,"{we, Whole Foods Market, Whole Foods Market, I...","{$42, $13.7 billion}",{Amazon},,,
2,EMPLOYMENT,{remain},,,,,,"{we, Whole Foods Market, Whole Foods Market, I...",{John Mackey},{CEO}


In [21]:
#Aggregate table + cleaning the data

In [22]:
#Visualize

## Visualizing the Output

This is just for show, but it does provide information on how Comprehend recognizes and links events together.

In [23]:
# Convert Events output to displaCy format.
entities = [
    {'start': m['BeginOffset'], 'end': m['EndOffset'], 'label': m['Type']}
    for e in result['Entities']
    for m in e['Mentions']
]

triggers = [
    {'start': t['BeginOffset'], 'end': t['EndOffset'], 'label': t['Type']}
    for e in result['Events']
    for t in e['Triggers']
]

# Spans need to be sorted for displaCy to process them correctly
spans = sorted(entities + triggers, key=lambda x: x['start'])
tags = [s['label'] for s in spans]

output = [{"text": raw_text, "ents": spans, "title": None, "settings": {}}]

# Misc. objects for presentation purposes
spectral = cm.get_cmap("Spectral", len(tags))
tag_colors = [colors.rgb2hex(spectral(i)) for i in range(len(tags))]
color_map = dict(zip(*(tags, tag_colors)))

# Note that only Entities participating in Events are shown.
displacy.render(output, style="ent", options={"colors": color_map}, manual=True)

In [24]:
# Entities are associated with events by group, not individual mention; for simplicity, 
# assume the canonical mention is the longest one.
def get_canonical_mention(mentions):
    extents = enumerate([m['Text'] for m in mentions])
    longest_name = sorted(extents, key=lambda x: len(x[1]))
    return [mentions[longest_name[-1][0]]]

# Set a global confidence threshold
thr = 0.5

# Nodes are (id, type, tag, score, mention_type) tuples.
trigger_nodes = [
    ("tr%d" % i, t['Type'], t['Text'], t['Score'], "trigger")
    for i, e in enumerate(result['Events'])
    for t in e['Triggers'][:1]
    if t['GroupScore'] > thr
]
entity_nodes = [
    ("en%d" % i, m['Type'], m['Text'], m['Score'], "entity")
    for i, e in enumerate(result['Entities'])
    for m in get_canonical_mention(e['Mentions'])
    if m['GroupScore'] > thr
]

# Edges are (trigger_id, node_id, role, score) tuples.
argument_edges = [
    ("tr%d" % i, "en%d" % a['EntityIndex'], a['Role'], a['Score'])
    for i, e in enumerate(result['Events'])
    for a in e['Arguments']
    if a['Score'] > thr
]

In [25]:
G = nx.Graph()

# Iterate over triggers and entity mentions.
for mention_id, tag, extent, score, mtype in trigger_nodes + entity_nodes:
    label = extent if mtype.startswith("entity") else tag
    G.add_node(mention_id, label=label, size=score*10, color=color_map[tag], tag=tag, group=mtype)
    
# Iterate over argument role assignments
for event_id, entity_id, role, score in argument_edges:
    G.add_edges_from(
        [(event_id, entity_id)],
        label=role,
        weight=score*100,
        color="grey"
    )

# Drop mentions that don't participate in events
G.remove_nodes_from(list(nx.isolates(G)))

In [26]:
nt = Network("600px", "800px", notebook=True, heading="")
nt.from_nx(G)
nt.show("compact_nx.html")

## For the Future

### Automation

Fundamental work (NLP) was done through the AWS Console but could be written as code. Previous trouble with AWS SDK caused me to use the console (ease of access and use) rather than debug the code. Problems: connecting to AWS, name shadowing in personal computer. 

Regardless of AWS access medium (code or console), they both should handle bulk processing similarly - just pointing the machine to the S3 bucket containing the documents. For usability purposes, however, it's preferable to use code because of the previous data analysis. 

### Versatility

Given this was a press release with easily digestible information, it was clear to see the capabilities of this technology. However, more testing with other documents (SEC filings, other financials) is required in order to be useful. AWS Comprehend offered finance-oriented keywords but did not deviate from those keywords - I selected all of the options when running the program. Thus, if the scope of future work depends on other topics in financial statements, there lies a clear limitation of the technology. 


### Impact +  Issues

#### Issues: 
- Cost

- More testing required with different documents

- Highly tailored technology used right now: high level, abstracted allow ease of access for the cost of level of utilization

- Learning curve for more complex NLP techniques

#### Impact: 
Ideally, able to bulk process hundreds, thousands of financial statements to develop insights on them without the labor-intensive process (SEC makes it easy to gather lots of documents automatically).

Also able to apply this concept outside of the prebuilt AWS scope with more expertise in the NLP field for more accuracy at the cost of ease of use. Adoption of these techniques ultimately requires more trust in the system and a lot more work. This demonstration is just to show the potential of this technology. 


