# Entity Detection

In this example Notebook we will use Amazon Comprehend to do entity detection on our dataset of news articles. The idea here is to discover the subjects discussed in each article and use them as links to articles that discuss the same subjects.

We also run an experiment in which the detected entities are used to generate an alternative text representation and then run topic modelling. In other words can be discover natural topics that are grouping of Named Entities.

This Notebook was run in Sagemaker Studio with The **Python 3 (Data Science)** Kernel.

There is some IAM configuration that needs to be done later in the notebook. This blog contains some useful general guidelines on using Comprehend.

In [166]:
import pandas as pd
import numpy as np
import sagemaker
import boto3

boto_session = boto3.Session()
region = boto_session.region_name
comprehend = boto3.client('comprehend', region_name=region)
sgmk_session = sagemaker.Session()
sgmk_role = sagemaker.get_execution_role()

In [167]:
ctstories = "s3://funnybones/rural/topics/CTstories.csv"

In [168]:
df1 = pd.read_csv(ctstories)

In [169]:
text_only = df1.loc[:,['text']]

In [170]:
text_only.to_csv("data/text_only.csv", index=False, header=False)

In [171]:
bucket_name = "funnybones"
bucket_prefix="rural/entities/text"

In [10]:
# Upload CSV files to S3 for SageMaker training
train_uri = sgmk_session.upload_data(
    path="data/text_only.csv",
    bucket=bucket_name,
    key_prefix=bucket_prefix,
)

In [11]:
train_uri

's3://funnybones/rural/entities/text/text_only.csv'

In [12]:
output_uri="s3://funnybones/rural/entities/model/"

### Data Access

This next part is critical to using Comprehend to build models inside Sagemaker Studio.

We create a Role that will grant access to the S3 buckets where the data will be.

We then need grant our Sagemaker Execution Role the ability to pass this Role to the Comprehend Service.

I added this as an inline policy to my Sagemaker Execution Role:

```
{
    "Version": "2012-10-17",
    "Statement": {
        "Effect": "Allow",
        "Action": "iam:PassRole",
        "Resource": "arn:aws:iam::320389841409:role/ComprehendS3Access"
    }
}
```

In [14]:
comprehend_role = "arn:aws:iam::320389841409:role/ComprehendS3Access"

We can detect entities for a single line

In [78]:
sample_text = text_only.loc[0]['text']

if len(sample_text) > 5000:
    sample_text = sample_text[0:5000]

In [79]:

# Entities
entities = comprehend.detect_entities(Text=sample_text, LanguageCode='en')

In [175]:

def flatten(listy):
    return [item for sublist in listy for item in sublist]

def split(x):
    return x.split()

def get_word_list(input_list):
    return flatten( list(map(split,input_list)) )

def process_entities_to_list(entities):
    rez = []
    no_spaces = []
    with_spaces = []
    for ent in entities: 
        if ent['Type'] in ['PERSON', 'TITLE', 'COMMERCIAL_ITEM', 'EVENT', 'LOCATION', 'ORGANIZATION']:  
            if " " in ent['Text']:
                with_spaces.append(ent['Text'])
            else:
                no_spaces.append(ent['Text'])
    total_list_no_spaces = list(set(no_spaces))
    total_list_with_spaces = list(set(with_spaces))
    broken_list = get_word_list(total_list_with_spaces)
    for elem in total_list_no_spaces:
        if elem not in broken_list:
            rez.append(elem)
    for elem in total_list_with_spaces:
        rez.append(elem)
    return rez



In [None]:
ents = process_entities_to_list(entities['Entities'])

In [82]:
ents

['Loki',
 'London',
 'Kenyan',
 'Versace',
 'Gucci',
 'Australian',
 'Paris',
 'Australia',
 'Akiima',
 'Vogue',
 'Prada',
 'South Sudan',
 'World Cup',
 'Carla Zampatti',
 'Multicultural Hub Canberra',
 'IRT Kangara Waters Residential Care',
 "Multicultural Women's Service",
 'Saint Laurent',
 'World Refugee Week',
 'Refugee Week',
 "Hub's Multicultural Employment Service",
 'New York']

## Batch Detection Job



### StartEntitiesDetectionJob

In [85]:
response = comprehend.start_entities_detection_job(
    InputDataConfig={
        'S3Uri': train_uri,
        'InputFormat': 'ONE_DOC_PER_LINE'
    },
    OutputDataConfig={
        'S3Uri': output_uri
    },
    LanguageCode="en",
    DataAccessRoleArn=comprehend_role,
    JobName='RuralPress_Entity_Detection'
)

In [86]:
job_id = response['JobId']
print(job_id)

8bfb71120509a13d6e999cf2b642a01b


In [92]:
job_id = "8bfb71120509a13d6e999cf2b642a01b"

In [93]:
describe_result = comprehend.describe_entities_detection_job(JobId=job_id)

In [94]:
job_status = describe_result['EntitiesDetectionJobProperties']['JobStatus']
print(f'Job Status: {job_status}')
if job_status == 'FAILED':
    print(f'Reason: {describe_result["EntitiesDetectionJobProperties"]["Message"]}')

Job Status: COMPLETED


In [95]:
results_S3Url = comprehend.describe_entities_detection_job(JobId=job_id)['EntitiesDetectionJobProperties']['OutputDataConfig']['S3Uri']

results_S3Url

's3://funnybones/rural/entities/model/320389841409-NER-8bfb71120509a13d6e999cf2b642a01b/output/output.tar.gz'

In [96]:
s3_name = 's3://' + bucket_name + '/'
results_aws_filename = results_S3Url.replace(s3_name, '')
results_aws_filename

'rural/entities/model/320389841409-NER-8bfb71120509a13d6e999cf2b642a01b/output/output.tar.gz'

In [97]:
# Local file name
local_results_filename = 'results/entities.tar.gz'
# Download the results
s3 = boto3.client('s3')
s3.download_file(bucket_name,
                 results_aws_filename, 
                 local_results_filename)

In [99]:
!tar xzf results/entities.tar.gz -C results

In [100]:
import json

In [249]:
entities=[]
line_numbers=[]
with open('results/output') as f:
    data=f.readlines()
    for d in data:
        ner_data = json.loads(d)
        entities.append( process_entities_to_list( ner_data['Entities']) )
        line_numbers.append(ner_data['Line'])


In [250]:
len(entities)


1009

In [252]:
line_numbers[15]

18

In [262]:
entities[18]

['Lying',
 'Cow',
 'John Malouff',
 'School of Psychology,',
 'University of New England',
 "Serena Williams'",
 'Mitch McConnell',
 'US Senate',
 'Donald Trump']

In [263]:
text_only.loc[line_numbers[18]].text

' When I teach students how to help eliminate some behaviour, like nail biting, I preach that clients need to develop an alternative behaviour that serves the same functions as the one they want to eliminate. For instance, if a bloke wants to stop ruinous drinking, he may need to learn new methods of coping with the stresses of work.  These other methods might include meditating, walking in nature, talking with someone about his emotions, and so on. Alternatives are good also for expanding our ways of thinking.  I realised that years back when I was encouraging my students to go all out in their studies - to "become a stud."  I looked at the students, who were almost all females, and paused while I thought of how to include women. I said "or become a..."  A clever female student spoke up and added an alternative: A brood mare. I often hear the term WAGs for women and girlfriends of famous male athletes.  I propose "HABs" for husbands and boyfriends of famous female athletes. Serena Wil

In [182]:
def replace_all_spaces(input):
    return [x.replace(" ","_") for x in input]
    
newents = list(map(replace_all_spaces,entities))

In [183]:
newents[6]

['TWA',
 'P&O',
 'Asia',
 'Australasia',
 'Pakistan',
 'Rizzoli',
 'Heathrow',
 'Wolff',
 'Europe',
 'Rhianna',
 'America',
 'London',
 'Venice',
 'Belfast',
 'Harland',
 'Jason_Gay',
 'Justin_Bieber',
 'Miley_Cyrus',
 "Oxford_University's_Bodleian_Library",
 'Jodi_Peckman',
 'Come_Fly_with_Me:_Flying_in_Style',
 'Bodleian_Library',
 'Dolly_Parton',
 'Come_Fly_with_Me',
 'Sharon_Stone',
 'Second_World_War',
 'get_you_up_there',
 'Thor_Johnson',
 "John_Sayers'",
 'Lady_Gaga',
 'Louis_Vuitton',
 'Frank_Sinatra',
 'Sydney_airport',
 'Jane_Fonda',
 'Billy_May',
 'Marilyn_Monroe',
 'John_Sayers',
 'Middle_East',
 'Joan_Collins',
 'James_Bond',
 'Future_Covid',
 "'ll_just_glide",
 'Secrets_of_the_Great_Ocean_Liners',
 'Diamonds_are_Forever',
 'Mohammed_Ali',
 'SS_Canberra',
 'Rolling_Stone',
 'Diana_Ross',
 'Dean_Martin',
 'Los_Angeles',
 'Falklands_war',
 'Pan_Am']

In [184]:
def to_text(input):
    return " ".join(input)
    
textents = list(map(to_text,newents))

In [185]:
textents[6]

"TWA P&O Asia Australasia Pakistan Rizzoli Heathrow Wolff Europe Rhianna America London Venice Belfast Harland Jason_Gay Justin_Bieber Miley_Cyrus Oxford_University's_Bodleian_Library Jodi_Peckman Come_Fly_with_Me:_Flying_in_Style Bodleian_Library Dolly_Parton Come_Fly_with_Me Sharon_Stone Second_World_War get_you_up_there Thor_Johnson John_Sayers' Lady_Gaga Louis_Vuitton Frank_Sinatra Sydney_airport Jane_Fonda Billy_May Marilyn_Monroe John_Sayers Middle_East Joan_Collins James_Bond Future_Covid 'll_just_glide Secrets_of_the_Great_Ocean_Liners Diamonds_are_Forever Mohammed_Ali SS_Canberra Rolling_Stone Diana_Ross Dean_Martin Los_Angeles Falklands_war Pan_Am"

In [186]:
import pandas as pd

In [187]:
df = pd.DataFrame({"text":textents})

In [188]:
df.head()

Unnamed: 0,text
0,The Kiwi NRL Olympics Australia Renegade Tokyo...
1,Loki London Versace Gucci Paris Australia Akii...
2,Labor Minerals_Council_of_Australia Scott_Morr...
3,WBL WNBA Canberra Caps WNBL Bankstown French_l...
4,Tharni Nades Priya Kopika Australia Tharnicaa ...


In [189]:
df.to_csv("data/entities_text_only.csv", index=False, header=False)

In [190]:
bucket_name = "funnybones"
bucket_prefix="rural/topics/entities"

In [191]:
# Upload CSV files to S3 for SageMaker training
train_uri = sgmk_session.upload_data(
    path="data/entities_text_only.csv",
    bucket=bucket_name,
    key_prefix=bucket_prefix,
)

In [192]:
!head data/entities_text_only.csv

"The Kiwi NRL Olympics Australia Renegade Tokyo TKO Victor_Oganov Olympic_Games IBF_Australasian Jason_Whateley Lucas_Browne Paul_Gallen Issac_Hardman Emmanuel_Carlos ANBF_Australian GLANCE_Termination_Day Sam_Goodman Justis_Huni Andrei_Mikhailovich Jeff_Fenech International_Convention_Centre_Sydney Sydney's_International_Convention_Centre Nort_Beauchamp Eddie_Hearn ANBF_Australian_heavyweight WBO_Oriental Dean_Lonergan Grahame_""""Spike""""_Cheney Mark_Hunt Jai_Opetaia Alex_Hanan"
Loki London Versace Gucci Paris Australia Akiima Vogue Prada South_Sudan World_Cup Carla_Zampatti Multicultural_Hub_Canberra IRT_Kangara_Waters_Residential_Care Multicultural_Women's_Service Saint_Laurent World_Refugee_Week Refugee_Week Hub's_Multicultural_Employment_Service New_York Multicultural_Hub
Labor Minerals_Council_of_Australia Scott_Morrison Energy_Minister Prime_Minister Angus_Taylor Chris_Bowen Federal_Parliament
WBL WNBA Canberra Caps WNBL Bankstown French_league Marianna_Tolo Paul_Goriss U19_FI

In [193]:
train_uri


's3://funnybones/rural/topics/entities/entities_text_only.csv'

In [194]:
output_uri="s3://funnybones/rural/entities/topics/model/"

In [195]:
comprehend_role = "arn:aws:iam::320389841409:role/ComprehendS3Access"

In [196]:
response = comprehend.start_topics_detection_job(
    InputDataConfig={
        'S3Uri': train_uri,
        'InputFormat': 'ONE_DOC_PER_LINE'
    },
    OutputDataConfig={
        'S3Uri': output_uri
    },
    DataAccessRoleArn=comprehend_role,
    JobName='RuralPress_Entity_Derived_Topics',
    NumberOfTopics=50
)

In [197]:
job_id = response['JobId']
print(job_id)

499b0daf240174b54213ce24ce876342


In [202]:
job_id = "499b0daf240174b54213ce24ce876342"

In [205]:
describe_result = comprehend.describe_topics_detection_job(JobId=job_id)

In [206]:
job_status = describe_result['TopicsDetectionJobProperties']['JobStatus']
print(f'Job Status: {job_status}')
if job_status == 'FAILED':
    print(f'Reason: {describe_result["TopicsDetectionJobProperties"]["Message"]}')

Job Status: COMPLETED


In [207]:
results_S3Url = comprehend.describe_topics_detection_job(JobId=job_id)['TopicsDetectionJobProperties']['OutputDataConfig']['S3Uri']

results_S3Url

's3://funnybones/rural/entities/topics/model/320389841409-TOPICS-499b0daf240174b54213ce24ce876342/output/output.tar.gz'

In [208]:
s3_name = 's3://' + bucket_name + '/'
results_aws_filename = results_S3Url.replace(s3_name, '')
results_aws_filename

'rural/entities/topics/model/320389841409-TOPICS-499b0daf240174b54213ce24ce876342/output/output.tar.gz'

In [209]:
# Local file name
local_results_filename = 'results/entities_topics.tar.gz'
# Download the results
s3 = boto3.client('s3')
s3.download_file(bucket_name,
                 results_aws_filename, 
                 local_results_filename)

In [211]:
!tar xzf results/entities_topics.tar.gz -C results/entities/

In [214]:
!ls -la results/entities

total 60
drwxr-xr-x 2 root root  6144 Jul 14 06:06 .
drwxr-xr-x 4 root root  6144 Jul 14 06:05 ..
-rw-r--r-- 1 root root 37217 Jul 14 05:17 doc-topics.csv
-rw-r--r-- 1 root root  9629 Jul 14 05:17 topic-terms.csv


In [215]:
doc_topics = pd.read_csv("results/entities/doc-topics.csv")

In [216]:
top_topics = doc_topics.groupby('docname').first().reset_index()

In [217]:
top_topics.head()

Unnamed: 0,docname,topic,proportion
0,entities_text_only.csv:0,4,1.0
1,entities_text_only.csv:1,4,1.0
2,entities_text_only.csv:10,2,1.0
3,entities_text_only.csv:100,0,1.0
4,entities_text_only.csv:1000,10,1.0


In [218]:
def get_line_no(docname):
    return int(docname.split(':')[1])
    
top_topics['line'] =  top_topics['docname'].apply(lambda x : get_line_no(x))


In [219]:
top_topics_sorted = top_topics.sort_values('line', axis=0, ascending=True, inplace=False)

In [220]:
top_topics_sorted

Unnamed: 0,docname,topic,proportion,line
0,entities_text_only.csv:0,4,1.0,0
1,entities_text_only.csv:1,4,1.0,1
121,entities_text_only.csv:2,3,1.0,2
232,entities_text_only.csv:3,6,1.0,3
343,entities_text_only.csv:4,4,1.0,4
...,...,...,...,...
8,entities_text_only.csv:1004,32,1.0,1004
9,entities_text_only.csv:1005,4,1.0,1005
10,entities_text_only.csv:1006,4,1.0,1006
11,entities_text_only.csv:1007,5,1.0,1007


In [221]:
topic_terms = pd.read_csv("results/entities/topic-terms.csv")

In [223]:
topic_terms.tail(20)

Unnamed: 0,topic,term,weight
312,41,warehouse_circus,0.330828
313,42,duntroon,0.33444
314,42,parnell_road,0.333556
315,42,royal_military_college,0.332005
316,43,walt,0.334036
317,43,trivium,0.333926
318,43,burley,0.332038
319,44,perinatal_wellbeing_centre,0.334745
320,44,hyatt_hotel,0.332622
321,44,cakeoff_2021,0.332633


In [224]:
topic_labels = topic_terms.groupby('topic').first().reset_index()

In [226]:
topic_labels.head(20)

Unnamed: 0,topic,term,weight
0,0,raider,0.062456
1,1,canberra_times,0.085083
2,2,brumbies,0.144344
3,3,australia,0.14979
4,4,geelong_street,0.04147
5,5,act_supreme_court,0.121782
6,6,canberra,0.2133
7,7,act_government,0.160364
8,8,canberra,0.162315
9,9,act_fire_and_rescue,0.125362


In [230]:
topic_labels['label'] = topic_labels['term'].apply(lambda x: str(x).replace("_", " ").capitalize() )

In [231]:
topic_labels.head(20)

Unnamed: 0,topic,term,weight,label
0,0,raider,0.062456,Raider
1,1,canberra_times,0.085083,Canberra times
2,2,brumbies,0.144344,Brumbies
3,3,australia,0.14979,Australia
4,4,geelong_street,0.04147,Geelong street
5,5,act_supreme_court,0.121782,Act supreme court
6,6,canberra,0.2133,Canberra
7,7,act_government,0.160364,Act government
8,8,canberra,0.162315,Canberra
9,9,act_fire_and_rescue,0.125362,Act fire and rescue


In [234]:
topic_label_map = dict( (topic_labels.loc[:,['topic','label']] ).values )

In [235]:
topic_label_map[19]

'Canberra croatia fc'

## Join back to the orginal documents and get topic labels

In [238]:
top_topics_sorted['label'] = top_topics_sorted['topic'].apply(lambda z: topic_label_map[z])

In [239]:
top_topics_sorted.head()

Unnamed: 0,docname,topic,proportion,line,label
0,entities_text_only.csv:0,4,1.0,0,Geelong street
1,entities_text_only.csv:1,4,1.0,1,Geelong street
121,entities_text_only.csv:2,3,1.0,2,Australia
232,entities_text_only.csv:3,6,1.0,3,Canberra
343,entities_text_only.csv:4,4,1.0,4,Geelong street


In [242]:
text_only.loc[0].text

' Model Akiima was born in the small village of Loki while her family was travelling to a Kenyan refugee camp when fleeing war in South Sudan.  After 10 years, Akiima and her family moved to Australia to start a new life. Multicultural Hub Canberra supported Akiima in a placement at IRT Kangara Waters Residential Care where she became a valued aged-care worker and a mentor for many refugees experiencing their first job. Her happy and supportive personality made everyone feel comfortable and welcome. After being introduced to a modelling agency by a friend, Akiima made her modelling runway debut with the late Carla Zampatti and has since been in demand on runways all over the world including fashion shows in Paris, London and New York. Akiima has modelled for known fashion labels such as Versace, Gucci, Prada and Saint Laurent and has appeared on the cover of fashion magazines including Vogue. In a recent Vogue interview, Akiima said, "Australia is a multicultural country, but unfortuna

In [244]:
df.loc[0].text

'The Kiwi NRL Olympics Australia Renegade Tokyo TKO Victor_Oganov Olympic_Games IBF_Australasian Jason_Whateley Lucas_Browne Paul_Gallen Issac_Hardman Emmanuel_Carlos ANBF_Australian GLANCE_Termination_Day Sam_Goodman Justis_Huni Andrei_Mikhailovich Jeff_Fenech International_Convention_Centre_Sydney Sydney\'s_International_Convention_Centre Nort_Beauchamp Eddie_Hearn ANBF_Australian_heavyweight WBO_Oriental Dean_Lonergan Grahame_""Spike""_Cheney Mark_Hunt Jai_Opetaia Alex_Hanan'