# Topic Labelling with Comprehend

In this example Notebook we are going to generate article labels/tags making use of the Comprehend Service to do Topic Modelling.

We will compare this with labelled data to see how well th groupings agree with human generated topic labels.

This Notebook was run in Sagemaker Studio with The **Python 3 (Data Science)** Kernel.

There is some IAM configuration that needs to be done later in the notebook. This blog contains some useful general guidelines on using Comprehend.

https://zyabkina.com/nlp-with-aws-comprehend-how-to-guide/

In [1]:
import pandas as pd
import numpy as np
import sagemaker
import boto3

boto_session = boto3.Session()
region = boto_session.region_name
comprehend = boto3.client('comprehend', region_name=region)
sgmk_session = sagemaker.Session()
sgmk_role = sagemaker.get_execution_role()

Load the data

In [2]:
ctstories = "s3://funnybones/rural/topics/CTstories.csv"
stories = "s3://funnybones/rural/topics/stories.csv"

In [3]:
df1 = pd.read_csv(ctstories)

In [4]:
df2 = pd.read_csv(stories)

In [5]:
df1.head()


Unnamed: 0,id,category,summary,tags,text,title
0,7300876,human,Multicultural Hub Canberra has supported the s...,"['bf-label-advertising-feature', 'story-busine...",Model Akiima was born in the small village of...,Model of success - from refugee to the runways
1,7300648,sport,Justis Huni and Paul Gallen finally went head ...,"['domestic-sports', 'top-sport']",This was poetic Justis at its finest. For all...,Poetic Justis: Huni demolishes Gallen
2,7300577,environment,"Mr Bowen declared ""this is a solar panel, don'...","['news', 'subscriber-only', 'federal-politics'...","It has taken four years, but Labor's Chris Bo...","'This is a solar panel, don't be afraid': Labo..."
3,7300512,arts,"Now, most celebrities fly by private jet with ...","['books', 'signpost-review']",Remember when flying was glamorous and ocean ...,Glamorous travel of yesteryear
4,7300496,sport,Tahlia Tupaea will remain in Canberra next sea...,"['capitals', 'basketball', 'signpost-subscribe...",The second youngest debutant in WNBL history ...,Tupaea locked in for her return to the capital


In [6]:
result = df1

In [7]:
len(result)

1009

In [8]:
text_only = result.loc[:,['text']]

In [9]:
text_only.to_csv("data/text_only.csv", index=False, header=False)

Upload the text only to S3

In [10]:
bucket_name = "funnybones"
bucket_prefix="rural/topics/text"

In [11]:
# Upload CSV files to S3 for SageMaker training
train_uri = sgmk_session.upload_data(
    path="data/text_only.csv",
    bucket=bucket_name,
    key_prefix=bucket_prefix,
)

In [12]:
train_uri

's3://funnybones/rural/topics/text/text_only.csv'

In [13]:
output_uri="s3://funnybones/rural/topics/model/"

### Data Access

This next part is critical to using Comprehend to build models inside Sagemaker Studio.

We create a Role that will grant access to the S3 buckets where the data will be.

We then need grant our Sagemaker Execution Role the ability to pass this Role to the Comprehend Service.

I added this as an inline policy to my Sagemaker Execution Role:

```
{
    "Version": "2012-10-17",
    "Statement": {
        "Effect": "Allow",
        "Action": "iam:PassRole",
        "Resource": "arn:aws:iam::320389841409:role/ComprehendS3Access"
    }
}
```

In [14]:
comprehend_role = "arn:aws:iam::320389841409:role/ComprehendS3Access"

# Run a topic modelling job

In [15]:
response = comprehend.start_topics_detection_job(
    InputDataConfig={
        'S3Uri': train_uri,
        'InputFormat': 'ONE_DOC_PER_LINE'
    },
    OutputDataConfig={
        'S3Uri': output_uri
    },
    DataAccessRoleArn=comprehend_role,
    JobName='RuralPress_TopLeve_Topics',
    NumberOfTopics=50
)

In [16]:
job_id = response['JobId']
print(job_id)

3c504344320c90c2b54a8099595dabd5


In [26]:
job_id = "3c504344320c90c2b54a8099595dabd5"

In [27]:
describe_result = comprehend.describe_topics_detection_job(JobId=job_id)

In [28]:
job_status = describe_result['TopicsDetectionJobProperties']['JobStatus']
print(f'Job Status: {job_status}')
if job_status == 'FAILED':
    print(f'Reason: {describe_result["TopicsDetectionJobProperties"]["Message"]}')

Job Status: COMPLETED


In [29]:
results_S3Url = comprehend.describe_topics_detection_job(JobId=job_id)['TopicsDetectionJobProperties']['OutputDataConfig']['S3Uri']

results_S3Url

's3://funnybones/rural/topics/model/320389841409-TOPICS-3c504344320c90c2b54a8099595dabd5/output/output.tar.gz'

In [30]:
s3_name = 's3://' + bucket_name + '/'
results_aws_filename = results_S3Url.replace(s3_name, '')
results_aws_filename

'rural/topics/model/320389841409-TOPICS-3c504344320c90c2b54a8099595dabd5/output/output.tar.gz'

In [31]:
# Local file name
local_results_filename = 'results/topics.tar.gz'
# Download the results
s3 = boto3.client('s3')
s3.download_file(bucket_name,
                 results_aws_filename, 
                 local_results_filename)

In [32]:
!tar xzf results/topics.tar.gz -C results

In [33]:
!ls results

doc-topics.csv	topic-terms.csv  topics.tar.gz


In [34]:
doc_topics = pd.read_csv("results/doc-topics.csv")

In [35]:
doc_topics.head(20)

Unnamed: 0,docname,topic,proportion
0,text_only.csv:30,5,0.562893
1,text_only.csv:30,0,0.339177
2,text_only.csv:30,12,0.052311
3,text_only.csv:30,20,0.045619
4,text_only.csv:65,1,0.408687
5,text_only.csv:65,20,0.34129
6,text_only.csv:65,3,0.21268
7,text_only.csv:65,14,0.021582
8,text_only.csv:65,30,0.010124
9,text_only.csv:65,6,0.005636


In [36]:
len(doc_topics)

5784

You see that the topic modeller produces a set of probabilities for the learned topics. However, the number of records varies depending on the number of topics that match a document.

We manipulate this data to get a list of the top topics of the document

In [37]:
top_topics = doc_topics.groupby('docname').first().reset_index()

In [38]:
len(top_topics)

1009

In [39]:
top_topics.head()

Unnamed: 0,docname,topic,proportion
0,text_only.csv:0,1,0.335041
1,text_only.csv:1,20,0.482878
2,text_only.csv:10,1,0.843629
3,text_only.csv:100,23,0.230259
4,text_only.csv:1000,13,0.338228


### Now extract the line from the original CSV from the docname field

In [40]:
def get_line_no(docname):
    return int(docname.split(':')[1])
    
top_topics['line'] =  top_topics['docname'].apply(lambda x : get_line_no(x))


In [41]:
top_topics_sorted = top_topics.sort_values('line', axis=0, ascending=True, inplace=False)

In [42]:
top_topics_sorted

Unnamed: 0,docname,topic,proportion,line
0,text_only.csv:0,1,0.335041,0
1,text_only.csv:1,20,0.482878,1
121,text_only.csv:2,12,0.932177,2
232,text_only.csv:3,16,0.533105,3
343,text_only.csv:4,2,0.953504,4
...,...,...,...,...
8,text_only.csv:1004,16,0.602084,1004
9,text_only.csv:1005,16,0.320456,1005
10,text_only.csv:1006,16,0.301611,1006
11,text_only.csv:1007,15,0.753549,1007


## Join with labelled data

In [45]:
topic_set = result.copy()

topic_set['topic'] = top_topics_sorted['topic']

dataset = topic_set.loc[:,['topic','category','id']]

In [60]:
evalset = dataset.loc[0:60].copy()

In [48]:
len(evalset)

61

In [49]:
trainset = dataset.loc[61:]

In [51]:
trainset.head()

Unnamed: 0,topic,category,id
61,8,arts,7297974
62,4,realestate,7297970
63,1,politics,7297894
64,5,environment,7297854
65,17,health,7297801


In [52]:
topics = {}

for index, row in trainset.iterrows():
    if row['topic'] in topics:
        topics[row['topic']].append(row['category'])
    else:
        topics[row['topic']] = [row['category']]
        

In [54]:
# Program to find most frequent 
# element in a list
def most_frequent(List):
    return max(set(List), key = List.count)

In [56]:
topic_labels = {}

for key in topics.keys():
    topic_labels[key] = most_frequent(topics[key])

In [58]:
evalset.head()

Unnamed: 0,topic,category,id
0,1,human,7300876
1,20,sport,7300648
2,1,environment,7300577
3,23,arts,7300512
4,13,sport,7300496


In [61]:
evalset['pred'] = evalset.topic.map(topic_labels)

In [62]:
evalset.head()

Unnamed: 0,topic,category,id,pred
0,1,human,7300876,politics
1,20,sport,7300648,sport
2,1,environment,7300577,politics
3,23,arts,7300512,sport
4,13,sport,7300496,arts


In [63]:
evalset['correct'] = (evalset['category']==evalset['pred'])

In [64]:
sum(evalset['correct'])/len(evalset)

0.18032786885245902