# Topic Labelling Sub Categories with Comprehend

In this Notebook example we want to explore how well Amazon Comprehend Topic Modeller can discover sub-categories of our labelled dataset of new articles.

We will take just the sport articles and run a topic modeller to generate sub categories.

This Notebook was run in Sagemaker Studio with The **Python 3 (Data Science)** Kernel.

There is some IAM configuration that needs to be done later in the notebook. This blog contains some useful general guidelines on using Comprehend.

In [1]:
import pandas as pd
import numpy as np
import sagemaker
import boto3

boto_session = boto3.Session()
region = boto_session.region_name
comprehend = boto3.client('comprehend', region_name=region)
sgmk_session = sagemaker.Session()
sgmk_role = sagemaker.get_execution_role()

Load the data

In [2]:
ctstories = "s3://funnybones/rural/topics/CTstories.csv"
stories = "s3://funnybones/rural/topics/stories.csv"

In [3]:
df1 = pd.read_csv(ctstories)

In [4]:
df2 = pd.read_csv(stories)

In [5]:
df1.head()


Unnamed: 0,id,category,summary,tags,text,title
0,7300876,human,Multicultural Hub Canberra has supported the s...,"['bf-label-advertising-feature', 'story-busine...",Model Akiima was born in the small village of...,Model of success - from refugee to the runways
1,7300648,sport,Justis Huni and Paul Gallen finally went head ...,"['domestic-sports', 'top-sport']",This was poetic Justis at its finest. For all...,Poetic Justis: Huni demolishes Gallen
2,7300577,environment,"Mr Bowen declared ""this is a solar panel, don'...","['news', 'subscriber-only', 'federal-politics'...","It has taken four years, but Labor's Chris Bo...","'This is a solar panel, don't be afraid': Labo..."
3,7300512,arts,"Now, most celebrities fly by private jet with ...","['books', 'signpost-review']",Remember when flying was glamorous and ocean ...,Glamorous travel of yesteryear
4,7300496,sport,Tahlia Tupaea will remain in Canberra next sea...,"['capitals', 'basketball', 'signpost-subscribe...",The second youngest debutant in WNBL history ...,Tupaea locked in for her return to the capital


In [6]:
result = pd.concat([df1,df2])

In [7]:
len(result)

1375

In [8]:
sport_only = result[ result['category']=='sport'].copy()

In [9]:
len(sport_only)

208

In [10]:
text_only = sport_only.loc[:,['text']]

In [11]:
text_only.to_csv("data/sport_text_only.csv", index=False, header=False)

Upload the text only to S3

In [12]:
bucket_name = "funnybones"
bucket_prefix="rural/topics/text"

In [13]:
# Upload CSV files to S3 for SageMaker training
train_uri = sgmk_session.upload_data(
    path="data/sport_text_only.csv",
    bucket=bucket_name,
    key_prefix=bucket_prefix,
)

In [14]:
train_uri

's3://funnybones/rural/topics/text/sport_text_only.csv'

In [15]:
output_uri="s3://funnybones/rural/topics/sport_only_model/"

### Data Access

This next part is critical to using Comprehend to build models inside Sagemaker Studio.

We create a Role that will grant access to the S3 buckets where the data will be.

We then need grant our Sagemaker Execution Role the ability to pass this Role to the Comprehend Service.

I added this as an inline policy to my Sagemaker Execution Role:

```
{
    "Version": "2012-10-17",
    "Statement": {
        "Effect": "Allow",
        "Action": "iam:PassRole",
        "Resource": "arn:aws:iam::320389841409:role/ComprehendS3Access"
    }
}
```

In [16]:
comprehend_role = "arn:aws:iam::320389841409:role/ComprehendS3Access"

# Run a topic modelling job

In [27]:
response = comprehend.start_topics_detection_job(
    InputDataConfig={
        'S3Uri': train_uri,
        'InputFormat': 'ONE_DOC_PER_LINE'
    },
    OutputDataConfig={
        'S3Uri': output_uri
    },
    DataAccessRoleArn=comprehend_role,
    JobName='RuralPress_Sport_SubTopics',
    NumberOfTopics=10
)

In [28]:
job_id = response['JobId']
print(job_id)

c656e330eb67e36d55b71fe48ea07da5


In [72]:
job_id = "c656e330eb67e36d55b71fe48ea07da5"

In [73]:
describe_result = comprehend.describe_topics_detection_job(JobId=job_id)

In [74]:
job_status = describe_result['TopicsDetectionJobProperties']['JobStatus']
print(f'Job Status: {job_status}')
if job_status == 'FAILED':
    print(f'Reason: {describe_result["TopicsDetectionJobProperties"]["Message"]}')

Job Status: COMPLETED


In [75]:
results_S3Url = comprehend.describe_topics_detection_job(JobId=job_id)['TopicsDetectionJobProperties']['OutputDataConfig']['S3Uri']
results_S3Url

's3://funnybones/rural/topics/sport_only_model/320389841409-TOPICS-c656e330eb67e36d55b71fe48ea07da5/output/output.tar.gz'

In [76]:
s3_name = 's3://' + bucket_name + '/'
results_aws_filename = results_S3Url.replace(s3_name, '')
results_aws_filename

'rural/topics/sport_only_model/320389841409-TOPICS-c656e330eb67e36d55b71fe48ea07da5/output/output.tar.gz'

In [77]:
# Local file name
local_results_filename = 'results/topics.tar.gz'
# Download the results
s3 = boto3.client('s3')
s3.download_file(bucket_name,
                 results_aws_filename, 
                 local_results_filename)

In [78]:
!tar xzf results/topics.tar.gz -C results

In [79]:
!ls results

doc-topics.csv	topic-terms.csv  topics.tar.gz


In [80]:
doc_topics = pd.read_csv("results/doc-topics.csv")

In [81]:
topic_terms = pd.read_csv("results/topic-terms.csv")

In [118]:
topic_terms.head(20)

Unnamed: 0,topic,term,weight,scarcity
0,0,raider,0.020368,0.899892
1,0,play,0.019529,0.715887
2,0,origin,0.011891,0.807753
3,0,game,0.016727,0.685234
4,0,nrl,0.011083,0.926603
5,0,queensland,0.007678,0.857674
6,0,nsw,0.00735,0.840745
7,0,government,0.007444,0.692601
8,0,wighton,0.006682,1.0
9,0,blue,0.006614,0.728256


## First remove all of the duplicate terms

In [90]:
idx = topic_terms.groupby('term')['weight'].transform(max) == topic_terms['weight']


In [92]:
reduced = topic_terms[idx]


In [None]:
# ORIGINAL ATTEMPT - HIGHEST WEIGHTED TOPIC - CONTAINS DUPS

#topic_reduce = topic_terms.groupby('term').max().reset_index()

In [103]:
topic_labels = reduced.groupby('topic').first().reset_index()

# Examine

This list of topic names contains many generic words

In [106]:
topic_labels.head(10)

Unnamed: 0,topic,term,weight
0,0,play,0.019529
1,1,player,0.018634
2,2,raider,0.040471
3,3,minute,0.021406
4,4,olympic,0.023014
5,5,maffra,0.013018
6,6,season,0.022845
7,7,year,0.025633
8,8,train,0.064504
9,9,brumbies,0.100248


In [107]:
!pip install texturizer

  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
Collecting texturizer
  Downloading texturizer-0.1.8.tar.gz (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 38.0 MB/s eta 0:00:01
Collecting jellyfish
  Downloading jellyfish-0.8.2-cp37-cp37m-manylinux2014_x86_64.whl (90 kB)
[K     |████████████████████████████████| 90 kB 1.0 MB/s  eta 0:00:01
[?25hCollecting textdistance
  Downloading textdistance-4.2.1-py3-none-any.whl (28 kB)
Collecting spacy
  Downloading spacy-3.1.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.4 MB)
[K     |████████████████████████████████| 6.4 MB 21.0 MB/s eta 0:00:01
[?25hCollecting textblob
  Downloading textblob-0.15.3-py2.py3-none-any.whl (636 kB)
[K     |████████████████████████████████| 636 kB 61.1 MB/s eta 0:00:01
Collecting cymem<2.1.0,>=2.0.2
  Downloading cymem-2.0.5-cp37-cp37m-manylinux2014_x86_64.whl (35 kB)
Collecting spacy-legacy<3.1.0,>=3.0.7
  Downloading spacy_legacy-3.0

In [108]:
import texturizer as txzr



In [136]:
import string
import re
def scarce(x):
    if bool(re.search(r'\d', x)):
        return 0.0
    x = x.translate(str.maketrans('', '', string.punctuation))
    return txzr.scarcity.get_scarcity(x)

reduced['scarcity'] = reduced['term'].map(scarce)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':


In [137]:
reduced.head()

Unnamed: 0,topic,term,weight,scarcity
1,0,play,0.019529,0.715887
2,0,origin,0.011891,0.807753
4,0,nrl,0.011083,0.926603
5,0,queensland,0.007678,0.857674
6,0,nsw,0.00735,0.840745


In [138]:
idx2 = reduced.groupby('topic')['scarcity'].transform(max) == reduced['scarcity']


In [139]:
labels2 = reduced[idx2]

In [140]:
labels2

Unnamed: 0,topic,term,weight,scarcity
8,0,wighton,0.006682,1.0
13,1,mckellar,0.010683,0.950595
24,2,bronco,0.00924,0.918922
35,3,belconnen,0.010263,0.950675
42,4,huni,0.011203,1.0
50,5,maffra,0.013018,0.963513
68,6,goriss,0.008885,1.0
79,7,covid,0.007525,1.0
88,8,maiden,0.012742,0.87804
90,9,brumbies,0.100248,0.956887
