# Text Feature Processing

In this example we are going to augment a dataset with NLP derived text features using the Amazon Comprehend APIs.

Note: That to execute this Notebook your role will need the following policy attached: ComprehendFullAccess

In [2]:
!wget -O data/smsspamcollection.zip https://www.dt.fee.unicamp.br/~tiago/smsspamcollection/smsspamcollection.zip

--2021-06-18 05:31:21--  https://www.dt.fee.unicamp.br/~tiago/smsspamcollection/smsspamcollection.zip
Resolving www.dt.fee.unicamp.br (www.dt.fee.unicamp.br)... 143.106.12.20
Connecting to www.dt.fee.unicamp.br (www.dt.fee.unicamp.br)|143.106.12.20|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 210521 (206K) [application/zip]
Saving to: ‘data/smsspamcollection.zip’


2021-06-18 05:31:26 (80.1 KB/s) - ‘data/smsspamcollection.zip’ saved [210521/210521]



In [15]:
!ls -la data

total 1472
drwxr-xr-x 2 root root   6144 Jun 18 05:42 .
drwxr-xr-x 7 root root   6144 Jun 18 05:41 ..
-rwxr-xr-x 1 root root 477907 Mar 16  2011 SMSSpamCollection.txt
-rw-r--r-- 1 root root   5868 Apr 18  2011 readme
-rw-r--r-- 1 root root 129103 Jun 18 04:08 score.csv
-rw-r--r-- 1 root root 210521 Jul  3  2015 smsspamcollection.zip
-rw-r--r-- 1 root root 132691 Jun 16 12:22 test.csv
-rw-r--r-- 1 root root 396174 Jun 16 12:22 train.csv
-rw-r--r-- 1 root root 132688 Jun 16 12:22 validation.csv


In [13]:
!unzip data/smsspamcollection.zip -d data

Archive:  data/smsspamcollection.zip
  inflating: data/readme             
  inflating: data/SMSSpamCollection.txt  


In [14]:
!head data/SMSSpamCollection.txt

ham	Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
ham	Ok lar... Joking wif u oni...
spam	Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
ham	U dun say so early hor... U c already then say...
ham	Nah I don't think he goes to usf, he lives around here though
spam	FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv
ham	Even my brother is not like to speak with me. They treat me like aids patent.
ham	As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune
spam	WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.
spam	H

In [16]:
import pandas as pd
import numpy as np

In [26]:
df = pd.read_csv("data/SMSSpamCollection.txt", delimiter="\t", header=None)
df.columns = ["class","text"]

In [27]:
df.head()

Unnamed: 0,class,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [28]:
import sagemaker
import boto3

boto_session = boto3.Session()
region = boto_session.region_name
comprehend = boto3.client('comprehend', region_name=region)

# Comprehend API for single block of text

In these examples we use comprehend to extract features from a single block of text.

In [30]:
sample_tweet="It’s always a great day when I can randomly put my equestrian knowledge to good use at work! #AWS #BePeculiar"   

# Key phrases
phrases = comprehend.detect_key_phrases(Text=sample_tweet, LanguageCode='en')

# Entities
entities = comprehend.detect_entities(Text=sample_tweet, LanguageCode='en')

#Sentiments
sentiments = comprehend.detect_sentiment(Text=sample_tweet, LanguageCode='en')


# Print the phrases:
print('------- phrases ---------')
for i in range(0, len(phrases['KeyPhrases'])):
    print((phrases['KeyPhrases'][i]['Text']))
    

# Print the entities with entitity type:
print('------- entity : entity type ---------')
for i in range(0, len(entities['Entities'])):
    print(entities['Entities'][i]['Text'] + ' : ' + entities['Entities'][i]['Type'] )
    
# Print the sentiment:
print('------- sentiment ---------')
print(sentiments['Sentiment'])

------- phrases ---------
a great day
my equestrian knowledge
good use
work
------- entity : entity type ---------
#AWS : ORGANIZATION
#BePeculiar : TITLE
------- sentiment ---------
POSITIVE


## Language detection

It is also possible to first detect the language and then use that in subsequent comprehend API calls for additinal NLP features

In [32]:
lan = comprehend.detect_dominant_language(Text=sample_tweet)

In [37]:
language = lan['Languages'][0]['LanguageCode']

In [42]:
language

'en'

# Batch Jobs

Comprehend has separate APIs to score batches of text blocks

https://docs.aws.amazon.com/comprehend/latest/dg/API_BatchDetectSentiment.html

In [43]:
rez = comprehend.batch_detect_sentiment(TextList=['happy days', 'not feeling good'], LanguageCode='en')

In [44]:
rez

{'ResultList': [{'Index': 0,
   'Sentiment': 'POSITIVE',
   'SentimentScore': {'Positive': 0.9528746604919434,
    'Negative': 0.0026089251041412354,
    'Neutral': 0.037291351705789566,
    'Mixed': 0.007225051987916231}},
  {'Index': 1,
   'Sentiment': 'NEGATIVE',
   'SentimentScore': {'Positive': 0.0058039030991494656,
    'Negative': 0.956430196762085,
    'Neutral': 0.014397521503269672,
    'Mixed': 0.023368341848254204}}],
 'ErrorList': [],
 'ResponseMetadata': {'RequestId': 'f863b124-acb1-4d3a-8109-310ac6b9afde',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'f863b124-acb1-4d3a-8109-310ac6b9afde',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '384',
   'date': 'Fri, 18 Jun 2021 06:49:54 GMT'},
  'RetryAttempts': 0}}

# NLP Feature Engineering Pipeline

Now that we understand the fundamentals of the Comprehend APIs, lets build a function that uses it for 