**Objective**

This notebook will help to udnerstand and use the comprehend service

In [1]:
import boto3
import pandas as pd
from sagemaker import get_execution_role,ModelPackage
import re
import sagemaker
import boto3

# Some function for basic data cleaning

I am taking the analytics vidya's competition data to understand Amazon Comprehend.

**Preprocessing Step**

1. Remove Hashtags
2. Remove URL

In [2]:
def get_hashtags(text):
    """
        Extract the hashtags from a text
        Regular expressions: r"#(\w+)"
        
        Args:
            text: input text (str)
        Returns:
            hashtags : list
            
        >>> text = ""
        >>> get_hashtags(text)
        >>>
    
    """
    return re.findall(r"#(\w+)", text)

def detecturl(text):
    """
        Detect the urls from a text
        Regular expressions: "(?P<url>https?://[^\s]+)"
        
        Args:
            text: input text (str)
        Returns:
            url : str
            
        >>> text = ""
        >>> detecturl(text)
        >>>
    
    """
    try:
        return re.search("(?P<url>https?://[^\s]+)", text).group("url")
    except:
        return ''
    
def remove_hashtags(text,hashlist):
    """
        Remove the hastags from the text
        
        Args:
            text: input text (str)
            hashlist : output of get_hashtags function (list)
        Returns:
            text : cleaned text (str)
            
        >>> text = ""
        >>> remove_hashtags(text)
        >>>
    
    """
    for k in hashlist:
        text = text.replace(k,'')
    return text
def clean_punctuation(text):
    """
        Remove the punctuations from the text
        Regular expressions: r'[^\w\s]'
        
        Args:
            text: input text (str)
        Returns:
            text : cleaned text (str)
            
        >>> text = ""
        >>> clean_punctuation(text)
        >>>
    
    """

    return re.sub(r'[^\w\s]','',text)

In [3]:
data = pd.read_csv('NLP_train_data.csv')
data['hashtags']  = data.tweet.apply(lambda x: get_hashtags(x))
data['urls'] = data.tweet.apply(lambda x: detecturl(x))
data['cleantext'] = data[['tweet','urls']].apply(lambda x: x['tweet'].replace(x['urls'],''),1)
data['cleantext'] = data[['cleantext','hashtags']].apply(lambda x: remove_hashtags(x['cleantext'],x['hashtags']),1)
data['cleantext'] = data.cleantext.apply(lambda x: clean_punctuation(x))
data['TextLength'] = data.cleantext.apply(lambda x: len(x.strip().split(' ')))

Please set aws access key using aws cli or as a configuration file.

!aws configure set aws_access_key_id '{your id}'

!aws configure set aws_secret_access_key '{your key}'


**Comprehend Service**

All the output json will be in jsonoutput_comprehend folder.

In [4]:
comprehend = boto3.client(service_name='comprehend',region_name = 'us-east-1')

In [5]:
import os
try:
    os.mkdir('jsonoutput_comprehend')
except:
    pass

In [6]:
text = data.cleantext.iloc[4]
print(text)

What amazing service Apple wont even talk to me about a question I have unless I pay them 1995 for their stupid support


# Entity Extraction - Single Content

In [7]:
def extract(jsonfile,key):
    """
        This function helps to extract relevant contents for 4 different methods.
        
        1. Entity extraction key : 'Entities'
        2. Key phrase extraction key : 'KeyPhrases'
        3. Sentiment analyzer key: 'SentimentScore'
        3. Sentiment analyzer key: 'SyntaxTokens'
        
        Supported AWS comprehend methods:
        
            1. detect_entities
            2. detect_key_phrases
            3. detect_sentiment
            4. detect_syntax
        
        Args:
        
            jsonfile : jsonfile returns from comprehend services (.json)
            key : key of the json file for extraction
        
        
        >>> entity_detection_json = comprehend.detect_entities(Text = text,LanguageCode = 'en')
        >>> extract(entity_detection_json,'Entities')
        >>> Detected_Entities : Pandas Format
            Score    Type    Text    BeginOffset    EndOffset
            0.991983    ORGANIZATION    Apple    21    26
            0.999853    DATE    1995    90    94
    
    """
    if jsonfile['ResponseMetadata']['HTTPStatusCode'] == 200:
        if isinstance(jsonfile[key],dict):
            return pd.DataFrame([jsonfile[key]])
        else:
            return pd.DataFrame(jsonfile[key])
    else:
        return "Response code is not 200!"

In [8]:
entity_detection_json = comprehend.detect_entities(
    Text = text,LanguageCode = 'en')
import json
with open("jsonoutput_comprehend/entity_detection_json.json","w") as js:
    json.dump(entity_detection_json,js)

print('\n')
print("Detected_Entities : Pandas Format")
extract(entity_detection_json,'Entities')



Detected_Entities : Pandas Format


Unnamed: 0,Score,Type,Text,BeginOffset,EndOffset
0,0.991983,ORGANIZATION,Apple,21,26
1,0.999853,DATE,1995,90,94


# Key Phrase Extraction - Single Content

In [9]:
phrase_detection_json = comprehend.detect_key_phrases(
    Text = text,LanguageCode = 'en')
with open("jsonoutput_comprehend/phase_detection_json.json","w") as js:
    json.dump(entity_detection_json,js)
print("Key Phrases  - Pandas format")
extract(phrase_detection_json,'KeyPhrases')

Key Phrases  - Pandas format


Unnamed: 0,Score,Text,BeginOffset,EndOffset
0,0.995883,amazing service,5,20
1,0.579024,Apple,21,26
2,1.0,a question,54,64
3,0.999994,1995,90,94


# Sentiment Extraction - Single Content

In [10]:
sentiment_detection_json = comprehend.detect_sentiment(
    Text = text,LanguageCode = 'en')
with open("jsonoutput_comprehend/sentiment_detection_json.json","w") as js:
    json.dump(sentiment_detection_json,js)
print("Sentiment scores  - Pandas format")
extract(sentiment_detection_json,'SentimentScore')

Sentiment scores  - Pandas format


Unnamed: 0,Positive,Negative,Neutral,Mixed
0,0.761463,0.098101,0.051443,0.088993


# Syntax tokens (PoS) Extraction - Single Content

In [11]:
syntax_detection_json = comprehend.detect_syntax(
    Text = text,LanguageCode = 'en')
with open("jsonoutput_comprehend/syntax_detection_json.json","w") as js:
    json.dump(syntax_detection_json,js)
print("Syntax tokens  - Pandas format")
extract(syntax_detection_json,'SyntaxTokens')

Syntax tokens  - Pandas format


Unnamed: 0,TokenId,Text,BeginOffset,EndOffset,PartOfSpeech
0,1,What,0,4,"{'Tag': 'PRON', 'Score': 0.7429221868515015}"
1,2,amazing,5,12,"{'Tag': 'ADJ', 'Score': 0.9928527474403381}"
2,3,service,13,20,"{'Tag': 'NOUN', 'Score': 0.9986752867698669}"
3,4,Apple,21,26,"{'Tag': 'PROPN', 'Score': 0.9693880081176758}"
4,5,wont,27,31,"{'Tag': 'VERB', 'Score': 0.9332540035247803}"
5,6,even,32,36,"{'Tag': 'ADV', 'Score': 0.9979731440544128}"
6,7,talk,37,41,"{'Tag': 'VERB', 'Score': 0.9342613220214844}"
7,8,to,42,44,"{'Tag': 'ADP', 'Score': 0.9995973706245422}"
8,9,me,45,47,"{'Tag': 'PRON', 'Score': 0.9999958276748657}"
9,10,about,48,53,"{'Tag': 'ADP', 'Score': 0.9659994840621948}"


# Batch Processing

Please note that all the apis we have used above, we can use the same for batch processing. **Maximum size of a batch = 25**.

All the batch processing except language extraction expact that sentences belong to same language.

In [12]:
text_list = data.cleantext.sample(25).apply(lambda x: x.strip()).values.tolist()
print(text_list)

['That  tho   love', 'apple are you seriously so afraid of your profit going into someone elses pocket that you make it to where other chargers arent supported My phone is at 18 and my POS apple charger broke a month after getting the phone Good job Apple you', 'Playing start the party         Carlos A House', 'will be rolling out the   at s day', 'Yes Totally unexpected  Thank you ninong Vicpaul for the    Cant wait for Saturday to get my gift', 'Day 21  my favorite  sammy doo       B River', 'What more do you need', 'So my life is', 'I am so happy i dont have an', 'tjitjil Seriously Mbaak O Noted in mind next time I chew apple p re apel bawang kentang', 'my phone is officially broken i can hear it go off but the screen doesnt work the keyboard is okay though', 'to announce we will have our     on hand  hesterstfair', 'STILL NOT OVER IT        sia', 'now i do love u', 'What would my night be if I didnt drop my phone on my face in bed every night iphone', 'I just that modest  button of

# Entity Extraction - Batch Processing

In [13]:
def get_batch_df(js,key):
    """
        This function helps to extract relevant contents for 4 different methods for batch processing.
        
        1. Entity extraction key : 'Entities'
        2. Key phrase extraction key : 'KeyPhrases'
        3. Sentiment analyzer key: 'SentimentScore'
        3. Sentiment analyzer key: 'SyntaxTokens'
        
        Supported AWS comprehend methods:
        
            1. detect_entities
            2. detect_key_phrases
            3. detect_sentiment
            4. detect_syntax
        
        Args:
        
            js : jsonfile returns from comprehend batch services (.json)
            key : key of the json file for extraction
        
        
        >>> out_batch_entities = comprehend.batch_detect_entities(TextList = text_list,LanguageCode = 'en')
        >>> get_batch_df(out_batch_entities['ResultList'],'Entities')

    
    """

    ent =  pd.DataFrame(js['Entities']) 
    ent['Line_Number'] = js['Index']
    return ent

In [14]:
out_batch_entities = comprehend.batch_detect_entities(
    TextList = text_list,LanguageCode = 'en')
with open("jsonoutput_comprehend/batch_detect_entities.json","w") as js:
    json.dump(out_batch_entities,js)

In [15]:
batch_entity_df = pd.concat(list(map(lambda x: get_batch_df(x,'Entities'), out_batch_entities['ResultList'])))
batch_entity_df.head()

Unnamed: 0,Line_Number,Score,Type,Text,BeginOffset,EndOffset
0,1,0.999751,ORGANIZATION,apple,0.0,5.0
1,1,0.794621,QUANTITY,18,153.0,155.0
2,1,0.99361,ORGANIZATION,apple,167.0,172.0
3,1,0.741435,DATE,a month,187.0,194.0
4,1,0.994091,ORGANIZATION,Apple,228.0,233.0


# Key Phrase Extraction - Batch Processing

In [16]:
out_batch_phrases = comprehend.batch_detect_key_phrases( TextList = text_list,LanguageCode = 'en')
with open("jsonoutput_comprehend/batch_detect_key_phrases.json","w") as js: 
    json.dump(out_batch_phrases,js)

In [17]:
batch_phrase_df = pd.concat(list(map(lambda x: get_batch_df(x,'KeyPhrases'), out_batch_entities['ResultList'])))
batch_phrase_df.head()

Unnamed: 0,Line_Number,Score,Type,Text,BeginOffset,EndOffset
0,1,0.999751,ORGANIZATION,apple,0.0,5.0
1,1,0.794621,QUANTITY,18,153.0,155.0
2,1,0.99361,ORGANIZATION,apple,167.0,172.0
3,1,0.741435,DATE,a month,187.0,194.0
4,1,0.994091,ORGANIZATION,Apple,228.0,233.0


# Syntax Extraction - Batch Processing

In [18]:
out_batch_syntax = comprehend.batch_detect_syntax(
    TextList = text_list,LanguageCode = 'en')
with open("jsonoutput_comprehend/batch_detect_syntax.json","w") as js:
    json.dump(out_batch_syntax,js)

In [19]:
syntax_df = pd.concat(list(map(lambda x: get_batch_df(x,'SyntaxTokens'), out_batch_entities['ResultList'])))
syntax_df.head()

Unnamed: 0,Line_Number,Score,Type,Text,BeginOffset,EndOffset
0,1,0.999751,ORGANIZATION,apple,0.0,5.0
1,1,0.794621,QUANTITY,18,153.0,155.0
2,1,0.99361,ORGANIZATION,apple,167.0,172.0
3,1,0.741435,DATE,a month,187.0,194.0
4,1,0.994091,ORGANIZATION,Apple,228.0,233.0
