### Amazon Comprehend

In this notebook, we will take a look at NLP service called Comprehend that can be used to find insights and relationships in text. Comprehend can detect dominant language, entities, key phrases and sentiment in provided text, which will be shown in the following examples.

To start using Comprehend API, we need to initialize client:

In [2]:
import boto3
from pprint import pprint

session = boto3.session.Session()
comprehend_client = session.client('comprehend')

#### Detect dominant language

In our first task, we will try to detect dominant language of several texts. In the first example, we will try to analyze following English text:

In [3]:
text = """
Amazon Comprehend is a natural language processing (NLP) service
that uses machine learning to find insights and relationships in text.
Amazon Comprehend identifies the language of the text;
extracts key phrases, places, people, brands, or events;
understands how positive or negative the text is;
and automatically organizes a collection of text files by topic.
"""

response = comprehend_client.detect_dominant_language(
    Text=text
)

pprint(response)

{'Languages': [{'LanguageCode': 'en', 'Score': 0.9784398674964905}],
 'ResponseMetadata': {'HTTPHeaders': {'connection': 'keep-alive',
                                      'content-length': '64',
                                      'content-type': 'application/x-amz-json-1.1',
                                      'date': 'Wed, 28 Mar 2018 22:00:05 GMT',
                                      'x-amzn-requestid': '5f76f599-32d3-11e8-9e50-91b788397d9c'},
                      'HTTPStatusCode': 200,
                      'RequestId': '5f76f599-32d3-11e8-9e50-91b788397d9c',
                      'RetryAttempts': 0}}


In the response, we get a list of languages with corresponding score (level of confidence). In first example, we can see that Comprehend is almost sure that provided text is English.

Let's now try to detect language for a text written in both Polish and English.

In [4]:
text = """
Potrafi identyfikować język podanego tekstu;
extracts key phrases, places, people, brands, or events;
understands how positive or negative the text is;
and automatically organizes a collection of text files by topic.
"""

response = comprehend_client.detect_dominant_language(
    Text=text
)

pprint(response)

{'Languages': [{'LanguageCode': 'en', 'Score': 0.7801410555839539},
               {'LanguageCode': 'pl', 'Score': 0.10633250325918198}],
 'ResponseMetadata': {'HTTPHeaders': {'connection': 'keep-alive',
                                      'content-length': '114',
                                      'content-type': 'application/x-amz-json-1.1',
                                      'date': 'Wed, 28 Mar 2018 22:00:06 GMT',
                                      'x-amzn-requestid': '5fb2ed9a-32d3-11e8-b48a-9f02d80b9bc8'},
                      'HTTPStatusCode': 200,
                      'RequestId': '5fb2ed9a-32d3-11e8-b48a-9f02d80b9bc8',
                      'RetryAttempts': 0}}


In this example, Comprehend is still fairly sure that dominant language in provided text is English, but it also includes Polish in the response as one of the possibilites.

#### Detect entities

Next we will take a look at `detect_entities` method, that allows us to detect entities such as `person`, `localization`, `organization`, `commercial_item`, `event`, `date`, `quantity` and `title`. At the time of writing, Comprehend supports detecting entities in Spanish and English texts.

In our example, we will analyze text from following article: [Rare pics of 'The Beatles' sold for ₹2.32 crore at auction](https://www.inshorts.com/en/news/rare-pics-of-the-beatles-sold-for-%E2%82%B9232-crore-at-auction-1521995431851).

In [5]:
text = """
Rare pictures of the rock band 'The Beatles' has been sold for ₹2.32 crore (£253,200) at an auction in England.
They included over 350 previously unseen photos of the band.
The images shot by photographer Mike Mitchell, who was 18 at the time,
show the band arriving in 1964 for their first concerts in Washington DC and Baltimore in USA.
"""

response = comprehend_client.detect_entities(
    Text=text,
    LanguageCode='en'
)

print(len(response['Entities']))
pprint(response)

12
{'Entities': [{'BeginOffset': 33,
               'EndOffset': 44,
               'Score': 0.9071453213691711,
               'Text': 'The Beatles',
               'Type': 'ORGANIZATION'},
              {'BeginOffset': 64,
               'EndOffset': 75,
               'Score': 0.9864733815193176,
               'Text': '₹2.32 crore',
               'Type': 'QUANTITY'},
              {'BeginOffset': 77,
               'EndOffset': 85,
               'Score': 0.9995929002761841,
               'Text': '£253,200',
               'Type': 'QUANTITY'},
              {'BeginOffset': 104,
               'EndOffset': 111,
               'Score': 0.9942112565040588,
               'Text': 'England',
               'Type': 'LOCATION'},
              {'BeginOffset': 127,
               'EndOffset': 160,
               'Score': 0.8907501101493835,
               'Text': 'over 350 previously unseen photos',
               'Type': 'QUANTITY'},
              {'BeginOffset': 206,
               'End

In the response, we get a list of all detected entities, confidence score, type as well as indicies of starting and ending characters.

#### Detect key phrases

In our next example, we will try out `detect_key_phrases` method, which can be used, as name suggests, to identify key phrases in provided text. At the time of writing, Comprehend supports detecting key phrases in Spanish and English texts.

For testing, we will analyze part of the following article: [Messaging Patterns for Event-Driven Microservices](https://content.pivotal.io/blog/messaging-patterns-for-event-driven-microservices).

In [8]:
text = """
In a microservices architecture, each microservice is designed as an atomic and self-sufficient piece of software.
Implementing a use case will often require composing multiple calls to these single responsibility,
distributed endpoints. Although synchronous request-response calls are required when the requester
expects an immediate response, integration patterns based on eventing and asynchronous messaging
provide maximum scalability and resiliency. Some of the world's most scalable architectures
such as Linkedin and Netflix are based on event-driven, asynchronous messaging.
"""

response = comprehend_client.detect_key_phrases(
    Text=text,
    LanguageCode='en'
)

print(len(response['KeyPhrases']))
pprint(response)

18
{'KeyPhrases': [{'BeginOffset': 4,
                 'EndOffset': 32,
                 'Score': 0.9916486740112305,
                 'Text': 'a microservices architecture'},
                {'BeginOffset': 34,
                 'EndOffset': 51,
                 'Score': 0.9988448619842529,
                 'Text': 'each microservice'},
                {'BeginOffset': 67,
                 'EndOffset': 102,
                 'Score': 0.9816235303878784,
                 'Text': 'an atomic and self-sufficient piece'},
                {'BeginOffset': 106,
                 'EndOffset': 114,
                 'Score': 0.9996533393859863,
                 'Text': 'software'},
                {'BeginOffset': 129,
                 'EndOffset': 139,
                 'Score': 0.9893721342086792,
                 'Text': 'a use case'},
                {'BeginOffset': 169,
                 'EndOffset': 183,
                 'Score': 0.960514485836029,
                 'Text': 'multiple calls'},
    

In the response, we get a list of all detected key phrases, confidence score, type as well as indicies of starting and ending characters.

#### Detect sentiment - positive review

Last method covered in this tutorial will be `detect_sentiment`. At the time of writing, Comprehend supports detecting sentiments in Spanish and English texts. 

As first test of `detect_sentiment`, we will analyze a positive review of restaurant from [Zomato](https://www.zomato.com/pl/warszawa/tandoor-%C5%9Br%C3%B3dmie%C5%9Bcie-po%C5%82udniowe/reviews)

In [9]:
text = """
Called in tonight as it was so near to where we were staying....staff really attentive
and friendly...place clean and inviting...great choices on the menu...food was so tasty,
meat really tender...and definitely the best mojito in Warsaw...will be back....returned
on our final night in Warsaw and was not disappointed...the meals were once again
gorgeous..staff so friendly and helpful.
"""

response = comprehend_client.detect_sentiment(
    Text=text,
    LanguageCode='en'
)

pprint(response)

{'ResponseMetadata': {'HTTPHeaders': {'connection': 'keep-alive',
                                      'content-length': '164',
                                      'content-type': 'application/x-amz-json-1.1',
                                      'date': 'Wed, 28 Mar 2018 22:16:20 GMT',
                                      'x-amzn-requestid': 'a4597911-32d5-11e8-ad19-63ebbbebc655'},
                      'HTTPStatusCode': 200,
                      'RequestId': 'a4597911-32d5-11e8-ad19-63ebbbebc655',
                      'RetryAttempts': 0},
 'Sentiment': 'POSITIVE',
 'SentimentScore': {'Mixed': 0.021178914234042168,
                    'Negative': 0.002122301608324051,
                    'Neutral': 0.008368930779397488,
                    'Positive': 0.968329906463623}}


As we can see, Comprehend didn't have any problems with correctly identifying provided review as very positive (over 95% of confidence score).

Now, we will try how it will handle review that isn't obviously positive or negative. As an example, we will once again use a review of restaurant from [TripAdvisor](https://pl.tripadvisor.com/Restaurant_Review-g274772-d1749958-Reviews-Restauracja_Starka-Krakow_Lesser_Poland_Province_Southern_Poland.html).

In [10]:
text = """
4 of us - were going to book this for an evening meal but they had no reservations available.
However we did have lunch. It was just ok - nothing really to write home about.
We felt as if the staff thought we should be thankful that they found a table for us to have lunch;
they were not friendly. We were given a complementary vodka which was appreciated and the reason
for our average score. There are better places and thankfully we found them
"""

response = comprehend_client.detect_sentiment(
    Text=text,
    LanguageCode='en'
)

pprint(response)

{'ResponseMetadata': {'HTTPHeaders': {'connection': 'keep-alive',
                                      'content-length': '159',
                                      'content-type': 'application/x-amz-json-1.1',
                                      'date': 'Wed, 28 Mar 2018 22:23:37 GMT',
                                      'x-amzn-requestid': 'a8d1ec98-32d6-11e8-bb31-e118196dfed7'},
                      'HTTPStatusCode': 200,
                      'RequestId': 'a8d1ec98-32d6-11e8-bb31-e118196dfed7',
                      'RetryAttempts': 0},
 'Sentiment': 'MIXED',
 'SentimentScore': {'Mixed': 0.45759281516075134,
                    'Negative': 0.14617452025413513,
                    'Neutral': 0.04490724578499794,
                    'Positive': 0.3513253629207611}}


In this example, we can notice that Comprehend was a bit less sure with sentiment detection, and while qualifying review as `MIXED`, it also assigned quite high score for `POSITIVE` which is a bit surprising in this case.