# How to use the Google Cloud Natural Language API
## Clothing Reviews Example

This notebook demonstrates how to use the [Google Cloud Natural Language API](https://cloud.google.com/natural-language/docs) for:
* Sentiment analysis
* Entity extraction
* Syntax analysis
* Text classification

This notebook will also show how to visualize results from the API with the [Seaborn data visualization library](https://seaborn.pydata.org/).

## Prerequisites

### Upload dataset

The dataset we will use is [Kaggle - Women's E-Commerce Clothing Reviews](https://www.kaggle.com/nicapotato/womens-ecommerce-clothing-reviews).

It is necessary to download the CSV from Kaggle, and upload it into the same directory as this notebook.

In [None]:
# [Colab Only] Upload CSV programatically
import sys

if 'google.colab' in sys.modules:    
  from google.colab import files
  files.upload()

### [Colab Only] Create a GCP project and perform setup tasks

Follow these steps to **Setup a Project** in the [documentation](https://cloud.google.com/natural-language/docs/quickstart-client-libraries).

Create a new key as JSON, and download it.

### [Colab Only] Authenticate with GCP

In [None]:
# Upload the downloaded JSON file that contains your key.

if 'google.colab' in sys.modules:    
  from google.colab import files
  keyfile_upload = files.upload()
  keyfile = list(keyfile_upload.keys())[0]
  %env GOOGLE_APPLICATION_CREDENTIALS $keyfile
  ! gcloud auth activate-service-account --key-file $keyfile

### Load data

In [None]:
# Load data from CSV

import pandas as pd

df = pd.read_csv('Womens Clothing E-Commerce Reviews.csv', index_col=0)

df.head()

In [None]:
# Filter dataset to rows with reviews > 100 chars long

df = df.loc[df['Review Text'].str.len() > 100]

df.head()

In [None]:
# Pick one of the reviews as an example

text = df['Review Text'][2]
text

### Sentiment analysis

We'll now instantiate the Natural Language API client, and invoke the sentiment analysis function on our text. The sentiment score and magnitude will be returned.

The score of a document's sentiment indicates the overall emotion of a document. The magnitude of a document's sentiment indicates how much emotional content is present within the document, and this value is often proportional to the length of the document. See the [documentation](https://cloud.google.com/natural-language/docs/basics#sentiment-analysis-values) for more details.

In [None]:
# Imports the Google Cloud client library
from google.cloud import language
from google.cloud.language import enums
from google.cloud.language import types

# Instantiates a client
client = language.LanguageServiceClient()

# The text to analyze
document = types.Document(
    content=text,
    type=enums.Document.Type.PLAIN_TEXT)

# Detects the sentiment of the text
sentiment = client.analyze_sentiment(document).document_sentiment

print('Text: {}\n'.format(text))
print('Sentiment: {}, {}'.format(round(sentiment.score, 2), round(sentiment.magnitude, 2)))

In [None]:
# Take a sample of the reviews for analysis
SAMPLE_SIZE = 100
df_sample = df.sample(SAMPLE_SIZE).copy().reset_index()
scores, magnitudes = list(), list()


# Iterate through each sample and invoke the API
for review in df_sample['Review Text']:
    document = types.Document(content=review, type=enums.Document.Type.PLAIN_TEXT)
    sentiment = client.analyze_sentiment(document=document).document_sentiment
    scores.append(sentiment.score)
    magnitudes.append(sentiment.magnitude)

In [None]:
# Merge the scores & magnitudes returned from the API with the original records

df_sample['Score'] = scores
df_sample['Magnitude'] = magnitudes
df_sample.head()

In [None]:
# Plot the sentiment for each clothing category

import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(20,6))

_ = sns.boxplot(x="Class Name", y="Score", data=df_sample)

### Entity extraction

In [None]:
# Print the text we want to extract entities from

text

In [None]:
# Analyze entities

document = {"content": text, "type": enums.Document.Type.PLAIN_TEXT}

response = client.analyze_entities(document, encoding_type=enums.EncodingType.UTF8)

Next, we will create a regular expression that will highlight every entity found in the text by wrapping the entity with an escape sequence.

By the way, the API also returns the position of each entity, in case you prefer to use a different approach.

In [None]:
# Get list of entity names from returned entities (e.g. ['hopes', 'dress', ...])
import re
from google.protobuf.json_format import MessageToDict

entities = MessageToDict(response)['entities']
for elem in entities:
    entity_names = [elem['name'] for elem in entities]

# Create a regular expression pattern to match any of the entity names
pattern = '(' + '|'.join(entity_names) + ')'
pattern

HIGHLIGHT = '\x1b[1;31m' # ANSI escape code sequence for red
RESET = '\x1b[0m'

# Print out the text with the entities highlighted
highlighted_text = re.sub(re.compile(pattern), HIGHLIGHT + '\\1' + RESET, text)
print(highlighted_text)

### Syntax analysis

In [None]:
# Analyze syntax

syntax = client.analyze_syntax(document = types.Document(content=text, type=enums.Document.Type.PLAIN_TEXT))

In [None]:
# Count the number of times each part of speech occurs in the text

# Create a list of all possible parts of speech
all_tags = [e.name for e in enums.PartOfSpeech.Tag]

# Create dictionary for each part-of-speech
tag_counts = dict.fromkeys(all_tags, 0)

# Review each token and add to the counter
for token in syntax.tokens:
    part_of_speech = token.part_of_speech
    tag = enums.PartOfSpeech.Tag(part_of_speech.tag).name
    tag_counts[tag] += 1

# Sort the counts in descending order, and plot them
sorted_counts = dict(sorted(tag_counts.items(), key=lambda item: item[1], reverse=True))
_ = sns.barplot(x=list(sorted_counts.values()), y=list(sorted_counts.keys()), )

In [None]:
# Analyze the parts of speech for each review in the sample. This time, calculate the % by part-of-speech (e.g. 20% Noun, 15% Adjective, etc.)

tags = list()

for review in df_sample['Review Text']:
    document = types.Document(content=review, type=enums.Document.Type.PLAIN_TEXT)
    syntax = client.analyze_syntax(document)
    
    tag_ratios = dict.fromkeys(all_tags, 0)
    for token in syntax.tokens:
        
        part_of_speech = token.part_of_speech
        tag = enums.PartOfSpeech.Tag(part_of_speech.tag).name
        tag_ratios[tag] += 1 / len(syntax.tokens)
    tags.append(tag_ratios)

In [None]:
# Append the parts of speech to the review dataframe

df_sample = pd.concat([df_sample, pd.DataFrame(tags, columns=all_tags)], axis=1)

df_sample.head()

In [None]:
# Let's see if there is any correlation between sentiment and a couple common parts of speech

sns.lmplot(x='ADJ', y='Score', data=df_sample)
sns.lmplot(x='VERB', y='Score', data=df_sample)

### Classify Text

In [None]:
# Classify text

response = client.classify_text(types.Document(content=text, type=enums.Document.Type.PLAIN_TEXT))

In [None]:
# Print the category name and confidence

for category in response.categories:
    print(f"Category name: {category.name}")
    print(f"Confidence: {round(category.confidence, 2)}")

In [None]:
# Find the category name for each review in the list of samples

import numpy as np

categories = list()

for review in df_sample['Review Text']:
    document = types.Document(content=review, type=enums.Document.Type.PLAIN_TEXT)
    response = client.classify_text(document)
    try:
        category = response.categories[0].name
    except:
        category = np.nan
    categories.append(category)

In [None]:
# Append the category to the review dataframe

df_sample = pd.concat([df_sample, pd.DataFrame(categories, columns=['Category'])], axis=1)

df_sample.head()

In [None]:
# Plot the count of categories in descending order

category_counts = df_sample[['Category','index']].groupby(['Category']).count().rename(columns={'index': 'Count'}).sort_values(by='Count', ascending=False)
_ = sns.barplot(x=category_counts['Count'], y=category_counts.index)