## News Articles Entity Extraction and Classification with Google AutoML Natural Language API

### Introduction

A common problem faced by many Media and News companies is the ability to, in a timely manner, extract useful metada from news articles to (1) help classify them on internal repositories and (2) enable a better search positioning by adding more data to SEO configuration.

In this tutorial we gonna go through an step by step of how to perform the following tasks using Google AutoML:
- analyze one news article and get all entities from that article
- classify this article according to its type - science, arts, sports, etc
- as a reference for Portuguese speakers, this exercise will also consider translating things from English to Portuguese when appropriate

For this simple prototype, a hardcoded text is being used. But this can be used as base for an API that can digest Cloud Storage files, runtime text information, web crawlers data, etc.

**It is important reinforce:** all those results are possible to be generated with agility and high confidence without the need of manually training, optimizing and testing machine learning models or writing machine learning code - which increases drastically the time to market of generated solutions.

A reference of all built-in classifications available on AutoML can be found at: https://cloud.google.com/natural-language/docs/categories

### How does AutoML Natural Language API work?

AutoML is a managed service available at Google Cloud where training data and machine learning models are created, curated and managed by Google and customers only need to run predictions/inferences against the available API.

As an example, you can go to https://cloud.google.com/natural-language and check the live demo where you add the sample text "Google, headquartered in Mountain View (1600 Amphitheatre Pkwy, Mountain View, CA 940430), unveiled the new Android phone for $799 at the Consumer Electronic Show. Sundar Pichai said in his keynote that users love their new Android phones", click analyze and see the following results. First for the entities extraction:

![AutoML generated entities](./images/automl-entities.png)

Also it is possible to check the automated content classification - as below:

![AutoML generated content classification](./images/automl-classification.png)

### Preparation Steps

Before starting, it is important to:

**a) First, install the required Python modules:**
- googleapis-common-protos
- google-cloud-automl
- google-cloud-translate

To install those packages simply run:

`$ sudo pip3 install googleapis-common-protos google-cloud-automl google-cloud-translate`

**b) Also, validate that you have a valid Application Credential to use on this exercise**

If you don't have or if you don't know, please check: https://cloud.google.com/docs/authentication/getting-started

Below just follow and execute the code cells to perform the activity

In [1]:
# import initial required modules
import os
import json

In [2]:
## configure environment variables for authentication
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = '<insert the path for your credentials here>'

In [3]:
from google.cloud import language

## Natural Language API client definition
language_client = language.LanguageServiceClient()

## sample text for test purposes
# source: https://tommo.ricmais.com.br/noticia/friends-matt-leblanc-comenta-especial-em-entrevista/#new_tab
text = 'Uma das comédias mais famosas da história, Friends ganhará um especial em maio, lançado pela HBO Max, \
        plataforma de streaming da Warner. Matt LeBlanc, um dos seis protagonistas da série, comentou a “reunião” \
        do elenco em uma entrevista, afirmando que o programa será como “reunir a banda sem instrumentos” (via TV Guide). \
        Em conversa no programa da cantora e apresentadora Kelly Clarkson, o intérprete de Joey Tribbiani confirmou o \
        formato não roteirizado do especial. Segundo descreveu o ator, a reunião de Friends contará com “nós seis \
        conversando sobre os bons e velhos tempos”. Um dos eventos televisivos mais pedidos e aguardados dos últimos anos, \
        a reunião de Friends precisou ter sua gravação, marcada para março, adiadapor causa do coronavírus. Ainda assim, a \
        HBO Max manteve a estreia do episódio para maio, junto com a chegada da plataforma. O especial, que será uma espécie \
        de conversa entre Jennifer Aniston, MattLeBlanc, Courtney Cox, MatthewPerry, Lisa Kudrow, DavidSchwimmer e os \
        produtores MarthaKauffman, David Crane e Kevin Bright, segue como um dos títulos originais a serem lançados na HBO Max, \
        plataforma de streaming da Warner, programada para ser lançada em algum momento em maio.'

In [4]:
import googleapiclient.discovery

## Analyzing text entities
body = {
        'document': {
            'type': 'PLAIN_TEXT',
            'content': text,
        },
        'encoding_type': 'UTF32',
    }

entities = []
service = googleapiclient.discovery.build('language', 'v1')
request = service.documents().analyzeEntities(body=body)
response = request.execute()
entity_types = ['PERSON', 'LOCATION', 'ORGANIZATION', 'EVENT', 'WORK_OF_ART', 'CONSUMER_GOOD'] 
for entity in response['entities']:
  if (entity['salience']*100 >= 1) and (entity['type'] in entity_types):
    if entity['name'] not in entities:
        entities.append(entity['name'])

# take a look on the entities captured
entities

['HBO Max',
 'Friends',
 'Warner',
 'Matt LeBlanc',
 'reunião',
 'programa',
 'protagonistas',
 'elenco',
 'entrevista',
 'banda',
 'intérprete',
 'gravação',
 'eventos',
 'pedidos',
 'conversa',
 'chegada',
 'episódio',
 'MatthewPerry']

In [5]:
from google.cloud import translate_v2 as translate

## Translate API client definition
client_translate = translate.Client()
target_language = 'en'

## translating the text from pt-BR to English
translation = client_translate.translate(text, target_language=target_language)
translated_text = translation['translatedText']

# take a look on the translated text
translated_text

'One of the most famous comedies in history, Friends will win a special in May, released by HBO Max, Warner&#39;s streaming platform. Matt LeBlanc, one of the six protagonists of the series, commented on the cast&#39;s &quot;reunion&quot; in an interview, stating that the program will be like &quot;reuniting the band without instruments&quot; (via TV Guide). Speaking on the program by singer and presenter Kelly Clarkson, Joey Tribbiani&#39;s interpreter confirmed the non-scripted format of the special. According to the actor, the Friends meeting will feature &quot;the six of us talking about the good old days&quot;. One of the most requested and anticipated television events in recent years, the Friends meeting needed to have its recording, scheduled for March, postponed due to the coronavirus cause. Still, HBO Max maintained the episode&#39;s premiere for May, along with the platform&#39;s arrival. The special, which will be a kind of conversation between Jennifer Aniston, MattLeBlanc

In [6]:
import googleapiclient.discovery

# performing the text classification
categories = []
document = language.types.Document(
    content=translated_text,
    type=language.enums.Document.Type.PLAIN_TEXT)
response = language_client.classify_text(document)
categories = response.categories

# taking a look on the generated categories
categories

[name: "/Arts & Entertainment/Humor"
confidence: 0.9200000166893005
, name: "/Arts & Entertainment/TV & Video/TV Shows & Programs"
confidence: 0.9100000262260437
]

In [7]:
# bring the content back to Brazilian Portuguese
target_language = 'pt'
classes = []
for category in categories:
    if category.confidence*100 >= 60:
        if category.name not in classes:
            classes.append(client_translate.translate(category.name, 
                            target_language=target_language)['translatedText'])

# take a look on the categories after the translation
classes

['/ Artes e entretenimento / Humor',
 '/ Artes e entretenimento / TV e vídeo / Programas e programas de TV']

In [8]:
# wrapping up - taking a look on everything
print('the original text:')
print(text, '\n')
print('text entities:', entities)
print('text category classes:', classes)

the original text:
Uma das comédias mais famosas da história, Friends ganhará um especial em maio, lançado pela HBO Max,         plataforma de streaming da Warner. Matt LeBlanc, um dos seis protagonistas da série, comentou a “reunião”         do elenco em uma entrevista, afirmando que o programa será como “reunir a banda sem instrumentos” (via TV Guide).         Em conversa no programa da cantora e apresentadora Kelly Clarkson, o intérprete de Joey Tribbiani confirmou o         formato não roteirizado do especial. Segundo descreveu o ator, a reunião de Friends contará com “nós seis         conversando sobre os bons e velhos tempos”. Um dos eventos televisivos mais pedidos e aguardados dos últimos anos,         a reunião de Friends precisou ter sua gravação, marcada para março, adiadapor causa do coronavírus. Ainda assim, a         HBO Max manteve a estreia do episódio para maio, junto com a chegada da plataforma. O especial, que será uma espécie         de conversa entre Jennifer Anist