### Multi-lingual model based on Transformer for Sentiment Analysis

* Conda env used: **Natural Language Processing for CPU Python 3.7**
* the model is based on a pre-trained transformer available from **Hugging Face Hub**: "nlptown/bert-base-multilingual-uncased-sentiment"
* if a GPU is available, the model does inference on GPU (faster)

**Sentiment Analysis**: we want to analyze a text and establish if the sentiment expressed is positive, negative or neutral.

In the case of the transformer used here, it is used a scale with a **number of stars** ranging from **1** (very negative) to **5** (highly positive)

In [9]:
import torch
from torch import nn
import numpy as np

# HuggingFace transformers (availale in OCI DS conda nlp env)
# see: https://github.com/huggingface/transformers
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# This is the class I have developed to simplify the code, from my Python file
from sentiment_analyzers import MultiLanguageSentimentAnalyzer

# to read file from Object Storage
import os
import ocifs
import ads
from ads import set_auth
import pandas as pd

#### loading the model (is locally cached)

In [2]:
%%time

# loading the model: pass the HF model name
MODEL_NAME = "nlptown/bert-base-multilingual-uncased-sentiment"

# labels are dependent on the used model , see HF documentation
sent_analyzer = MultiLanguageSentimentAnalyzer(
    MODEL_NAME, labels=["1 star", "2 star", "3 star", "4 star", "5 star"]
)

Loading model...
Model loading completed!
CPU times: user 2.3 s, sys: 611 ms, total: 2.91 s
Wall time: 7.37 s


### Some tests

In [3]:
# is using GPU?
sent_analyzer.get_device()

device(type='cpu')

In [4]:
%%time

input_sentences = ["La Ferrari ha sbagliato completamente la strategia di gara",
                  "La gara di Werstappen è stata veramente avvincente",
                  "Odio quando fanno uscire la safety car",
                  "Una buona gara",
                  "Oddio, che gara entusiasmante",
                  "Peccato, la Ferrari poteva vincere"]

detailed_scores = sent_analyzer.batch_score(input_sentences)

# Instead of a tensor I want the numpy vector
scores = (np.argmax(detailed_scores.numpy(), axis = 1) + 1)

scores

CPU times: user 671 ms, sys: 36.7 ms, total: 708 ms
Wall time: 183 ms


array([1, 5, 1, 4, 5, 2])

### Read a file from Object Storage and does scoring on each sentence

In [10]:
# this way we enable access to Object Storage and don't need to provide API keys
# OCI admin must have set-up a dynamic group for Notebooks, with proper policy
set_auth(auth='resource_principal')

In [11]:
def read_from_object_storage(oci_url):
    # get access to OSS as an fs
    # config={} assume RESOURCE PRINCIPAL auth
    fs = ocifs.OCIFileSystem(config={})
    
    # reading data from Object Storage
    with fs.open(oci_url, 'r') as f:
        df = pd.read_csv(f, sep=";", header=None)
    
    return df

In [12]:
NAMESPACE = "frqap2zhtzbe"
BUCKET = "oracle_redbull_inputs"
FILE_NAME = "oracle_redbull1.csv"

oci_url = f"oci://{BUCKET}@{NAMESPACE}/{FILE_NAME}"

df_texts = read_from_object_storage(oci_url)
df_texts.columns = ['id','text']

In [13]:
df_texts.head()

Unnamed: 0,id,text
0,1,La Ferrari ha sbagliato completamente la strat...
1,2,La gara di Werstappen è stata veramente avvinc...
2,3,Odio quando fanno uscire la safety car
3,4,Una buona gara
4,5,"Oddio, che gara entusiasmante"


In [18]:
%%time

# estraiamo i test e facciamo lo scoring in modalità batch
input_sentences = list(df_texts['text'].values)

detailed_scores = sent_analyzer.batch_score(input_sentences)

# Instead of a tensor I want the numpy vector
scores = (np.argmax(detailed_scores.numpy(), axis = 1) + 1)

# aggiungiamo colonna al DF
df_texts['score'] = scores

CPU times: user 1.11 s, sys: 55.7 ms, total: 1.17 s
Wall time: 296 ms


In [19]:
df_texts.head(10)

Unnamed: 0,id,text,score
0,1,La Ferrari ha sbagliato completamente la strat...,1
1,2,La gara di Werstappen è stata veramente avvinc...,5
2,3,Odio quando fanno uscire la safety car,1
3,4,Una buona gara,4
4,5,"Oddio, che gara entusiasmante",5
5,6,"Peccato, la Ferrari poteva vincere",2
6,7,Da ragazzino ero un grande fan della Ferrari e...,5
7,8,"Peccato, se non si rompeva il motore",1
8,9,"Dai, che sorpasso veramente entusiasmante",5
9,10,"Nulla da dire, una strategia di gara impeccabile",5


In [20]:
THRESHOLD = 3
condition = (df_texts['score'] >= THRESHOLD)

df_texts[condition].head(10)

Unnamed: 0,id,text,score
1,2,La gara di Werstappen è stata veramente avvinc...,5
3,4,Una buona gara,4
4,5,"Oddio, che gara entusiasmante",5
6,7,Da ragazzino ero un grande fan della Ferrari e...,5
8,9,"Dai, che sorpasso veramente entusiasmante",5
9,10,"Nulla da dire, una strategia di gara impeccabile",5
