# pysentimiento: A multilingual toolkit for Sentiment Analysis and SocialNLP tasks

En esta notebook mostramos un breve ejemplo de cómo usar [pysentimiento](https://github.com/pysentimiento/pysentimiento/), un toolkit multilingual para extracción de opiniones y análisis de sentimientos (aunque centrado en el idioma español)

`pysentimiento` es un una librería que utiliza modelos pre-entrenados de [transformers](https://github.com/huggingface/transformers) para distintas tareas de SocialNLP. Usa como modelos bases a [BETO](https://github.com/dccuchile/beto) y [RoBERTuito](https://github.com/pysentimiento/robertuito) en Español, y BERTweet en inglés.

-- 

In this notebook we show a brief example of how to use [pysentimiento](https://github.com/pysentimiento/pysentimiento/), a multilingual toolkit for opinion mining and sentiment analysis.

`pysentimiento` is a library that uses pre-trained models of [transformers] (https://github.com/huggingface/transformers) for different SocialNLP tasks. It uses as base models [BETO] (https://github.com/dccuchile/beto) and [RoBERTuito] (https://github.com/pysentimiento/robertuito) in Spanish, and BERTweet in English.

 
First, let's install the library

In [1]:
!pip install pysentimiento

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pysentimiento
  Downloading pysentimiento-0.5.2-py3-none-any.whl (30 kB)
Collecting datasets>=1.13.3
  Downloading datasets-2.7.1-py3-none-any.whl (451 kB)
[K     |████████████████████████████████| 451 kB 13.9 MB/s 
[?25hCollecting emoji<2.0.0,>=1.6.1
  Downloading emoji-1.7.0.tar.gz (175 kB)
[K     |████████████████████████████████| 175 kB 36.4 MB/s 
[?25hCollecting transformers>=4.13.0
  Downloading transformers-4.25.1-py3-none-any.whl (5.8 MB)
[K     |████████████████████████████████| 5.8 MB 50.8 MB/s 
Collecting xxhash
  Downloading xxhash-3.1.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 56.7 MB/s 
[?25hCollecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting huggingface-hub<1.0.0,>=0.2.0
  Downloading huggingface_hub-0.11.1-py3-none-any.whl (182 kB)
[K     |█

Let's create an analyzer. The `create_analyzer` receives the task and the language as parameters (currently supports "es" and "en").

In [9]:
from pysentimiento import create_analyzer
analyzer = create_analyzer(task="emotion", lang="es")


Downloading:   0%|          | 0.00/1.08k [00:00<?, ?B/s]

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--pysentimiento--robertuito-emotion-analysis/snapshots/2a1fb82f525912c23a8187eeea418751049d5056/config.json
Model config RobertaConfig {
  "_name_or_path": "pysentimiento/robertuito-emotion-analysis",
  "architectures": [
    "RobertaForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "others",
    "1": "joy",
    "2": "sadness",
    "3": "anger",
    "4": "surprise",
    "5": "disgust",
    "6": "fear"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "anger": 3,
    "disgust": 5,
    "fear": 6,
    "joy": 1,
    "others": 0,
    "sadness": 2,
    "surprise": 4
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 130,
  "model_type": "

Downloading:   0%|          | 0.00/435M [00:00<?, ?B/s]

loading weights file pytorch_model.bin from cache at /root/.cache/huggingface/hub/models--pysentimiento--robertuito-emotion-analysis/snapshots/2a1fb82f525912c23a8187eeea418751049d5056/pytorch_model.bin
All model checkpoint weights were used when initializing RobertaForSequenceClassification.

All the weights of RobertaForSequenceClassification were initialized from the model checkpoint at pysentimiento/robertuito-emotion-analysis.
If your task is similar to the task the model of the checkpoint was trained on, you can already use RobertaForSequenceClassification for predictions without further training.


Downloading:   0%|          | 0.00/334 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/858k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/150 [00:00<?, ?B/s]

loading file tokenizer.json from cache at /root/.cache/huggingface/hub/models--pysentimiento--robertuito-emotion-analysis/snapshots/2a1fb82f525912c23a8187eeea418751049d5056/tokenizer.json
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at /root/.cache/huggingface/hub/models--pysentimiento--robertuito-emotion-analysis/snapshots/2a1fb82f525912c23a8187eeea418751049d5056/special_tokens_map.json
loading file tokenizer_config.json from cache at /root/.cache/huggingface/hub/models--pysentimiento--robertuito-emotion-analysis/snapshots/2a1fb82f525912c23a8187eeea418751049d5056/tokenizer_config.json
PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).



Let's check out some examples:

Veamos algunos ejemplos:

## Emotion Analysis

`pysentimiento` provee análisis de emociones a través de modelos pre-entrenados con los datasets de [EmoEvent](https://github.com/fmplaza/EmoEvent-multilingual-corpus/)

In [21]:
from google.colab import drive
import csv
import math
drive.mount('/content/gdrive')
%cd gdrive/My Drive
%cd ./dscapstone

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).
[Errno 2] No such file or directory: 'gdrive/My Drive'
/content/gdrive/My Drive/dscapstone
[Errno 2] No such file or directory: './dscapstone'
/content/gdrive/My Drive/dscapstone


In [29]:
with open('ALL_preprocessed_tweets.csv', "r") as f:
    reader = csv.DictReader(f)
    dict_list = list(reader)

prediction_dicts = []


for d in dict_list:
  tweet = d['text']
  print(tweet)
  preds = analyzer.predict(tweet)
  print(preds.probas)
  current_d = preds.probas
  current_d['tweet'] = tweet
  current_d['retweet_score'] = math.log(int(d['retweets'])/int(d['user_followers']))
  prediction_dicts.append(current_d)

with open('pysentis.csv', 'w') as csvfile:
  fieldnames = ['tweet', 'retweet_score', 'joy', 'others', 'surprise', 'anger', 'fear', 'sadness', 'disgust']
  writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
  writer.writeheader()
  writer.writerows(prediction_dicts)

@user I find the gold toe sock – inevitably off kilter &amp; washed out – a little troubling esthetically &amp; arguably a bit corpo
{'others': 0.8057851195335388, 'joy': 0.01761920191347599, 'sadness': 0.07362749427556992, 'anger': 0.015596874989569187, 'surprise': 0.04851844534277916, 'disgust': 0.01248099934309721, 'fear': 0.026372000575065613}
Sock Con, the conference for socks
{'others': 0.8592022657394409, 'joy': 0.07976508140563965, 'sadness': 0.005646423902362585, 'anger': 0.0021522953175008297, 'surprise': 0.04445117339491844, 'disgust': 0.003238177625462413, 'fear': 0.0055446443147957325}
Always something new for the magazine cover and the articles practically write themselves
{'others': 0.8777663707733154, 'joy': 0.017492100596427917, 'sadness': 0.02102222852408886, 'anger': 0.018847664818167686, 'surprise': 0.03980891406536102, 'disgust': 0.012677349150180817, 'fear': 0.012385332025587559}
@user This guy gets it
{'others': 0.3223031163215637, 'joy': 0.5233367681503296, 'sad