# Proyecto 3: Detección de Odio y Generando Alegría

Este proyecto tiene como objetivo que aprendan a utilizar los transformers en aplicaciones de la vida real, usando el ecosistema de https://huggingface.co/ (librerías de transformers, datasets, tokenizers, etc) y PyTorch. \

Utilizarán como base un transformer derivado de BERT llamado [RoBERTuito](https://huggingface.co/pysentimiento/robertuito-base-uncased) que ha sido entrenado (en español) sobre más de 500 millones de tweets.  El paper de dicho modelo lo pueden encontrar en este [link](https://arxiv.org/abs/2111.09453).

### Parte 1. Detección de Odio

El discurso del Odio, lamentablemente se esta volviendo muy frecuente, e inclusive se pueden programar bots para promoverlo. En esta parte utilizarán un modelo de HF llamado [robertuito-hate-speech](https://huggingface.co/pysentimiento/robertuito-hate-speech). La idea es que utilicen la API de Twitter que usaron en el primer proyecto para recopilar una gran cantidad de tweets (durante la última semana/día(s) en **Colombia**) y detecten cuales de esos tweets están siendo "odiosos". Dicho modelo produce tres etiquetas (es multi-label): HS-Hate Speach; TR - targeted to specific individual y; AG - Aggressive. Una probabilidad relativamente alta en cualquier de esas categorías indica que es un tweet de odio.

Este problema no requiere ningún entrenamiento, es sólo usar el modelo de transformer mencionado. 

¿Qué usuarios en los últimos días estan frecuentemente usando un discurso de odio? ¿Serán Bots?

Generé un pipeline (una función, script, etc.) que pueda variar el tiempo de recopilación y así poder determinar diferentes momentos en los que se esta hablando de esta forma. 

Para esta parte les puede ser útil la documentación de HF o el capítulo 2 y 3 del libro de [Natural Language Processing with Transformers](https://github.com/nlp-with-transformers/notebooks)

### Parte 2. Contrarestar el Odio

Como seguramente se esta generando odio a través de tweets  el objetivo de esta parte es crear un bot que genere un discurso de alegría.

RoBERTuito también tiene un modelo para detección de emociones: [robertuito-emotion-analysis](https://huggingface.co/pysentimiento/robertuito-emotion-analysis). La idea es que usen una estrategia similar a la de la parte 1 para identificar que tweets en español son los que producen la etiqueta "joy", durante la última semana (no necesariamente limitado a Colombia). Guarde estos tweets porque serviran para realizar un generador de alegría. 

Ahora, deberán usar un modelo de transformer basado en GPT2 (por ejemplo: datificate/gpt2-small-spanish, PlanTL-GOB-ES/gpt2-large-bne, DeepESP/gpt2-spanish) y realizar un fine-tuning en la generación del texto sobre el dataset de alegría conseguido anteriormente (use muy pocos epochs, 2 o 3). Evalúe los resultados sobre el conjunto de validación usando una métrica para modelo de lenguaje (por ejemplo Perplexity). 

Una vez obtenido el modelo de generación llevelo a producción con una de las dos formas siguientes: (1) realice una interfaz para la generación del texto o (2) use la API de twitter para realizarlo de forma automática.

Para éste punto recomiendo revisar lo siguiente:

[1] Capítulo 5 del libro de [Natural Language Processing with Transformers](https://github.com/nlp-with-transformers/notebooks)

[2] [Este notebook sobre generación de texto](https://github.com/huggingface/blog/blob/main/notebooks/02_how_to_generate.ipynb)

[2] [Este notebook sobre fine-tune un modelo de generación de texto](https://github.com/huggingface/notebooks/blob/main/examples/language_modeling.ipynb)


-----
Es importante que realicen gráficos y visualizaciones que ayuden a la interpretación. No olviden ir analizando y comentando los hallazgos, y sobretodo **concluir**. El entregable es un notebook de Jupyter, debidamente presentado y comentado.

In [4]:
!pip install transformers








In [None]:
!pip install pypensamiento

In [5]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments, AutoConfig
from pysentimiento.preprocessing import preprocess_tweet
import re
import os
import pandas as pd
import csv
import requests
import dateutil.parser
import time
import torch
import torch.nn.functional as F
from datasets import load_dataset, Dataset, ClassLabel

ModuleNotFoundError: No module named 'pysentimiento'

In [None]:
os.environ['TOKEN'] = 'AAAAAAAAAAAAAAAAAAAAALLsZQEAAAAA9v%2FmbMJNGcyffUFN0jN%2FVs%2Bnjg8%3DMFhv6aHFDFZK5k5ln0N3kqqcvFIqtMCk7tZZJlzgCMsFmTPu9C'

In [None]:
def auth():
    return os.getenv('TOKEN')

def create_headers(bearer_token):
    headers = {"Authorization": "Bearer {}".format(bearer_token)}
    return headers

def create_url(keyword, start_date, end_date, max_results = 10):
    
    search_url = "https://api.twitter.com/2/tweets/search/recent" #Change to the endpoint you want to collect data from
    # search_url = "https://api.twitter.com/2/tweets/search/all" # With an academic research access

    #change params based on the endpoint you are using
    query_params = {'query': keyword,
                    'start_time': start_date,
                    'end_time': end_date,
                    'max_results': max_results,
                    'expansions': 'author_id,in_reply_to_user_id,geo.place_id',
                    'tweet.fields': 'id,text,author_id,in_reply_to_user_id,geo,conversation_id,created_at,lang,public_metrics,referenced_tweets,reply_settings,source',
                    'user.fields': 'id,name,username,created_at,description,public_metrics,verified',
                    'place.fields': 'full_name,id,country,country_code,geo,name,place_type',
                    'next_token': {}}
    return (search_url, query_params)

def connect_to_endpoint(url, headers, params, next_token = None):
    params['next_token'] = next_token   #params object received from create_url function
    response = requests.request("GET", url, headers = headers, params = params)
    # print("Endpoint Response Code: " + str(response.status_code))
    if response.status_code != 200:
        raise Exception(response.status_code, response.text)
    return response.json()

In [None]:
bearer_token = auth()
headers = create_headers(bearer_token)

start_time = "2022-04-21T00:00:00.000Z"
end_time = "2022-04-26T00:00:00.000Z"
max_results = 100

In [None]:
def append_to_csv(json_response, fileName):

    #A counter variable
    counter = 0

    #Open OR create the target CSV file
    csvFile = open(fileName, "a", newline="", encoding='utf-8')
    csvWriter = csv.writer(csvFile)

    #Loop through each tweet
    for tweet in json_response['data']:
        
        # We will create a variable for each since some of the keys might not exist for some tweets
        # So we will account for that

        # 1. Author ID
        author_id = tweet['author_id']

        # 2. Time created
        created_at = dateutil.parser.parse(tweet['created_at'])

        # 3. Geolocation
        if ('geo' in tweet):   
            geo = tweet['geo']['place_id']
        else:
            geo = " "

        # 4. Tweet ID
        tweet_id = tweet['id']

        # 5. Language
        lang = tweet['lang']

        # 6. Tweet metrics
        retweet_count = tweet['public_metrics']['retweet_count']
        reply_count = tweet['public_metrics']['reply_count']
        like_count = tweet['public_metrics']['like_count']
        quote_count = tweet['public_metrics']['quote_count']

        # 7. source
        source = tweet['source']

        # 8. Tweet text
        text = tweet['text']
        
        # Assemble all data in a list
        res = [author_id, created_at, geo, tweet_id, lang, like_count, quote_count, reply_count, retweet_count, source, text]
        
        # Append the result to the CSV file
        csvWriter.writerow(res)
        counter += 1

    # When done, close the CSV file
    csvFile.close()

    # Print the number of tweets for this iteration
    # print("# of Tweets added from this response: ", counter) 

In [None]:
def json_csv(keyword, name):
    #Inputs for tweets
    bearer_token = auth()
    headers = create_headers(bearer_token)
    keyword = keyword
    start_list =    ['2022-05-01T00:00:00.000Z',
                    '2022-05-02T00:00:00.000Z',
                     '2022-05-03T00:00:00.000Z',
                     '2022-05-04T00:00:00.000Z']

    end_list =      ['2022-05-03T00:00:00.000Z',
                    '2022-05-04T00:00:00.000Z',
                     '2022-05-05T00:00:00.000Z',
                     '2022-05-06T00:00:00.000Z']
    max_results = 100

    #Total number of tweets we collected from the loop
    total_tweets = 0

    # Create file
    csvFile = open(name+'.csv', "a", newline="", encoding='utf-8')
    csvWriter = csv.writer(csvFile)

    #Create headers for the data you want to save, in this example, we only want save these columns in our dataset
    csvWriter.writerow(['author id', 'created_at', 'geo', 'id','lang', 'like_count', 'quote_count', 'reply_count','retweet_count','source','tweet'])
    csvFile.close()

    for i in range(0,len(start_list)):

        # Inputs
        count = 0 # Counting tweets per time period
        max_count = 500 # Max tweets per time period
        flag = True
        next_token = None
        
        # Check if flag is true
        while flag:
            # Check if max_count reached
            if count >= max_count:
                break
            # print("-------------------")
            # print("Token: ", next_token)
            url = create_url(keyword, start_list[i],end_list[i], max_results)
            json_response = connect_to_endpoint(url[0], headers, url[1], next_token)
            result_count = json_response['meta']['result_count']

            if 'next_token' in json_response['meta']:
                # Save the token to use for next call
                next_token = json_response['meta']['next_token']
                # print("Next Token: ", next_token)
                if result_count is not None and result_count > 0 and next_token is not None:
                    # print("Start Date: ", start_list[i])
                    append_to_csv(json_response, name+".csv")
                    count += result_count
                    total_tweets += result_count
                    # print("Total # of Tweets added: ", total_tweets)
                    # print("-------------------")
                    time.sleep(1)                
            # If no next token exists
            else:
                if result_count is not None and result_count > 0:
                    # print("-------------------")
                    # print("Start Date: ", start_list[i])
                    append_to_csv(json_response, name+".csv")
                    count += result_count
                    total_tweets += result_count
                    # print("Total # of Tweets added: ", total_tweets)
                    # print("-------------------")
                    time.sleep(1)
                
                #Since this is the final request, turn flag to false to move to the next time period.
                flag = False
                next_token = None
            time.sleep(1)
    # print("Total number of results: ", total_tweets)

In [None]:
keyword = 'colombia lang:es -is:retweet'

In [None]:
json_csv(keyword, 'tweets')

In [None]:
df = pd.read_csv("tweets.csv")

In [None]:
ds = Dataset.from_pandas(df)
ds = ds.rename_columns({'tweet':'text', 'author id':'author_id'},)

In [None]:
device = 'cuda' if torch.cuda.is_available() else 'cpu' 
tokenizer = AutoTokenizer.from_pretrained('pysentimiento/robertuito-hate-speech')
model = (AutoModelForSequenceClassification.from_pretrained('pysentimiento/robertuito-hate-speech')
         .to(device))

In [None]:
# Dataset preprocesado 
ds_pre = ds.map(lambda x: {'text':preprocess_tweet(x['text'])})

def tokenize(batch):
    return tokenizer(batch['text'], padding=True, truncation=True, max_length=128)

ds_pre = ds_pre.map(tokenize,batched=True)

In [None]:
training_args = TrainingArguments(output_dir="test_trainer",
                   per_device_eval_batch_size=64)     

In [None]:
trainer = Trainer(
    model= model.to(device), 
    args=training_args,
    train_dataset=ds_pre,
    eval_dataset=ds_pre,
)

In [None]:
config = AutoConfig.from_pretrained("pysentimiento/robertuito-hate-speech")
config.id2label

In [None]:
preds = trainer.predict(ds_pre)
pred_labels = [config.id2label[x] for x in preds.predictions.argmax(1)]

The following columns in the test set  don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: retweet_count, lang, id, source, author_id, reply_count, quote_count, like_count, geo, created_at, text. If retweet_count, lang, id, source, author_id, reply_count, quote_count, like_count, geo, created_at, text are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 6852
  Batch size = 64


KeyboardInterrupt: ignored

In [None]:
preds

In [None]:
# pred_labels

## Parte 2

In [None]:
# !pip install -q git+https://github.com/huggingface/transformers.git
# !pip install -q tensorflow==2.1

In [39]:
!pip install --ignore-installed --upgrade tensorflow

Collecting tensorflow
  Downloading tensorflow-2.8.0-cp37-cp37m-manylinux2010_x86_64.whl (497.5 MB)
[K     |████████████████████████████████| 497.5 MB 19 kB/s 
[?25hCollecting flatbuffers>=1.12
  Downloading flatbuffers-2.0-py2.py3-none-any.whl (26 kB)
Collecting tensorflow-io-gcs-filesystem>=0.23.1
  Downloading tensorflow_io_gcs_filesystem-0.25.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (2.1 MB)
[K     |████████████████████████████████| 2.1 MB 40.8 MB/s 
[?25hCollecting termcolor>=1.1.0
  Downloading termcolor-1.1.0.tar.gz (3.9 kB)
Collecting six>=1.12.0
  Downloading six-1.16.0-py2.py3-none-any.whl (11 kB)
Collecting astunparse>=1.6.0
  Downloading astunparse-1.6.3-py2.py3-none-any.whl (12 kB)
Collecting grpcio<2.0,>=1.24.3
  Downloading grpcio-1.46.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.4 MB)
[K     |████████████████████████████████| 4.4 MB 54.5 MB/s 
[?25hCollecting h5py>=2.9.0
  Downloading h5py-3.6.0-cp37-cp37m-manylinux_2_12_x86_64

In [6]:
import tensorflow as tf
from transformers import GPT2Tokenizer
import random
from IPython.display import display, HTML
from transformers import TFGPT2LMHeadModel

In [7]:
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = TFGPT2LMHeadModel.from_pretrained("gpt2", pad_token_id=tokenizer.eos_token_id)

Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/475M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFGPT2LMHeadModel.

All the layers of TFGPT2LMHeadModel were initialized from the model checkpoint at gpt2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


In [8]:
# encode context the generation is conditioned on
input_ids = tokenizer.encode('We are happy to be here', return_tensors='tf')

# generate text until the output length (which includes the context length) reaches 50
greedy_output = model.generate(input_ids, max_length=50)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
We are happy to be here with you. We are here to help you. We are here to help you. We are here to help you. We are here to help you. We are here to help you. We are here to help you


In [9]:
tf.random.set_seed(0)

# set top_k = 50 and set top_p = 0.95 and num_return_sequences = 3
sample_outputs = model.generate(
    input_ids,
    do_sample=True, 
    max_length=50, 
    top_k=50, 
    top_p=0.95, 
    num_return_sequences=3
)

print("Output:\n" + 100 * '-')
for i, sample_output in enumerate(sample_outputs):
  print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))

Output:
----------------------------------------------------------------------------------------------------
0: We are happy to be here to provide you with a great experience. I hope you're as excited for your next game as I am, that you enjoy it, and that you enjoy doing this for me.

I will be making this part
1: We are happy to be here to help you understand how a great community can help you live up to your mission, while protecting what makes your community unique.

Learn More

Become a member of the Redwood Forest Community

Red
2: We are happy to be here with you." "Oh...okay?" "Well...you will have my attention for a while." "I will keep you there, just in case." "That'll do, isn't it?" "Oh no


language_modeling

In [10]:
def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [11]:
show_random_elements(ds)

NameError: name 'ds' is not defined

In [None]:
model_checkpoint = "distilgpt2"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

In [None]:
# tokenizer.add_special_tokens({'pad_token': '[PAD]'})
tokenized_datasets = ds.map(tokenize, batched=True, num_proc=4, remove_columns=["text"])
block_size = 128

In [None]:
tokenized_datasets

In [None]:
def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
        # customize this part to your needs.
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

No funciona este entonces hay que depurar los datos de manera que, todos los datoq ue se metan solo sea el texto y ya 

pero hay que hacerlo con dataset y no pude

In [None]:
lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    batch_size=1000,
    num_proc=4
)

In [None]:
txt = ds['text']

In [None]:
tokenizer.decode(lm_datasets["train"][1]["input_ids"])