# Spanish NLP: Classify Notebook

For more information visit [spanish_nlp](https://github.com/jorgeortizfuentes/spanish_nlp) repository on GitHub.


## Available models

| **Model name**     | **Sources**                            |
| ------------------ | -------------------------------------- |
| hate_speech        | bert, robertuito                       |
| incivility         | bert                                   |
| toxic_speech       | political-tweets-es                    |
| sentiment_analysis | robertuito                             |
| emotion_analysis   | robertuito                             |
| irony_analysis     | robertuito                             |
| sexist_analysis    | sexist_analysis_metwo                  |
| racist_analysis    | racism_paula_lobo_et_al_average_strict |


## Quick usage


In [4]:
from spanish_nlp import SpanishClassifier

sc = SpanishClassifier(model_name="hate_speech", device="cpu")
t1 = "LAS MUJERES Y GAYS DEBERÍAN SER EXTERMINADOS"
t2 = "El presidente convocó a una reunión a los representantes de los partidos políticos"
p1 = sc.predict(t1)
p2 = sc.predict(t2)

print("Text 1: ", t1)
print("Prediction 1: ", p1)
print("Text 2: ", t2)
print("Prediction 2: ", p2)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Text 1:  LAS MUJERES Y GAYS DEBERÍAN SER EXTERMINADOS
Prediction 1:  {'no_hate': 0.8702718019485474, 'hate': 0.12972821295261383}
Text 2:  El presidente convocó a una reunión a los representantes de los partidos políticos
Prediction 2:  {'no_hate': 0.9976341724395752, 'hate': 0.002365861786529422}


## Apply classification for a dataset in pandas


### Load dataset


In [5]:
import pandas as pd

# Create DataFrame

texts = [
    "Deberían ser exterminados los pueblos indígenas",
    "El presidente convocó a una reunión a los representantes de los partidos políticos",
    "Los pingüinos son animales",
    "La vacuna contra el covid-19 ya está disponible",
    "Hay que matar a todos los extranjeros",
]

df = pd.DataFrame(texts, columns=["text"])

### Preprocess dataset


In [None]:
# Preprocess texts

from spanish_nlp import SpanishPreprocess

sp = SpanishPreprocess(
    lower=False,
    remove_url=True,
    remove_hashtags=False,
    split_hashtags=True,
    normalize_breaklines=True,
    remove_emoticons=False,
    remove_emojis=False,
    convert_emoticons=False,
    convert_emojis=False,
    normalize_inclusive_language=True,
    reduce_spam=True,
    remove_vowels_accents=True,
    remove_multiple_spaces=True,
    remove_punctuation=True,
    remove_unprintable=True,
    remove_numbers=True,
    remove_stopwords=False,
    stopwords_list=None,
    lemmatize=False,
    stem=False,
    remove_html_tags=True,
)

df["text"] = df["text"].apply(sp.transform)

df = df[df.text.notnull()]
df = df[df.text != ""]
df = df[df["text"].apply(lambda x: isinstance(x, str))]
df = df.reset_index(drop=True)

Pandas Apply:   0%|          | 0/5 [00:00<?, ?it/s]

### Classify dataset

#### Models:

- hate_speech
- incivility
- sentiment analysis
- emotion analysis
- sexist analysisracism analysis


In [7]:
from datetime import datetime

def predict_label(text, model):
    try:
        return model.predict(text)
    except Exception as e:
        time = datetime.now().strftime("%d-%Y-%m %H:%M:%S")
        print(f"{time} - {e}")


classifiers_names = [
    "hate_speech",
    "incivility",
    "sentiment_analysis",
    "emotion_analysis",
    "irony_analysis",
    "sexist_analysis",
    "racism_analysis",
]
classifiers = {}

for n in classifiers_names:
    c = SpanishClassifier(model_name=n, device="cpu")
    df[n] = df["text"].apply(lambda x: c.predict(x))

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Pandas Apply:   0%|          | 0/5 [00:00<?, ?it/s]

config.json:   0%|          | 0.00/944 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/439M [00:00<?, ?B/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


tokenizer_config.json:   0%|          | 0.00/667 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/242k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/730k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

Pandas Apply:   0%|          | 0/5 [00:00<?, ?it/s]

config.json:   0%|          | 0.00/925 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/435M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/384 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.31M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/167 [00:00<?, ?B/s]

Pandas Apply:   0%|          | 0/5 [00:00<?, ?it/s]

config.json:   0%|          | 0.00/1.08k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/435M [00:00<?, ?B/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


tokenizer_config.json:   0%|          | 0.00/384 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.31M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/167 [00:00<?, ?B/s]

Pandas Apply:   0%|          | 0/5 [00:00<?, ?it/s]

config.json:   0%|          | 0.00/915 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/435M [00:00<?, ?B/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


tokenizer_config.json:   0%|          | 0.00/384 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.31M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/167 [00:00<?, ?B/s]

Pandas Apply:   0%|          | 0/5 [00:00<?, ?it/s]

config.json:   0%|          | 0.00/795 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/435M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/334 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.31M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]

Pandas Apply:   0%|          | 0/5 [00:00<?, ?it/s]

config.json:   0%|          | 0.00/834 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/439M [00:00<?, ?B/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


tokenizer_config.json:   0%|          | 0.00/310 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/650 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/248k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/486k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/134 [00:00<?, ?B/s]

Pandas Apply:   0%|          | 0/5 [00:00<?, ?it/s]

In [8]:
df

Unnamed: 0,text,hate_speech,incivility,sentiment_analysis,emotion_analysis,irony_analysis,sexist_analysis,racism_analysis
0,Deberian ser exterminados los pueblos indigenas,"{'no_hate': 0.954975426197052, 'hate': 0.04502...","{'no_incivility': 0.6234936118125916, 'incivil...","{'negative': 0.8032279014587402, 'neutral': 0....","{'others': 0.748774528503418, 'anger': 0.16288...","{'not_ironic': 0.9995823502540588, 'ironic': 0...","{'not_sexist': 0.9762647747993469, 'sexist': 0...","{'non-racist': 0.999099612236023, 'racist': 0...."
1,El presidente convoco a una reunion a los repr...,"{'no_hate': 0.9978753328323364, 'hate': 0.0021...","{'no_incivility': 0.8898597955703735, 'incivil...","{'neutral': 0.8114618062973022, 'positive': 0....","{'others': 0.9919043183326721, 'joy': 0.002639...","{'not_ironic': 0.9993013143539429, 'ironic': 0...","{'not_sexist': 0.9759377837181091, 'sexist': 0...","{'non-racist': 0.9996436834335327, 'racist': 0..."
2,Los pinguinos son animalos,"{'no_hate': 0.9705660343170166, 'hate': 0.0294...","{'incivility': 0.5072245597839355, 'no_incivil...","{'positive': 0.5787491798400879, 'neutral': 0....","{'others': 0.9116768836975098, 'joy': 0.024299...","{'not_ironic': 0.7218023538589478, 'ironic': 0...","{'not_sexist': 0.9535900950431824, 'sexist': 0...","{'non-racist': 0.9981189370155334, 'racist': 0..."
3,La vacuna contra el covid ya esta disponible,"{'no_hate': 0.998217761516571, 'hate': 0.00178...","{'no_incivility': 0.9326367974281311, 'incivil...","{'positive': 0.5552893280982971, 'neutral': 0....","{'others': 0.9687969088554382, 'joy': 0.019537...","{'not_ironic': 0.9697375297546387, 'ironic': 0...","{'not_sexist': 0.9818084836006165, 'sexist': 0...","{'non-racist': 0.9996614456176758, 'racist': 0..."
4,Hay que matar a todos los extranjeros,"{'hate': 0.8858439922332764, 'no_hate': 0.1141...","{'no_incivility': 0.7517166137695312, 'incivil...","{'negative': 0.7249141931533813, 'neutral': 0....","{'anger': 0.6267445683479309, 'disgust': 0.309...","{'not_ironic': 0.9974295496940613, 'ironic': 0...","{'not_sexist': 0.9626052379608154, 'sexist': 0...","{'racist': 0.9961186647415161, 'non-racist': 0..."
