<a href="https://colab.research.google.com/github/manualrg/dslab-nlp-dl/blob/master/09_intronlp_tut.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Clasificacion Supervisada: Codificacion de textos

En este notebook, se van a ensayar las siguientes aproximaciones para realizar el `encoding`  de un corpus de textos para clasificación supervisada:
1. Usar el promedio de W2V
2. Embeddings (bi-encoders) multidioma
3. Steming para obtener una DTM
4. Lemmatization para construir una DTM


Para mantener un marco de comparación de experimentos homogéneo, se usará como modelo una `LogisticRegression` sin hacer ajuste de hiperparámetros

In [1]:
import typing

import numpy as np
import pandas as pd
import sklearn
import nltk

from sklearn import preprocessing
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import word_tokenize
import spacy
from sentence_transformers import SentenceTransformer


In [2]:
!python -m spacy download es_core_news_lg

Collecting es-core-news-lg==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/es_core_news_lg-3.8.0/es_core_news_lg-3.8.0-py3-none-any.whl (568.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m568.0/568.0 MB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: es-core-news-lg
Successfully installed es-core-news-lg-3.8.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('es_core_news_lg')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [None]:
!pip install unidecode

In [3]:
print(f"{sklearn.__version__=}")
print(f"{nltk.__version__=}")
print(f"{spacy.__version__=}")
print(f"{pd.__version__=}")

sklearn.__version__='1.6.1'
nltk.__version__='3.9.1'
spacy.__version__='3.8.7'
pd.__version__='2.2.2'


In [4]:
nltk.download('stopwords')
nltk.download('punkt_tab')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [5]:
RND_SEED = 123
PCT_TEST = 0.2

# Dataset

In [6]:
df_data = pd.read_csv("hf://datasets/MariaIsabel/FR_NFR_Spanish_requirements_classification/New Spanish Academic Dataset.csv")
# para este dataset, no hace falta cuenta de HF
df_data.head()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Unnamed: 0,PROJECT,REQUIREMENT,FINAL_LABEL
0,16,Poder crear un usuario y acceder a través de é...,NF
1,16,Poder crear un perfil personal e individual a ...,F
2,16,Acceder a la aplicación y a sus funcionalidade...,NF
3,16,Todos los datos introducidos podrán ser leídos...,NF
4,16,"Poder leer, eliminar, editar o incluir cualqui...",F


In [7]:
df_data['FINAL_LABEL'].value_counts(normalize=True)

Unnamed: 0_level_0,proportion
FINAL_LABEL,Unnamed: 1_level_1
F,0.771208
NF,0.228792


In [8]:
df_data['y_is_nf'] =  df_data['FINAL_LABEL'].replace({"F": 0, "NF": 1})
df_data['y_is_nf'].value_counts(normalize=True)
df_data['x_text'] = df_data['REQUIREMENT']

  df_data['y_is_nf'] =  df_data['FINAL_LABEL'].replace({"F": 0, "NF": 1})


# Split

In [9]:
from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(
    df_data["x_text"],
    df_data["y_is_nf"],
    test_size=PCT_TEST,
    random_state=RND_SEED,
    stratify=df_data["y_is_nf"]
)

# Feature Engineering

## Embedding Models

In [83]:
# https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
# multiligual
# output dim: 384
# truncated input text to 256
model_st = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.89k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/645 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/471M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [40]:
nlp = spacy.load("es_core_news_lg")

In [41]:
sentences = X_train[:2].to_list()
print(sentences)

['Respuestas coherentes e idénticas ante entradas de audio o texto: Los usuarios tienen la posibilidad de escuchar la respuesta mediante voz, esta ha de ser entendida e idéntica a la respuesta por escrito.', 'Gestión de usuarios: Todos los administradores tienen los mismos permisos y privilegios entre ellos']


## Bi encoder

In [85]:
embeddings_st = model_st.encode(sentences,
                          normalize_embeddings=True
                          )

embeddings_st.shape

(2, 384)

In [88]:
X_train_st = model_st.encode(X_train.values,
                          normalize_embeddings=True
                          )

X_test_st = model_st.encode(X_test.values,
                          normalize_embeddings=True
                          )

## Word2Vec

In [46]:
def get_spacy_v2w(X, nlp):
  lst_embeddings_sp = []

  for doc in nlp.pipe(X):
      lst_embeddings_sp.append(
          doc.vector  # np.ndarray 1d
          )

  # prompt: concatenate a list of 1d numpy arrayinto a 2d array
  embeddings_sp = np.stack(lst_embeddings_sp)   #np.ndarray 2d [# docs, dims]

  return embeddings_sp

  embeddings_sp = get_spacy_v2w(sentences)
  embeddings_sp

In [47]:
X_train_sp = get_spacy_v2w(X_train.tolist(), nlp)
X_test_sp = get_spacy_v2w(X_test.tolist(), nlp)

## Stemming

In [59]:
import string
from sklearn.feature_extraction.text import TfidfVectorizer

# Tokenization and stemming
class SpanishStemTokenizer:
    def __init__(self):
        self.stemmer = SnowballStemmer("spanish")

    def __call__(self, text) -> typing.List[str]:
        return [self.stemmer.stem(word) for word in word_tokenize(text) if word not in string.punctuation]


stemmer = SpanishStemTokenizer()

stemmer(sentences[0])

['respuest',
 'coherent',
 'e',
 'ident',
 'ante',
 'entrad',
 'de',
 'audi',
 'o',
 'text',
 'los',
 'usuari',
 'tien',
 'la',
 'posibil',
 'de',
 'escuch',
 'la',
 'respuest',
 'mediant',
 'voz',
 'esta',
 'ha',
 'de',
 'ser',
 'entend',
 'e',
 'ident',
 'a',
 'la',
 'respuest',
 'por',
 'escrit']

In [61]:
tokenizer_es = SpanishStemTokenizer()
stopwords_es = nltk.corpus.stopwords.words('spanish')

stopwords_es_tok = list(set([tokenizer_es(term.lower())[0] for term in stopwords_es]))


tfidf_stem = TfidfVectorizer(
    strip_accents="ascii",
    lowercase=True,
    tokenizer=stemmer,
    stop_words=stopwords_es_tok,
    analyzer="word",
    ngram_range=(1, 1),
    min_df=5,
    max_df=0.95
)

X_train_stem = tfidf_stem.fit_transform(X_train)
X_test_stem = tfidf_stem.transform(X_test)

X_train_stem.shape

(311, 207)

In [96]:
pd.Series(tfidf_stem.vocabulary_)[:5]

Unnamed: 0,0
respuest,176
entrad,78
usuari,198
posibil,159
mediant,121


## Lemmatization

In [66]:
# prompt: tokenization for scikitlearn with spacy lemmas

from unidecode import unidecode

# Tokenization and lemmatization
class SpanishLemmaTokenizer:
    def __init__(self, nlp):
        self.nlp = nlp
        self._max_input_len = nlp.max_length  # 1000000
        self._min_token_len = 2

    def __call__(self, text) -> typing.List[str]:
        doc = self.nlp(text[:self._max_input_len])  # truncar el documento de entrada al máximo proporcionado por el modelo de spacy
        lemmas = [unidecode(token.lemma_) for token in doc if token.is_alpha
                  and len(token) > self._min_token_len
                  and not token.is_stop
                  and not token.like_email
                  and not token.like_url
                  and not token.is_currency
                  and token.ent_type_ not in ['PER', 'LOC', 'ORG']
                  ]
        return lemmas

lemmatizer = SpanishLemmaTokenizer(nlp)

lemmatizer(sentences[0])

['respuesta',
 'coherente',
 'identico',
 'entrada',
 'audio',
 'texto',
 'usuario',
 'posibilidad',
 'escuchar',
 'respuesta',
 'voz',
 'entender',
 'identico',
 'respuesta',
 'escrito']

In [69]:
tfidf_lemma = TfidfVectorizer(
    tokenizer=lemmatizer,
    stop_words=None,
    analyzer="word",
    ngram_range=(1, 1),
    min_df=5,
    max_df=0.95
)

X_train_lemma = tfidf_lemma.fit_transform(X_train)
X_test_lemma = tfidf_lemma.transform(X_test)

X_train_lemma.shape

(311, 172)

In [95]:
pd.Series(tfidf_lemma.vocabulary_)[:5]

Unnamed: 0,0
respuesta,149
entrada,64
usuario,165
posibilidad,133
gestion,80


# Experiments

In [74]:
from sklearn.metrics import f1_score
from sklearn.linear_model import LogisticRegression

def compute_f1(X_test, y_test, skl_pl, positive_class = 1):
  y_pred = skl_pl.predict(X_test)

  return  f1_score(y_test, y_pred, pos_label=positive_class)

## Bi-encoder

In [87]:
clf_st = LogisticRegression(random_state=RND_SEED)
clf_st.fit(X_train_st, y_train)

f1_score_st = compute_f1(X_test_st, y_test, clf_st)
print(f"{f1_score_st=}")

f1_score_st=0.8


## Word2Vec

In [77]:
clf_sp = LogisticRegression(random_state=RND_SEED, max_iter=3000)
clf_sp.fit(X_train_sp, y_train)

f1_score_sp = compute_f1(X_test_sp, y_test, clf_sp)
print(f"{f1_score_sp=}")

f1_score_sp=0.7142857142857143


## Stemming

In [78]:
clf_stem = LogisticRegression(random_state=RND_SEED, max_iter=3000)
clf_stem.fit(X_train_stem, y_train)

f1_score_stem = compute_f1(X_test_stem, y_test, clf_stem)
print(f"{f1_score_stem=}")

f1_score_stem=0.5925925925925926


## Lemmatization

In [80]:
clf_lemma = LogisticRegression(random_state=RND_SEED, max_iter=3000)
clf_lemma.fit(X_train_lemma, y_train)

f1_score_lemma = compute_f1(X_test_lemma, y_test, clf_lemma)
print(f"{f1_score_lemma=}")

f1_score_lemma=0.4166666666666667


# Benchmark

In [90]:
pd.Series(
    {
        "bi-encoder": f1_score_st,
        "word2vec": f1_score_sp,
        "stemming": f1_score_stem,
        "lemmatization": f1_score_lemma
    },
    name="f1_score"
)

Unnamed: 0,f1_score
bi-encoder,0.8
word2vec,0.714286
stemming,0.592593
lemmatization,0.416667


# Conclusiones

* Los modelos de embeddings, proporcionan una solución rápida sin hiperparámetros para la clasificación
* Con ajuste fino de TFIDF, puede obtenerse una solución de alto rendimiento
* Lemmatización es una solución muy interpretable, pero en la práctica, es más complejo obtener resultados igual de buenos que con Stemming