<img src="./assets/img/teclab_logo.png" alt="Teclab logo" width="170">

**Author**: Hector Vergara ([LinkedIn](https://www.linkedin.com/in/hector-vergara/))

**Repository**: [nlp_apis](https://github.com/hhvergara/nlp_apis)

**Python Notebook**: [API4.ipynb](https://github.com/hhvergara/nlp_apis/blob/main/API4.ipynb)

----

# API 4:

### Contexto

Una vez lograda la representaci√≥n vectorial del texto, se argumenta que ahora s√≠ se ha conseguido una data estructurada gracias al preprocesamiento de texto, y que este resultado, a su vez, puede ser INPUT para un modelo.

Una vez m√°s, alguien del equipo expresa que quiere aplicar una regresi√≥n lineal, a lo que contestamos que no se puede porque el target de este problema no es num√©rico, y m√°s bien hay que ponerse a trabajar en un modelo supervisado para clasificaci√≥n.
¬øQu√© modelos se aplicar√°n?

### Consignas

Modelo machine learning. Aplique un modelo machine learning -de los que Ud. ya conoce- para el problema de clasificaci√≥n.

Se pueden utilizar los modelos de aprendizaje supervisado, tales como: random forest, support vector machine, vecinos m√°s cercanos (KNN), regresi√≥n log√≠stica, o Na√Øve Bayes. El modelo debe ajustarse con los vectores de la muestra de entrenamiento. Es importante que se considere que el target es multinomial y no binomial (sobre todo en la regresi√≥n log√≠stica).

Evaluaci√≥n del modelo. Seg√∫n las predicciones de la muestra de testeo, realice la evaluaci√≥n del modelo. Para ello, calcule los √≠ndices de desempe√±o como acuracidad, recall y precisi√≥n. Interprete los resultados y exponga sus conclusiones.



In [1]:
# 1. Library Imports
import os
import nltk
import numpy as np
import pandas as pd
from pathlib import Path
from nltk import pos_tag
from nltk.corpus import wordnet
from  nltk.tokenize import RegexpTokenizer
from sklearn.model_selection import train_test_split
from nltk.stem import WordNetLemmatizer, PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer

__version__ = '0.0.1'
__email__ = 'hhvservice@gmail.com'
__author__ = 'Hector Vergara'
__annotations__ = 'https://www.linkedin.com/in/hector-vergara/'
__base_dir__ = Path().absolute()
__data_dir__ = os.path.join(__base_dir__, 'data')
filename_data = os.path.join(__data_dir__, 'sentiment_analysis_dataset.csv')

nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('wordnet')


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\vinyl\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\vinyl\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\vinyl\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\vinyl\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     C:\Users\vinyl\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\vinyl\AppData\Roaming\nltk_dat

True

### Descargamos el dataset "sentiment-analysis-dataset" de kaggle para realizar las pruebas.

Referencia: https://www.kaggle.com/datasets/abhi8923shriv/sentiment-analysis-dataset/data

In [2]:
# Load the dataset
df = pd.read_csv(filename_data, sep=',', encoding='unicode_escape')
df.dropna(inplace=True)
df.reset_index(drop=True, inplace=True)
df.head(10)

Unnamed: 0,textID,text,selected_text,sentiment,Time of Tweet,Age of User,Country,Population -2020,Land Area (Km¬≤),Density (P/Km¬≤)
0,cb774db0d1,"I`d have responded, if I were going","I`d have responded, if I were going",neutral,morning,0-20,Afghanistan,38928346,652860.0,60
1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,Sooo SAD,negative,noon,21-30,Albania,2877797,27400.0,105
2,088c60f138,my boss is bullying me...,bullying me,negative,night,31-45,Algeria,43851044,2381740.0,18
3,9642c003ef,what interview! leave me alone,leave me alone,negative,morning,46-60,Andorra,77265,470.0,164
4,358bd9e861,"Sons of ****, why couldn`t they put them on t...","Sons of ****,",negative,noon,60-70,Angola,32866272,1246700.0,26
5,28b57f3990,http://www.dothebouncy.com/smf - some shameles...,http://www.dothebouncy.com/smf - some shameles...,neutral,night,70-100,Antigua and Barbuda,97929,440.0,223
6,6e0c6d75b1,2am feedings for the baby are fun when he is a...,fun,positive,morning,0-20,Argentina,45195774,2736690.0,17
7,50e14c0bb8,Soooo high,Soooo high,neutral,noon,21-30,Armenia,2963243,28470.0,104
8,e050245fbd,Both of you,Both of you,neutral,night,31-45,Australia,25499884,7682300.0,3
9,fc2cbefa9d,Journey!? Wow... u just became cooler. hehe....,Wow... u just became cooler.,positive,morning,46-60,Austria,9006398,82400.0,109


In [3]:
print(f'''
Cantidad de filas: {df.shape[0]}
Cantidad de columnas: {df.shape[1]}
''')


Cantidad de filas: 27480
Cantidad de columnas: 10



## Preprocesamiento de los datos

In [4]:
class NLPPreprocessor:

    tokenizer_pattern = (
            r'[\U0001F600-\U0001F64F]'          # classic emojis
            r'|[\U0001F300-\U0001F5FF]'         # nature, symbols
            r'|[\U0001F680-\U0001F6FF]'         # transport
            r'|[\U0001F1E0-\U0001F1FF]'         # Flags
            r'|[\U00002700-\U000027BF]'         # various symbols
            r'|[\U0001F900-\U0001F9FF]'         # gestures
            r'|[\U00002600-\U000026FF]'         # ‚òÄ‚òÇ
            r'|‚ù§|ü•∞'                            # specific emojis
            r'|:\)'                             # emoticon :)
            r'|\b\w+\b'                         # words (alphanumeric)
        )

    def __init__(self, text_column: str):
        self.text_column = text_column

    def clean_tokenize_text(self, text: str) -> list:
        """ Tokenizes text and removes emojis, emoticons, and special characters."""

        tokenizer = RegexpTokenizer(self.tokenizer_pattern)
        return tokenizer.tokenize(text.lower())

    def _get_wordnet_pos_(self, treebank_tag) -> str:
        """
        Converts nltk (Treebank) POS tags to WordNet tags.
        """
        if treebank_tag.startswith('J'):
            return wordnet.ADJ
        elif treebank_tag.startswith('V'):
            return wordnet.VERB
        elif treebank_tag.startswith('N'):
            return wordnet.NOUN
        elif treebank_tag.startswith('R'):
            return wordnet.ADV
        else:
            return wordnet.NOUN  # By default, use NOUN if no match found


    def lemmatize_tokens(self, tokens: list) -> list:
        """Lemmatize tokens using POS tagging for greater accuracy."""
        lemmatizer = WordNetLemmatizer()
        pos_tags = pos_tag(tokens)  # [('los', 'DT'), ('ni√±os', 'NNS'), ...]
        return [
            lemmatizer.lemmatize(token, self._get_wordnet_pos_(pos))
            for token, pos in pos_tags
        ]

    def stem_tokens(self, tokens: list) -> list:
        """Stem tokens using PorterStemmer."""
        stemmer = PorterStemmer().stem
        return [stemmer(token) for token in tokens]

    def preprocess(self, df: pd.DataFrame) -> pd.DataFrame:
        df['tokens'] = df[self.text_column].astype(str).apply(self.clean_tokenize_text)
        df['lemmas'] = df['tokens'].apply(self.lemmatize_tokens)
        df['stems'] = df['tokens'].apply(self.stem_tokens)
        return df

In [5]:
# Example usage:
preprocessor = NLPPreprocessor(text_column='text')
processed_df = preprocessor.preprocess(df)
processed_df.head(10)


Unnamed: 0,textID,text,selected_text,sentiment,Time of Tweet,Age of User,Country,Population -2020,Land Area (Km¬≤),Density (P/Km¬≤),tokens,lemmas,stems
0,cb774db0d1,"I`d have responded, if I were going","I`d have responded, if I were going",neutral,morning,0-20,Afghanistan,38928346,652860.0,60,"[i, d, have, responded, if, i, were, going]","[i, d, have, respond, if, i, be, go]","[i, d, have, respond, if, i, were, go]"
1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,Sooo SAD,negative,noon,21-30,Albania,2877797,27400.0,105,"[sooo, sad, i, will, miss, you, here, in, san,...","[sooo, sad, i, will, miss, you, here, in, san,...","[sooo, sad, i, will, miss, you, here, in, san,..."
2,088c60f138,my boss is bullying me...,bullying me,negative,night,31-45,Algeria,43851044,2381740.0,18,"[my, boss, is, bullying, me]","[my, bos, be, bully, me]","[my, boss, is, bulli, me]"
3,9642c003ef,what interview! leave me alone,leave me alone,negative,morning,46-60,Andorra,77265,470.0,164,"[what, interview, leave, me, alone]","[what, interview, leave, me, alone]","[what, interview, leav, me, alon]"
4,358bd9e861,"Sons of ****, why couldn`t they put them on t...","Sons of ****,",negative,noon,60-70,Angola,32866272,1246700.0,26,"[sons, of, why, couldn, t, they, put, them, on...","[son, of, why, couldn, t, they, put, them, on,...","[son, of, whi, couldn, t, they, put, them, on,..."
5,28b57f3990,http://www.dothebouncy.com/smf - some shameles...,http://www.dothebouncy.com/smf - some shameles...,neutral,night,70-100,Antigua and Barbuda,97929,440.0,223,"[http, www, dothebouncy, com, smf, some, shame...","[http, www, dothebouncy, com, smf, some, shame...","[http, www, dothebounci, com, smf, some, shame..."
6,6e0c6d75b1,2am feedings for the baby are fun when he is a...,fun,positive,morning,0-20,Argentina,45195774,2736690.0,17,"[2am, feedings, for, the, baby, are, fun, when...","[2am, feeding, for, the, baby, be, fun, when, ...","[2am, feed, for, the, babi, are, fun, when, he..."
7,50e14c0bb8,Soooo high,Soooo high,neutral,noon,21-30,Armenia,2963243,28470.0,104,"[soooo, high]","[soooo, high]","[soooo, high]"
8,e050245fbd,Both of you,Both of you,neutral,night,31-45,Australia,25499884,7682300.0,3,"[both, of, you]","[both, of, you]","[both, of, you]"
9,fc2cbefa9d,Journey!? Wow... u just became cooler. hehe....,Wow... u just became cooler.,positive,morning,46-60,Austria,9006398,82400.0,109,"[journey, wow, u, just, became, cooler, hehe, ...","[journey, wow, u, just, become, cool, hehe, be...","[journey, wow, u, just, becam, cooler, hehe, i..."


In [6]:
# Split the dataset into training and testing sets
train_df, test_df = train_test_split(processed_df, test_size=0.2, random_state=42)

x_train = train_df['text']
x_test = test_df['text']
y_train = train_df['sentiment']
y_test = test_df['sentiment']

print(f'''Cantidad de filas en train: {train_df.shape[0]}
Cantidad de columnas en train: {train_df.shape[1]}
Cantidad de filas en test: {test_df.shape[0]}
Cantidad de columnas en test: {test_df.shape[1]}
''')

Cantidad de filas en train: 21984
Cantidad de columnas en train: 13
Cantidad de filas en test: 5496
Cantidad de columnas en test: 13



In [None]:
# Model creation using TF-IDF Vectorization
tfidf_vectorizer = TfidfVectorizer(token_pattern=r'[^\s]+', ngram_range=(1, 2), max_features=30000)

# Vectorize the text data
tfidf_x_train = tfidf_vectorizer.fit_transform(x_train)
tfidf_x_test = tfidf_vectorizer.transform(x_test)

print(f'''tfidf_x_train shape: {tfidf_x_train.shape}\ntfidf_x_test shape: {tfidf_x_test.shape}''')

tfidf_x_train shape: (21984, 185308)
tfidf_x_test shape: (5496, 185308)


In [8]:
# Get feature names from the TF-IDF vectorizer
feature_names = tfidf_vectorizer.get_feature_names_out()

# Convert the TF-IDF matrix back to a DataFrame
tfidf_x_train_df = pd.DataFrame(tfidf_x_train.toarray(), columns=feature_names)
tfidf_x_test_df = pd.DataFrame(tfidf_x_test.toarray(), columns=feature_names)

print(f'''tfidf_x_train_df shape: {tfidf_x_train_df.shape}\ntfidf_x_test_df shape: {tfidf_x_test_df.shape}''')

MemoryError: Unable to allocate 7.59 GiB for an array with shape (5496, 185308) and data type float64

In [None]:
# Display the first 10 rows of the TF-IDF DataFrame
tfidf_x_train_df.head(10)


In [None]:
# Display the first 10 rows of the TF-IDF test DataFrame
tfidf_x_test_df.head(10)


In [None]:
# %% [markdown]
# ## ‚úÖ Entrenamiento del modelo: Random Forest Classifier
# Entrenamos un modelo supervisado de clasificaci√≥n multiclase con los vectores TF-IDF.

# %%
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score, precision_score, recall_score

# Crear y entrenar el modelo
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(tfidf_x_train, y_train)

# Realizar predicciones
y_pred = clf.predict(tfidf_x_test)

# Evaluaci√≥n
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted', zero_division=0)
recall = recall_score(y_test, y_pred, average='weighted', zero_division=0)

print("üìä Resultados de evaluaci√≥n:")
print(f"Accuracy:  {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall:    {recall:.4f}\n")

# Reporte completo
print("üìã Classification Report:")
print(classification_report(y_test, y_pred, zero_division=0))


In [None]:
# %% [markdown]
# ## ‚úÖ Modelo alternativo: Naive Bayes (MultinomialNB)
# Probaremos ahora con un modelo de Naive Bayes, que suele tener buen rendimiento con TF-IDF en clasificaci√≥n de texto.

# %%
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, accuracy_score, precision_score, recall_score

# Entrenamiento
nb = MultinomialNB()
nb.fit(tfidf_x_train, y_train)

# Predicci√≥n
y_pred_nb = nb.predict(tfidf_x_test)

# Evaluaci√≥n
accuracy = accuracy_score(y_test, y_pred_nb)
precision = precision_score(y_test, y_pred_nb, average='weighted', zero_division=0)
recall = recall_score(y_test, y_pred_nb, average='weighted', zero_division=0)

print("üìä Resultados Naive Bayes:")
print(f"Accuracy:  {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall:    {recall:.4f}\n")

# Reporte completo
print("üìã Classification Report:")
print(classification_report(y_test, y_pred_nb, zero_division=0))


In [None]:
# %% 
# ## ‚úÖ Modelo alternativo: Regresi√≥n Log√≠stica Multinomial
# Utilizamos un modelo lineal eficiente que soporta clasificaci√≥n multiclase.

# %%
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score, precision_score, recall_score

# Entrenar el modelo
lr = LogisticRegression(multi_class='multinomial', solver='lbfgs', max_iter=1000, random_state=42)
lr.fit(tfidf_x_train, y_train)

# Predicciones
y_pred_lr = lr.predict(tfidf_x_test)

# Evaluaci√≥n
accuracy = accuracy_score(y_test, y_pred_lr)
precision = precision_score(y_test, y_pred_lr, average='weighted', zero_division=0)
recall = recall_score(y_test, y_pred_lr, average='weighted', zero_division=0)

print("üìä Resultados Logistic Regression:")
print(f"Accuracy:  {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall:    {recall:.4f}\n")

# Reporte completo
print("üìã Classification Report:")
print(classification_report(y_test, y_pred_lr, zero_division=0))


In [None]:
from sklearn.model_selection import GridSearchCV

# Definimos el espacio de b√∫squeda
param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],  # fuerza de regularizaci√≥n
    'penalty': ['l2'],             # s√≥lo 'l2' funciona con solver 'lbfgs' y multiclase
    'solver': ['lbfgs'],
    'multi_class': ['multinomial']
}

# Creamos el modelo base
log_reg = LogisticRegression(max_iter=1000, random_state=42)

# GridSearch con validaci√≥n cruzada (cv=3)
grid_search = GridSearchCV(log_reg, param_grid, cv=3, scoring='accuracy', n_jobs=-1, verbose=1)
grid_search.fit(tfidf_x_train, y_train)


In [None]:
print("üîß Mejores hiperpar√°metros encontrados:")
print(grid_search.best_params_)
print("Mejor accuracy promedio en CV:", grid_search.best_score_)


In [None]:
best_model = grid_search.best_estimator_
y_pred_best = best_model.predict(tfidf_x_test)

print("üìä Resultados fine-tuned Logistic Regression:")
print("Accuracy:", accuracy_score(y_test, y_pred_best))
print("Precision:", precision_score(y_test, y_pred_best, average='weighted', zero_division=0))
print("Recall:", recall_score(y_test, y_pred_best, average='weighted', zero_division=0))
print("\nüìã Classification Report:\n", classification_report(y_test, y_pred_best, zero_division=0))


# Conclusiones

Aplicamos el modelo para generar la representaci√≥n vectorial, usando TfidfVectorizer e incluyendo el 

```
token_pattern=r'[^\s]+'
```

Por otra parte, realizamos el ajuste y transformaci√≥n de los datos X de test, y por ultimo mostramos el DF resultante para cada caso.
