<img src="./assets/img/teclab_logo.png" alt="Teclab logo" width="170">

**Author**: Hector Vergara ([LinkedIn](https://www.linkedin.com/in/hector-vergara/))

**Repository**: [nlp_apis](https://github.com/hhvergara/nlp_apis)

**Python Notebook**: [API3.ipynb](https://github.com/hhvergara/nlp_apis/blob/main/API3.ipynb)

----

# API 3:

### Contexto

Al día siguiente, presentamos los resultados anteriores al equipo y obtenemos el VoBo para continuar. Efectivamente, lo obtenido hasta el momento hace sentido, y lo importante es que ya se cuenta con un corpus de comentarios limpio y se tiene un vocabulario resumido.

Alguien del equipo expresa que ya se puede aplicar, entonces, un modelo para clasificación, por lo que -con todo la razón- argumentamos que primero se debe obtener una representación vectorial del corpus, y justamente es lo que se pondrá en foco a continuación.

### Consignas

Representación vectorial: en esta parte, se debe aplicar un modelo de representación vectorial. Se recomienda el uso de TfidfVectorizer por sobre una representación BoW debido a que facilita la inclusión de los emojis, tal como se expone en el siguiente ejemplo:

![example 1](./assets/img/API3_1.png)

El modelo para la vectorización debe ajustarse primero con los datos X de entrenamiento y, posteriormente, realizar la transformación con los datos X de testeo:

![example 2](./assets/img/API3_2.png)



In [16]:
# 1. Library Imports
import os
import nltk
import numpy as np
import pandas as pd
from pathlib import Path
from nltk import pos_tag
from nltk.corpus import wordnet
from  nltk.tokenize import RegexpTokenizer
from sklearn.model_selection import train_test_split
from nltk.stem import WordNetLemmatizer, PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer

__version__ = '0.0.1'
__email__ = 'hhvservice@gmail.com'
__author__ = 'Hector Vergara'
__annotations__ = 'https://www.linkedin.com/in/hector-vergara/'
__base_dir__ = Path().absolute()
__data_dir__ = os.path.join(__base_dir__, 'data')
filename_data = os.path.join(__data_dir__, 'sentiment_analysis_dataset.csv')

nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('wordnet')


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\vinyl\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\vinyl\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\vinyl\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\vinyl\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     C:\Users\vinyl\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\vinyl\AppData\Roaming\nltk_dat

True

### Descargamos el dataset "sentiment-analysis-dataset" de kaggle para realizar las pruebas.

Referencia: https://www.kaggle.com/datasets/abhi8923shriv/sentiment-analysis-dataset/data

In [17]:
# Load the dataset
df = pd.read_csv(filename_data, sep=',', encoding='unicode_escape')
df.dropna(inplace=True)
df.reset_index(drop=True, inplace=True)
df.head(10)

Unnamed: 0,textID,text,selected_text,sentiment,Time of Tweet,Age of User,Country,Population -2020,Land Area (Km²),Density (P/Km²)
0,cb774db0d1,"I`d have responded, if I were going","I`d have responded, if I were going",neutral,morning,0-20,Afghanistan,38928346,652860.0,60
1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,Sooo SAD,negative,noon,21-30,Albania,2877797,27400.0,105
2,088c60f138,my boss is bullying me...,bullying me,negative,night,31-45,Algeria,43851044,2381740.0,18
3,9642c003ef,what interview! leave me alone,leave me alone,negative,morning,46-60,Andorra,77265,470.0,164
4,358bd9e861,"Sons of ****, why couldn`t they put them on t...","Sons of ****,",negative,noon,60-70,Angola,32866272,1246700.0,26
5,28b57f3990,http://www.dothebouncy.com/smf - some shameles...,http://www.dothebouncy.com/smf - some shameles...,neutral,night,70-100,Antigua and Barbuda,97929,440.0,223
6,6e0c6d75b1,2am feedings for the baby are fun when he is a...,fun,positive,morning,0-20,Argentina,45195774,2736690.0,17
7,50e14c0bb8,Soooo high,Soooo high,neutral,noon,21-30,Armenia,2963243,28470.0,104
8,e050245fbd,Both of you,Both of you,neutral,night,31-45,Australia,25499884,7682300.0,3
9,fc2cbefa9d,Journey!? Wow... u just became cooler. hehe....,Wow... u just became cooler.,positive,morning,46-60,Austria,9006398,82400.0,109


In [18]:
print(f'''
Cantidad de filas: {df.shape[0]}
Cantidad de columnas: {df.shape[1]}
''')


Cantidad de filas: 27480
Cantidad de columnas: 10



## Preprocesamiento de los datos

In [19]:
class NLPPreprocessor:

    tokenizer_pattern = (
            r'[\U0001F600-\U0001F64F]'          # classic emojis
            r'|[\U0001F300-\U0001F5FF]'         # nature, symbols
            r'|[\U0001F680-\U0001F6FF]'         # transport
            r'|[\U0001F1E0-\U0001F1FF]'         # Flags
            r'|[\U00002700-\U000027BF]'         # various symbols
            r'|[\U0001F900-\U0001F9FF]'         # gestures
            r'|[\U00002600-\U000026FF]'         # ☀☂
            r'|❤|🥰'                            # specific emojis
            r'|:\)'                             # emoticon :)
            r'|\b\w+\b'                         # words (alphanumeric)
        )

    def __init__(self, text_column: str):
        self.text_column = text_column

    def clean_tokenize_text(self, text: str) -> list:
        """ Tokenizes text and removes emojis, emoticons, and special characters."""

        tokenizer = RegexpTokenizer(self.tokenizer_pattern)
        return tokenizer.tokenize(text.lower())

    def _get_wordnet_pos_(self, treebank_tag) -> str:
        """
        Converts nltk (Treebank) POS tags to WordNet tags.
        """
        if treebank_tag.startswith('J'):
            return wordnet.ADJ
        elif treebank_tag.startswith('V'):
            return wordnet.VERB
        elif treebank_tag.startswith('N'):
            return wordnet.NOUN
        elif treebank_tag.startswith('R'):
            return wordnet.ADV
        else:
            return wordnet.NOUN  # By default, use NOUN if no match found


    def lemmatize_tokens(self, tokens: list) -> list:
        """Lemmatize tokens using POS tagging for greater accuracy."""
        lemmatizer = WordNetLemmatizer()
        pos_tags = pos_tag(tokens)  # [('los', 'DT'), ('niños', 'NNS'), ...]
        return [
            lemmatizer.lemmatize(token, self._get_wordnet_pos_(pos))
            for token, pos in pos_tags
        ]

    def stem_tokens(self, tokens: list) -> list:
        """Stem tokens using PorterStemmer."""
        stemmer = PorterStemmer().stem
        return [stemmer(token) for token in tokens]

    def preprocess(self, df: pd.DataFrame) -> pd.DataFrame:
        df['tokens'] = df[self.text_column].astype(str).apply(self.clean_tokenize_text)
        df['lemmas'] = df['tokens'].apply(self.lemmatize_tokens)
        df['stems'] = df['tokens'].apply(self.stem_tokens)
        return df

In [20]:
# Example usage:
preprocessor = NLPPreprocessor(text_column='text')
processed_df = preprocessor.preprocess(df)
processed_df.head(10)


Unnamed: 0,textID,text,selected_text,sentiment,Time of Tweet,Age of User,Country,Population -2020,Land Area (Km²),Density (P/Km²),tokens,lemmas,stems
0,cb774db0d1,"I`d have responded, if I were going","I`d have responded, if I were going",neutral,morning,0-20,Afghanistan,38928346,652860.0,60,"[i, d, have, responded, if, i, were, going]","[i, d, have, respond, if, i, be, go]","[i, d, have, respond, if, i, were, go]"
1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,Sooo SAD,negative,noon,21-30,Albania,2877797,27400.0,105,"[sooo, sad, i, will, miss, you, here, in, san,...","[sooo, sad, i, will, miss, you, here, in, san,...","[sooo, sad, i, will, miss, you, here, in, san,..."
2,088c60f138,my boss is bullying me...,bullying me,negative,night,31-45,Algeria,43851044,2381740.0,18,"[my, boss, is, bullying, me]","[my, bos, be, bully, me]","[my, boss, is, bulli, me]"
3,9642c003ef,what interview! leave me alone,leave me alone,negative,morning,46-60,Andorra,77265,470.0,164,"[what, interview, leave, me, alone]","[what, interview, leave, me, alone]","[what, interview, leav, me, alon]"
4,358bd9e861,"Sons of ****, why couldn`t they put them on t...","Sons of ****,",negative,noon,60-70,Angola,32866272,1246700.0,26,"[sons, of, why, couldn, t, they, put, them, on...","[son, of, why, couldn, t, they, put, them, on,...","[son, of, whi, couldn, t, they, put, them, on,..."
5,28b57f3990,http://www.dothebouncy.com/smf - some shameles...,http://www.dothebouncy.com/smf - some shameles...,neutral,night,70-100,Antigua and Barbuda,97929,440.0,223,"[http, www, dothebouncy, com, smf, some, shame...","[http, www, dothebouncy, com, smf, some, shame...","[http, www, dothebounci, com, smf, some, shame..."
6,6e0c6d75b1,2am feedings for the baby are fun when he is a...,fun,positive,morning,0-20,Argentina,45195774,2736690.0,17,"[2am, feedings, for, the, baby, are, fun, when...","[2am, feeding, for, the, baby, be, fun, when, ...","[2am, feed, for, the, babi, are, fun, when, he..."
7,50e14c0bb8,Soooo high,Soooo high,neutral,noon,21-30,Armenia,2963243,28470.0,104,"[soooo, high]","[soooo, high]","[soooo, high]"
8,e050245fbd,Both of you,Both of you,neutral,night,31-45,Australia,25499884,7682300.0,3,"[both, of, you]","[both, of, you]","[both, of, you]"
9,fc2cbefa9d,Journey!? Wow... u just became cooler. hehe....,Wow... u just became cooler.,positive,morning,46-60,Austria,9006398,82400.0,109,"[journey, wow, u, just, became, cooler, hehe, ...","[journey, wow, u, just, become, cool, hehe, be...","[journey, wow, u, just, becam, cooler, hehe, i..."


In [21]:
# Split the dataset into training and testing sets
train_df, test_df = train_test_split(processed_df, test_size=0.2, random_state=42)

x_train = train_df['text']
x_test = test_df['text']
y_train = train_df['sentiment']
y_test = test_df['sentiment']

print(f'''Cantidad de filas en train: {train_df.shape[0]}
Cantidad de columnas en train: {train_df.shape[1]}
Cantidad de filas en test: {test_df.shape[0]}
Cantidad de columnas en test: {test_df.shape[1]}
''')

Cantidad de filas en train: 21984
Cantidad de columnas en train: 13
Cantidad de filas en test: 5496
Cantidad de columnas en test: 13



In [22]:
# Model creation using TF-IDF Vectorization
tfidf_vectorizer = TfidfVectorizer(token_pattern=r'[^\s]+')

# Vectorize the text data
tfidf_x_train = tfidf_vectorizer.fit_transform(x_train)
tfidf_x_test = tfidf_vectorizer.transform(x_test)

print(f'''tfidf_x_train shape: {tfidf_x_train.shape}\ntfidf_x_test shape: {tfidf_x_test.shape}''')

tfidf_x_train shape: (21984, 38762)
tfidf_x_test shape: (5496, 38762)


In [23]:
# Get feature names from the TF-IDF vectorizer
feature_names = tfidf_vectorizer.get_feature_names_out()

# Convert the TF-IDF matrix back to a DataFrame
tfidf_x_train_df = pd.DataFrame(tfidf_x_train.toarray(), columns=feature_names)
tfidf_x_test_df = pd.DataFrame(tfidf_x_test.toarray(), columns=feature_names)

print(f'''tfidf_x_train_df shape: {tfidf_x_train_df.shape}\ntfidf_x_test_df shape: {tfidf_x_test_df.shape}''')

tfidf_x_train_df shape: (21984, 38762)
tfidf_x_test_df shape: (5496, 38762)


In [24]:
# Display the first 10 rows of the TF-IDF DataFrame
tfidf_x_train_df.head(10)


Unnamed: 0,!,!!,!!!,!!!!,!!!!!,!!!!!!,!!!!!!!,!!!!!!!!,!!!!!!!!!,!!!!!!!!!!!!!!!!!!!!!!!,...,"ã¯â¿â½greed,ã¯â¿â½",ã¯â¿â½iã¯â¿â½m,ã¯â¿â½n?eleg,ã¯â¿â½stupidityã¯â¿â½,ã¯â¿â½timo,ã¯â¿â½why?,ã¯â¿â½whyyy????????,ã¯â¿â½you,ã¯â¿â½ã¯â¿â½,ã¯â¿â½ã¯â¿â½h.
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [25]:
# Display the first 10 rows of the TF-IDF test DataFrame
tfidf_x_test_df.head(10)


Unnamed: 0,!,!!,!!!,!!!!,!!!!!,!!!!!!,!!!!!!!,!!!!!!!!,!!!!!!!!!,!!!!!!!!!!!!!!!!!!!!!!!,...,"ã¯â¿â½greed,ã¯â¿â½",ã¯â¿â½iã¯â¿â½m,ã¯â¿â½n?eleg,ã¯â¿â½stupidityã¯â¿â½,ã¯â¿â½timo,ã¯â¿â½why?,ã¯â¿â½whyyy????????,ã¯â¿â½you,ã¯â¿â½ã¯â¿â½,ã¯â¿â½ã¯â¿â½h.
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# Conclusiones

Aplicamos el modelo para generar la representación vectorial, usando TfidfVectorizer e incluyendo el 

```
token_pattern=r'[^\s]+'
```

Por otra parte, realizamos el ajuste y transformación de los datos X de test, y por ultimo mostramos el DF resultante para cada caso.
