# 2. Clasifición de textos.

Implementa una función llamada classify_subreddit(text: str) que clasifique un texto en una de las categorías de subreddits especificadas.
Debes probar al menos 3 métodos:
Método basado en TF-IDF + algoritmo de machine learning.
Método basado en el reconocimiento de entidades nombradas (NER) + machine learning.
Método basado en Word Embeddings + machine learning.
Evalúa estos métodos utilizando la métrica f1 score y una división de datos (70% para entrenamiento, 30% para test).
Incluye la implementación en core.py y documenta los pasos en implementacion_modulo_2.ipynb.

Nota: desde la implementacion del modulo 1 vemos que solo deberemos usar como texto de base la columna 'clean_post' del dataframe de subreddits.

Siguiendo el orden del enunciado, primero se implementará el método basado en TF-IDF + algoritmo de machine learning.

**Obtenemos las filas del dataset que no den errores a la hora de normalizar**

In [3]:
import pandas as pd

reddit_df = pd.read_csv('processed_dataset.csv', delimiter=';', quotechar='"', encoding='utf-8', low_memory=False)

In [4]:
reddit_df.drop(reddit_df.loc[reddit_df.clean_post.isna()].index, inplace=True)
reddit_df.reset_index(inplace=True)

In [5]:
reddit_df

Unnamed: 0,index,created_date,created_timestamp,subreddit,title,author,author_created_utc,full_link,score,num_comments,num_crossposts,subreddit_subscribers,post,sentiment,author_created_date,clean_post
0,0,2010-02-11 19:47:22,1265910442.0,analytics,So what do you guys all do related to analytic...,xtom,1.227476e+09,https://www.reddit.com/r/analytics/comments/b0...,7.0,4.0,0.0,,There's a lot of reasons to want to know all t...,NEGATIVE,2008-11-23 21:27:57,theres lot reasons want know stuff figured id ...
1,1,2010-03-04 20:17:26,1267726646.0,analytics,"Google's Invasive, non-Anonymized Ad Targeting...",xtom,1.227476e+09,https://www.reddit.com/r/analytics/comments/b9...,2.0,1.0,0.0,,"I'm cross posting this from /r/cyberlaw, hopef...",NEGATIVE,2008-11-23 21:27:57,im cross posting hopefully guys find interesti...
2,2,2011-01-06 04:51:18,1294282278.0,analytics,"DotCed - Functional Web Analytics - Tagging, R...",dotced,1.294282e+09,https://www.reddit.com/r/analytics/comments/ew...,1.0,1.0,,,"DotCed,a Functional Analytics Consultant, offe...",NEGATIVE,2011-01-06 02:49:14,dotceda functional analytics consultant offeri...
3,3,2011-01-19 11:45:30,1295430330.0,analytics,Program Details - Data Analytics Course,iqrconsulting,1.288245e+09,https://www.reddit.com/r/analytics/comments/f5...,0.0,0.0,,,Here is the program details of the data analyt...,NEGATIVE,2010-10-28 05:49:49,program details data analytics certification c...
4,4,2011-01-19 21:52:28,1295466748.0,analytics,potential job in web analytics... need to anal...,therewontberiots,1.278672e+09,https://www.reddit.com/r/analytics/comments/f5...,2.0,4.0,,,i decided grad school (physics) was not for me...,POSITIVE,2010-07-09 10:45:42,decided grad school physics branching job mark...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
272243,274207,2022-05-07 21:38:52,1651948732.0,rstats,Help interpretting lmer model output,seeking-stillness,,https://www.reddit.com/r/rstats/comments/ukjiy...,1.0,0.0,0.0,64078.0,Hello! I am wonder how the following output wo...,NEGATIVE,,hello wonder following output would interprete...
272244,274208,2022-05-07 22:13:52,1651950832.0,rstats,Medical stats book with R,Sweaty_Catch_4275,,https://www.reddit.com/r/rstats/comments/ukk7u...,1.0,0.0,0.0,64080.0,Can anybody recommend me a book with medical s...,POSITIVE,,anybody recommend book medical statistics r th...
272245,274209,2022-05-08 00:38:50,1651959530.0,rstats,Markov chains with unequal sequence lengths,sebelly,,https://www.reddit.com/r/rstats/comments/ukn1i...,1.0,0.0,0.0,64083.0,I'm trying to build a simple Markov chain. I h...,NEGATIVE,,im trying build simple markov chain data thera...
272246,274210,2022-05-08 01:19:00,1651961940.0,rstats,view all available Rcpp::plugins,BOBOLIU,,https://www.reddit.com/r/rstats/comments/uknuh...,1.0,0.0,0.0,64084.0,How do I view all available Rcpp::plugins? Tha...,POSITIVE,,view available rcppplugins thanks


**Las dos columnas que necesitaremos para llevar a cabo el aprendizaje supervisado son 'clean_post' y 'subreddit'.** \
Por ello, limitamos el dataset a estas dos columnas.

In [6]:
reddit_df = reddit_df[['clean_post', 'subreddit']]

**Dividimos el conjunto de datos en conjuntos de entrenamiento y prueba (70% para entrenamiento, 30% para prueba).** \ Usamos la función train_test_split de sklearn.model_selection.

In [7]:
from sklearn.model_selection import train_test_split
X = reddit_df['clean_post']
y = reddit_df['subreddit']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [8]:
type(y_train)

pandas.core.series.Series

**2.1. TF-IDF + algoritmo de machine learning** \
Utilizaremos la representación TF-IDF para transformar el texto y un clasificador de Regresión Logística para la clasificación.

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, classification_report

# Creamos el pipeline
tfidf_lr_pipeline = Pipeline([('vectorizer', TfidfVectorizer()), ('logistic', LogisticRegression())])
tfidf_lr_pipeline.fit(X_train, y_train)
y_pred_tfidf_lr = tfidf_lr_pipeline.predict(X_test)

f1_score_tfidf_lr = f1_score(y_test, y_pred_tfidf_lr, average='weighted')
print(f'F1 score: {f1_score_tfidf_lr}')

print(f'Classification report: {classification_report(y_test, y_pred_tfidf_lr)}')

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


F1 score: 0.5114427677633718


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Classification report:                       precision    recall  f1-score   support

       AskStatistics       0.50      0.58      0.54      9055
     DataScienceJobs       0.84      0.60      0.70       688
         MLQuestions       0.20      0.04      0.07      3410
     MachineLearning       0.45      0.58      0.51     11223
           analytics       0.76      0.58      0.66      2349
          artificial       0.58      0.40      0.48      2621
     computerscience       0.64      0.80      0.71      6711
      computervision       0.62      0.53      0.57      2925
                data       0.72      0.21      0.33       799
        dataanalysis       0.48      0.11      0.18      1214
     dataengineering       0.76      0.64      0.70      2468
         datascience       0.56      0.64      0.59     11171
  datascienceproject       0.00      0.00      0.00        75
            datasets       0.61      0.70      0.65      3440
        deeplearning       0.32      0.09     

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [10]:
tfidf_lr_pipeline.predict(pd.Series(['Machine Learning is the subject that studies different ways for machine to have what is called intelligent behavior', 'I have found a great job in data science, I hope you like it']))                                                       

array(['MachineLearning', 'datascience'], dtype=object)

**2.2. Word Embeddings + Logistic Regression** 

Importamos las librerías necesarias, y extraemos unas muestras para entrenamiento y testing para poder ejecutar el código en un tiempo razonable.

In [16]:
from transformers import AutoTokenizer, AutoModel
import torch
from sklearn.base import BaseEstimator, TransformerMixin

X_train_emb = X_train.sample(frac=0.1, random_state=42)
y_train_emb = y_train.loc[X_train_emb.index]
X_test_emb = X_test.sample(frac=0.01, random_state=42)
y_test_emb = y_test.loc[X_test_emb.index]

Para la formación de los embeddins usaremos el modelo distilbert, cuya arquitectura deriva del modelo Bert, pero más simplificado, haciéndolo mucho más rápido sacrificando precisión. Usaremos un modelo preentrenado de la librería transformers.

In [12]:
# Inicializar DistilBERT y el tokenizador
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

Creamos la una clase que herede de las clases BaseEstimator y TransformerMixin de la librería sklearn, para que el modelo sea compatible con los pipeline de sklearn.

In [13]:
class DistilBERTEmbeddingTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, model_name="distilbert-base-uncased"):
        self.model_name = model_name
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModel.from_pretrained(model_name)

    def fit(self, X, y=None):
        # No se necesita ajuste en este transformador
        return self

    def transform(self, X):
        embeddings = []
        with torch.no_grad():
            for text in X:
                inputs = self.tokenizer(text, return_tensors="pt", truncation=True, padding=True)
                outputs = self.model(**inputs)
                token_embeddings = outputs.last_hidden_state.squeeze(0)
                word_embedding = token_embeddings.mean(dim=0).numpy()
                embeddings.append(word_embedding)
        return embeddings

Entrenamos el modelo.

In [14]:
embedding_lr_pipeline = Pipeline([('vectorizer',  DistilBERTEmbeddingTransformer()), ('logistic', LogisticRegression())])
embedding_lr_pipeline.fit(X_train_emb, y_train_emb)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Hacemos las predicciones con el conjunte de testeo.

In [17]:
y_pred_embedding_lr = embedding_lr_pipeline.predict(X_test_emb)

Hacemos las pruebas de f1_score

In [18]:
f1_score_tfidf_lr = f1_score(y_test_emb, y_pred_embedding_lr, average='weighted')
print(f'F1 score: {f1_score_tfidf_lr}')

print(f'Classification report: {classification_report(y_test, y_pred_tfidf_lr)}')

F1 score: 0.4292594966120053


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Classification report:                       precision    recall  f1-score   support

       AskStatistics       0.50      0.58      0.54      9055
     DataScienceJobs       0.84      0.60      0.70       688
         MLQuestions       0.20      0.04      0.07      3410
     MachineLearning       0.45      0.58      0.51     11223
           analytics       0.76      0.58      0.66      2349
          artificial       0.58      0.40      0.48      2621
     computerscience       0.64      0.80      0.71      6711
      computervision       0.62      0.53      0.57      2925
                data       0.72      0.21      0.33       799
        dataanalysis       0.48      0.11      0.18      1214
     dataengineering       0.76      0.64      0.70      2468
         datascience       0.56      0.64      0.59     11171
  datascienceproject       0.00      0.00      0.00        75
            datasets       0.61      0.70      0.65      3440
        deeplearning       0.32      0.09     

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [19]:
print(len(X_train_emb), len(y_train_emb))

19057 19057


**2.3. NER (Reconocimiento de Entidades Nombradas) + Random Forest**

Importamos las bibliotecas necesarias

In [None]:
import spacy
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score

Leemos el archivo clean_post, nos quedamos con las dos columnas que nos interesan (clean_post y subreddit), y cargamos el modelo de lenguaje en_core_web_sm.

In [None]:

reddit_df = pd.read_csv('processed_dataset.csv', delimiter=';', quotechar='"', encoding='utf-8', low_memory=False)
reddit_df.drop(reddit_df.loc[reddit_df.clean_post.isna()].index, inplace=True)
reddit_df.reset_index(inplace=True)
reddit_df = reddit_df[['clean_post', 'subreddit']]

# Carga del modelo de SpaCy para NER
nlp = spacy.load("en_core_web_sm")

Definimos una función que extraiga las entidades reconocidas en un texto dado.

In [None]:
# Función para extraer entidades nombradas de un texto
def extract_entities(text):
    doc = nlp(text)
    return " ".join([ent.text for ent in doc.ents])  # Concatenamos las entidades reconocidas

Reducimos el tamaño del dataset por razones de complejidad temporal, para que el proceso de transformación de los textos y entrenamiento se haga en un márgen de tiempo razonable. Se puede modificar el porcentaje del dataset usado modificando el parámetro frac en reddit.df.sample(frac=0.01)

In [None]:
reddit_df_sm = reddit_df.sample(frac=0.01)

Creamos una nueva columna entinties, que guardará las entidades reconocidas de cada texto.

In [None]:
# Extraer entidades nombradas
reddit_df_sm["entities"] = reddit_df_sm["clean_post"].apply(extract_entities)

Separamos los sets de entrenamiento y test.

In [None]:
# Usar entidades como característica
X = reddit_df_sm["entities"]
y = reddit_df_sm["subreddit"]

# División en conjuntos de entrenamiento y prueba
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

creamos el pipeline para el entrenamiento, con los objetos de las clases CountVectorizer() para convertir los strings de la columna entinties a valores numéricos que pasar a RandomForestClassifier(), que será el modelo a ajustar.

In [None]:
# Pipeline de vectorización y clasificación
pipeline = make_pipeline(
    CountVectorizer(),  # Vectorizamos las entidades extraídas
    RandomForestClassifier(random_state=42)  # Modelo de clasificación
)

Entrenamos y evaluamos el modelo.

In [None]:
# Entrenar el modelo
pipeline.fit(X_train, y_train)

# Evaluación
y_pred = pipeline.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))


# Función de predicción
def predict_subreddit_ner(text):
    entities = extract_entities(text)
    return pipeline.predict([entities])[0]


# Prueba de la función
test_post = "I visited Paris last summer"
predicted_subreddit = predict_subreddit_ner(test_post)
print(f"El subreddit predicho es: {predicted_subreddit}")


**2.4. Función classify_subreddit**

Para la función classify_subreddit elegiremos el primero modelo, dado que con un 0.51 es el que mejor f1_score da.

In [20]:
def classify_subreddit(text):
    return tfidf_lr_pipeline.predict(pd.Series([text]))