### CATEGORIZACIÓN Y ANÁLISIS DE SENTIMIENTOS DE ARTÍCULOS DE NOTICIAS
#### Preparación de los artículos procesados
#### Entrenamiento y prueba del modelo de Random Forest - RF

In [1]:
# Se importan las librerías necesarias para el preprocesamiento de los datos.
import pandas as pd
from scipy import sparse

# Se importan los scripts necesarios para preparamos los datos y entrenar el modelo.
from topic_modeling_prep import TopicModelingPrep
from rf_news_classifier import NewsTopicClassifier, main

In [2]:
# Se cargan los datos procesados de noticias en un DataFrame.
processed_news_df = pd.read_csv("./data/processed_news_dataset_sample.csv", encoding='utf-8-sig',na_values=['', 'NA', 'null', 'NULL', 'NaN'])
processed_news_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2074 entries, 0 to 2073
Data columns (total 16 columns):
 #   Column                              Non-Null Count  Dtype 
---  ------                              --------------  ----- 
 0   category                            2074 non-null   object
 1   date                                2074 non-null   object
 2   processed_text                      2074 non-null   object
 3   sentiment_text                      2074 non-null   object
 4   adj_count                           2074 non-null   int64 
 5   verb_count                          2074 non-null   int64 
 6   noun_count                          2074 non-null   int64 
 7   proper_noun_count                   2074 non-null   int64 
 8   adv_count                           2074 non-null   int64 
 9   quote_count                         2074 non-null   int64 
 10  pattern_named_entities_count        2074 non-null   int64 
 11  pattern_action_phrases_count        2074 non-null   int6

In [3]:
processed_news_df = processed_news_df.dropna(subset=['processed_text'])
processed_news_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2074 entries, 0 to 2073
Data columns (total 16 columns):
 #   Column                              Non-Null Count  Dtype 
---  ------                              --------------  ----- 
 0   category                            2074 non-null   object
 1   date                                2074 non-null   object
 2   processed_text                      2074 non-null   object
 3   sentiment_text                      2074 non-null   object
 4   adj_count                           2074 non-null   int64 
 5   verb_count                          2074 non-null   int64 
 6   noun_count                          2074 non-null   int64 
 7   proper_noun_count                   2074 non-null   int64 
 8   adv_count                           2074 non-null   int64 
 9   quote_count                         2074 non-null   int64 
 10  pattern_named_entities_count        2074 non-null   int64 
 11  pattern_action_phrases_count        2074 non-null   int6

In [4]:
# Se preparan los datos para el entrenamiento del modelo.
prep = TopicModelingPrep()
topic_df, artifacts = prep.prepare_data(processed_news_df)


Resumen de validación:
Total de artículos: 2,074
Número de categorías: 27
Longitud promedio del texto: 105.6 caracteres
Cantidad promedio de palabras: 15.4 palabras

Top 5 de categorias por frecuencia:
category
POLITICS            338
WELLNESS            258
ENTERTAINMENT       171
PARENTS             123
DIVERSITY VOICES    123
Name: count, dtype: int64


In [10]:
# Se muestra el dataframe con los datos preparados.
display(topic_df)

Unnamed: 0,text,category,date,year,month,text_length,word_count,unique_words,category_id,category_frequency,tfidf_matrix,doc_term_matrix
0,plan parenthood sue ohio dispute fetal tissue ...,POLITICS,2015-12-14,2015,12,115,16,14,0,0.162970,<Compressed Sparse Row sparse matrix of dtype ...,<Compressed Sparse Row sparse matrix of dtype ...
1,serial killer dub 'angel death dy prison beati...,CRIME,2017-03-30,2017,3,118,19,19,1,0.021215,<Compressed Sparse Row sparse matrix of dtype ...,<Compressed Sparse Row sparse matrix of dtype ...
2,search nigeria miss girl go wrong,WORLD NEWS,2014-05-29,2014,5,33,6,6,2,0.047734,<Compressed Sparse Row sparse matrix of dtype ...,<Compressed Sparse Row sparse matrix of dtype ...
3,crazy new baby name,PARENTS,2015-05-27,2015,5,19,4,4,3,0.059306,<Compressed Sparse Row sparse matrix of dtype ...,<Compressed Sparse Row sparse matrix of dtype ...
4,heartwarming reason college student bring baby...,PARENTS,2016-11-02,2016,11,74,11,11,3,0.059306,<Compressed Sparse Row sparse matrix of dtype ...,<Compressed Sparse Row sparse matrix of dtype ...
...,...,...,...,...,...,...,...,...,...,...,...,...
2069,world kid go turkey africa thailand contemplat...,DIVERSITY VOICES,2015-03-02,2015,3,124,18,18,11,0.059306,<Compressed Sparse Row sparse matrix of dtype ...,<Compressed Sparse Row sparse matrix of dtype ...
2070,lgbt family risk new budget lgbt family arent ...,DIVERSITY VOICES,2017-06-01,2017,6,81,12,10,11,0.059306,<Compressed Sparse Row sparse matrix of dtype ...,<Compressed Sparse Row sparse matrix of dtype ...
2071,mhealth father first cellphone evolution risk ...,WELLNESS,2012-03-21,2012,3,191,26,23,7,0.124397,<Compressed Sparse Row sparse matrix of dtype ...,<Compressed Sparse Row sparse matrix of dtype ...
2072,russian u.s. effort curb russian banking would...,WORLD NEWS,2018-08-10,2018,8,105,15,13,2,0.047734,<Compressed Sparse Row sparse matrix of dtype ...,<Compressed Sparse Row sparse matrix of dtype ...


In [11]:
# Se muestra el contenido de los artefactos generados.
display(artifacts)

{'tfidf_vocabulary': {'plan': np.int64(3043),
  'parenthood': np.int64(2915),
  'sue': np.int64(3926),
  'ohio': np.int64(2824),
  'dispute': np.int64(1137),
  'fetal': np.int64(1471),
  'tissue': np.int64(4129),
  'attorney': np.int64(259),
  'general': np.int64(1626),
  'criticize': np.int64(932),
  'agency': np.int64(111),
  'remains': np.int64(3353),
  'plan parenthood': np.int64(3044),
  'attorney general': np.int64(260),
  'killer': np.int64(2196),
  'death': np.int64(1018),
  'dy': np.int64(1222),
  'prison': np.int64(3177),
  'beating': np.int64(336),
  'nurse': np.int64(2791),
  'aid': np.int64(120),
  'serve': np.int64(3624),
  'multiple': np.int64(2653),
  'life': np.int64(2325),
  'sentence': np.int64(3617),
  'admit': np.int64(75),
  'kill': np.int64(2195),
  'dozen': np.int64(1187),
  'people': np.int64(2968),
  'search': np.int64(3586),
  'nigeria': np.int64(2747),
  'miss': np.int64(2588),
  'girl': np.int64(1644),
  'wrong': np.int64(4539),
  'crazy': np.int64(911),
  

In [12]:
# Se entrena y se prube el modelo de clasificación de noticias.
classifier, topic_distributions, evaluation_results = main(topic_df, artifacts)

# Se imprime el reporte de clasificación detallado.
print("\nDetailed Classification Report:")
print(pd.DataFrame(evaluation_results['classification_report']).T)

Fitting LDA model...

Training Random Forest classifier...

Model Parameters:
-----------------
LDA topics: 20
RF estimators: 200
RF criterion: entropy
RF max_samples: 0.8
RF class_weight: balanced_subsample

Topic Modeling Results:
-----------------------
Topic 0: court, life, judge, rule, water, clean, supreme, lawyer, supreme court, ice
Topic 1: make, life, thing, time, year, day, way, kid, know, work
Topic 2: year, old, year old, say, end, huffpost, let, twitter, divorce, college
Topic 3: week, travel, time, late, moment, question, host, republican, senate, month
Topic 4: attack, kill, film, history, oscar, gay, rock, claim, hit, al
Topic 5: woman, christmas, center, oil, image, funny, look, birth, girl, order
Topic 6: death, white, cancer, die, dead, job, white house, house, breast, attorney
Topic 7: summer, road, vacation, print, search, lemon, wrong, lemonade, river, reporting
Topic 8: star, security, industry, credit, work, want, jeb, game, bush, noah
Topic 9: world, healthy, h

In [8]:
parameter_combinations = [
    {
        'n_topics': 20,
        'n_estimators': 200,
        'learning_offset': 10.0, 
        'criterion': 'gini',
        'max_samples': 0.8,
        'class_balance': 'balanced'
    },
    {
        'n_topics': 20,
        'n_estimators': 200,
        'learning_offset': 25.0,
        'criterion': 'entropy',    
        'max_samples': 0.7,
        'class_balance': 'balanced_subsample'
    },
    {
        'n_topics': 25,
        'n_estimators': 300,
        'learning_offset': 15.0,
        'criterion': 'entropy',
        'max_samples': 0.9,
        'class_balance': 'balanced'
    }
]

results = []
for params in parameter_combinations:
    print(f"\nParámetros de prueba: {params}")
    
    classifier = NewsTopicClassifier(**params)
    topic_distributions, top_words = classifier.fit_transform_lda(topic_df, artifacts)
    evaluation_results = classifier.train_classifier(topic_df, topic_distributions)
    
    results.append({
        'parameters': params,
        'accuracy': evaluation_results['classification_report']['accuracy'],
        'weighted_f1': evaluation_results['classification_report']['weighted avg']['f1-score']
    })


Parámetros de prueba: {'n_topics': 20, 'n_estimators': 200, 'learning_offset': 10.0, 'criterion': 'gini', 'max_samples': 0.8, 'class_balance': 'balanced'}
Fitting LDA model...

Training Random Forest classifier...

Parámetros de prueba: {'n_topics': 20, 'n_estimators': 250, 'learning_offset': 25.0, 'criterion': 'entropy', 'max_samples': 0.7, 'class_balance': 'balanced_subsample'}
Fitting LDA model...

Training Random Forest classifier...

Parámetros de prueba: {'n_topics': 25, 'n_estimators': 300, 'learning_offset': 15.0, 'criterion': 'entropy', 'max_samples': 0.9, 'class_balance': 'balanced'}
Fitting LDA model...

Training Random Forest classifier...


In [9]:
# Print results
for result in results:
    print("\nParameters:", result['parameters'])
    print(f"Accuracy: {result['accuracy']:.3f}")
    print(f"Weighted F1: {result['weighted_f1']:.3f}")

# Find best combination
best_result = max(results, key=lambda x: x['weighted_f1'])
print("\nBest combination:")
print(f"Parameters: {best_result['parameters']}")
print(f"Weighted F1: {best_result['weighted_f1']:.3f}")


Parameters: {'n_topics': 20, 'n_estimators': 200, 'learning_offset': 10.0, 'criterion': 'gini', 'max_samples': 0.8, 'class_balance': 'balanced'}
Accuracy: 0.436
Weighted F1: 0.387

Parameters: {'n_topics': 20, 'n_estimators': 250, 'learning_offset': 25.0, 'criterion': 'entropy', 'max_samples': 0.7, 'class_balance': 'balanced_subsample'}
Accuracy: 0.439
Weighted F1: 0.383

Parameters: {'n_topics': 25, 'n_estimators': 300, 'learning_offset': 15.0, 'criterion': 'entropy', 'max_samples': 0.9, 'class_balance': 'balanced'}
Accuracy: 0.422
Weighted F1: 0.371

Best combination:
Parameters: {'n_topics': 20, 'n_estimators': 200, 'learning_offset': 10.0, 'criterion': 'gini', 'max_samples': 0.8, 'class_balance': 'balanced'}
Weighted F1: 0.387
