# Preprocesamiento de Texto para la Clasificación de Noticias

Este notebook tiene como objetivo aplicar un preprocesamiento exhaustivo sobre un conjunto de datos textuales, con el fin de preparar los textos para su posterior análisis y clasificación. 

### Importación de librerías


In [1]:
import sys
import pandas as pd
sys.path.append('../src')  # Añades esa carpeta al path
from preprocessing import preprocess
from preprocessing import recortar_texto

[nltk_data] Downloading package punkt to /home/inma/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/inma/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/inma/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [2]:
# Ruta a los datos 
path = '../data/processed/noticias_tema.csv' 

# Cargar CSVs
df = pd.read_csv(path)

df.drop(columns=['title_length', 'no_verbs'], inplace=True)
df


Unnamed: 0,title,text,label,news_type,sentences,lang,Cluster,category
0,LAW ENFORCEMENT ON HIGH ALERT Following Threat...,No comment is expected from Barack Obama Membe...,1,Real,LAW ENFORCEMENT ON HIGH ALERT Following Threat...,en,5,sociedad
1,UNBELIEVABLE! OBAMA’S ATTORNEY GENERAL SAYS MO...,"Now, most of the demonstrators gathered last ...",1,Real,UNBELIEVABLE! OBAMA’S ATTORNEY GENERAL SAYS MO...,en,5,sociedad
2,"Bobby Jindal, raised Hindu, uses story of Chri...",A dozen politically active pastors came here f...,0,Falsa,"Bobby Jindal, raised Hindu, uses story of Chri...",en,5,sociedad
3,SATAN 2: Russia unvelis an image of its terrif...,"The RS-28 Sarmat missile, dubbed Satan 2, will...",1,Real,SATAN 2: Russia unvelis an image of its terrif...,en,0,gobierno
4,About Time! Christian Group Sues Amazon and SP...,All we can say on this one is it s about time ...,1,Real,About Time! Christian Group Sues Amazon and SP...,en,5,sociedad
...,...,...,...,...,...,...,...,...
61844,WIKILEAKS EMAIL SHOWS CLINTON FOUNDATION FUNDS...,An email released by WikiLeaks on Sunday appea...,1,Real,WIKILEAKS EMAIL SHOWS CLINTON FOUNDATION FUNDS...,en,1,clinton
61845,Russians steal research on Trump in hack of U....,WASHINGTON (Reuters) - Hackers believed to be ...,0,Falsa,Russians steal research on Trump in hack of U....,en,1,clinton
61846,WATCH: Giuliani Demands That Democrats Apolog...,"You know, because in fantasyland Republicans n...",1,Real,WATCH: Giuliani Demands That Democrats Apologi...,en,4,trump
61847,Migrants Refuse To Leave Train At Refugee Camp...,Migrants Refuse To Leave Train At Refugee Camp...,0,Falsa,Migrants Refuse To Leave Train At Refugee Camp...,en,5,sociedad


In [3]:
df['title_length'] = df['title'].apply(lambda t: len(str(t).split()))


In [4]:
df['title'] = df['title'].apply(preprocess)
df

Unnamed: 0,title,text,label,news_type,sentences,lang,Cluster,category,title_length
0,law enforcement high alert following threat co...,No comment is expected from Barack Obama Membe...,1,Real,LAW ENFORCEMENT ON HIGH ALERT Following Threat...,en,5,sociedad,18
1,unbelievable exclamationtoken obamas attorney ...,"Now, most of the demonstrators gathered last ...",1,Real,UNBELIEVABLE! OBAMA’S ATTORNEY GENERAL SAYS MO...,en,5,sociedad,18
2,bobby jindal raised hindu us story christian c...,A dozen politically active pastors came here f...,0,Falsa,"Bobby Jindal, raised Hindu, uses story of Chri...",en,5,sociedad,16
3,satan russia unvelis image terrifying new supe...,"The RS-28 Sarmat missile, dubbed Satan 2, will...",1,Real,SATAN 2: Russia unvelis an image of its terrif...,en,0,gobierno,16
4,time exclamationtoken christian group sue amaz...,All we can say on this one is it s about time ...,1,Real,About Time! Christian Group Sues Amazon and SP...,en,5,sociedad,13
...,...,...,...,...,...,...,...,...,...
61844,wikileaks email show clinton foundation fund u...,An email released by WikiLeaks on Sunday appea...,1,Real,WIKILEAKS EMAIL SHOWS CLINTON FOUNDATION FUNDS...,en,1,clinton,15
61845,russian steal research trump hack u democratic...,WASHINGTON (Reuters) - Hackers believed to be ...,0,Falsa,Russians steal research on Trump in hack of U....,en,1,clinton,11
61846,watch giuliani demand democrat apologize trump...,"You know, because in fantasyland Republicans n...",1,Real,WATCH: Giuliani Demands That Democrats Apologi...,en,4,trump,10
61847,migrant refuse leave train refugee camp hungary,Migrants Refuse To Leave Train At Refugee Camp...,0,Falsa,Migrants Refuse To Leave Train At Refugee Camp...,en,5,sociedad,10


In [5]:
df['text_length'] = df['text'].apply(lambda t: len(str(t).split()))
df['text'] = df['text'].apply(preprocess)
df['text'] = df['text'].apply(recortar_texto)
df

Unnamed: 0,title,text,label,news_type,sentences,lang,Cluster,category,title_length,text_length
0,law enforcement high alert following threat co...,no comment expected barack obama member fyf fu...,1,Real,LAW ENFORCEMENT ON HIGH ALERT Following Threat...,en,5,sociedad,18,871
1,unbelievable exclamationtoken obamas attorney ...,most demonstrator gathered last night exercisi...,1,Real,UNBELIEVABLE! OBAMA’S ATTORNEY GENERAL SAYS MO...,en,5,sociedad,18,34
2,bobby jindal raised hindu us story christian c...,dozen politically active pastor came private d...,0,Falsa,"Bobby Jindal, raised Hindu, uses story of Chri...",en,5,sociedad,16,1321
3,satan russia unvelis image terrifying new supe...,r sarmat missile dubbed satan replace s fly mi...,1,Real,SATAN 2: Russia unvelis an image of its terrif...,en,0,gobierno,16,329
4,time exclamationtoken christian group sue amaz...,all say one time someone sued southern poverty...,1,Real,About Time! Christian Group Sues Amazon and SP...,en,5,sociedad,13,244
...,...,...,...,...,...,...,...,...,...,...
61844,wikileaks email show clinton foundation fund u...,email released wikileaks sunday appears show f...,1,Real,WIKILEAKS EMAIL SHOWS CLINTON FOUNDATION FUNDS...,en,1,clinton,15,205
61845,russian steal research trump hack u democratic...,washington reuters hacker believed working rus...,0,Falsa,Russians steal research on Trump in hack of U....,en,1,clinton,11,735
61846,watch giuliani demand democrat apologize trump...,know fantasyland republican never questioned c...,1,Real,WATCH: Giuliani Demands That Democrats Apologi...,en,4,trump,10,604
61847,migrant refuse leave train refugee camp hungary,migrant refuse leave train refugee camp hungar...,0,Falsa,Migrants Refuse To Leave Train At Refugee Camp...,en,5,sociedad,10,477


In [6]:
# Calcular longitudes con list comprehension
longitudes = df['text'].str.split().str.len()


print(longitudes)  # Output: [4, 12, 14, 7]

0        300
1         22
2        300
3        185
4        147
        ... 
61844    121
61845    300
61846    300
61847    278
61848    300
Name: text, Length: 61849, dtype: int64


In [7]:
df['total_length'] = df['title_length'] + df['text_length']
df.drop(columns=['sentences'], inplace=True)
df['sentences'] = df['title'] + ' ' + df['text']

In [8]:
df

Unnamed: 0,title,text,label,news_type,lang,Cluster,category,title_length,text_length,total_length,sentences
0,law enforcement high alert following threat co...,no comment expected barack obama member fyf fu...,1,Real,en,5,sociedad,18,871,889,law enforcement high alert following threat co...
1,unbelievable exclamationtoken obamas attorney ...,most demonstrator gathered last night exercisi...,1,Real,en,5,sociedad,18,34,52,unbelievable exclamationtoken obamas attorney ...
2,bobby jindal raised hindu us story christian c...,dozen politically active pastor came private d...,0,Falsa,en,5,sociedad,16,1321,1337,bobby jindal raised hindu us story christian c...
3,satan russia unvelis image terrifying new supe...,r sarmat missile dubbed satan replace s fly mi...,1,Real,en,0,gobierno,16,329,345,satan russia unvelis image terrifying new supe...
4,time exclamationtoken christian group sue amaz...,all say one time someone sued southern poverty...,1,Real,en,5,sociedad,13,244,257,time exclamationtoken christian group sue amaz...
...,...,...,...,...,...,...,...,...,...,...,...
61844,wikileaks email show clinton foundation fund u...,email released wikileaks sunday appears show f...,1,Real,en,1,clinton,15,205,220,wikileaks email show clinton foundation fund u...
61845,russian steal research trump hack u democratic...,washington reuters hacker believed working rus...,0,Falsa,en,1,clinton,11,735,746,russian steal research trump hack u democratic...
61846,watch giuliani demand democrat apologize trump...,know fantasyland republican never questioned c...,1,Real,en,4,trump,10,604,614,watch giuliani demand democrat apologize trump...
61847,migrant refuse leave train refugee camp hungary,migrant refuse leave train refugee camp hungar...,0,Falsa,en,5,sociedad,10,477,487,migrant refuse leave train refugee camp hungar...


In [9]:
df.isnull().sum()

title           0
text            0
label           0
news_type       0
lang            0
Cluster         0
category        0
title_length    0
text_length     0
total_length    0
sentences       0
dtype: int64

In [10]:
df.to_csv('../data/processed/listo.csv', index=False)