#Modelo de deep learning Fake News
El presente modelo, se genera a partir de dos bases de datos, la primera denominada 'fake' y la segunda 'real'. La recolección de información, se hizo a partir de las noticias al rededor del entonces candidato Donald Trump en la carrera por la presidencia de Estados Unidos en el año 2016.
Base de datos disponible en:
https://www.kaggle.com/datasets/algord/fake-news

*Por: Israel Sánchez Graciano*

In [None]:
!pip install tensorflow




In [1]:
import pandas as pd
import numpy as np
import re
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [16]:
#Carga de bases de datos
df_true = pd.read_csv("/content/drive/MyDrive/Proyect 1 /true.csv")
df_fake = pd.read_csv("/content/drive/MyDrive/Proyect 1 /fake.csv")

df_true["label"] = 1
df_fake["label"] = 0

In [17]:

#Encabezado
df_true["text"] = df_true["title"] + " " + df_true["text"]
df_fake["text"] = df_fake["title"] + " " + df_fake["text"]


In [18]:

# Balance del dataset: 2500 noticias de cada clase
df_true_sample = df_true.sample(n=2500, random_state=42)
df_fake_sample = df_fake.sample(n=2500, random_state=42)
df_sample = pd.concat([df_true_sample, df_fake_sample]).sample(frac=1).reset_index(drop=True)

#Limpieza de texto
def limpiar_texto(text):
    text = str(text).lower()
    text = re.sub(r'\W', ' ', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

df_sample["text"] = df_sample["text"].apply(limpiar_texto)

In [19]:

#Tokenizar
tokenizer = Tokenizer(num_words=5000, oov_token="<OOV>")
tokenizer.fit_on_texts(df_sample["text"])
sequences = tokenizer.texts_to_sequences(df_sample["text"])

max_length = 250
padded_sequences = pad_sequences(sequences, maxlen=max_length, padding='post')
labels = df_sample["label"].values

In [21]:
#Glove
embedding_index = {}

with open("/content/drive/MyDrive/Proyect 1 /glove.6B.100d.txt", encoding='utf-8') as f:
    for line in f:
        values = line.split()
        word = values[0]
        vector = np.asarray(values[1:], dtype='float32')
        embedding_index[word] = vector

embedding_dim = 100
word_index = tokenizer.word_index
num_words = min(5000, len(word_index) + 1)
embedding_matrix = np.zeros((num_words, embedding_dim))

for word, i in word_index.items():
    if i >= 5000:
        continue
    embedding_vector = embedding_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

In [23]:
#Modelo
from tensorflow.keras.layers import Bidirectional

model = Sequential([
    Embedding(input_dim=num_words,
              output_dim=embedding_dim,
              weights=[embedding_matrix],
              input_length=max_length,
              trainable=True),  # Entrenable
    Bidirectional(LSTM(128, dropout=0.2, recurrent_dropout=0.2)),
    Dense(64, activation='relu'),
    Dropout(0.5),
    Dense(1, activation='sigmoid')
])

model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

model.summary()





In [24]:
# Entrenamiento
X_train, X_test, y_train, y_test = train_test_split(padded_sequences, labels, test_size=0.2, random_state=42)

history = model.fit(X_train, y_train,
                    validation_data=(X_test, y_test),
                    epochs=10,
                    batch_size=64)

# Evaluación del modelo
y_pred = (model.predict(X_test) > 0.5).astype("int32")
print(classification_report(y_test, y_pred))

Epoch 1/10
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m109s[0m 2s/step - accuracy: 0.7509 - loss: 0.5058 - val_accuracy: 0.9250 - val_loss: 0.1868
Epoch 2/10
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m142s[0m 2s/step - accuracy: 0.9277 - loss: 0.1860 - val_accuracy: 0.9460 - val_loss: 0.1499
Epoch 3/10
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m141s[0m 2s/step - accuracy: 0.9557 - loss: 0.1327 - val_accuracy: 0.9560 - val_loss: 0.1341
Epoch 4/10
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m142s[0m 2s/step - accuracy: 0.9612 - loss: 0.1058 - val_accuracy: 0.9570 - val_loss: 0.1184
Epoch 5/10
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m101s[0m 2s/step - accuracy: 0.9751 - loss: 0.0743 - val_accuracy: 0.9440 - val_loss: 0.1706
Epoch 6/10
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m140s[0m 2s/step - accuracy: 0.9791 - loss: 0.0606 - val_accuracy: 0.9650 - val_loss: 0.1149
Epoch 7/10
[1m63/63[0m [32m━━━━

In [25]:
# Prueba del modelo
X_train_texts, X_test_texts, _, _ = train_test_split(df_sample["text"], labels, test_size=0.2, random_state=42)


for i in range(10):
    texto = X_test_texts.iloc[i]
    secuencia = tokenizer.texts_to_sequences([texto])
    padded = pad_sequences(secuencia, maxlen=max_length, padding='post')
    pred = model.predict(padded)[0][0]

    etiqueta_real = y_test[i]
    etiqueta_predicha = 1 if pred > 0.5 else 0

    print(f"\n📰 Noticia: {texto[:200]}...")
    print(f"🔹 Real: {etiqueta_real} | 🔸 Predicha: {etiqueta_predicha} ({'REAL' if etiqueta_predicha==1 else 'FAKE'}) - Confianza: {pred:.2f}")


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 820ms/step

📰 Noticia: trump says nothing is off the table for response to iran washington reuters president donald trump told reporters on thursday said nothing is off the table in terms of a response to iran s ballistic m...
🔹 Real: 1 | 🔸 Predicha: 1 (REAL) - Confianza: 1.00
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1s/step   

📰 Noticia: chinese general kills himself amid corruption probe beijing reuters a prominent chinese general under investigation for corruption has committed suicide state media said on tuesday the latest developm...
🔹 Real: 1 | 🔸 Predicha: 1 (REAL) - Confianza: 1.00
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1s/step   

📰 Noticia: the state that gets more refugees than any other in america may surprise you here s one sure way to turn a solidly red state blue in fiscal year 2014 texas resettled 7 234 refugees however that doesn ...
🔹 Real: 0 | 🔸 Predicha: 0 (FAKE

#Conclusión
Se eligió una red neuronal recurrente del tipo LSTM bidireccional debido a su capacidad para capturar relaciones de dependencia a largo plazo en secuencias de texto, lo cual es esencial para entender el contexto de una noticia.

Los embeddings preentrenados GloVe se usaron para representar palabras en un espacio semántico denso. Estos embeddings permiten al modelo entender similitudes semánticas entre palabras, incluso si no se han visto en el entrenamiento.

La red logró un 96% de precisión en la clasificación de noticias verdaderas y falsas, demostrando ser una solución efectiva al problema de detección de fake news

