# RNNs con y sin LSTM

#### Para trabajar con cualquier modelo de DeepLearning lo primero que tenemos que hacer dejar los datos listos para trabajar

#### tenemos un dataset con informaciones de noticias, las cuales pueden ser falsas o verdaderas, usaremos 2 RNNs para resolver el problema, una con LSTM y otra sin LSTM

acá dejo un ejemplo del dataset para ver como podemos trabajar con él
```csv

title,text,subject,date,authenticity
 Donald Trump Sends Out Embarrassing New Year’s Eve Message; This is Disturbing,"Donald Trump just couldn t wish all Americans a Happy New Year and leave it at that. Instead, he had to give a shout out to his enemies, haters and  the very dishonest fake news media.  The former reality show star had just one job to do and he couldn t do it. As our Country rapidly grows stronger and smarter, I want to wish all of my friends, supporters, enemies, haters, and even the very dishonest Fake News Media, a Happy and Healthy New Year,  President Angry Pants tweeted.  2018 will be a great year for America! As our Country rapidly grows stronger and smarter, I want to wish all of my friends, supporters, enemies, haters, and even the very dishonest Fake News Media, a Happy and Healthy New Year. 2018 will be a great year for America!  Donald J. Trump (@realDonaldTrump) December 31, 2017Trump s tweet went down about as welll as you d expect.What kind of president sends a New Year s greeting like this despicable, petty, infantile gibberish? Only Trump! His lack of decency won t even allow him to rise above the gutter long enough to wish the American citizens a happy new year!  Bishop Talbert Swan (@TalbertSwan) December 31, 2017no one likes you  Calvin (@calvinstowell) December 31, 2017Your impeachment would make 2018 a great year for America, but I ll also accept regaining control of Congress.  Miranda Yaver (@mirandayaver) December 31, 2017Do you hear yourself talk? When you have to include that many people that hate you you have to wonder? Why do the they all hate me?  Alan Sandoval (@AlanSandoval13) December 31, 2017Who uses the word Haters in a New Years wish??  Marlene (@marlene399) December 31, 2017You can t just say happy new year?  Koren pollitt (@Korencarpenter) December 31, 2017Here s Trump s New Year s Eve tweet from 2016.Happy New Year to all, including to my many enemies and those who have fought me and lost so badly they just don t know what to do. Love!  Donald J. Trump (@realDonaldTrump) December 31, 2016This is nothing new for Trump. He s been doing this for years.Trump has directed messages to his  enemies  and  haters  for New Year s, Easter, Thanksgiving, and the anniversary of 9/11. pic.twitter.com/4FPAe2KypA  Daniel Dale (@ddale8) December 31, 2017Trump s holiday tweets are clearly not presidential.How long did he work at Hallmark before becoming President?  Steven Goodine (@SGoodine) December 31, 2017He s always been like this . . . the only difference is that in the last few years, his filter has been breaking down.  Roy Schulze (@thbthttt) December 31, 2017Who, apart from a teenager uses the term haters?  Wendy (@WendyWhistles) December 31, 2017he s a fucking 5 year old  Who Knows (@rainyday80) December 31, 2017So, to all the people who voted for this a hole thinking he would change once he got into power, you were wrong! 70-year-old men don t change and now he s a year older.Photo by Andrew Burton/Getty Images.",News,"December 31, 2017",Fake

```

#### Ahora sabemos un poco más como es nuestro dataset, creo que el primero que podemos hacer es eliminar puntuación, Upper/Lower case. Al hacer esto podemos estar perdiendo información semántica del texto, pero muchas veces cuando tenemos poder de computo limitado tenemos hacer este trade-off entre acurraccy y tiempo

#### Usaremos algunas librerías de preprocesamiento de los datos, como nltk, para poder separar el texto en palabras, podríamos hacer esto usando un simple preprocesamiento manual en python, pero para que inventar algo que ya existe?

In [1]:

import pandas as pd
import numpy as np

import torch
from torch import nn

from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer, sent_tokenize
from nltk.stem import WordNetLemmatizer
import nltk

# Download NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Load your dataset
df = pd.read_csv('news.csv') 

In [None]:

#Limpieza basica del texto, remover puntuación y digitos como fechas, números de usuario de twitter etc.

stop_words = set(stopwords.words('english'))
tokenizer = RegexpTokenizer('[\'a-zA-Z]+') # Acá es para eligir solo parabras del alfabeto entre A-Z minuscula o mayscula, es una RE
lemmatizer = WordNetLemmatizer() #Acá iremos reducir las palabras para su clasificación minima, la raíz semantica de la palabra.


primera_noticia = df.iloc[0]
def preprocess_text(text):
    words = []
    for sentence in sent_tokenize(text):
        tokens = [word for word in tokenizer.tokenize(sentence)]
        tokens = [token.lower() for token in tokens]
        tokens = [token for token in tokens if token not in stop_words]
        tokens = [lemmatizer.lemmatize(token) for token in tokens]
        words += tokens
    return ' '.join(words)
# Esta funcion nos sirve para etiquetar variables de clasificación binaria

tokens_primera_noticia = preprocess_text(primera_noticia['text'])
df['texto_titulo'] = df['title'] + df['text']

df['preprocessado'] = df['texto_titulo'].apply(preprocess_text)
df['preprocessado'].head()
# Si queremos palabra por palabra basta hacer .split(' ')

0    donald trump sends embarrassing new year eve m...
1    drunk bragging trump staffer started russian c...
2    sheriff david clarke becomes internet joke thr...
3    trump obsessed even obama name coded website i...
4    pope francis called donald trump christmas spe...
Name: preprocessado, dtype: object

#### Acá usaremos algo que huye un poco de que es las RNNS y usaremos glove, que es un Word two vector encoder, usaremos esto para poder vectorizar nuestras palabras, podríamos haber usado técnicas como one hot vector, pero nuestro problema involucra un vocabulario demasiado grande, terminaréamos con un vector de dimensión mayor que 100.000, lo que acaba haciendo imposible de entrenar en mi pobre computadora.

In [None]:
from collections import defaultdict

# Load GloVe embeddings
def load_glove_embeddings(path):
    embeddings_dict = {}
    with open(path, 'r', encoding='utf-8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            vector = np.asarray(values[1:], "float32")
            embeddings_dict[word] = vector
    return embeddings_dict

glove_embeddings = load_glove_embeddings("glove.6B.100d.txt")

# Function to convert articles to sequences of embeddings
def article_to_embedding(article, embeddings_dict, max_len):
    embedding_dim = len(next(iter(embeddings_dict.values())))
    embedded_article = np.zeros((max_len, embedding_dim))

    words = article.split()[:max_len]
    for i, word in enumerate(words):
        if word in embeddings_dict:
            embedded_article[i] = embeddings_dict[word]
        else:
            embedded_article[i] = np.zeros(embedding_dim)

    return embedded_article


max_len = 120  # Choose based on dataset analysis
embedded_articles = np.array([article_to_embedding(article, glove_embeddings, max_len) for article in df['preprocessado']])


In [None]:
text_as_vectors = torch.as_tensor(embedded_articles, dtype=torch.float)

embedding_dim = text_as_vectors.size(2)

text_as_vectors.size()

torch.Size([10000, 120, 100])

In [None]:
from torch.utils.data import DataLoader, TensorDataset

df['authenticity_as_num'] = df['authenticity'].apply(lambda x: 0 if x == 'Fake' else 1)
labels = torch.as_tensor(df['authenticity_as_num'].values, dtype=torch.long)

dataset = TensorDataset(text_as_vectors, labels)

# Splitting dataset into training and validation sets
train_size = int(0.8 * len(dataset))
val_size = len(dataset) - train_size
train_dataset, val_dataset = torch.utils.data.random_split(dataset, [train_size, val_size])

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False)


## Construyamos el modelo


In [None]:
import torch
import torch.nn as nn

class SimpleRNN(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(SimpleRNN, self).__init__()
        self.rnn = nn.RNN(input_dim, hidden_dim)
        self.fc = nn.Linear(hidden_dim, output_dim)
    
    def forward(self, x):
        output, _ = self.rnn(x)
        output = self.fc(output[:, -1, :])
        return output

# Ejemplo de uso: model = SimpleRNN(input_dim=embedding_dim, hidden_dim=128, output_dim=2)


In [None]:
class LSTMModel(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(LSTMModel, self).__init__()
        self.lstm = nn.LSTM(input_dim, hidden_dim, batch_first=True) 
        self.fc = nn.Linear(hidden_dim, output_dim)
    
    def forward(self, x):
        # x shape: (batch, seq_len, input_size)
        output, (hidden, cell) = self.lstm(x)
        # output shape: (batch, seq_len, hidden_dim)

        # Take the output of the last time step for classification

        output = self.fc(output[:, -1, :])  # shape: (batch, output_dim)
        
        return output


In [None]:
import torch

def train_model(model, train_loader, val_loader, epochs, learning_rate):
    criterion = torch.nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)


    for epoch in range(epochs):
        model.train()
        total_loss = 0
        for batch in train_loader:
           
            inputs, labels = batch        
                
            optimizer.zero_grad()
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            total_loss += loss.item()

        print(f"Epoch {epoch+1}/{epochs}, Loss: {total_loss/len(train_loader)}")

        # Validation
        model.eval()
        total = 0
        correct = 0
        with torch.no_grad():
            for batch in val_loader:
                inputs, labels = batch
                outputs = model(inputs)
                _, predicted = torch.max(outputs.data, 1)
                total += labels.size(0)
                correct += (predicted == labels).sum().item()

        accuracy = 100 * correct / total
        print(f'Validation Accuracy: {accuracy}%')


model = SimpleRNN(input_dim=100, hidden_dim=128, output_dim=2)

train_model(model, train_loader, val_loader, epochs=40, learning_rate=  0.001)



In [None]:

model2 = LSTMModel(input_dim=100, hidden_dim=256, output_dim=2)
train_model(model2, train_loader, val_loader, epochs=10, learning_rate=0.001)

KeyboardInterrupt: 