# Exercises 
Take the lessons from the last two notebooks and apply them to the following exercies 

## Alternative Algorithms 
In the 05_Classic Machine Learning with NLP notebook, we used a Naive Bayes classifier to identify Spam emails. As an exercise, we'll ask you to implement an alternative algortithm to predict if an email is spam or not 

### Importing the Data
You can reference the `05_Classic Machine Learning in NLP` notebook for tips on reading in parsing the email data

In [3]:
# Fill in code in the appropriate spots

import pandas as pd

# Read in spam filter example 
spam_df = pd.read_csv(r"D:\\Coding_Stuff\\GitHub\\Natural-Language-Processing\\data\\emails.csv")
spam_df.head(100)

Unnamed: 0,text,spam
0,Subject: naturally irresistible your corporate...,1
1,Subject: the stock trading gunslinger fanny i...,1
2,Subject: unbelievable new homes made easy im ...,1
3,Subject: 4 color printing special request add...,1
4,"Subject: do not have money , get software cds ...",1
...,...,...
95,"Subject: v - shoop hello , welcome to the med...",1
96,Subject: you need only 15 minutes to prepare f...,1
97,Subject: do i require an attorney to use this ...,1
98,Subject: high - quality affordable logos corp...,1


### Data Cleaning
Now that the data has been read in, we can use the `remove_stopwords` function to remove stopwords across our pandas `DataFrame`

In [14]:
# Fill in code in the appropriate spots
import string
from nltk.corpus import stopwords

# Define function to remove stop words
def remove_stopwords(text: str) -> str:
    no_punctuation = [character for character in text if character not in string.punctuation]
    no_punctuation = "".join(no_punctuation)
    
    return " ".join([word for word in no_punctuation.split() if word.lower() not in stopwords.words('english')])

In [15]:
# Fill in code in the appropriate spots
# Apply the stopword removal function

spam_df["removed_stopwords"] = spam_df["text"].apply(remove_stopwords)

In [16]:
# Fill in code in the appropriate spots
# Define the stemming function

from nltk.stem import PorterStemmer

def stem(text:str)-> str:
    stemmer = PorterStemmer()
    return " ".join([stemmer.stem(word) for word in text.split()])

In [17]:
# Fill in code in the appropriate spots
# Apply the stemming function 
spam_df["stemming"] = spam_df["removed_stopwords"].apply(stem)

### Generating Embeddings

Now that the data has been cleaned, we'll need to generate numeric representations of our text data. For these embeddings use the `CountVectorizer`

In [20]:
# Fill in code in the appropriate spots
from sklearn.feature_extraction.text import CountVectorizer

# Instantite, fit, and transform the `CountVectorizer`
vectorizer = CountVectorizer()
vectorized_matrix = vectorizer.fit_transform(spam_df["stemming"])

### Splitting Data
In modeling, it's crucial to split data into a training set and a test set to ensure the model generalizes well. Split the data into a training set and test set using a 80/20 split. 

In [21]:
# Fill in code in the appropriate spots
# Split the data 
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(vectorized_matrix, spam_df["spam"], test_size=0.3)

Use a classifier of choice to predict whether an email is spam or not 

In [22]:
# Fill in code in the appropriate spots
# Instantiate and train a Logistic Regressor
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train, y_train)

### Generate Predictions and Evaluate the Model
Now that the model has been trained, evaulate it using the `X_test` set from above. For a robust view, calcualate the `classification_report` and generate a confusion matrix

In [23]:
# Fill in code in the appropriate spots
from sklearn.metrics import classification_report

preds = model.predict(X_test)
print(classification_report(y_test, preds))

              precision    recall  f1-score   support

           0       0.99      0.99      0.99      1302
           1       0.98      0.97      0.98       417

    accuracy                           0.99      1719
   macro avg       0.99      0.98      0.98      1719
weighted avg       0.99      0.99      0.99      1719



## Deep Learning for NLP

Use the LSTM and Data classes from the `06 Deep Learning` notebooks to create and `LSTM` model from scratch, but this time tune a few of the hyperparameters in the model to see if you can generate a better response to the knock knock joke. 

In [15]:
import torch
from torch import nn
from torch.autograd import Variable

In [16]:
# Fill in code in the appropriate spots
class LSTM_Model(nn.Module):
    """
    LSTM model class
    """
    def __init__(self, dataset):
        """
        LSTM Model constructor
        
        @params:
        dataset: dataset used for model training
        
        @returns:
        None
        """
        super(LSTM_Model, self).__init__()
        self.lstm_size = 128
        self.embedding_dim = 128
        self.num_layers = FILL_IN

        n_vocab = len(dataset.uniq_words)
        self.embedding = nn.Embedding(
            num_embeddings=n_vocab,
            embedding_dim=self.embedding_dim,
        )
        self.lstm = nn.LSTM(
            input_size=self.lstm_size,
            hidden_size=self.lstm_size,
            num_layers=self.num_layers,
#             dropout=FILL_IN,
        )
        self.fc = nn.Linear(self.lstm_size, n_vocab)

    def forward(self, x, prev_state):
        """
        Forward module for training 
        
        @params:
        prev_state: torch.Tensor
        
        @returns:
        logits: tuple
        state: tuple
        """
        embed = self.embedding(x)
        output, state = self.lstm(embed, prev_state)
        logits = self.fc(output)
        return logits, state

    def init_state(self, sequence_length):
        """
        Init state
        
        @params:
        sequence_length: int, length of sequence
        
        @returns:
        tuple
        
        """
        return (torch.zeros(self.num_layers, sequence_length, self.lstm_size),
                torch.zeros(self.num_layers, sequence_length, self.lstm_size))

In [None]:
import torch
import pandas as pd
from collections import Counter

class Dataset(torch.utils.data.Dataset):
    """
    Torch Dataset class for data loading
    """
    def __init__(self,args,):
        """
        Dataset class constructor
        
        @params:
        args: Dict[str, Any]
        
        @returns:
        None
        """
        self.args = args
        self.words = self.load_words()
        self.uniq_words = self.get_uniq_words()

        self.index_to_word = {index: word for index, word in enumerate(self.uniq_words)}
        self.word_to_index = {word: index for index, word in enumerate(self.uniq_words)}

        self.words_indexes = [self.word_to_index[w] for w in self.words]

    def load_words(self):
        """
        Loading raw files modules 
        
        @parms:
        None
        
        @returns:
        None
        """
        train_df = pd.read_csv('data/dl_data/reddit-cleanjokes.csv')
        text = train_df['Joke'].str.cat(sep=' ')
        return text.split(' ')

    def get_uniq_words(self):
        """
        Retrieving unique words
        
        @params:
        None
        
        @returns:
        None
        
        """
        word_counts = Counter(self.words)
        return sorted(word_counts, key=word_counts.get, reverse=True)
    
    def __len__(self):
        """
        Get length difference between word indices and sequence length
        
        @params:
        None
        
        @return:
        int: difference in length 
        """
        return len(self.words_indexes) - self.args["sequence_length"]

    def __getitem__(self, index):
        """
        Get function 
        
        @params:
        index: int, index
        
        @returns:
        """
        return (
            torch.tensor(self.words_indexes[index:index+self.args["sequence_length"]]),
            torch.tensor(self.words_indexes[index+1:index+self.args["sequence_length"]+1]),
        )

In [None]:
# Fill in code in the appropriate spots
import argparse
import torch
import numpy as np
from torch import nn, optim
from torch.utils.data import DataLoader

def train(dataset, model, args):
    """
    Main training function 
    
    @params:
    dataset: torch.utils.data.Dataset
    model: LSTM_Model
    args: Dict[str, Any]
    
    @returns:
    None
    """
    model.train()

    dataloader = DataLoader(dataset, batch_size=args["batch_size"])
    criterion = nn.CrossEntropyLoss()
#     optimizer = optim.Adam(model.parameters(), lr=FILL_IN)

    for epoch in range(args["max_epochs"]):
        state_h, state_c = model.init_state(args["sequence_length"])

        for batch, (x, y) in enumerate(dataloader):
            optimizer.zero_grad()

            y_pred, (state_h, state_c) = model(x, (state_h, state_c))
            loss = criterion(y_pred.transpose(1, 2), y)

            state_h = state_h.detach()
            state_c = state_c.detach()
            
            loss.backward()
            optimizer.step()

            print({ 'epoch': epoch, 'batch': batch, 'loss': loss.item() })

In [None]:
def predict(dataset, model, text, next_words=100):
    """
    Generate predictions 
    
    @params:
    dataset: Dataset
    model: LSTM_Model, model 
    text: str
    next_words: int, number of next words to generate
    
    @returns:
    words: list, list of words
    
    """
    model.eval()

    words = text.split(' ')
    state_h, state_c = model.init_state(len(words))

    for i in range(0, next_words):
        x = torch.tensor([[dataset.word_to_index[w] for w in words[i:]]])
        y_pred, (state_h, state_c) = model(x, (state_h, state_c))

        last_word_logits = y_pred[0][-1]
        p = torch.nn.functional.softmax(last_word_logits, dim=0).detach().numpy()
        word_index = np.random.choice(len(last_word_logits), p=p)
        words.append(dataset.index_to_word[word_index])
        

    return words

In [None]:
# Fill in code in the appropriate spots
# Define main arguments for training
args = {
#     "max_epochs": FILL_IN, 
#     "batch_size": FILL_IN,
    "sequence_length": 4
}


# Instantiate the dataset
dataset = Dataset(args)

# Instantiate the model 
model = LSTM_Model(dataset)

# Train the model using dataset
train(dataset, model, args)
print(predict(dataset, model, text='Knock knock. Whos there?'))