### What we will learn
We will explore deep learning using PyTorch for text classification and generation. We'll cover encoding, deep learning models for text, and advanced topics around transformer architecture and protecting our models from attacks. These skills apply to real-world tasks, like sentiment analysis, text summarization, and machine translation.

### 3. What you should know
Before we begin, you should already be familiar with developing deep learning models with PyTorch, including training and evaluation loops, and have familiarity with convolutional and recurrent neural networks.

### 4. Text processing pipeline
Welcome to the text processing pipeline! Our text analysis approach in PyTorch involves preprocessing, encoding, and Dataset and DataLoader. This video will focus on preprocessing. We will explore encoding and recap Dataset and Dataloader later in the chapter. Let's begin.

### 5. Text processing pipeline
In preprocessing, we clean and prepare the text data for encoding.

### 6. PyTorch and NLTK
Preprocessing raw text data utilizes natural language processing techniques. We'll use PyTorch and NLTK, the Natural Language Toolkit, which provides a range of techniques to transform raw text into processed text.

### 7. Preprocessing techniques
We will discuss tokenization, stop word removal, stemming, and rare word removal.

### 8. Tokenization
The first step in text preprocessing is tokenization. This is where we extract tokens from text. A token could be a full word, part of a word, or a punctuation. We'll use the PyTorch get_tokenizer function imported from torchtext-dot-data-dot-utils. The basic_english tokenizer supports the English language. We input the sentence: "I am reading a book now. I love to read books!". By applying tokenization, our output becomes a list of tokens.

### 9. Stop word removal
Next is stopword removal, where NLTK is more suited. Here, we eliminate stopwords or commonly occurring words such as a, the, and, or, and others that don't contribute to the meaning of a text, allowing the model to focus on the words with meaning. We download the stopwords collection of words, also known as corpus, from nltk using nltk-dot-download and import the stopwords package. We create a set of stopwords with no duplicates using stopwords-dot-words. We use English to process English text, but other options are available. With list comprehension, we iterate through the tokens we previously created and filter out any stopwords. Note the use of the lower method; this helps us capture all instances of stopwords regardless of capitalization. Finally, we print the filtered tokens.

### 10. Stemming
Stemming reduces words or tokens to their base or root form for simplified analysis. For example, "running," "runs," and "ran" would all be converted to "run" using stemming. We use the NLTK library's PorterStemmer package to perform stemming on a set of words or tokens. We initialize the PorterStemmer. Its input will be a list of tokenized words with stopwords removed. We iterate through this list using stemmer-dot-stem to stem each token. In the output, reading becomes read, and books becomes book.

### 11. Rare word removal
Lastly, we can remove rare words that occur infrequently and may not provide value for our text analysis. We calculate the word frequencies using the FreqDist function from the nltk-dot-probability module and define the tokens input. We then define a threshold value of two to determine the rare words. We filter out the rare words by keeping only tokens whose frequency exceeds the threshold. Then, we print the result.

### 12. Preprocessing techniques
The techniques we have covered help refine our text data by reducing the number of features and creating cleaner, more representative datasets. We have only covered a few techniques here. Many more exist but are out of scope for this course. We encourage you to explore these further.

In [None]:
# Import the necessary functions
import nltk
from torchtext.data.utils import get_tokenizer
from nltk.probability import FreqDist
from nltk.stem import PorterStemmer

nltk.download('stopwords')
from nltk.corpus import stopwords


text = "In the city of Dataville, a data analyst named Alex explores hidden insights within vast data. With determination, Alex uncovers patterns, cleanses the data, and unlocks innovation. Join this adventure to unleash the power of data-driven decisions."

# Initialize the tokenizer and tokenize the text
tokenizer = get_tokenizer("basic_english")
tokens = tokenizer(text)

threshold = 1
# Remove rare words and print common tokens
freq_dist = FreqDist(tokens)
common_tokens = [token for token in tokens if freq_dist[token] > threshold]
print(common_tokens)

In [None]:

# Initialize and tokenize the text
tokenizer = get_tokenizer("basic_english")
tokens = tokenizer(text)

# Remove any stopwords
stop_words = set(stopwords.words("english"))
filtered_tokens = [token for token in tokens if token.lower() not in stop_words]

# Perform stemming on the filtered tokens
stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(token) for token in filtered_tokens]
print(stemmed_tokens)

### 2. Text encoding
Encoding happens after processing the data. Using PyTorch, we can convert text into machine-readable numbers for analysis and modeling. As seen in the image, each value in the red table is encoded in the blue table.

### 3. Encoding techniques
We will discuss three encoding methods: One-hot encoding transforms words into unique numerical representations, Bag-of-Words captures word frequency disregarding order, and TF-IDF balances the uniqueness and importance of words in a document. Additionally, embedding converts words into vectors representing semantic meanings. We will review embeddings in the next chapter.

### 4. One-hot encoding
With one-hot encoding, each word maps onto a distinct one-hot binary vector within the encoding space where one represents the presence of a word and zero the absence. For instance, in a vocabulary consisting of cat, dog, and rabbit, the one-hot vector for 'cat' could be `[1, 0, 0]`, `[0, 1, 0]` for 'dog' and `[0, 0, 1]` for 'rabbit'.

### 5. One-hot encoding with PyTorch
We have a vocab list that contains input tokens. For sentence input, we tokenize to create a list of tokens. We first determine the vocab list length. Using torch, we utilize the torch-dot-eye function to generate one-hot vectors for the length of our list. We create a dictionary called one_hot_dict where each word is mapped to its corresponding vector from one_hot_vectors. This allows us to easily access the vector representation of any word in our vocabulary.

### 6. Bag-of-words
Alternatively, we could improve our models by adding more meaning with bag-of-words, which treats a document as an unordered collection of words, emphasizing word frequency over order. For instance, the sentence 'The cat sat on the mat' is converted into a dictionary. In our case, "the" is the only word that appears twice.

### 7. CountVectorizer
In some cases, like this one, sklearn streamlines Bag-of-Words implementation. We import CountVectorizer from sklearn-dot-feature_extraction-dot-text. We instantiate a CountVectorizer object. We define our corpus, a collection of text documents represented here as a list of sentences. This can also be a tokenized list. We fit our vectorizer to the corpus and transform it into a numerical format using fit_transform. This produces our Bag-of-Words representation, which we store in X and print using the toarray function. We can visualize the words by extracting the feature names from the vectorizer with dot-get_feature_names_out. The output is a term frequency matrix, where each row corresponds to a document and each column corresponds to a word. For example, the presence of "and" in the first column is indicated by a one in the third row.

### 8. TF-IDF
The last technique we will cover is TF-IDF or Term Frequency-Inverse Document Frequency. It assesses word importance by considering word frequency across all documents, assigning higher scores to rare words and lower scores to common ones. TF-IDF emphasizes informative words in our text data, unlike bag-of-words, which treats all words equally.

### 9. TfidfVectorizer
To use TF-IDF we import TfidfVectorizer from sklearn. We instantiate a TfidfVectorizer object using the same corpus as before and fit it like we did for CountVectorizer. This transforms the data into TF-IDF vectors. TF-IDF can also accept a tokenized list. The toarray function yields a matrix of TF-IDF scores. We print the feature names. Every row in the matrix represents a document from the corpus. The feature names list displays the most significant words across all documents, and each word represents a column of the matrix.

### 10. TfidfVectorizer
For instance, the importance of the word first is highest in the first sentence with a score of zero-point-six-eight.

### 11. Encoding techniques
Encoding allows models to understand and process text. Ideally, we choose one technique for encoding to avoid redundant computations. As with processing, other encoding techniques exist but are beyond this course's scope. We will cover embeddings in the next chapter.

In [None]:
genres = ['Fiction','Non-fiction','Biography', 'Children','Mystery']

# Define the size of the vocabulary
vocab_size = len(genres)

# Create one-hot vectors
one_hot_vectors = torch.eye(vocab_size)

# Create a dictionary mapping genres to their one-hot vectors
one_hot_dict = {genre: one_hot_vectors[i] for i, genre in enumerate(genres)}

for genre, vector in one_hot_dict.items():
    print(f'{genre}: {vector.numpy()}')

In [None]:
# Import from sklearn
from sklearn.feature_extraction.text import CountVectorizer

titles = ['The Great Gatsby','To Kill a Mockingbird','1984','The Catcher in the Rye','The Hobbit', 'Great Expectations']

# Initialize Bag-of-words with the list of book titles
vectorizer = CountVectorizer()
bow_encoded_titles = vectorizer.fit_transform(titles)

# Extract and print the first five features
print(vectorizer.get_feature_names_out()[:5])
print(bow_encoded_titles.toarray()[0, :5])

In [None]:
# Importing TF-IDF from sklearn
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TF-IDF encoding vectorizer
vectorizer = TfidfVectorizer()
tfidf_encoded_descriptions = vectorizer.fit_transform(descriptions)

# Extract and print the first five features
print(vectorizer.get_feature_names_out()[:5])
print(tfidf_encoded_descriptions.toarray()[0, :5])

### 2. Recap: preprocessing
The first pipeline component is preprocessing. Recall the techniques we reviewed are tokenization, stopword removal, stemming, and rare word removal. These actions help to reduce the complexity of our models.

### 3. Text processing pipeline
The second component is encoding. Here, we convert our preprocessed text into numerical vectors using methods like One-Hot Encoding, Bag-of-Words, or TF-IDF. This enables our models to understand and process textual data. Another technique is embeddings, which will be discussed in the next chapter.

### 4. Text processing pipeline
We complete our pipeline by using PyTorch's Dataset and DataLoader. In our text processing pipeline, we will use Dataset as a container for our processed and encoded text data. DataLoader then allows us to iterate over this dataset in batches, shuffle the data, and apply multiprocessing for efficient loading.

### 5. Recap: implementing Dataset and DataLoader
Let's review applying Dataset and DataLoader to text data in PyTorch. We create a custom class, TextDataset, serving as our data container. The init method initializes the dataset with the input text data. The len method returns the total number of samples in the dataset, and the getitem method allows us to access a specific sample at a given index. This class, extending PyTorch's Dataset, allows us to organize and access our text data efficiently.

### 6. Recap: integrating Dataset and DataLoader
After encoding our text data, we instantiate our TextDataset with the encoded text. We then create a DataLoader, making the dataset iterable.

### 7. Using helper functions
For convenience, we'll use helper functions for preprocessing and encoding. preprocess_sentences combines the techniques we've covered; we can also customize it to only include specific techniques depending on the problem. We've chosen CountVectorizer in encode_sentences to convert the cleaned sentences into arrays. We've included an extract_sentences function that uses regular expressions (regex) to convert English sentences. While regex is beyond the scope of this course, we've included it here for potential use in the pre-exercise code.

### 8. Constructing the text processing pipeline
Now, let's construct our text processing pipeline. We define a function text_processing_pipeline that takes raw text as input. Within this function, we preprocess the text using the preprocess_sentences function. This returns a list of tokens. Next, we convert these tokens into numerical vectors using the encode_sentences function. After encoding, we instantiate our PyTorch TextDataset with the numerical vectors, then initialize a DataLoader with this dataset. The DataLoader will allow us to iterate over the dataset in manageable batches of size two and in a shuffled manner, ensuring a diverse mix of examples in each batch.

### 9. Applying the text processing pipeline
With our text processing pipeline function ready, we can apply it to any text data. Let's say we have two sentences: "This is the first text data" and "And here is another one". We call the extract sentences function to convert the text to sentences. We feed each of these sentences into our text_processing_pipeline function. This preprocesses, encodes, and loads them into individual DataLoaders, stored in the dataloaders list using list comprehension. We also store an instance of the vectorizer created during encoding to access the feature names for each vector. Finally, the print statement uses the next iter combination and allows us to access the batches of data from the dataloaders. The output is the first ten components of the first batch in the dataloader. It contains the encoded representation of the sentences that represent the frequency of the first five words in the vocabulary for each sentence.

### 10. Text processing pipeline: it's a wrap!
Excellent work! Our text processing pipeline efficiently converts raw text data into a machine-learning-ready format. After processing the text through this pipeline, we can use the resulting structured data to train, validate, and test models. We'll apply this pipeline to large datasets in upcoming chapters.

In [None]:
# Create a list of stopwords
stop_words = set(stopwords.words("english"))

# Initialize the tokenizer and stemmer
tokenizer = get_tokenizer("basic_english")
stemmer = PorterStemmer() 

# Complete the function to preprocess sentences
def preprocess_sentences(sentences):
    processed_sentences = []
    for sentence in sentences:
        sentence = sentence.lower()
        tokens = tokenizer(sentence)
        tokens = [token for token in tokens if token not in stop_words]
        tokens = [stemmer.stem(token) for token in tokens]
        processed_sentences.append(' '.join(tokens))
    return processed_sentences

processed_shakespeare = preprocess_sentences(shakespeare)
print(processed_shakespeare[:5]) 

In [None]:
import re

def extract_sentences(data):
    return re.findall(r'[A-Z][^.!?]*[.!?]', data)

In [None]:
# Define your Dataset class
class ShakespeareDataset(Dataset):
    def __init__(self, data):
        self.data = data
    def __len__(self):
        return len(self.data)
    def __getitem__(self, idx):
        return self.data[idx]

# Complete the encoding function
def encode_sentences(sentences):
    vectorizer = CountVectorizer()
    X = vectorizer.fit_transform(sentences)
    return X.toarray(), vectorizer
    
# Complete the text processing pipeline
def text_processing_pipeline(sentences):
    processed_sentences = preprocess_sentences(sentences)
    encoded_sentences, vectorizer = encode_sentences(processed_sentences)
    dataset = ShakespeareDataset(encoded_sentences)
    dataloader = DataLoader(dataset, batch_size=2, shuffle=True)
    return dataloader, vectorizer

dataloader, vectorizer = text_processing_pipeline(processed_shakespeare)

# Print the vectorizer's feature names and the first 10 components of the first item
print(vectorizer.get_feature_names_out()[:10]) 
print(next(iter(dataloader))[0, :10])

### Text classification defined
Text classification assigns labels to text, giving meaning to words and sentences. It helps in organizing and giving structure to unstructured data and is crucial in various applications, such as analyzing customer sentiment in reviews, detecting spam emails, or tagging news articles with topics. We'll cover three classification types: binary, multi-class, and multi-label.

### 3. Binary classification
Binary classification sorts text into two categories, such as spam and not spam, as seen in email spam detection.

### 4. Multi-class classification
Multi-class classification categorizes text into more than two categories. For example, a news article could be classified into one of various categories, like politics, sports, or technology, depending on its content.

### 5. Multi-label classification
In multi-label classification, text can belong to multiple categories simultaneously, unlike multi-class where it belongs to just one category. For example, a book can fit into multiple genres like action, adventure, and fantasy all at the same time.

### 6. What are word embeddings
Classifying text sometimes requires an understanding of the meaning of words. Previously, we covered encoding techniques including one-hot, bag-of-words, and TF-IDF in the processing pipeline. While these techniques are fundamental to preprocessing and are a good first step to extracting features, they often result in too many features and can't identify similar words. In contrast, word embeddings represent words as numerical vectors, preserving semantic meanings and connections between words like king and queen or man and woman. While the diagram shows three-dimensional vectors, real-world word embeddings often have much higher dimensionality.

### 7. Word to index mapping
To enable embedding, we assign a unique index to a word with word-to-index mapping. For instance, we can translate "King" to one and "Queen" to two, giving us a numerical representation that is more compact and computationally efficient compared to the previous encoding techniques. Word-to-index mapping typically follows tokenization in the text processing pipeline, but it can follow any of the preproccessing techniques we've covered.

### 8. Word embeddings in PyTorch
PyTorch's torch-dot-nn-dot-Embedding is a flexible tool for creating word embeddings. It takes word indexes and transforms them into word vectors or embeddings. For example, given our sentence "The cat sat on the mat", it produces a unique vector for each word based on its index. Initially, these lists, or embeddings, contain random numbers because the model hasn't learned the meanings of the words yet. Through training, these embeddings start to change and learn, helping the model understand word meanings and their relationships.

### 9. Using torch.nn.Embedding
Let's implement word embeddings using PyTorch's nn-dot-Embedding. First, we construct our words list. We employ a dictionary, word-to-idx, to map words to indexes by enumerating the words through a dictionary comprehension. Utilizing the torch-dot-LongTensor function, we represent these mapped values as a tensor. We define an embedding layer with num_embeddings argument set to the length of the words list. embedding_dim specifies the size of each embedding vector; we've set it to ten. Remember, embedding_dim is a hyperparameter that can be increased to tune our results further, but as we have only six words, we have chosen the smallest embedding. We create a tensor of indexes to pass through the embedding layer. The output is an embedding for each input value. In our case, it gives us a two-dimensional tensor with six rows and ten columns. Each row contains the embedding for the corresponding word.

### 10. Using embeddings in the pipeline
Here is what this step would look like in the full pipeline with the dataset and dataloader we previously created. Here, we perform our embedding on the data generated by the dataloader.

In [None]:
# Map a unique index to each word
words = ["This", "book", "was", "fantastic", "I", "really", "love", "science", "fiction", "but", "the", "protagonist", "was", "rude", "sometimes"]
word_to_idx = {word: i for i, word in enumerate(words)}

# Convert word_to_idx to a tensor
inputs = torch.LongTensor([word_to_idx[w] for w in words])

# Initialize embedding layer with ten dimensions
embedding = nn.Embedding(num_embeddings=len(words), embedding_dim=10)

# Pass the tensor to the embedding layer
output = embedding(inputs)
print(output)

### 2. CNNs for text classification
We have seen CNNs used for classifying images, but they can also apply to text, for example, for classifying tweets as positive, negative, or neutral.

### 3. The convolution operation
The key operation in CNNs is convolution, where a filter or kernel slides over the input, performing element-wise calculations. This helps the model learn detailed word and sentence structure and meaning in text data.

### 4. Filter and stride in CNNs
In the convolution operation, we use a filter, a small matrix that slides over the input data, a matrix of tensors. We use a parameter called stride, which determines how many positions the filter moves each time it slides. In the image, we move a two by two filter with a stride of two.

1 Animation from Vincent Dumoulin, Francesco Visin
### 5. CNN architecture for text
A typical CNN architecture for text classification consists of three layers. The convolutional layer applies filters to the input data to detect patterns. The pooling layer reduces the size of the data while preserving important information. Finally, the fully connected layer uses the previous layer outputs for final predictions.

### 6. Implementing a text classification model using CNN
Let's build a sentiment analysis model, starting with the SentimentAnalysisCNN class. Much of the code will look familiar, and we'll prepare the dataset in later steps. The init method accepts vocabulary size and embedding dimension to configure the network architecture. The super method initializes the base class of nn-dot-Module to leverage the PyTorch framework properly. We initialize an embedding layer using nn-dot-Embedding, which creates dense vectors with the specified vocabulary size and embed_dim. In our case, self-dot-conv directly initializes a single convolutional layer, while in other models, convolutional layers are grouped and applied sequentially. Convolutions apply nn-dot-Conv1d, with uniform input-output channels, kernel size, stride, and padding, which ensures uniform text sequence lengths. Conv1d is preferred over Conv2d as our text data is one-dimensional. Lastly, the nn-dot-Linear layer transforms the combined outputs of all convolutional layers into the desired target output size. We omit a pooling layer in this model because our data in the exercises will be small.

### 7. Implementing a text classification model using CNN
In the forward method, we pass the input text through an embedding layer, which converts each word to its embedding. The tensor's dimensions are permuted to match the convolutional layer's expected input format defined with batch size, embedding size, and sequence length, in our case zero, two, and one, respectively. We use the convolutional layer with a ReLU activation function to extract important features from the embeddings. Applying the activation function in forward allows dynamic computation that saves memory compared to defining in init. conved-dot-mean calculates the average across the sequence length to reduce the feature dimension and capture the essential information, simplifying the information in each sentence to a single average value for easier analysis by the model.

### 8. Preparing data for the sentiment analysis model
To prepare our data, we create a vocabulary and use word to index mapping. While One-Hot or TF-IDF encoding methods are alternatives, they are less efficient as they do not capture contextual word relationships and result in high dimensional input vectors mostly filled with zeros. Hence, we opt for embeddings. We set vocab-size to the length of word to index and embed-dim to ten. We have two book review samples for demonstration. We then initialize our SentimentAnalysisCNN model with vocab_size for vocabulary size and embed_dim for word embedding dimension. For training, we use Cross-Entropy loss with Stochastic Gradient Descent as our optimizer, setting a learning rate of zero-point-one.

### 9. Training the model
During ten training epochs, we iterate over each sentence-label pair in the data, clearing previous gradients at the model level for clean computation. Words in sentences are mapped to indexes using word_to_idx and converted to a long tensor. We use unsqueeze zero to add an extra dimension to the start of the tensor, creating a batch containing a single sequence to fit the model's input expectations. The model then predicts sentiments, and we turn the label into a long tensor. We compute the loss between predictions and actual labels, calculate gradients via backpropagation, and adjust the model parameters using the optimizer.

### 10. Running the Sentiment Analysis Model
With our model and data ready, we can start making predictions. We iterate over book_samples, transforming the words to tensors, and feed it to the model. The output provides sentiment scores for our classification labels. Using torch-dot-max, we identify the component with the highest score. One corresponds to positive sentiment, while zero corresponds to negative sentiment. We then print the review alongside its predicted sentiment.

In [None]:
class TextClassificationCNN(nn.Module):
    def __init__(self, vocab_size, embed_dim):
        super(TextClassificationCNN, self).__init__()
        # Initialize the embedding layer 
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.conv = nn.Conv1d(embed_dim, embed_dim, kernel_size=3, stride=1, padding=1)
        self.fc = nn.Linear(embed_dim, 2)
    def forward(self, text):
        embedded = self.embedding(text).permute(0, 2, 1)
        # Pass the embedded text through the convolutional layer and apply a ReLU
        conved = F.relu(self.conv(embedded))
        conved = conved.mean(dim=2) 
        return self.fc(conved)

In [None]:
# Define the loss function
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)

for epoch in range(10):
    for sentence, label in data:     
        # Clear the gradients
        model.zero_grad()
        sentence = torch.LongTensor([word_to_ix.get(w, 0) for w in sentence]).unsqueeze(0) 
        label = torch.LongTensor([int(label)])
        outputs = model(sentence)
        loss = criterion(outputs, label)
        loss.backward()
        # Update the parameters
        optimizer.step()
print('Training complete!')

In [None]:
book_reviews = [
    "I love this book".split(),
    "I do not like this book".split()
]
for review in book_reviews:
    # Convert the review words into tensor form
    input_tensor = torch.tensor([word_to_ix[w] for w in review], dtype=torch.long).unsqueeze(0) 
    # Get the model's output
    outputs = model(input_tensor)
    # Find the index of the most likely sentiment category
    _, predicted_label = torch.max(outputs.data, 1)
    # Convert the predicted label into a sentiment string
    sentiment = "Positive" if predicted_label.item() else "Negative"
    print(f"Book Review: {' '.join(review)}")
    print(f"Sentiment: {sentiment}\n")

# RNNs for text
Recurrent Neural Networks, or RNNs, are great at handling sequences of varying lengths. They maintain an internal short-term memory, enabling them to learn patterns across time. Unlike CNNs that spot patterns in chunks of text, RNNs remember past words to understand the whole sentence's meaning. Today, we will explore how to employ RNNs for text classification.

### 3. RNNs for text classification
RNNs are suitable for text classification because they process sequential data like humans read, one word at a time, allowing them to capture the context and order of words. Consider the tweet, "I just love getting stuck in traffic"; RNNs can accurately classify the tweet as sarcastic.

### 4. Recap: Implementing Dataset and DataLoader
Let's remind ourselves how to apply Dataset and DataLoader for text data in PyTorch. We create a custom class TextDataset, serving as our data container. The init method initializes the dataset with the input text data. The len method returns the total number of samples in the dataset, and the getitem method allows us to access a specific sample at a given index. This class, extending PyTorch's Dataset, allows us to organize and access our text data efficiently.

### 5. RNN implementation
Now let's take a look at an example of sentiment analysis for movie review from a tweet. We want to train an RNN model to classify movie reviews as either positive or negative. We can use our entire text processing pipeline here to feed to the model. This includes encoding or embedding. We preprocess the tweet and convert it to a tensor, which is not shown here for brevity. Then, we pass the preprocessed tensor through the model to make a sentiment prediction. In this case, the model predicts that the sentiment is "Positive."

### 6. RNN variation: LSTM
But what if the tweet is not so straightforward to understand the sentiment. Take the tweet, "Loved the cinematography, hated the dialogue. The acting was exceptional, but the plot fell flat". These complex sentences contain subtle nuances and conflicting sentiments. While RNNs may struggle to capture the negative sentiment, Long Short Term Memory models or LSTMs excel at capturing such complexities. They can effectively understand the underlying emotions, making them a powerful tool for sentiment analysis.

### 7. LSTM
LSTMs have input, forget, and output gates that enable them to store and forget information as needed. This architecture is ideal for complex classification tasks. The code defines an LSTM model using nn-dot-LSTM, with an initialization function that sets the input size, hidden size, and batch-first parameter. The forward function processes the input through the LSTM layer using self-dot-lstm, and the rest is similar to RNN.

### 8. RNN variation: GRU
But, what if we wanted to detect spam emails without needing the full context. Given an email subject like "Congratulations! You've won a free trip to Hawaii!", a Gated Recurrent Unit or GRU, can quickly recognize spammy patterns without needing the full context. This makes them suitable for tasks like spam detection, sentiment analysis, text summarization, and more.

### 9. GRU
GRUs are a streamlined version of LSTMs that trade some complexity for faster training. The code defines a GRU model using nn-dot-GRU, with an initialization function that specifies the input size, hidden size, and batch-first parameter. The forward function remains the same, with the change of self-dot-lstm becoming self-dot-gru.

In [None]:
# Complete the RNN class
class RNNModel(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers, num_classes):
        super(RNNModel, self).__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.rnn = nn.RNN(input_size, hidden_size, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size, num_classes)

    def forward(self, x):
        h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size)
        out, _ = self.rnn(x, h0)
        out = out[:, -1, :] 
        out = self.fc(out)
        return out

# Initialize the model
rnn_model = RNNModel(input_size, hidden_size, num_layers, num_classes)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(rnn_model.parameters(), lr=0.01)

# Train the model for ten epochs and zero the gradients
for epoch in range(10): 
    optimizer.zero_grad()
    outputs = rnn_model(X_train_seq)
    loss = criterion(outputs, y_train_seq)
    loss.backward()
    optimizer.step()
    print(f'Epoch: {epoch+1}, Loss: {loss.item()}')

In [None]:
# Initialize the LSTM and the output layer with parameters
class LSTMModel(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers, num_classes):
        super(LSTMModel, self).__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.lstm = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size, num_classes)       
    def forward(self, x):
        h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size)
        c0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size)
        out, _ = self.lstm(x, (h0, c0))
        out = out[:, -1, :] 
        out = self.fc(out)
        return out

# Initialize model with required parameters
lstm_model = LSTMModel(input_size, hidden_size, num_layers, num_classes)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(lstm_model.parameters(), lr=0.01)

# Train the model by passing the correct parameters and zeroing the gradient
for epoch in range(10): 
    optimizer.zero_grad()
    outputs = lstm_model(X_train_seq)
    loss = criterion(outputs, y_train_seq)
    loss.backward()
    optimizer.step()
    print(f'Epoch: {epoch+1}, Loss: {loss.item()}')

In [None]:
# Complete the GRU model
class GRUModel(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers, num_classes):
        super(GRUModel, self).__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.gru = nn.GRU(input_size, hidden_size, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size, num_classes)       
    def forward(self, x):
        h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size) 
        out, _ = self.gru(x, h0)
        out = out[:, -1, :] 
        out = self.fc(out)
        return out

# Initialize the model
gru_model = GRUModel(input_size, hidden_size, num_layers, num_classes)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(gru_model.parameters(), lr=0.01)

# Train the model and backpropagate the loss after initialization
for epoch in range(15): 
    optimizer.zero_grad()
    outputs = gru_model(X_train_seq)
    loss = criterion(outputs, y_train_seq)
    loss.backward()
    optimizer.step()
    print(f'Epoch: {epoch+1}, Loss: {loss.item()}')

### 2. Why evaluation metrics matter
Picture this: Our model, designed to assess the sentiment of book reviews, suggests that a best-seller has mostly negative reviews. Should we accept its judgment? We can use evaluation metrics to answer this.

### 3. Evaluation RNN Models
Before evaluating, we must generate predictions from the model. First, we pass the test dataset through the model to obtain the output predictions for each class. Next, we store the predictions in the predicted variable using the torch-dot-max function that returns the indexes of the maximum values along the specified dimension, indicated by the argument one. We'll use the predicted variable for evaluation metrics.

### 4. Accuracy
The most straightforward metric is accuracy, the ratio of correct predictions to the total predictions. Using torchmetrics, the tensors actual represent our actual labels, and predicted the model predictions. We want to determine if an instance belongs to class zero or class one, a binary classification. The accuracy class is initialized with a binary task and num_classes set to two for our two categories. The task can also be multiclass if there are more than two categories to classify. Passing labels to the accuracy instance gives the model's accuracy score. A score of zero-point-66 indicates the model predicted just over 66 percent of the samples correctly. A good score can vary based on the complexity of the problem. Scores range from zero to one, with higher scores representing greater accuracy. For example, zero-point-75 may be reasonable for sentiment analysis but poor elsewhere. As we learn more about metrics, we'll see that accuracy alone doesn't capture everything.

### 5. Beyond accuracy
Imagine a dataset of 10,000 book reviews where 9,800 readers adore the book and 200 found faults. Let's assume our model predicts all instances as positive, making it 98% accurate! But look closer. Such a model can't classify a single negative sentiment. Enter precision, which questions the model's confidence in labeling a review as negative. Recall checks how well the model spots actual negative reviews. The F1 Score harmonizes these two, ensuring neither is neglected. If we were to trust accuracy alone, we'd miss significant feedback. Let's explore each in more detail.

### 6. Precision and Recall
Precision is the ratio of correctly predicted positive observations to the total predicted positives. Recall is the ratio of correctly predicted positive observations to all observations in the actual positive class. To calculate these, we import the Precision and Recall classes from torchmetrics, use the same parameters as before, and print the results.

### 7. Precision and Recall
A precision of zero-point-six-six suggests that out of all positive predictions, just over 66 percent were accurate. Meanwhile, a recall of zero-point-five signifies the model captured 50 percent of all genuine positives. Like accuracy, the scores range from zero to one. The complexity of the problem needs to be considered when defining a score as good or bad.

### 8. F1 score
The F1 Score harmonizes precision and recall and is especially useful when dealing with imbalanced classes. To calculate it, we import the F1 Score class from torchmetrics and instantiate it with the same parameters. An F1 Score of one indicates perfect precision and recall, while a score of zero indicates the worst possible performance. Here F1 Score of zero-point-57 suggests a reasonably balanced trade-off between precision and recall, but this trade-off will depend on the task.

### 9. Considerations
In some instances, such as with multi-class classification, we may find that all scores are identical. Generally, this indicates a model is performing well across all classes. But remember to always consider the problem when interpreting results!

In [None]:
# Create an instance of the metrics
accuracy = Accuracy(task="multiclass", num_classes=num_classes)
precision = Precision(task="multiclass", num_classes=num_classes)
recall = Recall(task="multiclass", num_classes=num_classes)
f1 = F1Score(task="multiclass", num_classes=num_classes)

# Generate the predictions
outputs = rnn_model(X_test_seq)
_, predicted = torch.max(outputs, 1)

# Calculate the metrics
accuracy_score = accuracy(predicted, y_test_seq)
precision_score = precision(predicted, y_test_seq)
recall_score = recall(predicted, y_test_seq)
f1_score = f1(predicted, y_test_seq)
print("RNN Model - Accuracy: {}, Precision: {}, Recall: {}, F1 Score: {}".format(accuracy_score, precision_score, recall_score, f1_score))

In [None]:
# Create an instance of the metrics
accuracy = Accuracy(task="multiclass", num_classes=3)
precision = Precision(task="multiclass", num_classes=3)
recall = Recall(task="multiclass", num_classes=3)
f1 = F1Score(task="multiclass", num_classes=3)

# Calculate metrics for the LSTM model
accuracy_1 = accuracy(y_pred_lstm, y_test)
precision_1 = precision(y_pred_lstm, y_test)
recall_1 = recall(y_pred_lstm, y_test)
f1_1 = f1(y_pred_lstm, y_test)
print("LSTM Model - Accuracy: {}, Precision: {}, Recall: {}, F1 Score: {}".format(accuracy_1, precision_1, recall_1, f1_1))

# Calculate metrics for the GRU model
accuracy_2 = accuracy(y_pred_gru, y_test)
precision_2 = precision(y_pred_gru, y_test)
recall_2 = recall(y_pred_gru, y_test)
f1_2 = f1(y_pred_gru, y_test)
print("GRU Model - Accuracy: {}, Precision: {}, Recall: {}, F1 Score: {}".format(accuracy_2, precision_2, recall_2, f1_2))

### 2. Text generation and NLP
Text generation is utilized in Natural Language Processing, serving various applications like chatbots, translations, and technical writing. RNNs, LSTMs, and GRUs ability to remember past information positions them as a key technology for processing sequential text data. For example, given the input "The cat is on the m", an RNN would complete the statement with "at", for "mat".

### 3. Building an RNN for text generation
To construct an RNN text generation model, we import the essential libraries: torch and nn. Initiating a "Hello how are you?" data variable, we extract unique characters and establish bidirectional mappings (character to index and vice versa) to convert textual data to numerical form, and subsequently back to text post-prediction. Character-to-integer conversions are favored over words for reduced dimensionality and computational ease, especially in smaller datasets, ensuring the model efficiently processes numerical input and generates readable text outputs. The init method takes input size, hidden size, and output size. Hidden size is stored as an instance variable. An RNN layer with specified dimensions is defined, followed by a fully connected layer. Our goal will be to train a model to generate "Hello how are you?". In training to predict subsequent characters, inputting 'h' should suggest 'e' as a likely next character if "he" is a frequent bigram.

### 4. Forward propagation and model creation
In the forward method, we initialize a tensor of zeros for the initial hidden state, providing a neutral starting point for the RNN to learn from the data. We then pass the input data and the initial hidden state into the RNN layer to generate an output sequence and a new hidden state. We extract the last time step's output from the RNN and process it through a fully connected layer. This step allows us to convert the RNN's output into a format suitable as the next element in a generated sequence. We return this output as the result of the forward method. We instantiate our RNNmodel with an input size of one, a hidden size of 16, and an output size of one, optimized for single-token input and output sequences while allowing a 16-unit hidden layer for feature extraction. We treat text generation as a regression problem to predict the next token in a sequence as the next token can have infinite tensor output classes rather than a set number of output classes needed for classification. Our loss function is CrossEntropyLoss. We use the Adam optimizer, specifying the learning rate as 0-point-01.

### 5. Preparing input and target data
The inputs and targets lists are created by mapping each character in the data string to its corresponding index, excluding the last character for inputs and the first character for targets. The index lists are then converted into long tensors. Additionally, we reshape inputs to have an additional dimension and match the expected input shape for the model. The inputs tensor is one-hot encoded, turning each index into a binary vector, where all elements are zero except for the one at the position of the index. However, the targets tensor remains as character indices to align with CrossEntropyLoss, which requires class indices as targets.

### 6. Training the RNN model
We initiate our training loop for 100 epochs, switch our model to training mode, and feed the inputs to the model and get the outputs. We calculate the loss by comparing the model's outputs to the actual targets. As PyTorch accumulates gradients, we clear the existing gradients in the optimizer. We perform backpropagation to compute loss gradients for the model parameters, and update them. Finally, we print the epoch number and the current loss every ten epochs. The output is shown on the next slide.

### 7. Testing the model
Let's test our trained RNN model. We switch the model to evaluation mode. We will prepare the character 'h' for prediction. 'h' is converted to its index using character to index mapping. The nn-dot-functional-dot-one_hot function is used to one hot encode the index. The tensor is reshaped to a compatible format for encoding, with num_classes set to the length of unique characters and converted to a float tensor. We feed test_input into the model to get the predicted_output. Using torch-dot-argmax on this output, we find the index of the maximum value along axis one, representing the most probable next character. We then print the model's prediction for this input. The decreasing loss over 100 epochs suggests our model learned and improved. When we input h into our trained model, it predicted e, which is fairly close. This indicates that our model has learned to generate text well.

In [None]:
# Include an RNN layer and linear layer in RNNmodel class
class RNNmodel(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(RNNmodel, self).__init__()
        self.hidden_size = hidden_size
        self.rnn = nn.RNN(input_size, hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, output_size)

    def forward(self, x):
      h0 = torch.zeros(1, x.size(0), self.hidden_size)
      out, _ = self.rnn(x, h0)  
      out = self.fc(out[:, -1, :])  
      return out

# Instantiate the RNN model
model = RNNmodel(len(chars), 16, len(chars))

In [None]:
# Instantiate the loss function
criterion = nn.CrossEntropyLoss()
# Instantiate the optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

# Train the model
for epoch in range(100):
    model.train()
    outputs = model(inputs)
    loss = criterion(outputs, targets)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    if (epoch+1) % 10 == 0:
        print(f'Epoch {epoch+1}/100, Loss: {loss.item()}')

# Test the model
model.eval()
test_input = char_to_ix['r']
test_input = nn.functional.one_hot(torch.tensor(test_input).view(-1, 1), num_classes=len(chars)).float()
predicted_output = model(test_input)
predicted_char_ix = torch.argmax(predicted_output, 1).item()
print(f"Test Input: 'r', Predicted Output: '{ix_to_char[predicted_char_ix]}'")

### 1. Generative adversarial networks for text generation
Generative Adversarial Networks, or GANs, are often used for image generation

### 2. GANs and their role in text generation
but are becoming more common for text generation for creating synthetic data that preserves statistical similarities. Unlike RNNs, GANs replicate complex data patterns, ensuring feature correlation and authentically emulating real-world patterns.

### 3. Structure of a GAN
A GAN consists of two primary components: the Generator, which creates synthetic text data from noise, and the Discriminator, which distinguishes between real and generated text data. Here, noise refers to random changes to real data, such as adding special characters to a word. These components collaborate, with the Generator improving its fakes and the Discriminator enhancing its ability to detect them until the generated text becomes indistinguishable from real text.

### 4. Building a GAN model in PyTorch: Generator
We begin building a GAN model by defining the Generator. Our data is product reviews that have been embedded and converted to tensors, not shown here for brevity. The goal is for our model to create believable reviews. We define our Generator network with nn-dot-Module. It has a linear layer inside the Sequential function that transforms the input to have the same dimension as our data sequences. It is followed by a sigmoid activation function suitable for binary data that squashes the output values to the range zero to one. The forward method then applies this network to an input tensor.

### 5. Building the discriminator network
We define a Discriminator network similarly. This network has a linear layer that transforms the input to a single value, followed by a sigmoid activation function. The output represents the probability that the input data is real. The forward method applies this network to an input tensor.

### 6. Initializing networks and loss function
We initialize our Generator and Discriminator network instances and define the loss function as Binary Cross Entropy for binary classification tasks like distinguishing between real and fake data. Next, we create two Adam optimizers for the Generator and the Discriminator. Each optimizer has a learning rate 0-point-001, a value often used as a starting point and may be adjusted based on model performance.

### 7. Training the discriminator
We establish a training loop for 50 epochs, generating batches of real data and random noise for the Generator to create fake data. We obtain predictions from the Discriminator for real and fake data, using the detach function to prevent gradient tracking. The Discriminator's loss is calculated using torch-dot-ones_like and torch-dot-zeros_like to match the expected real and fake labels. We reset the gradients in the Discriminator's optimizer with zero_grad, perform backpropagation to calculate gradients, and update the Discriminator's parameters.

### 8. Training the generator
Next we train the Generator. We calculate the Generator's loss based on how well it fooled the Discriminator. The loss is determined by the difference between the Discriminator's predictions on fake data and an array of ones. We then reset the gradients in the Generator's optimizer, perform backpropagation to calculate gradients, and update the Generator's parameters. We print Generator and Discriminator losses every ten epochs to monitor training progress.

### 9. Printing real and generated data
After the training is complete, we print some real data. Then, we sample random values to form inputs for the Generator, generating data points mirroring the real data distribution.

### 10. GANs: generated synthetic data
The displayed output reveals Generator and Discriminator losses for every 10th epoch, demonstrating a consistent decline. However, after 50 epochs, the losses remain high, indicating the need for further training.

### 11. Generated data
Here's what our model generated. Since the input data was in tensor form, the output is also in tensor format. Upon reviewing the matrix, the real and generated data are similar. In practice, we would assess this further by plotting a correlation matrix and checking if the correlation between columns is maintained.