# Homework 3 Part 2

## Course Name: Deep Learning
#### Lecturers: Dr. Beigy

---

#### Notebooks Supervised By: Zeinab Sadat Taghavi
#### Notebooks Prepared By: Zahra Khoramnejad, Mehran Sarmadi, Zahra Rahimi

**Contact**: Ask your questions in Quera

---

### Instructions:
- Complete all exercises presented in this notebook.
- Ensure you run each cell after you've entered your solution.
- After completing the exercises, save the notebook and <font color='red'>follow the submission guidelines provided in the PDF.</font>


---


#Text Generation

<p align='justify'>Text generation task involves generating new text based on a given input or a prompt. It is a natural language processing (NLP) task that aims to generate coherent and contextually relevant text.

In text generation, a model is trained on a large corpus of text data and learns the patterns and structures of the language. This model can then be used to generate new text by sampling from the learned distribution of words or characters.

Text generation has various applications, including chatbots, language translation, poetry generation, and content creation. It can be implemented using different techniques such as `recurrent neural networks (RNNs)`, `transformers`, and `Markov chains`.

The goal of text generation is to produce text that is fluent, coherent, and contextually relevant. It requires a deep understanding of the language and the ability to generate text that follows grammatical rules and maintains semantic coherence.</p>

##Charachter-level text generation

One stage of the task of text generation is mapping, which can be at the word or character level. At this stage, a number is assigned to each word or character.

In this exercise, we generate text at the character level. Because generating text at the word level, even though it leads to more meaningful outputs, requires a rich dataset with a high number of word repetitions.

We will implement models based on `recurrent networks` for text generation and compare the performance of different models. In the following, we will check the performance of the best models on different datasets and compare the results

The steps of this exercise are as follows:
1. Train RNN and LSTM
2. FineTuning
3. Experiment on different datasets

---
---

#1. Train RNN and LSTM

## Imports

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.distributions import Categorical

import numpy as np
import pandas as pd
import random
import re
import string

import matplotlib.pyplot as plt


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

cuda


## Load data

- We use the dataset of `Shakespeare's plays` as the main dataset for this exercise

In [None]:
!wget "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt" -c -P {'data/'}

- Load data in amout of 30kb for training models

In [None]:
sh_data_file = "./data/input.txt"
sh_data = open(sh_data_file, 'r').read(30000)

##Charachter mapping

- For better performance of the model, we limit the set of allowed characters

In [None]:
chars = list(string.ascii_lowercase + '\n' + ' ' + ':' + '.')
vocab_size = len(chars)

In [None]:
# Mapping of char-index
char_to_ix = { ch:i for i,ch in enumerate(chars) }
ix_to_char = { i:ch for i,ch in enumerate(chars) }

##Preprocessing

In [None]:
def remove_extraneous_characters(data, valid_char_list):
    pattern = f"[^{re.escape(''.join(valid_char_list))}]"
    return re.sub(pattern, '', data)

In [None]:
sh_data = remove_extraneous_characters(sh_data.lower(), chars)
sh_data_size = len(sh_data)

# Extract indexes of data characters
sh_data = list(sh_data)
for i, ch in enumerate(sh_data):
    sh_data[i] = char_to_ix[ch]

sh_data = torch.tensor(sh_data).to(device)
sh_data = torch.unsqueeze(sh_data, dim=1)

##Modeling

- In this part define RNN and LSTM model, according to the mentioned characteristics and function inputs.


###RNN

In [None]:
class RNN(nn.Module):
    def __init__(self, input_size, output_size, hidden_size=512, num_layers=3, dropout_enable=False):
        super(RNN, self).__init__()
        self.input_size = input_size
        self.output_size = output_size
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.dropout_enable = dropout_enable
        self.dropout = nn.Dropout(0.5)
        self.hidden_state = None
        ####################################
        ######### Your code begins #########
        ####################################
        # Define self.rnn with model inputs
        # Define self.decoder for decoding output character from last hidden state
        # You can use torch.nn library

        self.rnn = None
        self.decoder = None

        ####################################
        ######### Your code ends ###########
        ####################################


    def forward(self, input_seq):
        ####################################
        ######### Your code begins #########
        ####################################
        # Implement forward part of model and save last hidden state on self.hidden_state


        ####################################
        ######### Your code ends ###########
        ####################################

        return output

    def save_model(self, path):
        torch.save(self.state_dict(), path)

    def load_model(self, path):
        self.load_state_dict(torch.load(path))

###LSTM

In [None]:
class LSTM(nn.Module):
    def __init__(self, input_size, output_size, hidden_size=512, num_layers=3, dropout_enable=False):
        super(LSTM, self).__init__()
        self.input_size = input_size
        self.output_size = output_size
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.dropout_enable = dropout_enable
        self.dropout = nn.Dropout(0.5)
        self.hidden_state = None

        ####################################
        ######### Your code begins #########
        ####################################
        # Define self.lstm with model inputs
        # Define self.decoder for decoding output character from last hidden state
        # You can use torch.nn library

        self.lstm = None
        self.decoder = None

        ####################################
        ######### Your code ends ###########
        ####################################

    def forward(self, input_seq):
        ####################################
        ######### Your code begins #########
        ####################################
        # Implement forward part of model and save last hidden state on self.hidden_state


        ####################################
        ######### Your code ends ###########
        ####################################

        return output

    def save_model(self, path):
        torch.save(self.state_dict(), path)

    def load_model(self, path):
        self.load_state_dict(torch.load(path))

##Training

In [None]:
def print_sample_output(model, data, data_size, test_output_len = 200):
    # Use this function to print sample that model generates from its current hidden state and random input character
    # test_output_len is total num of characters in output test sequence

    test_output = ""
    data_ptr = 0

    rand_index = np.random.randint(data_size-1)
    input_seq = data[rand_index : rand_index+1]

    while True:
        output = model(input_seq)

        output = F.softmax(torch.squeeze(output), dim=0)
        dist = Categorical(output)
        index = dist.sample().item()

        test_output += ix_to_char[index]

        input_seq[0][0] = index
        data_ptr += 1

        if data_ptr > test_output_len:
            break

    print("Train Sample +++++++++++++++++++++++++++++++++++++++++++++")
    print(test_output)
    print("++++++++++++++++++++++++++++++++++++++++++++++++++++++++++")

- For construction of each sample in the dataset, the output sequence is
obtained from the shift of one character from the input sequence. For example, when sequence_length is 10 and our text is `Hello world`. The input sequence would be `Hello worl`, and the target sequence `ello world`.

In [None]:
def train_epoch(model, data, data_size, epoch, optimizer, seq_len=200):
    # seq_length is length of training data sequence
    model.train()
    criterion = nn.CrossEntropyLoss()
    total_loss = 0
    sample_number = 0

    ####################################
    ######### Your code begins #########
    ####################################
    # Define training process for one epoch of input model
    # At the end of every ten epochs, print current loss and a sample output of the model using print_sample_output function
    # Feed all data sample to model by iterating over input data

    ####################################
    ######### Your code ends ###########
    ####################################

    return total_loss / sample_number

In [None]:
def train_rnn(data, data_size, model_save_file):
    # RNN parameters
    hidden_size = 512
    num_layers = 6
    lr = 0.002
    epoch_num = 100
    losses = []

    ####################################
    ######### Your code begins #########
    ####################################
    # Define training process in the specified number of epochs for RNN model
    # Use train_epoch function for train the model for one epoch
    # Use Adam as optimizer
    # Save best model in model_save_file address for next usage


    ####################################
    ######### Your code ends ###########
    ####################################

    return losses

In [None]:
def train_lstm(data, data_size, model_save_file):
    # LSTM parameters
    hidden_size = 512
    num_layers = 3
    lr = 0.002
    epoch_num = 100
    losses = []

    ####################################
    ######### Your code begins #########
    ####################################
    # Define training process in the specified number of epochs for LSTM model
    # Use train_epoch function for train the model for one epoch
    # Use Adam as optimizer
    # Save best model in model_save_file address for next usage


    ####################################
    ######### Your code ends ###########
    ####################################

    return losses

### RNN

In [None]:
rnn_sh_losses = train_rnn(sh_data, sh_data_size, './model_sh_rnn.pth')

### LSTM

In [None]:
lstm_sh_losses = train_lstm(sh_data, sh_data_size, './model_sh_lstm.pth')

##Generating texts

- A sample text to input the model

In [None]:
input_sample_text = 'First Citizen:\nYou are all resolved rather to die than to famish?\n'

def create_input_sample_dataset(input_sample_text):
    input_sample = remove_extraneous_characters(input_sample_text.lower(), chars)
    input_sample = list(input_sample)
    for i, ch in enumerate(input_sample):
        input_sample[i] = char_to_ix[ch]

    input_sample = torch.tensor(input_sample).to(device)
    input_sample = torch.unsqueeze(input_sample, dim=1)
    return input_sample

- This function generates the output generated by the model for the input sample, and if the input sample text is not given, it samples a sequence of original data and gives it to the model.

In [None]:
def generate_text(model, data, data_size, input_sample_test = None, output_len=1000):
    model.eval()
    data_ptr = 0
    test_output=""

    if input_sample_test is not None:
        index = 0
        seq_len = len(input_sample_test)
        input_seq = input_sample_test[index : index + seq_len-1]
    else:
        # If input sample not declared, select an initial string from the data of 10 characters randomly
        index = np.random.randint(data_size - 11)
        seq_len = 10
        input_seq = data[index : index + 9]

   # Set last hidden state of model by feeding input sequence to model
    output = model(input_seq)

    # Last charachter feed to model
    if input_sample_test is not None:
        input_seq = input_sample_test[index + seq_len-1 : index + seq_len]
    else:
        input_seq = data[index + seq_len-1 : index + seq_len]

    while True:
        output = model(input_seq)

        output = F.softmax(torch.squeeze(output), dim=0)
        dist = Categorical(output)
        index = dist.sample().item()

        test_output += ix_to_char[index]
        input_seq[0][0] = index
        data_ptr += 1

        if data_ptr  > output_len:
            break

    print("Eaxmple of generated text --------------------------------------------------------------------------")
    print(test_output)
    print("----------------------------------------------------------------------------------------------------")

### RNN

In [None]:
best_model_rnn =  RNN(vocab_size, vocab_size, 512, 6).to(device)
best_model_rnn.load_model('./model_sh_rnn.pth')
print("best loss", min(rnn_sh_losses))
generate_text(best_model_rnn, sh_data, sh_data_size)

### LSTM

In [None]:
best_model_lstm =  LSTM(vocab_size, vocab_size, 512, 3).to(device)
best_model_lstm.load_model('./model_sh_lstm.pth')
print("best loss", min(lstm_sh_losses))
generate_text(best_model_lstm, sh_data, sh_data_size)

## Plotting the losses

In [None]:
def plot_losses(losses):
    xpoints = np.array(range(len(losses)))
    ypoints = np.array(losses)

    plt.plot(xpoints, ypoints, color='blue',label='losses')
    plt.xlabel("epoch")
    plt.ylabel("loss")
    plt.legend()
    plt.show()

### RNN

In [None]:
plot_losses(rnn_sh_losses)

### LSTM

In [None]:
plot_losses(lstm_sh_losses)

## Report

According to the texts generated from different models and the losses during the training process of the models, analyze what is the reason for the difference in the result of models.

Which model works better and what do you think are the reasons?

<font color='#73FF73'><b>Your answer : </b></font>

---
---

#2. FineTuning

FineTuning is a technique used in neural network training where a pre-trained model is further trained on a new task or dataset. It allows us to leverage the knowledge and representations learned by a pre-trained model and adapt it to a specific task or domain.

In this exercise, we first train the models with a `wikipedia` dataset that contains english texts, then we fine-tune this pre-trained model again with the Shakespeare play dataset to check the effect of this method on different models.

## Load Wikipedia dataset

In [None]:
!wget https://s3.amazonaws.com/fast-ai-nlp/wikitext-2.tgz

In [None]:
!tar -xvzf '/content/wikitext-2.tgz' -C '/content/data'

In [None]:
!cat './data/wikitext-2/train.csv' | tr -d '\n' > ./data/wikitext.txt

##Preprocessing

In [None]:
def clean_wiki_data(data):
    repl=''
    data=re.sub('\(', repl, data)
    data=re.sub('\)', repl, data)
    for pattern in set(re.findall("=.*=",data)):
        data=re.sub(pattern, repl, data)
    for pattern in set(re.findall("<unk>",data)):
        data=re.sub(pattern,repl,data)
    for pattern in set(re.findall(r"[^\w ]", data)):
        repl=''
        if pattern=='-':
            repl=' '
        if pattern!='.' and pattern!="\'":
            data=re.sub("\\"+pattern, repl, data)

    return data

def load_data(filepath):
    f=open(filepath)
    return f.read()

In [None]:
wikidata=load_data("./data/wikitext.txt")
data=wikidata[:]
data=clean_wiki_data(data)
wikiPreprocessed_file = open("./data/wiki_preprocesed.txt", "w")
f = wikiPreprocessed_file.write(data)
wikiPreprocessed_file.close()

- Load data in amount of 50kb for finetuning

In [None]:
wi_data_file = "./data/wiki_preprocesed.txt"
wi_data = open(wi_data_file, 'r').read(50000)

In [None]:
wi_data = remove_extraneous_characters(wi_data.lower(), chars)
wi_data_size = len(wi_data)

wi_data = list(wi_data)
for i, ch in enumerate(wi_data):
    wi_data[i] = char_to_ix[ch]

wi_data = torch.tensor(wi_data).to(device)
wi_data = torch.unsqueeze(wi_data, dim=1)

## Pre-training by wikipedia dataset

### RNN

In [None]:
rnn_wi_losses = train_rnn(wi_data, wi_data_size, './model_wi_rnn.pth')

### LSTM

In [None]:
lstm_wi_losses = train_lstm(wi_data, wi_data_size, './model_wi_lstm.pth')

## Finetuning by Shakespeare

- Define the following functions to use the previous model as a pre-trained model and fine-tunes it using Shakespeare's plays dataset with lower learning rate.

In [None]:
def finetune_rnn(data, data_size, model_save_file, model_pretrained_path):
    # RNN parameters
    hidden_size = 512
    num_layers = 6
    lr = 0.001
    epoch_num = 100
    losses = []

    ####################################
    ######### Your code begins #########
    ####################################
    # In this section finetune the model that trained by wikipedia dataset with Shakespeare plays dataset


    ####################################
    ######### Your code ends ###########
    ####################################

    return losses

In [None]:
def finetune_lstm(data, data_size, model_save_file, model_pretrained_path):
    # LSTM parameters
    hidden_size = 512
    num_layers = 3
    lr = 0.001
    epoch_num = 100
    losses = []

    ####################################
    ######### Your code begins #########
    ####################################
    # In this section finetune the model that trained by wikipedia dataset with Shakespeare plays dataset


    ####################################
    ######### Your code ends ###########
    ####################################

    return losses

### RNN

In [None]:
rnn_sh_finetune_losses = finetune_rnn(sh_data, sh_data_size, './model_sh_finetune_rnn.pth', './model_wi_rnn.pth')

### LSTM

In [None]:
lstm_sh_finetune_losses = finetune_lstm(sh_data, sh_data_size, './model_sh_finetune_lstm.pth', './model_wi_lstm.pth')

## Plotting Losses

In [None]:
def plot_losses_together(losses1, losses2):
    xpoints = np.array(range(len(losses1)))
    ypoints1 = np.array(losses1)
    ypoints2 = np.array(losses2)

    plt.plot(xpoints, ypoints1, color='blue',label='base_losses' )
    plt.plot(xpoints, ypoints2, color='red',label='finetune_losses' )
    plt.xlabel("epoch")
    plt.ylabel("loss")
    plt.legend()
    plt.show()

### RNN

In [None]:
plot_losses_together(rnn_sh_losses, rnn_sh_finetune_losses)

### LSTM

In [None]:
plot_losses_together(lstm_sh_losses, lstm_sh_finetune_losses)

## Report

As you can see, fine-tuning has an effect in improving the training of the main model.

By analyzing the obtained results, state the advantage of finetuning after pre-training the model by public dataset, and compare its performance in different models

<font color='#73FF73'><b>Your answer : </b></font>

----
----

#3. Experiment on different datasets

In the previous section, you saw the performance results of the text generation model using the Shakespeare plays dataset. In the following, you will check the results of the LSTM model on the dialogues of the `Friends series`

## Load dataset

In [None]:
!wget https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-08/friends.csv -O ./data/Friends.csv

## preprocessing

In [None]:
friends = pd.read_csv('./data/Friends.csv')
friends = friends.dropna()
friends = friends[friends['speaker'].str.contains('SCENE')==False]
friends['speaker'] = friends['speaker'].apply(lambda sp: sp.lower().capitalize().split(' ')[0])
friends_texts = friends.drop(['episode','season','scene','utterance'], axis='columns')
friends_texts.head()

In [None]:
f = open("./data/fiends.txt", "w")
for i,row in friends_texts.iterrows():
    f.write(row['speaker'] + ':\n' + row['text'] + '\n\n')

f.close()

In [None]:
fr_data_file = "./data/fiends.txt"
fr_data = open(fr_data_file, 'r').read(30000)
fr_data = remove_extraneous_characters(fr_data.lower(), chars)
fr_data_size = len(fr_data)

fr_data = list(fr_data)
for i, ch in enumerate(fr_data):
    fr_data[i] = char_to_ix[ch]

fr_data = torch.tensor(fr_data).to(device)
fr_data = torch.unsqueeze(fr_data, dim=1)

## Train finetuned LSTM by friends dataset

In [None]:
lstm_fr_finetune_losses = finetune_lstm(fr_data, fr_data_size, './model_fr_lstm.pth', './model_wi_lstm.pth')

## Generating texts

In [None]:
best_model_lstm =  LSTM(vocab_size, vocab_size, 512, 3).to(device)
best_model_lstm.load_model('./model_fr_lstm.pth')
print("best loss", min(lstm_fr_finetune_losses))
generate_text(best_model_lstm, fr_data, fr_data_size)

- As you can see, the LSTM network has been able to learn the features of different datasets in terms of sentence length and writing style and use it in text generation.

## The output of finetuned models on different datasets on the input sample

- In this section, you can see the result of the text generated by models with a sample input text.

In [None]:
input_sample_text = "Hello, have a nice day.\n"

In [None]:
best_model_lstm =  LSTM(vocab_size, vocab_size, 512, 3).to(device)
best_model_lstm.load_model('./model_fr_lstm.pth')
generate_text(best_model_lstm, fr_data, fr_data_size, create_input_sample_dataset(input_sample_text),100)

In [None]:
best_model_lstm =  LSTM(vocab_size, vocab_size, 512, 3).to(device)
best_model_lstm.load_model('./model_sh_finetune_lstm.pth')
generate_text(best_model_lstm, sh_data, sh_data_size, create_input_sample_dataset(input_sample_text),100)

## Report

According to the sample input and output produced by the fine-tuned model with the Shakespeare dataset and the Friends dataset, which output is more meaningful and what is the reason for this difference?

<font color='#73FF73'><b>Your answer : </b></font>

----
----