<a href="https://colab.research.google.com/github/plantehenry/NeuralNetworksFinalProject/blob/main/MinimalChanges.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Introduction to Neural Networks (CSE 40868/60868)
# University of Notre Dame
# Practical 4: Text generation using LSTM and GRU
# _________________________________________________________________________
# Qi Li (lead programmer), Adam Czajka ("destroyer"), March 2022

### Prerequisites:

1. Copy this notebook to your Google Drive
2. Create a folder in your Google Drive for this practical
3. Create a `data` subfolder and upload the "Wonder Land" text there


### Tasks for today:

**Task 1 (2 points).** Load the data, do a short training of the the LSTM, and generate some example text (50 words is enough). Paste your generated text below.

**Task 2 (1 point).** Making some useful data curation steps, retraining the LSTM and generating a sample text (again, 50 words is enough). Paste your generated text below.

**Task 3 (2 points).** Switching to GRU and comparing it with LSTM (and training both models for a bit longer to get reasonable output). Paste your generated texts (for GRU and LSTM) below. Do you see any differences in the generated texts? Is any text qualitatively better than the other? If so, how would you explain the reasons for the observed differences?

**Task 4 (for 60868 section attendees).** Train either LSTM or GRU on a text from your researcgh area (for instance, a research paper or part of it). Experiment a bit with the loss (cross entropy vs MSE), optimizer (Adam, SGD, etc.) and their parameters. Since it's your research area, you should be able to assess how reasonable the outputs are. Is there any configuration (model / loss / parameters) that is better than others? If so, try to explain why. Have some fun with text generation!

### After completing the above tasks:

1. Discuss your solutions and observations with Qi, Ning, Qingkai or Adam in class.
2. Share your Google Colab notebook with all at Notre Dame (for reading). Please! Please! Remember about this step!
3. Submit the link via Canvas as your Practical 4 submission.
4. You may add any additional comments below -- we will be happy to read and discuss your observations!
### Your solutions:

Example text generated in Task 1:
> ...

Example text generated in Task 2:
> ...

Example texts generated in Task 3:
> ...

Example texts generated in Task 4 (60868):
> ...

Your comments:
...


### Starting with some imports, as usual

In [None]:
import torch
from torch import nn
from torch import nn, optim
from torch.utils.data import DataLoader
from torch.utils.data.dataloader import default_collate
import sys
import os
from collections import Counter
import string
import numpy as np
import argparse

### Mount your Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Here we construct our data loader

In [None]:
import csv
class LoadData(torch.utils.data.Dataset):
    def __init__(self,args):
        self.args = args
        self.sequence = self.loadText()

        
        # Data processing, mapping each unique word to an integer.


    def loadText(self):
        input_sequence = []
        with open(f"{self.args.workingDir}/{self.args.inputFile}.csv", newline='') as csvfile:
          spamreader = csv.reader(csvfile, delimiter=',', quotechar='|')
          spamreader.__next__()
          for row in spamreader:
              new_row = float(row[0])
              # for val in row:
                # new_row.append(float(val))
              input_sequence.append(new_row)

       
        ret_array = []
        for val in input_sequence:
          ret_array.append((val - min(input_sequence))/ (max(input_sequence) - min(input_sequence)))

        for idx in range(len(ret_array)):
          ret_array[idx] = [ret_array[idx]]
        
        return ret_array
        # output_sequence = []
        # with open(f"{self.args.workingDir}/{self.args.outputFile}.csv", newline='') as csvfile:
        #   spamreader = csv.reader(csvfile, delimiter=',', quotechar='|')
        #   spamreader.__next__()
        #   for row in spamreader:
        #     new_row = []
        #     for val in row:
        #       new_row.append(float(val))
        #     output_sequence.append(new_row)
        
        # return input_sequence, output_sequence
  

    def __len__(self):
        # Get the number of sequences for training purpose.
        return len(self.sequence) - self.args.seqLength

    def __getitem__(self, index):
        return (
            torch.tensor(self.sequence[index:index+self.args.seqLength]),
            torch.tensor(self.sequence[index+1:index+self.args.seqLength+1]),
        )

### Next, we define our Recurrent Neural Network (either LSTM or GRU, depending on the `rnnType` argument)

In [None]:
class Model(nn.Module):
    def __init__(self, dataset, args):
        super(Model, self).__init__()

        self.batch_size = args.batchSize

        # Define input dimension of RNN.
        self.inputSize = 1
        
        # Define embedding dimension of the RNN, i.e., output size of a RNN layer.
        self.embeddingDim = 256

        # Define the number of layers of the RNN.
        self.numLayers = 6

        # Find the number of unique words in input file.
        # numWords = len(dataset.uniqueWords)

        # # Define the embedding function.
        # self.embedding = nn.Embedding(num_embeddings=numWords,
        #     embedding_dim=self.embeddingDim,)
  


        #######################################################################
        # *** Build your RNN units for both models: peephole LSTM and GRU.
        # Look at the Pytorch documentation:
        # -- LSTM: https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html
        # -- GRU: https://pytorch.org/docs/stable/generated/torch.nn.GRU.html
        #
        # Define (using variables defined above in this class):
        # -- "input_size"
        # -- "hidden_size"
        # -- "num_layers"
        # -- "dropout" rate that comes in "args.dropoutRate"        
        if args.rnnType == "LSTM":
            self.rnnUnit = nn.LSTM(input_size=self.inputSize,
                hidden_size=self.inputSize, num_layers=self.numLayers,
                dropout=args.dropoutRate)
        else: # GRU
            self.rnnUnit = nn.GRU(input_size=self.inputSize, 
                hidden_size=self.inputSize, num_layers=self.numLayers,
                dropout=args.dropoutRate)
        #######################################################################

        # Fully-connected layer generating our output.
        self.fc = nn.Linear(self.inputSize, 1)
        # self.fc = nn.Linear(self.embeddingDim, 1)

    def forward(self, X, prevState):
        
        # Calculate embeddings.
        # embedding = self.embedding(X)

        # Get the output from the RNN cell (either LSTM or GRU).
        output, state = self.rnnUnit(X, prevState)

        # And pass it through the FC layer to get the RNN's output for a given time step.
        logits = self.fc(output)

        return logits, state

    def initState(self, seqLength):

        # Define state initialization for LSTM and GRU.
        if args.rnnType == "LSTM":
            stateHidden = torch.zeros(self.numLayers, seqLength, self.inputSize)
            stateCurrent = torch.zeros(self.numLayers, seqLength, self.inputSize)
            return (stateHidden, stateCurrent)
        # GRU has fewer gates compared to LSTM, we assign 0 as dummy for current state.
        else: 
            weight = next(self.parameters()).data
            stateHidden = weight.new(self.numLayers, seqLength, self.embeddingDim).zero_()
            return (stateHidden, 0)

### Training section
Estimated training times on GPU: around 25 seconds per epoch (for both LSTM and GRU).

In [None]:
# Training the model and generating texts
if __name__ == "__main__":

    # Parsing arguments. The args has been treated as input for class "LoadData" and class "Model". 
    # To run it in the jupyter notebook, please modify the value in default.
    # "Wonder Land" does not require GPU, CPU is feasible.
    # If you are interested in "Game of Thrones", or other larger text files, you need GPU.
    # If you ran out of GPU time in Google Colab, you may want to switch to a different Google account
    # or use CRC resources.
    #  
    # The steps of running Practical 4 on CRC GPU clusters: 
    #    1. Fill in your scripts for each task. 
    #    2. Copy each piece of scripts in this file and create a '.py' file. 
    #    3. Follow instructions of Practical 1 how to run '.py' code on CRC machines.
    #    4. Submit your notebook via Canvas.

    parser = argparse.ArgumentParser()
    parser.add_argument('-f')
    parser.add_argument('--workingDir', type=str, 
        default="/content/drive/My Drive/Neural Networks Project")
    # Can be game_of_thrones or wonder_land:
    parser.add_argument('--inputFile', type=str, default="input_data") 
    parser.add_argument('--maxEpochs', type=int, default=1)
    parser.add_argument('--batchSize', type=int, default=128)
    parser.add_argument('--seqLength', type=int, default=30)
    parser.add_argument('--learningRate', type=float, default=.01)
    parser.add_argument('--dropoutRate', type=float, default=0.0)
    # Here define which RNN model, i.e., LSTM or GRU you like to train:
    parser.add_argument('--rnnType', type=str, default="LSTM") 
    args = parser.parse_args()

    # Create folder for saving your checkpoints.
    try:
        os.mkdir(f"{args.workingDir}/checkpoints") 
    except:
        pass

    # If using GPU, we need the following line. Also, please make sure to send 
    # both models, hidden states, current states, and data to GPU, if you are 
    # using GPU (to train on "Game of Thrones" or other larger texts, you really need GPU).
    device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
    print("currently using: ", device)

    dataset = LoadData(args)
    model = Model(dataset, args)
    model.to(device) # if you're using GPU

    def train(dataset, model, args):

        # Switch the model to training mode.
        model.train()
        
        # If you are using GPU:
        dataloader = DataLoader(dataset, batch_size=args.batchSize, shuffle=True, 
                    collate_fn=lambda x: tuple(x_.to(device) for x_ in default_collate(x)))
        
        # If you are using CPU:
        # dataloader = DataLoader(dataset, batch_size=args.batchSize, shuffle=True,
        #            collate_fn=lambda x: tuple(x_ for x_ in default_collate(x)))
        
        # Define the loss function and the optimizer. Do these names ring a bell? 
        # We used exactly the same losses and optimizers in CNNs and MLPs.
        # criterion = nn.CrossEntropyLoss()
        criterion = nn.MSELoss()
        optimizer = optim.Adam(model.parameters(), lr=args.learningRate)

        # Training loop starts here.
        for epoch in range(1, args.maxEpochs+1):
            stateHidden, stateCurrent = model.initState(args.seqLength)
            stateHidden = stateHidden.to(device)

            #######################################################################
            # *** Task 3 *** The line below is only for LSTM
            # Comments this line out for GRU (GRU does not pass the "stateCurrent" to the next step)
            stateCurrent = stateCurrent.to(device)
            #######################################################################


            bestLoss = float('inf')
            for batch, (X, y) in enumerate(dataloader): 

                optimizer.zero_grad()

                #######################################################################
                # *** Task 1 *** Implement states update in our RNN models.
                # Hint: GRU takes only the hidden state, while LSTM both hidden and current states.
                if args.rnnType == "LSTM":
                    # y_pred, (???,???) = model(X, (???,???))
                    # print(X.size())
                    # print(y.size())
                    y_pred, (stateHidden, stateCurrent) = model(X, (stateHidden, stateCurrent))
                else: # i.e., GRU
                    # y_pred, ??? = model(X, ???)
                    y_pred, stateHidden = model(X, stateHidden)
                #######################################################################

                loss = criterion(y_pred, y)

                # Update states for the RNN models. Note the difference between LSTM and GRU!
                if args.rnnType == "LSTM":
                    stateHidden = stateHidden.detach()
                    stateCurrent = stateCurrent.detach()
                else:
                    stateHidden = stateHidden.data

                loss.backward()
                optimizer.step()

                currLoss = loss.item()
                
                # Only save models with smallest loss per epoch.
                if currLoss < bestLoss:
                    bestLoss = currLoss
                    torch.save(model.state_dict(), f'{args.workingDir}/checkpoints/{args.rnnType}-{args.inputFile}-epoch_{epoch}.pth')
            print(f"Epoch ID: {epoch}, 'the best loss': {bestLoss}")
 

    # We are ready to train our RNN model!
    train(dataset, model, args)


currently using:  cuda:0
Epoch ID: 1, 'the best loss': 0.050369709730148315


### Ready to generate some text? Let's do it here!
First, define our text generator function:

In [None]:
def textGenerator(dataset, idx, model, workingDir, seedSequence, checkpointFileName, outputSize):
      checkpointFile = f"{workingDir}/checkpoints/{checkpointFileName}"
      print (f"Loading checkpoints from {checkpointFile}.pth \n")

      # If using GPU, we need the following line.
      model.to(device)

      # Loading the selected checkpoint (= RNN's weights)
      model.load_state_dict(torch.load(f"{checkpointFile}.pth"))
      model.eval()

      # Initialization.
      stateHidden, stateCurrent = model.initState(len(seedSequence))

      # If using GPU, we need the following line.
      stateHidden = stateHidden.to(device)

      #######################################################################
      # *** Task 3 *** The line below is only for LSTM
      stateCurrent = stateCurrent.to(device)  # comment this line for GRU
      #######################################################################

      for i in range(outputSize):
          X = torch.tensor([[[w] for w in seedSequence[i:]]])

          # If using GPU, we need the following line.
          X = X.to(device)

          #######################################################################
          # *** Task 1 *** Implement states update in our RNN models (as you did above already).
          if args.rnnType == "LSTM":
              # y_pred, (???,???) = model(X, (???,???))
              y_pred, (stateHidden, stateCurrent) = model(X, (stateHidden, stateCurrent))
          else: # i.e., GRU
              # y_pred, ??? = model(X, ???)
              y_pred, stateHidden = model(X, stateHidden)

          outputLogit = y_pred[0][-1]
          # Note that, even if you are using GPU, numpy can only be run on 
          # CPU, so if you are using GPU, please remember to send outputLogit to CPU.
          # outputLogit = outputLogit.cpu()
          # p = torch.nn.functional.softmax(outputLogit, dim=0).detach().numpy()

          # # Below we take the most probable word as our output. 
          # # Note that "p" is provided to "np.random.choice" so this numpy function will take softmax scores into account when sampling the output
          # # Look at https://numpy.org/doc/stable/reference/random/generated/numpy.random.choice.html
          # # to better understand this construction 
          # wordIndex = np.random.choice(len(outputLogit), p=p)

          # Output next word:
          outputWord = outputLogit.item()
          # print("on3")
          print("___________")
          print(outputWord)
          idx += 1
          print(dataset.sequence[idx])
          print()
          # print(outputWord.size())
          # sys.stdout.write(outputWord)
          # print("two")

          # Making sure that each line has "words_per_line" words:
          words_per_line = 20
          if i%words_per_line == 0:
              sys.stdout.write("\n")
          else:
              sys.stdout.write(" ")

          seedSequence.append(outputWord) 

Next, select the seed sequence and let it generate a text composed of `outputSize` words

In [None]:
# First, we need to define seed sequence for text generation.
# The model will generates texts starting with a sequence of words that the model has seen before, 
# i.e., in the training stage.
seed = np.random.choice(len(dataset.sequence)-1)
seedSequence = dataset.sequence[seed:seed+args.seqLength]
for idx in range(len(seedSequence)):
  seedSequence[idx] = seedSequence[idx][0]
print(f"Seed: {seedSequence}\n")

# Select the model you want to use:
args.rnnType = 'LSTM'

# And the number of epochs it was trained for (double check you have this checkpoint
# in your Google Drive). Usually you should select the checkpoint with the model offering 
# the smallest loss. But you can also switch between checkpoints to see if there is any obvious
# difference in the generated texts between models offering high and low loss values.
training_epochs = 200

# How many words to generate?
outputSize = 200

print ("!!!!!!! Text generation Starts !!!!!!!\n")
textGenerator(dataset, seed+args.seqLength , model, args.workingDir, seedSequence, 
    f"{args.rnnType}-{args.inputFile}-epoch_{training_epochs}", outputSize=outputSize)
print ("\n \n!!!!!!! Text generation Ends !!!!!!!\n")

Seed: [0.28465082341874054, 0.24206882827905182, 0.25826886084193135, 0.2924387787714146, 0.30556041318676014, 0.3220167276718144, 0.3229544179988302, 0.3397514366863049, 0.30459257204537094, 0.29267094004209054, 0.30966996918586776, 0.32877352517291497, 0.3771263861836908, 0.3609504742723101, 0.3886862084144892, 0.3903354579607195, 0.397161602334879, 0.3943093352951463, 0.39542491542696573, 0.4084169013405051, 0.4093787123190197, 0.39104400209849666, 0.39197264718120034, 0.3987656256595491, 0.3994349737386406, 0.41210133085694645, 0.4287777462869272, 0.4313405655086744, 0.4350943418981746, 0.41645510845247935]

!!!!!!! Text generation Starts !!!!!!!

Loading checkpoints from /content/drive/My Drive/Neural Networks Project/checkpoints/LSTM-input_data-epoch_200.pth 

___________
0.31096377968788147
[0.41639179174229496]


___________
0.3937872648239136
[0.42069129787195547]

 ___________
0.3588946759700775
[0.41921993813052894]

 ___________
0.34057557582855225
[0.42843704208450667]

 _