# Intro to Recurrent Neural Nets (RNNs)

References: 
* Intro to RNNs: https://victorzhou.com/blog/intro-to-rnns/
* Explanation of Entropy: https://towardsdatascience.com/cross-entropy-loss-function-f38c4ec8643e

## Instructions

1. Create Virtual Environment: `python3 -m venv datascience-venv`
2. Set Virtual Environment: `source datascience-venv/bin/activate`
3. Install JupyterLab in your Virtual Env using pip: `pip3 install jupyterlab`
4. Install dependencies (`numpy`, `pandas`, `scikit-learn`) into the virtual environment
   * `pip3 install pandas`, `pip3 install scikit-learn`
5. Add your Virtual Environment as a kernel to Jupyterlab: `python3 -m ipykernel install --user --name=datascience-venv`
6. Start JupyterLab from the virtual environment: `jupyter-lab --notebook-dir <location of your notebooks>`
7. Make sure your set your Virtual Env's kernel in the notebook that you're using

In [1]:
import sys
import pandas as pd
import numpy as np
from test_data.rnns_testdata import train_data, test_data
from functools import reduce

## Training data setup

In [2]:
# Reduce all training data into set of unique words
# Note: the lambda does x + y - as [1,2,3] + [4,5,6] appends 2 lists together
vocabulary = list(set(reduce(
    lambda list_elem1, list_elem2: list_elem1+list_elem2,
    [key.split(' ') for key in train_data.keys()], 
    []
)))
assert len(vocabulary) == 18

In [3]:
train_data

{'good': True,
 'bad': False,
 'happy': True,
 'sad': False,
 'not good': False,
 'not bad': True,
 'not happy': False,
 'not sad': True,
 'very good': True,
 'very bad': False,
 'very happy': True,
 'very sad': False,
 'i am happy': True,
 'this is good': True,
 'i am bad': False,
 'this is bad': False,
 'i am sad': False,
 'this is sad': False,
 'i am not happy': False,
 'this is not good': False,
 'i am not bad': True,
 'this is not sad': True,
 'i am very happy': True,
 'this is very good': True,
 'i am very bad': False,
 'this is very sad': False,
 'this is very happy': True,
 'i am good not bad': True,
 'this is good not bad': True,
 'i am bad not good': False,
 'i am good and happy': True,
 'this is not good and not happy': False,
 'i am not at all good': False,
 'i am not at all bad': True,
 'i am not at all happy': False,
 'this is not at all sad': True,
 'this is not at all happy': False,
 'i am good right now': True,
 'i am bad right now': False,
 'this is bad right now': Fa

In [17]:
# Build a map of idx to word in the vocab
map_idx_to_vocab = {x[0]: x[1] for x in list(enumerate(vocabulary))}

# Build a map of word in the vocab to idx
map_vocab_to_idx = {x[1]: x[0] for x in list(enumerate(vocabulary))}

# Create one-hot encodings for these 18 features based on the input training data
feature_matrix = np.zeros(shape=(len(train_data), len(vocabulary)))
assert feature_matrix.shape == (58, 18) 
# 58 training data samples, and 18 distinct words in vocab - i.e. 18 features per sample

# Consistently sized feature matrix - not sure if this is OK for RNNs?
train_data_l = list(train_data.keys())
for _iter in range(len(train_data)):
    for _elem in train_data_l[_iter].split(' '):
        feature_matrix[_iter][map_vocab_to_idx.get(_elem)] = 1
        
if True:
    np.set_printoptions(threshold=sys.maxsize)
    print(feature_matrix)

# Note: Not using this feature_matrix, as I'm not sure if it fits into the whole RNN thing:
# * RNNs can accept inputs of different sizes
# * RNNs can create outputs of different sizes 

[[0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
 [1. 1. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
 [1. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
 [1. 0. 0. 1. 0. 0. 0. 1. 0. 0.

In [29]:
def createInputs(text: str) -> list:
  '''
  Returns an array of one-hot vectors representing the words
  in the input text string.
  - text is a string
  - Each one-hot vector has shape (1, len(vocabulary))
  '''
  inputs = []
  for w in text.split(' '):
    v = np.zeros((1, len(vocabulary)))
    v[0][map_vocab_to_idx[w]] = 1
    inputs.append(v)
  return inputs

ip = createInputs('i am very good')

assert len(ip) == 4
assert ip[0].shape == (1, len(vocabulary))
assert np.transpose(ip[0]).shape == (len(vocabulary), 1)
ip

[array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         1., 0.]]),
 array([[0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.,
         0., 0.]]),
 array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.,
         0., 0.]]),
 array([[0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0.]])]

## RNN Model setup

### RNN explained visually
![RNN many to 1](pngs/rnn_many_to_one.png "RNN many to 1")  ![RNN many to many](pngs/rnn_many_to_many.png "RNN many to many")

### RNN forward pass

#### Formulae 
![RNN Formulae](pngs/rnn_formulae.png "RNN Formulae")

#### Matrix implementation

Note: The `@` operator is used to multiply numpy matrixes

Analysis of the first equation:
* ht = tanh(Wxh @ X + Whh @ ht-1 + bh)
    * ht.shape = 64x1
    * Wxh.shape = 64x18 (18 is the number of features in training data - 64 is the size of the hidden layer matrix - I don't know why 64 was selected)
    * Whh.shape = 64x64 (64 is the size of the hidden layer matrix - I don't know why 64 was selected)
    * bh.shape = 64x1
    * X.shape = 18x1 (18 is the number of features in the training data)



Analysis of the second equation:
* yt = Why @ ht + by
    * yt.shape = 2x1 (2 was chosen in the reference url - I don't know what's the significance)
    * Why.shape = 2x64
    * ht.shape = 64x1
    * by.shape = 2x1


In [147]:
def softmax(xs):
  # Applies the Softmax Function to the input array.
  return np.exp(xs) / sum(np.exp(xs))

class RNN:
    def __init__(self,
                 size_weights_input: int, 
                 size_weights_output: int, 
                 size_weights_hidden: int = 64,
                 debug: bool = False
                ):
        self.wxh = np.random.randn(size_weights_hidden, size_weights_input) / 1000
        self.whh = np.random.randn(size_weights_hidden, size_weights_hidden) / 1000
        self.why = np.random.randn(size_weights_output, size_weights_hidden) / 1000

        self.size_weights_input = size_weights_input
        self.size_weights_hidden = size_weights_hidden
        self.size_weights_output = size_weights_output

        self.bh = np.zeros(shape=(size_weights_hidden, 1))
        self.by = np.zeros(shape=(size_weights_output, 1))

        self.debug = debug

    def forward_pass(self, inputs: list) -> np.array:
        """Forward pass of this RNN.

        Keyword arguments:
        inputs -- a list of np.array of shape (number_of_words, len(vocabulary))
        
        Returns:
        np.array -- A numpy array of size (self.size_output, 1)
        """
        ht = np.zeros(shape=(self.size_weights_hidden, 1))

        # Perform each step of the RNN - on each sample Xn, i.e. each word in the input
        for _iter, word_features in enumerate(inputs):
            if self.debug: print("Computing h%s" % _iter)
            ht = np.tanh(self.wxh @ np.transpose(word_features) + self.whh @ ht + self.bh)
            if self.debug: print('Ht shape: {0}'.format(ht.shape))
            if self.debug: print(ht)

        # Compute the output
        y = self.why @ ht + self.by

        return y, ht

In [154]:
# tests

rnn = RNN(size_weights_input=len(vocabulary), 
          size_weights_output=2,
          size_weights_hidden=64,
          debug=False
         )
inputs = createInputs('i am very good')
out, h = rnn.forward_pass(inputs)

assert out.shape == (2, 1)
assert h.shape == (64,1)
print(out)
probs = softmax(out)
probs

[[ 3.94111531e-06]
 [-9.44234590e-06]]


array([[0.50000335],
       [0.49999665]])

In [156]:
int(False)

0

## What I don't understand - need to read up on

Post on machine learning stack exchange to get some insight and learn further. 

1. [ ] Could i use a standard `n x m` matrix to train a NNet? - such as the matrix `feature_matrix`?
2. [ ] Why are the `Whh`, `Wxh` and `Why` weights set to `64x64` matrix with each position another `64x64` normal distribution matrix? Can't we just use a regular `64x64` matrix of scalar values for the matrix multiplications?
   * Why use `randn` to generate a normal `64x64` distribution in each elem of the matrix?
   * Why not just use `random_normal` to generate 1 matrix of scalars?
3. [ ] What is the significance of the `size_hidden_matrix` value of 64?
4. [ ] What is the significance of the `size_output_matrix` value of 2?
5. [ ] Readup on `Softmax`
6. [ ] Readup on `Cross Entropy Loss`