# Week 5 - Natural Language Processing (NLP)

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, cross_val_score
import src.week5_func as wk5
import os
import re
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

# Chapter 1: How do we turn text into data we can use?

### Convert your corpus into bags of words

We can't apply any of the techniques we have learned over the past few weeks directly on raw text.  Therefore, our first task is to convert our corpus into numbers.  The simplest way to do this is to use a **bag of words**. You can see some examples of this here: https://liferay.de.dariah.eu/tatom/index.html.

Once you understand the concept, convert your corpus and documents into bags of words below:

In [102]:
def extract_words(sentence):
    ignore_words = ['a', 'the', 'if', 'br', 'and', 'of', 'to', 'is']
    words = re.sub("[^\w]", " ",  sentence).split() #nltk.word_tokenize(sentence) this replaces all special chars with ' '
    words_cleaned = [w.lower() for w in words if w not in ignore_words]
    return words_cleaned  

def tokenize_sentences(sentences):
    words = []
    for sentence in sentences:
        w = extract_words(sentence)
        words.extend(w)
        
    words = sorted(list(set(words)))
    return words

def bagofwords(sentence, words):
    sentence_words = extract_words(sentence)
    # frequency word count
    bag = np.zeros(len(words))
    for sw in sentence_words:
        for i,word in enumerate(words):
            if word == sw: 
                bag[i] += 1
                
    return np.array(bag)

vocabulary = tokenize_sentences(df_raw['text'])


In [103]:
n_words = len(vocabulary)
print(n_words)
n_docs = len(df_raw.iloc[:10])
bag_o = np.zeros([n_docs,n_words])
for ii in range(n_docs):
    bag_o[ii,:] = bagofwords(df_raw['text'].iloc[ii],vocabulary)
print(np.sum(bag_o,axis=1))

74891
[115. 364. 124.  99.  97. 145.  88. 296. 374. 282.]


<font color=blue>Trying to do it manually causes a memory error. Probably will have to use sparse matrices instead. Let's just implement this with sklearn which does the sparse matrix for you.</font>
## <font color='blue'><strong>Try using sklearn instead</font>

In [93]:
vectorizer = CountVectorizer(analyzer = "word", strip_accents=None, tokenizer = None, preprocessor = None, stop_words = None, max_features = 5000) 
train_data_features = vectorizer.fit_transform(df_raw['text'])
print(np.shape(train_data_features))
# bag_sk  = vectorizer.transform(df_raw['text'])

(25000, 5000)


In [83]:
df_freq = pd.DataFrame({'word':vectorizer.get_feature_names() ,'freq':np.array(np.sum(train_data_features,axis=0)).flatten()})
df_freq.sort_values('freq',ascending=False).head(5)

In [None]:
from nltk.tokenize import word_tokenize
from collections import Counter
Counter(word_tokenize(df_raw))

### Show us your bags

Show and explain what one of your documents looks like as a bag of words below.  What are the advantages and disadvantages of encoding text as bags of words?

### Tell us a story with your bags
Now that your text is in a more digestible format, you can apply previously learned techniques to better understand the corpus. **Create a brief story around your corpus, for example by using clustering techniques.** Some examples of what you can do below:
* Use *Hierarchical Clustering* to understand similarity of documents in your corpus. What distance measure works best? Are the results what you expect?
* Learn about *Latent Dirichlet Allocation* to extract topics from your corpora, and measure each document on how much of each topic it contains. How do you interpret these topics?

Some **potential inspiration** below (but please keep your own story simple!):
* https://liferay.de.dariah.eu/tatom/topic_model_mallet.html covers a few examples of text analysis
* http://fantheory.viacom.com/
* https://pudding.cool/2017/02/vocabulary/

Additional resources on LDA (if you are interested): 
* https://medium.com/@lettier/how-does-lda-work-ill-explain-using-emoji-108abf40fa7d
* https://www.youtube.com/watch?v=DDq3OVp9dNA

### Normalize your bags
In the above exercise, you may find it important to normalize your data.  One useful method when dealing with text is *Term Frequency - Inverse Document Frequency (TF-IDF)*. You can see more detail on this here: http://blog.christianperone.com/2011/10/machine-learning-text-feature-extraction-tf-idf-part-ii/.

Once you understand the concept, **express your data as TF-IDF vectors (instead of simple bag-of-words counts), and see if it changes your above story**. 

In [112]:
corpse_us = ['I am a boy','I am a girl girl','you are neither','what is this cake']
print(vectorizer.fit_transform(corpse_us).toarray())
print(vectorizer.vocabulary_)

[[1 0 1 0 0 0 0 0 0 0]
 [1 0 0 0 2 0 0 0 0 0]
 [0 1 0 0 0 0 1 0 0 1]
 [0 0 0 1 0 1 0 1 1 0]]
{'am': 0, 'boy': 2, 'girl': 4, 'you': 9, 'are': 1, 'neither': 6, 'what': 8, 'is': 5, 'this': 7, 'cake': 3}


In [115]:

tfidf = TfidfTransformer().fit_transform(train_data_features)

In [116]:
tfidf

<25000x5000 sparse matrix of type '<class 'numpy.float64'>'
	with 2956225 stored elements in Compressed Sparse Row format>

### Show us your bags (Version 2)

Show and explain what one of your documents looks like as a TF-IDF vector below.  How is this different from a simple bag-of-words?

# Chapter 2: Simple Supervised Learning with Text
Now that you are comfortable with treating text as numbers, we can try out supervised learning.  We'll use a labelled dataset of IMDB reviews to classify each review as 'positive' or 'negative'.  You can **find the data below:**

http://ai.stanford.edu/~amaas/data/sentiment/

Load in and process the data, then train a supervised learning model.  **You should achieve val or test set accuracy of 85%**. Pretty good for a simple bag, no?

imported and download data using  
``` wk5.dl_and_unzip('http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz')```  
Tar file removed to save space.

In [2]:
# data processed and saved with week5_func.save_as_pkl()
df_raw = pd.read_pickle('./data/df_raw.pkl')
df_raw_test = pd.read_pickle('./data/df_raw_test.pkl')

In [9]:
vectorizer = CountVectorizer(analyzer = "word", strip_accents=None, tokenizer = None, \
                             preprocessor = None, stop_words = None, max_features = 5000) 
train_data_features = vectorizer.fit_transform(df_raw['text'])
test_data_features = vectorizer.transform(df_raw_test['text'])
tfidfier = TfidfTransformer()
tfidf = tfidfier.fit_transform(train_data_features)
tfidf_test = tfidfier.transform(test_data_features)
X_all = tfidf.toarray()
y_all = df_raw['positive'].values
X_test = tfidf_test.toarray()
y_test = df_raw_test['positive'].values

In [10]:
tfidf

<25000x5000 sparse matrix of type '<class 'numpy.float64'>'
	with 2956225 stored elements in Compressed Sparse Row format>

In [11]:
# X_train,X_test,y_train,y_test = train_test_split(X_all,y_all,shuffle=True)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.linear_model import LogisticRegression
def classify():
    rf = LogisticRegression()
    rf.fit(X_all,y_all)
    print(rf.score(X_test,y_test))
    return rf
classify()



0.88256


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

# Chapter 3: Playing with Recurrent Neural Networks (RNN)
So far, we've only treated text as a simple bag, with reasonable results.  We'll now shift to a more complex representation of language: recurrent neural networks.  To do so, we need to process text at the word or character level, and capture the sequence of a document. 

Our task here is to build an RNN that 'eats up' sequences of characters in order to predict the next character in a sequence, for every step in the sequence of a document. This is a common (and fun) task, with lots of examples available online. 

For this task, use existing RNN APIs (don't code everything from scratch) from Keras or PyTorch. 

**Read up on RNNs and this exercise** below:
* http://karpathy.github.io/2015/05/21/rnn-effectiveness/ - start here!
* https://github.com/martin-gorner/tensorflow-rnn-shakespeare - video, slides and code going through an example with Shakespeare
* http://killianlevacher.github.io/blog/posts/post-2016-03-01/post.html - another nice example based on Trump tweets

### Prepare your data

Our first step is to prepare our text. **Process your corpora into a format that can be used by an RNN, and walkthough one sequence below**.

An **example way to shape your data** for this task is as follows (feel free to play around with different structures):

*In this example your corpora starts with the string 'the cat and I'*
* RNN input: divide your text into sequences of 10 characters e.g. 'the cat an'
* RNN output: the 1 character immediately following RNN input sequences e.g. 'd'. 
* Note: You may or may not want to divide your text into overlapping strings (e.g. RNN input contains 'the cat an', 'he cat and', 'e cat and ', ...) . How is the model different in each case?
* Note: Your 'vocabulary' or `vocab_size` here is the number of unique characters in your text (and therefore the number of classes you want to predict)

<font color='blue'>
converted entire corpus into one long string, then divided string into segments of 10 (or 20) characters. Used input of characters to predict next character. Resulted in some sort of exploding/vanishing gradient problem. outputs turned into NaN very quickly (in ~6-10 iterations).  
  
Next I'm goning to try one-hot encoding the inputs.
</font>

In [478]:
import torch.nn as nn
import torch
import torch.optim as optim
import torch.nn.functional as F

In [4]:
# lets start by predicting just positive stuff.
df_pos = df_raw[df_raw['positive']==1]
str_all=[]
def func(text):
    str_all.append(text)
df_pos.apply(lambda x: func(x['text']),axis=1)

# learn on the first 10 smaples
strrring = ''
for i in range(10):
    strrring+=str_all[i]

print(strrring)
    

Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as "Teachers". My 35 years in the teaching profession lead me to believe that Bromwell High's satire is much closer to reality than is "Teachers". The scramble to survive financially, the insightful students who can see right through their pathetic teachers' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line: INSPECTOR: I'm here to sack one of your teachers. STUDENT: Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn't!Homelessness (or Houselessness as George Carlin stated) has been an issue for years but never a plan to help those on the street that were once considered human who did everything from going to 

In [5]:
# encode the text and map each character to an integer and vice versa

# we create two dictionaries:
# 1. int2char, which maps integers to characters
# 2. char2int, which maps characters to unique integers
chars = tuple(set(strrring))
int2char = dict(enumerate(chars))
char2int = {ch: ii for ii, ch in int2char.items()}

# encode the text
encoded = np.array([char2int[ch] for ch in strrring])

In [6]:
(encoded.shape)
int2char

{0: 'q',
 1: 'P',
 2: '6',
 3: '(',
 4: 'O',
 5: 'U',
 6: 'e',
 7: 'D',
 8: '1',
 9: '2',
 10: 'S',
 11: 'y',
 12: '0',
 13: '7',
 14: 'o',
 15: 'b',
 16: '/',
 17: 'R',
 18: 'V',
 19: '8',
 20: 'E',
 21: 'N',
 22: 'c',
 23: 'H',
 24: '>',
 25: "'",
 26: 'A',
 27: 'k',
 28: 'F',
 29: 'f',
 30: ';',
 31: '5',
 32: 'K',
 33: 'i',
 34: 'a',
 35: ':',
 36: 'p',
 37: ' ',
 38: '.',
 39: 'M',
 40: 'l',
 41: 'j',
 42: 'h',
 43: 'L',
 44: 'v',
 45: 't',
 46: ')',
 47: '\x85',
 48: 'n',
 49: 'C',
 50: '-',
 51: 'g',
 52: 'G',
 53: 'd',
 54: 'I',
 55: '3',
 56: '?',
 57: 'W',
 58: '"',
 59: 'T',
 60: 'B',
 61: 'r',
 62: 'x',
 63: 'm',
 64: ',',
 65: 'w',
 66: '<',
 67: 'z',
 68: '!',
 69: '~',
 70: 'Y',
 71: 'J',
 72: 'u',
 73: 's',
 74: '*'}

In [425]:
def one_hot_encode(arr, n_labels):
    
    # Initialize the the encoded array
    one_hot = np.zeros([len(arr), n_labels], dtype=np.float32)
    
    # Fill the appropriate elements with ones
#     one_hot[np.arange(one_hot.shape[0]), arr.flatten()] = 1.
#     print(arr.flatten().tolist())
    one_hot[np.arange(one_hot.shape[0]), arr.flatten().int()] = 1.
    
    # Finally reshape it to get back to the original array
    one_hot = torch.from_numpy(one_hot.reshape((*arr.shape, n_labels)))
    
    return one_hot
# print(type(line_tensor))

In [427]:
# arr = one_hot_encode(encoded, len(chars))
# np.shape(arr)

In [497]:
def get_batches(arr, seq_length): 
    X=[]
    y=[]
    for ind in range(np.shape(arr)[0]-seq_length):
        X.append(arr[ind:ind+seq_length]) #,:])
        y.append(arr[ind+seq_length])# ,:])
#     return np.moveaxis(np.array(X),0,1),np.array(y)
    return torch.from_numpy(np.array(X).reshape(-1,seq_length)).float(), torch.from_numpy(np.array(y)).float()
X,y = get_batches(encoded,10) #try without onehot.
# y_cold = y.max(1)[1]
print(np.shape(X),np.shape(y))

torch.Size([13606, 10]) torch.Size([13606])


In [498]:
class RNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(RNN, self).__init__()
        self.hidden_size = hidden_size
        self.i2h = nn.Linear(input_size + hidden_size, hidden_size) #(129,128)
        self.i2o = nn.Linear(input_size + hidden_size, output_size) #(129,75)
        self.softmax = nn.LogSoftmax(dim=1)
    
    def forward(self, inputs, hidden):
#         print('forward sizes',input.size(),hidden.size())
        combined = torch.cat((inputs,hidden),1)
        hidden = self.i2h(combined)
        output = self.i2o(combined)
        output = self.softmax(output)
        return output, hidden
    
    def initHidden(self):
        return torch.zeros(1, self.hidden_size)

# n_hidden = 128
# n_letters = 1 #len(chars)*10
# n_categories = len(chars)
# rnn = RNN(n_letters, n_hidden, n_categories)

# hidden =torch.zeros(1, n_hidden)
# output, next_hidden = rnn.forward(X[0,0].reshape(1,1), hidden)
# print(output,next_hidden)

In [499]:
def categoryFromOutput(output):
    top_n, top_i = output.topk(1)
    category_i = top_i[0].item()
    return int2char[category_i], category_i

print(categoryFromOutput(output))

('m', 63)


In [500]:
np.shape(X)[0]-1

13605

In [501]:
import random

def randomChoice(l, t):
    randind = random.randint(0,np.shape(l)[0]-1)
    return l[randind ,:], t[randind] #, tc[randind]
def randomTrainingExample():
    randX, randy = randomChoice(X,y) #,y_cold)
#     print(randX)
#     reshaped_str = np.reshape(randX,(10,-1),order='C')
    trainstr = ''
    train = [int2char[ii.item()] for ii in randX]
    for ii in train: trainstr+=ii
#     print('[',trainstr,'][',int2char[randy.item()],']')
    return int2char[randy.item()], trainstr, randy, randX
#     return trainstr, int2char[np.argmax(randy).item()], randX.reshape(1,-1), randy, randy_c
randomTrainingExample()

(' ',
 'olent than',
 tensor(37.),
 tensor([14., 40.,  6., 48., 45., 37., 45., 42., 34., 48.]))

In [502]:
criterion = nn.NLLLoss()
# criterion = nn.CrossEntropyLoss()
# optimizer = optim.Adam(net.parameters(), lr=0.001)

In [503]:
def train(category_tensor, line_tensor):
    hidden = rnn.initHidden()
    rnn.zero_grad()
#     print(np.shape(line_tensor))
    outputs = []
    for i in range(line_tensor.size()[0]):
        output, hidden = rnn(line_tensor[i].reshape(1,-1), hidden)
        outputs.append(output)
    outputs = torch.stack(outputs)
    
    loss = criterion(output, category_tensor.reshape(1,).long()) #just use the final output to calculate the loss
#     print('lossis', output, category_tensor)
#     print(torch.min(torch.max(outputs,1)[1].reshape(-1,1).float()),torch.min(torch.max(category_tensor, 1)[1]))
#     loss = criterion(torch.max(outputs,1)[1].reshape(-1,1).float(), category_tensor)
    loss.backward()

    # Add parameters' gradients to their values, multiplied by learning rate
    for p in rnn.parameters():
        p.data.add_(learning_rate, p.grad.data)
#         print('post zero', p.grad.data)
    return output, loss.item()

In [504]:
import time
import math

n_iters = 5000
print_every = 100
plot_every = 100
learning_rate = 0.005

# initiate model
n_hidden = 32
n_letters = 75 #len(chars)*10
n_categories = len(chars)
rnn = RNN(n_letters, n_hidden, n_categories)

# Keep track of losses for plotting
current_loss = 0
all_losses = []

def timeSince(since):
    now = time.time()
    s = now - since
    m = math.floor(s / 60)
    s -= m * 60
    return '%dm %ds' % (m, s)

start = time.time()
# output, loss = train(y_cold, X)
# print(timeSince(start))
for iter in range(1, n_iters + 1):
    category, line, category_tensor, line_tensor = randomTrainingExample()
    line_tensor = one_hot_encode(line_tensor, n_categories)
#     print(line_tensor)
    output, loss = train(category_tensor.reshape(1,1), line_tensor)
#     output, loss = train(y_cold, X)
    current_loss += loss

    # Print iter number, loss, name and guess
    if iter % print_every == 0:
        guess, guess_i = categoryFromOutput(output)
        correct = '✓' if guess == category else '✗ (%s)' % category
        print('%d %d%% (%s) %.4f %s / %s %s' % (iter, iter / n_iters * 100, timeSince(start), loss, line, guess, correct))

    # Add current loss avg to list of losses
    if iter % plot_every == 0:
        all_losses.append(current_loss / plot_every)
        current_loss = 0

100 2% (0m 0s) 4.3195  big digge / ~ ✗ (r)
200 4% (0m 0s) 4.4236  who has a / ( ✗ (d)
300 6% (0m 0s) 4.4242 tling drea / 1 ✗ (d)
400 8% (0m 0s) 4.3184 e film NOW / ~ ✗ (.)
500 10% (0m 1s) 4.6458 omments, a / ( ✗ (n)
600 12% (0m 1s) 4.6184 ave been f / B ✗ (a)
700 14% (0m 1s) 5.2326 ead Maupin / B ✗ ( )
800 16% (0m 1s) 5.7538 tor in the / P ✗ ( )
900 18% (0m 1s) nan Blazing Sa / * ✗ (d)
1000 20% (0m 2s) nan lt shrugs  / * ✗ (i)
1100 22% (0m 2s) nan  Lisa Emer / * ✗ (y)
1200 24% (0m 2s) nan  one physi / * ✗ (c)
1300 26% (0m 2s) nan . Ms. Oh i / * ✗ (s)
1400 28% (0m 2s) nan stic view  / * ✗ (o)
1500 30% (0m 3s) nan it the nex / * ✗ (t)
1600 32% (0m 3s) nan <br /><br  / * ✗ (/)
1700 34% (0m 3s) nan ke a news  / * ✗ (r)
1800 36% (0m 3s) nan shrugs ind / * ✗ (i)
1900 38% (0m 4s) nan rm of a yo / * ✗ (u)
2000 40% (0m 4s) nan ad million / * ✗ (a)
2100 42% (0m 4s) nan  is correc / * ✗ (t)
2200 44% (0m 4s) nan You really / * ✗ ( )
2300 46% (0m 4s) nan n probably / * ✗ ( )
2400 48% (0m 5s) nan ic

## Try LSTM

In [505]:
train

<function __main__.train(category_tensor, line_tensor)>

In [627]:
class LSTM(nn.Module):

    def __init__(self, embedding_dim, hidden_dim, vocab_size, tagset_size):
        super(LSTM, self).__init__()
        self.hidden_dim = hidden_dim

        self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)

        # The LSTM takes word embeddings as inputs, and outputs hidden states
        # with dimensionality hidden_dim.
        self.lstm = nn.LSTM(embedding_dim, hidden_dim)

        # The linear layer that maps from hidden state space to tag space
        self.hidden2tag = nn.Linear(hidden_dim, tagset_size)
        self.hidden = self.init_hidden()

    def init_hidden(self):
        # Before we've done anything, we dont have any hidden state.
        # Refer to the Pytorch documentation to see exactly
        # why they have this dimensionality.
        # The axes semantics are (num_layers, minibatch_size, hidden_dim)
        return (torch.zeros(1, 1, self.hidden_dim),
                torch.zeros(1, 1, self.hidden_dim))

    def forward(self, sentence):
#         embeds = self.word_embeddings(sentence)
#         print(sentence.size())
        lstm_out, self.hidden = self.lstm(sentence, self.hidden)
        tag_space = self.hidden2tag(lstm_out.view(len(sentence), -1))
#         print(tag_space.size())
        tag_scores = F.log_softmax(tag_space, dim=1)
        return tag_scores
    
def train(category_tensor, line_tensor):
    hidden = lstm.init_hidden()
    lstm.zero_grad()
#     print(np.shape(line_tensor))
#     outputs = []
#     for i in range(line_tensor.size()[0]):
    output = lstm.forward(line_tensor.view(1,1,-1))
#         outputs.append(output)
#     outputs = torch.stack(outputs)
    
    loss = criterion(output, category_tensor.reshape(1,).long()) #just use the final output to calculate the loss
#     print('lossis', output, category_tensor)
#     print(torch.min(torch.max(outputs,1)[1].reshape(-1,1).float()),torch.min(torch.max(category_tensor, 1)[1]))
#     loss = criterion(torch.max(outputs,1)[1].reshape(-1,1).float(), category_tensor)
    loss.backward(retain_graph=True)
    optimizer.step()

    # Add parameters' gradients to their values, multiplied by learning rate
#     for p in lstm.parameters():
#         p.data.add_(learning_rate, p.grad.data)
#         print('post zero', p.grad.data)
    return output, loss.item()

In [None]:
EMBEDDING_DIM = 750
HIDDEN_DIM = 128
lstm = LSTM(EMBEDDING_DIM, HIDDEN_DIM, len(chars), len(chars))
loss_function = nn.NLLLoss()
optimizer = optim.SGD(lstm.parameters(), lr=0.1)
start = time.time()
n_iters = 10000
print_every = 100
plot_every = 100
for iter in range(1, n_iters + 1):
    lstm.zero_grad()
    category, line, category_tensor, line_tensor = randomTrainingExample()
    line_tensor = one_hot_encode(line_tensor, n_categories)
#     print(line_tensor.size())
    output, loss = train(category_tensor.reshape(1,1), line_tensor)
#     output, loss = train(y_cold, X)
    current_loss += loss

    # Print iter number, loss, name and guess
    if iter % print_every == 0:
        guess, guess_i = categoryFromOutput(output)
        correct = '✓' if guess == category else '✗ (%s)' % category
        print('%d %d%% (%s) %.4f %s / %s %s' % (iter, iter / n_iters * 100, timeSince(start), loss, line, guess, correct))

    # Add current loss avg to list of losses
    if iter % plot_every == 0:
        all_losses.append(current_loss / plot_every)
        current_loss = 0

100 1% (0m 7s) 2.7795  too as th /   ✗ (e)
200 2% (0m 23s) 5.9118 , Collette /   ✗ (')
300 3% (0m 54s) 5.4376 t Listener /   ✗ (")
400 4% (1m 33s) 2.6753  his own r /   ✗ (e)
500 5% (2m 24s) 3.9111 tinks" to  /   ✗ (b)
600 6% (3m 22s) 2.4977 ><br />Mel / d ✗ ( )


In [552]:
state_dict = lstm.state_dict()

OrderedDict([('word_embeddings.weight',
              tensor([[ 1.4019,  0.7592, -0.1744,  ..., -1.0821,  1.7427, -1.8301],
                      [ 0.2770, -1.2941,  1.1478,  ...,  0.3614, -1.0851, -0.6375],
                      [ 0.1778, -0.5748,  0.7365,  ..., -0.3112, -0.3029, -1.8446],
                      ...,
                      [-0.7961, -1.4003,  0.5016,  ..., -0.6924,  0.9593,  0.2017],
                      [-0.9170, -0.1393,  0.7344,  ...,  0.8253, -0.9071,  0.5072],
                      [-1.1378, -0.0972, -0.3705,  ..., -0.0116,  2.0097, -0.4688]])),
             ('lstm.weight_ih_l0',
              tensor([[ 2.2874e-02,  8.6096e-02,  7.1945e-02,  ...,  5.2616e-02,
                        2.2873e-02, -3.8136e-02],
                      [-8.5802e-02, -3.6061e-02, -9.4667e-03,  ..., -6.1878e-02,
                       -6.9095e-02,  4.1901e-02],
                      [-8.5189e-02, -5.4993e-02,  8.4023e-02,  ...,  1.1696e-02,
                       -7.9350e-02, -7.1315e-02]

In [626]:
def randomFromOutput(output):
    top_n, top_i = output.topk(1)
    category_i = top_i[0].item()
    return int2char[category_i], category_i

def sample_from_probabilities(probabilities, topn=75):
    """Roll the dice to produce a random integer in the [0..ALPHASIZE] range,
    according to the provided probabilities. If topn is specified, only the
    topn highest probabilities are taken into account.
    :param probabilities: a list of size ALPHASIZE with individual probabilities
    :param topn: the number of highest probabilities to consider. Defaults to all of them.
    :return: a random integer
    """
    p = np.squeeze(np.exp(probabilities.detach().numpy()))
    p[np.argsort(p)[:-topn]] = 0
    p = p / np.sum(p)
    return np.random.choice(75, 1, p=p)[0]

line = 'hello how are y'

for ii in range(100):
    linenum = [char2int[jj] for jj in line]
    inline = one_hot_encode(torch.from_numpy(np.array(linenum[-10:])),75).view(1,1,-1)
#     print(nn.Softmax(lstm.forward(inline)))
    line += int2char[sample_from_probabilities(lstm.forward(inline))][0]

print(line)

torch.Size([1, 75])
tensor(1.0000)
torch.Size([1, 75])
tensor(1.0000)
torch.Size([1, 75])
tensor(1.0000)
torch.Size([1, 75])
tensor(1.0000)
torch.Size([1, 75])
tensor(1.0000)
torch.Size([1, 75])
tensor(1.0000)
torch.Size([1, 75])
tensor(1.0000)
torch.Size([1, 75])
tensor(1.)
torch.Size([1, 75])
tensor(1.0000)
torch.Size([1, 75])
tensor(1.0000)
torch.Size([1, 75])
tensor(1.0000)
torch.Size([1, 75])
tensor(1.0000)
torch.Size([1, 75])
tensor(1.0000)
torch.Size([1, 75])
tensor(1.0000)
torch.Size([1, 75])
tensor(1.0000)
torch.Size([1, 75])
tensor(1.0000)
torch.Size([1, 75])
tensor(1.0000)
torch.Size([1, 75])
tensor(1.0000)
torch.Size([1, 75])
tensor(1.0000)
torch.Size([1, 75])
tensor(1.0000)
torch.Size([1, 75])
tensor(1.0000)
torch.Size([1, 75])
tensor(1.0000)
torch.Size([1, 75])
tensor(1.)
torch.Size([1, 75])
tensor(1.)
torch.Size([1, 75])
tensor(1.0000)
torch.Size([1, 75])
tensor(1.0000)
torch.Size([1, 75])
tensor(1.)
torch.Size([1, 75])
tensor(1.)
torch.Size([1, 75])
tensor(1.)
torch.Siz

In [598]:
sample_from_probabilities(lstm.forward(inline))

52

In [477]:

class CharRNN(nn.Module):
    
    def __init__(self, tokens, n_hidden=256, n_layers=2,
                               drop_prob=0.5, lr=0.001):
        super().__init__()
        self.drop_prob = drop_prob
        self.n_layers = n_layers
        self.n_hidden = n_hidden
        self.lr = lr
        
        # creating character dictionaries
        self.chars = tokens
        self.int2char = dict(enumerate(self.chars))
        self.char2int = {ch: ii for ii, ch in self.int2char.items()}
        
        ## TODO: define the LSTM
        self.lstm = nn.LSTM(len(self.chars), n_hidden, n_layers, 
                            dropout=drop_prob, batch_first=True)
        
        ## TODO: define a dropout layer
        self.dropout = nn.Dropout(drop_prob)
        
        ## TODO: define the final, fully-connected output layer
        self.fc = nn.Linear(n_hidden, len(self.chars))
      
    
    def forward(self, x, hidden):
        ''' Forward pass through the network. 
            These inputs are x, and the hidden/cell state `hidden`. '''
                
        ## TODO: Get the outputs and the new hidden state from the lstm
        r_output, hidden = self.lstm(x, hidden)
        
        ## TODO: pass through a dropout layer
        out = self.dropout(r_output)
        
        # Stack up LSTM outputs using view
        # you may need to use contiguous to reshape the output
        out = out.contiguous().view(-1, self.n_hidden)
        
        ## TODO: put x through the fully-connected layer
        out = self.fc(out)
        
        # return the final output and the hidden state
        return out, hidden
    
    
    def init_hidden(self, batch_size):
        ''' Initializes hidden state '''
        # Create two new tensors with sizes n_layers x batch_size x n_hidden,
        # initialized to zero, for hidden state and cell state of LSTM
        weight = next(self.parameters()).data
        
        if (train_on_gpu):
            hidden = (weight.new(self.n_layers, batch_size, self.n_hidden).zero_().cuda(),
                  weight.new(self.n_layers, batch_size, self.n_hidden).zero_().cuda())
        else:
            hidden = (weight.new(self.n_layers, batch_size, self.n_hidden).zero_(),
                      weight.new(self.n_layers, batch_size, self.n_hidden).zero_())
        
        return hidden

In [19]:
import string

all_letters = string.ascii_letters + " .,;?!1234567890'"

# Find letter index from all_letters, e.g. "a" = 0
def letterToIndex(letter):
    return all_letters.find(letter)

# Turn a line into a <line_length x 1 x n_letters>,
# or an array of one-hot letter vectors
def lineToTensor(line):
    tensor = torch.zeros(len(line), 1, n_letters)
    for li, letter in enumerate(line):
        tensor[li][0][letterToIndex(letter)] = 1
    return tensor

seq_len = 10 #read 10 chars at a time
n_hidden = 128 #hidden layer
n_letters = len(all_letters)
n_categories = n_categories
rnn = RNN(n_letters, n_hidden, n_categories)

input = lineToTensor('Albertddfw')
hidden = torch.zeros(1, n_hidden)
output, next_hidden = rnn(input[0], hidden)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)


### Generate text

Once the model is trained, we can use it to generate completely new text in the style of your training data.  **Train a model using your original choice of corpus below, and generate some sample sentences.** Don't worry too much about your loss / accuracy during training, but instead check on the text your model is generating. Your generated text should be somewhat coherent, i.e. similar to your training text in structure, and not excessively mispelled.

An **example model architecture** is as follows (feel free to play around with different structures):
* Embedding (for each character in your vocab) of dimension 64
* Dropout of 20% for the embedding input to the RNN
* 2 LSTM layers, each of dimension 512 (play around with the number and dimension of hidden layers)
* Dropout of 50% for each LSTM layer
* Dense softmax layer of same dimension as your vocab size (e.g. if your vocab size is 100, this layer is the probabilty that your output is one of 100 possible characters)
    
**You should understand what each of the above elements are and how they work at a high level by the end of this week's exercise.**

### Generalizing the exercise
How do you think you can apply what you learned in the above exercise to other problems involving text? For example, how would you tackle the previous IMDB sentiment classification task using an RNN architecture? **Discuss below.**

(*Bonus*: create an RNN model for the IMDB classification task and discuss your results. How does the performance compare to your bag of words model?)

# Chapter 4: RNNs from scratch
Now that you understand how to use RNNs, it's time to build a basic one from scratch.  You won't understand how they work until you get stuck in the weeds! 

### Generate text (Version 2)
Your task is now to **build the forward pass of a simple RNN, without using any existing RNN APIs**. You can use PyTorch or Tensorflow (Keras is too high level for this exercise), both of which will automatically handle backpropagation for you.  If you use Tensorflow, please research and use Eager execution - it replaces Tensorflow's default graph / session framework, which is very difficult to learn and debug.

Similar to last week's exercise, create a class for your network (write forward and loss steps, allowing PyTorch or Tensorflow to handle backpropagation for you).  Consider appropriate sizes for your input, hidden and output layers - your __init__ method should take in the params `hidden_size`, `vocab_size`, and `embedding_size` (if you use embeddings). Using these variables, you should initialise three weight layers `input_layer`, `hidden_layer`, and `output_layer`.  In an RNN, you will also have to deal with another item - the `hidden_state`. (Note: your RNN structure may vary slightly from this depending on your learning materials, but the key part is always `hidden_state`)

You should **train your RNN on the same data and task as in Chapter 3.**

**How do the results of your basic RNN compare to your model in Chapter 3?**  What do you think explains the difference in performance? Discuss below.

Some relevant resources on LSTMs (and RNN theory) below if you are interested:
* http://colah.github.io/posts/2015-08-Understanding-LSTMs/
* https://www.youtube.com/watch?v=93rzMHtYT_0&list=LLpNVCNE9cYqVrjb2O8bZUGg&index=2&t=0s
* https://www.youtube.com/watch?v=zQxm3Upr3_I
* http://harinisuresh.com/2016/10/09/lstms/

### Bonus Challenges (not required!):
1. Build the forward pass of an LSTM, without using any existing RNN APIs (as above, with PyTorch or Tensorflow)
1. Build a basic RNN or LSTM in Numpy - including forward pass as well as backpropogation