# Recurrent Neural Networks
> "Here I explore Recurrent Neural Networks and build a text generator "
---
- toc: false
- layout: post
- categories: ["Deep Learning", "NLP", "WIP"]
- title: Recurrent Neural Networks
- hide: true
---


### Objective:

Traditional Neural networks found it very difficult to execute tasks in Natural Language Processing (NLP). To understand a sentence it is important to both understand each word, it's relationship with the previous and successive words and also the context in which the word is being used. Recurrent Neural Networks adopt a structure that helps solve this issue and is the main focus of this article. I will then explore approaches and try to build a text generator

### Language Model:

Language Modelling is a central task in Natural Language Processing and is at the heart of many systems like speech recognition, machine translation and text generation. Given the words $ x_{1},..., x_{t}$ the language model predicts the following probability 

$$
 P(x_{t+1} = v_{j} | x_{t}, ... x_{1})
$$
where $v_{j}$ is every word in the vocabulary. Here I will be building a Language Model using Recurrent Neural Networks.


### What is a Recurrent Neural Network?

A recurrent neural network (RNN) is a type of Neural Network that allows previous outputs to be used as inputs while using hidden states. It is called a Recurrent Network because it repeatedly takes an input, uses it to modify a hidden layer and then provides an output which is then fed back to the hidden layer along with the next input. The hidden layer acts as "memory" that keeps track of the previous inputs by how the weights in the layer were modified by the said input. Consider an example of a RNN working on predicting the next word in a sentence. The first word is passed into a hidden layer, the output of this hidden layer is then passed on to the next hidden layer along with the second word. This process continues till all the inputs are passed on to the final output layer which predicts the final word. In this way, the RNN has an "understanding" of each word in the network and can make a reasonable prediction of the final word.


`# TODO: Add image of RNN for predicting the next word here::`

The structure used here is a many-to-one type of RNN. Many other RNN architectures have been developed to accomplish other tasks like music generation, sentiment analysis and machine translation. 

In [None]:
import pandas as pd

| Advantages | Disadvantages |
| Can Process inputs of any length | Slow Computation |

| Model size does not increase with size of input | Difficulty accessing information passed a long time ago |

| Computation takes into account historical information | |
| Weights are shared across time |  |


### Long Short Term Memory (LSTM):

One of the main drawbacks in using RNNs is that of Vanishing or Exploding Gradients. This is what While training the data with so many layers will result in gradients of the layers to explore to a very large number or diminsh to a number close to 0. This problem was exacerbated as the length of the input sequence increased and were not very useful in making predictions as the sentences got very long.

To avoid this a special type iof RNN using Long Short Term Memory (LSTM) units  were developed. When we say LSTM NEtwork, we mean a neural network which contains an LSTM recurrent layer. LSTM cells were first introduced in 1997 in a paper by Sepp Hochreiter and Jürgen Schmidhuber. In the paper, the authors describe how LSTMs do not suffer from the same vanishing gradient problem experienced by RNNs and can be trained on sequences that are hundreds of timesteps long. Since then, the LSTM architecture has been adapted and improved, and variations such as gated recurrent units (GRUs) are now widely utilized and available as layers in Keras.

Let's build a simple LSTM network in Keras here:





#### Tokenization

The first step to handling text data is to tokenize the text, Tokenization is the process of splitting the text into individual parts like words or characters. First we get the text we would like to tokenize, I am choosing to use The Strange case of Dr. Jekyll and Mr. Hyde by R.L Stevenson, one of my favourite books form middle school. I am going to use the text from Project Gutenberg's page

In [None]:
import requests

f = requests.get("https://www.gutenberg.org/files/43/43-0.txt")
theText = f.text

In [None]:
import re
from keras.preprocessing.text import Tokenizer

seq_length = 20


#clean up text

text = theText.lower()
text = text.replace('\n', ' ')
text = re.sub(' +', '. ', text).strip()
text = text.replace('..', '.')
text = re.sub('([!"#$%&()*+,-./:;<=>?@[\]^_`{|}~])', r' \1 ', text)
text = re.sub('\s{2,}', ' ', text)

In [None]:
text



In [None]:
# TOKENIZATION
tokenizer = Tokenizer(char_level = False, filters = '')
tokenizer.fit_on_texts([text])
total_words = len(tokenizer.word_index) + 1
token_list = tokenizer.texts_to_sequences([text])[0]

In [None]:
import numpy as np
from keras.utils import np_utils
def generate_sequences(token_list, step):
    X = []
    y = []

    for i in range(0, len(token_list) - seq_length, step):
        X.append(token_list[i: i + seq_length])
        y.append(token_list[i + seq_length])
    
    y = np_utils.to_categorical(y, num_classes = total_words)
    num_seq = len(X)
    
    print('Number of sequences:', num_seq, "\n")
    return X, y, num_seq

step = 1
seq_length = 20
X, y, num_seq = generate_sequences(token_list, step)
X = np.array(X)
y = np.array(y)

Number of sequences: 62372 



In [None]:
from keras.layers import Dense, LSTM, Input, Embedding, Dropout
from keras.models import Model
from keras.optimizers import RMSprop
n_units = 256
embedding_size = 100
text_in = Input(shape = (None,))
x = Embedding(total_words, embedding_size)(text_in)
x = LSTM(n_units)(x)
x = Dropout(0.2)(x)
text_out = Dense(total_words, activation = 'softmax')(x)
model = Model(text_in, text_out)
opti = RMSprop(lr = 0.001)
model.compile(loss='categorical_crossentropy', optimizer=opti)
epochs = 10
batch_size = 32
model.fit(X, y, epochs=epochs, batch_size=batch_size, shuffle = True)

  "The `lr` argument is deprecated, use `learning_rate` instead.")


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f159f710b50>

In [None]:
def sample_with_temp(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probs = np.random.multinomial(1, preds, 1)
    return np.argmax(probs)

def generate_text(seed_text, next_words, model, max_sequence_len, temp):
    output_text = seed_text
    seed_text = start_story + seed_text
    for _ in range(next_words):
        token_list = tokenizer.texts_to_sequences([seed_text])[0]
        token_list = token_list[-max_sequence_len:]
        token_list = np.reshape(token_list, (1, max_sequence_len))
        probs = model.predict(token_list, verbose=0)[0]
        y_class = sample_with_temp(probs, temperature = temp)
        output_word = tokenizer.index_word[y_class] if y_class > 0 else ''
        
        seed_text += output_word + ' '
        output_text += output_word + ' '
    return output_text

In [None]:
start_story = "Mr. Hyde"
generate_text("It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of light, it was the season of darkness, it was the spring of hope, it was the winter of despair", 200, model, 30, 0.75)

'It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of light, it was the season of darkness, it was the spring of hope, it was the winter of despair. sir . from . the . same . the . same . air . of . a . that . it . was . his . lawyer . the . old . lawyer . and . been . i . was . name . of . utterson ; . his . some . out . of . my . were . lawyer . for . his . own . door , . he . had . man . to . have . now . to . what . i . ! . you . to . i . never . but . an . and . i . do . you . a . , . to . i . had . work . but . with . the . face . of . so . for . it . was . that . door . . . . ? . was . the . here , . as . i . must . have . . . to . be . from . to . my . you . to . been . i . shall . to . the . light . of . to . so '

### Building the Dataset:

I am looking to tarin the network on a bunch of tweets from some users first and then hopefully try and change the dataset to generate text pertaining to a certain subject like Politics or Science or Sports.


Now I got an ides to create an offensive tweet generator using the offensive tweets dataset:


In [None]:
import pandas as pd
df = pd.read_csv('https://query.data.world/s/d6tdhprfqdnhepv72qna23w4nrejvp')

The data are stored as a CSV and as a pickled pandas dataframe (Python 2.7). Each data file contains 5 columns:

count = number of CrowdFlower users who coded each tweet (min is 3, sometimes more users coded a tweet when judgments were determined to be unreliable by CF).

hate_speech = number of CF users who judged the tweet to be hate speech.

offensive_language = number of CF users who judged the tweet to be offensive.

neither = number of CF users who judged the tweet to be neither offensive nor non-offensive.

class = class label for majority of CF users.
0 - hate speech,
1 - offensive language,
2 - neither

In [None]:
df_trump = pd.read_csv("/content/tweets_01-08-2021.csv")

In [None]:
df_trump

Unnamed: 0,id,text,isRetweet,isDeleted,device,favorites,retweets,date,isFlagged
0,98454970654916608,Republicans and Democrats have both created ou...,f,f,TweetDeck,49,255,2011-08-02 18:07:48,f
1,1234653427789070336,I was thrilled to be back in the Great city of...,f,f,Twitter for iPhone,73748,17404,2020-03-03 01:34:50,f
2,1218010753434820614,RT @CBS_Herridge: READ: Letter to surveillance...,t,f,Twitter for iPhone,0,7396,2020-01-17 03:22:47,f
3,1304875170860015617,The Unsolicited Mail In Ballot Scam is a major...,f,f,Twitter for iPhone,80527,23502,2020-09-12 20:10:58,f
4,1218159531554897920,RT @MZHemingway: Very friendly telling of even...,t,f,Twitter for iPhone,0,9081,2020-01-17 13:13:59,f
...,...,...,...,...,...,...,...,...,...
56566,1319485303363571714,RT @RandPaul: I don’t know why @JoeBiden think...,t,f,Twitter for iPhone,0,20683,2020-10-23 03:46:25,f
56567,1319484210101379072,RT @EliseStefanik: President @realDonaldTrump ...,t,f,Twitter for iPhone,0,9869,2020-10-23 03:42:05,f
56568,1319444420861829121,RT @TeamTrump: LIVE: Presidential Debate #Deba...,t,f,Twitter for iPhone,0,8197,2020-10-23 01:03:58,f
56569,1319384118849949702,Just signed an order to support the workers of...,f,f,Twitter for iPhone,176289,36001,2020-10-22 21:04:21,f


In [None]:
df.columns

Index(['Unnamed: 0', 'count', 'hate_speech', 'offensive_language', 'neither',
       'class', 'tweet'],
      dtype='object')

### Using `fastai` to build the Language Model


In [None]:
 #hide
!pip install -Uqq fastbook
import fastbook
fastbook.setup_book()

[K     |████████████████████████████████| 727kB 12.1MB/s 
[K     |████████████████████████████████| 51kB 5.9MB/s 
[K     |████████████████████████████████| 204kB 27.8MB/s 
[K     |████████████████████████████████| 1.2MB 36.2MB/s 
[K     |████████████████████████████████| 61kB 6.9MB/s 
[K     |████████████████████████████████| 61kB 7.1MB/s 
[?25hMounted at /content/gdrive


In [None]:
#hide
from fastbook import *

from IPython.display import display,HTML


In [None]:
from fastai.text.all import *

#dls_lm = DataBlock(blocks=TextBlock.from_df('tweet', is_lm=True), get_x=ColReader('text') ).dataloaders(df[:1000], bs=128, seq_len=80)
dls_lm = DataBlock(blocks=TextBlock.from_df('text', is_lm=True), get_x=ColReader('text') ).dataloaders(df_trump, bs=128, seq_len=80)

  return array(a, dtype, copy=False, order=order)


In [None]:
dls_lm.show_batch(max_n=2)

Unnamed: 0,text,text_
0,"xxbos xxup better & & xxup cheaper xxup healthcare . xxup vote ! xxbos xxmaj joe xxmaj biden got xxunk tied over the weekend when he was unable to properly deliver a very simple line about his decision to run for xxmaj president . xxmaj get used to it , another low xxup i.q . individual ! xxbos xxrep 3 "" xxunk : @realdonaldtrump xxmaj begging you to run for xxmaj president . xxmaj you must save this country .","xxup better & & xxup cheaper xxup healthcare . xxup vote ! xxbos xxmaj joe xxmaj biden got xxunk tied over the weekend when he was unable to properly deliver a very simple line about his decision to run for xxmaj president . xxmaj get used to it , another low xxup i.q . individual ! xxbos xxrep 3 "" xxunk : @realdonaldtrump xxmaj begging you to run for xxmaj president . xxmaj you must save this country . xxmaj"
1,"this , but xxmaj i ’ll be seeing them ! # xxup maga xxbos xxunk xxmaj thanks xxmaj joe . xxbos xxup rt @senategop : 🚨 xxup breaking 🚨 \n\n xxmaj the xxup u.s . economy added 2.5 million jobs in xxmaj may . \n\n xxmaj that 's the xxup biggest xxup jobs xxup increase xxup ever ! 🥳 🇺 🇸 https : / / t.co / xxunk … xxbos … .and ruined . xxmaj the xxmaj federal xxmaj government",", but xxmaj i ’ll be seeing them ! # xxup maga xxbos xxunk xxmaj thanks xxmaj joe . xxbos xxup rt @senategop : 🚨 xxup breaking 🚨 \n\n xxmaj the xxup u.s . economy added 2.5 million jobs in xxmaj may . \n\n xxmaj that 's the xxup biggest xxup jobs xxup increase xxup ever ! 🥳 🇺 🇸 https : / / t.co / xxunk … xxbos … .and ruined . xxmaj the xxmaj federal xxmaj government must"


In [None]:
learn = language_model_learner(
    dls_lm, AWD_LSTM, drop_mult=0.3, 
    metrics=[accuracy, Perplexity()]).to_fp16()

In [None]:
learn.fit_one_cycle(1, 2e-2)

epoch,train_loss,valid_loss,accuracy,perplexity,time
0,3.540043,3.323024,0.384695,27.744116,04:01


In [None]:
learn.save('1epoch')

Path('models/1epoch.pth')

In [None]:
learn2 = learn.load('1epoch')

In [None]:
learn.unfreeze()
learn.fit_one_cycle(10, 2e-3)

epoch,train_loss,valid_loss,accuracy,perplexity,time
0,3.146276,3.161818,0.40752,23.613493,04:41
1,3.027328,3.09164,0.413115,22.013161,04:41
2,2.860141,2.996219,0.428058,20.009731,04:41
3,2.672321,2.9712,0.434699,19.51532,04:44
4,2.484839,2.990189,0.435672,19.889442,04:43
5,2.283492,3.017854,0.437816,20.447374,04:43
6,2.119269,3.064309,0.437837,21.419664,04:43
7,1.970804,3.114613,0.4366,22.524717,04:42
8,1.879339,3.145345,0.435817,23.227682,04:41
9,1.818112,3.158783,0.435178,23.541933,04:42


In [None]:
learn.save_encoder('finetuned')


In [None]:
TEXT = "The Chinese virus and Sleepy Joe "
N_WORDS = 40


preds = learn.predict(TEXT, N_WORDS, temperature=0.75)

In [None]:
print(preds)

The Chinese virus needs to be started again because it was learned to be a disaster . RT @secretarysonny : The American economy is the best in the world . Our farmers and ranchers are working overtime to help us


In [None]:
preds10 = [learn.predict(TEXT, N_WORDS, temperature=0.75) + "\n" for _ in range(10)]
print("\n".join(preds10))

The Chinese virus needs to be stronger than ever before . We are the reason we are so lucky to have our great country as China . China is their enemy , USA ! RT @whitehouse : LIVE

The Chinese virus needs to again rise . We should be using our own power but we need to get our country back ! Thank you for your support ! # MAGA https : / / t.co / t6ucyapriy

The Chinese virus needs to be lifted fast and it is time to move quickly . It is time to # construct a wall and bring our companies back to their country . Our military is finally getting stronger than ever ! #

The Chinese virus needs to be approved . We need to do it the same way as the China Virus does . We have to start working together ( we have a vaccine to solve ) . This is fantastic

The Chinese virus needs to be stronger than ever . We will be stronger than ever before . We will have no better leadership in China and more in favor of BAN - picking . """ What

The Chinese virus needs to be redone http : / / t.co / vqhn0u1vgc 

In [None]:
TEXT = "The Chinese virus and Sleepy Joe "
N_WORDS = 40


preds10 = [learn.predict(TEXT, N_WORDS, temperature=0.75) + "\n" for _ in range(10)]
print("\n".join(preds10))

The Chinese virus and Sleepy Joe Biden , together with the Fake News Media , are looking to China . They are playing golf with us again , big time ! Happy BIRTHDAY to our great

The Chinese virus and Sleepy Joe Biden are saying that our people are being forced to leave China for other countries . They believe they are bad . China is just behind us and our Country . China wants us

The Chinese virus and Sleepy Joe Biden are doing a great job . People are living with them . The American people are sick and tired of it . They want safety & & security . i also want to

The Chinese virus and Sleepy Joe Biden would not be able to make a deal on the Coronavirus ! RT @senategop : Democrats are trying to steal the 2020 election by what they have always accomplished . 

 But the

The Chinese virus and Sleepy Joe Biden why we wo n’t have jobs at home because of the Biden Administration . Biden is a phony Democrat candidate ! RT @senatemajldr : After President Trump ’s disastrous

The Chinese viru

### pErplexity in Natural Language Processing

[link](https://towardsdatascience.com/perplexity-in-language-models-87a196019a94)

In [None]:
from fastai.text.all import *
path = untar_data("http://qwone.com/~jason/20Newsgroups/20news-bydate.tar.gz")
path = untar_data('http://qwone.com/~jason/20Newsgroups/20news-19997.tar.gz')
# Using the Newsgroup dataset to collect the data

KeyboardInterrupt: ignored

In [None]:
path

Path('/root/.fastai/data/20news-19997.tar')

In [None]:
path2 = path/'talk.religion.misc/84127'


In [None]:
files = get_text_files(path, folders = ['talk.religion.misc', 'alt.atheism', 'soc.religion.christian'])

In [None]:
files
txt = files[0].open().read(); txt

(#0) []

In [None]:
path2.open().read()

'Xref: cantaloupe.srv.cs.cmu.edu talk.abortion:121245 talk.religion.misc:84127\nPath: cantaloupe.srv.cs.cmu.edu!magnesium.club.cc.cmu.edu!news.sei.cmu.edu!cis.ohio-state.edu!zaphod.mps.ohio-state.edu!cs.utexas.edu!uunet!gatekeeper.us.oracle.com!barrnet.net!kyle.eitech.com!kyle.eitech.com!not-for-mail\nFrom: ekr@kyle.eitech.com (Eric Rescorla)\nNewsgroups: talk.abortion,talk.religion.misc\nSubject: Re: What part of "No" don\'t you understand?\nDate: 24 Apr 1993 19:39:28 -0700\nOrganization: EIT\nLines: 37\nMessage-ID: <1rctl0$ka3@kyle.eitech.com>\nReferences: <1993Apr24.002509.4017@midway.uchicago.edu> <1rbh3n$hav@kyle.eitech.com> <1993Apr24.214843.10940@midway.uchicago.edu>\nNNTP-Posting-Host: kyle.eitech.com\n\nIn article <1993Apr24.214843.10940@midway.uchicago.edu> eeb1@midway.uchicago.edu writes:\n>In article <1rbh3n$hav@kyle.eitech.com>\n>ekr@kyle.eitech.com (Eric Rescorla) writes:\n>>In article <1993Apr24.002509.4017@midway.uchicago.edu>\n>>eeb1@midway.uchicago.edu writes:\n>\n>>>

### Other code


In [None]:
### Other code


In [None]:
#hide
api_key = "rphUvMAMILFkJfJ6UxGXbboz2"
api_secret = "4mk751YaruDe3A8H8xkXnfAW7BAQ9A0YxASX1SQBDfCQnGhWrW"
bearer_tkn = "AAAAAAAAAAAAAAAAAAAAAHEoQAEAAAAAIMbrTFvtRNaV%2Fwsi9nw5tUzPZ%2Fg%3DJ9tE7Uys0QpS0b2Mpk7V38j82q3ogkRcajRxGACxidYEmW81aS"

auth = tweepy.AppAuthHandler(api_key, api_secret)




In [None]:
api = tweepy.API(auth)
for tweet in tweepy.Cursor(api.search, q='#space').items(10):
    print
    print(tweet.text)
    print("***************************************************************")