<h1 style="color:brown;">  Recurrent neural nets</h1> 

![Looping network](./img/RNN_colah.png)

##### RNNs can produce amazing results <a href ="http://karpathy.github.io/2015/05/21/rnn-effectiveness/">blog</a>

### Lesson plan 
1. Why classic neural nets are not enough?
2. Word embeddings - word2vec
3. Categorical embeddings
4. RNN 
5. Takeaways
6. Hands on word2vec

In [3]:
import numpy as np

### Classic nets vs. RNN's

Classic:
    - Inputs and outputs must be fixed-sized vectors
    - No idea of location or time 

RNNs: 

![](./img/diags.jpeg)

### Word embeddings

#### N-gram

-Which word produces the highest probability to be next given we have seen n specific other words before

In [None]:
Words: Thank, you, Hello, goodbye

In [None]:
If we have 4 words and we are looking at 2-gram? 
    Example: no. of times Thank you occurs divided by number of times Thank occurs

We need to calculate the probabilty of 
 - Thank Hello
 - Thank you
 - Thank goodbye
 - Thank Thank

So we needed to do 4 calculations

In [17]:
def how_many_calc_to_do(gram, voc_size):
    '''This function needs to calculate all combos 
    of words'''
    
    return np.prod(np.repeat(voc_size, gram))

In [15]:
how_many_calc_to_do(7, 10000)
# Notice that this is only an approxiamtion and it can be implemented in more efficient ways.

4477988020393345024

![](./img/one_hot_encoding_distance_on_3d.png)

#### Insight I: 
    we can actually just turn each word to a random vector sized 100 or 200 or 300, 
    train a classic neural net to predict the next word and update both the weights and the random vectors.
    You can think of it as just another layer of weights multiplying the one hot encoded inputs.
<a href="http://hunterheidenreich.com/blog/intro-to-word-embeddings/">word_embed blog</a>

<a href="https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa">word_embed blog II</a>


![W2V](./img/w2v.png)

##### Objective: maximize the sum of probabilities of each word given its observed window

This idea is very strong in comparison to other options: 

    Bag of words - just count occourences 
    TF-IDF - word is either informative or not but has no relation to other words
    one-hot encoder: for the computer paris-france is the same distance as paris-blabla

- Distance and direction are meaningful! 'King' - 'man' = 'Queen' - 'woman'
- Now the words 'massive' and 'huge' are similar!
- Extends to sentences, paragraphs and documents.

![](./img/see_attached_word_embed.png)

##### We have reduced the dimension of the vocublary by a big factor!
example: from 80,000 to 300

![](https://www.lemay.ai/demo/wordEmbedding/)

##### Insight I.I  The same thing can be applied to any categorical variable. 
##### With enough training data we can learn its continous position in space - state of the art

![categorical_embed](./img/categorical_embedding.png) # image I

![german_states](./img/german_states_mapped_2D.png) 

#### Insight II: 
        well, even if we can include many words (large n-gram), how can we capture context?
        If the text mentioned queen Mary and few pages later is talking about the queen, how will our network 
        know her name is Mary? 

### Idea I: Memory - your current choices are based on previous understanding

Add some cell in the network to keep previous memory and combine with current input to predict next word

![](./img/memory_rnn.png)

#### Problem: calculating the derivative (aka gradient) is problematic, either infinite or zero.

Imagine the memory at time t is the memory at time t-1 times a weight vector:
    $h_t = W*h_{t-1}$
Then:
    $h_t = W^t * h_0$ 
    
  $W > 1$ $h_t --> \infty$

### Solution: LSTM/GRU

<a href="https://colah.github.io/posts/2015-08-Understanding-LSTMs/">LSTM/GRU blog</a>


![](./img/RNNs.png)

![](./img/LSTM_colah.png)

#### Idea II: gates: don't multiply, use addition for memory!

##### Components

    - cell state
    - candidates  

##### Gates
- forget - information to throw (0 means throw all from the cell state)
- input - what values we are going to update
- output - filter which values of the cell we are going to output 

The current cell state is the sum of forgetting and updating with new candidates

### Extension: attention

<a href="https://www.youtube.com/watch?v=SysgYptB198">Intuition</a>

######  - Translate part by part
###### -  Use attention weights - how much attention should you give to each word in the input (update weights to each new word)

![](./img/attention.png)

### Takeaways:
    

##### Word embeddings
- Word/categorical embeddings gives meaning to words in relation to one another
- Word/categorical embeddings are computationally efficient
- Training is done through a classic NN with small window around words

##### RNN
- Old generation RNNs suffered from exploding/vanishing gradients
- New generation RNNs (commonly LSTM or GRU) are using memory gates to mitigate this problem
- RNNs are just multiple copies of a NN connected by the hidden layer
- Training is done again by backpropogation
- Weights are shared accros all network
- RNN's can be used for any sequence. Unlike time series models can include both time and features.
- Are flexible in input and output sizes
- Amazing results in NLP, recommendations and many more.
- Many flavours - BRNN, CRNN...

##### Attention
- Typicall for translations/images
- Weight all the words in one language to decide how much they should influence input to translated language
- components: word weights, BRNN, RNN, context vectors.

### Hands on Word2Vec/word embedding

In [1]:
!pip install gensim

Collecting gensim
[?25l  Downloading https://files.pythonhosted.org/packages/c8/3a/32a1edf4f335eba0873021a7ddb3230f05dedd2b5450960118b402ca0771/gensim-3.8.0-cp37-cp37m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (24.7MB)
[K    100% |████████████████████████████████| 24.7MB 514kB/s ta 0:00:011  17% |█████▋                          | 4.4MB 1.2MB/s eta 0:00:17    49% |███████████████▉                | 12.2MB 13.4MB/s eta 0:00:01
Collecting smart-open>=1.7.0 (from gensim)
[?25l  Downloading https://files.pythonhosted.org/packages/37/c0/25d19badc495428dec6a4bf7782de617ee0246a9211af75b302a2681dea7/smart_open-1.8.4.tar.gz (63kB)
[K    100% |████████████████████████████████| 71kB 3.1MB/s ta 0:00:011
Building wheels for collected packages: smart-open
  Building wheel for smart-open (setup.py) ... [?25ldone
[?25h  Stored in directory: /Users/flatironschool/Library/Caches/pip/wheels/5f/ea/fb/5b1a947b369724063b2617011f1540c44eb00e28c3d2ca

In [2]:
import gensim
import numpy as np
import json
import string

##### Reading in the data

In [5]:
with open('JEOPARDY_QUESTIONS1.json') as f:
    data = json.load(f)

In [6]:
len(data)

216930

In [7]:
# Let's look at the first element in our list
data[0]

{'category': 'HISTORY',
 'air_date': '2004-12-31',
 'question': "'For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory'",
 'value': '$200',
 'answer': 'Copernicus',
 'round': 'Jeopardy!',
 'show_number': '4680'}

In [9]:
# Word2Vec requires that our text have the form of a list
# of 'sentences', where each sentence is itself a list of
# words. How can we put our _Jeopardy!_ clues in that shape?

text = []

for clue in data:
    sentence = clue['question'].translate(str.maketrans('', '',
                                                        string.punctuation)).split(' ')
    
    new_sent = []
    for word in sentence:
        new_sent.append(word.lower())
    
    text.append(new_sent)

In [10]:
# Let's check the new structure of our first clue
text[0]

['for',
 'the',
 'last',
 '8',
 'years',
 'of',
 'his',
 'life',
 'galileo',
 'was',
 'under',
 'house',
 'arrest',
 'for',
 'espousing',
 'this',
 'mans',
 'theory']

#####  Constructing the model

In [11]:
# simply a matter of
# instantiating a Word2Vec object.
model = gensim.models.Word2Vec(text, sg=1)
## sg means skip-gram

##### training 

In [12]:
# To train, call 'train()'!
model.train(text, total_examples=model.corpus_count, epochs=model.epochs)

(11337286, 15849970)

In [13]:
# The '.wv' attribute stores the word vectors
model.wv

<gensim.models.keyedvectors.Word2VecKeyedVectors at 0x1a1c70d7b8>

In [14]:
model.wv['child']

array([ 0.12494138,  0.11836532, -0.14108932,  0.32135296,  0.1776498 ,
        0.55454534,  0.40669787,  0.05038987,  0.33630976, -0.40864718,
        0.25312513, -0.27528214, -0.44418907, -0.19268678,  0.21038345,
        0.34953833, -0.74776876,  0.26504254, -0.22842602, -0.37754226,
        0.25178295, -0.18551709,  0.18449825, -0.22050059,  0.5227614 ,
       -0.33762783,  0.1946886 ,  0.49459988,  0.02129038,  0.17589822,
       -0.16853447, -0.24504685,  0.07491993, -0.40641135, -0.2511042 ,
        0.12094141,  0.24607868, -0.15969782,  0.34866145, -0.01459805,
       -0.08847884,  0.38575873,  0.62831104, -0.03583027, -0.67448795,
       -0.3662738 ,  1.0437644 , -0.13409522, -0.28969002, -0.22915882,
        0.03881628,  0.35262918,  0.20637606, -0.1248785 ,  0.287397  ,
        0.8215341 , -0.00116134, -0.00531662, -0.23599575, -0.21356125,
        0.26732796,  0.07857442, -0.14343904, -0.17514062, -0.16967578,
       -0.1732776 ,  0.57790226, -0.30380446,  0.07053508, -0.20

In [15]:
### model.wv methods
#### 'most_similar()' and 'similarity()'

In [16]:
model.wv.most_similar('happiness')

[('shame', 0.7463243007659912),
 ('wherefore', 0.7310535907745361),
 ('kindness', 0.7243857383728027),
 ('shakespearebr', 0.7124285697937012),
 ('existential', 0.703360378742218),
 ('pity', 0.7000101804733276),
 ('prosperity', 0.6970384120941162),
 ('despair', 0.6969484686851501),
 ('vile', 0.6958224177360535),
 ('compassion', 0.6954309940338135)]

In [17]:
model.wv.most_similar('furniture')

[('ceramic', 0.7190740704536438),
 ('artwork', 0.7131906747817993),
 ('fastener', 0.705875813961029),
 ('decorative', 0.694380521774292),
 ('bicycles', 0.6937385201454163),
 ('drip', 0.6905083656311035),
 ('integral', 0.6853310465812683),
 ('pottery', 0.6816097497940063),
 ('linen', 0.6805097460746765),
 ('flooring', 0.6797986030578613)]

In [18]:
model.wv.similarity('furniture', 'jewelry')

0.66111517

In [19]:
model.wv.most_similar(positive=['cat', 'animal', 'pet', 'mammal'])
# positive/negative is a weighted average of words you want to get values close to/far from (euclidean distance)
# neg is for words you want values further from

[('rodent', 0.8043734431266785),
 ('marsupial', 0.8014607429504395),
 ('parrot', 0.8013291358947754),
 ('carnivore', 0.801075279712677),
 ('reptile', 0.793319821357727),
 ('giraffe', 0.7925571203231812),
 ('shorthaired', 0.7915731072425842),
 ('arthropod', 0.7885787487030029),
 ('leopard', 0.7780919671058655),
 ('predatory', 0.7752895951271057)]

In [20]:
model.wv.most_similar(positive=['cat', 'animal'], negative='pet')

[('rodent', 0.4051472842693329),
 ('insect', 0.3996090292930603),
 ('parrot', 0.3779188394546509),
 ('extinction', 0.3717319369316101),
 ('marsupial', 0.3656271994113922),
 ('lizard', 0.3602727949619293),
 ('dog', 0.3567514419555664),
 ('dogs', 0.3431258797645569),
 ('animals', 0.3428812026977539),
 ('sheep', 0.34268712997436523)]

In [21]:
model.wv.most_similar(positive=['king', 'woman'], negative='man', topn=3)

[('throne', 0.3003922700881958),
 ('empress', 0.28956758975982666),
 ('duchess', 0.2547074854373932)]

In [22]:
model.wv.most_similar(positive='usa')

[('pageant', 0.648227870464325),
 ('brisbane', 0.6109720468521118),
 ('xmas', 0.5956884622573853),
 ('fargo', 0.59088534116745),
 ('tyra', 0.5895758867263794),
 ('guides', 0.5860363245010376),
 ('supermarket', 0.5858744382858276),
 ('surfin', 0.5839501023292542),
 ('sweetheart', 0.5835374593734741),
 ('englishspeaking', 0.5830857157707214)]

In [23]:
model.wv.most_similar('canada')

[('commonwealth', 0.6723837852478027),
 ('uruguay', 0.6470383405685425),
 ('marianas', 0.6467249393463135),
 ('arenas', 0.6382854580879211),
 ('venezuela', 0.6369739174842834),
 ('commonwealths', 0.6311945915222168),
 ('everglades', 0.6306896209716797),
 ('wedged', 0.630102276802063),
 ('zimbabwe', 0.6281924247741699),
 ('zambia', 0.6267754435539246)]

In [24]:
model.wv.most_similar('shakespeare')

[('shakespeares', 0.7368870973587036),
 ('sophocles', 0.7167152166366577),
 ('euripides', 0.711763858795166),
 ('hamlet', 0.6985545754432678),
 ('ibsen', 0.6967687010765076),
 ('shakespearean', 0.6942389011383057),
 ('falstaff', 0.6783837080001831),
 ('rur', 0.6782490015029907),
 ('romeo', 0.6684530377388),
 ('moliere', 0.6680902242660522)]

In [25]:
model.wv.most_similar(positive=['president', 'germany'], negative='usa')

[('emperors', 0.23164430260658264),
 ('russia', 0.22659927606582642),
 ('inaugural', 0.21296563744544983),
 ('dictator', 0.19721680879592896),
 ('emperor', 0.19365380704402924),
 ('headlines', 0.1858173906803131),
 ('fascist', 0.18227535486221313),
 ('france', 0.18017175793647766),
 ('ussr', 0.16986921429634094),
 ('milan', 0.16904905438423157)]

In [23]:
#### 'doesnt_match()'

In [24]:
model.wv.doesnt_match(['breakfast', 'lunch', 'frog', 'food'])
# picks the one that's most dissimilar

  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)


'frog'