## Model description and written questions

a)
In character level embeddings we start with limited number of possible inputs (usually there is less characters that words in a language) so we don't need a lot of dimensions to represent them. For words, when embeddings are initialized randomly or pre-trained on some other data we need more dimensions to distinguish different meanings they can have. 

Later initial vectors representing characters are preprocessed in order to keep only the relevant information (for example max pooling is used).

b) Total number of parameters:
* character-based embedding model:

    (V_char * e_char) + (f * e_char * k + f) + 2 * (e_char * e_char + e_char) 

* word-based lookup embedding model:

    V_word * e_word
    
* comparison:

For k = 5, V_word ≈ 50, 000 and V_char = 96:


Assuming that in the formula for the number of parameters in the character-based embedding model the greatest term is V_char * e_char we can say that the total number of parameters is no greater than 4 * V_char * e_char. 
So dividing V_word * e_word by (4 * V_char * e_char) gives us 130 * e_word / e_char. If e_word is around 5 times larger that e_char this will result in at least 650 times more parameters for word-embeddings model, so the magnitude is around 1 thousand. 

c) One advantage of using a convolutional architecture rather than a recurrent architecture for the purpose of generation of word embeddings:
When convolution architecture is used we can use max pooling which is good at detecting specific patterns in the words, irrespective of surroundings characters (because the convolution is calculated only on a window of characters). 

d) 
* Max-pooling advantage: Max pooling takes into account only maximums, discarding most of the values from input. Because of this it's better at detecting specific pattern (e.g. for generation of word embeddings this could lead to detecting and using only input that is highly relevant for the meaning of the word).
    
* Average-pooling advantage: Average pooling takes into account only averages of input so it always uses all available values from input. This can be better in capturing overall characteristics of words.
 
## Implementation
h) I have extended code of sanity_check.py - I check the output dimension. The code follows the same convention as the checks already provided in the assignment, it's also possible to run it multiple times without changes in Highway class.

## Analyzing NMT Systems 

a)

In [36]:
! grep -E '"traduzco":|"traduces":|"traduce":|"traduzca":|"traduzcas":|"traducir":' vocab.json

    "traducir": 5112,
    "traduce": 8567,


There are only 2 forms in the vocabulary.
### Why this is a bad thing for word-based NMT from Spanish to English. 
Because the missing words have similar meaning to the ones present in the vocabulary but we don't know it and will treat it as any other rare words and it will be difficult for the model to generate good translation of this word (even if there is some kind of pattern between Spanish and English versions like in the example above).


### How our new character-aware NMT model may overcome this problem.
We build embeddings on a character level. The expectation is then that the embeddings for the words that have similar meaning (e.g. traduces and traduce) will be similar and the model will pick up the their meaning and will be able to translate them correctly. 


In [38]:
! grep -E "traduce" vocab.json

    "traducen": 23247,
    "traduce": 8567,


b)
#### Word2Vec:
• financial - economic

• neuron - nerve

• Francisco - san

• naturally - occuring

• expectation - norms

#### CharCNN:
• financial - vertical

• neuron - Newton

• Francisco - France

• naturally - practically

• expectation - exception


#### What kind of similarity is modeled by Word2Vec. 
It's semantic similarity (based on similar meaning of words).

#### What kind of similarity is modeled by the CharCNN.

It's very often similary based on characters but also semantic similarity.

#### How the differences in the methodology of Word2Vec and a CharCNN explain the differences you have found.

Word2Vec treats all words in the same way irrespective of the fact if they look similar or not and embeddings are built only using information about co-ocurrence in the input documents. CharCNN calculates convolution on characters embeddings so for similar words (in terms of the characters they contain) we can expect the output to be similar.

c)
#### Example where the character-based decoder produced an acceptable translation in place of UNK
    
#### Example where the character-based decoder produced an incorrect translation in place of UNK

In [None]:
I'm <unk> that we have never come to know us.
