# I.	Short Essay Responses 

## 1. Select one career or industry that makes use of applied NLP.

One industry where NLP is applied extensively is eCommerce.

### 1A. Explain generally how that field  or career utilizes NLP.

A couple of places where NLP is being used in eCommerce is for **comparison shopping** and for **extracting product features** from the product description. 

**Comparison Shopping**: When a customer comes to an eCommerce website, they may type a general description of the product that they are looking for. By presenting only the most relevant products to the customer, the eCommerce company can potentially lock in the customer's decision faster and close the sale of the product. If the customer has to wade through irrelevant search results to get to what they want instead, the company might end up losing this customer. NLP can be used to aid this comparison process by finding the closest matching products to the users search query.

**Extracting Product Features from Product Description**: In today's world where customers may have short attention span, vendors selling products usually misuse the product description field where they not only add the product description but a lot of the product features as well. This is done so that they can catch the customers attention faster. However, we can use NLP to extract the feature information from the product description to make an accurate catalog of the products being sold and this information can also use used to feed the feature filters that show up on the comparison shopping page. 


### 1B. Explain at least some methods of NLP that are very likely to be used in the career or industry you selected.

**Comparison Shopping**: One of the NLP methods that this might use is sentence similarity. Here, sentence refers to the product description. We may choose to train a model (to get the word vectors or embeddings) using the vocabulary of the products in our eCommerce website. Using these word vectors, we can compute how similar the product descriptions are to what the user has typed in the search box. While there may be various methods to extract the single "vector" corresponding to the product description, we would start with a very basic method of taking the average of the word vectors in the product description and the average of the word vectors in the users search and compare the 2 to find the closest matching products.

**Extracting Product Features from Product Description**: The method most likely to be used in this case could be sentence segmentation. Before we extract the exact product features out of the product description, we would want to break up the product descriptions into the sections that describe each individual feature. For this we may have to first create a training dataset that indicates when a feature description starts and when it ends in the product description. We can then train a model to recognize this and this trained model can eventually be used to segment the product description into the constituent features of the product.

### 1C. Give at least one specific example of a use case for NLP within the chosen field, and explain how the problem or situation is (or could be) improved by applying NLP.

Let's look at the example of **Comparison Shopping** to see how this could be improved by using NLP. Let's say that the customer is specifically searching for an older version (version 1) of a camera on the eCommerce website because it is cheaper than the latest version (version 2). The eCommerce company may have many vendors selling both versions of this camera. However, the eCommerce company may not specifically have a field for "product version" for this product so the vendors may choose to add the product version to the description section itself. However the word vector for version 1 may be different from the word vector for version 2 so when we do the "product similarity" comparison using the word vectors, we will most likely end up presenting version 1 as the top search results to the customer since that would be closest to the users search query. Alternately, if we had not used this method, we might have presented version 2 as the top result (since that is what more customer might be searching for at this point in time), but that would not have been relevant to this customer. In the worst case, if version 1 did not show up on the first page of the search, the customer may have gone to a different eCommerce website thinking that version 1 is not offered at this website.


## 2. Choose one of the “trade-offs” in NLP that was covered in the asynchronous materials for this course.

In terms of trade-offs in NLP, the one that resonates with me is Feature Learning vs. Feature Engineering. 

### 2A. Explain the trade-off in general terms. Define the two choices.

In simple terms, **Feature Learning** means that using (a lot of) data, train a machine learning or deep learning algorithm to automatically figure out the “important features” needed to get the best metrics possible. In this approach, we rely on the machine to figure out (mine) the right features from a large corpus of data. 

On the contrary, **Feature Engineering** relies on a Subject Matter Expert (SME) to define the right set of features to get the best metric for the problem we are trying to solve. This method relies on the SMEs expertise and intuition to define these features.

### 2B. Explain the benefits and weaknesses of each side of the trade-off.  Include at least one benefit and one weakness of each.

In terms of pros and cons of both approaches, **Feature Learning** is useful when we have a lot of labelled data already available in which case, we can feed this into our algorithm quickly (maybe with a little bit of data cleaning and EDA). This method requires minimal interaction with the SMEs who may otherwise be busy with their "day" jobs and may not be available to help if needed. On the other hand, if large amounts of labelled data are not available, the algorithms will most likely fail to recognize the right set of "features" from the data. Also, this approach tends to alienate the SMEs since they feel that the decisions are being made by a "black box" algorithm and these decisions may not always be explainable or make sense to these experts.

On the contrary, the pros of **Feature Engineering** by a SME is that the experts feel somewhat in control and have a vested interest in the success of the work. This can lead to better buy in from management teams. In addition, in many cases, large amounts of labelled may not be available and in these cases, hand engineering of the important features by an SME would be a logical choice. The cons of feature engineering that we must find a cooperative SME who is willing to work with the data scientist/machine learning engineer in order to define these features.  

### 2C. Describe a work-situation that would make one of the choices in the trade-off much better, in terms of practical outcomes for you and your stakeholders on a project.

The real way to make the decision of whether to use Feature Learning or Feature Engineering depends on the stakeholders and the users of the models. The domain that I work in is Analog Circuit Design. This field is considered highly specialized and Subject Matter Experts (SMEs) consider this to be more of an "art" form than science. Hence any notion of "automation" is frowned upon by a large section of the population. Because of this, "Feature Engineering" by a SME might be more appropriate. In addition it will likely lead to better accuracy as well as illustrated by the examples below.

**Example 1**: One of the use cases is to look at different blocks of a design and try to classify what kind of block it is. Unfortunately, although there are a lot of designs (data) available for analysis, there is not a lot of labelled data available. Hence the first task is to create labelled data. One way to do this is to look at the name of the block and take a "best guess" using regular expressions. This may not need us to involve the SME. But this can be error prone since the names of the blocks are free form text. For example a block containing the letters "AMP" might most like be an amplifier, but a block containing the word "RAMP" (which also contains "AMP") will most likely be an oscillator. Without the involvement of SMEs, this kind of "features" may not be captured in the regular expressions and may lead to lower accuracy. 

**Example 2**: Another problem where involving the SME might be helpful is that of comparing 2 design blocks (which can be represented as graphs). Although not directly NLP, this method draws heavily from the NLP literature by using methods like node2vec, sub2vec, graph2vec. These methods are analogous to word2vec and are trained using methods that were originally proposed in NLP literature such as skipgram and Continuous Bag of Words methods. When using these methods though, it is importaht to capture the design information appropriately in the graphs. This includes nodes (equivalent to the "words" in our vocabulary in NLP) and edges (equivalent to the relationships between words in NLP). This process is very domain specific and would not be successful without the involvement of SMEs to represent the design in the form of an appropriate graph. Once the graphs have been appropriately defined, we can rely on the "feature learning" approach to assign the appropriate embedding to the blocks in the design so that we can compare them.


# II. NLP Networks

In [1]:
from typing import Optional
import numpy as np
from numpy.linalg import norm
from scipy import spatial

In [2]:
## Tensorflow
import tensorflow as tf

# Data Prep
from tensorflow.keras.preprocessing import sequence

## Layers
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Embedding, Dense, Dropout, TimeDistributed, Bidirectional, Lambda
from tensorflow.keras.layers import SimpleRNN as RNN
from tensorflow.keras.layers import LSTM as LSTM
from tensorflow.keras.layers import GRU as GRU

## Callbacks
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.callbacks import ModelCheckpoint



I have a vocabulary of 10 words assigned the following indexes (in a dictionary):

```
{
    "the":  0, "quick":  1, "brown": 2, "fox": 3, "jumped": 4,
    "over": 5, "fence": 6, "under": 7, "car" : 8, "did": 9
}
```
I have a network that classifies a sentence as a question or a statement.  0 means statement, 1 indicates a question. I give you the following code as the network:

```
# truncate and pad input sequences
max_sent_length = 8
X_train = sequence.pad_sequences(X_train, maxlen=max_sent_length)
X_test = sequence.pad_sequences(X_test, maxlen=max_sent_length)

embedding_vec_length = 75
model = Sequential()
model.add(Embedding(top_words, embedding_vecor_length, input_length=max_sent_length))
model.add(LSTM(115, return_sequences=True))
model.add(RNN(95))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
```

Draw/Make a diagram of this network using an input sequence of “the car jumped over the fence ”

Assumptions:
The sequence tokens are words, split by whitespace.
You may label a cell by its type—there is no need to show the inner connections of the LSTM cell.  (A quick reminder—LSTM has 4 sets of gates/weights, but all those gates/weights have the same size matrix—that size is what I am after !)

In [3]:
vocab = {
    "the":  0, "quick":  1, "brown": 2, "fox": 3, "jumped": 4,
    "over": 5, "fence": 6, "under": 7, "car" : 8, "did": 9
}
vocab

{'the': 0,
 'quick': 1,
 'brown': 2,
 'fox': 3,
 'jumped': 4,
 'over': 5,
 'fence': 6,
 'under': 7,
 'car': 8,
 'did': 9}

**The one thing missing in this setting are the tokens for PAD, START and UNK. We will add this to the beginning of the dictionary and move the keys for all other words by 3 to account for this. This also answers question 6 to some extent, but we will elaborate later again.**

In [4]:
word_index = {}
word_index["<PAD>"] = 0
word_index["<START>"] = 1
word_index["<UNK>"] = 2
for index, word in enumerate(vocab):
    word_index[word] = index+3
word_index

{'<PAD>': 0,
 '<START>': 1,
 '<UNK>': 2,
 'the': 3,
 'quick': 4,
 'brown': 5,
 'fox': 6,
 'jumped': 7,
 'over': 8,
 'fence': 9,
 'under': 10,
 'car': 11,
 'did': 12}

In [5]:
top_words = len(word_index)
max_sent_length = 8

# X_train = sequence.pad_sequences(X_train, maxlen=max_sent_length)
# X_test = sequence.pad_sequences(X_test, maxlen=max_sent_length)

embedding_vector_length = 75
model = Sequential()
model.add(Embedding(top_words, embedding_vector_length, input_length=max_sent_length))
model.add(LSTM(115, return_sequences=True))
model.add(RNN(95))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])


In [6]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 8, 75)             975       
_________________________________________________________________
lstm (LSTM)                  (None, 8, 115)            87860     
_________________________________________________________________
simple_rnn (SimpleRNN)       (None, 95)                20045     
_________________________________________________________________
dense (Dense)                (None, 1)                 96        
Total params: 108,976
Trainable params: 108,976
Non-trainable params: 0
_________________________________________________________________


## 1. Label each block and step by input/sequence step. Compute the dimensions of the weight for all steps. All inputs must be labeled by dimension. Include your original word ENCODING (notice not vector!) as input. You may omit bias.

**INPUT:** The input to the network will be a single sequence of 8 integers (representing the sequence of length 8). The sequence will be padded if needed (as shown in this example -- the first 2 entries are padded with 0).

**EMBEDDING MATRIX:** Internally, these integers are One Hot Encoded (OHE) before feeding to the embedding matrix. The length of the OHE vector is the size of the vocabulary. This indicates the row of the embedding matrix that must be pulled for each word in the sequence.

The size of the embedding matrix is equal to the vocabulary size * length of the embedding vector (i.e. since there is an embedding vector for each word in the vocabulary)

The OHE input is multiplied with the embedding matrix to produce an output of size sequence length * length of the embedding vector. This is basically the embeddings corresponding to each word in the sequence.

**LSTM Layer:** The input to the LSTM layer comes from the Embedding Layer (sequence length x embedding vector length). In addition to this, the output of the LSTM is fed back to the LSTM as well (appended to the embedding vector). Hence the size of the LSTM Weight Matrix is (embedding vector length + number of LSTM neurons) x number of LSTM neurons. There are 4 such matrices. Since return sequence is True for the LSTM layer, the output of each time step is returned and this is fed into the RNN layer. The size of each time step output from the LSTM layer is 1 x number if LSTM neurons (and there will be 8 such outputs - one for each time step in the sequence).

**RNN Layer:** The input to the RNN layer comes from the LSTM Layer (sequence length x number of LSTM neurons). In addition to this, the output of the RNN is fed back to the RNN as well (appended to the input). Hence the size of the RNN Weight Matrix is (number of LSTM neurons + number of RNN neurons) x number of RNN neurons. There is only 1 such matrix in the RNN. Since return sequence is False for the RNN layer, the output of only the last time step is returned and this is fed into the dense layer. The size of this output from the RNN layer is 1 x number if RNN neurons.

**Dense Layer:** The dense layer is straightforward and has the same number of neurons as the RNN layer. There is only 1 output of this dense layer that produces an output 1 for a "question" and 0 for a "statement".

### View 1

**NOTE: If the image does not load, please refer to 2p1.JPG in this folder.**

<img src="2p1.JPG">

### Alternate View

**NOTE: If the image does not load, please refer to 2p1_alt.JPG in this folder.**

<img src="2p1_alt.JPG">



## 2. Write the initial vector form of the input sequence using only 1s and 0s

The initial vectors of the input sequence (One Hot Encoded) are shown below. There is a '1' in the index corresponding to the location of the word in the dictionary and a 0 at all other indices. 

**NOTE: If the image does not load, please refer to 2p2.JPG in this folder.**

<img src="2p2.JPG">

## 3. Find the average GloVe Word Vector of your the input sequence (Spacy uses Glove vectors!) 

In [7]:
import spacy
nlp = spacy.load("en_core_web_lg")

In [8]:
doc = nlp("the car jumped over the fence")
vectors = []
for token in doc:
    # print(token.text, token.pos_, token.dep_, token.vector.size)
    vectors.append(token.vector)

vectors = np.array(vectors)
print(vectors.shape)

(6, 300)


In [9]:
mean_glove = np.mean(vectors, axis = 0)
print(f"Shape of the Average GloVe vector: {mean_glove.shape}")
print(f"The average GloVe vector: {mean_glove}")

Shape of the Average GloVe vector: (300,)
The average GloVe vector: [ 1.02246672e-01  6.65950105e-02 -1.49527833e-01  3.49901654e-02
  2.13985667e-01 -7.65089318e-02 -2.45101675e-01  7.83283338e-02
  2.09699962e-02  2.41605020e+00 -1.05761170e-01  5.89891970e-02
  3.98833267e-02 -6.72252774e-02 -2.43400812e-01 -1.21773286e-02
 -7.25516751e-02  1.03658164e+00 -2.32895855e-02 -1.55521646e-01
  2.29424983e-02 -8.47426429e-02  2.09339336e-01  1.32689821e-02
  1.60996635e-02 -1.53346330e-01 -8.59160051e-02 -1.78890824e-01
 -9.34920013e-02 -2.61886623e-02  4.54970933e-02  2.78918356e-01
 -1.12586327e-01  2.80501485e-01  1.23955004e-01 -1.41709670e-01
 -4.83516753e-02  1.34987980e-02  1.60396993e-02 -3.77281681e-02
  2.11798310e-01  4.15756665e-02 -5.41546196e-02 -6.71449974e-02
  8.97116885e-02 -4.67299968e-02 -2.15548679e-01 -3.80335063e-01
  6.24183305e-02  1.80273965e-01 -4.30923253e-02 -6.58916831e-02
 -1.00524008e-01  1.68676674e-01 -1.23110332e-01  3.10916658e-02
  4.90240008e-02 -6.93

In [10]:
# Alternately, we could have used the .vector method to get the mean vector
all(doc.vector == mean_glove)

True

## 4. Find the nearest word (in the above dictionary) to answer #3 

In [11]:
def compute_similarty(word1: np.array, word2: np.array, numpy=True, scipy=False, verbose: bool = True) -> Optional[float]:
    """
    Function computes the similarity between 2 word vectors.
    Computation is done using 2 methods: (1) using Scipy formula and (2) using Spacy/Gensim equivalent formula created with Numpy
    :param: word1: Word Vector for the 1st word
    :type: word1 np.array
    :param: word2: Word Vector for the 2nd word
    :type: word2 np.array
    :param numpy Whether to print the distance calculation using numpy (Default: True)
    :type numpy Bool
    :param scipy Whether to print the distance calculation using scipy (Default: False)
    :type numpy Bool
    :rtype Optional[float]
    """
    cosine_similarity = None
    if scipy:
        # NOTE: that scipy uses distance, hence in order to calculate the similarity, we need to take 1 - distance
        cosine_distance_scipy = spatial.distance.cosine(word1, word2)  ## Scipy
        if verbose:
            print(f"Cosine similarity using scipy (default: computes distance, i.e. less distance is more similar): {cosine_distance_scipy}")
        cosine_similarity = 1 - cosine_distance_scipy  ## Scipy
        if verbose:
            print(f"Cosine Similarity using scipy (corected from distance to actual similarity): {cosine_similarity}")
    if numpy:
        cosine_similarity = np.dot(word1, word2)/(norm(word1)*norm(word2)) ## Manual
        if verbose:
            print(f"Cosine Similarity using numpy (same formula as Gensim and SpaCy): {cosine_similarity}")  
    return cosine_similarity

In [12]:
vocab_doc = nlp(" ".join(list(vocab.keys())))
vocab_doc

the quick brown fox jumped over fence under car did

In [13]:
similarities = []

for token in vocab_doc:
    print(f"\nWord: {token.text} --> ")
    similarities.append(compute_similarty(token.vector, mean_glove, verbose=True))
    
print("\n")
print(similarities) 


Word: the --> 
Cosine Similarity using numpy (same formula as Gensim and SpaCy): 0.7533266544342041

Word: quick --> 
Cosine Similarity using numpy (same formula as Gensim and SpaCy): 0.46778127551078796

Word: brown --> 
Cosine Similarity using numpy (same formula as Gensim and SpaCy): 0.2831871509552002

Word: fox --> 
Cosine Similarity using numpy (same formula as Gensim and SpaCy): 0.310445100069046

Word: jumped --> 
Cosine Similarity using numpy (same formula as Gensim and SpaCy): 0.650532066822052

Word: over --> 
Cosine Similarity using numpy (same formula as Gensim and SpaCy): 0.7485372424125671

Word: fence --> 
Cosine Similarity using numpy (same formula as Gensim and SpaCy): 0.6448632478713989

Word: under --> 
Cosine Similarity using numpy (same formula as Gensim and SpaCy): 0.491545170545578

Word: car --> 
Cosine Similarity using numpy (same formula as Gensim and SpaCy): 0.6525694727897644

Word: did --> 
Cosine Similarity using numpy (same formula as Gensim and SpaCy):

In [14]:
index = similarities.index(max(similarities))  # Index of nearest word
print(f"The closest word in the original dictionary to the mean GloVe vector for the sentence is: '{vocab_doc[index]}'") # Nearest word

The closest word in the original dictionary to the mean GloVe vector for the sentence is: 'the'


## 5. What is the difference between the W(weight) matrix of the first LSTM sequence at time/sequence 0 and at time/sequence 5.  How do you know this?  


During the forward pass the training process, the weight matrix for the LSTM is the same at time step 0 as it is for time step 5. This is clear when we look at the LSTM in a "rolled" form. Usually, the LSTM is drawn out in the "unrolled" form which makes us believe that there is a separate value of the weight matrix at each time step. However in the "rolled" form, we can clearly see that the output of the LSTM is fed back to the same LSTM as an input. 

Another way we know this is that if we calculate the number of parameters in the weight matrix, we can see that it does not depend on the sequence length. It is only dependent on the number of neurons and the size of the input vector at each step. We can see this from the analysis below where a network with sequence length 8 and sequence length 80 both have the same number of parameters (weights) for the LSTM and RNN layers. Since the Weight is independent of the step size (or sequence length), we know that the weight matrix does not depend on the time step in the sequence.

Since the weight at time step 0 is the same as time step 5, the difference between the weight matrices will be 0.

The only time there will be a difference between the weights at time step 5 and time step 0 is during the back propagation step when the weights are being updated but this is a transient (temporary) state. Once back propagation is over and the weights have been updated, the same weights are used for all time steps in the forward pass.


In [15]:
# With Sequence length of 8
top_words = len(word_index)
max_sent_length = 8

embedding_vector_length = 75
model = Sequential()
model.add(Embedding(top_words, embedding_vector_length, input_length=max_sent_length))
model.add(LSTM(115, return_sequences=True))
model.add(RNN(95))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 8, 75)             975       
_________________________________________________________________
lstm_1 (LSTM)                (None, 8, 115)            87860     
_________________________________________________________________
simple_rnn_1 (SimpleRNN)     (None, 95)                20045     
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 96        
Total params: 108,976
Trainable params: 108,976
Non-trainable params: 0
_________________________________________________________________


In [16]:
# With Sequence length of 80
top_words = len(word_index)
max_sent_length = 80

embedding_vector_length = 75
model = Sequential()
model.add(Embedding(top_words, embedding_vector_length, input_length=max_sent_length))
model.add(LSTM(115, return_sequences=True))
model.add(RNN(95))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 80, 75)            975       
_________________________________________________________________
lstm_2 (LSTM)                (None, 80, 115)           87860     
_________________________________________________________________
simple_rnn_2 (SimpleRNN)     (None, 95)                20045     
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 96        
Total params: 108,976
Trainable params: 108,976
Non-trainable params: 0
_________________________________________________________________


## 6. What is missing in the above code—something important is not determined and based on that, there are some minor adjustments or additions that need to be made.  Make a logical determination of what that missing piece of info should be based on the info given here and what additions or adjustments are necessary. 

We already discussed this earlier that the missing piece in this were the tags for PAD, START and UNK (although for this illustration only the PAD was important, START and UNK could be important for other applications). Because we choose to add these tags to the beginning of our vocabulary (PAD = 0, START = 1, UNK = 2), we had to shift all the other words down by 3 in the vocabulary index. This is the usual convention since the padding function has 0 as the default value.

Alternately, we could have chosen to add these 3 tokens at the end of the vocabulary. In that case, we would not have had to shift the other words down, but we would have had to modify the padding to indicate the position of the PAD in our vocabulary. For example if PAD was at index 10, the new code would look like this:

```
X_train = sequence.pad_sequences(X_train, maxlen=max_sent_length, value=10)
X_test = sequence.pad_sequences(X_test, maxlen=max_sent_length, value=10)
```
