<a href="https://colab.research.google.com/github/manaswimancha/cs685-advanced-nlp-2023/blob/main/CS685_HW0_S23.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Question 1.1 (10 points)
Let's begin with a quick probability review. In the task of language modeling, we're interested in computing the **joint** probability of some text. Say we have a sentence $s$ with $n$ words ($w_1, w_2, w_3, \dots, w_n$) and we want to compute the joint probability $P(w_1, w_2, w_3, \dots, w_n$). Assume we are given a model that produces the conditional probability of the next word in a sentence given all preceding words: $P(w_i|w_1,w_2,\dots,w_{i-1})$. How can we use this model to compute the joint probability of sentence $s$?

---


For every word in the text, we are going to multiply their conditional probability based on the prefix together to get the joint probability for the whole prefix. For example if there were 2 words in the text, we would do P(w2 | w1)*P(w1).

##Question 1.2 (10 points)
Why would we ever want to compute the joint probability of a sentence? Provide **two** different reasons why this probability might be useful to solve an NLP task.

---

The joint probability helps us compute the the overall proability of a sentence fpr a language modeling task. The joint probabilty also helps us in predicting the most probable next word for a text generation task.

##Question 1.3 (5 points)
Here is a simple way to build a language model: for any prefix $w_1, w_2, \dots, w_{i-1}$, retrieve all occurrences of that prefix in some huge text corpus (such as the [Common Crawl](https://commoncrawl.org/)) and keep count of the word $w_i$ that follows each occurrence. I can then use this to estimate the conditional probability $P(w_i|w_1, w_2, \dots, w_{i-1})$ for any prefix. Explain why this method is completely impractical!

---

For longer sequences of text, we would run into issues of sparsity where there might not be any matches so our model would compute 0 probability. Also, searching through the entire database to for each new word in a text is highly inefficient.

##Question 2.1 (5 points)
Let's switch over to coding! The below coding cell contains the opening paragraph of Daphne du Maurier's novel *Rebecca*. Write some code in this cell to compute the number of unique word **types** and total word **tokens** in this paragraph (watch the lecture videos if you're confused about what these terms mean!). Use a whitespace tokenizer to separate words (i.e., split the string on white space using Python's split function). Be sure that the cell's output is visible in the PDF file you turn in on Gradescope.

---


In [1]:
paragraph = '''Last night I dreamed I went to Manderley again. It seemed to me
that I was passing through the iron gates that led to the driveway.
The drive was just a narrow track now, its stony surface covered
with grass and weeds. Sometimes, when I thought I had lost it, it
would appear again, beneath a fallen tree or beyond a muddy pool
formed by the winter rains. The trees had thrown out new
low branches which stretched across my way. I came to the house
suddenly, and stood there with my heart beating fast and tears
filling my eyes.'''.lower() # lowercase normalization is often useful in NLP

types = 0
tokens = 0

# YOUR CODE HERE! POPULATE THE types AND tokens VARIABLES WITH THE CORRECT VALUES!
tokens = len(paragraph.split(" "))
types = len(set(paragraph.split(" ")))

# DO NOT MODIFY THE BELOW LINE!
print('Number of word types: %d, number of word tokens:%d' % (types, tokens))

Number of word types: 72, number of word tokens:92


##Question 2.2 (5 points)
Now let's look at the most frequently used word **types** in this paragraph. Write some code in the below cell to print out the ten most frequently-occurring types. We have initialized a [Counter](https://docs.python.org/2/library/collections.html#collections.Counter) object that you should use for this purpose. In general, Counters are very useful for text processing in Python.

---


In [2]:
from collections import Counter
c = Counter()

for word in paragraph.split(" "):
  c[word] += 1

# DO NOT MODIFY THE BELOW LINES!
for word, count in c.most_common()[:10]:
    print(word, count)

i 6
the 5
to 4
a 3
and 3
my 3
was 2
had 2
last 1
night 1


##Question 2.3 (5 points)
What do you notice about these words and their linguistic functions (i.e., parts-of-speech)? These words are known as "stopwords" in NLP and are often removed from the text before any computational modeling is done. Why do you think that is?

---

They tend to be articles, prepositions, and conjunctions. They are removed because they don't enough information relative to their frequency in the text.


##Question 3.1 (10 points)
In *neural* language models, we represent words with low-dimensional vectors also called *embeddings*. We use these embeddings to compute a vector representation $\boldsymbol{x}$ of a given prefix, and then predict the probability of the next word conditioned on $\boldsymbol{x}$. In the below cell, we use [PyTorch](https://pytorch.org), a machine learning framework, to explore this setup. We provide embeddings for the prefix "Alice talked to"; your job is to combine them into a single vector representation $\boldsymbol{x}$ using [element-wise vector addition](https://ml-cheatsheet.readthedocs.io/en/latest/linear_algebra.html#elementwise-operations).

*TIP: if you're finding the PyTorch coding problems difficult, you may want to run through [the 60 minutes blitz tutorial](https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html)!*

---

In [4]:
import torch
torch.set_printoptions(sci_mode=False)
torch.manual_seed(0)

prefix = 'Alice talked to'

# spend some time understanding this code / reading relevant documentation!
# this is a toy problem with a 5 word vocabulary and 10-d embeddings
embeddings = torch.nn.Embedding(num_embeddings=5, embedding_dim=10)
vocab = {'Alice':0, 'talked':1, 'to':2, 'Bob':3, '.':4}

# we need to encode our prefix as integer indices (not words) that index
# into the embeddings matrix. the below line accomplishes this.
# note that PyTorch inputs are always Tensor objects, so we need
# to create a LongTensor out of our list of indices first.
indices = torch.LongTensor([vocab[w] for w in prefix.split()])
prefix_embs = embeddings(indices)
print('prefix embedding tensor size: ', prefix_embs.size())

# okay! we now have three embeddings corresponding to each of the three
# words in the prefix. write some code that adds them element-wise to obtain
# a representation of the prefix! store your answer in a variable named "x".

### YOUR CODE HERE!
x = torch.zeros(10)
x = torch.sum(prefix_embs, dim=0)

### DO NOT MODIFY THE BELOW LINE
print('embedding sum: ', x)


prefix embedding tensor size:  torch.Size([3, 10])
embedding sum:  tensor([-0.1770, -2.3993, -0.4721,  2.6568,  2.7157, -0.1408, -1.8421, -3.6277,
         2.2783,  1.1165], grad_fn=<SumBackward1>)


##Question 3.2 (5 points)
Modern language models do not use element-wise addition to combine the different word embeddings in the prefix into a single representation (a process called *composition*). What is a major issue with element-wise functions that makes them unsuitable for use as composition functions?

---

Element-wise functions like addition do not preserve the order of the words. For example, Alice talked to and to talked Alice would be the same. This representation would not work for English where word order plays a role in the meaning of the sentence.

##Question 3.3 (10 points)
One very important function in neural language models (and for basically every task we'll look at this semester) is the [softmax](https://pytorch.org/docs/master/nn.functional.html#softmax), which is defined over an $n$-dimensional vector $<x_1, x_2, \dots, x_n>$ as $\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_{1 \leq j \leq n} e^{x_j}}$. Let's say we have our prefix representation $\boldsymbol{x}$ from before. We can use the softmax function, along with a linear projection using a matrix $W$, to go from $\boldsymbol{x}$ to a probability distribution $p$ over the next word: $p = \text{softmax}(W^T\boldsymbol{x})$. Let's explore this in the code cell below:


In [7]:
# remember, our goal is to produce a probability distribution over the
# next word, conditioned on the prefix representation x. This distribution
# is thus over the entire vocabulary (i.e., it is a 5-dimensional vector).
# take a look at the dimensionality of x, and you'll notice that it is a
# 10-dimensional vector. first, we need to **project** this representation
# down to 5-d. We'll do this using the below matrix:

W = torch.rand(10, 5)

# use this matrix to project x to a 5-d space, and then
# use the softmax function to convert it to a probability distribution.
# this will involve using PyTorch to compute a matrix/vector product.
# look through the documentation if you're confused (torch.nn.functional.softmax)
# please store your final probability distribution in the "probs" variable.

### YOUR CODE HERE
probs = torch.rand(5)
probs = torch.softmax(x@W, dim=0)


### DO NOT MODIFY THE BELOW LINE!
print('probability distribution', probs)


probability distribution tensor([0.5857, 0.0210, 0.3097, 0.0130, 0.0706], grad_fn=<SoftmaxBackward0>)


##Question 3.4 (15 points)
So far, we have looked at just a single prefix ("Alice talked to"). In practice, it is common for us to compute many prefixes in one computation, as this enables us to take advantage of GPU parallelism and also obtain better gradient approximations (we'll talk more about the latter point later). This is called *batching*, where each prefix is an example in a larger batch. Here, you'll redo the computations from the previous cells, but instead of having one prefix, you'll have a batch of two prefixes. The final output of this cell should be a 2x5 matrix that contains two probability distributions, one for each prefix. **NOTE: YOU WILL LOSE POINTS IF YOU USE ANY LOOPS IN YOUR ANSWER!** Your code should be completely vectorized (a few large computations is faster than many smaller ones).

In [15]:

# for this problem, we'll just copy our old prefix over three times
# to form a batch. in practice, each example in the batch would be different.
batch_indices = torch.cat(2 * [indices]).reshape((2, 3))
batch_embs = embeddings(batch_indices)
print('batch embedding tensor size: ', batch_embs.size())

# now, follow the same procedure as before:
# step 1: compose each example's embeddings into a single representation
# using element-wise addition. HINT: check out the "dim" argument of the torch.sum function!
batch_sums = torch.sum(batch_embs, dim=1)

# step 2: project each composed representation into a 5-d space using matrix W
batch_reps = x@W

# step 3: use the softmax function to obtain a 2x5 matrix with the probability distributions

# please store this probability matrix in the "batch_probs" variable.

batch_probs = torch.rand(2,5)
batch_probs = torch.softmax(batch_reps, dim=0)


### DO NOT MODIFY THE BELOW LINE
print("batch probability distributions:", batch_probs)

batch embedding tensor size:  torch.Size([2, 3, 10])
batch probability distributions: tensor([0.5857, 0.0210, 0.3097, 0.0130, 0.0706], grad_fn=<SoftmaxBackward0>)


## Question 4 (20 points)

Choose  one  paper  from  [EMNLP 2022](https://aclanthology.org/volumes/2022.emnlp-main/) that you find interesting. A good way to do this is by scanning the titles and abstracts; there are hundreds of papers so take your time before selecting one!  Then, write a summary in  your own words of the paper you chose. Your summary should answer the following questions: what is its motivation? Why should anyone care about it? Were there things in the paper that you didn't understand at all? What were they? Fill out the below cell, and make sure to write 2-4 paragraphs for the summary to receive full credit!

**Title of paper**: Balanced Adversarial Training: Balancing Tradeoffs between Fickleness and Obstinacy in NLP Models

**Authors**: Hannah Chen, Yangfeng Ji, David Evans

**Conference name**: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

**URL**: https://aclanthology.org/2022.emnlp-main.40.pdf

**Your summary**: Most apporaches tp adversarial training rely on fickle adversarial examples and can end up making a model more vulnerable to obstinate adversarial examples. A new approach of balanced adversarial training can make models robust to both fickle and obstinate adversarial examples. Yes. I did not understand distance-oracle misalignment and GLUE evaluation.