# HW2 Overview

In this assignment, we will study language model. You will get the basic ideas of maximum likelihood estimation, smoothing, generate text documents from language models, and language model evaluation. 

We will reuse the same Yelp dataset and refer to each individual user review as a **document** (e.g., as in computing document frequency). You should reuse your JSON parser in this assignment.

The same pre-processing steps you have developed in HW1 will be used in this assignment, i.e., tokenization, stemming and normalization. Note: **NO** stopword removal is needed in this assignment. 



# Statistical Language Models

### 1. Maximum likelihood estimation for statistical language models with proper smoothing (50pts)

Use all the review documents to estimate a unigram language model $p(w)$ and two bigram language models (with different smoothing methods specified below). Note those language models are corpus-level models, i.e., aggregating all the words across different documents.

When estimating the bigram language models, using linear interpolation smoothing and absolute discount smoothing based on the unigram language model $p_u(w)$ to get two different bigram language models accordingly, i.e., $p^L(w_i|w_{i-1})$ and $p^A(w_i|w_{i-1})$. In linear interpolation smoothing, set the parameter $\lambda=0.9$; and in absolute discount smoothing, set the parameter $\delta=0.1$.

Specifically, when estimating $p^L(w_i|w_{i-1})$ and $p^A(w_i|w_{i-1})$, you should use the unigram language model $p(w_i)$ as the reference language model in smoothing. For example, in linear interpolation smoothing, the resulting smoothing formula looks like this,

$$p^L(w_i|w_{i-1})=(1-\lambda) \frac{c(w_{i-1}w_i)}{c(w_{i-1})} + \lambda p(w_i)$$ 
where $c(w_{i-1}w_i)$ is the frequency of bigram $w_{i-1}w_i$ in the whole corpus.

From the resulting two bigram language models, find the top 10 words that are most likely to follow the word "good", i.e., rank the words in a descending order by $p^L(w|good")$ and $p^A(w|good")$ and output the top 10 words. Are those top 10 words the same from these two bigram language models? Explain your observation.

*HINT: to reduce space complexity, you do not need to actually maintain a $V\times V$ array to store the counts and probabilities for the bigram language models. You can use a sparse data structure, e.g., hash map, to store the seen words/bigrams, and perform the smoothing on the fly, i.e., evoke some function calls to return the value of $p^L(w|good")$ and $p^A(w|good")$.* 

**What to submit**:

1. Paste your implementation of the linear interpolation smoothing and absolute discount smoothing.
2. The top 10 words selected from the corresponding two bigram language models.
3. Your explanation of the observations about the top words under those two bigram language models.



In [2]:
from utils import *
import pickle

In [3]:
#unigram language model p(w)
list_of_tokenized_reviews = []
with open('list_of_tokenized_reviews.pickle', 'rb') as file:
    list_of_tokenized_reviews = pickle.load(file)
token_freq_dict = total_term_frequency(list_of_tokenized_reviews)
total_num_tokens = sum([len(tokens) for tokens in list_of_tokenized_reviews])
token_prob_dict = {}
for key in token_freq_dict.keys():
    token_prob_dict[key] = token_freq_dict[key] / total_num_tokens


In [4]:
#bigram language model with linear interpolation smoothing
all_bigrams = get_all_bigrams(list_of_tokenized_reviews)
bigram_freq_dict = total_bigram_freqency(all_bigrams)
def p_linear(first, second):
    lam = 0.9
    try:
        prob = (1-lam) * (bigram_freq_dict[(first, second)] / token_freq_dict[first]) + (lam * token_prob_dict[second])
    except:
        prob = (lam * token_prob_dict[second])
    return prob
#get the top most likely words to follow "good"
linear_probabilities = {}
for token in token_prob_dict.keys():
    linear_probabilities[token] = p_linear('good', token)
ans = dict(sorted(linear_probabilities.items(), key = lambda x: x[1], reverse = True)[:10])
for line in ans:
    print(f'{line} : {ans[line]}')    

the : 0.05306509260194806
and : 0.03497929899421944
i : 0.03195698201920177
a : 0.02365215008387146
to : 0.019990190189529227
but : 0.01815881364406983
it : 0.017903926404976605
wa : 0.01704291139500162
of : 0.014552406250610416
for : 0.011796944880580106


In [5]:
#most popular words to follow 'good' with no smooting.
target_bigrams = {}
for bigram in bigram_freq_dict.keys():
    if bigram[0] == 'good':
        target_bigrams[bigram] = bigram_freq_dict[bigram]
ans = dict(sorted(target_bigrams.items(), key = lambda x: x[1], reverse = True)[:10])
for line in ans:
    print(f'{line} : {ans[line]}')

('good', 'but') : 7560
('good', 'and') : 5022
('good', 'i') : 4455
('good', 'the') : 4358
('good', 'as') : 2874
('good', 'food') : 2306
('good', 'it') : 1507
('good', 'thing') : 1403
('good', 'for') : 1393
('good', 'too') : 1385


In [6]:
#bigram language model with absolute discount smoothing
def p_abs_disc(first, second):
    delta = 0.1
    d_u = len(target_bigrams) #number of unique bigrams with 'good' as the first word
    try:
        prob = (max(bigram_freq_dict[(first, second)] - delta, 0) + (delta * d_u * token_prob_dict[second])) / (token_freq_dict[first])
    except:
        prob = (delta * d_u * token_prob_dict[second]) / (token_freq_dict[first])
    return prob

abs_disc_probabilities = {}
for token in token_prob_dict.keys():
    abs_disc_probabilities[token] = p_abs_disc('good', token)
ans = dict(sorted(abs_disc_probabilities.items(), key = lambda x: x[1], reverse = True)[:10])
for line in ans:
    print(f'{line} : {ans[line]}')

but : 0.09564301951999492
and : 0.063624517412393
i : 0.05644462810001931
the : 0.05530186422700514
as : 0.03636073506188902
food : 0.029181046543558607
it : 0.019120838648309255
thing : 0.017747196904450483
for : 0.01765550959445897
too : 0.01752179260050426


### 2. Generate text documents from a language model (40pts)

Fixing the document length to 20, generate 10 documents by sampling words from $p(w)$, $p^L(w_i|w_{i-1})$ and $p^A(w_i|w_{i-1})$ respectively.

*HINT: you can use $p(w)$ to generate the first word of a document and then sampling from the corresponding bigram language model when generating from $p^L(w_i|w_{i-1})$ and $p^A(w_i|w_{i-1})$.* 

**What to submit**:

1. Paste your implementation of the sampling procedure from a language model.
2. The 10 documents generated from $p(w)$, $p^L(w_i|w_{i-1})$ and $p^A(w_i|w_{i-1})$ accordingly, and the corresponding likelihood given by the used language model.

In [7]:
import numpy as np
doc_size = 20

In [8]:
#unigram
unigram_tokens = []
unigram_probs = []
for key in token_prob_dict.keys():
    unigram_tokens.append(key)
    unigram_probs.append(token_prob_dict[key])
print(sum(token_prob_dict.values()))
for i in range(10):
    samples = np.random.choice(unigram_tokens, doc_size, p=unigram_probs)
    print(str(i+1) +'. ' + ' '.join(samples))

0.9999999999991019
1. burger smell wa becaus cob that all the pretti a they saturday mind is than open it math food tabl
2. a cramp great and the much katz nt the wait good wait waft nosh love had peopl tom so good
3. nice the wa to if chicago peopl consum and cajun attent NUM to south all extra at at though nt
4. out my with look NUM romant chees regret a though she order if and steak and the extravag beehiv littl
5. whatev hi the be sauc food price dinner sauc heard work crimin onli near and far you for canadian new
6. and have m jonasapprov of and in may and ailoi on juici some thi of onli the experi the whi
7. batter stood and ton home drink on a a as on did with alto she wa great would not of
8. engag feel cool a the super portion so sketchi been out how howev here estat or of bare chic there
9. ani tabl time disappoint usual beyond for be for spici eaten off are the cheer to it and tuna with
10. busi it fill i brais fanci the nt thi class attent onli for i a decent up when food o

In [9]:
#bigram linear interp

for i in range(10):
    samples = []
    prev_word = np.random.choice(unigram_tokens, 1, p=unigram_probs)[0]
    samples.append(prev_word)
    for i in range(doc_size-1):
        lin_int_probs_dict = {}
        for token in token_prob_dict.keys():
            lin_int_probs_dict[token] = p_linear(prev_word, token)
        lin_int_tokens = []
        lin_int_probs = []
        for key in lin_int_probs_dict.keys():
            lin_int_tokens.append(key)
            lin_int_probs.append(lin_int_probs_dict[key])
        lin_int_probs = np.array(lin_int_probs)
        lin_int_probs /= sum(lin_int_probs)
        prev_word = np.random.choice(lin_int_tokens, 1, p=lin_int_probs)[0]
        samples.append(prev_word)
        
    print(str(i+1) +'. ' + ' '.join(samples))
    


19. to a the bengal thi for the out get were welcom for almost meh it but almost drink which prop
19. have wait they he charact experi should ll you the not spectacular fri chicken been i have and dissapoint and
19. i sassi to see whi ass i say though a go cashier into of and what set wa NUM out
19. with charcuteri all which gari tartaremain be dish a s local an appet and spend ration i and thai and
19. of number sweet disarmingli greatdiv with a tone dxws…i 1p the want at were debat about peopl wife eat though
19. everyon a i we flag you even dure i idea they do here season and wa whitefish build pull the
19. mix take the chicken a you top servic that recommend the food it but the a which wa thing wait
19. saturday wait our tabl after ok for the fatti factori they appet the here platter is blown good in cochon
19. mk NUM we chicago the actual way arriv your sock that the servic kid chicken had on sat at by
19. skin as the it d so wa would out and mob my were special but between tri th

In [10]:
#bigram absolute discount
for i in range(10):
    samples = []
    prev_word = np.random.choice(unigram_tokens, 1, p=unigram_probs)[0]
    samples.append(prev_word)
    for i in range(doc_size-1):
        abs_disc_probs_dict = {}
        for token in token_prob_dict.keys():
            abs_disc_probs_dict[token] = p_abs_disc(prev_word, token)
        abs_disc_tokens = []
        abs_disc_probs = []
        for key in abs_disc_probs_dict.keys():
            abs_disc_tokens.append(key)
            abs_disc_probs.append(abs_disc_probs_dict[key])
        abs_disc_probs = np.array(abs_disc_probs)
        abs_disc_probs /= sum(abs_disc_probs)
        prev_word = np.random.choice(abs_disc_tokens, 1, p=abs_disc_probs)[0]
        samples.append(prev_word)
        
    print(str(i+1) +'. ' + ' '.join(samples))

19. ll read about that hold your way through the iceberg serv of champagn a person prefer NUM buck a place
19. nt see a downsid i just one of fri chicken breast we will tri it wa delici and definit go
19. fairli certain it s like a tapa can do nt do nt like beef is becaus the select of mine
19. on the steak the line move away by it it wa abov good and incred well for group order NUM
19. tri to see whi it for a reserv process the it s kind of oyster and crave these burger fresh
19. place to wait for the servic i thought i alway have just plain mayo had ha pasta never appear blend
19. ve eaten there wa such a carnivor i rememb what sold tradit mexican food may fri onion soup here are
19. and fast if onli had to give it becam of the food one bad fri it probabl a high hope
19. of name and i m at least one meal at a veri high given water but not sure you ve
19. i have ever eaten here about an appet doe i wa starv after one full could have toppl you can


# Reading Assignment — Belief or Bias in Information Retrieval (10pts)
In our class, we have learned both classical and modern information retrieval evaluation methods. And their shared goal is to assess if a retrieval system can satisfy users' information need. Such an evaluation directly leads to the subsequent optimization of retrieval system, e.g., optimize the ranking for click-through rates. But should a system please its users so as to improve the metrics or should it educate the users about what is right and wrong?

Let's read the paper ["Beliefs and biases in web search"](https://dl.acm.org/doi/10.1145/2484028.2484053), which is the best paper in SIGIR'2013. Based on the findings of this paper and current public concern/debate of the wide spread of misinformation on the web, what kind of suggestion do you want to give to Google and Bing to improve the situation? You can focus on the search evaluation, document retrieval and ranking, or any aspect related to the retrieval process.

# Extra Credits (5pts)

You are encouraged to further investigate the relation between classic language model and the trending Large Language Models. How LLMs differ from unigram and bigram models we implemented? It is okay to consult LLMs for this question :\) 

# Submission

This assignment has in total 100 points. The deadline is Feb 20 23:59 PDT. You should submit your report in **PDF** using the homework latex template, and submit your code (notebook)。