# HW2 Overview

In this assignment, we will study language model. You will get the basic ideas of maximum likelihood estimation, smoothing, generate text documents from language models, and language model evaluation. 

We will reuse the same Yelp dataset and refer to each individual user review as a **document** (e.g., as in computing document frequency). You should reuse your JSON parser in this assignment.

The same pre-processing steps you have developed in HW1 will be used in this assignment, i.e., tokenization, stemming and normalization. Note: **NO** stopword removal is needed in this assignment. 



# Statistical Language Models

### 1. Maximum likelihood estimation for statistical language models with proper smoothing (50pts)

Use all the review documents to estimate a unigram language model $p(w)$ and two bigram language models (with different smoothing methods specified below). Note those language models are corpus-level models, i.e., aggregating all the words across different documents.

When estimating the bigram language models, using linear interpolation smoothing and absolute discount smoothing based on the unigram language model $p_u(w)$ to get two different bigram language models accordingly, i.e., $p^L(w_i|w_{i-1})$ and $p^A(w_i|w_{i-1})$. In linear interpolation smoothing, set the parameter $\lambda=0.9$; and in absolute discount smoothing, set the parameter $\delta=0.1$.

Specifically, when estimating $p^L(w_i|w_{i-1})$ and $p^A(w_i|w_{i-1})$, you should use the unigram language model $p(w_i)$ as the reference language model in smoothing. For example, in linear interpolation smoothing, the resulting smoothing formula looks like this,

$$p^L(w_i|w_{i-1})=(1-\lambda) \frac{c(w_{i-1}w_i)}{c(w_{i-1})} + \lambda p(w_i)$$ 
where $c(w_{i-1}w_i)$ is the frequency of bigram $w_{i-1}w_i$ in the whole corpus.

From the resulting two bigram language models, find the top 10 words that are most likely to follow the word "good", i.e., rank the words in a descending order by $p^L(w|good")$ and $p^A(w|good")$ and output the top 10 words. Are those top 10 words the same from these two bigram language models? Explain your observation.

*HINT: to reduce space complexity, you do not need to actually maintain a $V\times V$ array to store the counts and probabilities for the bigram language models. You can use a sparse data structure, e.g., hash map, to store the seen words/bigrams, and perform the smoothing on the fly, i.e., evoke some function calls to return the value of $p^L(w|good")$ and $p^A(w|good")$.* 

**What to submit**:

1. Paste your implementation of the linear interpolation smoothing and absolute discount smoothing.
2. The top 10 words selected from the corresponding two bigram language models.
3. Your explanation of the observations about the top words under those two bigram language models.



In [1]:
from utils import *
import pickle

In [2]:
#unigram language model p(w)
list_of_tokenized_reviews = []
with open('list_of_tokenized_reviews.pickle', 'rb') as file:
    list_of_tokenized_reviews = pickle.load(file)
token_freq_dict = total_term_frequency(list_of_tokenized_reviews)
total_num_tokens = sum([len(tokens) for tokens in list_of_tokenized_reviews])
token_prob_dict = {}
for key in token_freq_dict.keys():
    token_prob_dict[key] = token_freq_dict[key] / total_num_tokens


In [31]:
#bigram language model with linear interpolation smoothing
all_bigrams = get_all_bigrams(list_of_tokenized_reviews)
bigram_freq_dict = total_bigram_freqency(all_bigrams)
def p_linear(first, second):
    lam = 0.9
    try:
        prob = (1-lam) * (bigram_freq_dict[(first, second)] / token_freq_dict[first]) + (lam * token_prob_dict[second])
    except:
        prob = (lam * token_prob_dict[second])
    return prob
#get the top most likely words to follow "good"
linear_probabilities = {}
for token in token_prob_dict.keys():
    linear_probabilities[token] = p_linear('good', token)
ans = dict(sorted(linear_probabilities.items(), key = lambda x: x[1], reverse = True)[:10])
for line in ans:
    print(f'{line} : {ans[line]}')    

the : 0.05306509260194806
and : 0.03497929899421944
i : 0.03195698201920177
a : 0.02365215008387146
to : 0.019990190189529227
but : 0.01815881364406983
it : 0.017903926404976605
wa : 0.01704291139500162
of : 0.014552406250610416
for : 0.011796944880580106


In [32]:
#most popular words to follow 'good' with no smooting.
target_bigrams = {}
for bigram in bigram_freq_dict.keys():
    if bigram[0] == 'good':
        target_bigrams[bigram] = bigram_freq_dict[bigram]
ans = dict(sorted(target_bigrams.items(), key = lambda x: x[1], reverse = True)[:10])
for line in ans:
    print(f'{line} : {ans[line]}')

('good', 'but') : 7560
('good', 'and') : 5022
('good', 'i') : 4455
('good', 'the') : 4358
('good', 'as') : 2874
('good', 'food') : 2306
('good', 'it') : 1507
('good', 'thing') : 1403
('good', 'for') : 1393
('good', 'too') : 1385


In [33]:
#bigram language model with absolute discount smoothing
def p_abs_disc(first, second):
    delta = 0.1
    d_u = len(target_bigrams) #number of unique bigrams with 'good' as the first word
    try:
        prob = (max(bigram_freq_dict[(first, second)] - delta, 0) + (delta * d_u * token_prob_dict[second])) / (token_freq_dict[first])
    except:
        prob = (delta * d_u * token_prob_dict[second]) / (token_freq_dict[first])
    return prob

abs_disc_probabilities = {}
for token in token_prob_dict.keys():
    abs_disc_probabilities[token] = p_abs_disc('good', token)
ans = dict(sorted(abs_disc_probabilities.items(), key = lambda x: x[1], reverse = True)[:10])
for line in ans:
    print(f'{line} : {ans[line]}')

but : 0.09564301951999492
and : 0.063624517412393
i : 0.05644462810001931
the : 0.05530186422700514
as : 0.03636073506188902
food : 0.029181046543558607
it : 0.019120838648309255
thing : 0.017747196904450483
for : 0.01765550959445897
too : 0.01752179260050426


### 2. Generate text documents from a language model (40pts)

Fixing the document length to 20, generate 10 documents by sampling words from $p(w)$, $p^L(w_i|w_{i-1})$ and $p^A(w_i|w_{i-1})$ respectively.

*HINT: you can use $p(w)$ to generate the first word of a document and then sampling from the corresponding bigram language model when generating from $p^L(w_i|w_{i-1})$ and $p^A(w_i|w_{i-1})$.* 

**What to submit**:

1. Paste your implementation of the sampling procedure from a language model.
2. The 10 documents generated from $p(w)$, $p^L(w_i|w_{i-1})$ and $p^A(w_i|w_{i-1})$ accordingly, and the corresponding likelihood given by the used language model.

In [34]:
import numpy as np
doc_size = 20

In [43]:
#unigram
unigram_tokens = []
unigram_probs = []
for key in token_prob_dict.keys():
    unigram_tokens.append(key)
    unigram_probs.append(token_prob_dict[key])
print(sum(token_prob_dict.values()))
for i in range(10):
    samples = np.random.choice(unigram_tokens, doc_size, p=unigram_probs)
    print(str(i+1) +'. ' + ' '.join(samples))

0.9999999999991019
1. sauc i best glace to burger too equi up drool besid peopl the warm too s die had i are
2. one check tapa singl green trader beer the one too NUM star is eateri too d mani methink lou of
3. that here shock up of just front huge and we for again to the more have aot you if take
4. told background valu dessert anyth is i like it did and to a wallet super mouth tasti fill the made
5. the dish larger i which damn crazi home i the experi take order i were my see anyth be egg
6. one for go after and delici a everyth we up the friend littl wa were dine one it wateri nt
7. over i chef the big not make for expect for bad two chicago boggl is everi anymor be far dinner
8. la a your the joke finish with wa both disappoint of nt for s flaki ringer smallish the head and
9. great want food in eaten NUM about go it alway howev were i idea time for atmospher just here cajeta
10. tomato in by grill healthiest realli after ridicul the twice walk record it side chop and do new and a


In [48]:
#bigram linear interp

for i in range(10):
    samples = []
    prev_word = np.random.choice(unigram_tokens, 1, p=unigram_probs)[0]
    samples.append(prev_word)
    for i in range(doc_size-1):
        lin_int_probs_dict = {}
        for token in token_prob_dict.keys():
            lin_int_probs_dict[token] = p_linear(prev_word, token)
        lin_int_tokens = []
        lin_int_probs = []
        for key in lin_int_probs_dict.keys():
            lin_int_tokens.append(key)
            lin_int_probs.append(lin_int_probs_dict[key])
        lin_int_probs = np.array(lin_int_probs)
        lin_int_probs /= sum(lin_int_probs)
        prev_word = np.random.choice(lin_int_tokens, 1, p=lin_int_probs)[0]
        samples.append(prev_word)
    print(str(i+1) +'. ' + ' '.join(samples))
    


19. bar sardin tell might wa with be enjoy stuf are great offer were at three cover dream it heavi our
19. seriou not ball consid of food your i highli bit disappoint amaz check ahead with they brunt and if sometim
19. there these be and and wa ha fish so much nt that warm s for and mixologist who tri over
19. they chicken hi chees out burger even befor a eat onc fantast cup sassi thi saw loud vegetarian hearth on
19. first chicken we provolon winner of check bacon tabl abl realli but describ dine NUM chicken hostess here serv includ
19. beef get the here pitcher noodl tri the dish for pastrami bread of messi comfort some spoke group NUM of
19. for so i it side also boston good you make line sandwich and there by cook wa are look my
19. quick they were oh NUM downtown week i felt throughout favorit half an venu and ve tri onc the would
19. were wa at who sat back slice still what enjoy dish hi mushroom busi creation hous the of order ve
19. deafen the instead do environ with day is nex

In [37]:
for token1 in token_prob_dict.keys():
    temp_prob_dict = {}
    for token2 in token_prob_dict.keys():
        temp_prob_dict[token2] = p_linear(token1, token2)
    print(f'{token1} :{sum(temp_prob_dict.values())}')

we :0.9999976292062261
had :0.9996919785071037
our :0.999986439143738
veri :0.9999976241368265
last :0.999819840106166
meal :0.9967054486790975
in :0.9998599954237022
n :0.9998785916613644
o :0.9960159362530482
here :0.9973157850945462
and :0.9999995446495747
it :0.9987296750018573
wa :0.9999691277013354
great :0.9988274908284445
their :0.9999858075484614
sweet :0.9994916161236482
tea :0.9982686567144848
nice :0.9989399675480932
my :0.9999962264487733
i :0.9999955429638958
decid :0.9999284282832092
to :0.999931987202132
tri :0.9985445817568287
a :0.9999795452870724
the :0.9999987664252796
craw :0.999999999998067
fish :0.9996218970883187
pie :0.9977388439633261
with :0.9999545675788161
side :0.9992028407836309
of :0.9999859752602991
green :0.999841605066707
pretti :0.9999527881216603
much :0.9992625595310232
an :0.9999999999980835
empanada :0.9992900608499937
stuf :0.9993030303010977
fill :0.9994615077762874
similar :0.9994449583699457
ettoufe :0.9961538461519128
pastri :0.9985526315770

KeyboardInterrupt: 

# Reading Assignment — Belief or Bias in Information Retrieval (10pts)
In our class, we have learned both classical and modern information retrieval evaluation methods. And their shared goal is to assess if a retrieval system can satisfy users' information need. Such an evaluation directly leads to the subsequent optimization of retrieval system, e.g., optimize the ranking for click-through rates. But should a system please its users so as to improve the metrics or should it educate the users about what is right and wrong?

Let's read the paper ["Beliefs and biases in web search"](https://dl.acm.org/doi/10.1145/2484028.2484053), which is the best paper in SIGIR'2013. Based on the findings of this paper and current public concern/debate of the wide spread of misinformation on the web, what kind of suggestion do you want to give to Google and Bing to improve the situation? You can focus on the search evaluation, document retrieval and ranking, or any aspect related to the retrieval process.

# Extra Credits (5pts)

You are encouraged to further investigate the relation between classic language model and the trending Large Language Models. How LLMs differ from unigram and bigram models we implemented? It is okay to consult LLMs for this question :\) 

# Submission

This assignment has in total 100 points. The deadline is Feb 20 23:59 PDT. You should submit your report in **PDF** using the homework latex template, and submit your code (notebook)。