Natural Language Processing - Homework 1 

Pham Lan Phuong - 210120

In [1]:
import re
import nltk
from nltk.util import ngrams
nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/lanphgphm/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [2]:
with open('testing_data.txt', 'r') as f:
    test = f.read()

with open('training_data.txt', 'r') as f:
    train = f.read()

# Question 1
Find all sentences that contain “to be” verbs (i.e. “is”, “are”, ...) in the training data
file.

In [3]:
sentences = nltk.tokenize.sent_tokenize(train)

In [4]:
be_sentences = [] 
for sentence in sentences:
    # use re.search() because we care if the sentence contains 'be' or not
    contains_be = re.search(r'\b(is|are|am|be|being|been|was|were)\b', sentence)
    if contains_be:
        be_sentences.append(sentence)

be_sentences

['As part of the major, students will be equipped with the foundational knowledge in Computer Science and relevant disciplines.',
 'They will be exposed to essential areas of the CS discipline including theory, systems, and applications.',
 'They will learn about the underlying mathematical ideas that are critical for computation, establish proficiency in the process of designing systems and applications, gain experience in collecting and analyzing data using modern technologies, and begin to develop an understanding for the role of users in the design of systems and applications.',
 'The Computer Science major at Fulbright is designed to prepare students for work in industry or continue their lifelong learning as well as potential graduate-level studies.',
 'All students are first required to take the core courses in Liberal Arts and Science.',
 'In addition to the two courses in “Global Humanities and Social Change”, and “Modern Vietnamese Culture and Society”, they will be exposed t

# Question 2
Build a unigram model and a bigram model (both are with add-one smoothing) from the training data file. Then calculate and compare the perplexity score of these two models
on the testing data file.

Build model 
- Tokenize training data
- Count frequency of each ngram 
- Apply add-one smoothing on the ngram count 

Train model
- Compute ngram probability on training data after smoothing
- To handle unseen token: ignore 

Test model
- Tokenize test data 
- Compute probability of seeing the test sentence -- do computation in log space
- Save the average log likelihood & return it 

Compute Perplexity
- Compute perplexity: exponentiate negative average log likelihood (avoid underflow, and adding is faster in previous step)


$$PPL = exp\large(-\frac{1}{n}\sum_i^n log P(w_i | w_{1:i-1})\large)$$

### Helper functions

### Unigram & add-one smoothing

In [5]:
unigrams = nltk.tokenize.word_tokenize(train) 
n_unigrams = len(unigrams)

unique_unigrams = set(unigrams) 
n_unique_unigrams = len(unique_unigrams)

print(f"Number of unigrams: {n_unigrams}\nNumber of unique unigrams: {n_unique_unigrams}")

Number of unigrams: 742
Number of unique unigrams: 302


In [6]:
def ngram_count(n, ngrams): 
    '''ngram is tokenized data'''
    counts = {} 
    counts['<UNK>'] = 0
    last_start = len(ngrams) - n + 1 

    for i in range(last_start): 
        ngram = tuple(ngrams[i:i+n])
        if ngram in counts.keys():
            counts[ngram] += 1
        else:
            counts[ngram] = 1
    
    return counts 
    

$$P_{Laplace}(w_i) = \frac{c_i+1}{N+V}$$

In [7]:
def add_one_smoothing(ngrams_count, N, V): 
    '''N: Number of word token, V: vocab size (unique ngram)
    this function returns dictionary of ngram probabilities 
    after applying Laplace smoothing. 
    '''
    ngrams_prob = {} 
    ngrams_prob['<UNK>'] = 1 / (N + V)

    for ngram in ngrams_count.keys(): 
        ci = ngrams_count[ngram]
        ngrams_prob[ngram] = (ci + 1) / (N + V)
    
    return ngrams_prob

In [15]:
unigrams_count = ngram_count(1, unigrams)
unigrams_prob = add_one_smoothing(unigrams_count, n_unigrams, n_unique_unigrams)

unigrams_prob

{'<UNK>': 0.0009578544061302681,
 ('The',): 0.0028735632183908046,
 ('Computer',): 0.009578544061302681,
 ('Science',): 0.01053639846743295,
 ('major',): 0.008620689655172414,
 ('prepares',): 0.0019157088122605363,
 ('students',): 0.009578544061302681,
 ('with',): 0.006704980842911878,
 ('an',): 0.005747126436781609,
 ('adaptable',): 0.0019157088122605363,
 ('skill',): 0.0019157088122605363,
 ('set',): 0.0019157088122605363,
 ('to',): 0.020114942528735632,
 ('respond',): 0.0019157088122605363,
 ('the',): 0.029693486590038315,
 ('astonishing',): 0.0019157088122605363,
 ('speed',): 0.0019157088122605363,
 ('of',): 0.022030651340996167,
 ('technological',): 0.0019157088122605363,
 ('change',): 0.0019157088122605363,
 ('and',): 0.03639846743295019,
 ('develop',): 0.0028735632183908046,
 ('solutions',): 0.0019157088122605363,
 ('for',): 0.009578544061302681,
 ('problems',): 0.0038314176245210726,
 ('today',): 0.0019157088122605363,
 ('tomorrow',): 0.0019157088122605363,
 ('.',): 0.030651340

### Test

"testing" model ở đây là predict value? no, it's like we skip the "test" phase and go straight to evaluation after training model? there is no labelled data to test? just go to compute log likelihood and perplexity now? 

In [16]:
test_unigrams = nltk.tokenize.word_tokenize(test)
n_test_unigrams = len(test_unigrams)

test_unique_unigrams = set(test_unigrams) 
n_test_unique_unigrams = len(test_unique_unigrams)

In [17]:
test_unigrams_count = ngram_count(1, test_unigrams)
test_unigrams_prob = add_one_smoothing(test_unigrams_count, n_test_unigrams, n_test_unique_unigrams)

In the case of unigram model: 

$$P(sentence | model) = \prod_{i}^{N} P(word_i | model) \\
\Leftrightarrow ln P(sentence | model) = \sum_{i}^{N} ln P(word_i | model) \\
\Leftrightarrow ln P(sentence | model) = \sum_{i}^{N} count(word_i) * ln P(word_i | model)  
$$

Dòng 1 và dòng 2 là em tự viết. Dòng cuối trong 3 dòng toán này là đáp án em hỏi GPT-4, thực ra em thấy nó rất intuitive và dễ hiểu, bởi vì một từ càng xuât hiện nhiều thì nó càng nên tăng log likelihood của từ đó, nhưng em chưa thấy chỗ nào viết cồng thức ở dòng 3 explicitly ra cả. Em có tìm thêm để cố gắng justify dòng 3 thì có Wikipedia viết là: 


"Given the independence of each event, the overall log-likelihood of intersection equals the sum of the log-likelihoods of the individual events. This is analogous to the fact that the overall log-probability is the sum of the log-probability of the individual events."

Anh có hiểu vì sao ko? 

In [18]:
import math 

def log_likelihood(ngrams_count, ngrams_prob, N):
    logllh = 0 

    for ngram in ngrams_count.keys(): 
        count = ngrams_count[ngram] 
        prob = ngrams_prob[ngram]
        logllh += count * math.log2(prob)
    
    return logllh / N 

In [36]:
def perplexity(avglogllh):
    return pow(2, (-1)*avglogllh)

In [29]:
def unigram_ppl(unigrams, n_unigrams, unigram_probability):
    test_ppl = 1

    for unigram in unigrams:
        if unigram in unigram_probability.keys():
            test_ppl *= 1/unigram_probability[unigram]
        else: 
            test_ppl *= 1/unigram_probability['<UNK>']

    test_ppl = test_ppl ** (1/n_unigrams)
    return test_ppl

In [33]:
test_manual_perplexity = unigram_ppl(test_unigrams, n_test_unigrams, unigrams_prob)
test_manual_perplexity

1043.9999999999998

In [37]:
test_avg_logllh = log_likelihood(test_unigrams_count, test_unigrams_prob, n_test_unigrams)
test_logllh_perplexity = perplexity(test_avg_logllh)
test_logllh_perplexity

31.205498836801908

In [34]:
import math

def calculate_perplexity(unigram_probs, test_set):
    entropy = 0
    for word in test_set:
        if word in unigram_probs:
            entropy += math.log(unigram_probs[word], 2)
        else:
            entropy += math.log(1/len(unigram_probs), 2)  # handling unknown words
    entropy = -entropy / len(test_set)
    perplexity = math.pow(2, entropy)
    return perplexity

In [35]:
calculate_perplexity(unigrams_prob, test)

302.99999999998823

In [43]:
def compute_perplexity_on_test_set(model, tokenized_test_sentences ):
    perplexity = 1
    N = len(tokenized_test_sentences)

    for sentence in tokenized_test_sentences:
        for word in sentence:
            perplexity *= (1/model[word])

    perplexity = pow(perplexity, 1/float(N))
    return perplexity

In [44]:
compute_perplexity_on_test_set(unigrams_prob, test_unigrams)

KeyError: 'A'

Em implement cái gì sai vậy mà sao nó ra khác nhau quá trời TvT 

### Bigram & add-one smoothing

In [8]:
bigrams = [] 
for i in range(len(unigrams)-1):
    context = unigrams[i]
    word = unigrams[i+1]
    bigram = (context, word)
    bigrams.append(bigram)

bigrams[:10]

[('The', 'Computer'),
 ('Computer', 'Science'),
 ('Science', 'major'),
 ('major', 'prepares'),
 ('prepares', 'students'),
 ('students', 'with'),
 ('with', 'an'),
 ('an', 'adaptable'),
 ('adaptable', 'skill'),
 ('skill', 'set')]

With the same training corpus, bigram has significantly lower perplexity score of 
2.0643605162383167 compared to unigram with perplexity score of 37.57928459350777. 

Bigram performs better on test set. 

### Intermediate results for Question 2