# Homework 1

## Prerequisite

1. Install [_Miniconda_](https://docs.conda.io/en/main/miniconda.html) or [_Anaconda_](https://docs.anaconda.com/anaconda/install/index.html)
2. Create a new virtual Python environment: <code>conda create -n gwnlp Python=3.10</code>
3. Activate your environment (and you'll use this Python environment throughout the course - make sure it is selected as the Python interpreter if you are using an IDE like VS Code): <code>conda activate gwnlp</code>
4. Install packages (this will give you pandas, pytorch, fastai, spacy, etc.): <code>conda install -c fastchan fastai</code>

## Problem 1 (20 points)

### 1a (5 points). Normalize all of the raw phone numbers with Python RE module. Find one pattern that works for all.

| Raw | Normalized |
| --- | --- |
| 2021213121 | +1 (202) 121-3121 |
| 12021213121 | +1 (202) 121-3121 |
| +12021213121 | +1 (202) 121-3121 |
| 202-121-3121 | +1 (202) 121-3121 |
| (202)  121 -   3121 | +1 (202) 121-3121 |
| (202)121-3121 | +1 (202) 121-3121 |
| 862021213121 | +86 (202) 121-3121 |

In [None]:
import re

def replace_func(matches):
    matches = ''.join(matches)
    out = "+1 (" + matches[0:2] + " " + matches[3:5] + "-" + matches[6:9] 
        

in_list = ["2021213121", "12021213121", "+12021213121", "202-121-3121", "(202) 121 - 3121", "(202)121-3121", "862021213121"]
output = []
for x in in_list:
    output.append(replace_func(re.findall(r"[0-9]", x)))
print(output)

### 1b (15 points). Use Python RE module to complete the following tasks, with **one** regex pattern **for each**. Show your test samples.

1. Add spaces around / and #. E.g., "good/bad" -> "good / bad".
2. Replace tokens in ALL CAPS by their lower version. E.g., "This is AMAZING!" -> "This is amazing!".
3. Convert _camel case_ to _snake case_. E.g., "getNamesFromUserInput" -> "get_names_from_user_input".

In [None]:
def one():
    cases = ["He / him", "He/him", "He/ him", "He /him", "He him", "He  /  him", "He # him", "He#him", "He# him", "He #him", "He him", "He  #  him"]
    output = []
    for x in cases:
        output.append(re.sub(r"( *(/|#) *)", r" \2 ", x))
    print(output)
    
def replace_2(match_object):
    return match_object.group(1).lower()
    
def two():
    cases = ["MY FUNNY VALENTINE", "MY Funny valentine", "my FUNNY Valentine", "mY FuNnNNnny VALENTINE"]
    output = []
    for x in cases:
        output.append(re.sub(r"((^|[^a-z])[A-Z][A-Z]+)", replace_2, x))
    print(output)

def replace_3(match_object):
    return match_object.group(1)+'_'+match_object.group(2).lower()
    
def three():
    cases = ["myFunnyValentine", "temporaryVariable", "javascriptFunction", "already_in_the_right_format", "almost_but_notQuite"]
    output = []
    for x in cases:
        output.append(re.sub(r"([a-z])([A-Z])", replace_3, x))
    print(output)
    
one()
two()
three()

## Note: For Problem 2 - 5 we will work on a sample of IMDB Reviews dataset. Load the data into a _pandas_ _Dataframe_ (review [the basics of pandas](https://pandas.pydata.org/docs/user_guide/10min.html) if you are new to it) using the following script:

In [None]:
import pandas as pd
from fastai.data.external import URLs, untar_data

In [None]:
path = untar_data(URLs.IMDB_SAMPLE)

In [None]:
df = pd.read_csv(path/'texts.csv')

In [None]:
len(df), sum(df['is_valid'] == False), sum(df['is_valid'] == True), sum(df['label'] == 'positive'), sum(df['label'] == 'negative')

In [None]:
df.head()

## Problem 2 (20 points)

### 2a (5 points). 
- Find at least one thing that needs to be cleaned with regex in the texts. Show your Python code.
- Create train/valid split using the column 'is_valid'.

In [None]:
import numpy as np
# One of the most obvious things that needs to be fixed is the presence of the <br/> HTML tags
# Let's fix that here

for i in range(len(df.text)):
    df.text[i] = re.sub(r"\<br \/\>", "", df.text[i])

# Now let's make 20% of the entries a member of the validation class
num_entries = len(df.is_valid)
selection = np.arange(num_entries)
np.random.shuffle(selection)
selection = selection[int(num_entries*.8):]
df.loc[selection, "is_valid"] = True
df.head()

In [None]:
og_df = df.copy() # saving a copy o the original df for future problems

### 2b (5 points). 
- Implement your own tokenizer for the texts. Requirements: split by space, remove most punctuations and split common abbreviations (e.g., "don't" -> "do" "n't", "you'll" -> "you" "'ll"). 
- Create 3 vocabularies using top 1000, 5000, and 10000 tokens, respectively.

In [None]:
def replacements(x):
    def replace_uppercase(match_object): return match_object.group(0).lower()
    x = re.sub(r"[A-Z]", replace_uppercase, x)
    x = re.sub(r"[.,\/#!$%\^&\*;:{}=\-_`~()\"\?]", " ", x)
    x = re.sub(r"(n\'t)", " not", x)
    x = re.sub(r"(\'ve)", " have", x)
    x = re.sub(r"(\won't)", " would not", x)
    x = re.sub(r"(\'ll)", " will", x)
    x = re.sub(r"(\'re)", " are", x)
    x = re.sub(r"(\'s)", " is", x)
    x = re.sub(r"(\'d)", " would", x)
    x = re.sub(r"(\'m)", " am", x)
    x = re.sub(r"[']", "", x)
    x = re.sub(r"  ", " ", x)
    x = re.sub(r"\[", "", x)
    x = re.sub(r"\]", "", x)
    
    return x

df.text = df.text.apply(lambda x: replacements(x))
df.head()

In [None]:
def build_master_vocab(text):    
    master_vocab = []
    freqs = []
    for x in text:
        for word in x.split(' '):
            if not(word in master_vocab):
                master_vocab.append(word)
                freqs.append(1)
            else:
                idx = master_vocab.index(word)
                freqs[idx] = freqs[idx]+1
    master_vocab = [x for _, x in sorted(zip(freqs, master_vocab))]
    master_vocab.reverse()
    return master_vocab

master_vocab = build_master_vocab(df.text)
vocab_one = master_vocab[:1000]
vocab_two = master_vocab[:5000]
vocab_three = master_vocab[:10000]

### 2c (5 points). 
- Implement on your own and train a Naive Bayes sentiment classifier in the _training set_. Requirements: use log scales and add-one smoothing.
- Report your model performances on the _validation set_, with the 3 vocabs your created in 2b, respectively.

In [None]:
def get_corpuses_text():
    pos_corpus = (''.join(df.loc[(df.label == 'positive') & (df.is_valid == False)].text).split(' '))
    neg_corpus = (''.join(df.loc[(df.label == 'negative') & (df.is_valid == False)].text).split(' '))
    return pos_corpus, neg_corpus

In [None]:
import math
# Category ratio
# We're using the same dataset here, so these values won't change
neg_count = df.label.value_counts()[0]
pos_count = df.label.value_counts()[1]
total = neg_count + pos_count
overall_pos_doc_freq = math.log(pos_count/total)
overall_neg_doc_freq = math.log(neg_count/total)
def naive_bayes_train(vocab, mode):
    pos_corpus = None
    neg_corpus = None
    if mode == 'text':
        pos_corpus, neg_corpus = get_corpuses_text()
    elif mode == 'lemma':
        pos_corpus, neg_corpus = get_corpuses_lemma()
    # Calc frequencies of word in each class
    pos_freqs = [None]*len(vocab)
    neg_freqs = [None]*len(vocab)
    for i in range(len(vocab)):
        pos_freqs[i] = math.log((pos_corpus.count(vocab[i])+1)/(len(pos_corpus)+len(vocab)))
        neg_freqs[i] = math.log((neg_corpus.count(vocab[i])+1)/(len(neg_corpus)+len(vocab)))
    return pos_freqs, neg_freqs
        
pos_1, neg_1 = naive_bayes_train(vocab_one, 'text')
pos_2, neg_2 = naive_bayes_train(vocab_two, 'text')
pos_3, neg_3 = naive_bayes_train(vocab_three, 'text')

In [None]:
def calc_val_accuracy(predictions, labels):
    correct = 0
    for i in range(len(predictions)):
        if predictions[i] == labels[i]:
            correct = correct + 1
    return (correct/len(labels))*100

In [None]:
def naive_bayes_validate(pos_scores, neg_scores, val_set, vocab):
    classifications = ["negative"]*len(val_set.text)
    for i in range(len(val_set.text)):
        points_p = overall_pos_doc_freq
        points_n = overall_neg_doc_freq
        for word in val_set.text[i].split(' '):
            if word in vocab:
                idx = vocab.index(word)
                points_p = points_p + pos_scores[idx]
                points_n = points_n + neg_scores[idx]
        if points_p > points_n:
            classifications[i] = "positive"
    return classifications, calc_val_accuracy(classifications, val_set.label)

v1_predictions, v1_accuracy = naive_bayes_validate(pos_1, neg_1, df.loc[df.is_valid == True].reset_index(drop=True), vocab_one)
v2_predictions, v2_accuracy = naive_bayes_validate(pos_2, neg_2, df.loc[df.is_valid == True].reset_index(drop=True), vocab_two)
v3_predictions, v3_accuracy = naive_bayes_validate(pos_3, neg_3, df.loc[df.is_valid == True].reset_index(drop=True), vocab_three)
print("Accuracy for vocab of size 1000: "+str(v1_accuracy))
print("Accuracy for vocab of size 5000: "+str(v2_accuracy))
print("Accuracy for vocab of size 10000: "+str(v3_accuracy))

### 2d (5 points). Use [_spaCy_](https://spacy.io/) to _tokenize_ and _lemmatize_ this time. Get a new vocab of top 10000 lemmas. Retrain your model on this vocab and report its performance on the validation set.
(Note that spaCy relies on language-specific databases to work. Even though it is already importable, you still need to install its dependency for English. If you are in your _jupyter notebook_, create a new cell and execute the following: <code>!python -m spacy download en_core_web_sm</code>)

In [None]:
!python -m spacy download en_core_web_sm

In [None]:
df = og_df.copy()
import spacy
nlp = spacy.load("en_core_web_sm")

In [None]:
def remove_stop_spacy(x): return [token.text for token in x if not token.is_stop]
def get_lemmas_spacy(x): return [token.lemma_ for token in x if not token.is_stop]

df['token'] = df.text.apply(lambda x: remove_stop_spacy(nlp(x)))
df['lemma'] = df.text.apply(lambda x: get_lemmas_spacy(nlp(x)))
df.head()

In [None]:
def build_lemma_vocab(text):    
    master_vocab = []
    freqs = []
    for x in text:
        for word in x:
            if not(word in master_vocab):
                master_vocab.append(word)
                freqs.append(1)
            else:
                idx = master_vocab.index(word)
                freqs[idx] = freqs[idx]+1
    master_vocab = [x for _, x in sorted(zip(freqs, master_vocab))]
    master_vocab.reverse()
    return master_vocab

lemma_vocab = build_lemma_vocab(df.lemma)[:10000]

In [None]:
def get_corpuses_lemma():
    pos_corpus = (df.loc[(df.label == 'positive') & (df.is_valid == False)].lemma).sum()
    neg_corpus = (df.loc[(df.label == 'negative') & (df.is_valid == False)].lemma).sum()
    return pos_corpus, neg_corpus

In [None]:
pos_4, neg_4 = naive_bayes_train(lemma_vocab, 'lemma')
v4_predictions, v4_accuracy = naive_bayes_validate(pos_4, neg_4, df.loc[df.is_valid == True].reset_index(drop=True), lemma_vocab)
print("Accuracy for vocab of spaCy lemmas: "+str(v4_accuracy))

## Problem 3 (20 points)

### 3a (10 points). 
- Implement your own _subword tokenizer_ (the algorithm can be found in the slides). 
- Create 3 vocabularies of size 1000, 5000, and 10000, respectively.

In [None]:
from collections import defaultdict
df = og_df.copy()
df.text = df.text.apply(lambda x: replacements(x))
def most_common_pair(a, tokens):
    occurences = defaultdict(int)
    for i in range(len(tokens)-1):
        occurences[(tokens[i], tokens[i+1])] += 1
    sorted_occurences = {k: v for k, v in sorted(occurences.items(), key=lambda item: item[1], reverse=True)}
    for key, val in sorted_occurences.items():
        return ''.join(key)

def bpe(c, k):
    c = [i for i in c if i != ' ']
    v = list(set(c))
    while len(v) < k:
        winner = most_common_pair(v, c)
        v.append(winner)
        for j in range(len(c)-1):
            if (c[j]+c[j+1]) in v:
                c[j] = c[j]+c[j+1]
                c[j+1] = "DROP"
        c = [i for i in c if i != 'DROP']
    return v, c

bpe_vocab1, c1 = bpe([*''.join(df.text)], 1000)
bpe_vocab2, c2 = bpe([*''.join(df.text)], 5000)
bpe_vocab3, c3 = bpe([*''.join(df.text)], 10000)

### 3b (5 points). Compare the number of unknown words in your training set between the 3 tokenizers and 3 subword tokenizers.

In [None]:
def remove_known_words(known, corp):
    for i in range(len(corp)):
        if corp[i] in known:
            corp[i] = "DROP"
    return [x for x in corp if ix != 'DROP']

corpus = og_df.copy().text.apply(lambda x: replacements(x))

print("Unknown words in 1000 word vocab set: " + len(remove_known_words(vocab_one, corpus.copy())))
print("Unknown words in 5000 word vocab set: " + len(remove_known_words(vocab_two, corpus.copy())))
print("Unknown words in 10000 word vocab set: " + len(remove_known_words(vocab_three, corpus.copy())))

print("Unknown words in 1000 SUBword vocab set: " + len(remove_known_words(bpe_vocab1, corpus.copy())))
print("Unknown words in 5000 SUBword vocab set: " + len(remove_known_words(bpe_vocab2, corpus.copy())))
print("Unknown words in 10000 SUBword vocab set: " + len(remove_known_words(bpe_vocab3, corpus.copy())))

### 3c (5 points). Train your Naive Bayes classifier with the subword tokenizer of 10000 tokens. Compare your model performance (better/worse/same?) and give your analysis (why).

In [None]:
pos_5, neg_5 = naive_bayes_train(bpe_vocab3, 'text')
v5_predictions, v5_accuracy = naive_bayes_validate(pos_5, neg_5, df.loc[df.is_valid == True].reset_index(drop=True), bpe_vocab3)
print("Accuracy for vocab of 10000 bpe subwords: "+str(v5_accuracy))

## Problem 4 (20 points)

### 4a (10 points). Build two probabilistic language models using 2-gram and 3-gram, respectively, on the _entire_ texts.

In [None]:
df = og_df.copy()
def get_ngrams(n, grams):
    ngrams = defaultdict(int)
    for i in range(len(grams)-(n)):
        ngrams[' '.join(grams[i:(i+n)])] += 1
    return list(ngrams.keys())

def calc_ngram_prob(ngram, text, n):
    temp_ngram = ' '.join(ngram)
    prob = text.count(temp_ngram)
    prob = (prob)/(text.count(' '.join(re.findall(r"[a-zA-Z]+", temp_ngram)[:n-1])))
    if prob != 0:
        prob = math.log(prob)
    return prob

def get_ngram_probs(ngrams, text, n):
    probabilities = [0]*len(ngrams)
    running_total = 0
    for i in range(len(ngrams)):
        probabilities[i] = calc_ngram_prob(ngrams[i], text, n)
    return probabilities

def find_sentence_probs(n, in_sentences):
    text = ''.join(df.text.apply(lambda x: replacements(x)))
    words = re.findall(r"[a-zA-Z]+", text)
    ngrams = get_ngrams(n, words)
    ng_probs = get_ngram_probs(ngrams, text, n)
    out_probs = []
    for in_sen in in_sentences:
        in_sen_ngrams = []
        in_sen_ngrams.append(get_ngrams(n, in_sen.lower().split(' ')))
        in_sen_probs = get_ngram_probs(in_sen_ngrams, text, n)
        curr_prob = 0
        for x in in_sen_probs:
            curr_prob += x
        out_probs.append(curr_prob)
    return out_probs

### 4b (10 points). Generate 5 examples for each of the LM. Compare their results.

In [None]:
test_sentences = ['This movie was great', 'Hard to believe she was the producer on this dog', 'I was so bored', 'It had very bad acting', 'It is my new favorite film']
print('Log probabilities for 2-gram calculation:')
print(find_sentence_probs(2, test_sentences))
    
print('Log probabilities for 3-gram calculation:')
print(find_sentence_probs(2, t))

## Problem 5 (20 points)

### 5a (10 points). 

- Run topic modeling with SVD for 2, 6, and 10 topics, respectively.
- Extract 10 keywords for each topic.
- Try to mannually assign topic labels for (some of) them.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

topic_word_list = []
def get_topics(components, vocab, num_topics): 
    for i, comp in enumerate(components):
        terms_comp = zip(vocab,comp)
        sorted_terms = sorted(terms_comp, key= lambda x:x[1], reverse=True)[:10]
        temp = ""
        for t in sorted_terms:
            topic = t[0]
            temp = temp + " " + str(topic)
        topic_word_list.append(temp)
    print(topic_word_list)
    return topic_word_list

def model_topics(num_topics):
    vectorizer = TfidfVectorizer(stop_words='english',smooth_idf=True) 
    in_matrix = vectorizer.fit_transform(og_df.copy().text.apply(lambda x: replacements(x))).todense()
    svd_modeling= TruncatedSVD(n_components=num_topics, algorithm='randomized', n_iter=200, random_state=42)
    svd_modeling.fit(in_matrix)
    components=svd_modeling.components_
    vocab = vectorizer.get_feature_names()
    get_topics(components, vocab, num_topics)
    
for i in [2, 6, 10]:
    model_topics(i)
    

Two of the entries when running for 10 topics, specifically:
*' action great series bourne movie horror movies films original fun'*, 
*' movies watch love great comedy bad actors people funny life'*, 
Seem to respectively be about:
Thriller movies such as action, horror, and similar genres as well as:
Fun comedy movies where the acting isn't great but perhaps still enjoyable

### 5b (5 points).

Do the following:
- Remove stopwords
- Lemmatize
- Keep only nouns, verbs, and adjs with the help of spaCy POS tagger
- Remove certain named entities (choose whatever makes sense to you)
- Remove html tags
- Remove non-ascii characters

And run SVD again for 10 topics. Compare your results with 5a.

### 5c (5 points). Find 2 most similar pairs of reviews using document embeddings derived from SVD.