# Homework 1

## Prerequisite

1. Install [_Miniconda_](https://docs.conda.io/en/main/miniconda.html) or [_Anaconda_](https://docs.anaconda.com/anaconda/install/index.html)
2. Create a new virtual Python environment: <code>conda create -n gwnlp Python=3.10</code>
3. Activate your environment (and you'll use this Python environment throughout the course - make sure it is selected as the Python interpreter if you are using an IDE like VS Code): <code>conda activate gwnlp</code>
4. Install packages (this will give you pandas, pytorch, fastai, spacy, etc.): <code>conda install -c fastchan fastai</code>

## Problem 1 (20 points)

### 1a (5 points). Normalize all of the raw phone numbers with Python RE module. Find one pattern that works for all.

| Raw | Normalized |
| --- | --- |
| 2021213121 | +1 (202) 121-3121 |
| 12021213121 | +1 (202) 121-3121 |
| +12021213121 | +1 (202) 121-3121 |
| 202-121-3121 | +1 (202) 121-3121 |
| (202)  121 -   3121 | +1 (202) 121-3121 |
| (202)121-3121 | +1 (202) 121-3121 |
| 862021213121 | +86 (202) 121-3121 |

In [53]:
import re

pattern = r'\d+'

numbers = ['2021213121', '12021213121', '+12021213121', '202-121-3121', '(202)  121 -   3121', '(202)121-3121', '862021213121']



for number in numbers:
    # print(re.match(pattern, number).group())
    stripped_number = ''
    for i in re.finditer(pattern, number):
        stripped_number+=i.string[i.regs[0][0]:i.regs[0][1]]

    n = len(stripped_number)
    print(number, stripped_number)
    
    difference = n - 10

    country_code = stripped_number[:difference]
    area_code = stripped_number[difference:difference+3]

    first_group = stripped_number[difference+3:difference+6]
    second_group = stripped_number[difference+6:]

    phone_number_string = '+{}({}) {}-{}'.format(1, area_code, first_group, second_group)
    print('\t' +phone_number_string)


2021213121 2021213121
	+1(202) 121-3121
12021213121 12021213121
	+1(202) 121-3121
+12021213121 12021213121
	+1(202) 121-3121
202-121-3121 2021213121
	+1(202) 121-3121
(202)  121 -   3121 2021213121
	+1(202) 121-3121
(202)121-3121 2021213121
	+1(202) 121-3121
862021213121 862021213121
	+1(202) 121-3121


### 1b (15 points). Use Python RE module to complete the following tasks, with **one** regex pattern **for each**. Show your test samples.

1. Add spaces around / and #. E.g., "good/bad" -> "good / bad".
2. Replace tokens in ALL CAPS by their lower version. E.g., "This is AMAZING!" -> "This is amazing!".
3. Convert _camel case_ to _snake case_. E.g., "getNamesFromUserInput" -> "get_names_from_user_input".

In [54]:
# 1

r = r'/'
print(re.sub(r, ' / ', 'good/bad'))

# 2
r = r'\b[A-Z]+(?:\s+[A-Z]+)*\b'
print(re.sub(r, lambda m: m.group(0).lower(), 'This is AMAZING!'))

# 3
r = r'[^a-z]+'
print(re.sub(r, lambda m: '_'+m.group(0).lower(), 'getNamesFromUserInput'))

good / bad
This is amazing!
get_names_from_user_input


## Note: For Problem 2 - 5 we will work on a sample of IMDB Reviews dataset. Load the data into a _pandas_ _Dataframe_ (review [the basics of pandas](https://pandas.pydata.org/docs/user_guide/10min.html) if you are new to it) using the following script:

In [1]:
import pandas as pd
from fastai.data.external import URLs, untar_data
import string

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
path = untar_data(URLs.IMDB_SAMPLE)

In [57]:
df = pd.read_csv(path/'texts.csv')

In [58]:
len(df), sum(df['is_valid'] == False), sum(df['is_valid'] == True), sum(df['label'] == 'positive'), sum(df['label'] == 'negative')

(1000, 800, 200, 476, 524)

In [59]:
df.head()

Unnamed: 0,label,text,is_valid
0,negative,"Un-bleeping-believable! Meg Ryan doesn't even look her usual pert lovable self in this, which normally makes me forgive her shallow ticky acting schtick. Hard to believe she was the producer on this dog. Plus Kevin Kline: what kind of suicide trip has his career been on? Whoosh... Banzai!!! Finally this was directed by the guy who did Big Chill? Must be a replay of Jonestown - hollywood style. Wooofff!",False
1,positive,"This is a extremely well-made film. The acting, script and camera-work are all first-rate. The music is good, too, though it is mostly early in the film, when things are still relatively cheery. There are no really superstars in the cast, though several faces will be familiar. The entire cast does an excellent job with the script.<br /><br />But it is hard to watch, because there is no good end to a situation like the one presented. It is now fashionable to blame the British for setting Hindus and Muslims against each other, and then cruelly separating them into two countries. There is som...",False
2,negative,"Every once in a long while a movie will come along that will be so awful that I feel compelled to warn people. If I labor all my days and I can save but one soul from watching this movie, how great will be my joy.<br /><br />Where to begin my discussion of pain. For starters, there was a musical montage every five minutes. There was no character development. Every character was a stereotype. We had swearing guy, fat guy who eats donuts, goofy foreign guy, etc. The script felt as if it were being written as the movie was being shot. The production value was so incredibly low that it felt li...",False
3,positive,"Name just says it all. I watched this movie with my dad when it came out and having served in Korea he had great admiration for the man. The disappointing thing about this film is that it only concentrate on a short period of the man's life - interestingly enough the man's entire life would have made such an epic bio-pic that it is staggering to imagine the cost for production.<br /><br />Some posters elude to the flawed characteristics about the man, which are cheap shots. The theme of the movie ""Duty, Honor, Country"" are not just mere words blathered from the lips of a high-brassed offic...",False
4,negative,"This movie succeeds at being one of the most unique movies you've seen. However this comes from the fact that you can't make heads or tails of this mess. It almost seems as a series of challenges set up to determine whether or not you are willing to walk out of the movie and give up the money you just paid. If you don't want to feel slighted you'll sit through this horrible film and develop a real sense of pity for the actors involved, they've all seen better days, but then you realize they actually got paid quite a bit of money to do this and you'll lose pity for them just like you've alr...",False


## Problem 2 (20 points)

### 2a (5 points). 
- Find at least one thing that needs to be cleaned with regex in the texts. Show your Python code.
- Create train/valid split using the column 'is_valid'.

In [133]:
string_to_fix = '"This", is a extremely well-made film. The acting, script and camera-work are all first-rate. The music is good, too, though it is mostly early in the film, when things are still relatively cheery. There are no really superstars in the cast, though several faces will be familiar. The entire cast does an excellent job with the script.<br /><br />But it is hard to watch, because there is no good end to a situation like the one presented. It is now fashionable to blame the British for setting Hindus and Muslims against each other, and then cruelly separating them into two countries.'

# remove <br /> and any punctuation
pattern = r'\.*<\s*br\s*\/>|[".,\/#!$%\^&\*;:{}=\-_`~()]'
re.sub(pattern, '', string_to_fix)

'This is a extremely wellmade film The acting script and camerawork are all firstrate The music is good too though it is mostly early in the film when things are still relatively cheery There are no really superstars in the cast though several faces will be familiar The entire cast does an excellent job with the scriptBut it is hard to watch because there is no good end to a situation like the one presented It is now fashionable to blame the British for setting Hindus and Muslims against each other and then cruelly separating them into two countries'

In [134]:
df_valid = df.loc[df['is_valid'] == True]
df_train = df.loc[df['is_valid'] == False]

In [135]:
df = df.replace(to_replace=pattern, value='', regex=True)
df.head()

Unnamed: 0,label,text,is_valid
0,negative,unbleepingbelievable meg ryan does not even look her usual pert lovable self in this which normally makes me forgive her shallow ticky acting schtick hard to believe she was the producer on this dog plus kevin kline what kind of suicide trip has his career been on? whoosh banzai finally this was directed by the guy who did big chill? must be a replay of jonestown hollywood style wooofff,False
1,positive,this is a extremely wellmade film the acting script and camerawork are all firstrate the music is good too though it is mostly early in the film when things are still relatively cheery there are no really superstars in the cast though several faces will be familiar the entire cast does an excellent job with the scriptbut it is hard to watch because there is no good end to a situation like the one presented it is now fashionable to blame the british for setting hindus and muslims against each other and then cruelly separating them into two countries there is some merit in this view but it i...,False
2,negative,every once in a long while a movie will come along that will be so awful that i feel compelled to warn people if i labor all my days and i can save but one soul from watching this movie how great will be my joywhere to begin my discussion of pain for starters there was a musical montage every five minutes there was no character development every character was a stereotype we had swearing guy fat guy who eats donuts goofy foreign guy etc the script felt as if it were being written as the movie was being shot the production value was so incredibly low that it felt like i was watching a junio...,False
3,positive,name just says it all i watched this movie with my dad when it came out and having served in korea he had great admiration for the man the disappointing thing about this film is that it only concentrate on a short period of the man is life interestingly enough the man is entire life would have made such an epic biopic that it is staggering to imagine the cost for productionsome posters elude to the flawed characteristics about the man which are cheap shots the theme of the movie duty honor country are not just mere words blathered from the lips of a highbrassed officer it is the deep dec...,False
4,negative,this movie succeeds at being one of the most unique movies you have seen however this comes from the fact that you ca not make heads or tails of this mess it almost seems as a series of challenges set up to determine whether or not you are willing to walk out of the movie and give up the money you just paid if you do not want to feel slighted youwill sit through this horrible film and develop a real sense of pity for the actors involved they have all seen better days but then you realize they actually got paid quite a bit of money to do this and youwill lose pity for them just like you hav...,False


### 2b (5 points). 
- Implement your own tokenizer for the texts. Requirements: split by space, remove most punctuations and split common abbreviations (e.g., "don't" -> "do" "n't", "you'll" -> "you" "'ll"). 
- Create 3 vocabularies using top 1000, 5000, and 10000 tokens, respectively.

In [3]:
from collections import defaultdict
import re

def remove_contraction(text):
    text = re.sub(r'n\'t', ' not', text)
    text = re.sub(r'\'re', ' are', text)
    text = re.sub(r'\'s', ' is', text)
    text = re.sub(r'\'d', ' would', text)
    text = re.sub(r'\'ll', 'will', text)
    text = re.sub(r'\'t', ' not', text)
    text = re.sub(r'\'ve', ' have', text)
    text = re.sub(r'\'m', ' am', text)

    return text

def word_count(words):
    count = defaultdict(int)

    for word in words:
        count[word] += 1

    return count

def clean_text(df, column):
    # REMOVE PUNCTUATION
    pattern = r'\.*<\s*br\s*\/>|[".,\/#!$%\^&\*;:{}=\-_`~()]'

    tokenized_df = df.replace(to_replace=pattern, value='', regex=True)

    # LOWERCASE
    tokenized_df[column] = tokenized_df[column].apply(str.lower)

    # contractions
    tokenized_df[column] = tokenized_df[column].apply(remove_contraction)

    return tokenized_df

def tokenizer(df, column):
    # words
    sentences = df[column].to_list()
    sentences = ' '.join(sentences)
    words = sentences.split(' ') 
    
    word_counts = word_count(words)
    sorted_word_counts = {k: v for k, v in sorted(word_counts.items(), key=lambda item: item[1], reverse=True)}
    
    ret = pd.DataFrame(columns=['word', 'freq', 'positive', 'negative', 'freq_pos', 'freq_neg'])
    
    ret['word'] = list(sorted_word_counts.keys())
    ret['freq'] = list(sorted_word_counts.values())

    # check if in pos class
    sentences = df.loc[df['label'] == 'positive'][column].to_list()
    sentences = ' '.join(sentences)
    words = sentences.split(' ') 
    pos = []
    freqs =[]
    for word in ret['word'].to_list():
        c = words.count(word)
        freq = c
        if c >0: 
            pos.append(True)
        else:
            pos.append(False)
        freqs.append(freq)
    ret['freq_pos'] = freqs
    
    
    # check if in neg class
    sentences = df.loc[df['label'] == 'negative'][column].to_list()
    sentences = ' '.join(sentences)
    words = sentences.split(' ') 
    
    neg = []
    freqs =[]

    for word in ret['word'].to_list():
        c = words.count(word)
        freq = c
        if c >0: 
            neg.append(True)
        else:
            neg.append(False)
        freqs.append(freq)
    ret['freq_neg'] = freqs

    ret['positive'] = pos
    ret['negative'] = neg
    return ret
    # return list(sorted_word_counts)[:len_vocab]



In [None]:
df_train, df_valid = clean_text(df_train, 'text'), clean_text(df_valid, 'text')

word_counts = tokenizer(df_train, 'text')

### 2c (5 points). 
- Implement on your own and train a Naive Bayes sentiment classifier in the _training set_. Requirements: use log scales and add-one smoothing.
- Report your model performances on the _validation set_, with the 3 vocabs your created in 2b, respectively.

In [137]:
import numpy as np
from tqdm import tqdm
from sklearn.metrics import accuracy_score
from collections import defaultdict
def build_l_table(df, len_vocab):
    liklihood_table = defaultdict(dict)

    top_n = df.iloc[:len_vocab, :]

    total_pos = top_n['freq_pos'].sum()
    total_neg = top_n['freq_neg'].sum()

    for word in tqdm(top_n['word'], desc='building liklihood table'):
        pos_lik = int(top_n.loc[top_n['word'] == word]['freq_pos'])
        neg_lik = int(top_n.loc[top_n['word'] == word]['freq_neg'])

        liklihood_table[word]['pos'] = np.log((pos_lik + 1) / (total_pos + len_vocab))
        liklihood_table[word]['neg'] = np.log((neg_lik + 1) / (total_neg + len_vocab))
    
    return liklihood_table, (total_pos, total_neg)



def priors(df):
    N = len(df)

    positive_prior = df['label'].value_counts()['positive'] / N
    negative_prior = df['label'].value_counts()['negative'] / N

    return np.log(positive_prior), np.log(negative_prior)

def nb_preprocess(df_train, word_counts, len_vocab):
    p_prior, n_prior = priors(df_train)

    likelihood_table, total_for_classes = build_l_table(word_counts, len_vocab)

    return (p_prior, n_prior), likelihood_table, total_for_classes

def naive_bayes(priors, likelihoods, total_for_classes, eval_sentences):
    preds = []

    p_prior, n_prior = priors


    for sentence in tqdm(eval_sentences):

        pos_log_lik, neg_log_lik = 0, 0
        for word in sentence.split(' '):
            if word not in list(likelihoods.keys()):
                pos_log_lik += np.log(1 / total_for_classes[0])
                neg_log_lik += np.log(1 / total_for_classes[1])
            else:
                if 'pos' not in list(likelihoods[word].keys()):
                    pos_log_lik += np.log(1 / total_for_classes[0])
                else:
                    pos_log_lik += likelihoods[word]['pos']

                if 'neg' not in list(likelihoods[word].keys()):
                    neg_log_lik += np.log(1 / total_for_classes[1])
                else:
                    neg_log_lik += likelihoods[word]['neg']

        
        pos_pred = p_prior + pos_log_lik
        neg_pred = n_prior + neg_log_lik

        pred = 'positive' if pos_pred > neg_pred else 'negative'

        preds.append(pred)

    return preds

In [138]:
p, l, total_for_classes = nb_preprocess(df_train, word_counts, 1000)
preds = naive_bayes(p, l, total_for_classes, df_valid['text'].to_list())

accuracy_score(df_valid['label'].to_list(), preds)

building liklihood table: 100%|██████████| 1000/1000 [00:00<00:00, 1760.64it/s]
100%|██████████| 200/200 [00:00<00:00, 566.50it/s]


0.695

In [139]:
p, l, total_for_classes = nb_preprocess(df_train, word_counts, 5000)
preds = naive_bayes(p, l, total_for_classes, df_valid['text'].to_list())

accuracy_score(df_valid['label'].to_list(), preds)

building liklihood table: 100%|██████████| 5000/5000 [00:04<00:00, 1199.11it/s]
100%|██████████| 200/200 [00:01<00:00, 132.78it/s]


0.77

In [140]:
p, l, total_for_classes = nb_preprocess(df_train, word_counts, 10000)
preds = naive_bayes(p, l, total_for_classes, df_valid['text'].to_list())

accuracy_score(df_valid['label'].to_list(), preds)

building liklihood table: 100%|██████████| 10000/10000 [00:11<00:00, 835.99it/s]
100%|██████████| 200/200 [00:02<00:00, 71.34it/s]


0.755

### 2d (5 points). Use [_spaCy_](https://spacy.io/) to _tokenize_ and _lemmatize_ this time. Get a new vocab of top 10000 lemmas. Retrain your model on this vocab and report its performance on the validation set.
(Note that spaCy relies on language-specific databases to work. Even though it is already importable, you still need to install its dependency for English. If you are in your _jupyter notebook_, create a new cell and execute the following: <code>!python -m spacy download en_core_web_sm</code>)

In [38]:
import spacy
import pandas as pd

nlp = spacy.load("en_core_web_sm")

In [141]:
df = pd.read_csv(path/'texts.csv')

df_valid = df.loc[df['is_valid']]
df_train = df.loc[df['is_valid'] == False]

In [142]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

def frequency(words):
    res = {}
 
    for keys in words:
        res[keys] = res.get(keys, 0) + 1
    return res

def get_lemmas(df, label):
    lemmas = []
    for i in range(len(df.loc[df['label'] == label])):
        doc = nlp(df.iloc[i, 1])

        for token in doc:
            if not token.is_stop:
                lemmas.append(token.lemma_.lower())
    return lemmas

    
def get_lemmas_valid(text):
    lemmas = []
    doc = nlp(text)

    for token in doc:
        if not token.is_stop:
            lemmas.append(token.lemma_.lower())
    return " ".join(lemmas)

    

def spacy_tokenizer(df):
    ret = pd.DataFrame(columns=['word', 'freq', 'positive', 'negative', 'freq_pos', 'freq_neg'])

    pos_lemmas = get_lemmas(df, 'positive')
    neg_lemmas = get_lemmas(df, 'negative')

    pos_freq_dict = frequency(list(pos_lemmas))
    neg_freq_dict = frequency(list(neg_lemmas))

    words = set(pos_lemmas).union(set(neg_lemmas))

    for word in words:
        if word not in pos_freq_dict:
            pos_freq = 0
        else:
            pos_freq = pos_freq_dict[word]

        if word not in neg_freq_dict:
            neg_freq = 0
        else:
            neg_freq = neg_freq_dict[word] 
        row = {
            'word': word,
            'freq':  pos_freq + neg_freq,
            'positive': True if word in pos_freq_dict else False,
            'negative': True if word in neg_freq_dict else False,
            'freq_pos': pos_freq,
            'freq_neg': neg_freq
        }

        ret = ret.append(row, ignore_index=True);

    return ret
    # print(pos_lemmas)

In [143]:
word_counts = spacy_tokenizer(df_train)

In [144]:
df_valid['text'] = df_valid['text'].apply(get_lemmas_valid)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_valid['text'] = df_valid['text'].apply(get_lemmas_valid)


In [145]:
p, l, total_for_classes = nb_preprocess(df_train, word_counts, 10000)
preds = naive_bayes(p, l, total_for_classes, df_valid['text'].to_list())

accuracy_score(df_valid['label'].to_list(), preds)

building liklihood table: 100%|██████████| 10000/10000 [00:10<00:00, 931.67it/s]
100%|██████████| 200/200 [00:02<00:00, 74.39it/s]


0.55

## Problem 3 (20 points)

In [46]:
import pandas as pd
from fastai.data.external import URLs, untar_data
import string

In [47]:
df = pd.read_csv(path/'texts.csv')
df = clean_text(df, 'text')
df_valid = df.loc[df['is_valid']]
df_train = df.loc[df['is_valid'] == True]

### 3a (10 points). 
- Implement your own _subword tokenizer_ (the algorithm can be found in the slides). 
- Create 3 vocabularies of size 1000, 5000, and 10000, respectively.

In [27]:
def most_common_pair(tokens):
    occurences = defaultdict(int)
    for i in range(len(tokens) - 1):
        occurences[(tokens[i] , tokens[i+1])] += 1
    sorted_occurences = {k: v for k, v in sorted(occurences.items(), key=lambda item: item[1], reverse=True)}

    return list(sorted_occurences.keys())[0]

def bpe_tokenizer(strings, target_vocab_size):
    V = []
    for string in strings:
        # V.union(set(string.replace(' ', '')))
        V.extend(list(string.replace(' ', '')))
        
    V = list(set(V))
    C = list(' '.join(strings))
    while len(V) < target_vocab_size:
        tL, tR = most_common_pair(C)

        tNew = tL + tR

        tNew = tNew
        V.append(tNew)

        C_new = []

        i = 0
        while i < len(C)-1:
            if C[i] == tL and C[i+1] == tR:
                C_new.append(tNew)
                i += 2
            else:
                C_new.append(C[i])
                i+=1
            # strings[idx] = ' '.join(s)
        C = C_new
        print("\r\rVocab Size = {}".format(len(V)), end='')
    print()
    return V

# V_1000 = bpe_tokenizer(df_train['text'].to_list(), target_vocab_size=1000)
# V_5000 = bpe_tokenizer(df_train['text'].to_list(), target_vocab_size=5000)
V_10000 = bpe_tokenizer(df_train['text'].to_list(), target_vocab_size=10000)

Vocab Size = 10000


### 3b (5 points). Compare the number of unknown words in your training set between the 3 tokenizers and 3 subword tokenizers.

### 3c (5 points). Train your Naive Bayes classifier with the subword tokenizer of 10000 tokens. Compare your model performance (better/worse/same?) and give your analysis (why).

In [44]:
def wc(df, column, vocab):
    ret = pd.DataFrame(columns=['word', 'positive', 'negative', 'freq_pos', 'freq_neg'])
    
    ret['word'] = vocab


    # check if in pos class
    sentences = df.loc[df['label'] == 'positive'][column].to_list()
    sentences = ' '.join(sentences)
    words = sentences.split(' ')

    freqs =[]
    pos = []

    for v in vocab:
        c = words.count(v)
        freq = c
        if c >0: 
            pos.append(True)
        else:
            pos.append(False)
        freqs.append(freq)
    ret['freq_pos'] = freqs
    
    
    # check if in neg class
    sentences = df.loc[df['label'] == 'negative'][column].to_list()
    sentences = ' '.join(sentences)
    words = sentences.split(' ') 
    
    neg = []
    freqs =[]

    for v in vocab:
        c = words.count(v)
        freq = c
        if c >0: 
            neg.append(True)
        else:
            neg.append(False)
        freqs.append(freq)
    ret['freq_neg'] = freqs

    ret['positive'] = pos
    ret['negative'] = neg
    return ret

word_counts = wc(df_train, 'text', V_10000)

In [36]:
p, l, total_for_classes = nb_preprocess(df_train, word_counts, 1000)
preds = naive_bayes(p, l, total_for_classes, df_valid['text'].to_list())

accuracy_score(df_valid['label'].to_list(), preds)

building liklihood table: 100%|██████████| 1000/1000 [00:00<00:00, 1699.73it/s]
100%|██████████| 200/200 [00:00<00:00, 499.34it/s]


0.635

# 3) Answer

## Results

For 10000 word vocabs these were my results:

- My own Tokenizer: 77
- Spacy:            55
- BPE:              63.5

## Analysis

I think my tokenizer had better quality data, leading to better results, for a few reasons: 
1) I was able to remove punctuation better. Especially for Spacy tokenization things like HTML tags were still left in the strings.
2) Neither broke apart contractions. I think having separate words for contractions helped a lot
3) I was not really sure how to use spacy properly! I think I did it right, but I was definitely not a master of it.

## Problem 4 (20 points)

### 4a (10 points). Build two probabilistic language models using 2-gram and 3-gram, respectively, on the _entire_ texts.

In [4]:
import pandas as pd
from fastai.data.external import URLs, untar_data
import string
import spacy
import numpy as np
from collections import defaultdict
from sklearn.feature_extraction.text import CountVectorizer

nlp = spacy.load("en_core_web_sm")

path = untar_data(URLs.IMDB_SAMPLE)

df = pd.read_csv(path/'texts.csv')
df = clean_text(df, 'text')
# df_train = df.loc[df['is_valid'] == False]


  from .autonotebook import tqdm as notebook_tqdm


In [5]:
all_words = set(' '.join(df['text'].to_list()).split(' '))

In [6]:
cv = CountVectorizer(ngram_range=(2,2))
fitted_cv = cv.fit_transform(df['text'].to_list())
freq_df = pd.DataFrame(fitted_cv.toarray(), columns = cv.get_feature_names_out())

In [7]:
len(df_train['text'].to_list())

800

In [8]:
freq_df

Unnamed: 0,007 game,00s that,010ps what,03oct2009 observed,07 knew,081006 with,10 are,10 based,10 because,10 but,...,zooms frantically,zucco all,zuccopreparing his,zulu later,zulu people,zulu to,zulus without,zuniga of,zvyagvatsev 2003,über alles
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
795,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
796,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
797,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
798,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
from tqdm import tqdm
def bi_prior_hist(freq_df, history):
    # probability of the history
    # pr = c(0, 1)/ c(0) * c(1, 2) / c(1)...
    
    # biggram history
    pr = 1
    for i in range(0, len(history) - 1):
        g = history[i] + ' ' + history[i+1]

        occurences_of_i = 0
        for c in freq_df.columns:
            if history[i] in c:
                occurences_of_i += freq_df[c].sum()
        if g in freq_df.columns:
            pr *= freq_df[g].sum() / occurences_of_i
            # pr *= df[g].sum(axis=1) / occurences_of_i
    return pr

def bi_gram(freq_df, history, word, pr):
    g = history[-1] + ' ' + word
    occurences_of_i = 0
    for c in freq_df.columns:
        if history[-1] in c:
            occurences_of_i += freq_df[c].sum()
    if g in freq_df.columns:
        pr *= freq_df[g].sum() / occurences_of_i

    return pr

def bi_gram_model(all_words, freq_df, history, n):

    ret = ' '.join(history)

    pr = bi_prior_hist(freq_df, history)
    for i in tqdm(range(n)):
        m_prob, m_word = 0, ''

        for word in all_words:
            pr = bi_gram(freq_df, history, word, pr)
            if pr > m_prob:
                m_word = word
                m_prob = pr
        return ret + m_word







### 4b (10 points). Generate 5 examples for each of the LM. Compare their results.

In [None]:
print(bi_gram_model(all_words, freq_df, ['this', 'movie', 'is'], 10))
print(bi_gram_model(all_words, freq_df, ['the', 'english', 'man'], 10))
print(bi_gram_model(all_words, freq_df, ['zulu', 'people', 'are'], 10))
print(bi_gram_model(all_words, freq_df, ['bad', 'movie'], 10))
print(bi_gram_model(all_words, freq_df, ['bad', 'acting', 'man'], 10))

## Problem 5 (20 points)

### 5a (10 points). 

- Run topic modeling with SVD for 2, 6, and 10 topics, respectively.
- Extract 10 keywords for each topic.
- Try to mannually assign topic labels for (some of) them.

In [27]:
import pandas as pd
from fastai.data.external import URLs, untar_data
import string
import spacy
import numpy as np
from collections import defaultdict

nlp = spacy.load("en_core_web_sm")
path = untar_data(URLs.IMDB_SAMPLE)

df = pd.read_csv(path/'texts.csv')
df = df.loc[df['is_valid'] == False]

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import decomposition
from scipy import linalg
from sklearn.feature_extraction import _stop_words as stop_words


In [28]:
vectorizer = TfidfVectorizer(smooth_idf=True)
vectors = vectorizer.fit_transform(df['text'].to_list()).todense()
vocab = np.array(vectorizer.get_feature_names_out())

In [29]:
def print_topics(comps, n_topics):
    topic_word_list = []
    for i, comp in enumerate(comps):
        terms_comp = zip(vocab,comp)
        sorted_terms = sorted(terms_comp, key= lambda x:x[1], reverse=True)[:n_topics]
        topic=" "
        for t in sorted_terms:
            topic= topic + ' ' + t[0]
        topic_word_list.append(topic)
        # print(topic_word_list)
    return topic_word_list

In [30]:
n_topics = 10
svd = decomposition.TruncatedSVD(n_components=n_topics, algorithm='randomized')

svd.fit(np.asarray(vectors))
components = svd.components_
print_topics(components, n_topics)

['  the and of br to it is in that this',
 '  br she her his episode he of as woman by',
 '  br movie it you was this bad my so if',
 '  the was movie were of series bad worst br plot',
 '  was he she her his had were didn in to',
 '  he movie his you the is bad man him if',
 '  movie and her she is good very story this acting',
 '  she her you is the where if not to but',
 '  film is he this was bad it very films the',
 '  you film was this will to love the my see']

In [31]:
n_topics = 6
svd = decomposition.TruncatedSVD(n_components=n_topics, algorithm='randomized')

svd.fit(np.asarray(vectors))
components = svd.components_
print_topics(components, n_topics)

['  the and of br to it',
 '  br she her his episode he',
 '  br movie it you was this',
 '  he his is to and her',
 '  was he she her his had',
 '  film and it her she would']

In [32]:
n_topics = 2
svd = decomposition.TruncatedSVD(n_components=n_topics, algorithm='randomized')

svd.fit(np.asarray(vectors))
components = svd.components_
print_topics(components, n_topics)

['  the and', '  br she']

# 5A
It is hard to assign topics that make sense to this. The most distinguishable one is that the movie is bad. However, it is confused by the stop words and other distracting characters.

### 5b (5 points).

Do the following:
- Remove stopwords
- Lemmatize
- Keep only nouns, verbs, and adjs with the help of spaCy POS tagger
- Remove certain named entities (choose whatever makes sense to you)
- Remove html tags
- Remove non-ascii characters

And run SVD again for 10 topics. Compare your results with 5a.

In [33]:
path = untar_data(URLs.IMDB_SAMPLE)

df = pd.read_csv(path/'texts.csv')
df = df.loc[df['is_valid'] == False]
df = clean_text(df, 'text') # clean removes tags and non ascii!

In [34]:
def POS_remove(text):
    """
    WILL REMOVE PUNCT, STOPS, AND LEAVE ONLY NOUN, VERB, ADJ
    """
    
    doc = nlp(text)
    new_text = ""
    for token in doc:
        if not token.is_stop and (token.pos_ in ['NOUN', 'VERB', 'ADJ']) and not token.is_punct:
            new_text += token.text + ' '
    return text

df['text'] = df['text'].apply(POS_remove)

In [35]:
vectorizer = TfidfVectorizer(stop_words='english',smooth_idf=True) # takes out other stops as well
vectors = vectorizer.fit_transform(df['text'].to_list()).todense()
vocab = np.array(vectorizer.get_feature_names_out())

n_topics = 10
svd = decomposition.TruncatedSVD(n_components=n_topics, algorithm='randomized')

svd.fit(np.asarray(vectors))
components = svd.components_
print_topics(components, n_topics)

['  movie film like just really good story did bad movies',
 '  movie movies bad acting good did worst horrible think plot',
 '  episode series love family story man best life characters son',
 '  bad episode little like effects lake horror script monster pretty',
 '  really good liked great bourne thought story end action did',
 '  bad acting performance action character actors attempt plot davis role',
 '  story watch love series episode movies watching like worst acting',
 '  series bourne war just did gundam great episode really characters',
 '  did just character story think men got work know women',
 '  action scenes bourne just scene minutes films interesting time son']

# Comparing to 5A

You can see that there are clearer topics in my 5B topics than to my 5A topics. This is because we removed useless stop words, lemmatized and removed other distracting things like html tags and non ascii characters. 

The topics in 5A did not make much sense where as most of these topics follow one theme.
- "movie is bad"
- "bad acting"
- "bad series"
- "horror"

### 5c (5 points). Find 2 most similar pairs of reviews using document embeddings derived from SVD.

In [None]:
import numpy as np
from tqdm import tqdm
def cosine_similarity(v,u):
    return (v @ u)/ (np.linalg.norm(v) * np.linalg.norm(u))

def find_closest(V):
    m = -np.inf
    m1, m2 = -1, -1
    for c1 in tqdm(range(1, V.shape[1])):
        for c2 in range(1, V.shape[1]):
            similarity = cosine_similarity(V[:,c1], V[:,c2])
            if similarity > m and c1 != c2:
                m = similarity
                m1 = c1
                m2 = c2
    return m, m1, m2
sim, i1, i2 = find_closest(components)
print(sim, i1, i2)
print(df_train.iloc[i1, 1])
print(df_train.iloc[i2, 1])