# Homework 1

## Prerequisite

1. Install [_Miniconda_](https://docs.conda.io/en/main/miniconda.html) or [_Anaconda_](https://docs.anaconda.com/anaconda/install/index.html)
2. Create a new virtual Python environment: <code>conda create -n gwnlp Python=3.10</code>
3. Activate your environment (and you'll use this Python environment throughout the course - make sure it is selected as the Python interpreter if you are using an IDE like VS Code): <code>conda activate gwnlp</code>
4. Install packages (this will give you pandas, pytorch, fastai, spacy, etc.): <code>conda install -c fastchan fastai</code>

## Problem 1 (20 points)

### 1a (5 points). Normalize all of the raw phone numbers with Python RE module. Find one pattern that works for all.

| Raw | Normalized |
| --- | --- |
| 2021213121 | +1 (202) 121-3121 |
| 12021213121 | +1 (202) 121-3121 |
| +12021213121 | +1 (202) 121-3121 |
| 202-121-3121 | +1 (202) 121-3121 |
| (202)  121 -   3121 | +1 (202) 121-3121 |
| (202)121-3121 | +1 (202) 121-3121 |
| 862021213121 | +86 (202) 121-3121 |

In [7]:
import re

pattern = r'\d+'

numbers = ['2021213121', '12021213121', '+12021213121', '202-121-3121', '(202)  121 -   3121', '(202)121-3121', '862021213121']



for number in numbers:
    # print(re.match(pattern, number).group())
    stripped_number = ''
    for i in re.finditer(pattern, number):
        stripped_number+=i.string[i.regs[0][0]:i.regs[0][1]]

    n = len(stripped_number)
    print(number, stripped_number)
    
    difference = n - 10

    country_code = stripped_number[:difference]
    area_code = stripped_number[difference:difference+3]

    first_group = stripped_number[difference+3:difference+6]
    second_group = stripped_number[difference+6:]

    phone_number_string = '+{}({}) {}-{}'.format(1, area_code, first_group, second_group)
    print('\t' +phone_number_string)


2021213121 2021213121
	+1(202) 121-3121
12021213121 12021213121
	+1(202) 121-3121
+12021213121 12021213121
	+1(202) 121-3121
202-121-3121 2021213121
	+1(202) 121-3121
(202)  121 -   3121 2021213121
	+1(202) 121-3121
(202)121-3121 2021213121
	+1(202) 121-3121
862021213121 862021213121
	+1(202) 121-3121


### 1b (15 points). Use Python RE module to complete the following tasks, with **one** regex pattern **for each**. Show your test samples.

1. Add spaces around / and #. E.g., "good/bad" -> "good / bad".
2. Replace tokens in ALL CAPS by their lower version. E.g., "This is AMAZING!" -> "This is amazing!".
3. Convert _camel case_ to _snake case_. E.g., "getNamesFromUserInput" -> "get_names_from_user_input".

In [40]:
# 1

r = r'/'
print(re.sub(r, ' / ', 'good/bad'))

# 2
r = r'\b[A-Z]+(?:\s+[A-Z]+)*\b'
print(re.sub(r, lambda m: m.group(0).lower(), 'This is AMAZING!'))

# 3
r = r'[^a-z]+'
print(re.sub(r, lambda m: '_'+m.group(0).lower(), 'getNamesFromUserInput'))

good / bad
This is amazing!
get_names_from_user_input


## Note: For Problem 2 - 5 we will work on a sample of IMDB Reviews dataset. Load the data into a _pandas_ _Dataframe_ (review [the basics of pandas](https://pandas.pydata.org/docs/user_guide/10min.html) if you are new to it) using the following script:

In [1]:
import pandas as pd
from fastai.data.external import URLs, untar_data

  from .autonotebook import tqdm as notebook_tqdm


In [13]:
path = untar_data(URLs.IMDB_SAMPLE)

In [14]:
df = pd.read_csv(path/'texts.csv')

In [15]:
len(df), sum(df['is_valid'] == False), sum(df['is_valid'] == True), sum(df['label'] == 'positive'), sum(df['label'] == 'negative')

(1000, 800, 200, 476, 524)

In [16]:
df.head()

Unnamed: 0,label,text,is_valid
0,negative,"Un-bleeping-believable! Meg Ryan doesn't even look her usual pert lovable self in this, which normally makes me forgive her shallow ticky acting schtick. Hard to believe she was the producer on this dog. Plus Kevin Kline: what kind of suicide trip has his career been on? Whoosh... Banzai!!! Finally this was directed by the guy who did Big Chill? Must be a replay of Jonestown - hollywood style. Wooofff!",False
1,positive,"This is a extremely well-made film. The acting, script and camera-work are all first-rate. The music is good, too, though it is mostly early in the film, when things are still relatively cheery. There are no really superstars in the cast, though several faces will be familiar. The entire cast does an excellent job with the script.<br /><br />But it is hard to watch, because there is no good end to a situation like the one presented. It is now fashionable to blame the British for setting Hindus and Muslims against each other, and then cruelly separating them into two countries. There is som...",False
2,negative,"Every once in a long while a movie will come along that will be so awful that I feel compelled to warn people. If I labor all my days and I can save but one soul from watching this movie, how great will be my joy.<br /><br />Where to begin my discussion of pain. For starters, there was a musical montage every five minutes. There was no character development. Every character was a stereotype. We had swearing guy, fat guy who eats donuts, goofy foreign guy, etc. The script felt as if it were being written as the movie was being shot. The production value was so incredibly low that it felt li...",False
3,positive,"Name just says it all. I watched this movie with my dad when it came out and having served in Korea he had great admiration for the man. The disappointing thing about this film is that it only concentrate on a short period of the man's life - interestingly enough the man's entire life would have made such an epic bio-pic that it is staggering to imagine the cost for production.<br /><br />Some posters elude to the flawed characteristics about the man, which are cheap shots. The theme of the movie ""Duty, Honor, Country"" are not just mere words blathered from the lips of a high-brassed offic...",False
4,negative,"This movie succeeds at being one of the most unique movies you've seen. However this comes from the fact that you can't make heads or tails of this mess. It almost seems as a series of challenges set up to determine whether or not you are willing to walk out of the movie and give up the money you just paid. If you don't want to feel slighted you'll sit through this horrible film and develop a real sense of pity for the actors involved, they've all seen better days, but then you realize they actually got paid quite a bit of money to do this and you'll lose pity for them just like you've alr...",False


## Problem 2 (20 points)

### 2a (5 points). 
- Find at least one thing that needs to be cleaned with regex in the texts. Show your Python code.
- Create train/valid split using the column 'is_valid'.

In [19]:
# some strings have <br /> tags
string_to_fix = 'This is a extremely well-made film. The acting, script and camera-work are all first-rate. The music is good, too, though it is mostly early in the film, when things are still relatively cheery. There are no really superstars in the cast, though several faces will be familiar. The entire cast does an excellent job with the script.<br /><br />But it is hard to watch, because there is no good end to a situation like the one presented. It is now fashionable to blame the British for setting Hindus and Muslims against each other, and then cruelly separating them into two countries.'

pattern = r'\.*<\s*br\s*\/>'
re.sub(pattern, lambda m: '. ' if '.' in m.group(0).lower() else '', string_to_fix)

'This is a extremely well-made film. The acting, script and camera-work are all first-rate. The music is good, too, though it is mostly early in the film, when things are still relatively cheery. There are no really superstars in the cast, though several faces will be familiar. The entire cast does an excellent job with the script. But it is hard to watch, because there is no good end to a situation like the one presented. It is now fashionable to blame the British for setting Hindus and Muslims against each other, and then cruelly separating them into two countries.'

In [20]:
df = df.replace(to_replace=pattern, value='', regex=True)
df.head()

Unnamed: 0,label,text,is_valid
0,negative,"Un-bleeping-believable! Meg Ryan doesn't even look her usual pert lovable self in this, which normally makes me forgive her shallow ticky acting schtick. Hard to believe she was the producer on this dog. Plus Kevin Kline: what kind of suicide trip has his career been on? Whoosh... Banzai!!! Finally this was directed by the guy who did Big Chill? Must be a replay of Jonestown - hollywood style. Wooofff!",False
1,positive,"This is a extremely well-made film. The acting, script and camera-work are all first-rate. The music is good, too, though it is mostly early in the film, when things are still relatively cheery. There are no really superstars in the cast, though several faces will be familiar. The entire cast does an excellent job with the scriptBut it is hard to watch, because there is no good end to a situation like the one presented. It is now fashionable to blame the British for setting Hindus and Muslims against each other, and then cruelly separating them into two countries. There is some merit in th...",False
2,negative,"Every once in a long while a movie will come along that will be so awful that I feel compelled to warn people. If I labor all my days and I can save but one soul from watching this movie, how great will be my joyWhere to begin my discussion of pain. For starters, there was a musical montage every five minutes. There was no character development. Every character was a stereotype. We had swearing guy, fat guy who eats donuts, goofy foreign guy, etc. The script felt as if it were being written as the movie was being shot. The production value was so incredibly low that it felt like I was watc...",False
3,positive,"Name just says it all. I watched this movie with my dad when it came out and having served in Korea he had great admiration for the man. The disappointing thing about this film is that it only concentrate on a short period of the man's life - interestingly enough the man's entire life would have made such an epic bio-pic that it is staggering to imagine the cost for productionSome posters elude to the flawed characteristics about the man, which are cheap shots. The theme of the movie ""Duty, Honor, Country"" are not just mere words blathered from the lips of a high-brassed officer - it is th...",False
4,negative,"This movie succeeds at being one of the most unique movies you've seen. However this comes from the fact that you can't make heads or tails of this mess. It almost seems as a series of challenges set up to determine whether or not you are willing to walk out of the movie and give up the money you just paid. If you don't want to feel slighted you'll sit through this horrible film and develop a real sense of pity for the actors involved, they've all seen better days, but then you realize they actually got paid quite a bit of money to do this and you'll lose pity for them just like you've alr...",False


### 2b (5 points). 
- Implement your own tokenizer for the texts. Requirements: split by space, remove most punctuations and split common abbreviations (e.g., "don't" -> "do" "n't", "you'll" -> "you" "'ll"). 
- Create 3 vocabularies using top 1000, 5000, and 10000 tokens, respectively.

In [21]:
# tokenize
import string

# remove punctuation and change to lower case
df['text'] = df['text'].str.strip(string.punctuation).str.lower()

common_abbreviations = {
    
}

AttributeError: 'Series' object has no attribute 'lower'

### 2c (5 points). 
- Implement on your own and train a Naive Bayes sentiment classifier in the _training set_. Requirements: use log scales and add-one smoothing.
- Report your model performances on the _validation set_, with the 3 vocabs your created in 2b, respectively.

### 2d (5 points). Use [_spaCy_](https://spacy.io/) to _tokenize_ and _lemmatize_ this time. Get a new vocab of top 10000 lemmas. Retrain your model on this vocab and report its performance on the validation set.
(Note that spaCy relies on language-specific databases to work. Even though it is already importable, you still need to install its dependency for English. If you are in your _jupyter notebook_, create a new cell and execute the following: <code>!python -m spacy download en_core_web_sm</code>)

## Problem 3 (20 points)

### 3a (10 points). 
- Implement your own _subword tokenizer_ (the algorithm can be found in the slides). 
- Create 3 vocabularies of size 1000, 5000, and 10000, respectively.

### 3b (5 points). Compare the number of unknown words in your training set between the 3 tokenizers and 3 subword tokenizers.

### 3c (5 points). Train your Naive Bayes classifier with the subword tokenizer of 10000 tokens. Compare your model performance (better/worse/same?) and give your analysis (why).

## Problem 4 (20 points)

### 4a (10 points). Build two probabilistic language models using 2-gram and 3-gram, respectively, on the _entire_ texts.

### 4b (10 points). Generate 5 examples for each of the LM. Compare their results.

## Problem 5 (20 points)

### 5a (10 points). 

- Run topic modeling with SVD for 2, 6, and 10 topics, respectively.
- Extract 10 keywords for each topic.
- Try to mannually assign topic labels for (some of) them.

### 5b (5 points).

Do the following:
- Remove stopwords
- Lemmatize
- Keep only nouns, verbs, and adjs with the help of spaCy POS tagger
- Remove certain named entities (choose whatever makes sense to you)
- Remove html tags
- Remove non-ascii characters

And run SVD again for 10 topics. Compare your results with 5a.

### 5c (5 points). Find 2 most similar pairs of reviews using document embeddings derived from SVD.