# HW3: Natural Language Processing

**Instructions**: 
- Please read the problem description carefully
- Make sure to complete all requirements (shown as bullets) . In general, it would be much easier if you complete the requirements in the order as shown in the problem description
- Follow the Submission Instruction to submit your assignment.
- Code of academic integrity:
    - **Each assignment needs to be completed independently. This is NOT group assignment**. 
    - Never ever copy others' work (even with minor modification, e.g. changing variable names)
    - If you generate code using large lanaguage models (although it is not encouraged), make sure to adapt the generated code to meet all requirements and it is executable.
    - Anti-Plagiarism software will be used to check similarities between all submissions.
    - Check Syllabus for more details.

## Q1: Extract data using regular expression (2 points)
Suppose you have scraped the text shown below from an online source (https://finance.yahoo.com/). Write `a single regular expression` to covert the text into a list of tuples `(Symbol, Last Price, Change, % Change)` as shown below.


In [1]:
pip install spacy

Note: you may need to restart the kernel to use updated packages.


In [2]:
import pandas as pd
import nltk
from sklearn.metrics import pairwise_distances
import numpy as np
from matplotlib import pyplot as plt
from sklearn.preprocessing import normalize
import re
import json
import pprint as pp
import spacy
from collections import defaultdict

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [20]:
!python3 -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [27]:
text='''BTC-USD

Bitcoin USD
	63,473.52	+1,498.26	+2.42%
ETH-USD

Ethereum USD
	3,471.85	+39.03	+1.14%
USDT-USD

Tether USDt USD
	1.00	-0.00	-0.01%
BNB-USD

BNB USD
	414.75	+4.12	+1.00%
SOL-USD

Solana USD
	128.86	-1.31	-1.01%'''



In [28]:
result = re.findall(r'(\w+-\w+\s+.*\s)\s([\d,]+\.\d+)\s([\d,+-]+\.\d+)\s([\d+-.%]+)', text)
pp.pprint(result)

[('BTC-USD\n\nBitcoin USD\n', '63,473.52', '+1,498.26', '+2.42%'),
 ('ETH-USD\n\nEthereum USD\n', '3,471.85', '+39.03', '+1.14%'),
 ('USDT-USD\n\nTether USDt USD\n', '1.00', '-0.00', '-0.01%'),
 ('BNB-USD\n\nBNB USD\n', '414.75', '+4.12', '+1.00%'),
 ('SOL-USD\n\nSolana USD\n', '128.86', '-1.31', '-1.01%')]


## Q2: Develop a QA system (8 points)


Objective: Find a sentence in an article that can best answer a question. A dataset has been provided. Please follow the instruction below carefully to develop this system.

In [29]:
data = json.load(open("qa.json", "r"))
data[5]

{'text': 'In 1995, Coolio made a song featuring R&B singer LV for the movie Dangerous Minds, titled "Gangsta\'s Paradise". It would become one of the most successful rap songs of all time, reaching #1 on the Billboard Hot 100 for 3 weeks. It was the #1 single of 1995 for all genres, and was a global hit, as it reached #1 in the United States, United Kingdom, Ireland, France, Germany, Italy, Sweden, Austria, Netherlands, Norway, Switzerland, Australia, and New Zealand. The song also created a controversy when Coolio claimed that parody artist "Weird Al" Yankovic had not asked for permission to make his parody of "Gangsta\'s Paradise", titled "Amish Paradise". At the 1996 Grammy Awards, the song won Coolio a Grammy for Best Rap Solo Performance.  Originally "Gangsta\'s Paradise" was not meant to be included on one of Coolio\'s studio albums, but its success led to Coolio not only putting it on his next album but also making it the title track. The title track sampled the chorus and music

In [30]:
# randomly select one article to test your code

idx = 5

text = data[idx]["text"]
text

qs = [item["question"] for item in  data[idx]['qa']]
qs

ans =[item["answer"] for item in  data[idx]['qa']]
ans


'In 1995, Coolio made a song featuring R&B singer LV for the movie Dangerous Minds, titled "Gangsta\'s Paradise". It would become one of the most successful rap songs of all time, reaching #1 on the Billboard Hot 100 for 3 weeks. It was the #1 single of 1995 for all genres, and was a global hit, as it reached #1 in the United States, United Kingdom, Ireland, France, Germany, Italy, Sweden, Austria, Netherlands, Norway, Switzerland, Australia, and New Zealand. The song also created a controversy when Coolio claimed that parody artist "Weird Al" Yankovic had not asked for permission to make his parody of "Gangsta\'s Paradise", titled "Amish Paradise". At the 1996 Grammy Awards, the song won Coolio a Grammy for Best Rap Solo Performance.  Originally "Gangsta\'s Paradise" was not meant to be included on one of Coolio\'s studio albums, but its success led to Coolio not only putting it on his next album but also making it the title track. The title track sampled the chorus and music of the s

["What was the relationship between Coolio and Gangsta's parapdise?",
 'WHen was the song released?',
 'Which record label release the song?',
 'Did the song have a high sales?',
 'Did he wind any award?',
 'Which other names were mention n the song?',
 'What was their contribution to the song?',
 'Which other song did he make?']

['Coolio made a song featuring R&B singer LV for the movie Dangerous Minds, titled "Gangsta\'s Paradise',
 'In 1995,',
 'RIAA.',
 'It would become one of the most successful rap songs of all time, reaching #1 on the Billboard Hot 100 for 3 weeks.',
 'At the 1996 Grammy Awards, the song won Coolio a Grammy for Best Rap Solo Performance.',
 'Too Hot" with J.T. Taylor of Kool & the Gang doing the chorus.',
 'J.T. Taylor of Kool & the Gang doing the chorus.',
 "Sumpin' New"]

### **Q2.1.** Tokenize function (3 points)

Define a function `tokenize(doc, lemmatized = True, remove_stopword = True)`  as follows: 

   - Take three parameters: 
       - `doc`: an input string (e.g. a question)
       - `lemmatized`: an optional boolean parameter to indicate if tokens are lemmatized. The default value is True (i.e. tokens are lemmatized). 
       - `remove_stopword`: an optional bookean parameter to remove stop words. The default value is True (i.e. remove stop words). 
   - First split the text into sentences.
   - Split each sentence into unigrams and also clean up tokens as follows:
       - if `lemmatized` is turned on, lemmatize all unigrams.
       - if `remove_stopword` is tuned on, remove all stop words.
   - Convert all unigrams to the lower case and remove punctuations and empty tokens
   - Count the frequency of each word in each sentence and save the result into a dictionary (see sample output)
   - Return the resulting **sentences** and **dictionary** after all the processing. 
   
   
(Hint: you can use spacy package for this task. For reference, check https://spacy.io/api/token#attributes)

In [26]:
nlp = spacy.load('/Users/thanapoomphatthanaphan/miniconda3/lib/python3.11/site-packages/en_core_web_sm/en_core_web_sm-3.7.1')

In [71]:
def tokenize(doc, lemmatized=True, remove_stopword=True):

    # Initialize a dictionary and list to store words and sentences, respectively
    vocab, sents = {}, []
    
    # Add your code here
    docs = nlp(doc)
    
    # Iterate to tokenize each sentence
    for i, sent in enumerate(docs.sents):
        temp_dict = defaultdict(int)
        sents.append(sent)
        for token in sent:
            
            # Remove punctuations and space
            if token.is_punct or token.is_space:
                continue
            
            # Check lemma
            if lemmatized:
                text = token.lemma_
            else:
                text = token.text
                
            # Convert to lowercase
            text = text.lower()
            
            # Check and remove stopword
            if remove_stopword:
                if token.is_stop:
                    continue
                else:
                    temp_dict[text] += 1
            else:
                temp_dict[text] += 1
        
        # Store vocab in the dictionary
        vocab[i] = temp_dict
    
    return vocab, sents

#### Lemmatized=True, remove_stopword=True

In [77]:
print("1.lemmatized=True, remove_stopword=True\n"), 

# concatenate questions to the text and tokenize together
vocab, sents = tokenize(text + '\n' +' '.join(qs), lemmatized=True, remove_stopword=True)
pp.pprint(vocab)
pp.pprint(sents)

1.lemmatized=True, remove_stopword=True



(None,)

{0: defaultdict(<class 'int'>,
                {'1995': 1,
                 'coolio': 1,
                 'dangerous': 1,
                 'feature': 1,
                 'gangsta': 1,
                 'lv': 1,
                 'minds': 1,
                 'movie': 1,
                 'paradise': 1,
                 'r&b': 1,
                 'singer': 1,
                 'song': 1,
                 'title': 1}),
 1: defaultdict(<class 'int'>,
                {'1': 1,
                 '100': 1,
                 '3': 1,
                 'billboard': 1,
                 'hot': 1,
                 'rap': 1,
                 'reach': 1,
                 'song': 1,
                 'successful': 1,
                 'time': 1,
                 'week': 1}),
 2: defaultdict(<class 'int'>,
                {'1': 2,
                 '1995': 1,
                 'australia': 1,
                 'austria': 1,
                 'france': 1,
                 'genre': 1,
                 'germany': 1,
  

#### Lemmatized=True, remove_stopword=False

In [73]:
# Test another configuration
print("2.lemmatized=True, remove_stopword=False\n"), 
vocab, sents = tokenize(text + '\n' +' '.join(qs), lemmatized=True, remove_stopword=False)
pp.pprint(vocab)

2.lemmatized=True, remove_stopword=False



(None,)

{0: defaultdict(<class 'int'>,
                {"'s": 1,
                 '1995': 1,
                 'a': 1,
                 'coolio': 1,
                 'dangerous': 1,
                 'feature': 1,
                 'for': 1,
                 'gangsta': 1,
                 'in': 1,
                 'lv': 1,
                 'make': 1,
                 'minds': 1,
                 'movie': 1,
                 'paradise': 1,
                 'r&b': 1,
                 'singer': 1,
                 'song': 1,
                 'the': 1,
                 'title': 1}),
 1: defaultdict(<class 'int'>,
                {'1': 1,
                 '100': 1,
                 '3': 1,
                 'all': 1,
                 'become': 1,
                 'billboard': 1,
                 'for': 1,
                 'hot': 1,
                 'it': 1,
                 'most': 1,
                 'of': 2,
                 'on': 1,
                 'one': 1,
                 'rap': 1,
             

### **Q2.2.** Compute TF-IDF (1 point)

Define a function `compute_tf_idf(vocab)` as follows: 

- Take the dictionary returned in Q2.1 as an input.
- Calculate tf_idf weights as shown in lecture notes (Hint: feel free to reuse the code segment in NLP Lecture Notes (II))
- Return the smoothed normalized `tf_idf` array and the words corresponding to the columns of the tfidf array.
 

In [82]:
def compute_tfidf(vocab):
     
    # add your code here
    # Get document-term matrix
    dtm = pd.DataFrame.from_dict(vocab, orient="index")
    dtm = dtm.fillna(0)
    dtm = dtm.sort_index(axis = 0)
      
    # Get normalized term frequency (tf) matrix        
    tf = dtm.values
    doc_len = tf.sum(axis=1, keepdims=True)
    tf = np.divide(tf, doc_len)
    
    # Get idf
    df = np.where(tf>0, 1, 0)
    smoothed_idf = np.log(np.divide(len(vocab)+1, np.sum(df, axis=0)+1)) + 1    
    smoothed_tf_idf = tf * smoothed_idf
 
    # Get the words corresponding to the columns of the tfidf array
    words = dtm.columns
    
    return smoothed_tf_idf, words


In [83]:
tfidf, words = compute_tfidf(vocab)

# show shape of tfidf matrix
tfidf.shape

(21, 150)

### **Q2.3.** Put everything together to match questions and answers. (4 points)


Define a function `Match(text, questions, lemmatized = True, remove_stopword = True, top-K = 3)`  as follows: 
- Take four inputs:
    - `text`: a paragraph 
   - `questions`: is a list of questions
   - `lemmatized, remove_stopword`:  similar to those defined in Q2.1
   - `top-K`: the top-K answer to each question
- Tokenize the concatenated text and questions using the `tokenize` function as defined in Q2.1.
- Calculate the smoothed normalized tf_idf matrix for the concatenated text
- Split the tf_idf matrix into sub-matrices for the text and questions respectively
- For each question, find the top-K sentences that may answer it based on the TF-IDF similarities between the question and the sentences
- Return the matched top-K sentences of each question.


**Analysis (1 point)**


You may find TFIDF similarity may not be able to find correct answers to some questions. Based on your analysis, answer the following questions:
- What kind of questions cannot be correctly found by this method?
- What could be the possible solution to fix these issues? Discuss your idea. You don't have to implement it.

In [160]:
from sklearn.metrics.pairwise import cosine_similarity

def Match(text, questions, lemmatized=True, remove_stopword=True, topK=3):

    
    # Tokenize the concatenated text and questions using the tokenize function as defined in Q2.1
    vocab, sents = tokenize(text + '\n' + ' '.join(questions), lemmatized, remove_stopword)
    
    # Calculate the smoothed normalized tf_idf matrix for the concatenated text
    tfidf, words = compute_tfidf(vocab)
    
    # Split the tf-idf matrix into sub-matrices for the text and questions respectively
    tfidf_text = tfidf[:len(vocab)-len(questions)]
    tfidf_questions = tfidf[len(vocab)-len(questions):]
    
    # Initialize the list to store the top-K sentences
    answers = []
    
    # Iterate over each question
    for idx in range(len(questions)):
        
        # Initialize the temporary list to store the answers for the current question
        temp_ans = []
        
        # Calculate cosine similarity between question and text sentences
        similarities = cosine_similarity(tfidf_questions[idx].reshape(1, -1), tfidf_text)
        
        # Get indices of top-K similar sentences
        top_idx_sim_sents = np.argsort(similarities)[0][::-1][:topK]
        
        # Get top-K sentences for the current question
        for i in top_idx_sim_sents:
            temp_ans.append(sents[i])
        
        # Store the answers for the current question in the list named 'answers'
        answers.append(temp_ans)
    
    return answers

In [159]:
answers = Match(text, qs, lemmatized=True, remove_stopword=True)

for q, a1, a2 in zip(qs, answers, ans):
    print(f'Q:\t{q}\n\nA:\t{a1}\n\nCorrect:\t{a2}\n')

Q:	What was the relationship between Coolio and Gangsta's parapdise?

A:	[Originally "Gangsta's Paradise" was not meant to be included on one of Coolio's studio albums, but its success led to Coolio not only putting it on his next album but also making it the title track., In 1995, Coolio made a song featuring R&B singer LV for the movie Dangerous Minds, titled "Gangsta's Paradise"., The song also created a controversy when Coolio claimed that parody artist "Weird Al" Yankovic had not asked for permission to make his parody of "Gangsta's Paradise", titled "Amish Paradise".]

Correct:	Coolio made a song featuring R&B singer LV for the movie Dangerous Minds, titled "Gangsta's Paradise

Q:	WHen was the song released?

A:	[The album Gangsta's Paradise was released in 1995 and was certified 2X Platinum by the RIAA., In 1996, Coolio had another top 40 hit with the song "It's All the Way Live (Now)" from the soundtrack to the movie Eddie., It would become one of the most successful rap songs 

**What kind of questions cannot be correctly found by this method?**

- The questions that cannot be correctly found by this method are the questions that contain the similar words and do not have the unique words with high tf-idf. For example, questions 6-8 contain the word "song", and they got the exact same top K sentences as their answers.

**What could be the possible solution to fix these issues?**

- We can use pre-trained model to help a model to better understand the context of the sentence.

## **Q3.** (Bonus 2 points)


Implement a function `match_by_wv(text, questions, lemmatized = True, remove_stopword = True, topK = 3)` to find topK answers to a question by the similarity of word vectors.
- For each key word in the question, find the best matched word in a candidate answer by the cosine similarity between the word vectors
- Calculate the match score of the answer as the mean of the cosine similarities of the best match words
- Return the answers with the topK largest match score.


hint: feel free to use pretrained word vectors


In [None]:
def match_by_wv(text, questions, 
                lemmatized = True, 
                remove_stopword = True, 
                topK = 3):
    
    answers = None
    
    # add your code
    
    return answers