# **INFO5731 Assignment 3**

In this assignment, we will delve into various aspects of natural language processing (NLP) and text analysis. The tasks are designed to deepen your understanding of key NLP concepts and techniques, as well as to provide hands-on experience with practical applications.

Through these tasks, you'll gain practical experience in NLP techniques such as N-gram analysis, TF-IDF, word embedding model creation, and sentiment analysis dataset creation.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).


**Total points**: 100

**Deadline**: See Canvas

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**


## Question 1 (30 points)

**Understand N-gram**

Write a python program to conduct N-gram analysis based on the dataset in your assignment two. You need to write codes from scratch instead of using any pre-existing libraries to do so:

(1) Count the frequency of all the N-grams (N=3).

(2) Calculate the probabilities for all the bigrams in the dataset by using the fomular count(w2 w1) / count(w2). For example, count(really like) / count(really) = 1 / 3 = 0.33.

(3) Extract all the noun phrases and calculate the relative probabilities of each review in terms of other reviews (abstracts, or tweets) by using the fomular frequency (noun phrase) / max frequency (noun phrase) on the whole dataset. Print out the result in a table with column name the all the noun phrases and row name as all the 100 reviews (abstracts, or tweets).

In [None]:
# Write your code here
#change the word matching with regex!

import pandas as pd
import pprint
from google.colab import drive
import nltk
from textblob import TextBlob
import time

from collections import Counter

nltk.download('punkt')
nltk.download('brown')

drive.mount('drive', force_remount=True)
dataframe = pd.read_csv('/content/drive/My Drive/densho_11700380_cleaned.csv')

def unique_value(ngram_list):
  unique_value = []

  for _list in ngram_list:
    if _list not in unique_value:
        unique_value.append(_list)

  return unique_value

def generate_ngrams(text, n, delimiter="_"):
  #assume we are working with a clean data
  result = []
  #if type(text) is not str:
  #  text = ""
  tokens = text.split(" ")

  for i in range(len(tokens)):
    #check the index
    if i+n <= len(tokens):
      #create ngrams
      word = ""
      for j in range(n):
        word += delimiter + tokens[i+j]
      result.append(word[1:])
      word = ""
    else: break

  return result

def count_ngram_frequency(ngram_list):
  result = {}

  unique_ = []
  unique_ = unique_value(ngram_list)

  for value in unique_:
    counter = 0
    for i in ngram_list:
      if value == i: counter += 1
      else: continue
    result.update({value: counter})

  return result

def np_list(text):
  np = TextBlob(text)
  return np.noun_phrases

#no correction for know
def correction(text):
  cor = TextBlob(text)
  return cor.correct()

def count_ngram_probability(ngram_list_numerator, ngram_list_denominator):
  result = {}

  ngram_numerator_freq = count_ngram_frequency(ngram_list_numerator)
  ngram_denominator_freq = count_ngram_frequency(ngram_list_denominator)

  #count(w2 w1) / count(w2)
  for k, v in ngram_numerator_freq.items():
    #get the first word
    first_word = k.split(" ")[0]
    if first_word in ngram_denominator_freq:
      result.update({k: float(v/ngram_denominator_freq[first_word])})

  return result

def count_np_probability(np_list):
  np_dict = {}
  max_np = len(np_list)
  unique_np_freq = count_ngram_frequency(np_list)

  for k, v in unique_np_freq.items():
    #frequency (noun phrase) / max frequency (noun phrase)
    np_dict.update({k: float(v/max_np)})

  return np_dict

#preprocess
#remove duplicate values
dataframe = dataframe.drop_duplicates(subset=['desc'])

#remove nan value
dataframe = dataframe.dropna()
df_len = len(dataframe.index)

unigram = []
bigram = []
trigram = []
np = []

print("\nCreating N-Grams...\n")
t = time.process_time()
for index, row in dataframe.iterrows():
  text = row['desc']
  unigram.extend(generate_ngrams(text, 1, delimiter=" "))
  bigram.extend(generate_ngrams(text, 2, delimiter=" "))
  trigram.extend(generate_ngrams(text, 3, delimiter=" "))
  np.extend(np_list(text))

print("\nDone...\nProcess time..."+str(time.process_time() - t)+" seconds\n")


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!


Mounted at drive

Creating N-Grams...


Done...
Process time...2.691205355999955 seconds



In [None]:
#1)Trigrams frequency
t2 = time.process_time()
print("TRIGRAMS FREQUENCY\n")
print("Calculating...\n")

dataframe_trigram = pd.DataFrame(columns=["trigram", "frequency"])

index = 1
for k_trigram, v_trigram in count_ngram_frequency(trigram).items():
  dataframe_trigram.loc[index] = [k_trigram, v_trigram]
  index += 1

display(dataframe_trigram)

print("\nDone...\nProcess time..."+str(time.process_time() - t2)+" seconds\n")

TRIGRAMS FREQUENCY

Calculating...



Unnamed: 0,trigram,frequency
1,kay aiko abe,1
2,aiko abe nisei,1
3,abe nisei femal,1
4,nisei femal born,286
5,femal born may,32
...,...,...
29823,began work offic,1
29824,work offic inspector,1
29825,inspector gener u,1
29826,gener u s,1



Done...
Process time...226.77490375900015 seconds

Error: Runtime no longer has a reference to this dataframe, please re-run this cell and try again.


Error: Runtime no longer has a reference to this dataframe, please re-run this cell and try again.


In [None]:
#2)Bigrams probability
t3 = time.process_time()
print("\nBIGRAMS PROBABILITY\n")
print("Calculating...\n")

dataframe_bigram = pd.DataFrame(columns=["bigram", "probability"])

index = 1
for k_bigram, v_bigram in count_ngram_probability(bigram, unigram).items():
  dataframe_bigram.loc[index] = [k_bigram, v_bigram]
  index += 1

display(dataframe_bigram)

print("\nDone...\nProcess time..."+str(time.process_time() - t3)+" seconds\n")


BIGRAMS PROBABILITY

Calculating...



Unnamed: 0,bigram,probability
1,kay aiko,0.090909
2,aiko abe,0.200000
3,abe nisei,1.000000
4,nisei femal,0.397590
5,femal born,0.938119
...,...,...
21365,februari in,0.014085
21366,februari left,0.014085
21367,left civil,0.008929
21368,divis began,0.066667



Done...
Process time...158.60955475799983 seconds



In [None]:
#3) Noun phrase probability
t4 = time.process_time()
print("NOUN PHRASE PROBABILITY\n")
print("Calculating...\n")

dataframe_np = pd.DataFrame(columns=["noun phrase", "probability"])

index = 1
for k_np, v_np in count_np_probability(np).items():
  dataframe_np.loc[index] = [k_np, v_np]
  index += 1

display(dataframe_np)

print("\nDone...\nProcess time..."+str(time.process_time() - t4)+" seconds\n")

NOUN PHRASE PROBABILITY

Calculating...



Unnamed: 0,noun phrase,probability
1,kay aiko abe nisei femal,0.000144
2,selleck washington,0.000144
3,childhood beaverton oregon father,0.000144
4,own farm influenc earli age parent convers chr...,0.000144
5,war work,0.000144
...,...,...
5947,civil right investig provid litig support,0.000144
5948,need litig support,0.000144
5949,amount work requir os decreas mani staff membe...,0.000144
5950,civil right divis mr zajic continu work litig ...,0.000144



Done...
Process time...8.715061833000163 seconds



## Question 2 (25 points)

**Understand TF-IDF and Document representation**

Starting from the documents (all the reviews, or abstracts, or tweets) collected for assignment two, write a python program:

(1) To build the documents-terms weights (tf * idf) matrix.

(2) To rank the documents with respect to query (design a query by yourself, for example, "An Outstanding movie with a haunting performance and best character development") by using cosine similarity.

Note: You need to write codes from scratch instead of using any pre-existing libraries to do so.

In [None]:
# Write your code here

import pandas as pd
import pprint
from google.colab import drive
import math
import re

drive.mount('drive', force_remount=True)
dataframe = pd.read_csv('/content/drive/My Drive/densho_11700380_cleaned.csv')

def unique_value(ngram_list):
  unique_value = []

  for _list in ngram_list:
    if _list not in unique_value:
        unique_value.append(_list)

  return unique_value

def generate_ngrams(text, n, delimiter="_"):
  #assume we are working with a clean data
  result = []
  #if type(text) is not str:
  #  text = ""
  tokens = text.split(" ")

  for i in range(len(tokens)):
    #check the index
    if i+n <= len(tokens):
      #create ngrams
      word = ""
      for j in range(n):
        word += delimiter + tokens[i+j]
      result.append(word[1:])
      word = ""
    else: break

  return result

def tfidf(word, document, document_id, idf):
  #TF(word, document) = “number of occurrences of the word in the document” / “number of words in the document”
  #IDF =Log[(# Number of documents) / (Number of documents containing the word)]

  word_n = document.count(word)
  word_doc = len(document.split())

  return {f"{word}|{document_id}": float(word_n/word_doc * idf)}

In [None]:
if __name__ == "__main__":
  #preprocess
  #remove duplicate values
  dataframe = dataframe.drop_duplicates(subset=['desc'])

  #remove nan value
  dataframe = dataframe.dropna()

  unigram = []
  for index, row in dataframe.iterrows():
      unigram.extend(generate_ngrams(row['desc'], 1, delimiter=" "))

  tfidf_matrix = {}

  #1) documents-terms weights (tf * idf) matrix
  print("TF-IDF MATRIX\n")
  unique_words = unique_value(unigram)
  dataframe_len = len(dataframe.index)

  #tfidf
  #iterate through documents
  for index_tfidf, row_tfidf in dataframe.iterrows():
    document_word_n = 0
    for word in unique_words:
      if re.search(r"\b{0}\b".format(word), row_tfidf['desc']):
        document_word_n += 1
      tfidf_matrix.update(tfidf(word, row_tfidf['desc'], index_tfidf,float(math.log(dataframe_len/document_word_n))))

  pprint.pprint(tfidf_matrix)

## Question 3 (25 points)

**Create your own word embedding model**

Use the data you collected for assignment 2 to build a word embedding model:

(1) Train a 300-dimension word embedding (it can be word2vec, glove, ulmfit, bert, or others).

(2) Visualize the word embedding model you created.

Reference: https://machinelearningmastery.com/develop-word-embeddings-python-gensim/

Reference: https://jaketae.github.io/study/word2vec/

In [None]:
# Write your code here









## Question 4 (20 Points)

**Create your own training and evaluation data for sentiment analysis.**

 **You don't need to write program for this question!**

 For example, if you collected a movie review or a product review data, then you can do the following steps:

*   Read each review (abstract or tweet) you collected in detail, and annotate each review with a sentiment (positive, negative, or neutral).

*   Save the annotated dataset into a csv file with three columns (first column: document_id, clean_text, sentiment), upload the csv file to GitHub and submit the file link blew.

*   This datset will be used for assignment four: sentiment analysis and text classification.


In [None]:
# The GitHub link of your final csv file


# Link:
'''
https://github.com/kahosadi/INFO5731_Spring_2024/blob/main/densho_cleaned_annotated.csv
'''


# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

In [None]:
# Type your answer
'''
Challenging part: coding
Best part: coding

I won't complain. But I was doing other assignments simultaneously; to make things worse, this is tax season. In the end, I only had one day to complete this assignment.
If I had more time, I would do it better. Typical procrastinator comments inserted! '''