<a href="https://colab.research.google.com/github/nagabudibhavyanth/INFO5731/blob/main/NV_Bhavyanth_Assignment_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment 3**

In this assignment, we will delve into various aspects of natural language processing (NLP) and text analysis. The tasks are designed to deepen your understanding of key NLP concepts and techniques, as well as to provide hands-on experience with practical applications.

Through these tasks, you'll gain practical experience in NLP techniques such as N-gram analysis, TF-IDF, word embedding model creation, and sentiment analysis dataset creation.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).


**Total points**: 100

**Deadline**: See Canvas

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**


## Question 1 (30 points)

**Understand N-gram**

Write a python program to conduct N-gram analysis based on the dataset in your assignment two. You need to write codes from scratch instead of using any pre-existing libraries to do so:

(1) Count the frequency of all the N-grams (N=3).

(2) Calculate the probabilities for all the bigrams in the dataset by using the fomular count(w2 w1) / count(w2). For example, count(really like) / count(really) = 1 / 3 = 0.33.

(3) Extract all the noun phrases and calculate the relative probabilities of each review in terms of other reviews (abstracts, or tweets) by using the fomular frequency (noun phrase) / max frequency (noun phrase) on the whole dataset. Print out the result in a table with column name the all the noun phrases and row name as all the 100 reviews (abstracts, or tweets).

In [1]:
import pandas as pd
import numpy as np

In [2]:
from google.colab import drive
drive.mount('drive')

Mounted at drive


In [3]:
product_reviews_df = pd.read_csv("/content/drive/My Drive/product_reviews.csv")

In [4]:
print(product_reviews_df)

     Unnamed: 0                                             review
0             0  \nUPDATE 3/8/2023: i bought parallels desktop ...
1             1  \nI was a bit wary of ordering such an expensi...
2             2  \nAmazing machine. Smooth, fast, quiet, and be...
3             3  \nBest laptop I ever had.  I bought it during ...
4             4  \nI make music and this is exactly what I need...
..          ...                                                ...
995         995  \nI had been a Mac user for years, but then go...
996         996  \nI bought this 2020 MacBook Air (8 GB RAM, 25...
997         997  \nIt's really surprising how cool the laptop i...
998         998  \nI first bought this mac in 2020 when it firs...
999         999  \nUpdated review with more detail [after 2 wee...

[1000 rows x 2 columns]


In [5]:
del product_reviews_df["Unnamed: 0"]

In [6]:
product_reviews = product_reviews_df["review"].head(100)
print(len(product_reviews))

100


In [7]:
def generate_ngrams(text, N = 3):
  words = text.split()
  return [words[i:i+N] for i in range(len(words)- N+1)]

In [8]:
def compute_bi_gram_probabilities(n_grams, text):
  return {' '.join(n_gram) : n_grams.count(n_gram) / text.split().count(n_gram[0]) for n_gram in n_grams}

In [9]:
review_bigrams = []
for review in product_reviews:
  review_bigrams.extend(generate_ngrams(review, 2))

In [10]:
print(review_bigrams)

[['UPDATE', '3/8/2023:'], ['3/8/2023:', 'i'], ['i', 'bought'], ['bought', 'parallels'], ['parallels', 'desktop'], ['desktop', 'and'], ['and', 'a'], ['a', 'windows'], ['windows', '11'], ['11', 'license.'], ['license.', 'I'], ['I', 'need'], ['need', 'it'], ['it', 'to'], ['to', 'try'], ['try', 'a'], ['a', 'certain'], ['certain', 'amount'], ['amount', 'of'], ['of', 'things'], ['things', 'in'], ['in', 'IT,'], ['IT,', 'however,'], ['however,', 'the'], ['the', 'performance'], ['performance', 'when'], ['when', 'running'], ['running', 'windows'], ['windows', 'in'], ['in', 'a'], ['a', 'virtualized'], ['virtualized', 'environment'], ['environment', '(like'], ['(like', 'parallels'], ['parallels', 'is)'], ['is)', 'it'], ['it', 'gets'], ['gets', 'seriously'], ['seriously', 'affected'], ['affected', 'on'], ['on', 'the'], ['the', 'M1,'], ['M1,', 'something'], ['something', 'that'], ['that', "didn't"], ["didn't", 'happen'], ['happen', 'with'], ['with', 'intel'], ['intel', 'processors.'], ['processors.'

In [None]:
bigram_probabilities = compute_bi_gram_probabilities(review_bigrams, " ".join(product_reviews))

In [None]:
print(bigram_probabilities)

In [None]:
def noun_phrases(words):

    phrases, phrase = [], []
    keywords = ['is', 'are', 'was', 'were', 'can', 'could', 'will', 'would', 'should', 'has', 'have', 'had', 'do', 'does', 'did', 'to', 'at', 'in', 'out', 'on', 'off', 'over', 'under', 'and', 'or', 'but']
    for word in words:
        if word.lower() not in keywords:
            phrase.append(word)
        elif phrase:
            phrases.append(' '.join(phrase))
            phrase = []
    if phrase:
        phrases.append(' '.join(phrase))
    return phrases

In [None]:
noun_phrase_freq = dict()
for i, review in enumerate(product_reviews):
    words = review.split()
    phrases = noun_phrases(words)
    for phrase in phrases:
        if phrase not in noun_phrase_freq:
            noun_phrase_freq[phrase] = [0] * len(product_reviews)
        noun_phrase_freq[phrase][i] += 1

for phrase, frequencies in noun_phrase_freq.items():
    max_freq = max(frequencies)
    noun_phrase_freq[phrase] = [freq / max_freq for freq in frequencies]

noun_phrase_df = pd.DataFrame(noun_phrase_freq)
noun_phrase_df.index.name = "Review"
noun_phrase_df.columns.name = "NounPhrases"

noun_phrase_df.to_csv('/content/drive/My Drive/noun_phrase_data.csv', index=True)

## Question 2 (25 points)

**Undersand TF-IDF and Document representation**

Starting from the documents (all the reviews, or abstracts, or tweets) collected for assignment two, write a python program:

(1) To build the documents-terms weights (tf * idf) matrix.

(2) To rank the documents with respect to query (design a query by yourself, for example, "An Outstanding movie with a haunting performance and best character development") by using cosine similarity.

Note: You need to write codes from scratch instead of using any pre-existing libraries to do so.

In [None]:
import pandas as pd
from collections import defaultdict
import math

In [None]:
product_reviews_df = pd.read_csv("/content/drive/My Drive/movie_reviews.csv")

In [None]:
product_reviews_df.head()

In [None]:
reviews = product_reviews_df['review'].tolist()[:100]

tokenize_reviews = lambda review : review.split()

tokenized_reviews = [tokenize_reviews(doc) for doc in reviews]
print(tokenized_reviews)
def calculate_term_freq(tokenized_review):
    res = defaultdict(int)
    for word in tokenized_review:
        res[word] += 1

    for word in res:
        res[word] = res[word] / len(tokenized_review)
    return res

def calculate_inv_doc_freq(tokenized_reviews):
    res = defaultdict(int)
    for review in tokenized_reviews:
        for word in set(review):
            res[word] += 1

    for word in res:
        res[word] = math.log(len(tokenized_reviews) / res[word]) + 1
    return res

tf_reviews = [calculate_inv_doc_freq(doc) for doc in tokenized_reviews]
print(tf_reviews)

idf_reviews = calculate_inv_doc_freq(tokenized_reviews)
print(idf_reviews)

def calculate_tf_idf(tf_reviews, idf_reviews):
    tf_idf = []
    for doc_tf in tf_reviews:
        doc_tf_idf = {}
        for word, value in doc_tf.items():
            doc_tf_idf[word] = value * idf_reviews[word]
        tf_idf.append(doc_tf_idf)
    return tf_idf


tf_idf_matrix = calculate_tf_idf(tf_reviews, idf_reviews)

def dot_product(vector1, vector2):
    return sum([vector1.get(word, 0) * vector2.get(word, 0) for word in vector1])

def vector_norm(vector):
    return math.sqrt(sum([value**2 for value in vector.values()]))

def cosine_similarity(vector1, vector2):
    return dot_product(vector1, vector2) / (vector_norm(vector1) * vector_norm(vector2))

query = "A good product with the maximum of covered features and flexibility"
tokenized_query = tokenize_reviews(query.lower())


query = calculate_term_freq(tokenized_query)

query_tf_idf = {word: query[word] * idf_reviews.get(word, 0) for word in query}

cos_sim = [cosine_similarity(query_tf_idf, doc) for doc in tf_idf_matrix]

ranked_reviews_index = sorted(range(len(cos_sim)), key=lambda i: cos_sim[i], reverse=True)

print("The top 5 ranked documents are below :")
for i in ranked_reviews_index[:5]:
    print(f"Document - {i + 1}, Index - {i}, Cos - Similarity: {cos_sim[i]}")

print(tf_idf_matrix[0])

## Question 3 (25 points)

**Create your own word embedding model**

Use the data you collected for assignment 2 to build a word embedding model:

(1) Train a 300-dimension word embedding (it can be word2vec, glove, ulmfit, bert, or others).

(2) Visualize the word embedding model you created.

Reference: https://machinelearningmastery.com/develop-word-embeddings-python-gensim/

Reference: https://jaketae.github.io/study/word2vec/

In [None]:
import pandas as pd
from gensim.models import Word2Vec
from sklearn.decomposition import PCA
from matplotlib import pyplot

In [None]:
product_reviews_df = pd.read_csv("/content/drive/My Drive/movie_reviews.csv")

In [None]:
sentences = [record['review'].split() for i, record in product_reviews_df.iterrows()]

In [None]:
model = Word2Vec(sentences, vector_size=300, window=5, min_count=1, workers=4)
print(model)

In [None]:
words = list(model.wv.key_to_index)
print(words)

In [None]:
model.save('model.bin')

In [None]:
new_model = Word2Vec.load('model.bin')
print(f"\n",new_model)

In [None]:
X = model.wv.vectors
pca = PCA(n_components=2)
result = pca.fit_transform(X)

In [None]:
pyplot.scatter(result[:, 0], result[:, 1])
words = list(model.wv.key_to_index.keys())
for i, word in enumerate(words):
 pyplot.annotate(word, xy=(result[i, 0], result[i, 1]))
pyplot.show()

In [None]:
product_reviews_df.head(100)

## Question 4 (20 Points)

**Create your own training and evaluation data for sentiment analysis.**

 **You don't need to write program for this question!**

 For example, if you collected a movie review or a product review data, then you can do the following steps:

*   Read each review (abstract or tweet) you collected in detail, and annotate each review with a sentiment (positive, negative, or neutral).

*   Save the annotated dataset into a csv file with three columns (first column: document_id, clean_text, sentiment), upload the csv file to GitHub and submit the file link blew.

*   This datset will be used for assignment four: sentiment analysis and text classification.


In [None]:
# The GitHub link of your final csv file


# Link:

# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

In [None]:
# Type your answer