1.Submit a Google Colab notebook containing your completed code and experimentation results.

2.Include comments and explanations in your code to help understand the implemented logic.

**Additional Notes:**
*   Ensure that the notebook runs successfully in Google Colab.
*   Document any issues encountered during experimentation and how you addressed them.

**Grading:**
*   Each task will be graded out of the specified points.
*   Points will be awarded for correctness, clarity of code, thorough experimentation, and insightful analysis.

In [None]:
# from google.colab import drive
# drive.mount('/content/gdrive')

# SOURCE_DIR = '/content/Q3_data.csv'

In [1]:
import torch
import re
from sklearn.preprocessing import OneHotEncoder
import numpy as np
import pandas as pd
import math
from gensim.models import Word2Vec

In [2]:
def delete_hashtag_usernames(text):
  try:
    result = []
    for word in text.split():
      if word[0] not in ['@', '#']:
        result.append(word)
    return ' '.join(result)
  except:
    return ''

def delete_url(text):
  text = re.sub(r'http\S+', '', text)
  return text

def delete_ex(text):
  text = re.sub(r'\u200c', '', text)
  return text

# 0. Data preprocessing

In [3]:
!pip install json-lines

Collecting json-lines
  Downloading json_lines-0.5.0-py2.py3-none-any.whl (6.8 kB)
Installing collected packages: json-lines
Successfully installed json-lines-0.5.0


In [4]:
import json_lines

In [5]:
# 1. extract all tweets from file and save them in memory
df = pd.read_csv('Q3_data.csv')["Text"]
# df
tweets = df.tolist()
# len(tweets)

# 2. remove urls, hashtags and usernames. use the prepared functions
for i in range(len(tweets)):
  tweets[i] = delete_hashtag_usernames(tweets[i])
  tweets[i] = delete_url(tweets[i])
  tweets[i] = delete_ex(tweets[i])

# tweets
tweets[0:5]

['بنشین تا شود نقش فال ما نقش هم فردا شدن',
 'این گوزو رو کی گردن میگیره؟؟ دچار زوال عقل شده از بس پای منبر دستمال کشی کرده.',
 'برای ایران، برای مهسا.',
 'مرگ بر دیکتاتور',
 'نذاریم خونشون پایمال شه...']

# 1. Functions

## Cosine Similarity

To measure the similarity between two words, you need a way to measure the degree of similarity between two embedding vectors for the two words. Given two vectors $u$ and $v$, cosine similarity is defined as follows:

$$\text{CosineSimilarity(u, v)} = \frac {u \cdot v} {||u||_2 ||v||_2} = cos(\theta) \tag{1}$$

* $u \cdot v$ is the dot product (or inner product) of two vectors
* $||u||_2$ is the norm (or length) of the vector $u$
* $\theta$ is the angle between $u$ and $v$.
* The cosine similarity depends on the angle between $u$ and $v$.
    * If $u$ and $v$ are very similar, their cosine similarity will be close to 1.
    * If they are dissimilar, the cosine similarity will take a smaller value.

<img src="images/cosine_sim.png" style="width:800px;height:250px;">
<caption><center><font color='purple'><b>Figure 1</b>: The cosine of the angle between two vectors is a measure of their similarity.</font></center></caption>

Implement the function `cosine_similarity()` to evaluate the similarity between word vectors.

**Reminder**: The norm of $u$ is defined as $ ||u||_2 = \sqrt{\sum_{i=1}^{n} u_i^2}$

In [6]:
def cosine_similarity(u, v):
    """
    Cosine similarity reflects the degree of similarity between u and v

    Arguments:
        u -- a word vector of shape (n,)
        v -- a word vector of shape (n,)

    Returns:
        cosine_similarity -- the cosine similarity between u and v defined by the formula above.
    """
    dotProduct = np.dot(u, v)
    normMultiplication = np.linalg.norm(u) * np.linalg.norm(v)
    return dotProduct / normMultiplication

## find k nearest neighbors

In [40]:
def find_k_nearest_neighbors(word, embedding_dict, k):
  """
    implement a function to return the nearest words to an specific word based on the given dictionary

    Arguments:
        word           -- a word, string
        embedding_dict -- dictionary that maps words to their corresponding vectors
        k              -- the number of word that should be returned

    Returns:
        a list of size k consisting of the k most similar words to the given word

    Note: use the cosine_similarity function that you have implemented to calculate the similarity between words
  """
  # Calculate cosine similarity between the given word and all other words in the embedding dictionary
  similarities = []
  for item in embedding_dict:
      if item != word:
          similarity = cosine_similarity(embedding_dict[word], embedding_dict[item])
          similarities.append((item, similarity))

  # Sort the similarities in descending order based on cosine similarity
  similarities.sort(key=lambda x: x[1], reverse=True)

  # Retrieve the k nearest neighbors based on cosine similarity
  nearest_neighbors = [w for w, _ in similarities[:k]]

  return nearest_neighbors

# 2. One hot encoding

In [9]:
# 1. find one hot encoding of each word
vocab = []
for tweet in tweets:
  for i in tweet.split():
    if i not in vocab:
      vocab.append(i)
print(vocab)
print(len(vocab))

['بنشین', 'تا', 'شود', 'نقش', 'فال', 'ما', 'هم', 'فردا', 'شدن', 'این', 'گوزو', 'رو', 'کی', 'گردن', 'میگیره؟؟', 'دچار', 'زوال', 'عقل', 'شده', 'از', 'بس', 'پای', 'منبر', 'دستمال', 'کشی', 'کرده.', 'برای', 'ایران،', 'مهسا.', 'مرگ', 'بر', 'دیکتاتور', 'نذاریم', 'خونشون', 'پایمال', 'شه...', 'مابهت', 'افتخار', 'میکنیم', 'نبات', 'باعث', 'شدی', 'کل', 'دنیا', 'مارو', 'ببینه', 'انسانای', 'خوشگلمون', 'فارغ', 'هر', 'باوری', 'متحد', 'شویم.', 'اینها', 'عجب', 'موجودات', 'پستی', 'هستن🥺🥺🥺الهی', 'بگردم،', 'من', 'خودم', 'باردارم', 'و', 'حتی', 'توتظاهرات', 'مسالمت', 'امیز', 'خارج', 'ایران', 'استرس', 'داشتم', 'ادم', 'ها', 'نا', 'خود', 'اگاه', 'بهم', 'ضربه', 'بزنن،بمیرم', 'دل', 'اون', 'زن', 'که', 'چه', 'کشیده...مرگ', 'کصخلا', 'چرا', '۴', 'تاوفحشش', 'نمیدن؟', 'روی', 'پول', 'ملی....', 'آزادی', 'آخرین', 'قطره', 'خونم', 'میجنگم', '🇮🇷✌🏻', '(7)', 'داریم', 'چهل', 'میشیم', 'سی', 'سه', 'سلام', 'بچه', 'اگر', 'میبینید', 'حتما', 'هشتگ', 'بزنید.', 'امروز', 'کاری', 'رفتم', 'سراغ', 'وسایل', 'قدیمی', 'اینارو', 'دیدم', ':', '

In [10]:
one_hot_encoder = OneHotEncoder()
one_hot_encoded_vectors = one_hot_encoder.fit_transform(np.array(list(vocab)).reshape(-1, 1)).toarray()
one_hot_encoded_dict = {}
for i in range(len(vocab)):
  one_hot_encoded_dict[vocab[i]] = one_hot_encoded_vectors[i]

In [11]:
one_hot_encoded_dict['"جرأت"']
len(one_hot_encoded_dict['"جرأت"'])

32115

In [12]:
# 2. find 10 nearest words from "آزادی"
Azadi_nearest_oneHot = find_k_nearest_neighbors('آزادی',one_hot_encoded_dict,10)
Azadi_nearest_oneHot

['بنشین', 'تا', 'شود', 'نقش', 'فال', 'ما', 'هم', 'فردا', 'شدن', 'این']

##### Describe advantages and disadvantages of one-hot encoding

Advantage:


1.   ُSimple implementation
2.   For missing values, We can handle them better in one-hot



Disadvantage:


1.   Size of vectors are too high
2.   time-consuming execution
3.   Meaning of the words are not considered here
4.   Sparsity: meaning that most the cells are 0

Analysis:

For this part, First we completed the find_k_nearest_neighbors function and inside it, we calculated the cosine similarity.

after that, we got all the words that are in all of our tweets. we also considered that do not include previous words that exist in the array.

then, we made the vectors in our one-hot-ecndoding algorithm. we assign a vector to every word. this vector has 1d and a length of all the words. just one of the values are 1 and others are 0. for example:  NLP = [0,0,0,0,1,0,0,0,.....,0]

So, the dot product of every 2 word is 0. in the result, the algorithm just chooses 10 words randomly and doesn't work correctly.



# 3. TF-IDF

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer
import random

# 1. find the TF-IDF of all tweets.
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(tweets).toarray()

In [18]:
# 2. choose one tweets randomly.
chosen_index = random.randint(0, len(tweets) - 1)
chosen_tweet = tweets[chosen_index]
chosen_tweet_tfitd = tfidf_matrix[chosen_index]
print(chosen_tweet)
print(chosen_tweet_tfitd)

همه با هم هستیم
[0. 0. 0. ... 0. 0. 0.]


In [19]:
# 3. find 10 nearest tweets from chosen tweet.
all_tweets = []
for i in range(len(tfidf_matrix)):
  all_tweets.append((tweets[i], cosine_similarity(chosen_tweet_tfitd, tfidf_matrix[i])))
all_tweets.sort(key=lambda x: x[1], reverse=True)
for i, j in all_tweets[0:10]:
  print(i, "\t\t", j)

  return dotProduct / normMultiplication


همه با هم هستیم 		 1.0
ما همه با هم هستیم~ 		 0.9185658037193614
ما همه با هم هستیم 		 0.9185658037193614
ما همه با هم هستیم  		 0.9185658037193614
ما همه با هم هستیم. 		 0.9185658037193614
همه با هم 		 0.7548234635833173
نترسید نترسید ما همه با هم هستیم 		 0.4535293384062828
همه برای همه 		 0.4219071302833197
شعارهامون از "نترسید، نترسید، ما همه با هم هستیم" تبدیل شدن به: "بترسید، بترسید، ما همه با هم هستید" و این دقیقا تفاوت اعتراضات امروز با اعتراضات سالهای پیشمونه 		 0.3830246978886267
برای همه 		 0.3768066250110662


##### Describe advantages and disadvantages of TF-IDF

Advatages:


1.   Simple Implimentation
2.   Considering Similarities
3.   Vectors lengths are fixed
4.   Term Importance
5.   automatically handles common words by assigning them lower weights




Disadvantages:


1.   Lack of Semantic Understanding because TF-IDF does not consider the semantic meaning of words.
2.   TF-IDF treats each term as a separate entity, which can lead to issues with terms that have multiple variations.This can result in the same concept being represented by different terms, reducing the effectiveness of the technique.
3.   Difficulty Handling Synonyms
4.   Sparse Matrix


Analysis:

This method is much better than the one-hor encoding but again has some problems. one of them is repeated. both method have sparse matrices that will get a huge space. also this method doesn't pay attention to words and works just based on sentences. but a lot better than one-hot. length of the vector is fixed.

# 4. Word2Vec

In [20]:
import nltk
nltk.download('punkt')

# 1. train a word2vec model base on all tweets
tokenized_tweets = [nltk.word_tokenize(tweet) for tweet in tweets]
word2vec_model = Word2Vec(tokenized_tweets, vector_size=100, window=5, min_count=1, sg=0)
# word2vect_dict = {}
# for i in range(len(vocab)):
#   one_hot_encoded_dict[vocab[i]] = one_hot_encoded_vectors[i]
nearest_words = word2vec_model.wv.most_similar("آزادی", topn=10)

# 2. find 10 nearest words from "آزادی"
print("10 nearest words from 'آزادی' are:")
for word, similarity in nearest_words:
    print(word, similarity)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


10 nearest words from 'آزادی' are:
ازادی 0.9947225451469421
آزادی، 0.9936996102333069
زن، 0.9920216202735901
زن 0.9914841651916504
کشورم 0.9877528548240662
آزادی.به 0.9875050783157349
زندگی 0.987430214881897
ایرانم 0.9866270422935486
آبادی 0.9824851751327515
فردای 0.9819443821907043


##### Describe advantages and disadvantages of Word2Vec

Advantages:


1.   Word2Vec captures semantic similarity between words
2.   Dimensionality Reduction
3.   Word2Vec considers the context in which words appear in a corpus
4.   This model is relatively fast to train




Disadvantages:


1.   Word2Vec typically requires a large corpus of text to learn high-quality word embeddings
2.   Words that are not present in the training corpus, or rare words, may not have meaningful embeddings in the model

this algorithm is better than the past 2 ones. this consider meaning of the words and all the corpus we have and also have a fixed-length vector but has got problems that are not in the corpus we have.

# 5. Contextualized embedding

In [21]:
!pip install transformers[sentencepiece]



In [22]:
# Load model and tokenizer

from transformers import BertModel, BertTokenizer

model_name = "HooshvareLab/bert-base-parsbert-uncased"


In [73]:
# 1. fine-tune the model base on all tweets
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertModel.from_pretrained(model_name)

Bert_dict = {}
for tweet in tweets:

  encoded_tweet = tokenizer(tweet, padding=True, truncation=True, return_tensors="pt")
  outputs = model(**encoded_tweet)
  word_embedding = outputs.last_hidden_state.mean(dim=1)
  word_embeddings = word_embedding.squeeze().detach().numpy()
  Bert_dict[tweet] = word_embeddings

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


In [74]:
Bert_dict[list(Bert_dict.keys())[0]].shape

(768,)

In [75]:
# 2. find 10 nearest words from "آزادی"
encoded_word = tokenizer("آزادی", padding=True, truncation=True, return_tensors="pt")
output_word = model(**encoded_word)
word_embedding_word = output_word.last_hidden_state.mean(dim=1)
word_embedding = word_embedding_word.squeeze().detach().numpy()
word_embedding.shape
Similarities = {}
for item in Bert_dict:
  if item != word:
    similarity = cosine_similarity(word_embedding,Bert_dict[item])
    Similarities[item] = similarity
# print(Similarities)
Sorted_Similarities = sorted(Similarities.items(), key=lambda x: x[1], reverse=True)
# print(Sorted_Similarities)
for i in Sorted_Similarities[0:10]:
  print(i)

('آزادی', 1.0000001)
('ازادی', 1.0000001)
('آزادی ایرانم❤️🤞🏼', 0.9281945)
('آزادی 🖤', 0.9281945)
('ازادی 💚❤🕊', 0.9281945)
('بای آزادی', 0.9254588)
('رهایی', 0.91361344)
('آزادی۲۲', 0.9132374)
('آزادی…', 0.9126624)
('آزادی قشنگه', 0.90117466)


##### Describe advantages and disadvantages of Contextualized embedding

Advantages:


1.   Contextualized embeddings gets the meaning of every word based on the context the word is in. So, we could get different meanings of a word
2.   By this way, We can find even sentences or terms that are related to a word not just only words.
3.   Pre-trained contextualized embeddings, such as BERT can be fine-tuned on specific downstream tasks with relatively little labeled data.
4.   Contextualized embeddings often considers similarity between words and phrases compared to traditional word embeddings.




Disadvantages:


1.   Contextualized embeddings, especially those generated by large-scale transformer models like BERT, are computationally expensive to train and use.
2.   While contextualized embeddings offer impressive performance, they often require large amounts of labeled data for fine-tuning on specific tasks.
3.   Contextualized embeddings are inherently complex and lack interpretability compared to traditional word embeddings like Word2Vec or GloVe.

