1.Submit a Google Colab notebook containing your completed code and experimentation results.

2.Include comments and explanations in your code to help understand the implemented logic.

**Additional Notes:**
*   Ensure that the notebook runs successfully in Google Colab.
*   Document any issues encountered during experimentation and how you addressed them.

**Grading:**
*   Each task will be graded out of the specified points.
*   Points will be awarded for correctness, clarity of code, thorough experimentation, and insightful analysis.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
SOURCE_DIR = '/content/drive/MyDrive/NLP-M/Q3_data.csv'

In [3]:
import torch
import re
from sklearn.preprocessing import OneHotEncoder
import numpy as np
import pandas as pd
import math
from gensim.models import Word2Vec

In [4]:
def delete_hashtag_usernames(text):
  try:
    result = []
    for word in text.split():
      if word[0] not in ['@', '#']:
        result.append(word)
    return ' '.join(result)
  except:
    return ''

def delete_url(text):
  text = re.sub(r'http\S+', '', text)
  return text

def delete_ex(text):
  text = re.sub(r'\u200c', '', text)
  return text

# 0. Data preprocessing

In [5]:
!pip install json-lines



In [6]:
import json_lines

In [7]:
# 1. extract all tweets from file and save them in memory
# 2. remove urls, hashtags and usernames. use the prepared functions

# Load the dataset
df = pd.read_csv(SOURCE_DIR, engine='python')

# Function to clean a single tweet
def clean_tweet(tweet):
    tweet = delete_hashtag_usernames(tweet)
    tweet = delete_url(tweet)
    tweet = delete_ex(tweet)
    return tweet

# Clean all tweets in the dataframe
df['cleaned_tweet'] = df['Text'].apply(clean_tweet)


# Show first 10 rows
df.head(10)

Unnamed: 0.1,Unnamed: 0,Datetime,Text,PureText,Language,Sentiment,Date,cleaned_tweet
0,0,2022-09-22 09:14:35+00:00,بنشین تا شود نقش فال ما \nنقش هم‌ فردا شدن\n#م...,بنشین تا شود نقش فال ما نقش هم‌ فردا شدن,fa,negative,2022-09-22,بنشین تا شود نقش فال ما نقش هم فردا شدن
1,1,2022-10-06 01:44:55+00:00,@Tanasoli_Return @dr_moosavi این گوزو رو کی گر...,این گوزو رو کی گردن میگیره؟؟ دچار زوال عقل شده...,fa,very negative,2022-10-06,این گوزو رو کی گردن میگیره؟؟ دچار زوال عقل شده...
2,2,2022-09-22 15:12:28+00:00,@ghazaleghaffary برای ایران، برای مهسا.\n#OpIr...,برای ایران، برای مهسا.,fa,positive,2022-09-22,برای ایران، برای مهسا.
3,3,2022-09-22 09:35:50+00:00,@_hidden_ocean مرگ بر دیکتاتور \n#OpIran \n#Ma...,مرگ بر دیکتاتور,fa,very negative,2022-09-22,مرگ بر دیکتاتور
4,4,2022-09-22 01:31:25+00:00,نذاریم خونشون پایمال شه.‌‌.‌‌.\n#Mahsa_Amini #...,نذاریم خونشون پایمال شه.‌‌.‌‌.,fa,negative,2022-09-22,نذاریم خونشون پایمال شه...
5,5,2022-09-26 21:05:14+00:00,@Nabauti88 مابهت افتخار میکنیم نبات باعث شدی ک...,مابهت افتخار میکنیم نبات باعث شدی کل دنیا مارو...,fa,positive,2022-09-26,مابهت افتخار میکنیم نبات باعث شدی کل دنیا مارو...
6,6,2022-09-22 20:37:50+00:00,@Bunnpaw برای انسانای خوشگلمون\n\n#مهسا_امینی ...,برای انسانای خوشگلمون,fa,positive,2022-09-22,برای انسانای خوشگلمون
7,7,2022-09-28 05:27:37+00:00,@neginsh فارغ از هر باوری متحد شویم.\n#مهسا_ام...,فارغ از هر باوری متحد شویم.,fa,no sentiment expressed,2022-09-28,فارغ از هر باوری متحد شویم.
8,8,2022-10-09 03:40:09+00:00,@mansurehhossai2 اینها عجب موجودات پستی هستن🥺🥺...,اینها عجب موجودات پستی هستن🥺🥺🥺الهی بگردم، من خ...,fa,very negative,2022-10-09,اینها عجب موجودات پستی هستن🥺🥺🥺الهی بگردم، من خ...
9,9,2022-09-24 21:46:20+00:00,@ShahinMaghsoodi کصخلا چرا ۴ تاوفحشش نمیدن؟\n\...,کصخلا چرا ۴ تاوفحشش نمیدن؟,fa,very negative,2022-09-24,کصخلا چرا ۴ تاوفحشش نمیدن؟


# 1. Functions

## Cosine Similarity

To measure the similarity between two words, you need a way to measure the degree of similarity between two embedding vectors for the two words. Given two vectors $u$ and $v$, cosine similarity is defined as follows:

$$\text{CosineSimilarity(u, v)} = \frac {u \cdot v} {||u||_2 ||v||_2} = cos(\theta) \tag{1}$$

* $u \cdot v$ is the dot product (or inner product) of two vectors
* $||u||_2$ is the norm (or length) of the vector $u$
* $\theta$ is the angle between $u$ and $v$.
* The cosine similarity depends on the angle between $u$ and $v$.
    * If $u$ and $v$ are very similar, their cosine similarity will be close to 1.
    * If they are dissimilar, the cosine similarity will take a smaller value.

<img src="images/cosine_sim.png" style="width:800px;height:250px;">
<caption><center><font color='purple'><b>Figure 1</b>: The cosine of the angle between two vectors is a measure of their similarity.</font></center></caption>

Implement the function `cosine_similarity()` to evaluate the similarity between word vectors.

**Reminder**: The norm of $u$ is defined as $ ||u||_2 = \sqrt{\sum_{i=1}^{n} u_i^2}$

In [8]:
def cosine_similarity(u, v):
    """
    Cosine similarity reflects the degree of similarity between u and v

    Arguments:
        u -- a word vector of shape (n,)
        v -- a word vector of shape (n,)

    Returns:
        cosine_similarity -- the cosine similarity between u and v defined by the formula above.
    """
      # Compute the dot product between u and v (u.v)
    dot_product = np.dot(u, v)

    # Compute the L2 norm of u (|u|)
    norm_u = np.linalg.norm(u)

    # Compute the L2 norm of v (|v|)
    norm_v = np.linalg.norm(v)

    # Compute the cosine similarity defined by formula (1)
    cosine_similarity = dot_product / (norm_u * norm_v)

    return cosine_similarity

## find k nearest neighbors

In [9]:
def find_k_nearest_neighbors(word, embedding_dict, k):
  """
    implement a function to return the nearest words to an specific word based on the given dictionary

    Arguments:
        word           -- a word, string
        embedding_dict -- dictionary that maps words to their corresponding vectors
        k              -- the number of word that should be returned

    Returns:
        a list of size k consisting of the k most similar words to the given word

    Note: use the cosine_similarity function that you have implemented to calculate the similarity between words
    """
  similarities = {}

  # Check if the word is in the embedding dictionary
  if word not in embedding_dict:
      raise ValueError("Word not found in the embedding dictionary.")

  # Get the embedding vector of the given word
  word_vector = embedding_dict[word]

  # Calculate the cosine similarity with every other word in the embedding dictionary
  for other_word, other_vector in embedding_dict.items():
      if other_word != word:  # Skip the query word itself
          sim = cosine_similarity(word_vector, np.array(other_vector))
          similarities[other_word] = sim

  # Sort words by their similarity score in descending order
  sorted_similarities = sorted(similarities.items(), key=lambda item: item[1], reverse=True)

  # Extract the top k most similar words
  nearest_neighbors = [word for word, similarity in sorted_similarities[:k]]

  return nearest_neighbors

# 2. One hot encoding

In [10]:
# 1. find one hot encoding of each word
# 2. find 10 nearest words from "آزادی"

all_words = set()
df['cleaned_tweet'].apply(lambda tweet: all_words.update(tweet.split()))

# Convert the set to a list and reshape for OneHotEncoder
unique_words_list = list(all_words)
unique_words_array = np.array(unique_words_list).reshape(-1, 1)

# One-hot encoding
encoder = OneHotEncoder(sparse=False)  # Use sparse=False to get a dense array for demonstration
one_hot_encoded_words = encoder.fit_transform(unique_words_array)

# Creating a dictionary mapping words to their one-hot encoded vectors
word_to_one_hot = dict(zip(unique_words_list, one_hot_encoded_words))

embedding_dict_one_hot = word_to_one_hot

# Find 10 "nearest" words to "آزادی" based on one-hot encoding
# This will not yield meaningful semantic results
try:
    nearest_neighbors = find_k_nearest_neighbors("آزادی", embedding_dict_one_hot, 10)
    print(nearest_neighbors)
except ValueError as e:
    print(e)



['احمقهای', 'نکند', 'مطبم', 'دقیقهای', 'همين', 'رفته،', 'ازادی..', 'مملکتمون،دوری', 'مشهدی', 'کشورت']


##### Describe advantages and disadvantages of one-hot encoding

**Advantage:** <br>

*Simplicity:* One-hot encoding is straightforward to
understand and implement.
<br><br>
*No Implicit Ordering:* It does not impose any order or hierarchy on the categorical values, unlike numerical encoding. This prevents the model from assuming a natural ordering between categories where none exists, which can be particularly important for nominal variables where no such ordinal relationship should influence the model's performance.
<br><br>
*Compatibility with Algorithms:* Many machine learning models, especially linear models, require input data to be numerical. One-hot encoding converts categorical data into a numeric format, making it possible to train models that otherwise couldn’t handle categorical data directly.

**Disadvantage:** <br>
*Dimensionality Increase:* One of the biggest downsides of one-hot encoding is the significant increase in the dataset's dimensionality.
<br><br>
*Sparse Matrix:* The resulting encoded data is very sparse (mostly zeros), which can be inefficient for both storage and computation.
<br><br>
*Loss of Information:* One-hot encoding treats each category as independent without considering the potential relationships between categories.


# 3. TF-IDF

In [11]:
import math
from collections import defaultdict
import numpy as np

# Calculate Term Frequency (TF)
def compute_tf(text):
    words = text.split()
    tf = {word: words.count(word) / len(words) for word in set(words)}
    return tf

# Calculate Inverse Document Frequency (IDF)
def compute_idf(documents):
    N = len(documents)
    idf = defaultdict(lambda: math.log(N))
    for word in set(word for document in documents for word in document.split()):
        idf[word] -= math.log(sum(word in document for document in documents))
    return idf

# Calculate TF-IDF for all documents
def compute_tfidf(documents):
    idf = compute_idf(documents)
    tfidf_documents = []
    for document in documents:
        tf = compute_tf(document)
        tfidf = {word: tf_score * idf[word] for word, tf_score in tf.items()}
        tfidf_documents.append(tfidf)
    return tfidf_documents, list(idf.keys())

# Convert TF-IDF dictionaries to vectors
def tfidf_to_vectors(tfidf_documents, feature_words):
    vectors = []
    for doc in tfidf_documents:
        vector = [doc.get(word, 0) for word in feature_words]
        vectors.append(vector)
    return np.array(vectors)

# Cosine Similarity
def cosine_similarity(vector1, vector2):
    dot_product = np.dot(vector1, vector2)
    norm_a = np.linalg.norm(vector1)
    norm_b = np.linalg.norm(vector2)
    return dot_product / (norm_a * norm_b)

# Find nearest neighbors based on cosine similarity
def find_nearest_neighbors(vector, vectors, n_neighbors=10):
    similarities = [cosine_similarity(vector, v) for v in vectors]
    nearest_indices = np.argsort(similarities)[-n_neighbors-1:-1][::-1]
    return nearest_indices

# Process the documents
def process_documents(documents):
    tfidf_documents, feature_words = compute_tfidf(documents)
    vectors = tfidf_to_vectors(tfidf_documents, feature_words)
    return vectors

def choose_random(documents, vectors):
    # Choose a random document
    random_index = np.random.randint(len(documents))
    chosen_vector = vectors[random_index]

    # Find the nearest neighbors
    nearest_indices = find_nearest_neighbors(chosen_vector, vectors)

    # Print the chosen document and its nearest neighbors
    print(f"Chosen Document: {documents[random_index]}")
    print("Nearest Neighbors:")
    for index in nearest_indices:
        print(documents[index])




In [18]:
documents = df['cleaned_tweet'].tolist()
vectors = process_documents(documents)

In [20]:
choose_random(documents, vectors)

Chosen Document: برای زخم هایی که هر روز تازه اند... برای دردهای مادران عزادار...
Nearest Neighbors:
برای تمام دردهای فرو خورده
برای تمام مادران ایران زمین
برای همه مادران و دختران ایران زمین
برای اشک های مادران داغ دار
برای تمام مادران داغدار ایران
براي مادران داغدار
برای مادران چشم انتظار ….
برای بغض مادران چشم به راه آبان
برای تمام مادران دادخواه... :)
تازه وصل شدن


In [21]:
del vectors

In [22]:
del documents

##### Describe advantages and disadvantages of TF-IDF

**Advatages:** <br>
*Relevance Measurement:* TF-IDF provides a simple yet effective way to quantify the relevance of words within a document relative to a collection of documents. This helps in identifying and prioritizing the most significant words in texts.
<br><br>
*Reduction of Noise:* By considering inverse document frequency, TF-IDF naturally filters out common words that appear in many documents (such as "the", "is", and "and"), which are typically less informative.
<br><br>
*Easy to Compute and Understand:* The calculation of TF-IDF is straightforward, making it easy to implement and scale.
<br><br>

**Disadvantages:** <br>
*Lack of Context and Order:* TF-IDF treats each document as a bag of words, meaning it doesn't capture the order of words, context, or semantics of the terms. This can lead to a loss of valuable information.
<br><br>
*Not Suitable for Understanding Word Relationships:* Since TF-IDF focuses on individual word importance without considering the relationship between words, it cannot capture synonyms or polysemy effectively.
<br><br>
*Ignores Document Length:* TF-IDF does not normalize for document length. A longer document could have higher TF-IDF scores simply due to having more words, even if the relevance of specific terms is not higher than in shorter documents.
<br><br>
*High-Dimensional Sparse Vectors:* Similar to one-hot encoding, TF-IDF representation can lead to very high-dimensional feature spaces, especially with large corpora containing many unique words.



# 4. Word2Vec

In [23]:
# 1. train a word2vec model base on all tweets
# 2. find 10 nearest words from "آزادی"

from gensim.models import Word2Vec

# Tokenize cleaned tweets:
sentences = [tweet.split() for tweet in df['cleaned_tweet'].tolist()]

# 1. Train a Word2Vec model based on all tweets
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)

# 2. Find 10 nearest words from "آزادی"
word = "آزادی"
if word in model.wv.key_to_index:
    nearest_words = model.wv.most_similar(word, topn=10)
    print(f"The 10 nearest words to '{word}' are:")
    for similar_word, similarity in nearest_words:
        print(f"{similar_word}: {similarity}")
else:
    print(f"The word '{word}' is not in the model's vocabulary.")


The 10 nearest words to 'آزادی' are:
ازادی: 0.9979506134986877
زن،: 0.9945236444473267
آزادی،: 0.9939445853233337
زندگی،: 0.9937655925750732
آزادی.: 0.9921770691871643
ایرانم: 0.9921579957008362
وطنم: 0.991191565990448
زن: 0.989142894744873
زندگی: 0.989051342010498
میهن: 0.9887256026268005


##### Describe advantages and disadvantages of Word2Vec

**Advantages:** <br>
*Semantic Similarity:* One of the biggest strengths of Word2Vec is its ability to capture semantic similarity between words. Words that are used in similar contexts are positioned closely in the vector space, which allows for capturing nuances in meaning.
<br><br>
*Dimensionality Reduction:* Compared to one-hot encoding, Word2Vec significantly reduces the dimensionality of the feature space while retaining semantic information.
<br><br>
*Efficiency in Training:* Despite its complexity, Word2Vec is relatively efficient to train compared to deeper, more complex models, especially when using the negative sampling training method.
<br><br>

**Disadvantages:** <br>
*Lack of Word Sense Disambiguation:* Word2Vec assigns a single vector to each word, which means it cannot distinguish between different meanings of a word used in different contexts.
<br><br>
*Requirement for Large Training Corpora:* To capture rich semantic relationships, Word2Vec models require large amounts of training data.
<br><br>
*Static Word Embeddings:* Once trained, the embeddings are static and do not change. This means that the model cannot adapt to new words or phrases that weren't in the training data, nor can it evolve with language use over time.

# 5. Contextualized embedding

In [24]:
!pip install transformers[sentencepiece]



In [25]:
# Load model and tokenizer
from transformers import BertModel, BertTokenizer, AdamW, BertForSequenceClassification
from torch.utils.data import DataLoader, Dataset
import torch

model_name = "HooshvareLab/bert-base-parsbert-uncased"

tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=6)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at HooshvareLab/bert-base-parsbert-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [26]:
class TweetDataset(Dataset):
    def __init__(self, df, tokenizer):
        self.tweets = df['Text'].tolist()  # Convert the 'Text' column to a list
        self.labels = df['Sentiment'].map({'negative': 0, 'very negative': 1, 'positive': 2, 'no sentiment expressed': 3, 'very positive': 4, 'mixed': 5}).tolist()  # Convert the 'Sentiment' column to a list
        self.tokenizer = tokenizer

    def __len__(self):
        return len(self.tweets)

    def __getitem__(self, idx):
        tweet = self.tweets[idx]
        label = self.labels[idx]
        encoding = self.tokenizer.encode_plus(
            tweet,
            truncation=True,
            padding='max_length',
            max_length=128,
            return_tensors='pt'
        )
        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'labels': torch.tensor(label, dtype=torch.long)
        }


dataset = TweetDataset(df, tokenizer)
dataloader = DataLoader(dataset, batch_size=16)

In [27]:
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
print(device)

cuda


In [28]:
import torch.nn as nn
from tqdm import tqdm
from transformers import BertModel, BertTokenizer, AdamW, BertForSequenceClassification

# Assume model and dataloader have been defined earlier in the code.
# Also, 'device' is assumed to be defined (e.g., 'cuda' or 'cpu').

# Move the model to the specified device (GPU or CPU)
model.to(device)

# Initialize the optimizer with the model's parameters and a learning rate of 1e-5
optimizer = AdamW(model.parameters(), lr=1e-5)

# Set the model to training mode. This is necessary because some modules like Dropout
# behave differently during training
model.train()

# Loop through each batch in the DataLoader. tqdm is used to display progress.
for batch in tqdm(dataloader):
    # Move batch data to the same device as the model to avoid CPU-GPU data transfer issues
    batch = {k: v.to(device) for k, v in batch.items()}

    # Pass the batch through the model. This step computes the forward pass.
    # **batch unpacks the dictionary directly into the input arguments of the model.
    outputs = model(**batch)

    # Extract labels from the batch for comparison with model outputs to compute loss
    labels = batch['labels']

    # Initialize the loss function. CrossEntropyLoss is commonly used for classification tasks.
    loss_function = nn.CrossEntropyLoss()

    # Compute the loss by comparing the model outputs with the true labels.
    # 'logits' are the raw, unnormalized scores output by the last layer of the model.
    loss = loss_function(outputs.logits, labels)

    # Perform backpropagation starting from the loss to compute gradients for each parameter
    loss.backward()

    # Update model parameters based on gradients computed
    optimizer.step()

    # Clear the gradients to prevent them from being accumulated
    optimizer.zero_grad()


100%|██████████| 1250/1250 [07:25<00:00,  2.81it/s]


In [29]:
# 2. find 10 nearest words from "آزادی"
from scipy.spatial.distance import cosine

# Get the embedding of 'آزادی'
input_ids = tokenizer.encode('آزادی', return_tensors='pt').to(device)
with torch.no_grad():
    azadi_embedding = model(input_ids)[0][0, :].cpu().numpy()  # Removed one indexing dimension

# Calculate the cosine similarity with all other words
similarities = []
for word in tqdm(tokenizer.get_vocab()):
    input_ids = tokenizer.encode(word, return_tensors='pt').to(device)
    with torch.no_grad():
        word_embedding = model(input_ids)[0][0, :].cpu().numpy()  # Removed one indexing dimension
    similarity = 1 - cosine(azadi_embedding, word_embedding)
    similarities.append((word, similarity))

# Sort by similarity and get the top 10 words
similar_words = sorted(similarities, key=lambda x: x[1], reverse=True)[:10]

similar_words

100%|██████████| 100000/100000 [19:28<00:00, 85.55it/s]


[('ویديوهای', 0.9990941882133484),
 ('بازدیدکنندگان', 0.9984754323959351),
 ('vey', 0.9982344508171082),
 ('جوزه', 0.99754798412323),
 ('اشتغالزا', 0.9975302815437317),
 ('استقلالطلبی', 0.997490406036377),
 ('تفریحات', 0.9972715973854065),
 ('۰۳۱', 0.9972620010375977),
 ('مجله', 0.9971975684165955),
 ('موسسهی', 0.997179388999939)]


### Advantages

1. **Context-Awareness**: Unlike static embeddings (e.g., Word2Vec, GloVe), contextualized embeddings take the entire sentence context into account. This means the representation of a word can change based on its surrounding words, allowing for a more nuanced understanding of language.

2. **Handling Polysemy**: They excel at capturing the meanings of polysemous words (words with multiple meanings) depending on their usage in a sentence, which is a significant limitation of non-contextual embeddings.

3. **Improved Model Performance**: Contextualized embeddings have been shown to significantly improve the performance of NLP models on a wide range of tasks, including but not limited to, sentiment analysis, question answering, and named entity recognition.

4. **Transfer Learning and Few-shot Learning**: Models pretrained with contextualized embeddings on large corpora can be fine-tuned with relatively small datasets to achieve high performance, making NLP applications more accessible.

5. **Deeper Linguistic Understanding**: These embeddings can capture deeper syntactic and semantic information, such as the role of a word in a sentence or relationships between words, leading to models that better understand the nuances of language.

### Disadvantages

1. **Computational Complexity**: Generating contextualized embeddings generally requires more computational resources than static embeddings, both in terms of memory and processing power. This can be a limiting factor for deployment in resource-constrained environments.

2. **Training Time**: Pretraining models to generate contextualized embeddings (e.g., BERT, GPT) involves large datasets and extensive computational resources, making the process time-consuming.

3. **Model Size**: Models capable of generating high-quality contextualized embeddings are often large, with hundreds of millions (or even billions) of parameters. This can make them challenging to deploy on mobile devices or in environments with strict latency requirements.

4. **Fine-tuning Challenges**: While transfer learning is a significant advantage, it also requires expertise to fine-tune these models effectively for specific tasks. Improper fine-tuning can lead to suboptimal performance or overfitting.

5. **Interpretability**: The complexity of models producing contextualized embeddings can make it difficult to understand how decisions are made, posing challenges for interpretability and debugging.