1.Submit a Google Colab notebook containing your completed code and experimentation results.

2.Include comments and explanations in your code to help understand the implemented logic.

**Additional Notes:**
*   Ensure that the notebook runs successfully in Google Colab.
*   Document any issues encountered during experimentation and how you addressed them.

**Grading:**
*   Each task will be graded out of the specified points.
*   Points will be awarded for correctness, clarity of code, thorough experimentation, and insightful analysis.

In [1]:
# from google.colab import drive
# drive.mount('/content/gdrive')

SOURCE_DIR = 'Q3_data.csv'

In [2]:
import torch
import re
from tqdm import tqdm
from sklearn.preprocessing import OneHotEncoder
import numpy as np
import pandas as pd
import math
from gensim.models import Word2Vec

In [3]:
def delete_hashtag_usernames(text):
  try:
    result = []
    for word in text.split():
      if word[0] not in ['@', '#']:
        result.append(word)
    return ' '.join(result)
  except:
    return ''

def delete_url(text):
  text = re.sub(r'http\S+', '', text)
  return text

def delete_ex(text):
  text = re.sub(r'\u200c', '', text)
  return text

# 0. Data preprocessing

In [4]:
!pip install json-lines

Collecting json-lines
  Downloading json_lines-0.5.0-py2.py3-none-any.whl (6.8 kB)
Installing collected packages: json-lines
Successfully installed json-lines-0.5.0


In [5]:
import string

In [8]:
# Reading the CSV file
df = pd.read_csv(SOURCE_DIR)

# Deleting usernames, hashtags, and web addresses from the texts and saving them in a list
texts = list(set(df['Text'].map(delete_hashtag_usernames).map(delete_url).map(delete_ex)))

# Removing periods and semicolons from the texts
texts = [re.sub(r"[{}]".format(string.punctuation + '؟،؛!'), " ", text).strip() for text in texts]

# Removing duplicate texts
texts = list(set(texts))

# Displaying the first 100 tweets along with line numbers in a column format
for i, text in enumerate(texts[:5], start=1):
    print(f"{i}. {text}")


1. ایرانیان مالزی همگام با ایرانیان در اقصی نقاط دنیا همراه و همدل با هموطنانمان در داخل ایران هستند
2. الان حال خامنه ای اینطوریه
3. چو آمد خروشان به تنگ اندرش بجنبید و برداشت خود از سرش رها شد ز بند زره موی اوی درفشان چو خورشید شد روی اوی بدانست سهراب کاو دخترست سر و موی او ازدر افسرست شگفت آمدش گفت از ایران سپاه چنین دختر آید به آوردگاه
4. برای ایران برای ایران🤍
5. برای 38


# 1. Functions

## Cosine Similarity

To measure the similarity between two words, you need a way to measure the degree of similarity between two embedding vectors for the two words. Given two vectors $u$ and $v$, cosine similarity is defined as follows:

$$\text{CosineSimilarity(u, v)} = \frac {u \cdot v} {||u||_2 ||v||_2} = cos(\theta) \tag{1}$$

* $u \cdot v$ is the dot product (or inner product) of two vectors
* $||u||_2$ is the norm (or length) of the vector $u$
* $\theta$ is the angle between $u$ and $v$.
* The cosine similarity depends on the angle between $u$ and $v$.
    * If $u$ and $v$ are very similar, their cosine similarity will be close to 1.
    * If they are dissimilar, the cosine similarity will take a smaller value.

<img src="images/cosine_sim.png" style="width:800px;height:250px;">
<caption><center><font color='purple'><b>Figure 1</b>: The cosine of the angle between two vectors is a measure of their similarity.</font></center></caption>

Implement the function `cosine_similarity()` to evaluate the similarity between word vectors.

**Reminder**: The norm of $u$ is defined as $ ||u||_2 = \sqrt{\sum_{i=1}^{n} u_i^2}$

In [9]:
def cosine_similarity(u, v):
    """
    Cosine similarity reflects the degree of similarity between u and v

    Arguments:
        u -- a word vector of shape (n,)
        v -- a word vector of shape (n,)

    Returns:
        cosine_similarity -- the cosine similarity between u and v defined by the formula above.
    """

    dot_product = np.dot(u, v) # Calculate the dot product of vectors u and v.
    norm_u = np.linalg.norm(u) # Calculate the Euclidean norm of vector u.
    norm_v = np.linalg.norm(v) # Calculate the Euclidean norm of vector v.
    cosine_similarity = dot_product / (norm_u * norm_v) # Calculate the cosine similarity between vectors u and v.

    return cosine_similarity


## find k nearest neighbors

In [25]:
def find_k_nearest_neighbors(word, embedding_dict, k):
    """
    Implement a function to return the nearest words to a specific word based on the given dictionary.

    Arguments:
        word           -- a word, string
        embedding_dict -- dictionary that maps words to their corresponding vectors
        k              -- the number of words that should be returned

    Returns:
        A list of size k consisting of the k most similar words to the given word.

    Note: Use the cosine_similarity function to calculate the similarity between words.
    """
    # Get the word vector for the given word
    word_vector = embedding_dict.get(word)  # Retrieve the word vector corresponding to the input word from the embedding dictionary

    if word_vector is None:
        return []  # Return an empty list if the word is not found in the dictionary

    # Calculate cosine similarity between the word vector and all other word vectors
    similarities = {}  # Initialize an empty dictionary to store similarities
    for w, v in embedding_dict.items():  # Iterate over all words and their corresponding vectors in the embedding dictionary
        if w != word:  # Exclude the input word itself
            similarity = cosine_similarity(word_vector, v)  # Calculate cosine similarity between the input word vector and the current word vector
            similarities[w] = similarity  # Store the similarity score in the dictionary

    # Sort the dictionary by values (cosine similarities) in descending order
    sorted_similarities = sorted(similarities.items(), key=lambda x: x[1], reverse=True)

    # Get the k nearest neighbors
    nearest_neighbors = [item[0] for item in sorted_similarities[:k]]  # Extract the words (neighbors) with the highest similarity scores

    return nearest_neighbors  # Return the list of k nearest neighbors




# 2. One hot encoding

In [27]:
    """
      1-Vocabulary Creation: It first concatenates all the texts into one string, then splits it into words and creates a list of unique vocabulary words.
      2-Converting Vocabulary to One-Hot Encoding Vectors: It uses the OneHotEncoder from the scikit-learn library to convert the vocabulary into one-hot encoding vectors. Here, handle_unknown='ignore' is used to handle unknown inputs silently, and sparse=False is used to produce dense vectors.
      3-Creating an Embedding Dictionary: It stores the one-hot encoding vectors corresponding to each word in a dictionary.
      4-Returning the Embedding Dictionary: Finally, it returns the dictionary containing words as keys and their corresponding one-hot encoding vectors as values.
    """
    # Create vocabulary based on words present in the texts
    vocab = list(set((" ".join(texts)).split()))
    # Convert vocabulary to an array and reshape it to be column-wise
    vocab = np.array(vocab).reshape(-1, 1)
    # Create an instance of the OneHotEncoder with specified behavior for encountering unknown tokens
    encoding = OneHotEncoder(handle_unknown='ignore', sparse=False)
    # Transform vocabulary into one-hot vectors using the encoder
    one_hot_encoding = encoding.fit_transform(vocab)
    # Create a dictionary for mapping words to their corresponding one-hot vectors
    embedding_dict = {vocab[i][0]: one_hot_encoding[i] for i in range(len(vocab))}
    # Set the number of nearest neighbors to find
    k = 10
    # Set the word for which to find the nearest neighbors
    word = "آزادی"
    # Call the function find_k_nearest_neighbors to find the k nearest neighbors of the given word in the embedding dictionary
    # The function returns a list of the k nearest words
    k_nearest_words = find_k_nearest_neighbors(word, embedding_dict, k)

    # Print the list of k nearest words with their indices
    for i, nearest_word in enumerate(k_nearest_words, start=1):
        print(f"{i}. {nearest_word}")




1. اساتیدی
2. شدم😭😭😭
3. «هنوز»
4. آنجاست
5. نمیخواستم
6. ریدم
7. قاتلها
8. یاذتونه
9. شخصیه
10. وکت


##### Describe advantages and disadvantages of one-hot encoding

One-hot encoding is a technique used in machine learning and data processing to represent categorical variables as binary vectors. Each category is represented by a vector where only one element is 1 and the rest are 0s. Here are some advantages and disadvantages of using one-hot encoding:</n><br><br>


Advantage:
<br><br>

1- Preservation of Information: One-hot encoding preserves the categorical nature of variables without imposing ordinality or numerical relationships that may not exist.

2- Compatibility with Algorithms: Many machine learning algorithms and models require numerical input. One-hot encoding transforms categorical data into a format that these algorithms can process.

3- Avoidance of Numerical Bias: By using binary values, one-hot encoding prevents the introduction of artificial numerical relationships that could bias the model.

4- Interpretability: One-hot encoding results in interpretable features. Each feature represents a single category, making it clear which category is being referred to.

5- Generalization: One-hot encoding can handle categorical variables with any number of unique categories, making it suitable for a wide range of datasets.<br><br>


Disadvantage:
<br><br>

1- Dimensionality Increase: One-hot encoding can significantly increase the dimensionality of the dataset, especially when dealing with categorical variables with many unique categories. This can lead to the curse of dimensionality and increased computational complexity.

2- Sparse Representation: The resulting one-hot encoded vectors are sparse, consisting mostly of 0s. This can consume a lot of memory and storage space, especially for datasets with a large number of unique categories.

3- Loss of Information about Relationships: One-hot encoding treats each category as independent, disregarding any potential relationships or similarities between categories. This may lead to loss of information, especially in cases where there is inherent ordinality or hierarchical structure among categories.

4- Difficulty Handling New Categories: If new categories are encountered during model deployment that were not present in the training data, it may be challenging to handle them appropriately without retraining the model or using additional techniques like hashing or embedding.

5- Potential for Overfitting: In models with limited data, one-hot encoding can lead to overfitting, especially if there are many rare categories. Each category gets its own dimension, which the model may try to fit even if it's not statistically significant.


# 3. TF-IDF

In [None]:
# 1. find the TF-IDF of all tweets.
# 2. choose one tweets randomly.
# 3. find 10 nearest tweets from chosen tweet.


# Function to count occurrences of a word in a sentence
def count_word_occurrences(word, sentence):
    return (sentence.split()).count(word)

# Function to calculate the term frequency (TF) matrix
def calculate_tf_matrix(texts_list, vocab_list):
    # Creating a mapping of text indices to their respective positions in the texts_list
    text_to_index = {texts_list[i]: i for i in range(len(texts_list))}
    # Creating a reverse mapping of text indices
    index_to_text = {i: texts_list[i] for i in range(len(texts_list))}
    # Assertion to check if the mappings have consistent lengths
    assert len(text_to_index) == len(index_to_text), "Mismatch in lengths of text_to_index and index_to_text"

    # Creating a mapping of vocabulary words to their respective positions
    word_to_index = {vocab_list[i][0]: i for i in range(len(vocab_list))}
    # Creating a reverse mapping of word indices
    index_to_word = {i: vocab_list[i][0] for i in range(len(vocab_list))}
    # Assertion to check if the mappings have consistent lengths
    assert len(word_to_index) == len(index_to_word), "Mismatch in lengths of word_to_index and index_to_word"

    # Getting the number of words in the vocabulary
    num_words = len(word_to_index)
    # Getting the number of documents (texts) in the texts_list
    num_documents = len(text_to_index)

    # Initializing a TF matrix with zeros, shaped according to the vocabulary size and number of documents
    tf_matrix = np.zeros(shape=(num_words, num_documents))

    # Looping through each word in the vocabulary
    for word_index in tqdm(range(num_words)):
        # Looping through each document (text)
        for document_index in range(num_documents):
            # Counting the occurrences of the word in the current document
            word_count = count_word_occurrences(index_to_word[word_index], index_to_text[document_index])
            # Calculating the term frequency using the formula: (1 + log(word_count)) if word_count > 0, else 0
            tf_matrix[word_index, document_index] = (1 + np.log(word_count)) if word_count > 0 else 0

    # Returning the calculated TF matrix
    return tf_matrix, index_to_word, num_documents



In [None]:
texts_list = texts
# Assign the variable texts_list to the texts list.

# Create vocabulary based on words present in the texts
vocab_list = list(set((" ".join(texts_list)).split()))
# Create a text by joining all texts, split it into words, get unique words, and convert them into a list.
# The variable vocab holds a list of unique words present in the texts.

# Convert vocabulary to an array and reshape it to be column-wise
vocab_list = np.array(vocab_list).reshape(-1, 1)
# Convert the list vocab into an array and reshape it into a column-wise shape.
# The variable vocab holds an array of unique words in a column-wise format.


# Calculate term frequency (TF) matrix, index_to_word, and the number of documents
tf_matrix, index_to_word, num_documents = calculate_tf_matrix(texts_list, vocab_list)

# Initialize a dictionary for storing TF-IDF values
tf_idf = {}

# Loop over each word
for i in tqdm(range(len(index_to_word))):
    # Calculate document frequency (DF) for the current word
    df = np.sum(tf_matrix[i])
    # Compute inverse document frequency (IDF)
    temp = np.log(num_documents/df)
    # Calculate TF-IDF for the current word and store it in the dictionary
    tf_idf[index_to_word[i]] = tf_matrix[i] * temp


100%|██████████| 25916/25916 [18:33<00:00, 23.27it/s]
100%|██████████| 25916/25916 [00:08<00:00, 3165.69it/s]


##### Describe advantages and disadvantages of TF-IDF

In [None]:
def encode_tweet(text):
  # Split the input text into individual words
  words = text.split()

  # Calculate the TF-IDF value for each word in the text using a pre-calculated tf_idf dictionary
  words_vec = [tf_idf[w] for w in words]

  # Calculate the average TF-IDF value for all words in the text
  # by summing up the TF-IDF values of all words and dividing by the total number of words
  return np.array(sum(words_vec)) / len(words)

# Create a dictionary called embedding_dict where keys are the texts and values are their corresponding encoded representations
# by applying the encode_tweet function to each text in the texts list
embedding_dict = {t: encode_tweet(t) for t in texts}

# Select a random tweet from the list of texts
tweet = texts[np.random.randint(0, len(texts))]

# Define the number of nearest neighbors to find
k = 10

# Find the k nearest neighbors of the selected tweet based on their embeddings
k_nearest_tweets = find_k_nearest_neighbors(tweet, embedding_dict, k)

# Print the list of k nearest neighbor tweets in a columnar format with row numbers
for i, tweet in enumerate(k_nearest_tweets, start=1):
    print(f"{i}. {tweet}")



1. گوه به قبر پدرت خامنه ای
2. ای به چشم
3. مرگ به نیرنگشون
4. دریچه ای به آزادی
5. به امید
6. به امید پیروزی🤍
7. به امید آزادی⁦🕊️⁩
8. به امید آزادی🤞🏻
9. به امید ۲۰تاییی
10. به امید ازادی🔥


TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure used to evaluate the importance of a word in a document relative to a collection of documents. Here are some advantages and disadvantages of using TF-IDF:<br><br>

Advatages:<br><br>
1- Weighted term importance: TF-IDF assigns weights to terms based on their frequency in the document and their rarity across the corpus. This helps in identifying important terms within a document.

2- Reduces the impact of common terms: TF-IDF reduces the importance of terms that occur frequently across all documents in the corpus. Common words like "the", "is", etc., which may not carry much semantic meaning, are often penalized in their importance.

3- Suitable for large datasets: TF-IDF works well even with large datasets since it doesn't rely on the entire vocabulary of the corpus but rather on the terms present in individual documents.

4- Handles sparse data well: In datasets where most words occur only a few times, TF-IDF can effectively handle the sparsity by focusing on the specific terms that are present.

5- Simple and efficient: Implementation of TF-IDF is relatively straightforward, making it easy to understand and implement in various applications.
<br><br>

Disadvantages:
<br><br>
1- Ignores word order and context: TF-IDF treats documents as bags of words, disregarding the word order and context in which the words appear. This can lead to loss of information, especially in tasks where word order is crucial, such as in natural language processing tasks like sentiment analysis or text summarization.

2- Doesn't consider semantics: TF-IDF only considers the frequency of terms in documents and doesn't take into account the meaning or semantics of words. Therefore, it may not always capture the true relevance of a term to a document's content.

3- Sensitive to document length: Longer documents may have higher overall term frequencies compared to shorter documents, which can skew the TF-IDF scores. Normalization techniques can mitigate this, but it remains a consideration.

4- Requires a large corpus: TF-IDF relies on a corpus of documents to calculate inverse document frequency. In cases where the corpus is small or not representative of the domain, TF-IDF may not perform optimally.

5- Not effective for some tasks: In tasks such as sentiment analysis or document classification where the focus is on understanding the overall context or sentiment of the document, TF-IDF alone may not be sufficient and more advanced techniques like word embeddings or deep learning models may be needed.


# 4. Word2Vec

In [None]:
# 1. train a word2vec model base on all tweets
# 2. find 10 nearest words from "آزادی"
# Tokenize each text in the 'texts' list by splitting them into individual words
tokenized_texts = [t.split() for t in texts]

# Train a Word2Vec model on the tokenized texts with 100 epochs
model = Word2Vec(tokenized_texts, epochs=100)

# Retrieve the vocabulary of the trained Word2Vec model
vocab = model.wv.key_to_index

# Create an embedding dictionary where keys are words from the vocabulary and values are their corresponding embeddings
embedding_dict = {word: model.wv[word] for word in vocab}

word = "آزادی"
# Define the number of nearest neighbors to find
k = 10
# Find the k nearest neighbors of the selected tweet based on their embeddings
k_nearest_words = find_k_nearest_neighbors(word, embedding_dict, k)
# Print the list of k nearest neighbor words in a columnar format with row numbers
for i, word in enumerate(k_nearest_words, start=1):
    print(f"{i}. {word}")



1. ازادی
2. فرزندان
3. آزادیمون
4. قیام
5. آزادی»
6. وطنمان
7. شادی
8. آغاز
9. پیروزی
10. خیابان


##### Describe advantages and disadvantages of Word2Vec

Word2Vec is a popular technique in natural language processing (NLP) used to generate distributed representations of words in a continuous vector space. Here are some advantages and disadvantages of Word2Vec:<br><br>

Advantages:<br><br>
1- Semantic Similarity: Word2Vec captures semantic similarities between words by representing them as vectors in a continuous space. This means that words with similar meanings are often closer together in the vector space, allowing for more nuanced understanding of language.

2- Efficiency: Word2Vec models are computationally efficient compared to other methods for generating word embeddings, such as co-occurrence matrices or neural network-based approaches like GloVe. This efficiency makes Word2Vec suitable for training on large text corpora.

3- Dimensionality Reduction: Word2Vec effectively reduces the dimensionality of the word space while preserving semantic relationships. This allows for more efficient storage and processing of word embeddings, making them easier to work with in downstream NLP tasks.

4- Pre-trained Models: Pre-trained Word2Vec models trained on large text corpora are readily available, which saves time and computational resources for developers who can leverage these pre-trained embeddings for their specific tasks instead of training from scratch.

5- Transfer Learning: Word2Vec embeddings can be transferred and fine-tuned for downstream NLP tasks such as text classification, sentiment analysis, and machine translation. This transfer learning capability enables models to benefit from the semantic knowledge captured during Word2Vec training.
<br><br>

Disadvantages:<br><br>
1- Contextual Information: Word2Vec does not capture contextual information, meaning that the same word may have different representations depending on its context. This limitation can lead to suboptimal performance in tasks where context is crucial, such as disambiguation or language generation.

2- Out-of-vocabulary Words: Word2Vec struggles with out-of-vocabulary words, as it can only generate embeddings for words seen during training. Rare or unseen words may not have meaningful representations in the Word2Vec space, which can affect the performance of downstream tasks, especially in domains with specialized terminology.

3- Polysemy and Homonymy: Word2Vec may struggle to disambiguate words with multiple meanings (polysemy) or words that are spelled the same but have different meanings (homonymy). In such cases, the word embeddings may not accurately capture the intended semantics, leading to errors in downstream tasks.

4- Training Data Bias: Word2Vec embeddings are trained on large text corpora, which may contain biases present in the data, such as cultural or gender biases. These biases can be inadvertently propagated to downstream applications, potentially amplifying societal prejudices or stereotypes.

5- Fixed Embedding Size: Word2Vec generates fixed-size embeddings for words, meaning that all words are represented by vectors of the same length. This fixed-size representation may not capture the full complexity of language, especially for words with rich semantic or syntactic properties.


# 5. Contextualized embedding

In [28]:
!pip install transformers[sentencepiece]



In [29]:
!pip install accelerate -U

Collecting accelerate
  Downloading accelerate-0.28.0-py3-none-any.whl (290 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/290.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━[0m [32m245.8/290.1 kB[0m [31m7.4 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.1/290.1 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.10.0->accelerate)
  Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.7/23.7 MB[0m [31m45.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.10.0->accelerate)
  Downloading nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m823.6/823.6 kB[0m [

In [30]:
# Load model and tokenizer

# Import necessary modules
from transformers import BertForMaskedLM, BertTokenizer, AdamW
from torch.utils.data import Dataset, DataLoader

# Specify the pre-trained model name
model_name = "HooshvareLab/bert-base-parsbert-uncased"

# Define the device for computation (GPU if available, otherwise CPU)
device = torch.device('cuda' if torch.cuda.is_available else 'cpu')

# Initialize the tokenizer using the specified pre-trained model
tokenizer = BertTokenizer.from_pretrained(model_name)

# Initialize the model for Masked Language Modeling (MLM) using the specified pre-trained model
# Move the model to the specified device (GPU or CPU)
model = BertForMaskedLM.from_pretrained(model_name).to(device)



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


vocab.txt:   0%|          | 0.00/1.22M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/434 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/654M [00:00<?, ?B/s]

Some weights of the model checkpoint at HooshvareLab/bert-base-parsbert-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [31]:
# Define a custom Dataset class for your text data
class TextDataset(Dataset):
    # The initializer takes a list of texts and a tokenizer
    def __init__(self, texts, tokenizer):
        self.tokenizer = tokenizer  # Store the tokenizer
        # Tokenize the texts and store the result
        self.inputs = tokenizer(texts, return_tensors='pt', padding=True, truncation=True)

    # The __len__ method returns the number of texts
    def __len__(self):
        return len(self.inputs["input_ids"])

    # The __getitem__ method returns the inputs for the text at the given index
    def __getitem__(self, idx):
        return {key: val[idx] for key, val in self.inputs.items()}

# Create an instance of your custom Dataset
dataset = TextDataset(texts, tokenizer)

# Create a DataLoader to handle batching of your data
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

# Initialize the AdamW optimizer with a learning rate of 1e-4
optimizer = AdamW(model.parameters(), lr=1e-4)

# Put the model in training mode
model.train()

# Train the model for 3 epochs
for epoch in range(3):
    epoch_loss = 0  # Initialize the loss for this epoch
    # Loop over each batch in the DataLoader
    for batch in tqdm(dataloader, total=len(dataloader)):
        optimizer.zero_grad()  # Zero the gradients
        # Move the batch data to the device
        inputs = {key: val.to(device) for key, val in batch.items()}
        # Use the input ids as labels
        inputs['labels'] = inputs['input_ids'].detach().clone()
        # Forward pass: compute the model outputs
        outputs = model(**inputs)
        # Compute the loss
        loss = outputs.loss
        # Accumulate the loss over the epoch
        epoch_loss += loss.item()
        # Backward pass: compute the gradients
        loss.backward()
        # Update the model parameters
        optimizer.step()
    # Print the loss for this epoch
    print(f'{epoch}- loss: {epoch_loss}')


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
100%|██████████| 520/520 [09:56<00:00,  1.15s/it]


0- loss: 44.535070070720394


100%|██████████| 520/520 [09:58<00:00,  1.15s/it]


1- loss: 0.15834053394792136


100%|██████████| 520/520 [09:58<00:00,  1.15s/it]

2- loss: 0.06375286062029772





In [33]:
# 1. fine-tune the model base on all tweets
# 2. find 10 nearest words from "آزادی"

word_embeddings = model.bert.embeddings.word_embeddings  # Extracting word embeddings layer from BERT model.

# Creating a dictionary where keys are words and values are their corresponding embeddings.
embedding_dict = {
    word: word_embeddings(torch.tensor([tokenizer.convert_tokens_to_ids(word)])  # Convert word to token ID, then tensor.
                          .to(device))  # Move tensor to appropriate device (CPU/GPU).
                          .cpu()  # Move tensor back to CPU for detach operation.
                          .squeeze(0)  # Remove the batch dimension.
                          .detach()  # Detach tensor from computation graph.
                          .numpy()  # Convert tensor to NumPy array.
    for word in tokenizer.get_vocab().keys()  # Iterate over all words in tokenizer's vocabulary.
}


word = "ازادی"
# Define the number of nearest neighbors to find
k = 10

# The word of interest for finding its nearest neighbors is specified here.

k_nearest_words = find_k_nearest_neighbors(word, embedding_dict, k)

# The function find_k_nearest_neighbors finds the closest similar words to a given word from the embedding_dict dataset and stores them in k_nearest_words.
# These functions perform computational operations related to embedding vectors.

# Print the list of k nearest words with their indices
for i, nearest_word in enumerate(k_nearest_words, start=1):
    print(f"{i}. {nearest_word}")


1. ازادسازی
2. ازاد
3. ازادیهای
4. ازادى
5. ازادیها
6. رهایی
7. ازادانه
8. اسایش
9. ازادشدن
10. انعطافپذیری


##### Describe advantages and disadvantages of Contextualized embedding

Contextualized word embeddings, such as those produced by models like ELMo, GPT, and BERT, offer several advantages and disadvantages:<br><br>

Advantages:
<br><br>
1- Contextual Understanding: Contextualized embeddings capture the meaning of a word based on its context within a sentence. This allows for a deeper understanding of the word's semantics compared to static embeddings like Word2Vec or GloVe.

2- Polysemy Handling: Words can have multiple meanings (polysemy), and contextual embeddings can capture these nuances better by considering the surrounding context. For instance, "bank" could refer to a financial institution or the side of a river.

3- Transfer Learning: Pre-trained contextualized embeddings can be fine-tuned on specific downstream tasks with relatively small amounts of task-specific data. This transfer learning approach often leads to improved performance on various NLP tasks.

4- Dynamic Representations: Unlike static embeddings, which assign a fixed representation to each word, contextualized embeddings produce dynamic representations that vary based on the context. This dynamic nature helps in capturing subtle changes in meaning across different contexts.

5- Out-of-Vocabulary (OOV) Handling: Contextualized embeddings can generate representations for out-of-vocabulary words based on their context. This is particularly useful for handling rare or domain-specific terms that may not be present in the training vocabulary.
<br><br>


Disadvantages:
<br><br>
1- Computational Complexity: Contextualized embedding models are computationally intensive and require significant resources for training and inference compared to static embedding models. Fine-tuning on downstream tasks can also be time-consuming.

2- Limited Interpretability: The contextualized embeddings produced by models like BERT or GPT are high-dimensional and lack direct interpretability. Understanding why a particular representation is generated for a word in a given context can be challenging.

3- Data Dependency: Contextualized embeddings heavily rely on large amounts of annotated text data for pre-training. This dependency on data availability can be a limitation, especially for low-resource languages or specialized domains with limited text corpora.

4- Domain Specificity: Pre-trained contextualized embeddings may not capture domain-specific nuances effectively, especially if the pre-training data is not representative of the target domain. Fine-tuning on domain-specific data can help mitigate this issue to some extent.

5- Context Window Limitation: Although contextualized embeddings consider the surrounding context, they are still limited by the window size used during pre-training. Long-range dependencies or context beyond the specified window may not be fully captured, impacting the model's understanding of the text.
