# Word Embedding Techniques (Word2Vec, BERT, and Doc2Vec) for Text Vectorization

In this notebook, we explore and apply three popular methods for generating vector representations (embeddings) of text data: **Word2Vec**, **BERT**, and **Doc2Vec**. These methods are essential for converting text into numerical vectors that can be used in various NLP tasks such as text classification, clustering, and more.

1. **Word2Vec**:
   - **Description**: Word2Vec is a neural network-based model that learns vector representations of words from large text corpora. It captures the semantic relationships between words by learning from context.
   - **Advantages**:
     - Flexible and customizable for specific tasks.
     - Efficient for small to medium-sized datasets.
   - **Disadvantages**:
     - Slower when working with large datasets.
     - May experience performance issues when adding too many columns to the DataFrame.
   
   **Note**: In this notebook, **Word2Vec** is used to generate word embeddings by training the model on the text data and then averaging word vectors to generate sentence embeddings.

2. **BERT**:
   - **Description**: BERT (Bidirectional Encoder Representations from Transformers) uses a transformer-based model to generate **contextualized embeddings**. Unlike traditional embeddings, BERT considers the surrounding context of words to generate more accurate representations.
   - **Advantages**:
     - Highly accurate and context-aware embeddings.
     - Suitable for tasks requiring deep understanding of language context.
   - **Disadvantages**:
     - Computationally expensive and slower than other methods.
     - Requires significant memory and resources, making it less efficient for very large datasets.
   
   **Note**: **BERT** is applied here to generate more context-aware sentence embeddings. These embeddings capture the meaning of words depending on their surrounding context.

3. **Doc2Vec**:
   - **Description**: Doc2Vec extends the Word2Vec approach by generating embeddings not just for words, but for entire documents or sentences. This method captures the overall meaning of larger text chunks, making it useful for document-level tasks.
   - **Advantages**:
     - Great for handling larger documents and capturing the meaning of whole sentences or paragraphs.
     - Works well for tasks requiring the understanding of full text.
   - **Disadvantages**:
     - Training can be computationally intensive.
     - May not offer as much control over individual words compared to **Word2Vec**.
   
   **Note**: **Doc2Vec** is used here to generate embeddings that represent entire documents or sentences, helping capture the semantic meaning beyond individual words.

### **How to Use**:
- **Word2Vec** is used for learning word embeddings and generating sentence representations by averaging the word vectors.
- **BERT** is used to create contextualized embeddings, where the meaning of a word is influenced by the surrounding context.
- **Doc2Vec** is applied to generate document-level embeddings, offering a higher-level representation of text.

In this notebook, we demonstrate how to use each of these methods to generate embeddings for text data and save the results as a CSV file for further analysis.
s for further use in CSV format.


In [4]:
data = pd.read_csv("/kaggle/input/text-document-classification-dataset/df_file.csv")
data

Unnamed: 0,Text,Label
0,Budget to set scene for election\n \n Gordon B...,0
1,Army chiefs in regiments decision\n \n Militar...,0
2,Howard denies split over ID cards\n \n Michael...,0
3,Observers to monitor UK election\n \n Minister...,0
4,Kilroy names election seat target\n \n Ex-chat...,0
...,...,...
2220,India opens skies to competition\n \n India wi...,4
2221,Yukos bankruptcy 'not US matter'\n \n Russian ...,4
2222,Survey confirms property slowdown\n \n Governm...,4
2223,High fuel prices hit BA's profits\n \n British...,4


In [5]:
# Import necessary libraries
import pandas as pd
import gensim
import nltk
import torch
from tqdm import tqdm
from transformers import BertTokenizer, AutoModel
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize

# Initialize BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")

nltk.download('punkt')  # Required for Doc2Vec tokenization


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

# Word2Vec

### الكود الأول Word2Vec(الكود القديم):
#### - يعمل على إنشاء قاموس مخصص للكلمات باستخدام Word2Vec ويقوم بحساب المتوسط.
#### - It creates a custom dictionary of words using Word2Vec and calculates the average.

#### - مرن في التعامل مع النصوص ويمكنك تخصيصه حسب احتياجاتك.
#### - Flexible in handling text and can be customized according to your needs.

#### - بطيء نسبيًا عند التعامل مع بيانات ضخمة بسبب الحاجة إلى بناء نموذج Word2Vec من البداية.
#### - It's relatively slow when dealing with large datasets because it requires building the Word2Vec model from scratch.

#### - قد يتسبب في تحذيرات تتعلق بالأداء عندما يتم إضافة العديد من الأعمدة في DataFrame.
#### - It may cause performance warnings when adding many columns to the DataFrame.

#### Advantages:
#### - More flexible and customizable to handle specific words or advanced text processing.
#### - أكثر مرونة وقابلية للتخصيص للتعامل مع الكلمات المحددة أو المعالجة النصية المتقدمة.

#### - Good for projects where you need to work with smaller, manageable datasets.
#### - جيد للمشاريع التي تحتاج إلى التعامل مع مجموعات بيانات أصغر يمكن التحكم فيها.

#### - Useful when you want more control over how word vectors are handled.
#### - مفيد عندما ترغب في مزيد من التحكم في كيفية التعامل مع المتجهات الخاصة بالكلمات.

#### Disadvantages:
#### - Slower when working with large datasets because it builds the Word2Vec model from scratch.
#### - أبطأ عند التعامل مع مجموعات بيانات ضخمة لأنه يقوم ببناء نموذج Word2Vec من البداية.

#### - Performance warnings may appear when adding too many columns to the DataFrame.
#### - قد تظهر تحذيرات من الأداء عند إضافة العديد من الأعمدة إلى DataFrame.
e DataFrame.
nced processing is needed.



In [6]:
import pandas as pd
import gensim

def W2VAverage(Data, Feature, VectorSize=100):
    # Create an empty DataFrame to store Word2Vec vectors
    W2VDF = pd.DataFrame()

    # Extract the text data from the 'Feature' column of the input 'Data'
    Text = Data[Feature].tolist()

    # Split the text into lists of words
    Text = [str(i).split() for i in Text]

    # Train a Word2Vec model on the text data with the specified vector size
    model = gensim.models.Word2Vec(Text, vector_size=VectorSize)

    # Populate the 'W2VDF' DataFrame with words and their corresponding Word2Vec vectors
    W2VDF['Words'] = list(model.wv.key_to_index.keys())
    W2VDF['W2V'] = W2VDF['Words'].apply(lambda x: model.wv.get_vector(x))

    # Create a dictionary to map words to their Word2Vec vectors
    W2VDict = {i: j for i, j in zip(W2VDF['Words'].tolist(), W2VDF['W2V'].tolist())}

    # Define a function to apply Word2Vec averaging to text data
    def ApplyW2V(x):
        L = []
        x = str(x).lower()

        # Iterate through words in the text
        for w in x.split():
            if w in W2VDict.keys():
                L.append(W2VDict[w])

        # If there are Word2Vec vectors for the words in the text, calculate the mean
        if len(L) > 0:
            d = pd.DataFrame(L)
            return d.mean(axis=0).tolist()  # Ensure this is returned as a list of 100 elements
        else:
            return None  # Return None if no match

    # Apply Word2Vec averaging to the 'Feature' column of the 'Data'
    Data['W2V'] = Data[Feature].apply(ApplyW2V)

    # Remove rows where the Word2Vec vector is None (i.e., no words matched)
    Data = Data.dropna(subset=['W2V'])

    # Convert the vectors to separate columns (one column per dimension)
    W2VColumns = pd.DataFrame(Data['W2V'].tolist(), columns=[f'C{i+1}' for i in range(VectorSize)])

    # Add the words to the final output
    W2VColumns['Word'] = Data[Feature].values

    # Return the DataFrame with words and their corresponding vectors
    return W2VColumns

# Example of calling the function with your dataset
W2VAverageData = W2VAverage(data, 'Text', VectorSize=100)

# Save the output to a CSV file if you are working on Kaggle
W2VAverageData.to_csv('W2VAverageData_output.csv', index=False)


In [7]:

W2VAverageData.head()

Unnamed: 0,C1,C2,C3,C4,C5,C6,C7,C8,C9,C10,...,C92,C93,C94,C95,C96,C97,C98,C99,C100,Word
0,-0.131122,0.13242,0.371762,0.08539,0.198889,-0.20854,-0.092727,0.779647,-0.289085,-0.236045,...,0.383347,0.107613,-0.360982,0.737224,0.098876,-0.072426,-0.164225,0.330256,-0.140899,Budget to set scene for election\n \n Gordon B...
1,-0.130588,0.060809,0.374329,0.141239,0.206432,-0.139559,-0.063602,0.753321,-0.282003,-0.189063,...,0.393399,0.173181,-0.299578,0.752268,0.115492,-0.111245,-0.150251,0.349589,-0.160532,Army chiefs in regiments decision\n \n Militar...
2,-0.009661,0.0734,0.422543,0.052059,0.133188,-0.100056,-0.145044,0.818228,-0.079964,-0.295547,...,0.339059,0.091743,-0.4151,0.779377,0.220381,-0.07698,0.072737,0.263418,-0.074121,Howard denies split over ID cards\n \n Michael...
3,-0.055933,0.106168,0.351832,0.041982,0.144685,-0.214028,-0.159773,0.783564,-0.268072,-0.314375,...,0.338897,0.062605,-0.331467,0.653366,0.140105,-0.063807,-0.081929,0.2904,-0.092682,Observers to monitor UK election\n \n Minister...
4,0.049851,0.094148,0.272093,0.044148,0.095214,-0.145893,-0.13375,0.794728,-0.271272,-0.262693,...,0.294523,-0.023523,-0.466268,0.718938,0.157991,0.0072,-0.014884,0.256474,-0.098263,Kilroy names election seat target\n \n Ex-chat...


### الكود الثاني (Word2Vec المعدل):
#### - أسهل وأسرع في الاستخدام، يقوم بحساب المتوسط مباشرة باستخدام Word2Vec.
#### - Faster and more efficient, calculates the average directly using Word2Vec.

#### - أقل مرونة من الكود الأول لأنه يعتمد على إعدادات ثابتة ولا يوفر خيارات تخصيص متقدمة.
#### - Less flexible than the first code because it relies on fixed settings and doesn't offer advanced customization options.

#### - أسرع في المعالجة مقارنة بالكود الأول.
#### - Faster in processing compared to the first code.

#### Advantages:
#### - Faster and more efficient, especially for smaller datasets.
#### - أسرع وأكثر كفاءة، خاصة في التعامل مع مجموعات البيانات الصغيرة.

#### - Simple and easy to implement, providing quick results.
#### - بسيط وسهل التنفيذ، يوفر نتائج سريعة.

#### - Ideal for tasks that don’t require heavy customization or advanced configurations.
#### - مثالي للمهام التي لا تتطلب تخصيصًا كبيرًا أو إعدادات متقدمة.

#### Disadvantages:
#### - May not handle large datasets as effectively if more advanced processing is needed.
#### - قد لا يتعامل مع مجموعات البيانات الكبيرة بشكل فعال إذا كانت هناك حاجة إلى معالجة أكثر تقدمًا.
e advanced processing is needed.


In [8]:
import pandas as pd
import gensim
import numpy as np  # تأكد من استيراد مكتبة numpy لحساب المتوسط

def W2VAverage(Data, Feature, VectorSize=100):
    """
    Function to convert text to Word2Vec vectors and calculate the mean vector for words.
    دالة لتحويل النصوص إلى فكتورات باستخدام Word2Vec وحساب المتوسط لفكتورات الكلمات.
    
    Args:
    - Data: DataFrame containing the text data.
      الداتا التي تحتوي على النصوص.
    - Feature: The name of the column containing the text data.
      اسم العمود الذي يحتوي على النصوص.
    - VectorSize: The desired size of the word vectors (default is 100).
      حجم الفكتورات المطلوبة (افتراضيًا 100).
    
    Returns:
    - A DataFrame with words and their corresponding average Word2Vec vectors.
      دالة تُرجع DataFrame يحتوي على الكلمات وفكتوراتها المقابلة.
    """
    
    # Convert the text data into a list of words.
    # تحويل النصوص إلى قائمة من الكلمات
    Text = Data[Feature].apply(lambda x: str(x).split())
    
    # Train a Word2Vec model on the text data.
    # تدريب نموذج Word2Vec على البيانات النصية
    model = gensim.models.Word2Vec(Text, vector_size=VectorSize, window=5, min_count=1, workers=4)
    
    # Function to calculate the average Word2Vec vector for each text (list of words).
    # دالة لحساب المتوسط لفكتورات Word2Vec لكل نص (قائمة من الكلمات)
    def ApplyW2V(x):
        vectors = []  # Initialize an empty list to store the vectors
        for word in x:  # Iterate through each word in the text
            if word in model.wv:  # Check if the word is in the Word2Vec model's vocabulary
                vectors.append(model.wv[word])  # Append the word vector
        
        # If we have vectors, calculate the mean vector across all words.
        # إذا كان لدينا فكتورات، نحسب المتوسط عبر كل الكلمات
        if vectors:
            return pd.Series(np.mean(vectors, axis=0))  # Use numpy's mean to calculate the average
        else:
            return pd.Series([None] * VectorSize)  # If no words are found, return a series with None
    
    # Apply the ApplyW2V function to the text data.
    # تطبيق الدالة ApplyW2V على بيانات النصوص
    W2VData = Text.apply(ApplyW2V)
    
    # Add the original words column to the result.
    # إضافة عمود الكلمات الأصلية إلى النتائج
    W2VData['Word'] = Data[Feature].values
    
    return W2VData

# Example usage of the function:
# W2VAverageData = W2VAverage(data, 'Text', VectorSize=100)

# To save the output to a CSV file:
# W2VAverageData.to_csv('output.csv', index=False)


In [9]:
# مثال لاستخدام الدالة مع بيانات معينة:
W2VAverageData = W2VAverage(data, 'Text', VectorSize=100)

# لحفظ النتائج في ملف CSV:
W2VAverageData.to_csv('output.csv', index=False)


# Bert


In [10]:
import torch
import pandas as pd
from transformers import BertTokenizer, BertModel
from tqdm import tqdm

# تحميل الـ Tokenizer و الـ Model الخاص بـ BERT
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# دالة لحساب بارتينج الجمل وحفظ الخرج في ملف CSV
def bert_vectorize_and_save(Data, Feature, output_file='bert_output.csv'):
    # استخراج الجمل من العمود المخصص للنصوص
    sentences = Data[Feature].tolist()

    # تطبيق التوكنيزر لتحويل الجمل إلى رموز
    encoded_input = tokenizer(sentences, padding=True, truncation=True, max_length=128, return_tensors='pt')

    # استخدام موديل BERT لاستخراج الـ Embeddings
    with torch.no_grad():
        model_output = model(**encoded_input)

    # حساب الـ Mean Pooling للـ Embeddings
    sentence_embeddings = model_output.last_hidden_state.mean(dim=1)

    # تحويل الـ Embeddings إلى DataFrame
    embeddings_df = pd.DataFrame(sentence_embeddings.numpy())

    # التحقق إذا كان عمود "Label" موجود في البيانات قبل إضافته
    if 'Label' in Data.columns:
        embeddings_df['Label'] = Data['Label']
    
    # حفظ النتيجة في ملف CSV
    embeddings_df.to_csv(output_file, index=False)

    return embeddings_df

# مثال على كيفية استدعاء الدالة وحفظ الخرج
# تأكد من أن لديك الـ Data المناسبة قبل استدعاء الدالة
# مثال:
# Data = pd.read_csv('your_dataset.csv')
# bert_vectorize_and_save(Data, 'Text', 'output_file.csv')


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

In [11]:
bert_vectorize_and_save(data, 'Text', 'output_file.csv')

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,759,760,761,762,763,764,765,766,767,Label
0,-0.260485,-0.249689,0.575901,-0.277601,-0.003355,-0.030696,0.077894,0.283541,-0.105402,-0.050896,...,0.191574,0.318998,-0.012285,0.138120,-0.562039,0.168880,-0.300333,-0.086521,0.062143,0
1,-0.166651,-0.069870,0.529241,-0.580869,-0.077211,-0.183808,0.123304,0.004518,-0.024211,0.002156,...,0.106484,0.144022,-0.422713,0.007752,-0.464773,-0.234798,-0.179947,-0.086757,0.309456,0
2,-0.248155,-0.358391,0.206174,-0.411694,0.015422,0.144521,0.009294,0.159983,-0.042486,0.137489,...,0.196633,0.142124,-0.202916,0.168043,-0.573460,0.102079,-0.150089,-0.255482,0.333226,0
3,-0.120750,-0.306673,0.436360,-0.418010,0.000711,-0.149914,0.158809,0.201255,-0.070015,0.195831,...,0.282665,0.132892,-0.137338,0.033390,-0.656017,0.055683,-0.230324,-0.158656,0.195435,0
4,-0.299933,-0.382835,0.698436,-0.424881,0.161340,0.051376,0.012397,0.342028,-0.128560,-0.091923,...,0.323920,0.122669,-0.092493,0.178999,-0.362245,0.168734,-0.097546,-0.138226,0.071983,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2220,-0.202220,-0.312977,0.415126,0.032931,0.425125,-0.150159,-0.092723,0.629078,0.125659,-0.077277,...,0.144521,0.127036,-0.156167,0.122648,-0.124463,-0.054310,-0.160410,0.318539,-0.272514,4
2221,-0.195824,-0.044644,0.174195,0.022286,0.101371,-0.275588,-0.167030,0.251951,0.057936,0.306489,...,0.033705,0.170378,-0.464790,-0.058136,-0.357500,0.121089,-0.109391,0.240786,0.090840,4
2222,-0.621629,-0.187879,0.391347,0.034913,0.273063,0.254223,-0.065127,0.430357,-0.058636,0.183197,...,0.136774,0.074971,-0.350466,0.223838,-0.549659,-0.174912,-0.282130,0.319555,-0.103630,4
2223,-0.364588,-0.113018,0.443435,-0.185876,0.238173,0.204525,0.120340,0.348188,0.203169,0.262348,...,0.311545,0.104997,-0.266138,0.319714,-0.510037,-0.129523,-0.191244,0.431673,-0.003329,4


In [16]:
import pandas as pd
from nltk.tokenize import word_tokenize
from gensim.models import Doc2Vec
from tqdm import tqdm
from gensim.models.doc2vec import TaggedDocument

def ApplyDoc2Vec(Data, Feature, VectorSize=300, batch_size=8, output_file='doc2vec_output.csv'):
    # Preprocess and clean the text data by replacing double newline characters with a space
    AllReviews = Data[Feature].str.replace('\n\n', ' ')

    # Add an 'Index' column to the input 'Data'
    Data['Index'] = list(Data.index)

    # Check if 'Label' column exists in the dataset
    if 'Label' not in Data.columns:
        raise KeyError("'Label' column is missing in the DataFrame.")

    # Filter out rows with non-null values in 'AllReviews'
    non_na = AllReviews.notna()
    non_na_Reviews = AllReviews[non_na]
    non_na_Index = Data[non_na]['Index']

    # Tokenize the non-null reviews and create a list of TaggedDocument objects
    non_na_reviews_list = list(map(lambda x: word_tokenize(x), non_na_Reviews.values))
    documents = [TaggedDocument(doc, [non_na_Index.values[i]]) for i, doc in enumerate(non_na_reviews_list)]

    # Train a Doc2Vec model with the specified vector size and other settings
    model = Doc2Vec(documents, vector_size=VectorSize, window=2, min_count=1, workers=4)

    # Create a dictionary to map document indices to their embeddings
    document_dict = {}
    all_embeddings = []

    # Process in batches for performance optimization
    for i in tqdm(range(0, len(Data), batch_size)):
        batch_indices = Data.iloc[i:i + batch_size]['Index']
        for idx in batch_indices:
            text = Data.loc[Data['Index'] == idx, Feature].values[0].replace("\n\n", ' ')
            document_dict[idx] = model.dv[idx]  # model.docvecs[idx] is now model.dv[idx] in newer gensim versions
            all_embeddings.append(model.dv[idx])  # Append embeddings to the list

    # Convert the embeddings to a DataFrame
    document_embeddings_df = pd.DataFrame(all_embeddings)

    # Add 'Label' column from the original DataFrame
    document_embeddings_df['Label'] = Data['Label'].loc[document_embeddings_df.index]

    # Save the embeddings DataFrame to CSV
    document_embeddings_df.to_csv(output_file, index=False)

    # Return the embeddings DataFrame
    return document_embeddings_df


In [17]:

# تطبيق الدالة وحفظ النتائج في ملف
result_df = ApplyDoc2Vec(data, 'Text', VectorSize=300, batch_size=8, output_file='doc2vec_output.csv')


100%|██████████| 279/279 [00:00<00:00, 445.15it/s]
