## Intro

In this notebook, we will resume vectorization on the previously saved preprocessing dataset.

We will try different vectorization model, such as tf-idf, word2vec, doc2vec, etc. For the popular model word2vec, spaCy and Gensim both provide APIs and pre-trained model.

We will evalute these vectorization result by accuracy score (pred vs actual) and runtime.

# **Data Load**

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
# Read the dataset from the specified path
df = pd.read_csv('/content/preprocess_text_Spacy_Nltk.csv', sep=',', encoding='utf-8', quotechar='"')


In [None]:
df.isnull().sum()

text              0
type              0
processed_Text    0
dtype: int64

# **Vector model**

**TfidfVectorizer** Batch processing

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize the TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()

# Define batch size
batch_size = 500

# Get the number of batches
num_batches = (len(df) + batch_size - 1) // batch_size

# Initialize an empty list to store TF-IDF vectors
tfidf_vectors = []

# Process data in batches
for i in range(num_batches):
    start_idx = i * batch_size
    end_idx = min((i + 1) * batch_size, len(df))
    batch_texts = df['processed_Text'][start_idx:end_idx]

    # Fit the vectorizer to the batch of processed_Text data and transform it to obtain TF-IDF vectors
    tfidf_matrix = tfidf_vectorizer.fit_transform(batch_texts)

    # Convert the TF-IDF matrix to an array and append to the list of TF-IDF vectors
    tfidf_vectors.extend(tfidf_matrix.toarray().tolist())

# Store the TF-IDF vectors in the 'vector' column of the DataFrame
df['vector'] = tfidf_vectors

# Display the DataFrame after adding the vector column
print(df.head())

**spaCy's Word2Vec model**

In [None]:
!python -m spacy download en_core_web_md

Collecting en-core-web-md==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.7.1/en_core_web_md-3.7.1-py3-none-any.whl (42.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.8/42.8 MB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-3.7.1
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [None]:
import spacy

# Load spaCy model with pre-trained word vectors
nlp = spacy.load("en_core_web_md")

# Function to obtain Word2Vec embeddings for text
def get_word2vec_embeddings_batch(texts):
    text_vectors = []
    for text in texts:
        tokens = nlp(text)
        word_vectors = [token.vector for token in tokens if not token.is_punct and not token.is_space]
        if word_vectors:
            text_vector = sum(word_vectors) / len(word_vectors)
        else:
            text_vector = None
        text_vectors.append(text_vector)
    return text_vectors

# Apply the function to the processed_Text column in batches
batch_size = 100  # Adjust the batch size as needed
num_batches = (len(df) + batch_size - 1) // batch_size
vectors = []
for i in range(num_batches):
    start_idx = i * batch_size
    end_idx = min((i + 1) * batch_size, len(df))
    batch_texts = df['processed_Text'][start_idx:end_idx]
    batch_vectors = get_word2vec_embeddings_batch(batch_texts)
    vectors.extend(batch_vectors)

# Store the resulting vectors in the 'vector' column of the DataFrame
df['vector'] = vectors

# Display the DataFrame after adding the vector column
df.head()

**Gensim's Word2vec model: LdaModel**

In [None]:
from gensim.models import LdaModel
from gensim import corpora

# Create a Gensim dictionary mapping each word to a unique integer ID
dictionary = corpora.Dictionary(df['processed_Text'].apply(lambda x: x.split()))

# Create a bag-of-words corpus
bow_corpus = [dictionary.doc2bow(doc.split()) for doc in df['processed_Text']]

# Train an LDA model on the corpus with the desired number of topics
num_topics = 10  # Adjust the number of topics as needed
lda_model = LdaModel(bow_corpus, num_topics=num_topics, id2word=dictionary)

# Function to infer topic distribution for each document and return as vector representation
def infer_lda_vector(text):
    bow_vector = dictionary.doc2bow(text.split())
    lda_vector = lda_model[bow_vector]
    return lda_vector

# Apply the function to the processed_Text column
df['vector'] = df['processed_Text'].apply(infer_lda_vector)

# Display the DataFrame after adding the vector column
df.head()

Unnamed: 0,text,type,processed_Text,vector
0,WASHINGTON (Reuters) - The head of a conservat...,True,washington reuter head conservative republic...,"[(0, 0.20272386), (2, 0.08436078), (5, 0.06530..."
1,WASHINGTON (Reuters) - Transgender people will...,True,washington reuters transgender people allow ...,"[(2, 0.61250937), (3, 0.025587862), (5, 0.1052..."
2,WASHINGTON (Reuters) - The special counsel inv...,True,washington reuter special counsel investigat...,"[(1, 0.073686436), (6, 0.7563586), (8, 0.15961..."
3,WASHINGTON (Reuters) - Trump campaign adviser ...,True,washington reuters trump campaign adviser ge...,"[(1, 0.06463825), (4, 0.096640006), (6, 0.8353..."
4,SEATTLE/WASHINGTON (Reuters) - President Donal...,True,seattlewashington reuters president donald t...,"[(0, 0.07565529), (3, 0.073021084), (5, 0.1398..."


**Gensim's Doc2Vec model**

In [None]:
import nltk

# Download the necessary NLTK data
nltk.download('punkt')
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize

# Train a Doc2Vec model
tagged_docs = [TaggedDocument(words=word_tokenize(doc), tags=[i]) for i, doc in enumerate(df['processed_Text'])]
doc2vec_model = Doc2Vec(vector_size=300, window=5, min_count=1, workers=4, epochs=20)
doc2vec_model.build_vocab(tagged_docs)
doc2vec_model.train(tagged_docs, total_examples=doc2vec_model.corpus_count, epochs=doc2vec_model.epochs)

# Define a function to infer Doc2Vec vectors for a batch of texts
def infer_vector_batch(texts, model):
    vectors = []
    for text in texts:
        vectors.append(model.infer_vector(word_tokenize(text)))
    return vectors

# Apply the function to the processed_Text column in batches
batch_size = 100  # Adjust the batch size as needed
num_batches = (len(df) + batch_size - 1) // batch_size
vectors = []
for i in range(num_batches):
    start_idx = i * batch_size
    end_idx = min((i + 1) * batch_size, len(df))
    batch_texts = df['processed_Text'][start_idx:end_idx]
    batch_vectors = infer_vector_batch(batch_texts, doc2vec_model)  # Pass the Doc2Vec model to the function
    vectors.extend(batch_vectors)

# Store the resulting vectors in the 'vector' column of the DataFrame
df['vector'] = vectors

# Display the DataFrame after adding the vector column
df.head()

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


### Summary

What are the outputs of above models? Which is running well? How to evaluate?
