# Training and Visualizing Word Embeddings with Word2Vec

## Introduction
This tutorial demonstrates how to train Word2Vec word embeddings on text data and visualize the results. Word2Vec is a powerful technique that converts words into numerical vectors in a way that captures semantic relationships between words. Words that are used in similar contexts end up having similar vector representations.

The main steps we'll cover:
1. Setting up the required libraries
2. Loading and preprocessing text data
3. Training a Word2Vec model
4. Exploring word relationships
5. Visualizing word embeddings using t-SNE

## Setup: Required Libraries
First, we'll import all necessary libraries for our analysis:

In [None]:
import pandas as pd
from gensim.models import Word2Vec
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import re
from nltk.tokenize import word_tokenize
import nltk
from nltk.corpus import stopwords
import random

This code imports:
- `pandas`: For data manipulation
- `gensim.models.Word2Vec`: The main Word2Vec implementation
- `sklearn.manifold.TSNE`: For dimensionality reduction and visualization
- `matplotlib.pyplot`: For creating visualizations
- `nltk`: For text processing utilities
- Other utilities for regular expressions and random sampling

## Downloading NLTK Resources
We need to download some NLTK resources for text processing:

In [None]:
# Ensure NLTK resources are available
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('punkt_tab')

These downloads provide:
- `punkt`: For sentence tokenization
- `stopwords`: Common words to filter out
- `punkt_tab`: Additional tokenization data

## Loading the Data
Now we load our text data:

In [None]:
# Load your data
file_path = '/content/messages.txt'
data = pd.read_csv(file_path, sep='\t')

The data should be in a tab-separated format with an 'utterance' column containing the text to analyze.

## Text Preprocessing
We define a function to clean and tokenize our text data:

In [None]:
def preprocess_text(text):
    if isinstance(text, str):
        # Remove links and similar non-words using regular expressions
        text = re.sub(r'http\S+|www\S+|https\S+', '', text)  # Remove links
        text = re.sub(r'\@\w+|\#', '', text)  # Remove mentions and hashtags

        # Lowercase
        text = text.lower()
        # Remove non-alphanumeric characters (basic cleaning)
        text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
        # Tokenize
        tokens = word_tokenize(text)
        return tokens
    else:
        return []  # Returning an empty list for non-string values

# Apply preprocessing to the 'utterance' column
data['tokens'] = data['utterance'].apply(preprocess_text)

This preprocessing function:
1. Checks if input is a string
2. Removes URLs and social media markers
3. Converts text to lowercase
4. Removes special characters
5. Tokenizes the text into individual words

## Training the Word2Vec Model
Now we train our Word2Vec model:

In [None]:
# Train Word2Vec model
tokenized_corpus = data['tokens'].tolist()
model = Word2Vec(sentences=tokenized_corpus, vector_size=100, window=5, min_count=1, workers=4)

Parameters explained:
- `vector_size=100`: Length of the word vectors
- `window=5`: Maximum distance between current and predicted word within a sentence
- `min_count=1`: Ignores words that appear less than this
- `workers=4`: Number of CPU cores to use for training

## Extracting Word Vectors
Extract the trained word vectors for analysis:

In [None]:
# Extract word vectors for visualization
words = list(model.wv.index_to_key)
word_vectors = model.wv[words]

## Exploring Word Relationships
We can analyze relationships between words:

In [None]:
# Get the vector for a specific word
word_vector = model.wv['rapture']
print(word_vector)  # Prints the vector representation of 'rapture'

In [None]:
# Find the most similar words
similar_words = model.wv.most_similar(positive=[word_vector], topn=20)

# Print the results:
for word, similarity in similar_words:
    print(f"{word}: {similarity}")

This shows the 20 words most similar to our target word, along with their similarity scores.

## Visualizing Word Embeddings
Finally, we'll visualize the word embeddings using t-SNE:

In [None]:
# Reduce dimensions using t-SNE
tsne = TSNE(n_components=2, random_state=42)
reduced_vectors = tsne.fit_transform(word_vectors)

Parameters:
- `n_components=2`: Reduce to 2 dimensions for visualization
- `random_state=42`: For reproducible results

## Filtering Stop Words
Remove common words before visualization:

In [None]:
# Get the list of English stop words from nltk
stop_words = stopwords.words('english')
print(stop_words)

In [None]:
# Random shuffle the words in the data to be able to look at a different subset every time
words_subset = [w for w in words if not w in stop_words][:200]

## Creating the Visualization
Finally, create a scatter plot of the word embeddings:

In [None]:
# Plotting the embeddings
plt.figure(figsize=(12, 8))
for i, word in enumerate(words_subset):
    plt.scatter(reduced_vectors[i, 0], reduced_vectors[i, 1])
    plt.text(reduced_vectors[i, 0] + 0.1, reduced_vectors[i, 1] + 0.1, word, fontsize=9)
plt.title('t-SNE visualization of Word2Vec word embeddings')
plt.show()

This creates a 2D visualization where:
- Each point represents a word
- Similar words appear closer together
- The plot shows semantic relationships between words in your corpus

## Conclusion
This tutorial showed how to:
1. Preprocess text data for word embedding
2. Train a Word2Vec model
3. Explore word relationships
4. Visualize the resulting word embeddings

The visualization can reveal interesting semantic relationships in your text data, showing how words cluster together based on their usage patterns.