<a href="https://colab.research.google.com/github/jasper-zheng/teaching/blob/main/digital_images_data_science/Text_classifier_via_word_embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Create a text classifier using word embeddings  

The dataset used in this example is [fine-food reviews](https://www.kaggle.com/datasets/snap/amazon-fine-food-reviews) from Amazon. The dataset contains a total of 568,454 food reviews Amazon users left up to October 2012. We will use a subset of this dataset, consisting of 1,000 most recent reviews for illustration purposes. The reviews are in English and tend to be positive or negative. Each review has a ProductId, UserId, Score, review title (Summary) and review body (Text).  

We'll use the [Word2vec](https://radimrehurek.com/gensim/models/word2vec.html) model from Gensim for word embeddings.

In [None]:
!wget https://github.com/jasper-zheng/teaching/blob/main/digital_images_data_science/reviews_10k.csv?raw=true -O reviews_10k.csv


In [None]:
!pip install gensim

In [None]:
import pandas as pd
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from gensim.models import Word2Vec
import gensim.downloader

## Loading the word embedding

In [None]:
model_w2v = gensim.downloader.load('glove-twitter-100')

In [None]:
import numpy as np

def vectorize_text(text):
    vectors = [model_w2v[word] for word in text if word in model_w2v]
    if vectors:
      return torch.tensor(sum(vectors) / len(vectors))
    else:
      return torch.zeros(100) # Handle cases with no recognized words


def get_cosine_similarity(vec_a, vec_b):
        dot_product = vec_a @ vec_b
        product_of_magnitudes = np.linalg.norm(vec_a) * np.linalg.norm(vec_b)
        return dot_product / product_of_magnitudes

## Inspecting our word embedding model:

In [None]:
dog = vectorize_text("dog")
cat = vectorize_text("cat")
computers = vectorize_text("computers")

In [None]:
print(f'distance between dog and cat is {get_cosine_similarity(dog, cat)}')
print(f'distance between dog and computers is {get_cosine_similarity(dog, computers)}')
print(f'distance between cat and computers is {get_cosine_similarity(cat, computers)}')

A larger distance means that our embedding thinks the two words are far away from each other, so this is accurate!

In [None]:
model_w2v.most_similar('computer', topn=10)

### Issues and Biases in word embeddings:  

Word embeddings for quantitative analysis can be quite problematic, especially in terms of gender, racial, class, sexuality, disability or other... Here we expose some examples:

In [None]:
doctor = vectorize_text("doctor")
woman = vectorize_text("woman")
man = vectorize_text("man")

print(f'distance between doctor and woman is {get_cosine_similarity(doctor, woman)}')
print(f'distance between doctor and man is {get_cosine_similarity(doctor, man)}')
print('please be critical when using word embeddings')

Try out some other words to see if you can reveal some other problematic terms in the embedding.

## Using the embedding to create a food review dataset

In [None]:
reviews_df = pd.read_csv('reviews_10k.csv')


# Preprocessing
reviews_df = reviews_df.dropna()  # Remove rows with missing values
reviews = reviews_df['Text'].apply(lambda x: x.lower().split()).tolist()  # Tokenize text
reviews_df.head(3)

In [None]:
reviews_vectors = [vectorize_text(review) for review in reviews]
scores = reviews_df['Score'].values


In [None]:
# Encode scores to numerical values (if needed)
le = LabelEncoder()
scores_encoded = le.fit_transform(scores)

# Split data
X_train, X_test, y_train, y_test = train_test_split(reviews_vectors, scores_encoded, test_size=0.2, random_state=42)


In [None]:
# Create PyTorch dataset
class ReviewDataset(Dataset):
    def __init__(self, X, y):
        self.X = X
        self.y = torch.tensor(y, dtype=torch.long)
    def __len__(self):
        return len(self.X)
    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]

# Classifier Model
class TextClassifier(nn.Module):
    def __init__(self, input_size, num_classes):
        super(TextClassifier, self).__init__()
        self.fc1 = nn.Linear(input_size, 64)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(64, num_classes)

    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        return x


train_dataset = ReviewDataset(X_train, y_train)
test_dataset = ReviewDataset(X_test, y_test)

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32)


## Create a Neural Network Classifier Model

In [None]:
model = TextClassifier(100, len(le.classes_))  # Assuming 100-dim word embeddings

# Training
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

## Train the Classifier Model

In [None]:
num_epochs = 30  # Adjust as needed
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

for epoch in range(num_epochs):
    for i, (inputs, labels) in enumerate(train_loader):
        inputs, labels = inputs.to(device), labels.to(device)
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
    if epoch % 2 == 0:
      print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}")

## Inspecting some results

In [None]:
def predict_score(text, model, le, device):
    text_vector = vectorize_text(text.lower().split())
    text_vector = text_vector.to(device)
    with torch.no_grad():
      output = model(text_vector)
      _, predicted = torch.max(output, 0)
    return le.inverse_transform([predicted.item()])[0]

In [None]:
new_text = "This is an amazing product! I highly recommend it."
predicted_score = predict_score(new_text, model, le, device)
print(f"Predicted score: {predicted_score}")

In [None]:
new_text = "The cheesecake is not as advertised"
predicted_score = predict_score(new_text, model, le, device)
print(f"Predicted score: {predicted_score}")