PROBLEM 2 : NNet supervised classification with tuned word vectors
Train a neural network on a sizeable subset of 20NG (say, at least 5 categories)

Download GloVe embeddings from https://www.kaggle.com/datasets/rtatman/glove-global-vectors-for-word-representation?select=glove.6B.100d.txt and do some basic simplification, e.g.

## read the dataset, tokenize and pad
from gensim.utils import simple_preprocess
import torch
tokens = list()
for text in ng_text:
tokens.append(simple_preprocess(text))
ng_vector_idx = torch.LongTensor([doc2ind(doc) for doc in ng_text])

where `ng_vector_idx` is a `torch.tensor` of integers representing the indices of the GloVe vectors from above, and `doc2ind` is a function you need. Note that you should not form the matrix of word embeddings explicitly, but simply specific vector-indices representing the words in the text (see `torch.Embedding` for more details)

Parameterize an embedding layer for GloVe. With pytorch, this looks something like:

from torch import nn
glove_emb = nn.Embedding.from_pretrained(< glove vectors from NG tags here >)
glove_emb.weight.requires_grad = False


Construct an Neural Network using the embedding layer. You're free to design the architecture of the network after that. For example, in PyTorch, the architecture code might look something similar too:

model = nn.Sequential(
   glove_emb,
   ...
   nn.Linear(..., num_classes),
   nn.Softmax(dim=1)
)

It's possible to get a test set accuracy around 63%.



Fine tune them on 20NG by making your embedding layer trainable, i.e. by unfreezing the weights. After a sufficient amount of training, plot a 2d projection of the resulting embeddings colored by class using your choice reduction (PCA, MDS, tSNE, etc.). Is there any perceptible difference between the embedding before and after tuning?
You can follow a tutorial such as
https://czarrar.github.io/Gensim-Word2Vec/
https://github.com/ashutoshsingh0223/mittens

In [28]:
%pip install gensim

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.2 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [29]:
# imports
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import gensim
from gensim.utils import simple_preprocess

In [65]:
glove = pd.read_csv('glove.6B.100d.txt', sep=' ', quoting=3, header=None)
glove

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,91,92,93,94,95,96,97,98,99,100
0,the,-0.038194,-0.244870,0.728120,-0.399610,0.083172,0.043953,-0.391410,0.334400,-0.57545,...,0.016215,-0.017099,-0.389840,0.87424,-0.725690,-0.510580,-0.520280,-0.145900,0.82780,0.270620
1,",",-0.107670,0.110530,0.598120,-0.543610,0.673960,0.106630,0.038867,0.354810,0.06351,...,0.349510,-0.722600,0.375490,0.44410,-0.990590,0.612140,-0.351110,-0.831550,0.45293,0.082577
2,.,-0.339790,0.209410,0.463480,-0.647920,-0.383770,0.038034,0.171270,0.159780,0.46619,...,-0.063351,-0.674120,-0.068895,0.53604,-0.877730,0.318020,-0.392420,-0.233940,0.47298,-0.028803
3,of,-0.152900,-0.242790,0.898370,0.169960,0.535160,0.487840,-0.588260,-0.179820,-1.35810,...,0.187120,-0.018488,-0.267570,0.72700,-0.593630,-0.348390,-0.560940,-0.591000,1.00390,0.206640
4,to,-0.189700,0.050024,0.190840,-0.049184,-0.089737,0.210060,-0.549520,0.098377,-0.20135,...,-0.131340,0.058617,-0.318690,-0.61419,-0.623930,-0.415480,-0.038175,-0.398040,0.47647,-0.159830
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
399995,chanty,-0.155770,-0.049188,-0.064377,0.223600,-0.201460,-0.038963,0.129710,-0.294510,0.00359,...,0.093324,0.094486,-0.023469,-0.48099,0.623320,0.024318,-0.275870,0.075044,-0.56380,0.145010
399996,kronik,-0.094426,0.147250,-0.157390,0.071966,-0.298450,0.039432,0.021870,0.008041,-0.18682,...,-0.305450,-0.011082,0.118550,-0.11312,0.339510,-0.224490,0.257430,0.631430,-0.20090,-0.105420
399997,rolonda,0.360880,-0.169190,-0.327040,0.098332,-0.429700,-0.188740,0.455560,0.285290,0.30340,...,-0.044082,0.140030,0.300070,-0.12731,-0.143040,-0.069396,0.281600,0.271390,-0.29188,0.161090
399998,zsombor,-0.104610,-0.504700,-0.493310,0.135160,-0.363710,-0.447500,0.184290,-0.056510,0.40474,...,0.151530,-0.108420,0.340640,-0.40916,-0.081263,0.095315,0.150180,0.425270,-0.51250,-0.170540


In [59]:
#import 20 news groups
from sklearn.datasets import fetch_20newsgroups

# remove 14 categories (so that we only have 7 categories)
categories = ['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'sci.space']
ng_text = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'), categories=categories)
ng_text_test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'), categories=categories)

ng_text.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'sci.space']

In [66]:
def doc2ind(doc, word2idx):
    return [word2idx[word] for word in doc if word in word2idx]

In [68]:
# Create a word2idx dictionary from GloVe embeddings
word2idx = {word: idx for idx, word in enumerate(glove.iloc[:, 0])}

# Read the dataset, tokenize, and pad
tokens = list()
for text in ng_text.data:  # Use ng_text.data to access the text data
    tokens.append(simple_preprocess(str(text)))  # Ensure the input is a string

# Convert tokens to indices using the word2idx dictionary
indexed_tokens = [doc2ind(doc, word2idx) for doc in tokens]

# Pad the sequences to ensure uniform length
from torch.nn.utils.rnn import pad_sequence
padded_tokens = pad_sequence([torch.tensor(seq) for seq in indexed_tokens], batch_first=True, padding_value=0)

# Convert to LongTensor
ng_vector_idx = padded_tokens.long()
print(ng_vector_idx)

tensor([[   32,  1688, 15720,  ...,     0,     0,     0],
        [13075,    37,    14,  ...,     0,     0,     0],
        [  346,    37,    14,  ...,     0,     0,     0],
        ...,
        [ 4680,   841,  2185,  ...,     0,     0,     0],
        [   31,  1544,   492,  ...,     0,     0,     0],
        [  279,   298,    25,  ...,     0,     0,     0]])


In [70]:
glove_emb = nn.Embedding.from_pretrained(torch.tensor(glove.iloc[:, 1:].values))
glove_emb.weight.requires_grad = False

In [1]:
model = nn.Sequential(
    glove_emb,
    nn.Linear(100, 7)
)

# Define the loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# train and test on 20 newsgroups
for epoch in range(10):
    model.train()
    optimizer.zero_grad()
    output = model(ng_vector_idx)
    loss = criterion(output, torch.tensor(ng_text.target))
    loss.backward()
    optimizer.step()
    print(f'Epoch {epoch+1}, Loss: {loss.item()}')

# Test the model
model.eval()






NameError: name 'nn' is not defined