<a href="https://colab.research.google.com/github/lejunliu/NLP-QTM340/blob/main/QTM340_PS3_Angela_Liu.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this problem set, we'll do a deep dive with language models.

Once again, you're free to execute the notebook on your personal environment, but I would strongly recommend using Google Colab. You can upload this notebook to Google colab by following the steps below.

1. Open [colab.research.google.com](colab.research.google.com)
2. Click on the upload tab
3. Upload the .ipynb file by choosing the right file from your local disk


**Submission instructions**

1. When you're ready to submit, you'll save the notebook as QTM340-PS3-Firstname-Lastname.ipynb; for example, if your name is Harry Potter, save the file as `QTM340-PS3-Harry-Potter.ipynb`. This can be done in Google colab by editing the filename and then following File --> Download --> .ipynb

2. Upload this file on canvas.

**Objective**: In this notebook, you'll learn the following in a classification task:

a. To use bag of words representation as predictors (1 point)

b. To use static word representations as predictors (2 points)

c. To use contextual word representations as predictors (3 points)

d. Explain what are the strengths and weaknesses of each of the model (2 points)

Our task is to classify research papers to categories. We'll use the dataset hosted by [huggingface](https://huggingface.co/datasets/gfissore/arxiv-abstracts-2021)

## 0. Setup

Install all the required packages.

In [None]:
%%bash

pip install datasets
pip install transformers
pip install sentencepiece



Let's get all the libraries imported first.

In [None]:
from datasets import load_dataset
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score, classification_report, f1_score, confusion_matrix

import torch
import gensim
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

Now download the dataset and clean it up.


**Note** This may take a couple of minutes when you run the first time because the data will be downloaded.

In [None]:
def convert2label (x):
  best_cat = x[0].split()[0]
  return best_cat.split ('.')[0]

required_cats = ['math', 'cs', 'astro-ph', 'physics', 'quant-ph']
dataset = load_dataset("gfissore/arxiv-abstracts-2021", split='train')
dataset = dataset.remove_columns (column_names=['submitter',
                                                'authors',
                                                'journal-ref',
                                                'doi',
                                                'report-no',
                                                'comments',
                                                'versions'])
df_dataset = pd.DataFrame(dataset)
df_dataset["cat"] = df_dataset.categories.apply (lambda x:convert2label (x))
original_df = df_dataset.copy (deep=True)
df_dataset = original_df.query ('cat in @required_cats')

# randomly pick 1500 examples
df_dataset = df_dataset.sample (n=1500, random_state=42)

You have two variables that are of interest: `original_df` which contains all the examples in the dataset and `df_dataset` which contains examples that belong only to some fixed categories (as defined in `required_cats`)

Next, we'll create a train (80%), validate (10%) and test (10%) split for our dataset.

In [None]:
# Split df_dataset into train, validate and test dataframes
train_df, test_df = train_test_split (df_dataset,
                                      train_size=0.9,
                                      random_state=42)

train_df, val_df = train_test_split (train_df,
                                     train_size=80/90,
                                     random_state=42)

## 1. Bag of Words classification

We'll turn the title into bag of words features.

In [None]:
# Initialize a vectorizer and classifier
vectorizer = CountVectorizer (input="content",
                              lowercase=True,
                              min_df=5,
                              max_df=0.75,
                              max_features=1000)
classifier = LogisticRegression (penalty="l2",
                                 C=0.1,
                                 max_iter=1000)

# Fit the entire dataset on the vectorizer;
# effectively, this line extracts all the features
vectorizer.fit (df_dataset["title"])

# Get the labels
y_train = train_df["cat"].values
y_val = val_df["cat"].values
y_test = test_df["cat"].values

# Get the bag-of-words representation for each document
X_bow_train = vectorizer.transform (train_df["title"])
X_bow_val = vectorizer.transform (val_df["title"])
X_bow_test = vectorizer.transform (test_df["title"])

# Now, let's fit the model
classifier.fit (X_bow_train, y_train)

# Use the trained classifier to do predictions
yhat_bow_val = classifier.predict (X_bow_val)

# Get the accuracy of the classifier
print (f"Accuracy in %: {100*accuracy_score (y_val, yhat_bow_val):.2f}")

# Get the classification report
print ("Classification report")
print (classification_report (y_val, yhat_bow_val))

Accuracy in %: 60.00
Classification report
              precision    recall  f1-score   support

    astro-ph       0.61      0.54      0.57        26
          cs       0.65      0.56      0.60        39
        math       0.56      0.92      0.69        49
     physics       0.33      0.05      0.08        21
    quant-ph       0.89      0.53      0.67        15

    accuracy                           0.60       150
   macro avg       0.61      0.52      0.52       150
weighted avg       0.59      0.60      0.56       150



**Sanity check** The bag-of-words features are quite predictive of the type of paper (60% accuracy); in comparison, a majority-class classifier -- one that predicts "math" for all examples -- will perform at 33% accuracy.

**Your turn!**

Q1. Adapt the code above to find the best regularization hyperparameter (highest accuracy) of the classifier. Report the optimal parameter and write 2-3 sentences to interpret the optimal regularization parameter [0.5 points]

You'll tune the following parameters

- C: The regularization penalty. Try all the values from the set {0.001, 0.01, 0.1, 1.0, 10.0, 100.0}

Note that you'll have to calculate the accuracy on the validation set (not on the test set). You can learn about regularization [here](https://en.wikipedia.org/wiki/Regularization_(mathematics)) and how it's controlled by looking over Scikit's API documentation of logistic regression [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

In [None]:
# Your code in this cell.
C_values = [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]
best_C = None
best_accuracy = 0

for C in C_values:
  classifier = LogisticRegression(penalty="l2",
                                 C=C,
                                 max_iter=1000)
  classifier.fit (X_bow_train, y_train)
  yhat_bow_val = classifier.predict(X_bow_val)
  print (f"Accuracy is %: {100*accuracy_score (y_val, yhat_bow_val):.2f}")

  accuracy = accuracy_score (y_val, yhat_bow_val)

  if accuracy > best_accuracy:
    best_accuracy = accuracy
    best_C = C

  print(f"Best C: {best_C}")


Accuracy is %: 32.67
Best C: 0.001
Accuracy is %: 50.67
Best C: 0.01
Accuracy is %: 60.00
Best C: 0.1
Accuracy is %: 64.67
Best C: 1.0
Accuracy is %: 60.00
Best C: 1.0
Accuracy is %: 57.33
Best C: 1.0


The optimal parameter is 1.0 with an accuracy of 64.67, which is a moderate value of C. Lower value of C means stronger regularization and prevents a lot of overfitting. Higher value of C, on the other hand, might be susceptible to overfitting while allowing more complex data. In this case, a C 0f 1.0, is not too high or too low, is a good balanced approach to regularization where it can be complex enough to capture the details but not that it would overfit.

Q2. For the best classifier from Q1, report the top 10 and the bottom 10 features for each class that are most and least predictive of the label, respectively. Give a brief explanation for why you see these features at the top and bottom. [0.5 points]

You can obtain the top 10 features by sorting them based on the coefficients learned by the classifier.

In [None]:
# Your code in this cell
import numpy as np

classifier = LogisticRegression(penalty="l2",
                                 C=1.0,
                                 max_iter=1000)
classifier.fit (X_bow_train, y_train)

feature_names = vectorizer.get_feature_names_out()

coefficients = classifier.coef_

for i in range(len(required_cats)):
    class_coefficients = coefficients[i]

    sorted_coef_index = class_coefficients.argsort()
    top_10 = sorted_coef_index[-10:]
    bottom_10 = sorted_coef_index[:10]

    top_10_features = feature_names[top_10]
    bottom_10_features = feature_names[bottom_10]

    print(f"Class {classifier.classes_[i]}:")
    print("Top 10 predictive features: ", top_10_features)
    print("Bottom 10 predictive features: ", bottom_10_features)

Class astro-ph:
Top 10 predictive features:  ['planetary' 'gravitational' 'stellar' 'ray' 'cosmic' 'galactic'
 'galaxies' 'cosmological' 'stars' 'observations']
Bottom 10 predictive features:  ['quantum' 'electron' 'networks' 'to' 'graphs' 'via' 'random' 'on'
 'identification' 'flow']
Class cs:
Top 10 predictive features:  ['codes' 'heterogeneous' 'feature' 'logic' 'data' 'social' 'recognition'
 'learning' 'power' 'networks']
Bottom 10 predictive features:  ['quantum' 'space' 'optical' 'atomic' 'gamma' 'decomposition' 'imaging'
 'theorem' 'cosmological' 'noise']
Class math:
Top 10 predictive features:  ['decomposition' 'dimensional' 'operators' 'polynomials' 'spaces'
 'algebras' 'forms' 'equations' 'conjecture' 'groups']
Bottom 10 predictive features:  ['based' 'electron' 'channels' 'phase' 'networks' 'scale' 'entanglement'
 'cosmological' 'magnetic' 'optical']
Class physics:
Top 10 predictive features:  ['turbulence' 'frequency' 'atomic' 'momentum' 'transport' 'ion' 'electron'
 'scatt

In each class, the top 10 features are those with the highest positive coefficients for each class, indicating a strong positive correlation. On the other hand, the bottom 10 features are those with the most negative coefficients, indicating a strong negative correlation. It makes sense that for "astro-ph" class, words like stellar, cosmic, stars have the highest correlations while in the math class, words like magnetic and cosmological are least present.

## 2. Classification using type embeddings

We'll now learn the embeddings of each word and then use these embeddings as features in the classification model. The embeddings are used using [doc2vec](https://cs.stanford.edu/~quocle/paragraph_vector.pdf) which is a variation of word2vec that learns embeddings sensitive to the topic or some label for every sentence.  


We'll learn the parameters of the embedding model (i.e. word embeddings) from the abstracts and then construct the document embedding for the titles.

In [None]:
model = gensim.models.doc2vec.Doc2Vec(vector_size=100,
                                      min_count=5,
                                      epochs=15)

def read_corpus(iterable, tokens_only=False):
  for i, line in enumerate(iterable):
    tokens = gensim.utils.simple_preprocess(line)
    if tokens_only:
      yield tokens
    else:
      # For training data, add tags
      yield gensim.models.doc2vec.TaggedDocument(tokens, [i])

# Create the corpus in each split
train_corpus_abstracts = list(read_corpus(train_df["abstract"].values))

train_corpus_titles = list(read_corpus(train_df["title"].values, tokens_only=True))
val_corpus_titles = list(read_corpus(val_df["title"].values, tokens_only=True))
test_corpus_titles = list(read_corpus(test_df["title"].values, tokens_only=True))

model.build_vocab(train_corpus_abstracts)
model.train(train_corpus_abstracts,
            total_examples=model.corpus_count,
            epochs=model.epochs)

Now use `model.infer_vector` to get the vector representation of any document

**Your turn!**

Q1. Get the document vectors for every document in the train set to form the training matrix. Similarly construct the validation matrix and test matrix from documents in the validation and test corpus, respectively. [0.5 points]

Following is an example of how to use `model.infer_vector` function, which will return a single vector for the entire sequence.

In [None]:
vector = model.infer_vector(["physics", "is", "awesome"])
print (vector)

[ 0.00600176  0.04434367 -0.01625837  0.03682726  0.01159444 -0.06348134
  0.01770834  0.11706414 -0.08930016 -0.01128201 -0.01944441 -0.04588564
  0.00727504  0.0338801   0.01780108 -0.02409948  0.01820844 -0.02419796
  0.00065879  0.00725534  0.03539961 -0.02332593  0.04525088 -0.01188623
  0.00612918 -0.01086706 -0.02498062  0.03530251  0.0096279   0.00319705
  0.07401841 -0.02162201  0.02130155 -0.03014926  0.0328822   0.02881766
  0.01692638  0.03638084 -0.03389216 -0.02708544 -0.01894981 -0.02554624
 -0.06267942 -0.01599437 -0.01117519 -0.00456544  0.02475011  0.03130277
  0.02778811  0.02143952 -0.00048141 -0.02061078  0.02850049  0.00278964
 -0.03652105  0.03064709  0.0083608  -0.02282735 -0.04796872 -0.01129916
  0.01175171  0.01642023 -0.00480552  0.00549194 -0.05359496  0.03374572
  0.03181355  0.05682414 -0.04113982  0.05748565 -0.02224613  0.02760835
  0.03624192  0.02815029  0.05585455  0.02146997 -0.03445485  0.02921626
 -0.04448319 -0.05487447 -0.04940805  0.02785598 -0

In [None]:
import numpy as np
def corpus2staticmat (corpus:list, training=False) -> np.array:
  """ The function will take a corpus i.e. a collection of documents
      and get the embedding for each document.

  :params:
  corpus (list): The corpus is in the form of a list. Every item
                 in the list is a document. If the training flag is set,
                 then a document contains two properties: words and tags;
                 if the flag is not set, then the document is simply
                 a list of words.

  training (bool): A boolean flag that indicates whether the data
                   is training or non-traiing data

  :returns:
  embeddings (np.array): The embeddings for each document are
                         rows in a matrix.
  """

  embeddings = []
  # Write your code below
  for doc in corpus:
    if training:
      doc_words = doc.words
    else:
      doc_words = doc

    vector = model.infer_vector(doc_words)
    embeddings.append(vector)

  return np.array(embeddings)

In [None]:
X_static_train = corpus2staticmat (train_corpus_titles, training=False)
X_static_val = corpus2staticmat (val_corpus_titles, training=False)
X_static_test = corpus2staticmat (test_corpus_titles, training=False)

**Your turn!**

Q2. Find the best classifier using the embeddings features. Once again, you'll find the best hyperparameter (based on accuracy) for vector size. [0.5 points]

The vector size is a parameter for the following function `gensim.models.doc2vec.Doc2Vec`.

- vector_size: Try values from the following set {25, 50, 100, 200}

In [None]:
# Now, let's fit the model
classifier.fit (X_static_train, y_train)

# Use the trained classifier to do predictions
y_static_val = classifier.predict (X_static_val)

# Get the accuracy of the classifier
print (f"Accuracy in %: {100*accuracy_score (y_val, y_static_val):.2f}")

# Get the classification report
print ("Classification report")
print (classification_report (y_val, y_static_val))

Accuracy in %: 62.67
Classification report
              precision    recall  f1-score   support

    astro-ph       0.65      0.77      0.70        26
          cs       0.62      0.67      0.64        39
        math       0.62      0.92      0.74        49
     physics       0.00      0.00      0.00        21
    quant-ph       0.75      0.20      0.32        15

    accuracy                           0.63       150
   macro avg       0.53      0.51      0.48       150
weighted avg       0.55      0.63      0.56       150



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


**Sanity check**: I get 58% accuracy on the validation set using 100 dimensional features, which isn't bad considering I only trained the word2vec model for 15 epochs. There is also scope for improvement especially in categories that are rare.

In [None]:
# Your code below for tuning the vector size parameter

vector_sizes = [25,50, 100, 200]
best_accuracy = 0
best_vector_size = None
best_classifier = None

for vector_size in vector_sizes:
  model = gensim.models.doc2vec.Doc2Vec(vector_size=vector_size,
                                        min_count=5,
                                        epochs=15)
  model.build_vocab(train_corpus_abstracts)
  model.train(train_corpus_abstracts,
            total_examples=model.corpus_count,
            epochs=model.epochs)

  X_static_train = corpus2staticmat (train_corpus_titles, training=False)
  X_static_val = corpus2staticmat (val_corpus_titles, training=False)
  X_static_test = corpus2staticmat (test_corpus_titles, training=False)

  classifier.fit(X_static_train, y_train)

  y_static_val = classifier.predict(X_static_val)
  accuracy = accuracy_score(y_val, y_static_val)
  print(f"Accuracy for vector size {vector_size}: {100*accuracy:.2f}%")

  if accuracy > best_accuracy:
    best_accuracy = accuracy
    best_vector_size = vector_size
    best_classifier = classifier

print(f"Best vector size: {best_vector_size} with accuracy: {100*best_accuracy:.2f}%")

Accuracy for vector size 25: 62.67%
Accuracy for vector size 50: 62.67%
Accuracy for vector size 100: 61.33%
Accuracy for vector size 200: 58.67%
Best vector size: 25 with accuracy: 62.67%


Q3. Compare the best classifier using just the bag-of-words feature and the classifier using doc2vec features. Which one is better in terms of accuracy? Briefly explain why? [1 point]

Your answer here:

The bag-of-words' best classifier has a higher accuracy comparing to the best classifier using doc2vec.
One possible explanation for this is that in our dataset, the frequency of the words are good indicators of the class, so bag-of-words can capture it very effectively.



## 3. Using contextual embeddings

Now we'll use the embeddings from a variation of BERT as features to the classifier.

The variation we'll use is called [SciBERT](https://arxiv.org/abs/1903.10676), which is the BERT model trained on scientific data such as research papers.

In [None]:
from transformers import *

tokenizer = AutoTokenizer.from_pretrained('allenai/scibert_scivocab_uncased')
model = AutoModel.from_pretrained('allenai/scibert_scivocab_uncased', output_hidden_states=True)

model.eval()

Could not locate the tokenizer configuration file, will try to use the model config instead.
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--allenai--scibert_scivocab_uncased/snapshots/24f92d32b1bfb0bcaf9ab193ff3ad01e87732fc1/config.json
Model config BertConfig {
  "_name_or_path": "allenai/scibert_scivocab_uncased",
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.35.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 31090
}

loading file vocab.txt from cache at /root/.cache/huggingface/hub/models--allenai--scibert_scivocab_uncased/snapshots/24f92d32b

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(31090, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
  

Once you have the SciBERT model loaded, we can get the contextual embeddings for any sentence in a number of ways. One way is to take the embedding for the [CLS] token from the last layer using the `last_hidden_state` property set.

Note: In general if you want to access the embeddings at any hidden layer, we can access the `hidden_states` property which contains the token embeddings at every layer starting from bottom layer to the topmost layer.

Here's how to get the embeddings for the CLS token in any sequence.

In [None]:
with torch.no_grad():
  text = "Our paper measures the effect of eating ice-cream on happiness"
  encoded_input = tokenizer(text, return_tensors='pt')
  output = model(**encoded_input)

  # The [CLS] token is added at the start of the sentence,
  # which you can access by the token position 0
  # (the first zero is because we have only one sentence)
  print (output.last_hidden_state[0,0,:])
  print (output.last_hidden_state[0,0,:].size())

tensor([-1.2158e+00,  2.5564e-01, -5.5524e-01,  5.1215e-01, -9.7400e-02,
        -8.9400e-02,  6.6294e-01, -5.9423e-01, -6.3797e-01,  1.8451e-01,
         3.7581e-01,  7.6952e-01, -6.3338e-01, -1.3066e-01, -8.0694e-01,
        -4.9313e-02, -1.9207e+00,  4.9524e-01,  2.3647e-01, -4.6193e-01,
         3.0016e-01, -8.6158e-01, -4.0208e-01, -2.6200e-01,  4.7555e-01,
         6.2143e-01,  3.9351e-01, -3.8379e-01, -1.7755e-01,  3.0790e-01,
         8.2291e-01, -1.3329e+00,  1.9409e-01, -8.8806e-01,  5.0629e-01,
        -2.4628e-01, -2.9856e-02, -5.7507e-02, -7.9380e-02,  2.8860e-01,
         9.3349e-02,  4.4720e-01,  3.9603e-01, -3.6464e-01, -2.7641e-03,
        -7.4956e-01,  4.1467e-01,  7.9176e-01,  6.6163e-01, -1.1585e-01,
         6.1243e-01,  6.4887e-01,  1.1093e+00,  4.6878e-01,  6.0221e-01,
        -5.6090e-01, -5.2760e-01,  1.3664e-01,  4.9883e-01, -2.5344e-02,
        -2.1232e-01, -2.5739e-01, -4.7110e-01,  1.3782e-01, -4.6313e-02,
        -1.7934e-01,  5.0262e-01,  1.3339e-01, -2.9

The above code should print the embedding output and the size of the embedding.

**Your turn!**

Q1. Adapt the code above to get the contextual embeddings for all the examples in train, validate and test sets [1 point]

You have to be careful with BERT-like models because it starts to break if the input text after tokenization exceeds 512 wordpieces, so you want to set the following parameters when you're calling the tokenizer on the sequence:

- max_length to 512
- truncation to True
- padding is True

In [None]:
from tqdm import tqdm
import numpy as ny

def corpus2contextualmat(corpus, batch_size=32):
    embeddings = []

    for i in tqdm(range(0, len(corpus), 32)):
        batch = list(corpus[i:i+32])
        input = tokenizer(batch, padding=True, truncation=True, max_length=512, return_tensors='pt')
        with torch.no_grad():
            output = model(**input)

        batch_emb = output.last_hidden_state[:, 0, :]
        embeddings.append(batch_emb)

    return torch.cat(embeddings,dim=0)


Now let's call the method that gives us the contextual embeddings as follows.


Note: This could take some time because usually transformer models run fast on GPUs but we'll end up running everything on the CPU offered by Colab server.

It takes roughly 7-8 mins to run the cell below.

In [None]:
X_contextual_train = corpus2contextualmat (train_df["title"].values)
X_contextual_val = corpus2contextualmat (val_df["title"].values)
X_contextual_test = corpus2contextualmat (test_df["title"].values)

  0%|          | 0/38 [00:00<?, ?it/s]


TypeError: ignored

**Your turn!**

Q2. Report the accuracy by using the contextual embeddings of the titles. [0.5 points]

In [None]:
# Your code below
classifier = LogisticRegression(max_iter=1000)
classifier.fit (X_contextual_train, y_train)

y_predict = classifier.predict(X_contextual_val)
accuracy = accuracy_score(y_val, y_predict)
print(f"Validation Accuracy: {accuracy * 100:.2f}%")

Validation Accuracy: 80.00%


Q3. Instead of taking the contextual embeddings from the final layer, get the embeddings from the last 4 layers, avearge them and use them as features in the classifier. [1 point]

As mentioned, you can access the embeddings from individual layers using the `hidden_states` property of the output.

In [None]:
def corpus2contextualmat_averagedlayers (corpus, last_layers=4):
  """ Take the contextual embedding of any word as the average of the
      embeddings of the word from the last 4 layers.
  """
  embeddings = []
  # Your code below
  for i in tqdm(range(0, len(corpus), 32)):
        batch = list(corpus[i:i+32])
        input = tokenizer(batch, padding=True, truncation=True, max_length=512, return_tensors='pt')
        with torch.no_grad():
            output = model(**input)
        hidden_states = output.hidden_states[-last_layers:]
        avg_hidden_states = torch.mean(torch.stack(hidden_states), dim=0)
        batch_emb = avg_hidden_states[:, 0, :]
        embeddings.append(batch_emb)

  return torch.cat(embeddings,dim=0)

In [None]:
X_contextualaverage_train = corpus2contextualmat_averagedlayers (train_df["title"].values, last_layers=4)
X_contextualaverage_val = corpus2contextualmat_averagedlayers (val_df["title"].values, last_layers=4)
X_contextualaverage_test = corpus2contextualmat_averagedlayers (test_df["title"].values, last_layers=4)

100%|██████████| 38/38 [02:44<00:00,  4.32s/it]
100%|██████████| 5/5 [00:21<00:00,  4.31s/it]
100%|██████████| 5/5 [00:21<00:00,  4.37s/it]


Q4. Report the accuracy of the model with the features constructed above [0.5 points]

In [None]:
# Your code below
classifier = LogisticRegression(max_iter=1000)
classifier.fit (X_contextual_train, y_train)

y_predict = classifier.predict(X_contextualaverage_val)
accuracy = accuracy_score(y_val, y_predict)
print(f"Validation Accuracy: {accuracy * 100:.2f}%")


Validation Accuracy: 80.67%


## 4. Testing on unseen data

Now you have three competing classifiers:

(a) The most optimized classifier that uses bag-of-words features to predict the type of paper

(b) The most optimized classifier that uses static word embeddings to predict the type of paper

(c) The most optimized classifier that uses contextual word embeddings to predict the type of paper

**Your turn**

Q1. List 5 examples from the validation set that were misclassified by each of the classifiers. Explain in brief why the classifiers got the examples correct or incorrect. [0.5 points]

In answering the above question, you may want to think about the strengths and weaknesses of each of the classifiers.

Q2. Among the 3 competing classifiers, pick the one that has the highest accuracy. Use the classifiers output on the validation set to identify the true label that is misclassified the most. Report what is it misclassified as and explain in 2-3 sentences why this might be the case [1 point]

Q3. Report the accuracy and F1 score of all the competing classifiers. [0.5 points]




Q1:
Strengths and weaknesses of each of the classifiers:

#### bag-of-words
1. Strength: Simple and effective by capturing the frequency of words making it strong classifier when keywords are good indicators.
2. Weakness: ignores the context and order of words which might lead to losses and inaccurate interpretations of semantic meanings.
3. Misclassification reasons: the words are representative of another class. Ex. "electron" are present a lot in physics so it got predicted as physics.

#### static word embeddings
1. Strength: Captures the sementic relationships of words
2. Weakness: ignores the context
3. Misclassification reasons: Words are very prevalent in other categories.

#### contextual word embeddings
1. Strength: Captures the dynamic word representations based on context and able to understand polysemy.
2. Weakness: Longer run time and complex computations
3. Misclassification reasons: It's interesting to see this classifier making really silly mistake such as : classifing "The Transfer of Knowledge from Physics and Mathematics to Engineering Applications" as CS where in fact it is physics. So, one drawback to this classifier is "over-predicting" when context is considered.

In [None]:
bow = LogisticRegression(penalty="l2", C=1.0, max_iter=1000)
bow.fit(X_bow_train, y_train)

yhat_bow_val = bow.predict(X_bow_val)
accuracy = accuracy_score(y_val, yhat_bow_val)
print(f"Accuracy = {100*accuracy:.2f}%")

misclassified_count = 0
for index, (predicted, actual) in enumerate(zip(yhat_bow_val, y_val)):
    if predicted != actual:
        print(f"Title: {val_df['title'].iloc[index]}\n")
        print(f"Actual Label: {actual}\n")
        print(f"Predicted Label: {predicted}\n")
        misclassified_count += 1
        if misclassified_count >= 5:
            break

Accuracy = 64.67%
Title: On the maximum number of cliques in a graph

Actual Label: math

Predicted Label: cs

Title: Upper Bounds of Interference Alignment Degree of Freedom

Actual Label: cs

Predicted Label: math

Title: Polarization of Sunyaev-Zeldovich signal due to electron pressure
  anisotropy in galaxy clusters

Actual Label: astro-ph

Predicted Label: physics

Title: Minimal Degrees of Algebraic Numbers with respect to Primitive Elements

Actual Label: math

Predicted Label: cs

Title: Beyond Fowler-Nordheim model: Harmonic generation from metallic
  nano-structures

Actual Label: physics

Predicted Label: cs



In [None]:
model = gensim.models.doc2vec.Doc2Vec(vector_size=50, min_count=5, epochs=15, seed=42)
model.build_vocab(train_corpus_abstracts)
model.train(train_corpus_abstracts, total_examples=model.corpus_count, epochs=model.epochs)

X_static_train = corpus2staticmat(train_corpus_titles, training=False)
X_static_val = corpus2staticmat(val_corpus_titles, training=False)
X_static_test = corpus2staticmat(test_corpus_titles, training=False)

swb = LogisticRegression(penalty="l2", max_iter=1000)
swb.fit(X_static_train, y_train)

yhat_static_val = swb.predict(X_static_val)
accuracy = accuracy_score(y_val, yhat_static_val)
print(f"Accuracy = {100*accuracy:.2f}%")

misclassified_count = 0
for index, (predicted, actual) in enumerate(zip(yhat_static_val, y_val)):
    if predicted != actual:
        print(f"Title: {val_df['title'].iloc[index]}\n")
        print(f"Actual Label: {actual}\n")
        print(f"Predicted Label: {predicted}\n")

        misclassified_count += 1
        if misclassified_count >= 5:
            break

Accuracy = 62.00%
Title: Upper Bounds of Interference Alignment Degree of Freedom

Actual Label: cs

Predicted Label: math

Title: Linearized Reed-Solomon codes and linearized Wenger graphs

Actual Label: cs

Predicted Label: math

Title: Pencil-Beam Single-point-fed Dirac Leaky-Wave Antenna on a
  Transmission-Line Grid

Actual Label: physics

Predicted Label: astro-ph

Title: Trapped surface formation for spherically symmetric
  Einstein-Maxwell-charged scalar field system with double null foliation

Actual Label: math

Predicted Label: astro-ph

Title: The Transfer of Knowledge from Physics and Mathematics to Engineering
  Applications

Actual Label: physics

Predicted Label: cs



In [None]:
cwb = LogisticRegression(penalty="l2", max_iter=1000, random_state=42)
cwb.fit(X_contextualaverage_train, y_train)

yhat_contextualaverage_val = cwb.predict(X_contextualaverage_val)
accuracy = accuracy_score(y_val, yhat_contextualaverage_val)
print(f"Accuracy = {100*accuracy:.2f}%")

misclassified_count = 0
for index, (predicted, actual) in enumerate(zip(yhat_contextualaverage_val, y_val)):
    if predicted != actual:
        print(f"Title: {val_df['title'].iloc[index]}\n")
        print(f"Actual Label: {actual}\n")
        print(f"Predicted Label: {predicted}\n")
        misclassified_count += 1
        if misclassified_count >= 5:
            break

Accuracy = 82.00%
Title: Endangered Languages are not Low-Resourced!

Actual Label: cs

Predicted Label: quant-ph

Title: The Transfer of Knowledge from Physics and Mathematics to Engineering
  Applications

Actual Label: physics

Predicted Label: cs

Title: CSPEC: The cold chopper spectrometer of the European Spallation Source,
  a detailed overview prior to commissioning

Actual Label: physics

Predicted Label: astro-ph

Title: Unsourced Random Massive Access with Beam-Space Tree Decoding

Actual Label: cs

Predicted Label: quant-ph

Title: Variable Modified Chaplygin Gas in Anisotropic Universe with
  Kaluza-Klein Metric

Actual Label: physics

Predicted Label: math



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Q2:

#### Highest accuracy: contextual word embedding with an accuracy of over 80 percent.

#### Report the true label that is misclassified the most:

Title: Endangered Languages are not Low-Resourced!

Actual Label: cs

Predicted Label: quant-ph

Title: The Transfer of Knowledge from Physics and Mathematics to Engineering Applications

Actual Label: physics

Predicted Label: cs

Title: CSPEC: The cold chopper spectrometer of the European Spallation Source, a detailed overview prior to commissioning

Actual Label: physics

Predicted Label: astro-ph

Title: Unsourced Random Massive Access with Beam-Space Tree Decoding

Actual Label: cs

Predicted Label: quant-ph

Title: Variable Modified Chaplygin Gas in Anisotropic Universe with Kaluza-Klein Metric

Actual Label: physics

Predicted Label: math

#### Possible explanations:

1. In The Transfer of Knowledge from Physics and Mathematics to Engineering Applications, the title includes terms like "knowledge transfer" and "applications," which are often used in computer science contexts, so it causes the misclassification.
2. In Unsourced Random Massive Access with Beam-Space Tree Decoding, it might be that the terms "beam-space" and "decoding" might be more frequently associated with quantum physics in the training data, leading to a misclassification.
3. In Variable Modified Chaplygin Gas in Anisotropic Universe with Kaluza-Klein Metric, it contains specific math concepts like Kaluza-Klein Metric so it can lead to misclassification. However, the title also has the keyword "gas". But the classifier is unable to catch that.

In [None]:
yhat_bow_test = bow.predict(X_bow_test)
F1 = f1_score(y_test, yhat_bow_test, average = 'weighted')
accuracy = accuracy_score(y_test, yhat_bow_test)
print(f"Accuracy = {100*accuracy:.2f} , F1-Score: {F1:.2f}")

Accuracy = 74.67 , F1-Score: 0.74


In [None]:
yhat_static_test = swb.predict(X_static_test)
F1 = f1_score(y_test, yhat_static_test, average = 'weighted')
accuracy = accuracy_score(y_test, yhat_static_test)
print(f"Accuracy = {100*accuracy:.2f} , F1-Score: {F1:.2f}")

Accuracy = 68.67 , F1-Score: 0.66


In [None]:
yhat_contextualaverage_test = cwb.predict(X_contextualaverage_test)
F1 = f1_score(y_test, yhat_contextualaverage_test, average = 'weighted')
accuracy = accuracy_score(y_test, yhat_contextualaverage_test)
print(f"Accuracy = {100*accuracy:.2f} , F1-Score: {F1:.2f}")

Accuracy = 82.67 , F1-Score: 0.83
