<a href="https://colab.research.google.com/github/leukschrauber/Assignments/blob/main/assignment_5_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment
*by Fabian Leuk (csba6437/12215478)*

The following assignment consists of a theoretical part (learning portfolio) and a practical part (assignment). The goal is to build a classification model that predicts from which subject area a certain abstract originates. The plan would be that next week we will discuss your learnings from the theory part, that means you are relatively free to fill your Learning Portfolio on this new topic and in two weeks we will discuss your solutions of the Classification Model.


1) Preprocessing: The data which I provide as zip in Olat must be processed first, that means we need a table which has the following form:

Keywords | Title | Abstract | Research Field

The research field is determined by the name of the file.

2) We need a training dataset and a test dataset. My suggestion would be that for each research field we use the first 5700 lines for the training dataset and the last 300 lines for the test dataset. Please stick to this because then we can compare our models better!

3) Please use a pre-trained model from huggingface to build a classification model that tries to predict the correct research field from the 26. Please calculate the accuracy and the overall accuracy for all research fields. If you solve this task in a group, you can also try different pre-trained models. In addition to the abstracts, you can also see if the model improves if you include keywords and titles.

Some links, which can help you:

https://huggingface.co/docs/transformers/training

https://huggingface.co/docs/transformers/tasks/sequence_classification

One last request: Please always use PyTorch and not TensorFlow!

Addition: Accuracy measures whether the research field with the highest probability value matches the target. With 26 research fields, it would also be interesting to know if the correct target is at least among the three highest probability values.

$\begin{pmatrix} A\\ B \\ C \\D \\E \end{pmatrix} = \begin{pmatrix} 0.1\\ 0.95 \\ 0.5 \\0.2 \\0.3 \end{pmatrix} → \text{Choice}_1 = B, \text{Choice}_3 = B,C,E$

## Data Preprocessing

In order to prepare the data, I resolved some issues in the CSV files. Specifically, the file "MATH_1991-2000.csv" had an issue with line number 1061. A quotation mark could not be escaped by pandas CSV-Helper, thus I removed it.

Also, the file "HEAL_2001-2010.csv" contained 594 records only as opposed to 2000 for every other file. Thus, I extracted the first 95 % of each file into the training data set. The last 5 percent of each file were extracted into the test data set.

The data was condensed in the way requested to "Research Field", "Abstract", "Title" and "Keywords" where the keywords consist of the columns "Author Keywords" and "Index Keywords" of the original dataset. Multiple abstracts were missing. In this case, they were replaced with a combination of title and keywords.

A validation data set was extracted from the training data set, using 15 percent of the training data set records. The split was undertaken using a stratified sampling approach by means of the Research Field column.

In [34]:
import pandas as pd
import numpy as np
from google.colab import drive
import os
from google.colab import data_table
from sklearn.model_selection import train_test_split

data_table.enable_dataframe_formatter()

drive.mount('/content/drive')

directory = '/content/drive/My Drive/SE_Digital_Organizations/data/'
train_data = pd.DataFrame()
test_data = pd.DataFrame()

for filename in os.listdir(directory):
    if filename.endswith('.csv'): 
        file_path = os.path.join(directory, filename)
        data = pd.read_csv(file_path)

        # Extract research field from filename
        research_field = filename.split('_')[0]   
        data['Research Field'] = research_field

        # Concatenate keywords
        data['Keywords'] = data['Author Keywords'].fillna('') + ' ' + data['Index Keywords'].fillna('')

        # Reduce columns as requested
        columns_to_keep = ['Keywords', 'Research Field', 'Abstract', 'Title']
        data = data[columns_to_keep]

        # Replace empty abstracts with title and keywords
        data.loc[data['Abstract'] == '[No abstract available]', 'Abstract'] = data['Title'] + ' ' + data['Keywords']

        # Split into Training and Test
        train_end_idx = int(len(data) * 0.95)
        train_data = pd.concat([train_data, data[:train_end_idx]])
        test_data = pd.concat([test_data, data[train_end_idx:]])

# Split into training and validation using stratified sampling
train_data, validation_data = train_test_split(train_data, test_size=0.15, stratify=train_data['Research Field'], random_state=42)

print("Length of training, validation and test set")
print((len(train_data), len(validation_data), len(test_data)))

validation_data.head(10)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data.loc[data['Abstract'] == '[No abstract available]', 'Abstract'] = data['Title'] + ' ' + data['Keywords']
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data.loc[data['Abstract'] == '[No abstract available]', 'Abstract'] = data['Title'] + ' ' + data['Keywords']
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data.loc[data['Abstract'] == '[No abstract available]', 'Abstract'] = data['Title'] + ' ' + data['Keywords']
A value is trying to be set on a copy of a slice 

Length of training, validation and test set
(124834, 22030, 7730)


Unnamed: 0,Keywords,Research Field,Abstract,Title
194,2 hydroxyglutaric acid; amino acid; asparagin...,PHAR,Metabolomics is an emerging 'omics' science in...,Emerging applications of metabolomics in drug ...
1677,Cell wall; Membrane phospholipid; Nisin-resist...,IMMU,Nisin is the most prominent lantibiotic and is...,Mechanisms of nisin resistance in Gram-positiv...
938,Economic optimization; Long-term electricity s...,ENER,In future energy systems with high shares of f...,Optimal use of Power-to-Gas energy storage sys...
687,Copper catalysts; Dimethyl oxalate hydrogenati...,CENG,The catalytic performances of co-precipitated ...,Highly selective synthesis of ethylene glycol ...
755,,ARTS,Consumption studies have arguably transformed ...,Consumption and consumerism in early modern En...
1539,error estimates; Steklov eigenvalue problem; V...,MATH,The aim of this paper is to develop a virtual ...,A virtual element method for the Steklov eigen...
1506,Epworth Sleepiness Scale; Obstructive sleep ap...,NURS,This study compared the predictive abilities o...,Predictive abilities of the STOP-Bang and Epwo...
1832,Hybrid electric vehicles; Induction motor; Per...,ENER,this paper describes an investigation into dif...,Comparison of different motor design drives fo...
1112,alanine; anisomycin; insulin; insulin recepto...,BIOC,Tumor necrosis factor α (TNFα) inhibits insuli...,The c-Jun NH2-terminal kinase promotes insulin...
1557,AMPA receptor; glutamate receptor; glutamate ...,NEUR,Using a thrombin cleavage assay in cultured hi...,Subunit-specific temporal and spatial patterns...


## Training

### Training

When installing the necessary transformers libraries, I encountered a bug in the most recent version of transformers. Thus, I had to install a specific version. For more information on this visit https://github.com/huggingface/transformers/issues/22816

For training, the research fields had to be encoded into integers using python dictionaries. Also, transformers by convention expects input data to be in a DataSet-Format with column-headers "text" for inputs and "labels" for labels. Padding and truncation of abstracts was performed to conform to the models max input length.

I trained the model using the bert-base-uncased model as a base and the abstracts of the articles as inputs. (https://huggingface.co/bert-base-uncased) The model head has been removed and trained on three epochs of the training data set with hyperparameters left at defaults suggested by hugging face (https://huggingface.co/docs/transformers/v4.29.1/en/main_classes/trainer#transformers.TrainingArguments). Deviating from this, batch size was set to 32 for training and 64 for evaluation.

Because training time on the data set was estimated to be above ten hours, I used checkpoints. A checkpoint is saved every 500 batches to my Drive. Because resume_from_checkpoint is set to true, huggingface will always use the latest checkpoint when training.

In [35]:
!pip uninstall transformers -y

Found existing installation: transformers 4.28.0
Uninstalling transformers-4.28.0:
  Successfully uninstalled transformers-4.28.0


In [36]:
!pip install transformers==4.28.0
!pip install accelerate
!pip install datasets
!pip install evaluate

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers==4.28.0
  Using cached transformers-4.28.0-py3-none-any.whl (7.0 MB)
Installing collected packages: transformers
Successfully installed transformers-4.28.0


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [37]:
from transformers import AutoTokenizer, DataCollatorWithPadding, TrainingArguments, AutoModelForSequenceClassification, Trainer
from datasets import Dataset,DatasetDict

# Make data comply to transformers standard: text and labels
train_data_transformers = train_data.loc[:, ['Abstract', 'Research Field']]
train_data_transformers.columns = ['text', 'labels']
test_data_transformers = test_data.loc[:, ['Abstract', 'Research Field']]
test_data_transformers.columns = ['text', 'labels']
valid_data_transformers = validation_data.loc[:, ['Abstract', 'Research Field']]
valid_data_transformers.columns = ['text', 'labels']

# convert to datasets
train_ds = Dataset.from_pandas(train_data_transformers)
valid_ds = Dataset.from_pandas(valid_data_transformers)
test_ds = Dataset.from_pandas(test_data_transformers)

dataset_dict = DatasetDict({'train': train_ds, 'validation': valid_ds, 'test': test_ds})

# model and tokenizer definition
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

id2label = {1: 'DENT', 2: 'AGRI', 3: 'ENER', 4: 'PSYC', 5: 'DECI', 6: 'VETE', 7: 'PHAR', 8: 'MATH',
       9: 'NURS', 10: 'ECON', 11: 'COMP', 12: 'ARTS', 13: 'CENG', 14: 'ENVI', 15: 'SOCI', 16: 'BIOC',
       17: 'MATE', 18: 'CHEM', 19: 'HEAL', 20: 'ENGI', 21: 'BUSI', 22: 'NEUR', 23: 'MEDI', 24: 'IMMU',
       25: 'PHYS', 0: 'EART'}
label2id = {value: key for key, value in id2label.items()}

# tokenize text and convert labels into labekl id
def tokenize_function(x):
    tokens = tokenizer(x['text'], truncation=True, padding="max_length")
    tokens["labels"] = [label2id[label] for label in x["labels"]]
    return tokens

tokenized_datasets = dataset_dict.map(tokenize_function, batched=True)

Map:   0%|          | 0/124834 [00:00<?, ? examples/s]

Map:   0%|          | 0/22030 [00:00<?, ? examples/s]

Map:   0%|          | 0/7730 [00:00<?, ? examples/s]

In [42]:
from transformers import BertForSequenceClassification
import evaluate

accuracy = evaluate.load("accuracy")
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

training_args = TrainingArguments("/content/drive/My Drive/SE_Digital_Organizations/checkpoint_research/", evaluation_strategy="epoch", per_device_train_batch_size=8, per_device_eval_batch_size=8)
model = BertForSequenceClassification.from_pretrained(checkpoint, num_labels=26, id2label=id2label, label2id=label2id)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

trainer.train(resume_from_checkpoint=True)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

Epoch,Training Loss,Validation Loss


## Testing

When testing our model using the test data set, we receive an accuracy of 71.56 % when considering the most probable result. The accuracy of the model considering the three most probable results is 89.46 %.

In [8]:
preds = trainer.predict(tokenized_datasets["test"])
preds

PredictionOutput(predictions=array([[-0.44923425, -0.5951591 ,  0.2810526 , ..., -0.28517455,
         0.18553312, -0.6296185 ],
       [-0.4855193 , -0.34478235, -0.09674951, ...,  0.08822354,
        -0.3098316 , -0.08180964],
       [-0.33857864,  0.24000232, -1.0077988 , ...,  0.48366666,
        -0.6163959 ,  0.4590482 ],
       ...,
       [ 0.23488376, -0.15302663,  0.20471947, ..., -0.9302562 ,
        -1.2998265 ,  0.4856882 ],
       [ 0.17979462, -0.5427721 , -0.7442204 , ..., -1.6578081 ,
        -1.7306445 ,  2.0391958 ],
       [-0.76626116, -1.1526433 , -0.367672  , ..., -0.93114835,
        -1.8556029 , -0.4998625 ]], dtype=float32), label_ids=array([18, 18, 18, ...,  8,  8,  8]), metrics={'test_loss': 1.095831274986267, 'test_accuracy': 0.7156165092508734, 'test_runtime': 256.7947, 'test_samples_per_second': 30.098, 'test_steps_per_second': 3.766})

In [47]:
single_hits = 0
triple_hits = 0
wrong_single_hits_per_category = {}
wrong_triple_hits_per_category = {}

for i in range(len(preds.predictions)):
  if np.argmax(preds.predictions[i]) == tokenized_datasets["test"][i]["labels"]:
    single_hits += 1
  else:
    if wrong_single_hits_per_category.get(id2label[tokenized_datasets["test"][i]["labels"]]) is None:
        wrong_single_hits_per_category[id2label[tokenized_datasets["test"][i]["labels"]]] = 1
    else:
        wrong_single_hits_per_category[id2label[tokenized_datasets["test"][i]["labels"]]] += 1
  if tokenized_datasets["test"][i]["labels"] in np.argpartition(preds.predictions[i], -3)[-3:]:
    triple_hits += 1
  else:
    if wrong_triple_hits_per_category.get(id2label[tokenized_datasets["test"][i]["labels"]]) is None:
        wrong_triple_hits_per_category[id2label[tokenized_datasets["test"][i]["labels"]]] = 1
    else:
        wrong_triple_hits_per_category[id2label[tokenized_datasets["test"][i]["labels"]]] += 1

sorted_single_hit_fails = dict(sorted(wrong_single_hits_per_category .items(), key=lambda x: x[1], reverse=True))
sorted_triple_hit_fails = dict(sorted(wrong_triple_hits_per_category .items(), key=lambda x: x[1], reverse=True))

print("Single and Triple Hit Accuracy")
print((single_hits/len(preds.predictions), triple_hits/len(preds.predictions)))
print("Category to wrong prediction single")
print(sorted_single_hit_fails)
print("Category to wrong prediction triple")
print(sorted_triple_hit_fails)

Single and Triple Hit Accuracy
(0.709147367059128, 0.8874369258636304)
Category to wrong prediction single
{'COMP': 137, 'BIOC': 125, 'ENGI': 123, 'ARTS': 121, 'ENVI': 119, 'SOCI': 116, 'MEDI': 115, 'AGRI': 114, 'DECI': 103, 'MATH': 103, 'MATE': 102, 'CHEM': 91, 'IMMU': 90, 'PSYC': 89, 'PHAR': 88, 'CENG': 87, 'ENER': 71, 'EART': 67, 'BUSI': 64, 'ECON': 55, 'NURS': 54, 'PHYS': 53, 'HEAL': 46, 'NEUR': 45, 'VETE': 39, 'DENT': 31}
Category to wrong prediction triple
{'MATH': 69, 'MEDI': 62, 'SOCI': 50, 'ENGI': 50, 'MATE': 50, 'COMP': 46, 'AGRI': 42, 'BIOC': 41, 'PHAR': 37, 'EART': 34, 'VETE': 33, 'ENVI': 32, 'PHYS': 31, 'DECI': 30, 'CENG': 30, 'PSYC': 29, 'HEAL': 29, 'NEUR': 29, 'IMMU': 25, 'NURS': 22, 'ENER': 22, 'BUSI': 21, 'DENT': 19, 'ARTS': 17, 'CHEM': 11, 'ECON': 9}
