<a href="https://colab.research.google.com/github/leukschrauber/Assignments/blob/main/assignment_5_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment
*by Fabian Leuk (csba6437/12215478)*

The following assignment consists of a theoretical part (learning portfolio) and a practical part (assignment). The goal is to build a classification model that predicts from which subject area a certain abstract originates. The plan would be that next week we will discuss your learnings from the theory part, that means you are relatively free to fill your Learning Portfolio on this new topic and in two weeks we will discuss your solutions of the Classification Model.


1) Preprocessing: The data which I provide as zip in Olat must be processed first, that means we need a table which has the following form:

Keywords | Title | Abstract | Research Field

The research field is determined by the name of the file.

2) We need a training dataset and a test dataset. My suggestion would be that for each research field we use the first 5700 lines for the training dataset and the last 300 lines for the test dataset. Please stick to this because then we can compare our models better!

3) Please use a pre-trained model from huggingface to build a classification model that tries to predict the correct research field from the 26. Please calculate the accuracy and the overall accuracy for all research fields. If you solve this task in a group, you can also try different pre-trained models. In addition to the abstracts, you can also see if the model improves if you include keywords and titles.

Some links, which can help you:

https://huggingface.co/docs/transformers/training

https://huggingface.co/docs/transformers/tasks/sequence_classification

One last request: Please always use PyTorch and not TensorFlow!

## Data Preprocessing

In order to prepare the data, I resolved some issues in the CSV files. Specifically, the file "MATH_1991-2000.csv" had an issue with line number 1061. A quotation mark could not be escaped by pandas CSV-Helper, thus I removed it.

Also, the file "HEAL_2001-2010.csv" contained 594 records only as opposed to 2000 for every other file. Thus, I extracted the first 95 % of each file into the training data set. The last 5 percent of each file were extracted into the test data set.

The data was condensed in the way requested to "Research Field", "Abstract", "Title" and "Keywords" where the keywords consist of the columns "Author Keywords" and "Index Keywords" of the original dataset.

A validation data set was extracted from the training data set, using 15 percent of the training data set records. The split was undertaken using a stratified sampling approach by means of the Research Field column.

In [15]:
import pandas as pd
import numpy as np
from google.colab import drive
import os
from google.colab import data_table
from sklearn.model_selection import train_test_split

data_table.enable_dataframe_formatter()

drive.mount('/content/drive')

directory = '/content/drive/My Drive/SE_Digital_Organizations/data/'
train_data = pd.DataFrame()
test_data = pd.DataFrame()

for filename in os.listdir(directory):
    if filename.endswith('.csv'): 
        file_path = os.path.join(directory, filename)
        data = pd.read_csv(file_path)
        train_end_idx = int(len(data) * 0.95)
        train_data = pd.concat([train_data, data[:train_end_idx]])
        test_data = pd.concat([test_data, data[train_end_idx:]])

        research_field = filename.split('_')[0]
        
        train_data['Research Field'] = research_field
        test_data['Research Field'] = research_field


train_data['Keywords'] = train_data['Author Keywords'].fillna('') + ' ' + train_data['Index Keywords'].fillna('')
test_data['Keywords'] = test_data['Author Keywords'].fillna('') + ' ' + test_data['Index Keywords'].fillna('')

columns_to_keep = ['Keywords', 'Research Field', 'Abstract', 'Title']
train_data = train_data[columns_to_keep]
test_data = test_data[columns_to_keep]


train_data, validation_data = train_test_split(train_data, test_size=0.15, stratify=train_data['Research Field'], random_state=42)

print("Length of training, validation and test set")
print((len(train_data), len(validation_data), len(test_data)))

validation_data.head(10)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Length of training, validation and test set
(124834, 22030, 7730)


Unnamed: 0,Keywords,Research Field,Abstract,Title
396,archaeology; atmospheric deposition; climatology,AGRI,This is a useful microphysics handbook for con...,Microclimate for cultural heritage
385,messenger rna; protein kinase; amino acid seq...,AGRI,"Using positional cloning strategies, we have i...",Molecular basis of myotonic dystrophy: Expansi...
1185,Cognitive bias and heuristic; Information theo...,AGRI,A single coherent framework is proposed to syn...,Toward a synthesis of cognitive biases: How no...
1012,Amyloid; BOLD; default mode network (DMN); fMR...,AGRI,There has been a dramatic increase in the numb...,Resting state functional connectivity in precl...
1293,chemoattractant; chitin; chitobiose; chitotri...,AGRI,"Upon transit to colonization sites, bacteria o...",Initial symbiont contact orchestrates host-org...
1376,,AGRI,There are two purposes to the present study. O...,The physical environment of street blocks and ...
1715,Cathode material; Crystal structure; Iron sili...,AGRI,"Recently, preparation and preliminary testing ...",Structure and electrochemical performance of L...
799,Amorphous materials; Catalytic ability; Desig...,AGRI,Porous organic materials have garnered colossa...,Porous Organic Materials: Strategic Design and...
595,Coalition; Cooperative game; Hedonic game; Ind...,AGRI,We consider the partitioning of a society into...,The stability of hedonic coalition structures
275,Algorithms; Computation theory; Graph theory;...,AGRI,"The problem of measuring ""similarity"" of objec...",SimRank: A measure of structural-context simil...


Addition: Accuracy measures whether the research field with the highest probability value matches the target. With 26 research fields, it would also be interesting to know if the correct target is at least among the three highest probability values.

$\begin{pmatrix} A\\ B \\ C \\D \\E \end{pmatrix} = \begin{pmatrix} 0.1\\ 0.95 \\ 0.5 \\0.2 \\0.3 \end{pmatrix} → \text{Choice}_1 = B, \text{Choice}_3 = B,C,E$