<a href="https://colab.research.google.com/github/igorgatchin1993/assigments/blob/main/Assignment_5_ipynb_(practical_part_2).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The following assignment consists of a theoretical part (learning portfolio) and a practical part (assignment). The goal is to build a classification model that predicts from which subject area a certain abstract originates. The plan would be that next week we will discuss your learnings from the theory part, that means you are relatively free to fill your Learning Portfolio on this new topic and in two weeks we will discuss your solutions of the Classification Model.


#Practical part (Assignment, May 17)

1) Preprocessing: The data which I provide as zip in Olat must be processed first, that means we need a table which has the following form:

Keywords | Title | Abstract | Research Field

The research field is determined by the name of the file.

2) We need a training dataset and a test dataset. My suggestion would be that for each research field we use the first 5700 lines for the training dataset and the last 300 lines for the test dataset. Please stick to this because then we can compare our models better!

3) Please use a pre-trained model from huggingface to build a classification model that tries to predict the correct research field from the 26. Please calculate the accuracy and the overall accuracy for all research fields. If you solve this task in a group, you can also try different pre-trained models. In addition to the abstracts, you can also see if the model improves if you include keywords and titles.

Some links, which can help you:

https://huggingface.co/docs/transformers/training

https://huggingface.co/docs/transformers/tasks/sequence_classification

One last request: Please always use PyTorch and not TensorFlow!

In [None]:
# Transformers installation
!pip install transformers datasets
# To install from source instead of the last release, comment the command above and uncomment the following one.
!pip install transformers==4.28.0
!pip install accelerate
!pip install datasets
!pip install evaluate

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.29.2-py3-none-any.whl (7.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.1/7.1 MB[0m [31m26.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.12.0-py3-none-any.whl (474 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.6/474.6 kB[0m [31m23.8 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.14.1-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m13.1 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m51

In [None]:
import pandas as pd
import os
from google.colab import drive #allows us to reach our google drive
from transformers import AutoTokenizer, DataCollatorWithPadding, TrainingArguments, AutoModelForSequenceClassification, Trainer
from datasets import Dataset,DatasetDict
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
drive.mount('/content/drive')

folder_path = '/content/drive/MyDrive/research_field/data' # Dataset path from google drive

def load_csv_files(folder_path, test_size=0.2, random_state=42):
    # Dosya listesini al
    file_list = os.listdir(folder_path)
    csv_files = [file for file in file_list if file.endswith('.csv')]

    # Veri setlerini depolamak için boş DataFrame oluştur
    train_data = pd.DataFrame()
    test_data = pd.DataFrame()

    # CSV dosyalarını oku ve train/test veri setlerine ayır
    for file in csv_files:
        file_path = os.path.join(folder_path, file)
        try:
            df = pd.read_csv(file_path)
        except pd.errors.ParserError:
            print(f"Hata: {file_path} dosyası okunamadı ve atlandı.")
            continue

        # Metin sütununu bul
        text_column_name = find_text_column(df)

        # Metin sütunu bulunamazsa dosyayı atla
        if text_column_name is None:
            print(f"Hata: {file_path} dosyasında metin sütunu bulunamadı ve atlandı.")
            continue

        # Veri setini train ve test olarak ayır
        train_df, test_df = train_test_split(df, test_size=test_size, random_state=random_state)

        # Eğitim veri setine ekle
        train_data = pd.concat([train_data, train_df])

        # Test veri setine ekle
        test_data = pd.concat([test_data, test_df])

    return train_data, test_data

def find_text_column(data):
    # Sütunların türlerini kontrol et
    for column in data.columns:
        column_type = data[column].dtype
        if column_type == 'object' or column_type == 'string':
            return column
    
    return None

def tokenize_data(data, text_column):
    # Veri setindeki metin sütununu al
    text_data = data[text_column].tolist()

    # CountVectorizer kullanarak tokenize et
    vectorizer = CountVectorizer()
    tokenized_data = vectorizer.fit_transform(text_data)

    return tokenized_data

# Klasör yolunu belirt
folder_path = '/content/drive/MyDrive/research_field/data'

# CSV dosyalarını yükle ve train/test veri setlerini al
train_set, test_set = load_csv_files(folder_path)

# Metin sütununu bul
text_column = find_text_column(train_set)

# Metin sütunu bulunamazsa hata mesajı ver
if text_column is None:
    print("Hata: Metin sütunu bulunamadı.")
    exit()

# Train veri setini tokenize et
train_tokens = tokenize_data(train_set, text_column)

# Test veri setini tokenize et
test_tokens = tokenize_data(test_set, text_column)

# Train veri setinin ilk örneğini görüntüle
print("Train Veri Seti Örneği:")
print(train_tokens[0].toarray())

# Get columns name from each csv file
import os

def get_fields(folder_path):
    fields = set()
    for filename in os.listdir(folder_path):
        if filename.endswith('.csv'):
            field = filename.split('_')[0]
            fields.add(field)
    return list(fields)


fields = get_fields(folder_path)
print(fields)

# Train ve test veri setlerini oluştur
train_data, test_data = train_test_split(train_data, test_data, test_size=0.2, random_state=42)

# Train ve test veri setini göster
print("Train Veri Seti:")
print(train_data.head())
print()
print("Test Veri Seti:")
print(test_data.head())

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Hata: /content/drive/MyDrive/research_field/data/MATH_1991-2000.csv dosyası okunamadı ve atlandı.
Train Veri Seti Örneği:
[[0 0 0 ... 0 0 0]]
['BUSI', 'IMMU', 'HEAL', 'ECON', 'DECI', 'COMP', 'AGRI', 'PHYS', 'ENER', 'NEUR', 'PSYC', 'CENG', 'NURS', 'ARTS', 'BIOC', 'MEDI', 'ENVI', 'VETE', 'PHAR', 'MATH', 'DENT', 'EART', 'ENGI', 'SOCI', 'CHEM', 'MATE']


In [None]:
from transformers import BertForSequenceClassification
import evaluate
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

accuracy = evaluate.load("accuracy")
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

training_args = TrainingArguments('/content/drive/MyDrive/research_field/data', evaluation_strategy="epoch", per_device_train_batch_size=8, per_device_eval_batch_size=8)

model = BertForSequenceClassification.from_pretrained(checkpoint, num_labels=26)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenize_data["train"],
    eval_dataset=tokenize_data["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

trainer.train(resume_from_checkpoint=True)

ModuleNotFoundError: ignored

Addition: Accuracy measures whether the research field with the highest probability value matches the target. With 26 research fields, it would also be interesting to know if the correct target is at least among the three highest probability values.

$\begin{pmatrix} A\\ B \\ C \\D \\E \end{pmatrix} = \begin{pmatrix} 0.1\\ 0.95 \\ 0.5 \\0.2 \\0.3 \end{pmatrix} → \text{Choice}_1 = B, \text{Choice}_3 = B,C,E$