# Projekt LLM 25L - Klasyfikacja tekstu na wybranym zbiorze danych poprzez ekstrakcję cech metodą TF-IDF i wytrenowanie (własnego?) klasyfikatora

## Adam Kraś 325177

In [9]:
import torch
import datasets
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cuda')

In [3]:
!nvidia-smi

Mon May 26 15:13:31 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.133.20             Driver Version: 570.133.20     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GeForce RTX 4060 ...    Off |   00000000:01:00.0 Off |                  N/A |
| N/A   51C    P0             21W /   55W |    1127MiB /   8188MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [5]:
# Wczytanie wybranego zbioru danych
dataset = datasets.load_dataset("community-datasets/yahoo_answers_topics")

In [6]:
dataset["train"][420]  # Wyświetlenie przykładu z zestawu danych

{'id': 420,
 'topic': 4,
 'question_title': 'Who invented the mouse?',
 'question_content': '',
 'best_answer': 'According to the bible, it was God.\\nAccording to Darwin, it was natural evolution.\\nMany others will convince you that it was Steve Jobs and Apple.\\n\\nThe answer, according to http://www.superkids.com/aweb/pages/features/mouse/mouse.html, is Doug Engelbart.\\n\\n(which I found via this Yahoo! Search http://myweb2.search.yahoo.com/search?p=who+invented+the+mouse.)'}

### Analiza zbioru danych została przeprowadzona przed dostrajaniem modelu [BERT](BERT_klasyfikacja.ipynb), tutaj jedynie załaduję zbiór danych w identyczny sposób

In [7]:
train_dataset = dataset["train"]

dataset_size = 50000

reduced_train = train_dataset.train_test_split(
    train_size=dataset_size,
    test_size=dataset_size // 5,
    stratify_by_column="topic",
    shuffle=True,
    seed=42
)["train"]

train_dataset = reduced_train

full_split = train_dataset.train_test_split(test_size=0.2, seed=42)
val_test_split = full_split["test"].train_test_split(test_size=0.5, seed=42)

train_dataset = full_split["train"]
val_dataset = val_test_split["train"]
test_dataset = val_test_split["test"]

# Sprawdzenie rozmiarów zbiorów danych po podziale
print(f"Dataset size: {len(train_dataset) + len(test_dataset) + len(val_dataset)}")
print(f"Train dataset size: {len(train_dataset) / (len(train_dataset) + len(test_dataset) + len(val_dataset)) * 100:.2f}%")
print(f"Test dataset size: {len(test_dataset) / (len(train_dataset) + len(test_dataset) + len(val_dataset)) * 100:.2f}%")
print(f"Validation dataset size: {len(val_dataset) / (len(train_dataset) + len(test_dataset) + len(val_dataset)) * 100:.2f}%")

Dataset size: 50000
Train dataset size: 80.00%
Test dataset size: 10.00%
Validation dataset size: 10.00%


In [12]:
def join_text(example):
    return example["question_title"] + " " + example["question_content"]

# Extracting text and labels separately
train_texts = [join_text(example) for example in train_dataset]
test_texts = [join_text(example) for example in test_dataset]
val_texts = [join_text(example) for example in val_dataset]

train_labels = [example["topic"] for example in train_dataset]
val_labels = [example["topic"] for example in val_dataset]
test_labels = [example["topic"] for example in test_dataset]

In [34]:
vocab_size = 50000

vectorizer = TfidfVectorizer(
    max_features=vocab_size,
    ngram_range=(1, 2),
    stop_words='english',
    min_df=2,
    max_df=0.9,
)

In [35]:
train_tfidf_texts = vectorizer.fit_transform(train_texts)
val_tfidf_texts = vectorizer.transform(val_texts)
test_tfidf_texts = vectorizer.transform(test_texts)

In [36]:
print(len(vectorizer.vocabulary_))

50000
