# PROJECT 4.1 - ENSEMBLE LEARNING TECHNIQUE
## `DATA COLLECTION - FEATURE EXTRACTION`
In this Project, we shall use the ensemble learning to classify the subject & title of newspaper

After clean the data, write it to dataset folder with: train, test & full dataset csv

## i_IMPORT LIBRARY

In [1]:
import os
from matplotlib import pyplot as plt
import seaborn as sns
import math
import pandas as pd
import numpy as np

from utils.text_preprocessing import character_preprocessing
from utils.text_preprocessing import category_processing


A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.2.6 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/media/anhvt/DATA/_PyLIB_LINUX/.venv/lib/python3.12/site-packages/ipykernel_launcher.py", line 18, in <module>
    app.launch_new_instance()
  File "/media/anhvt/DATA/_PyLIB_LINUX/.venv/lib/python3.12/site-packages/traitlets/config/application.py", line 1075, in launch_instance
    app.start()
  File "/media/anhvt/DATA/_PyLIB_LINUX/.venv/lib/python3.12/site-packages/ipykernel/kernelapp.py", line 739, in start
  

In [2]:
from datasets import load_dataset
from sentence_transformers import SentenceTransformer


In [3]:
from sklearn.model_selection import train_test_split
# import warnings
# warnings.filterwarnings('ignore')

## 1. LOAD AND EXTRACT DATA
* Dataset has been loaded into the system with location in Linux: ./home/user/.cache/huggingface/datasets
* The dataset has been required to load at the fist time only. The second time the command will check and implement if the datasets is ready.

In [4]:
ds = load_dataset("UniverseTBD/arxiv-abstracts-large")
ds

DatasetDict({
    train: Dataset({
        features: ['id', 'submitter', 'authors', 'title', 'comments', 'journal-ref', 'doi', 'report-no', 'categories', 'license', 'abstract', 'versions', 'update_date', 'authors_parsed'],
        num_rows: 2292057
    })
})

In [5]:
# Check asbtract of categories to see all sub-features
# label: categories
# features: abstract
all_categories = ds["train"]["categories"]
all_abstracts = ds["train"]["abstract"]
print(all_categories)
print(all_abstracts[0:1])

Column(['hep-ph', 'math.CO cs.CG', 'physics.gen-ph', 'math.CO', 'math.CA math.FA'])
['  A fully differential calculation in perturbative quantum chromodynamics is\npresented for the production of massive photon pairs at hadron colliders. All\nnext-to-leading order perturbative contributions from quark-antiquark,\ngluon-(anti)quark, and gluon-gluon subprocesses are included, as well as\nall-orders resummation of initial-state gluon radiation valid at\nnext-to-next-to-leading logarithmic accuracy. The region of phase space is\nspecified in which the calculation is most reliable. Good agreement is\ndemonstrated with data from the Fermilab Tevatron, and predictions are made for\nmore detailed tests with CDF and DO data. Predictions are shown for\ndistributions of diphoton pairs produced at the energy of the Large Hadron\nCollider (LHC). Distributions of the diphoton pairs from the decay of a Higgs\nboson are contrasted with those produced from QCD processes at the LHC, showing\nthat enhanc

We will take 2000 samples with single label belonging to categories: ["astro-ph", "cond-mat", "cs", "math", "physics"]

In [6]:
# extract two columns for project
ds_extracted = ds["train"].select_columns(["abstract", "categories"])
# convert to pandas & extract to csv for future use
dataset_df = ds_extracted.to_pandas()

In [7]:
dataset_df.head()

Unnamed: 0,abstract,categories
0,A fully differential calculation in perturba...,hep-ph
1,"We describe a new algorithm, the $(k,\ell)$-...",math.CO cs.CG
2,The evolution of Earth-Moon system is descri...,physics.gen-ph
3,We show that a determinant of Stirling cycle...,math.CO
4,In this paper we show how to compute the $\L...,math.CA math.FA


## 2. COLLECT REQUIRED DATA TO CSV
From this part, the step forwards:
* Get the first letters of categories if this letter is in ["astro-ph", "cond-mat", "cs", "math", "physics"]

In [8]:
dataset_processing = dataset_df.copy()
dataset_processing = dataset_processing.assign(
    abstract = dataset_processing["abstract"].apply(character_preprocessing),
    categories = dataset_processing["categories"].apply(category_processing)
)
# catetories_convert_ds = dataset_df.apply(category_processing())

In [9]:
features_spec = ["astro-ph", "cond-mat", "cs", "math", "physics"]
# dataset_processing.to_csv("../dataset/full_dataset.csv", index=False)

In [10]:
# creat Regrex OR by '|'
# 'astro-ph'|'cond-mat'|'cs'|'math'|'physics'
pattern = "^(" + "|".join(features_spec) + ")"
print(pattern)

^(astro-ph|cond-mat|cs|math|physics)


In [11]:
# Filter the DataFrame with string from full_dataset
dataset_filtered = dataset_processing[
    dataset_processing["categories"].str.contains(pattern, case=False, na=False)
]

  dataset_processing["categories"].str.contains(pattern, case=False, na=False)


In [12]:
# Collect random 2000 values from filtered dataset
dataset_2k = dataset_filtered.sample(n=2000, random_state=42)
dataset_2k.reset_index(drop=True, inplace=True)
dataset_2k.head()

Unnamed: 0,abstract,categories
0,the factorially normalized bernoulli polynomia...,math
1,we propose a simple uniform lower bound on the...,math
2,we study truncated point schemes of connected ...,math
3,fourdimensional d printing a new technology em...,cs
4,we show that the dth secant variety of a proje...,math


In [13]:
# Recheck the field is same as our requested features
print(dataset_2k["categories"].unique())

['math' 'cs' 'physics' 'cond-mat' 'astro-ph' 'math-ph']


Export to csv for future use

In [14]:
X = dataset_2k["abstract"]
y = dataset_2k["categories"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"Training data size: {X_train.shape}")
print(f"Test data size: {X_test.shape}")
print(f"Training data size: {y_train.shape}")
print(f"Test data size: {y_test.shape}")
train_data = pd.concat([X_train, y_train], axis=1)
test_data = pd.concat([X_test, y_test], axis=1)

Training data size: (1600,)
Test data size: (400,)
Training data size: (1600,)
Test data size: (400,)


In [15]:
dataset_2k.to_csv("../dataset/dataset2k.csv", index=False)
train_data.to_csv("../dataset/train_data.csv", index=False)
test_data.to_csv("../dataset/test_data.csv", index=False)