**1. Import dataset**

In [2]:
import pandas as pd

df = pd.read_parquet('./data/data_final.parquet')

print(df.head(5))

                                               title  \
0  Is Physics Sick? [In Praise of Classical Physics]   
1    Modern Mathematical Physics: what it should be?   
2                                Topology in Physics   
3       Contents of Physics Related E-Print Archives   
4        Fundamental Dilemmas in Theoretical Physics   

                                             authors  \
0                                     Hisham Ghassib   
1                                     Ludwig Faddeev   
2                                          R. Jackiw   
3  E. R. Prakasan, Anil Kumar, Anil Sagar, Lalit ...   
4                                     Hisham Ghassib   

                                             summary             published  \
0  In this paper, it is argued that theoretical p...  2012-09-04T10:32:56Z   
1  Personal view of author on goals and content o...  2000-02-08T13:13:00Z   
2  The phenomenon of quantum number fractionaliza...  2005-03-15T16:00:59Z   
3  The frontie

**2. Preprocessing data**

In [3]:
import re
import string
import nltk
from nltk.tokenize import WordPunctTokenizer
from nltk.stem import WordNetLemmatizer, PorterStemmer
from nltk.corpus import stopwords

# Download necessary NLTK data (run once)
nltk.download('stopwords')
nltk.download('wordnet')

# Initialize the tokenizer, lemmatizer, stemmer
wpt = WordPunctTokenizer()
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()

# Create a stopwords set
stop_words = set(stopwords.words('english'))

def normalized_text(text):
    text = text.lower()
    text = re.sub(r'\[.*?\]', ' ', text)
    text = re.sub(r'https?://\S+|www\.\S+', ' ', text)
    text = re.sub(r'<.*?>+', ' ', text)
    text = re.sub(r'[%s]' % re.escape(string.punctuation), ' ', text)
    text = re.sub(r'\n', ' ', text)
    text = re.sub(r'\w*\d\w*', ' ', text)
    tokens = wpt.tokenize(text)
    
    filtered_tokens = [token for token in tokens if token not in stop_words]
    
    lemma_stem_tokens = [stemmer.stem(lemmatizer.lemmatize(token)) for token in filtered_tokens]
    
    cleaned_text = ' '.join(lemma_stem_tokens)
    return cleaned_text


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/tranminhanh/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/tranminhanh/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [4]:
df['cleaned'] = df['summary'].apply(lambda x: normalized_text(x))

In [5]:
df.head(5)

Unnamed: 0,title,authors,summary,published,updated,link,pdf_url,categories,target,cleaned
0,Is Physics Sick? [In Praise of Classical Physics],Hisham Ghassib,"In this paper, it is argued that theoretical p...",2012-09-04T10:32:56Z,2012-09-04T10:32:56Z,http://arxiv.org/abs/1209.0592v1,http://arxiv.org/pdf/1209.0592v1,"physics.gen-ph, physics.hist-ph",physic,paper argu theoret physic akin organ rigid str...
1,Modern Mathematical Physics: what it should be?,Ludwig Faddeev,Personal view of author on goals and content o...,2000-02-08T13:13:00Z,2000-02-10T10:14:56Z,http://arxiv.org/abs/math-ph/0002018v2,http://arxiv.org/pdf/math-ph/0002018v2,"math-ph, hep-th, math.MP","math-stats,physic",person view author goal content mathemat physic
2,Topology in Physics,R. Jackiw,The phenomenon of quantum number fractionaliza...,2005-03-15T16:00:59Z,2005-03-15T16:00:59Z,http://arxiv.org/abs/math-ph/0503039v1,http://arxiv.org/pdf/math-ph/0503039v1,"math-ph, cond-mat.mes-hall, math.MP, physics.c...","math-stats,physic",phenomenon quantum number fraction explain rel...
3,Contents of Physics Related E-Print Archives,"E. R. Prakasan, Anil Kumar, Anil Sagar, Lalit ...",The frontiers of physics related e-print archi...,2003-08-28T13:12:57Z,2003-08-28T13:12:57Z,http://arxiv.org/abs/physics/0308107v1,http://arxiv.org/pdf/physics/0308107v1,physics.data-an,physic,frontier physic relat e print archiv web servi...
4,Fundamental Dilemmas in Theoretical Physics,Hisham Ghassib,"In this paper, we argue that there are foundat...",2014-05-22T07:49:09Z,2014-05-22T07:49:09Z,http://arxiv.org/abs/1405.5530v1,http://arxiv.org/pdf/1405.5530v1,physics.hist-ph,physic,paper argu foundat dilemma theoret physic rela...


**3. Training model**

In [None]:
import pandas as pd
from collections import Counter
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.multioutput import MultiOutputClassifier
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import f1_score, precision_score, recall_score, hamming_loss, jaccard_score, classification_report

X = df['cleaned']  # Preprocessed text
y = df['target'].apply(lambda x: x.split(","))  # Convert to list if comma-separated

vectorizer = CountVectorizer(stop_words='english')
X_vectorized = vectorizer.fit_transform(X)

mlb = MultiLabelBinarizer()
Y_bin = mlb.fit_transform(y)

print("Labels:", mlb.classes_)  # Print encoded class labels

X_train, X_temp, Y_train, Y_temp = train_test_split(X_vectorized, Y_bin, test_size=0.2, random_state=42)
X_val, X_test, Y_val, Y_test = train_test_split(X_temp, Y_temp, test_size=0.5, random_state=42)

class_weights = []
for i in range(Y_train.shape[1]):
    class_counts = Counter(Y_train[:, i])
    total_samples = len(Y_train)
    weights = {label: total_samples / count for label, count in class_counts.items()}
    class_weights.append(weights)

nb_model = MultiOutputClassifier(MultinomialNB())

param_grid = {'estimator__alpha': [0.1, 0.5, 1.0, 2.0, 3.0]}
grid_search = GridSearchCV(nb_model, param_grid, cv=5, scoring='f1_macro', n_jobs=-1)
grid_search.fit(X_train, Y_train)

best_alpha = grid_search.best_params_['estimator__alpha']
print(f"Best alpha parameter: {best_alpha}")

best_nb_model = MultiOutputClassifier(MultinomialNB(alpha=best_alpha))
best_nb_model.fit(X_train, Y_train)

Y_val_pred = best_nb_model.predict(X_val)

Y_test_pred = best_nb_model.predict(X_test)

test_f1 = f1_score(Y_test, Y_test_pred, average='macro')
test_precision = precision_score(Y_test, Y_test_pred, average='macro', zero_division=0)
test_recall = recall_score(Y_test, Y_test_pred, average='macro', zero_division=0)
test_hamming = hamming_loss(Y_test, Y_test_pred)
test_jaccard = jaccard_score(Y_test, Y_test_pred, average='macro')

print("\nTest Set Metrics:")
print(f"Test F1 Score: {test_f1:.4f}")
print(f"Test Precision: {test_precision:.4f}")
print(f"Test Recall: {test_recall:.4f}")
print(f"Test Hamming Loss: {test_hamming:.4f}")
print(f"Test Jaccard Score: {test_jaccard:.4f}")

# Classification Reports
print("\nClassification Report (Test):")
print(classification_report(Y_test, Y_test_pred, target_names=mlb.classes_))


Labels: ['bio' 'cs' 'econ-qfin' 'eess' 'math-stats' 'physic']
Best alpha parameter: 2.0

Test Set Metrics:
Test F1 Score: 0.7466
Test Precision: 0.7013
Test Recall: 0.8138
Test Hamming Loss: 0.0963
Test Jaccard Score: 0.6063

Classification Report (Test):
              precision    recall  f1-score   support

         bio       0.70      0.77      0.73      1074
          cs       0.81      0.89      0.85      5758
   econ-qfin       0.67      0.88      0.76      1240
        eess       0.41      0.73      0.53      1007
  math-stats       0.76      0.80      0.78      5432
      physic       0.86      0.81      0.83      5550

   micro avg       0.76      0.83      0.79     20061
   macro avg       0.70      0.81      0.75     20061
weighted avg       0.78      0.83      0.80     20061
 samples avg       0.81      0.87      0.81     20061



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


**4. Backtesting**

In [None]:
import random
import numpy as np

new_data_samples = [
    "Quantum entanglement and Bell's theorem challenge classical interpretations of physics.",
    "Statistical methods in probability theory play a fundamental role in mathematical modeling.",
    "New deep learning architectures are transforming natural language processing applications.",
    "Cryptography relies on number theory and complex mathematical algorithms.",
    "Bayesian inference is widely used in statistical decision-making and machine learning.",
    "Attention is all you need The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.",
    "Vector autogressions (VARs) are widely applied when it comes to modeling and forecasting macroeconomic variables. In high dimensions, however, they are prone to overfitting. Bayesian methods, more concretely shrinkage priors, have shown to be successful in improving prediction performance. In the present paper, we introduce the semi-global framework, in which we replace the traditional global shrinkage parameter with group-specific shrinkage parameters. We show how this framework can be applied to various shrinkage priors, such as global-local priors and stochastic search variable selection priors. We demonstrate the virtues of the proposed framework in an extensive simulation study and in an empirical application forecasting data of the US economy. Further, we shed more light on the ongoing ``Illusion of Sparsity'' debate, finding that forecasting performances under sparse/dense priors vary across evaluated economic variables and across time frames. Dynamic model averaging, however, can combine the merits of both worlds.",
    "Classification can be performed using either a discriminative or a generative learning approach. Discriminative learning consists of constructing the conditional probability of the outputs given the inputs, while generative learning consists of constructing the joint probability density of the inputs and outputs. Although most classical and quantum methods are discriminative, there are some advantages of the generative learning approach. For instance, it can be applied to unsupervised learning, statistical inference, uncertainty estimation, and synthetic data generation. In this article, we present a quantum generative multiclass classification strategy, called quantum generative classification (QGC). This model uses a variational quantum algorithm to estimate the joint probability density function of features and labels of a data set by means of a mixed quantum state. We also introduce a quantum map called quantum-enhanced Fourier features (QEFF), which leverages quantum superposition to prepare high-dimensional data samples in quantum hardware using a small number of qubits. We show that the quantum generative classification algorithm can be viewed as a Gaussian mixture that reproduces a kernel Hilbert space of the training data. In addition, we developed a hybrid quantum-classical neural network that shows that it is possible to perform generative classification on high-dimensional data sets. The method was tested on various low- and high-dimensional data sets including the 10-class MNIST and Fashion-MNIST data sets, illustrating that the generative classification strategy is competitive against other previous quantum models.",
    "Research on human skin anatomy reveals its complex multi-scale, multi-phase nature, with up to 70% of its composition being bounded and free water. Fluid movement plays a key role in the skin's mechanical and biological responses, influencing its time-dependent behavior and nutrient transport.Poroelastic modeling is a promising approach for studying skin dynamics across scales by integrating multi-physics processes. This paper introduces a biology hierarchical two-compartment model capturing fluid distribution in the interstitium and micro-circulation. A theoretical framework is developed with a biphasic interstitium -- distinguishing interstitial fluid and non-structural cells -- and analyzed through a one-dimensional consolidation test of a column. This biphasic approach allows separate modeling of cell and fluid motion, considering their differing characteristic times. An appendix discusses extending the model to include biological exchanges like oxygen transport. Preliminary results indicate that cell viscosity introduces a second characteristic time, and at high viscosity and short time scales, cells behave similarly to solids.A simplified model was used to replicate an experimental campaign on short time scales. Local pressure (up to 31 kPa) was applied to dorsal finger skin using a laser Doppler probe PF801 (Perimed Sweden), following a setup described in Fromy Brain Res (1998). The model qualitatively captured ischemia and post-occlusive reactive hyperemia, aligning with experimental data.All numerical simulations used the open-source software FEniCSx v0.9.0. To ensure transparency and reproducibility, anonymized experimental data and finite element codes are publicly available on GitHub.",
    "Currency arbitrage capitalizes on price discrepancies in currency exchange rates between markets to produce profits with minimal risk. By employing a combinatorial optimization problem, one can ascertain optimal paths within directed graphs, thereby facilitating the efficient identification of profitable trading routes. This research investigates the methodologies of quantum annealing and gate-based quantum computing in relation to the currency arbitrage problem. In this study, we implement the Quantum Approximate Optimization Algorithm (QAOA) utilizing Qiskit version 1.2. In order to optimize the parameters of QAOA, we perform simulations utilizing the AerSimulator and carry out experiments in simulation. Furthermore, we present an NchooseK-based methodology utilizing D-Wave's Ocean suite. This methodology enables a comparison of the effectiveness of quantum techniques in identifying optimal arbitrage paths. The results of our study enhance the existing literature on the application of quantum computing in financial optimization challenges, emphasizing both the prospective benefits and the present limitations of these developing technologies in real-world scenarios.",
    "Despite advances in methods to interrogate tumor biology, the observational and population-based approach of classical cancer research and clinical oncology does not enable anticipation of tumor outcomes to hasten the discovery of cancer mechanisms and personalize disease management. To address these limitations, individualized cancer forecasts have been shown to predict tumor growth and therapeutic response, inform treatment optimization, and guide experimental efforts. These predictions are obtained via computer simulations of mathematical models that are constrained with data from a patient's cancer and experiments. This book chapter addresses the validation of these mathematical models to forecast tumor growth and treatment response. We start with an overview of mathematical modeling frameworks, model selection techniques, and fundamental metrics. We then describe the usual strategies employed to validate cancer forecasts in preclinical and clinical scenarios. Finally, we discuss existing barriers in validating these predictions along with potential strategies to address them."
]

new_data_vectorized = vectorizer.transform(new_data_samples)

new_predictions = best_nb_model.predict(new_data_vectorized)

new_predictions = np.array(new_predictions)  # Convert list of arrays to a 2D NumPy array
predicted_labels = mlb.inverse_transform(new_predictions)  # Now, this works correctly

for i, text in enumerate(new_data_samples):
    print(f"\n🔹 **Generated Test Data Point {i+1}**: {text}")
    print(f"📌 **Predicted Categories**: {predicted_labels[i]}")



🔹 **Generated Test Data Point 1**: Quantum entanglement and Bell's theorem challenge classical interpretations of physics.
📌 **Predicted Categories**: ('math-stats', 'physic')

🔹 **Generated Test Data Point 2**: Statistical methods in probability theory play a fundamental role in mathematical modeling.
📌 **Predicted Categories**: ('bio', 'econ-qfin', 'physic')

🔹 **Generated Test Data Point 3**: New deep learning architectures are transforming natural language processing applications.
📌 **Predicted Categories**: ('cs',)

🔹 **Generated Test Data Point 4**: Cryptography relies on number theory and complex mathematical algorithms.
📌 **Predicted Categories**: ('math-stats',)

🔹 **Generated Test Data Point 5**: Bayesian inference is widely used in statistical decision-making and machine learning.
📌 **Predicted Categories**: ('cs', 'math-stats')

🔹 **Generated Test Data Point 6**: Attention is all you need The dominant sequence transduction models are based on complex recurrent or convoluti