## Exploratory Data Analysis

1. Perform exploratory data analysis to understand the structure and characteristics of the dataset.

ArXiv dataset contains 1.7+ articles.

Keys of .json file are: id, submitter, title, comments, journal ref, categories, license, abstract, versions, authors_parsed. 

Main categories are: 'phys', 'math', 'cs', 'q-bio', 'q-fin', 'stat', 'eess', 'econ'.

-> Will create a .csv with only 800 instances (100 per category) with only information of: category, title and abstract.

-> Save new data in: data_arxiv_articles

In [8]:
import os

import pandas as pd

In [10]:
print(os.getcwd())

/Users/mariapazoliva/PycharmProjects/ArticlesClassifier/notebooks 


In [12]:
# Path to metadata JSON file
data_path = '/Users/mariapazoliva/PycharmProjects/Data/arxiv-metadata-oai-snapshot.json'

# Categories of interest
categories = ['phys', 'math', 'cs', 'q-bio', 'q-fin', 'stat', 'eess', 'econ']

# Initialization of dictionary to store the sampled articles for each category
sampled_articles = {category: pd.DataFrame() for category in categories}
needed_samples = {category: 100 for category in categories}

# Processing the JSON file in chunks to speed up the process
chunk_size = 1000
reader = pd.read_json(data_path, lines=True, chunksize=chunk_size)

for chunk in reader:
    # Checking if all categories have enough samples, if so, break early
    if all(samples.shape[0] >= 100 for samples in sampled_articles.values()):
        break

    # Processing each category
    for category in categories:
        if needed_samples[category] > 0:
            # Filtering the chunk for the category
            filtered_data = chunk[chunk['categories'].str.contains(category)]
            # Calculating how many more samples are needed
            remaining_samples = needed_samples[category]
            # Sampling the lesser of remaining needed or what's available in the chunk
            sampled = filtered_data.sample(min(remaining_samples, len(filtered_data)), replace=False, random_state=42)
            # Appending the samples to the respective category in sampled_articles
            sampled_articles[category] = pd.concat([sampled_articles[category], sampled])
            # Updating the number of samples still needed
            needed_samples[category] -= sampled.shape[0]

# Combining all sampled articles into one DataFrame
all_samples = pd.DataFrame()

for category, samples in sampled_articles.items():
    # Trimming to exactly 100 samples if needed
    final_samples = samples[:100].copy()  # Making a copy of the slice
    final_samples['category'] = category  # Safely adding the category column
    all_samples = pd.concat([all_samples, final_samples])

# Selecting only the required columns
all_samples = all_samples[['category', 'title', 'abstract']]
all_samples.reset_index(drop=True, inplace=True)

all_samples.to_csv(
    '/Users/mariapazoliva/PycharmProjects/ArticlesClassifier/data_arxiv_articles/sampled_arxiv_articles.csv',
    index=False)

print("CSV file with sampled articles has been created with 800 rows and saved in 'data_arxiv_articles'.")

CSV file with sampled articles has been created with 800 rows and saved in 'data_arxiv_articles'.


In [13]:
all_samples

Unnamed: 0,category,title,abstract
0,phys,A High Robustness and Low Cost Model for Casca...,We study numerically the cascading failure p...
1,phys,Turbulent Diffusion of Lines and Circulations,We study material lines and passive vectors ...
2,phys,Leaky modes of a left-handed slab,Using complex plane analysis we show that le...
3,phys,Universe Without Singularities. A Group Approa...,In the last years the traditional scenario o...
4,phys,Quantifying social group evolution,The rich set of interactions between individ...
...,...,...,...
795,econ,Economic Development and Inequality: a complex...,By borrowing methods from complex system ana...
796,econ,Testing for Common Breaks in a Multiple Equati...,The issue addressed in this paper is that of...
797,econ,A/B Testing of Auctions,"For many application areas A/B testing, whic..."
798,econ,Nonparametric Analysis of Random Utility Models,This paper develops and implements a nonpara...


In [14]:
# Printing the counts for each category
value_counts = all_samples['category'].value_counts()
print(value_counts)

category
phys     100
math     100
cs       100
q-bio    100
q-fin    100
stat     100
eess     100
econ     100
Name: count, dtype: int64


## Data Preprocessing

2. Preprocess the data:
   
- clean the text
- remove special characters
- lowercase the words
- remove stop words
- perform stemming or lemmatization
- vectorize abstracts

In [15]:
# Import the clean_text function from the utils package
from utils.clean_text import clean_text

# Apply the cleaning function to the 'abstract' column
all_samples['clean_abstract'] = all_samples['abstract'].apply(clean_text)

# Save the DataFrame with the new column to CSV
all_samples.to_csv(
    '/Users/mariapazoliva/PycharmProjects/ArticlesClassifier/data_arxiv_articles/cleaned_arxiv_articles.csv',
    index=False)

# Display the original and cleaned abstracts
print(all_samples[['abstract', 'clean_abstract']].head())

                                            abstract  \
0    We study numerically the cascading failure p...   
1    We study material lines and passive vectors ...   
2    Using complex plane analysis we show that le...   
3    In the last years the traditional scenario o...   
4    The rich set of interactions between individ...   

                                      clean_abstract  
0  study numerically cascade failure problem arti...  
1  study material line passive vector model turbu...  
2  complex plane analysis left handed slab suppor...  
3  year traditional scenario big bang deeply modi...  
4  rich set interaction individual society result...  


In [16]:
all_samples

Unnamed: 0,category,title,abstract,clean_abstract
0,phys,A High Robustness and Low Cost Model for Casca...,We study numerically the cascading failure p...,study numerically cascade failure problem arti...
1,phys,Turbulent Diffusion of Lines and Circulations,We study material lines and passive vectors ...,study material line passive vector model turbu...
2,phys,Leaky modes of a left-handed slab,Using complex plane analysis we show that le...,complex plane analysis left handed slab suppor...
3,phys,Universe Without Singularities. A Group Approa...,In the last years the traditional scenario o...,year traditional scenario big bang deeply modi...
4,phys,Quantifying social group evolution,The rich set of interactions between individ...,rich set interaction individual society result...
...,...,...,...,...
795,econ,Economic Development and Inequality: a complex...,By borrowing methods from complex system ana...,borrow method complex system analysis paper an...
796,econ,Testing for Common Breaks in a Multiple Equati...,The issue addressed in this paper is that of...,issue address paper testing common break equat...
797,econ,A/B Testing of Auctions,"For many application areas A/B testing, whic...",application area b testing partition user syst...
798,econ,Nonparametric Analysis of Random Utility Models,This paper develops and implements a nonpara...,paper develop implement nonparametric test ran...


##

In [18]:
from transformers import BertTokenizer, BertModel
import torch

# Loading pre-trained model tokenizer (vocabulary) and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')


def get_bert_embeddings(text):
    # Encoding text to get token ids and attention masks
    encoded_input = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=512)
    # Computing token embeddings
    with torch.no_grad():
        output = model(**encoded_input)
    # Getting embeddings for the [CLS] token (first token), which is often used as the sentence embedding
    embeddings = output.last_hidden_state[:, 0, :].squeeze().numpy()
    return embeddings


# Getting BERT embeddings for each cleaned abstract
all_samples['vectorized_abstract'] = all_samples['clean_abstract'].apply(get_bert_embeddings)

# Saving the DataFrame with the new columns to CSV
all_samples.to_csv(
    '/Users/mariapazoliva/PycharmProjects/ArticlesClassifier/data_arxiv_articles/final_arxiv_articles.csv', index=False)


In [19]:
all_samples

Unnamed: 0,category,title,abstract,clean_abstract,vectorized_abstract
0,phys,A High Robustness and Low Cost Model for Casca...,We study numerically the cascading failure p...,study numerically cascade failure problem arti...,"[-0.800163, -0.023465438, 0.13609968, -0.00636..."
1,phys,Turbulent Diffusion of Lines and Circulations,We study material lines and passive vectors ...,study material line passive vector model turbu...,"[-0.77268314, -0.0025703087, 0.40619317, 0.102..."
2,phys,Leaky modes of a left-handed slab,Using complex plane analysis we show that le...,complex plane analysis left handed slab suppor...,"[-0.7974227, 0.2585103, 0.43905467, 0.04818795..."
3,phys,Universe Without Singularities. A Group Approa...,In the last years the traditional scenario o...,year traditional scenario big bang deeply modi...,"[-0.64683455, 0.1709517, 0.10234092, -0.029879..."
4,phys,Quantifying social group evolution,The rich set of interactions between individ...,rich set interaction individual society result...,"[0.0090104, 0.0470066, 0.22108169, -0.152166, ..."
...,...,...,...,...,...
795,econ,Economic Development and Inequality: a complex...,By borrowing methods from complex system ana...,borrow method complex system analysis paper an...,"[-0.4228558, 0.44220725, 0.25396243, 0.257067,..."
796,econ,Testing for Common Breaks in a Multiple Equati...,The issue addressed in this paper is that of...,issue address paper testing common break equat...,"[-0.57197225, 0.13659757, 0.08875729, 0.279514..."
797,econ,A/B Testing of Auctions,"For many application areas A/B testing, whic...",application area b testing partition user syst...,"[-0.5779062, -0.16374889, 0.5059078, -0.515197..."
798,econ,Nonparametric Analysis of Random Utility Models,This paper develops and implements a nonpara...,paper develop implement nonparametric test ran...,"[-0.6584115, -0.25957307, 0.008531444, 0.03361..."


Extract new abstracts for testing.

In [3]:
# --- CONFIGURATION ---
# Full arXiv metadata JSON file
data_path = '/Users/mariapazoliva/PycharmProjects/Data/arxiv-metadata-oai-snapshot.json'
# Previously sampled articles CSV file
sampled_csv_path = '/Users/mariapazoliva/PycharmProjects/ArticlesClassifier/jupyter_notebooks/data_arxiv_articles/sampled_arxiv_articles.csv'

# Categories of interest
categories = ['phys', 'math', 'cs', 'q-bio', 'q-fin', 'stat', 'eess', 'econ']
# Number of new abstracts to extract per category
needed_per_category = 2

# --- LOAD PREVIOUSLY SAMPLED ABSTRACTS ---
# Read the CSV with sampled articles
if os.path.exists(sampled_csv_path):
    sampled_df = pd.read_csv(sampled_csv_path)
    # Create a set of abstracts that we already have (strip whitespace for robust matching)
    existing_abstracts = set(sampled_df['abstract'].dropna().str.strip().tolist())
else:
    existing_abstracts = set()

# --- PREPARE A DICTIONARY TO COLLECT NEW SAMPLES ---
new_samples = {category: [] for category in categories}
needed_samples = {category: needed_per_category for category in categories}

# --- PROCESS THE JSON FILE IN CHUNKS ---
chunk_size = 1000
reader = pd.read_json(data_path, lines=True, chunksize=chunk_size)

for chunk in reader:
    # For each category, check if we need more samples
    for category in categories:
        if needed_samples[category] <= 0:
            continue

        # Filter rows where the 'categories' column contains the current category
        # Make sure the abstract is not missing and not already in our existing abstracts
        mask = (
                chunk['categories'].str.contains(category, na=False) &
                chunk['abstract'].notna() &
                (~chunk['abstract'].str.strip().isin(existing_abstracts))
        )
        filtered = chunk[mask]

        # If there are any new samples in this chunk for this category, sample the needed amount
        if not filtered.empty:
            # Determine the number of samples to take
            n_to_sample = min(needed_samples[category], len(filtered))
            sampled = filtered.sample(n=n_to_sample, replace=False, random_state=42)

            # Append the sampled rows as dictionaries (we can keep only the columns of interest)
            for _, row in sampled.iterrows():
                new_samples[category].append({
                    'category': category,
                    'title': row.get('title', ''),
                    'abstract': row['abstract'].strip()
                })
                # Also add to the existing abstracts set to avoid duplicates in subsequent chunks
                existing_abstracts.add(row['abstract'].strip())

            # Update the number of samples still needed for this category
            needed_samples[category] -= n_to_sample

    # If all categories have enough samples, break out of the loop early.
    if all(n <= 0 for n in needed_samples.values()):
        break

# --- COMBINE THE NEW SAMPLES ---
# Flatten the new_samples dictionary into a single DataFrame
new_samples_list = []
for cat_samples in new_samples.values():
    new_samples_list.extend(cat_samples)

new_samples_df = pd.DataFrame(new_samples_list)

# Optionally, save the new samples to a CSV file
output_path = '/Users/mariapazoliva/PycharmProjects/ArticlesClassifier/jupyter_notebooks/data_arxiv_articles/test_sampled_arxiv_articles.csv'
new_samples_df.to_csv(output_path, index=False)

print(f"Extracted new samples for each category and saved to {output_path}")
print(new_samples_df)

Extracted new samples for each category and saved to /Users/mariapazoliva/PycharmProjects/ArticlesClassifier/jupyter_notebooks/data_arxiv_articles/test_sampled_arxiv_articles.csv
   category                                              title  \
0      phys  Fabrication of Analog Electronics for Serial R...   
1      phys     Modal Extraction in Spatially Extended Systems   
2      math  On second order shape optimization methods for...   
3      math             Iterated integral and the loop product   
4        cs  On-line Viterbi Algorithm and Its Relationship...   
5        cs  Opportunistic Relay Selection with Limited Fee...   
6     q-bio  Modeling the effects of HIV-1 virions and prot...   
7     q-bio  Stability of the splay state in pulse--coupled...   
8     q-fin  Capital Allocation to Business Units and Sub-P...   
9     q-fin  The public goods game on homogeneous and heter...   
10     stat  Transient Dynamics of Sparsely Connected Hopfi...   
11     stat  A Dynamic Algori

In [4]:
new_samples_df

Unnamed: 0,category,title,abstract
0,phys,Fabrication of Analog Electronics for Serial R...,A set of analog electronics boards for serial ...
1,phys,Modal Extraction in Spatially Extended Systems,We describe a practical procedure for extracti...
2,math,On second order shape optimization methods for...,This paper is devoted to the analysis of a sec...
3,math,Iterated integral and the loop product,In this article we discuss a relation between ...
4,cs,On-line Viterbi Algorithm and Its Relationship...,"In this paper, we introduce the on-line Viterb..."
5,cs,Opportunistic Relay Selection with Limited Fee...,It has been shown that a decentralized relay s...
6,q-bio,Modeling the effects of HIV-1 virions and prot...,We report a first in modeling and simulation o...
7,q-bio,Stability of the splay state in pulse--coupled...,The stability of the dynamical states characte...
8,q-fin,Capital Allocation to Business Units and Sub-P...,Despite the fact that the Euler allocation pri...
9,q-fin,The public goods game on homogeneous and heter...,We propose an extended public goods interactio...


In [9]:
import json

# Load the dataset with new articles
test_data = pd.read_csv(
    '/Users/mariapazoliva/PycharmProjects/ArticlesClassifier/jupyter_notebooks/data_arxiv_articles/test_sampled_arxiv_articles.csv')

# Convert the samples to a list of dictionaries (each dictionary is a JSON payload)
payloads = test_data.to_dict(orient='records')

# Print out the JSON payloads in a readable format
print(json.dumps(payloads, indent=2))

# Save the JSON payloads to a file
with open(
        '/Users/mariapazoliva/PycharmProjects/ArticlesClassifier/jupyter_notebooks/data_arxiv_articles/test_payloads.json',
        'w') as f:
    json.dump(payloads, f, indent=2)

[
  {
    "category": "phys",
    "title": "Fabrication of Analog Electronics for Serial Readout of Silicon Strip\n  Sensors",
    "abstract": "A set of analog electronics boards for serial readout of silicon strip\nsensors was fabricated. A commercially available amplifier is mounted on a\nhomemade hybrid board in order to receive analog signals from silicon strip\nsensors. Also, another homemade circuit board is fabricated in order to\ntranslate amplifier control signals into a suitable format and to provide bias\nvoltage to the amplifier as well as to the silicon sensors. We discuss\ntechnical details of the fabrication process and performance of the circuit\nboards we developed."
  },
  {
    "category": "phys",
    "title": "Modal Extraction in Spatially Extended Systems",
    "abstract": "We describe a practical procedure for extracting the spatial structure and\nthe growth rates of slow eigenmodes of a spatially extended system, using a\nunique experimental capability both to im