Arxiv34k6L datasets contains abstracts and their categories (6 class types with 34068 rows). I have limited the datasets to 6 top classes because the data collected from Arxiv was highly imbalanced. Just to make it easier for people to deal with data I have prepared it that way. This data can be used for practicing multi-label text classification model which can predict the label based on abstracts. I have shared the data collection steps below. Feel free to customize it according to your need.
___
Inspiration behind this data is [this project](https://keras.io/examples/nlp/multi_label_classification/) on keras

In [6]:
!pip install arxiv

Collecting arxiv
  Downloading arxiv-2.1.0-py3-none-any.whl.metadata (6.1 kB)
Collecting feedparser==6.0.10 (from arxiv)
  Downloading feedparser-6.0.10-py3-none-any.whl.metadata (2.3 kB)
Downloading arxiv-2.1.0-py3-none-any.whl (11 kB)
Downloading feedparser-6.0.10-py3-none-any.whl (81 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.1/81.1 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: feedparser, arxiv
  Attempting uninstall: feedparser
    Found existing installation: feedparser 6.0.11
    Uninstalling feedparser-6.0.11:
      Successfully uninstalled feedparser-6.0.11
Successfully installed arxiv-2.1.0 feedparser-6.0.10


In [8]:
import arxiv 
import pandas as pd

keywords = [
    "\"image segmentation\"",
    "\"self-supervised learning\"",
    "\"representation learning\"",
    "\"image generation\"",
    "\"object detection\"",
    "\"transfer learning\"",
    "\"transformers\"",
    "\"adversarial training",
    "\"generative adversarial networks\"",
    "\"model compressions\"",
    "\"image segmentation\"",
    "\"few-shot learning\"",
    "\"natural language\"",
    "\"graph\"",
    "\"colorization\"",
    "\"depth estimation\"",
    "\"point cloud\"",
    "\"structured data\"",
    "\"optical flow\"",
    "\"reinforcement learning\"",
    "\"super resolution\"",
    "\"attention\"",
    "\"tabular\"",
    "\"unsupervised learning\"",
    "\"semi-supervised learning\"",
    "\"explainable\"",
    "\"radiance field\"",
    "\"decision tree\"",
    "\"time series\"",
    "\"molecule\"",
]

In [16]:
client = arxiv.Client(num_retries = 20, page_size = 500)

def query_function(query):
    search = arxiv.Search(query = query,
                          max_results = 20000,
                          sort_by = arxiv.SortCriterion.LastUpdatedDate)
    terms = []
    titles = []
    abstracts = []
    
    for results in client.results(search):
        if results.primary_category in ["cs.CV", "stat.ML", "cs.LG"]:
            terms.append(results.categories)
            titles.append(results.categories)  # here it should be title - I did not noticed until it was processed (be careful) will take time to run
            abstracts.append(results.summary)
    return terms, titles, abstracts

In [18]:
all_tiles, all_summaries, all_terms = [], [], []

for query in keywords:
    terms, titles, abstracts = query_function(query)
    all_tiles.extend(titles)
    all_summaries.extend(abstracts)
    all_terms.extend(terms)

In [184]:
arxiv20k = pd.DataFrame({
    "titles": all_tiles,
    "Abstracts": all_summaries,
    "categories": all_terms
})

arxiv20k.head()

Unnamed: 0,titles,Abstracts,categories
0,[cs.CV],Despite the recent progress in medical image s...,[cs.CV]
1,"[cs.CV, eess.IV, quant-ph]",Quantum computing is expected to transform a r...,"[cs.CV, eess.IV, quant-ph]"
2,[cs.CV],Multi-view segmentation in Remote Sensing (RS)...,[cs.CV]
3,[cs.CV],"Compared to supervised deep learning, self-sup...",[cs.CV]
4,"[cs.CV, cs.AI]","We introduce Sauron, a filter pruning method t...","[cs.CV, cs.AI]"


In [185]:
arxiv20k = arxiv20k.drop("titles", axis = 1)

In [186]:
df = arxiv20k[:]
df

Unnamed: 0,Abstracts,categories
0,Despite the recent progress in medical image s...,[cs.CV]
1,Quantum computing is expected to transform a r...,"[cs.CV, eess.IV, quant-ph]"
2,Multi-view segmentation in Remote Sensing (RS)...,[cs.CV]
3,"Compared to supervised deep learning, self-sup...",[cs.CV]
4,"We introduce Sauron, a filter pruning method t...","[cs.CV, cs.AI]"
...,...,...
93479,Discovering the 3D atomic structure of molecul...,"[cs.CV, q-bio.QM]"
93480,Classifying structural variability in noisy pr...,[cs.CV]
93481,In single particle reconstruction (SPR) from c...,[cs.CV]
93482,Determining the 3D structures of biological mo...,"[stat.ML, cs.CV, cs.LG, q-bio.QM]"


In [187]:
top_6_labels = df['categories'].value_counts().head(6).index.tolist()
df_filtered = df[df['categories'].isin(top_6_labels)]

In [188]:
df_filtered

Unnamed: 0,Abstracts,categories
0,Despite the recent progress in medical image s...,[cs.CV]
2,Multi-view segmentation in Remote Sensing (RS)...,[cs.CV]
3,"Compared to supervised deep learning, self-sup...",[cs.CV]
4,"We introduce Sauron, a filter pruning method t...","[cs.CV, cs.AI]"
5,Biomedical image analysis is fundamental for b...,[cs.CV]
...,...,...
93463,Simplified Molecular Input Line Entry System (...,[cs.LG]
93467,Single particle cryo-electron microscopy (EM) ...,[cs.CV]
93470,Recent advances in machine learning have made ...,"[cs.LG, stat.ML]"
93480,Classifying structural variability in noisy pr...,[cs.CV]


In [189]:
df_filtered["categories"].value_counts()

categories
[cs.CV]             31051
[cs.LG]              8140
[cs.LG, cs.AI]       6134
[cs.LG, stat.ML]     4267
[cs.CV, cs.AI]       3818
[cs.CV, cs.LG]       3659
Name: count, dtype: int64

In [190]:
new_df = df_filtered.sort_values(by = ["categories"])

In [191]:
new_df

Unnamed: 0,Abstracts,categories
0,Despite the recent progress in medical image s...,[cs.CV]
48020,Local binary pattern (LBP) as a kind of local ...,[cs.CV]
48021,Casual photography is often performed in uncon...,[cs.CV]
48023,Prior studies show that the key to face anti-s...,[cs.CV]
48024,"In this paper, a novel perceptual image hashin...",[cs.CV]
...,...,...
7461,The wide-spread adoption of representation lea...,"[cs.LG, stat.ML]"
59004,This work pioneers regret analysis of risk-sen...,"[cs.LG, stat.ML]"
7462,Graphs arise naturally in many real-world appl...,"[cs.LG, stat.ML]"
7445,"When observing a phenomenon, severe cases or a...","[cs.LG, stat.ML]"


In [197]:
new_df["categories"][23000:-1].value_counts()

categories
[cs.LG]             8140
[cs.CV]             8051
[cs.LG, cs.AI]      6134
[cs.LG, stat.ML]    4266
[cs.CV, cs.AI]      3818
[cs.CV, cs.LG]      3659
Name: count, dtype: int64

In [198]:
df = new_df[23000:-1]
df["categories"].value_counts()

categories
[cs.LG]             8140
[cs.CV]             8051
[cs.LG, cs.AI]      6134
[cs.LG, stat.ML]    4266
[cs.CV, cs.AI]      3818
[cs.CV, cs.LG]      3659
Name: count, dtype: int64

In [199]:
len(df)

34068

In [200]:
df.to_csv("arxiv34k6L.csv", index = False)