# Получение данных

В этом ноутбуке, я реализовал сбор данных.

Основная идея: использовать библиотеку arxiv, и фильтровать по запросам на тему ``Генерации Видео``.

Я выделил несколько синонимичных тем, которые могут касаться генерации.
- video generation
- text-to-video
- video synthesis
- generative video
- video diffusion
- long video generation
- video transformer
- motion synthesis
- spatiotemporal generation
- video autoregressive
- video GAN

Также я решил сохранять статьи, которые были опубликованы за 2025 год, на перспективу тестирования результатов.

Концептуально важным тут является запрос вида: ``all:video AND all:generation``, который означает, что я собрал статьи, в которых есть слова и ``video`` и ``generation``, альтернативой этого является запрос ``all: video generation``, который выдаёт меньшее количесвто статаей $\approx 700$.

Плюсом моего подхода является улавливание большого количество статей, но важным минусом является зашумливание данных. Для того чтобы избавиться от зашумливания, я использую ``HDBSCAN``. Этот метод кластеризации может выделять шумовой кластер. Эта стратегия не является идеальной, однако, учитывая широкий охват данных, я считаю, что ее применение оправдано и эффективно. 

In [None]:
import arxiv
from tqdm.notebook import tqdm
import pandas as pd
import time

# Setting up the arXiv client with retries, delays, and page size
client = arxiv.Client(num_retries=7, delay_seconds=3, page_size=100)

# List of query templates to search for video-related generation topics
query_templates = [
    ('all:video AND all:generation', "Video generation"),
    ('all:text-to-video', "Text-to-video"),
    ('all:video AND all:synthesis', "Video synthesis"),
    ('all:generative AND all:video', "Generative video"),
    ('all:video AND all:diffusion', "Video diffusion"),
    ('all:long AND all:video AND all:generation', "Long videos"),
    ('all:video AND all:transformer', "Video transformer"),
    ('all:motion AND all:synthesis', "Motion synthesis"),
    ('all:spatiotemporal AND all:generation', "Spatiotemporal generation"),
    ('all:video AND all:autoregressive', "Autoregressive video"),
    ('all:video AND all:GAN', "Video GAN")
]

# Function to perform search query within a specific date range
def search_by_query(query: str, start_date: str, end_date: str):
    # Adding date range condition to query
    date_query = f'AND submittedDate:[{start_date} TO {end_date}]'
    search = arxiv.Search(
        query=f'{query} {date_query}',
        max_results=float("inf"),  # retrieve as many results as possible
        sort_by=arxiv.SortCriterion.SubmittedDate,
        sort_order=arxiv.SortOrder.Descending
    )

    results = []
    try:
        for result in client.results(search):
            # Collecting paper metadata
            results.append({
                "entry_id": result.entry_id,
                "arxiv_id": result.get_short_id(),
                "title": result.title.strip(),
                "authors": ', '.join(str(a) for a in result.authors),
                "abstract": result.summary.replace('\n', ' ').strip(),
                "published": result.published.strftime("%Y-%m-%d"),
                "updated": result.updated.strftime("%Y-%m-%d"),
                "year": result.published.year,
                "categories": ', '.join(result.categories),
                "primary_category": result.primary_category,
                "pdf_url": result.pdf_url,
                "arxiv_url": result.entry_id,
                "doi": result.doi or '',
                "comment": result.comment or '',
                "journal_ref": result.journal_ref or '',
            })
    except arxiv.UnexpectedEmptyPageError as e:
        # Handling occasional API errors without stopping the process
        print(f"UnexpectedEmptyPageError encountered: {e}. Continuing with next query.")
    return results

# Dictionary to hold unique results based on entry IDs
all_results = {}

# Defining date ranges (quarters) to segment the API queries and avoid API limitations
date_ranges = [
    ("202401010000", "202403312359"),
    ("202404010000", "202406302359"),
    ("202407010000", "202409302359"),
    ("202410010000", "202412312359"),
    ("202501010000", "202503312359"),
]

# Iterating through each query template and date range
for query, description in tqdm(query_templates, desc="Collecting data by queries"):
    print(f'Executing query: {description}')
    for start_date, end_date in date_ranges:
        print(f'   -> Interval: {start_date[:8]} - {end_date[:8]}')
        results = search_by_query(query, start_date, end_date)
        # Ensuring results are unique by checking entry IDs
        for res in results:
            if res['entry_id'] not in all_results:
                all_results[res['entry_id']] = res
        time.sleep(2)  # Delay between intervals to respect API limits

# Outputting the total number of unique articles collected
print(f"\nTotal unique papers collected: {len(all_results)}")

# Saving collected data to CSV file for further analysis
df = pd.DataFrame(all_results.values())
df.to_csv("data/arxiv_video_generation_papers_2024_2025.csv", index=False)

Collecting data by queries:   0%|          | 0/11 [00:00<?, ?it/s]

Executing query: Video generation
   -> Interval: 20240101 - 20240331
   -> Interval: 20240401 - 20240630
   -> Interval: 20240701 - 20240930
   -> Interval: 20241001 - 20241231
   -> Interval: 20250101 - 20250331
Executing query: Text-to-video
   -> Interval: 20240101 - 20240331
   -> Interval: 20240401 - 20240630
   -> Interval: 20240701 - 20240930
   -> Interval: 20241001 - 20241231
   -> Interval: 20250101 - 20250331
Executing query: Video synthesis
   -> Interval: 20240101 - 20240331
   -> Interval: 20240401 - 20240630
   -> Interval: 20240701 - 20240930
   -> Interval: 20241001 - 20241231
   -> Interval: 20250101 - 20250331
Executing query: Generative video
   -> Interval: 20240101 - 20240331
   -> Interval: 20240401 - 20240630
   -> Interval: 20240701 - 20240930
   -> Interval: 20241001 - 20241231
   -> Interval: 20250101 - 20250331
Executing query: Video diffusion
   -> Interval: 20240101 - 20240331
   -> Interval: 20240401 - 20240630
   -> Interval: 20240701 - 20240930
   -> I

Давайте посмотрим на данные

In [16]:
df.head(3)

Unnamed: 0,entry_id,arxiv_id,title,authors,abstract,published,updated,year,categories,primary_category,pdf_url,arxiv_url,doi,comment,journal_ref
0,http://arxiv.org/abs/2404.08221v1,2404.08221v1,Uncertain Boundaries: Multidisciplinary Approa...,"Jocelyn Dzuong, Zichong Wang, Wenbin Zhang",In the rapidly evolving landscape of generativ...,2024-03-31,2024-03-31,2024,"cs.LG, cs.AI, cs.CY",cs.LG,http://arxiv.org/pdf/2404.08221v1,http://arxiv.org/abs/2404.08221v1,,,
1,http://arxiv.org/abs/2404.00777v1,2404.00777v1,Privacy-preserving Optics for Enhancing Protec...,"Jhon Lopez, Carlos Hinojosa, Henry Arguello, B...",The modern surge in camera usage alongside wid...,2024-03-31,2024-03-31,2024,"cs.CV, cs.AI, cs.CR, cs.LG, eess.IV",cs.CV,http://arxiv.org/pdf/2404.00777v1,http://arxiv.org/abs/2404.00777v1,,Accepted to CVPR 2024. Project Website and Cod...,
2,http://arxiv.org/abs/2404.00726v1,2404.00726v1,MugenNet: A Novel Combined Convolution Neural ...,"Chen Peng, Zhiqin Qian, Kunyu Wang, Qi Luo, Zh...",Biomedical image segmentation is a very import...,2024-03-31,2024-03-31,2024,"eess.IV, cs.CV, cs.LG",eess.IV,http://arxiv.org/pdf/2404.00726v1,http://arxiv.org/abs/2404.00726v1,,,


Выделим статьи только за 2024 год

In [17]:
df_2024 = df[df['year'] == 2024]
df_2024 = df_2024.sort_values(by='published', ascending=False)
df_2024.shape

(6916, 15)

In [19]:
df_2024.head(3)

Unnamed: 0,entry_id,arxiv_id,title,authors,abstract,published,updated,year,categories,primary_category,pdf_url,arxiv_url,doi,comment,journal_ref
3427,http://arxiv.org/abs/2501.00601v2,2501.00601v2,DreamDrive: Generative 4D Scene Modeling from ...,"Jiageng Mao, Boyi Li, Boris Ivanovic, Yuxiao C...",Synthesizing photo-realistic visual observatio...,2024-12-31,2025-01-03,2024,"cs.CV, cs.AI, cs.GR",cs.CV,http://arxiv.org/pdf/2501.00601v2,http://arxiv.org/abs/2501.00601v2,,Project page: https://pointscoder.github.io/Dr...,
3433,http://arxiv.org/abs/2501.00352v1,2501.00352v1,PanoSLAM: Panoptic 3D Scene Reconstruction via...,"Runnan Chen, Zhaoqing Wang, Jiepeng Wang, Yuex...","Understanding geometric, semantic, and instanc...",2024-12-31,2024-12-31,2024,"cs.CV, cs.RO",cs.CV,http://arxiv.org/pdf/2501.00352v1,http://arxiv.org/abs/2501.00352v1,,,
7616,http://arxiv.org/abs/2501.00378v1,2501.00378v1,STARFormer: A Novel Spatio-Temporal Aggregatio...,"Wenhao Dong, Yueyang Li, Weiming Zeng, Lei Che...",Many existing methods that use functional magn...,2024-12-31,2024-12-31,2024,"eess.IV, cs.CV, cs.LG",eess.IV,http://arxiv.org/pdf/2501.00378v1,http://arxiv.org/abs/2501.00378v1,,,


Сохраним данные в CSV файл

In [None]:
df_2024.to_csv('data/video_generation_2024.csv', index=False)

Я получил набор данных для анализа, давайте переместимся в ``2_data_observation.ipynb`` и посмотрим на данные детально, проверим их на "адекватность"