## Data Retrieving
In this part of the project, we have fetched the data from **ArXiv API** and created the dataset on which apply our Deep Learning strategies and models.

In [None]:
import pandas as pd
import numpy as np
import arxiv
import requests
from bs4 import BeautifulSoup

### ArXiv categories' names download
Each ArXiv article is marked with a categories which will be our target value in the classification task.\
In the following section, we have collected all the **ArXiv categories** from the website using the web scraping technique.

In [None]:
arxiv_categories_url = "https://arxiv.org/category_taxonomy"
arxiv_categories_html = requests.get(arxiv_categories_url)
webpage_html = BeautifulSoup(arxiv_categories_html.content, "html.parser")
arxiv_categories_html = webpage_html.findAll("div", {"class": "column is-one-fifth"})

arxiv_categories = []

for category in arxiv_categories_html:

    category_title = category.find("h4")

    if (category_title != None):

        category_string = category_title.text.strip()
        category_string_splitted = category_string.split(' ',1)

        arxiv_categories.append({ 
            'categoryId': category_string_splitted[0], 
            'categoryName': category_string_splitted[1].replace('(', '').replace(')','') 
        })

### ArXiv API
Here we have actually retrieved the data from the ArXiv API using the related library.\
We have also created a **Pandas dataframe** with the fetched data and written it in a **CSV file**.\
For each **ArXiv article** we have the following information available:
- f
- d

In [None]:
arxiv_dataset = pd.DataFrame()

In [None]:
categoryNumber = 0

for category in arxiv_categories:

    categoryNumber = categoryNumber + 1

    category_articles = arxiv.Search(

        query = category['categoryId'],
        max_results = 500,
        sort_by = arxiv.SortCriterion.Relevance,
        sort_order = arxiv.SortOrder.Descending

    )

    for category_article in category_articles.results():

        article_dict = {

            'id': category_article.entry_id,
            'title': category_article.title,
            'publishedDate': category_article.published,
            'authorsList': category_article.authors,
            'abstract': category_article.summary,
            'categoryName': category['categoryName'],
            'categoryId': category['categoryId'],
            'categoryNumber': categoryNumber
            
        }

        arxiv_dataset = arxiv_dataset.append(article_dict, ignore_index = True)

In [None]:
arxiv_dataset