## Data Retrieving
In this part of the project, we have fetched the data from **ArXiv API** and created the dataset on which apply our Deep Learning strategies and models.

In [None]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd

import arxiv
import requests
from bs4 import BeautifulSoup

### ArXiv categories
Each ArXiv article is associated with a **category group** and marked with a **category**, both of which could be our target value in the classification task.

In the following section, we have collected all the **ArXiv categories** from the website using the **web scraping technique**.

For each category we have this data:
- *categoryId*: the ID of the category in the ArXiv taxonomy;
- *categoryName*: the actual name of the category;
- *categoryGroup*: the group of the category. We have choosen to aggregate the subgroups of the category "Physics" in a unique group, named **"Physics"**.

Other information at this link: [ArXiv category taxonomy](https://arxiv.org/category_taxonomy)

In [None]:
# Function which detects the ArXiv category group starting from the categoryID of the article
# Params: @category_group_id -> category ID of the article
# Return: a string with the category group name
def detectCategoryGroup(category_group_id):

    if(category_group_id == 'cs'): return 'Computer Science'
    if(category_group_id == 'econ'): return 'Economy'
    if(category_group_id == 'eess'): return 'Electrical Engineering and Systems Science'
    if(category_group_id == 'math'): return 'Mathematics'
    if(category_group_id == 'q-bio'): return 'Quantitative Biology'
    if(category_group_id == 'q-fin'): return 'Quantitative Finance'
    if(category_group_id == 'stat'): return 'Statistics'
    else: return 'Physics'

In [None]:
arxiv_categories_url = "https://arxiv.org/category_taxonomy"
arxiv_categories_html = requests.get(arxiv_categories_url)
webpage_html = BeautifulSoup(arxiv_categories_html.content, "html.parser")
arxiv_categories_html = webpage_html.findAll("div", {"class": "column is-one-fifth"})

arxiv_categories = []

for category in arxiv_categories_html:

    category_title = category.find("h4")

    if (category_title != None):

        category_string = category_title.text.strip()
        category_string_splitted = category_string.split(' ', 1)

        arxiv_categories.append({ 

            'categoryId': category_string_splitted[0], 
            'categoryName': category_string_splitted[1].replace('(', '').replace(')',''),
            'categoryGroup': detectCategoryGroup(category_string_splitted[0].split('.')[0])

        })

### ArXiv API
Here we have actually retrieved the data from the [ArXiv API](https://info.arxiv.org/help/api/index.html) using the related [library](http://lukasschwab.me/arxiv.py/index.html).

We have also created a **Pandas dataframe** with the fetched data and written it in a **CSV file**, which will be read and processed later in the project.

For each **ArXiv article** we have the following information available:
- *link*: link to the article on the ArXiv platform;
- *title*: title of the article;
- *publishedDate*: date of the first publication;
- *authors*: list of authors separated by a ',';
- *abstract*: ArXiv abstract of the article;
- *categoryId*: the ID of the category in the ArXiv taxonomy;
- *categoryName*: the actual name of the category;
- *categoryGroup*: the group of the category.

In [None]:
arxiv_dataset = pd.DataFrame()

In [None]:
categoryNumber = 0

for category in arxiv_categories:

    categoryNumber = categoryNumber + 1
    print(categoryNumber)

    category_articles = arxiv.Search(

        query = category['categoryId'],
        max_results = 500,
        sort_by = arxiv.SortCriterion.Relevance,
        sort_order = arxiv.SortOrder.Descending

    )

    for category_article in category_articles.results():

        authors = ''

        for author in category_article.authors:
            authors = authors + author.name + ', '

        article_dict = {

            'link': category_article.entry_id,
            'title': category_article.title,
            'publishedDate': category_article.published,
            'authors': authors[:-2],
            'abstract': category_article.summary,
            'categoryId': category['categoryId'],
            'categoryName': category['categoryName'],
            'categoryGroup': category['categoryGroup']
            
        }

        arxiv_dataset = arxiv_dataset.append(article_dict, ignore_index = True)

In [82]:
arxiv_dataset.to_csv('arxiv-dataset.csv', encoding = 'utf-8', index = False)