### Data preparation

In this part of the project, we have made a **first processing on inital ArXiv data**, with the aim of achieving more structured information needed for our purposes.

The steps have been these:
- We kept only the **category groups** (already explained in the data retrieving) related to the scientific articles and we also associated a **label** (progressive number) to each one of them, in detail:
    - Computer Science (0)
    - Economy (1)
    - Electrical Engineering and Systems Science (2)
    - Mathematics (3)
    - Physics (4)
    - Quantitative Biology (5)
    - Quantitative Finance (6)
    - Statistics (7)
    
    These will be our **target variable** to identify using deep learning algorithms.

- Furthermore, we combined the **title, abstract and name of the authors** in a single text, which will be manipulated and analyzed during the project to solve our text classification task.

In [1]:
import pandas as pd
import numpy as np

In [2]:
arxiv_data = pd.read_csv('../../data/arxiv-dataset.csv')

In [3]:
category_L1 = pd.DataFrame({
    'category': arxiv_data['categoryGroup'].unique(),
    'label': np.arange(0,len(arxiv_data['categoryGroup'].unique()))
})

category_L1.to_csv('../../data/cat1-label.csv', encoding = 'utf-8', index = False)

In [4]:
processed_data_catL1 = pd.merge(arxiv_data, category_L1, left_on = 'categoryGroup', right_on = 'category', how = 'left')

# Merge textual data (title, authors and abstract)
processed_data_catL1['text'] = processed_data_catL1['title'].str.cat(processed_data_catL1['authors'], sep = ' ').str.cat(processed_data_catL1['abstract'], sep = ' ')

processed_data_catL1 = processed_data_catL1[['text', 'label']] 

processed_data_catL1.to_csv('../../data/arxiv-dataset-cat1.csv', encoding = 'utf-8', index = False)