## Overview
Our project focuses on analyzing academic papers from arXiv, a leading repository of research papers across various disciplines. By extracting and analyzing the text from summary, we aim to develop a system that automatically assigns relevant labels to each paper, categorizing them into appropriate research topics. The dataset for this project includes publicly available academic papers, with key attributes such as paper titles, summary, and their associated labels.

## Prerequisites
Before running the project, ensure that you have installed these packages below.

In [None]:
# Fetching data through ArXiv API
!pip install feedparser

# Word Cloud
!pip install wordcloud

# Text Preprocessing and Encoding
!pip install nltk
!pip install scikit-learn

# Traditional ML models
!pip install scikit-learn
!pip install xgboost

# Deep Learning/ Transformers model
!pip install pandas
!pip install numpy
!pip install torch
!pip install nltk
!pip install pytorch_lightning
!pip install torchmetrics
!pip install torchtext
!pip install scikit-learn
!pip install transformers


## Usage

Once the libraries metioned above are installed, you are ready to explore our project!

**Data Acquisition and Preprocessing**

You can begin data acquisition by running the code provided in the Data_Acquisition_Preprocessing.ipynb. Ensure the correct search term is selected for scraping by updating the search query in the designated section of the code, labeled search_query within the function call.

*   For single-word queries (e.g., "physics"), simply input the search term as-is.

*  For multi-word queries (e.g., "machine learning"), replace the space with "+" (e.g., "machine+learning").

No other modifications are required to the code.
Once the search query is specified, the code will begin scraping research papers matching the given query from arXiv.org. It will collect papers in batches (default: 100 papers per batch) until the specified total number of papers is retrieved. The data will be saved and downloaded as a .csv file containing paper details such as title, authors, summary, publication date, and category.

After that, go to the preprocess section, where the code will clean, format, and preprocess the merged data. This step ensures the dataset is clear, consistent, and ready for analysis.

**Exploratory Data Analysis**

We will begin the first phase with EDA. Navigate to the EDA.ipynb and run all the code to see our visualizations! These include: Number of Papers per Year, Distribution of Papers by Number of Categories, and Word Cloud for two time periods: 2005-2014 and 2015-2024.

**Traditional Machine Learning Models**

Please open the Naïve Bayes, XGBoost, Logistic Regression, and Random Forest notebooks for our base models. Run all the code cells sequentially.

The notebook will first preprocess the text data and perform the necessary encoding. Next, it will train the model and conduct hyperparameter tuning. Once training is complete, the model will be tested on the testing set using the best parameters.

After execution, you will see the test set metrics, including F1-score, precision, recall, Hamming loss, and Jaccard score. A classification report will also be displayed, summarizing the model's performance across different categories.

**Deep Learning and Transformer Models**

Open the notebook ( whatever you pick )

The notebook will first preprocess the text data and perform the necessary encoding. Next, it will train the model and conduct hyperparameter tuning ( Only on LSTM ) . Once training is complete, the model will be tested on the testing set using the best parameters ( only for LSTM ) .

After execution, you will see the test set metrics, including F1-score, precision, recall, Hamming loss, and Jaccard score. A classification report will also be displayed, summarizing the model's performance across different categories.


## Additional notes
* **Search Term-Specific Adjustments**: Ensure the correct search term is entered in the code. For multi-word queries (e.g., "machine learning"), replace spaces with "+" (e.g., "machine+learning") to avoid issues with retrieval.
* **Training Time Considerations**: Model training time may vary depending on the dataset size and model complexity. Deep learning models and transformer-based models may take significantly longer compared to traditional machine learning models like Naïve Bayes or Random Forest.
* **Hyperparameter Tuning Impact**: The hyperparameter tuning process, especially for models like XGBoost and deep learning models, may require extended processing time. Consider adjusting tuning parameters if computational resources are limited.
* **Execution Order**: Run all code cells sequentially without interruption to ensure smooth execution, proper data preprocessing, and accurate results.