# Technical report

## Frontiers
 
Frontiers runs a number of open access journals in several scientific fields. Authors can
submit their articles for publication to one of these journals. However, in some cases the
authors may not be aware of the journal that best matches the scope of their paper. If the
wrong journal is chosen, it may result in delays or even rejection. To this end, we are
developing a feature that suggests to the authors the three most relevant journals to their
manuscript, to choose from.

You are tasked to build a text classifier for this feature that, given some input text, can
recommend the most suitable Frontiers journals to it.

You have at your disposal a .jsonl file containing:
- Article identifier
- Body text
- Frontiers journal name
for all articles published by Frontiers in January 2020. You can find it here:
https://drive.google.com/file/d/1es3EX0MdDAeolwFl_K_fS3RP0JFRxE2U/view?usp=sharing
Remarks:
- The solution should be coded in Python.
- You can use any Python library you may find useful.
- Together with the code you should also provide a report where you describe your approach and present the results.
- You are particularly encouraged to discuss the choice of the evaluation metric(s) and how this translates to business value.
- (last but not least) As you write code for this assignment, keep in mind that it will be reviewed (and in real life, put in production) by other colleagues. Clean code, a modular structure, python packaging, testability, explicit dependencies, documentation, are all things that can facilitate the team!

Please email your solution in .zip format to davide.fiocco@frontiersin.org and be prepared to
discuss it in the next interview stage.

## Summary


This report is divided into the following sections:
- **Introduction:** In this section I introduce the problem by providing references and context.
- **Data and evaluation metrics:**: In this section, I show an exploratory data analysis (EDA) of the given dataset providing useful insights for the definition of the best methods. Furthermore, I introduce the evaluation metrics that will be used to define the best method.
- **Methods:** In this section, I describe the set of tested methods used to provide the best recommendation system.
- **Results:** Here, I show the results of each method using the defined evaluation metrics comparing them with a trivial baseline.
- **Conclusion:** Finally, I choose the best method considering both time and model performance justifying the reasons. 
- **Deployment and application**: In this section, I show how to easily deploy the model as REST API and interact with it with a simple web app.

# Introduction

The task consists to develop an algorithm that, given a scientific paper (or a simple text/report), it recommends the most suitable Frontiers journals. Several methodology could be used to define the best recommendation system and classifier. However, it strongly depends by the number of classes (Frontiers Journals) to be predicted. A previous study ([Meijer et al. *Document Embedding for Scientific Articles:
Efficacy of Word Embeddings vs TFIDF.* 2021](https://arxiv.org/pdf/2107.05151.pdf)), already compare document embeddings using TFIDF and WordEmbeddings for classification of huge dataset of scientific papers (70 milion) into  30 thousand distinct journals or conferences.

Here, I develop several variantions of document embedding using:
- all document text;
- only a list of keywords extracted from the text (from 3 to 7).

From the text defined before I tested several embedding stategies such as:
- TFIDF: 
- Word2Vec
- SBERT

# Data and evaluation metrics

# Methods

# Results

# Conclusion

# Deployment and application

In [4]:
import os
os.chdir("/home/operti/inda/frontiers/src/")