# Technical report

## Frontiers
 
Frontiers runs a number of open access journals in several scientific fields. Authors can
submit their articles for publication to one of these journals. However, in some cases the
authors may not be aware of the journal that best matches the scope of their paper. If the
wrong journal is chosen, it may result in delays or even rejection. To this end, we are
developing a feature that suggests to the authors the three most relevant journals to their
manuscript, to choose from.

You are tasked to build a text classifier for this feature that, given some input text, can
recommend the most suitable Frontiers journals to it.

You have at your disposal a .jsonl file containing:
- Article identifier
- Body text
- Frontiers journal name
for all articles published by Frontiers in January 2020. You can find it here:
https://drive.google.com/file/d/1es3EX0MdDAeolwFl_K_fS3RP0JFRxE2U/view?usp=sharing
Remarks:
- The solution should be coded in Python.
- You can use any Python library you may find useful.
- Together with the code you should also provide a report where you describe your approach and present the results.
- You are particularly encouraged to discuss the choice of the evaluation metric(s) and how this translates to business value.
- (last but not least) As you write code for this assignment, keep in mind that it will be reviewed (and in real life, put in production) by other colleagues. Clean code, a modular structure, python packaging, testability, explicit dependencies, documentation, are all things that can facilitate the team!

Please email your solution in .zip format to davide.fiocco@frontiersin.org and be prepared to
discuss it in the next interview stage.

## Summary


This report is divided into the following sections:
- **Introduction:** In this section I introduce the problem by providing references and context.
- **Data and evaluation metrics:**: In this section, I show an exploratory data analysis (EDA) of the given dataset providing useful insights for the definition of the best methods. Furthermore, I introduce the evaluation metrics that will be used to define the best method.
- **Methods:** In this section, I describe the set of tested methods used to provide the best recommendation system.
- **Results:** Here, I show the results of each method using the defined evaluation metrics comparing them with a trivial baseline.
- **Conclusion:** Finally, I choose the best method considering both time and model performance justifying the reasons. 
- **Deployment and application**: In this section, I show how to easily deploy the model as REST API and interact with it with a simple web app.


# Introduction

The task consists to develop an algorithm that, given a scientific paper (or a simple text/report), it recommends the most suitable Frontiers journals. Several methodologies could be used to define the best recommendation system and classifier. However, it strongly depends on the number of classes (Frontiers Journals) to be predicted. A previous study ([Meijer et al. *Document Embedding for Scientific Articles:
Efficacy of Word Embeddings vs TFIDF.* 2021](https://arxiv.org/pdf/2107.05151.pdf)), already compare document embeddings using TFIDF and WordEmbeddings for classification of a huge dataset of scientific papers (70 million) into  30 thousand distinct journals or conferences.

Here, I develop several variations of document embedding using:
- all document text;
- only a list of keywords is extracted from the text (from 3 to 7).

From the text defined before I tested several embedding strategies such as:
- **TFIDF**: In information retrieval, TFIDF, short for term frequency-inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in searches for information retrieval, text mining, and user modeling. The TFIDF value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general. TFIDF is one of the most popular term-weighting schemes today. A survey conducted in 2015 showed that 83% of text-based recommender systems in digital libraries use TFIDF.[Wikipedia](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)
- **Word2Vec**: I used the pretrained [Spacy word embeddings](https://spacy.io/usage/linguistic-features#vectors-similarity) from the language model *en_core_web_lg*. I decided to not train my own word2vec (or FastText, Glove, etc.) word embeddings because of the short time to finish the assignment and the small dimension of the dataset.
- **SBERT**: Sentence BERT ([*Reimers et al. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. (2019)*](https://arxiv.org/abs/1908.10084)) is a transformer for sentence-pair regression tasks like semantic textual similarity. While the architecture of BERT makes it unsuitable for sentence-pair regression tasks like semantic textual similarity, SBERT provides semantically meaningful sentence embeddings that can be compared using cosine similarity.

# Data and evaluation metrics

This section is divided in:
- **Setup:**The setup for the correct operation of the notebook 
- **EDA:** An EDA of the dataset
- **Evaluation metrics**: The presentation of the evaluation metrics

### Setup

Simple import of all needed libraries

In [31]:
import src.train.document_approach

ModuleNotFoundError: No module named 'utils'

In [25]:
import os
os.chdir("/home/operti/inda/frontiers")

from src.utils.utils import load_data
from src.preprocess.preprocess import filter_papers_min_sample
from src.utils.utils import load_data, IO

from src.train.train import (train_test,
                        train_embeddings_keyword_word2vec,
                        train_embeddings_document_word2vec,
                        train_embeddings_keyword_tfidf,
                        train_embeddings_document_tfidf,
                        train_embeddings_keyword_sbert,
                        train_embeddings_document_sbert
                        )                        
from src.preprocess.preprocess import filter_papers_min_sample, preprocess
from src.evaluate.evaluate import (evaluate_document_word2vec, 
                               evaluate_keyword_word2vec,
                               evaluate_keyword_tfidf,
                               evaluate_document_tfidf,
                               evaluate_keyword_sbert,
                               evaluate_document_sbert)


%matplotlib inline
import plotly.express as px


ModuleNotFoundError: No module named 'train'

### EDA

The EDA shows:

1. The number of scientific articles published for each Frontiers journal;
2. The distribution of the length of the text;
3. The definition of the train and test split;
4. The definition of the preprocessing of the text;
    1. Normalization:
        1. Lowercase
        2. Remove punctaction
        3. Remove numbers
    2. Removing defined unseful words
    3. Removing stopwords
    4. Lemmatization
     
5. The definition of the keywords and how to extract them;
6. The distribution of the number of keywords;

In [3]:
# Load of the dataset
df = load_data()

In [9]:
# Print of one example
df.head(5)

Unnamed: 0,id,text,journal
0,465950,\n Sleep Characteristics and Influencing Facto...,Frontiers in Medicine
1,483526,A Hybrid Approach for Modeling Type 2 Diabetes...,Frontiers in Genetics
2,482500,\n Relationship Between SES and Academic Achie...,Frontiers in Psychology
3,437333,Environmental Health Research in Africa: Impor...,Frontiers in Genetics
4,486515,"\n 3,5-T2—A Janus-Faced Thyroid Hormone Metabo...",Frontiers in Endocrinology


In [18]:
# 1. The number of scientific articles published for each Frontiers journal;
documents_per_journal, df_subset = filter_papers_min_sample(df)
documents_per_journal = documents_per_journal.reset_index().rename(columns={0:"count"})
fig = px.bar(documents_per_journal, x='journal', y='count')
fig.show()

**Conclusion 1**: The number of published articles for each journal is strongly unbalanced. In order to evaluate the methods presented in the next section, I filter the original dataset with only the journals that received at least 2 publications. 

In [21]:
# 2. The distribution of the length of the text;
df_subset["len_text"] = df_subset["text"].apply(lambda x: len(x))
fig = px.histogram(df_subset, x="len_text",nbins=100)
fig.show()



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



**Conclusion 2**: It does not seem a normal distribution because of the tail (it seems a binomial distribution). However, the sample dimension is too short to evaluate.

In [24]:
# The definition of the train and test split;
df_train, df_test = train_test(df)

NameError: name 'train_test' is not defined

**Conclusion 3**: The test size is defined at 33%

In [23]:
# Preprocessing of the train and 
df_train = IO(filename="df_train_preprocessed",folder="02_intermediate",format_="pickle").load()
df_test = IO(filename="df_test_preprocessed",folder="02_intermediate",format_="pickle").load()    


# Methods

# Results

# Conclusion

# Deployment and application

In [4]:
import os
os.chdir("/home/operti/inda/frontiers/src/")