# Frontiers application notebook

This project was made to show my strong interest in joining Frontiers as Data Scientist and more precisely to work on their Artificial Intelligence Review Assistant (AIRA).

I challenged myself to create a scientific article keyword extracting tool that might be really useful for articles recommendation in a peer-reviewing context.

Specifically, this notebook materializes both my ability to produce a ML prototype in a very short time, including NLP techniques, and my proficiency in Python through good coding practices and clean coding.

I previously worked on applying Topological Data Analysis ML techniques on Neural time series data so I reused some publications as examples.

## Environment

In [1]:
%reload_ext watermark
%watermark \
    --machine \
    --python \
    --packages \
        numpy,pandas,sklearn,spacy

Python implementation: CPython
Python version       : 3.8.11
IPython version      : 7.27.0

numpy  : 1.21.2
pandas : 1.3.2
sklearn: 0.24.1
spacy  : 3.1.3

Compiler    : GCC 10.3.0
OS          : Linux
Release     : 5.10.70
Machine     : x86_64
Processor   : 
CPU cores   : 16
Architecture: 64bit



In [2]:
import os
import logging
import string
import multiprocessing

import numpy as np
import pandas as pd
import spacy

import textwrap

from typing import Any, Dict, List, Optional, Tuple, Union
from pathlib import Path
from spacy.tokens import Token
from spacy import Language
from transformers import pipeline

from sklearn.feature_extraction.text import TfidfVectorizer
from pdfminer.high_level import extract_text

In [3]:
NOTEBOOK_DIRECTORY = Path.cwd()
CORPUS_DIR = NOTEBOOK_DIRECTORY / 'corpus'

file_paths = list(CORPUS_DIR.glob('*.pdf'))
file_names = [path.stem for path in file_paths]
text_id = 1

logging.getLogger('pdfminer').setLevel(logging.WARNING)
wrapper = textwrap.TextWrapper(width=80, break_long_words=False, break_on_hyphens=False)
os.environ["TOKENIZERS_PARALLELISM"] = 'false'

## Extract text from articles

In [4]:
extracting_config = {
    'parallel': True
}

def extract_text_from_pdf(file: Path) -> str:
    text = extract_text(file)
    text = text.replace("\n", " ")
    text = text.replace("\x00", " ")
    text = ' '.join(text.split())
    return text

def extract_text_from_file_paths(*, file_paths: List[Path], parallel: bool) -> List[str]:
    if parallel:
        pool = multiprocessing.Pool()
        return list(pool.map(extract_text_from_pdf, file_paths))
    else:
        return list(map(extract_text_from_pdf, file_paths))

In [5]:
%%time
texts = extract_text_from_file_paths(**extracting_config, file_paths=file_paths)

CPU times: user 7.32 ms, sys: 75.1 ms, total: 82.5 ms
Wall time: 14.3 s


In [6]:
print(f'File: {file_paths[text_id].name}\n')
start = 700
end = 1400
print(wrapper.fill(texts[text_id][start:end]))

File: 1709.06206.pdf

ed to model the leaky integrate-fire spiking neuron, training SNNs using back in
overcoming propagation. Two SNN training algorithms are proposed: (1) SNN with
discontinuous integration, which is suitable for rate-coded input spikes, and
(2) SNN with continuous integration, which is more general and can handle input
spikes with temporal information. Neuromorphic hardware designed in 40nm CMOS
exploits the spike sparsity and demonstrates high classification accuracy (>98%
on MNIST) and low energy (48.4–773 nJ/image). Keywords—Spiking neural networks;
back propagation; neuromorphic hardware; straight-through estimator I.
INTRODUCTION Recently, many deep learning algorithms such as multi-layer


## Keywords extraction

### Preprocessing

In [7]:
nlp = spacy.load("en_core_web_sm", exclude=["parser", "lemmatizer", "ner"])
components = [pipeline for (pipeline, _) in nlp.pipeline]
print(f'SpaCy pipeline components: {components}')

remove_words = {'Fig', 'Figure', 'INTRODUCTION', 'Abstract', 'Section'}
pos_to_keep = {'NOUN', 'ADJ', 'PROPN'}

def keep_token(token: Token) -> bool:
    # Remove all but nouns, adjectives and proper noun
    if token.pos_ not in pos_to_keep:
        return False
    
    # Remove units, math signs, 2-char strings might not be relevant
    if len(token.text) <= 2:
        return False

    # Remove custom words specific to publications
    if token.text in remove_words:
        return False
    
    # Remove words with punctuation
    if any(map(lambda x: x in string.punctuation, token.text)):
        return False
    
    # Remove words with numbers (like 1e-5)
    if any(map(str.isdigit, token.text)):
        return False
    
    return True

def preprocess(*, nlp: Language, texts: List[str]) -> List[str]:
    # Apply spacy tokenizer pipeline
    docs = list(nlp.pipe(texts))
    
    # Apply filters on spacy tokens
    tokens = list(list(filter(keep_token, doc)) for doc in docs)
    
    # Join tokens back
    corpus = list(map(lambda x: " ".join([token.text for token in x]), tokens))
    
    return corpus

SpaCy pipeline components: ['tok2vec', 'tagger', 'attribute_ruler']


In [8]:
%%time
corpus = preprocess(nlp=nlp, texts=texts)

CPU times: user 1.55 s, sys: 168 ms, total: 1.72 s
Wall time: 1.72 s


### TF-IDF

In [9]:
tfidf_config = {
    'ngram_range': (1, 2)
}

def compute_tfidf_keywords(
    *, 
    corpus: List[str], 
    ngram_range: Tuple[int, int]
) -> np.ndarray:
    
    vectorizer = TfidfVectorizer(ngram_range=ngram_range)
    tfidf = vectorizer.fit_transform(corpus)
    feature_names = np.array(vectorizer.get_feature_names())
    tfidf_sorted = np.argsort(tfidf.toarray())[:, ::-1]
    sorted_features = feature_names[tfidf_sorted].T
    
    return sorted_features

In [10]:
%%time
tfidf_keywords = compute_tfidf_keywords(**tfidf_config, corpus=corpus)

CPU times: user 39.5 ms, sys: 937 µs, total: 40.5 ms
Wall time: 40.2 ms


## Results

In [11]:
keywords_df = pd.DataFrame(tfidf_keywords, columns=file_names).head(10)
keywords_df

Unnamed: 0,1706.03762,1709.06206,1812.05143,1801.06316,1810.03855,1510.06629
0,attention,snn,swd,quantum,classiﬁcation,homology
1,layer,mnist,window,betti,spike,head direction
2,transformer,neuron,homology,simplices,series,spatial
3,self,snns,time,eigenvalue,train,head
4,encoder,time,topological,data,spike train,direction
5,decoder,training,series,betti numbers,time series,persistent
6,self attention,classification,time series,points,isi,place
7,layers,time steps,persistence,register,activity,space
8,model,neurons,perea,topological,time,persistent homology
9,translation,mlp,data,numbers,feature,covariates


Using spaCy tokenizer with custom filtering then a TF-IDF pipeline from scikit-learn allows us to reach a pretty strong baseline for articles keywords extraction with a small corpus.

Bi-grams have an important role in the context of analyzing scientific articles since lots of expressions are composed of multiple words (life persistent homology or time series)

What could we do more?
- fine-tune the token filtering
- create a recommender system (using kNN for example)
- try transformers tokenizers

## Experimental questions answering

I also tried a question answering pipeline from transformers library to see what state-of-the-art models could do on complex texts.

In [12]:
pipeline_config = {
    'task': 'question-answering',
    'tokenizer': 'distilbert-base-cased-distilled-squad',
    'model': 'distilbert-base-cased-distilled-squad'
}

qa_pipeline = pipeline(**pipeline_config)

question = "What is the dataset?"

outputs = qa_pipeline(question=question, context=texts[text_id], top_k=5)
outputs_small = [{'answer': pred['answer'], 'score': f'{pred["score"]:.2f}'} for pred in outputs]
outputs_small

  tensor = as_tensor(value)
  p_mask = np.asarray(


[{'answer': 'saccade-1', 'score': '0.64'},
 {'answer': 'static image datasets', 'score': '0.52'},
 {'answer': 'N-MNIST', 'score': '0.50'},
 {'answer': 'N-MNIST', 'score': '0.35'},
 {'answer': 'spike sparsity', 'score': '0.28'}]

The model wasn't trained on articles but results are sometimes surprisingly good, this could definitely be improved and maybe help reviewers locate answer approximations.