A semantic search engine for white papers on the ArXiv and NASA ADS. Semantic search uses different techniques to understand the intent and context of search terms. This means that the search engine will understand the meaning of the query beyond just the exact keywords. The search engine uses a language model to generate embeddings for each manuscript which can be clustered for visualization applications and queried for entity searches. The embeddings are generated using tfidf+PCA and a Llama2-based fine tuned model. The embeddings are then clustered using an approximate nearest neighbor technique (ANNOY) or queried with FAISS to provide recommendations on similar articles.
Create a new virtual environment
conda create -n nlp python=3.11
conda activate nlp
Install dependencies
conda install -c conda-forge spacy
pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_core_sci_sm-0.5.1.tar.gz
pip install sqlalchemy sqlalchemy_utils pyuser_agent tqdm ipython jupyter datasets ftfy clean-text unidecode annoy scikit-learn ads mammoth markdown
python -m spacy download en_core_web_trf
Check out the file: requirements.txt for a list of dependencies and versions
-
Create a SQL database to store articles in:
python database.py
-
Populate the databasae
a. Download ~2.3 million abstracts from the Arxiv on Kaggle. After downloading and unzip, run the script:
json2db.py
. Specific categories can be selected by editing thecategory_map
dict in the script.b. Or, use our query script for manuscripts in NASA Astrophysical Database. Query the database for abstracts based on a keyword search. See
query_ads.py -h
for more details. You will need to sign up for an account on ADS and subscribe for an API key.
These data are added to a SQL database called whitepapers.db
which can be sorted and queried in a quick manner using SQL commands or with our SQL wrapper in database.py
.
-
Remove incomplete entries, convert special characters and remove abstracts based on keywords to clean up the database.
python clean_db.py
-
Create embeddings and set up an approximate nearest neighbor tree for the database.
python db_to_vec.py
We use two different algorithms to generate embeddings, the first is a TF-IDF vectorizer with PCA dimensionality reduction and the second is a fine-tuned language model, llama2. The embeddings are then clustered using an approximate nearest neighbor technique (ANNOY) and queried with FAISS to provide recommendations on similar articles to an input prompt.
Create a webserver to access the generative model for a predictive keyboard and to be able to find similar abstracts in real time
- check:
api.py
uvicorn api:app --reload
Text generation and nearest neighbor recommendations in a single app:
python -m bokeh serve --show bokeh_example.py
python gpt2_to_coreml.py
A deep language model, GPT-2, is trained on scientific manuscripts from ArXiv. This pilot study uses abstracts from ~2.1M articles as training data in order to explore correlations in scientific literature from a language modelling perspective. A language models are algorithms used to generate sequences of numers that correspond to tokens or words and can be used to represent sentances. The text samples are fed into the GPT-2 117M and trained for ~500,000 steps with fine tuning. After training, the language model is used to generate embeddings for each manuscript which can be clustered for visualization applications and queried for entity searches.
from transformers import pipeline
ai = pipeline('text-generation',model='pearsonkyle/gpt2-arxiv', tokenizer='gpt2', config={'max_length':1600})
machina = lambda text: ai(text)[0]['generated_text']
A few generated samples are below:
- We can remotely sense an atmosphere by observing its reflected, transmitted, or emitted light in varying geometries. This light will contain information on the planetary conditions including
temperature, pressure, composition, and cloud optical thickness. One such property that is important is...
- The reflectance of Earth's vegetation suggests
that large, deciduous forest fires are composed of mostly dry, unprocessed material that is distributed in a nearly patchy fashion. The distributions of these fires are correlated with temperature, and also with vegetation...
- Directly imaged exoplanets probe
key aspects of planet formation and evolution theory, as well as atmospheric and interior physics. These insights have led to numerous direct imaging instruments for exoplanets, many using polarimetry. However, current instruments take
Interested in training this model in the cloud? Try this repo on Google Colab
python train.py
to train a GPT-2 model, will have to make a script to write the abstracts to a txt file first