# Arixiv Search with OpenCLIP and LanceDB

In this example we'll build a Arxiv Search or a recommender based on semantic search using LanceDB. We'll also compare the results with keyword based saerch on Nomic's atlast


## OpenCLIP

![CLIP (1)](https://github.com/lancedb/vectordb-recipes/assets/15766192/11b3b900-0bcb-4a4a-8fd4-804611c85972)


OpenCLIP an open source implementation of OpenAI's CLIP (Contrastive Language-Image Pre-training) as is available with various backends

In [1]:
# SETUP
!pip install lancedb open_clip_torch arxiv --q

In [2]:
!pip install pandas



## Creating table from arxiv API

### Embedding Paper Summary using CLIP


In [3]:
import torch
import open_clip
import pandas as pd
from open_clip import tokenizer
from tqdm import tqdm
from collections import defaultdict
import arxiv
import lancedb


def embed_func_clip(text):
    model, _, preprocess = open_clip.create_model_and_transforms(
        "ViT-B-32", pretrained="laion2b_s34b_b79k"
    )
    tokenizer = open_clip.get_tokenizer("ViT-B-32")
    with torch.no_grad():
        text_features = model.encode_text(tokenizer(text))
    return text_features

### Create a DataFrame of the desired length

Here we'll use arxiv python utility to interact with arxiv api and get the document data

In [4]:
def get_arxiv_df(embed_func, length=100):
    results = arxiv.Search(
        query="cat:cs.AI OR cat:cs.CV OR cat:stat.ML",
        max_results=length,
        sort_by=arxiv.SortCriterion.Relevance,
        sort_order=arxiv.SortOrder.Descending,
    ).results()
    df = defaultdict(list)
    for result in tqdm(results, total=length):
        try:
            df["title"].append(result.title)
            df["summary"].append(result.summary)
            df["authors"].append(str(result.authors))
            df["url"].append(result.entry_id)
            df["vector"].append(embed_func(result.summary).tolist()[0])

        except Exception as e:
            print("error: ", e)

    return pd.DataFrame(df)

In [5]:
LENGTH = 100  # Reduce the size for demo


def create_table():
    db = lancedb.connect("db")
    df = get_arxiv_df(embed_func_clip, LENGTH)

    tbl = db.create_table("arxiv", data=df, mode="overwrite")

    return tbl

In [17]:
tbl = create_table()

100%|████████████████████████████████████████████████████████████████████████████████| 100/100 [07:42<00:00,  4.62s/it]


In [6]:
import lancedb

db = lancedb.connect("db")

if "arxiv" not in db.table_names():
    tbl = create_table()
else:
    tbl = db.open_table("arxiv")

In [7]:
len(tbl)

100

## Semantic Search by concepts or summary

In [8]:
from IPython.display import display, HTML


def search_table(query, embed_func=embed_func_clip, lim=3):
    db = lancedb.connect("db")
    tbl = db.open_table("arxiv")

    embs = embed_func(query)

    return tbl.search(embs.tolist()[0]).limit(3).to_pandas()

In [9]:
len(tbl)

100

In [10]:
# MobileSAM paper abstract 2nd half
query = """
Many of such applications need to be run on resource-constraint edge devices,
like mobile phones. In this work, we aim to make SAM mobile-friendly by replacing the heavyweight
image encoder with a lightweight one. A naive way to train such a new SAM as in the original SAM
paper leads to unsatisfactory performance, especially when limited training sources are available. We
find that this is mainly caused by the coupled optimization of the image encoder and mask decoder,
motivated by which we propose decoupled distillation. Concretely, we distill the knowledge from
the heavy image encoder (ViT-H in the original SAM) to a lightweight image encoder, which can be
automatically compatible with the mask decoder in the original SAM. The training can be completed
on a single GPU within less than one day, and the resulting lightweight SAM is termed MobileSAM
which is more than 60 times smaller yet performs on par with the original SAM. For inference speed,
With a single GPU, MobileSAM runs around 10ms per image: 8ms on the image encoder and 4ms
on the mask decoder. With superior performance, our MobileSAM is around 5 times faster than the
concurrent FastSAM and 7 times smaller, making it more suitable for mobile applications. Moreover,
we show that MobileSAM can run relatively smoothly on CPU
"""

result = search_table(query)

result.pop("vector")
display(HTML(result.to_html()))

open_clip_pytorch_model.bin:   0%|          | 0.00/605M [00:00<?, ?B/s]

  return tbl.search(embs.tolist()[0]).limit(3).to_df()


Unnamed: 0,title,summary,authors,url,_distance
0,Twin-GAN -- Unpaired Cross-Domain Image Translation with Weight-Sharing GANs,"We present a framework for translating unlabeled images from one domain into\nanalog images in another domain. We employ a progressively growing\nskip-connected encoder-generator structure and train it with a GAN loss for\nrealistic output, a cycle consistency loss for maintaining same-domain\ntranslation identity, and a semantic consistency loss that encourages the\nnetwork to keep the input semantic features in the output. We apply our\nframework on the task of translating face images, and show that it is capable\nof learning semantic mappings for face images with no supervised one-to-one\nimage mapping.",[arxiv.Result.Author('Jerry Li')],http://arxiv.org/abs/1809.00946v1,37.476677
1,TADAM: Task dependent adaptive metric for improved few-shot learning,"Few-shot learning has become essential for producing models that generalize\nfrom few examples. In this work, we identify that metric scaling and metric\ntask conditioning are important to improve the performance of few-shot\nalgorithms. Our analysis reveals that simple metric scaling completely changes\nthe nature of few-shot algorithm parameter updates. Metric scaling provides\nimprovements up to 14% in accuracy for certain metrics on the mini-Imagenet\n5-way 5-shot classification task. We further propose a simple and effective way\nof conditioning a learner on the task sample set, resulting in learning a\ntask-dependent metric space. Moreover, we propose and empirically test a\npractical end-to-end optimization procedure based on auxiliary task co-training\nto learn a task-dependent metric space. The resulting few-shot learning model\nbased on the task-dependent scaled metric achieves state of the art on\nmini-Imagenet. We confirm these results on another few-shot dataset that we\nintroduce in this paper based on CIFAR100. Our code is publicly available at\nhttps://github.com/ElementAI/TADAM.","[arxiv.Result.Author('Boris N. Oreshkin'), arxiv.Result.Author('Pau Rodriguez'), arxiv.Result.Author('Alexandre Lacoste')]",http://arxiv.org/abs/1805.10123v4,40.610191
2,Exploring the Limits of Large Scale Pre-training,"Recent developments in large-scale machine learning suggest that by scaling\nup data, model size and training time properly, one might observe that\nimprovements in pre-training would transfer favorably to most downstream tasks.\nIn this work, we systematically study this phenomena and establish that, as we\nincrease the upstream accuracy, the performance of downstream tasks saturates.\nIn particular, we investigate more than 4800 experiments on Vision\nTransformers, MLP-Mixers and ResNets with number of parameters ranging from ten\nmillion to ten billion, trained on the largest scale of available image data\n(JFT, ImageNet21K) and evaluated on more than 20 downstream image recognition\ntasks. We propose a model for downstream performance that reflects the\nsaturation phenomena and captures the nonlinear relationship in performance of\nupstream and downstream tasks. Delving deeper to understand the reasons that\ngive rise to these phenomena, we show that the saturation behavior we observe\nis closely related to the way that representations evolve through the layers of\nthe models. We showcase an even more extreme scenario where performance on\nupstream and downstream are at odds with each other. That is, to have a better\ndownstream performance, we need to hurt upstream accuracy.","[arxiv.Result.Author('Samira Abnar'), arxiv.Result.Author('Mostafa Dehghani'), arxiv.Result.Author('Behnam Neyshabur'), arxiv.Result.Author('Hanie Sedghi')]",http://arxiv.org/abs/2110.02095v1,40.749702


In [11]:
# Exmaple 2: Search via a concept you're reading
query = """
What is the general idea behind self-supervised learning.
"""

result = search_table(query)

result.pop("vector")
display(HTML(result.to_html()))

  return tbl.search(embs.tolist()[0]).limit(3).to_df()


Unnamed: 0,title,summary,authors,url,_distance
0,Unsupervised Learning via Meta-Learning,"A central goal of unsupervised learning is to acquire representations from\nunlabeled data or experience that can be used for more effective learning of\ndownstream tasks from modest amounts of labeled data. Many prior unsupervised\nlearning works aim to do so by developing proxy objectives based on\nreconstruction, disentanglement, prediction, and other metrics. Instead, we\ndevelop an unsupervised meta-learning method that explicitly optimizes for the\nability to learn a variety of tasks from small amounts of data. To do so, we\nconstruct tasks from unlabeled data in an automatic way and run meta-learning\nover the constructed tasks. Surprisingly, we find that, when integrated with\nmeta-learning, relatively simple task construction mechanisms, such as\nclustering embeddings, lead to good performance on a variety of downstream,\nhuman-specified tasks. Our experiments across four image datasets indicate that\nour unsupervised meta-learning approach acquires a learning algorithm without\nany labeled data that is applicable to a wide range of downstream\nclassification tasks, improving upon the embedding learned by four prior\nunsupervised learning methods.","[arxiv.Result.Author('Kyle Hsu'), arxiv.Result.Author('Sergey Levine'), arxiv.Result.Author('Chelsea Finn')]",http://arxiv.org/abs/1810.02334v6,34.913574
1,Local contrastive loss with pseudo-label based self-training for semi-supervised medical image segmentation,"Supervised deep learning-based methods yield accurate results for medical\nimage segmentation. However, they require large labeled datasets for this, and\nobtaining them is a laborious task that requires clinical expertise.\nSemi/self-supervised learning-based approaches address this limitation by\nexploiting unlabeled data along with limited annotated data. Recent\nself-supervised learning methods use contrastive loss to learn good global\nlevel representations from unlabeled images and achieve high performance in\nclassification tasks on popular natural image datasets like ImageNet. In\npixel-level prediction tasks such as segmentation, it is crucial to also learn\ngood local level representations along with global representations to achieve\nbetter accuracy. However, the impact of the existing local contrastive\nloss-based methods remains limited for learning good local representations\nbecause similar and dissimilar local regions are defined based on random\naugmentations and spatial proximity; not based on the semantic label of local\nregions due to lack of large-scale expert annotations in the\nsemi/self-supervised setting. In this paper, we propose a local contrastive\nloss to learn good pixel level features useful for segmentation by exploiting\nsemantic label information obtained from pseudo-labels of unlabeled images\nalongside limited annotated images. In particular, we define the proposed loss\nto encourage similar representations for the pixels that have the same\npseudo-label/ label while being dissimilar to the representation of pixels with\ndifferent pseudo-label/label in the dataset. We perform pseudo-label based\nself-training and train the network by jointly optimizing the proposed\ncontrastive loss on both labeled and unlabeled sets and segmentation loss on\nonly the limited labeled set. We evaluated on three public cardiac and prostate\ndatasets, and obtain high segmentation performance.","[arxiv.Result.Author('Krishna Chaitanya'), arxiv.Result.Author('Ertunc Erdil'), arxiv.Result.Author('Neerav Karani'), arxiv.Result.Author('Ender Konukoglu')]",http://arxiv.org/abs/2112.09645v1,35.332321
2,Universum GANs: Improving GANs through contradictions,"Limited availability of labeled-data makes any supervised learning problem\nchallenging. Alternative learning settings like semi-supervised and universum\nlearning alleviate the dependency on labeled data, but still require a large\namount of unlabeled data, which may be unavailable or expensive to acquire.\nGAN-based data generation methods have recently shown promise by generating\nsynthetic samples to improve learning. However, most existing GAN based\napproaches either provide poor discriminator performance under limited labeled\ndata settings; or results in low quality generated data. In this paper, we\npropose a Universum GAN game which provides improved discriminator accuracy\nunder limited data settings, while generating high quality realistic data. We\nfurther propose an evolving discriminator loss which improves its convergence\nand generalization performance. We derive the theoretical guarantees and\nprovide empirical results in support of our approach.","[arxiv.Result.Author('Sauptik Dhar'), arxiv.Result.Author('Javad Heydari'), arxiv.Result.Author('Samarth Tripathi'), arxiv.Result.Author('Unmesh Kurup'), arxiv.Result.Author('Mohak Shah')]",http://arxiv.org/abs/2106.09946v2,36.214127


# Full Text Search
In text retrieval, full-text search refers to techniques for searching a single computer-stored document or a collection in a full-text database. Full-text search is distinguished from searches based on metadata or on parts of the original texts represented in databases

LanceDB now provides **experimental** support for full text search. This is currently Python only. We plan to push the integration down to Rust in the future to make this available for JS as well.


In [12]:
!pip install tantivy@git+https://github.com/quickwit-oss/tantivy-py#164adc87e1a033117001cf70e38c82a53014d985

Collecting tantivy@ git+https://github.com/quickwit-oss/tantivy-py#164adc87e1a033117001cf70e38c82a53014d985
  Cloning https://github.com/quickwit-oss/tantivy-py to c:\users\fisclouds\appdata\local\temp\pip-install-40gpfser\tantivy_a0964f9b15de4b8a97fef4cabf7501ac
  Resolved https://github.com/quickwit-oss/tantivy-py to commit a47fcfb3a6ad3fa2fca76513bd52d840ff15c596
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Building wheels for collected packages: tantivy
  Building wheel for tantivy (pyproject.toml): started
  Building wheel for tantivy (pyproject.toml): still running...
  Building wheel for tantivy (pyproject.toml): still running...
  Building wheel for tantivy (pyproject.toml): finished with status 'd

  Running command git clone --filter=blob:none --quiet https://github.com/quickwit-oss/tantivy-py 'C:\Users\FISCLOUDS\AppData\Local\Temp\pip-install-40gpfser\tantivy_a0964f9b15de4b8a97fef4cabf7501ac'


### Build FTS index for the summary
Here, we're building the FTS index using python bindings for tantivy. You can also build the index for any other text column. A full-text index stores information about significant words and their location within one or more columns of a database table

In [13]:
# This cell might take a few mins
tbl.create_fts_index("summary")

In [14]:
## FTS via title
result = (
    tbl.search("What is the general idea behind self-supervised learning.")
    .limit(10)
    .to_pandas()
)

result.pop("vector")

display(HTML(result.to_html()))

  result = tbl.search("What is the general idea behind self-supervised learning.").limit(10).to_df()


Unnamed: 0,title,summary,authors,url,score
0,Local contrastive loss with pseudo-label based self-training for semi-supervised medical image segmentation,"Supervised deep learning-based methods yield accurate results for medical\nimage segmentation. However, they require large labeled datasets for this, and\nobtaining them is a laborious task that requires clinical expertise.\nSemi/self-supervised learning-based approaches address this limitation by\nexploiting unlabeled data along with limited annotated data. Recent\nself-supervised learning methods use contrastive loss to learn good global\nlevel representations from unlabeled images and achieve high performance in\nclassification tasks on popular natural image datasets like ImageNet. In\npixel-level prediction tasks such as segmentation, it is crucial to also learn\ngood local level representations along with global representations to achieve\nbetter accuracy. However, the impact of the existing local contrastive\nloss-based methods remains limited for learning good local representations\nbecause similar and dissimilar local regions are defined based on random\naugmentations and spatial proximity; not based on the semantic label of local\nregions due to lack of large-scale expert annotations in the\nsemi/self-supervised setting. In this paper, we propose a local contrastive\nloss to learn good pixel level features useful for segmentation by exploiting\nsemantic label information obtained from pseudo-labels of unlabeled images\nalongside limited annotated images. In particular, we define the proposed loss\nto encourage similar representations for the pixels that have the same\npseudo-label/ label while being dissimilar to the representation of pixels with\ndifferent pseudo-label/label in the dataset. We perform pseudo-label based\nself-training and train the network by jointly optimizing the proposed\ncontrastive loss on both labeled and unlabeled sets and segmentation loss on\nonly the limited labeled set. We evaluated on three public cardiac and prostate\ndatasets, and obtain high segmentation performance.","[arxiv.Result.Author('Krishna Chaitanya'), arxiv.Result.Author('Ertunc Erdil'), arxiv.Result.Author('Neerav Karani'), arxiv.Result.Author('Ender Konukoglu')]",http://arxiv.org/abs/2112.09645v1,7.476211
1,Multi-Scale Representation Learning for Spatial Feature Distributions using Grid Cells,"Unsupervised text encoding models have recently fueled substantial progress\nin NLP. The key idea is to use neural networks to convert words in texts to\nvector space representations based on word positions in a sentence and their\ncontexts, which are suitable for end-to-end training of downstream tasks. We\nsee a strikingly similar situation in spatial analysis, which focuses on\nincorporating both absolute positions and spatial contexts of geographic\nobjects such as POIs into models. A general-purpose representation model for\nspace is valuable for a multitude of tasks. However, no such general model\nexists to date beyond simply applying discretization or feed-forward nets to\ncoordinates, and little effort has been put into jointly modeling distributions\nwith vastly different characteristics, which commonly emerges from GIS data.\nMeanwhile, Nobel Prize-winning Neuroscience research shows that grid cells in\nmammals provide a multi-scale periodic representation that functions as a\nmetric for location encoding and is critical for recognizing places and for\npath-integration. Therefore, we propose a representation learning model called\nSpace2Vec to encode the absolute positions and spatial relationships of places.\nWe conduct experiments on two real-world geographic data for two different\ntasks: 1) predicting types of POIs given their positions and context, 2) image\nclassification leveraging their geo-locations. Results show that because of its\nmulti-scale representations, Space2Vec outperforms well-established ML\napproaches such as RBF kernels, multi-layer feed-forward nets, and tile\nembedding approaches for location modeling and image classification tasks.\nDetailed analysis shows that all baselines can at most well handle distribution\nat one scale but show poor performances in other scales. In contrast,\nSpace2Vec's multi-scale representation can handle distributions at different\nscales.","[arxiv.Result.Author('Gengchen Mai'), arxiv.Result.Author('Krzysztof Janowicz'), arxiv.Result.Author('Bo Yan'), arxiv.Result.Author('Rui Zhu'), arxiv.Result.Author('Ling Cai'), arxiv.Result.Author('Ni Lao')]",http://arxiv.org/abs/2003.00824v1,5.525813
2,Extending the WILDS Benchmark for Unsupervised Adaptation,"Machine learning systems deployed in the wild are often trained on a source\ndistribution but deployed on a different target distribution. Unlabeled data\ncan be a powerful point of leverage for mitigating these distribution shifts,\nas it is frequently much more available than labeled data and can often be\nobtained from distributions beyond the source distribution as well. However,\nexisting distribution shift benchmarks with unlabeled data do not reflect the\nbreadth of scenarios that arise in real-world applications. In this work, we\npresent the WILDS 2.0 update, which extends 8 of the 10 datasets in the WILDS\nbenchmark of distribution shifts to include curated unlabeled data that would\nbe realistically obtainable in deployment. These datasets span a wide range of\napplications (from histology to wildlife conservation), tasks (classification,\nregression, and detection), and modalities (photos, satellite images,\nmicroscope slides, text, molecular graphs). The update maintains consistency\nwith the original WILDS benchmark by using identical labeled training,\nvalidation, and test sets, as well as the evaluation metrics. On these\ndatasets, we systematically benchmark state-of-the-art methods that leverage\nunlabeled data, including domain-invariant, self-training, and self-supervised\nmethods, and show that their success on WILDS is limited. To facilitate method\ndevelopment and evaluation, we provide an open-source package that automates\ndata loading and contains all of the model architectures and methods used in\nthis paper. Code and leaderboards are available at https://wilds.stanford.edu.","[arxiv.Result.Author('Shiori Sagawa'), arxiv.Result.Author('Pang Wei Koh'), arxiv.Result.Author('Tony Lee'), arxiv.Result.Author('Irena Gao'), arxiv.Result.Author('Sang Michael Xie'), arxiv.Result.Author('Kendrick Shen'), arxiv.Result.Author('Ananya Kumar'), arxiv.Result.Author('Weihua Hu'), arxiv.Result.Author('Michihiro Yasunaga'), arxiv.Result.Author('Henrik Marklund'), arxiv.Result.Author('Sara Beery'), arxiv.Result.Author('Etienne David'), arxiv.Result.Author('Ian Stavness'), arxiv.Result.Author('Wei Guo'), arxiv.Result.Author('Jure Leskovec'), arxiv.Result.Author('Kate Saenko'), arxiv.Result.Author('Tatsunori Hashimoto'), arxiv.Result.Author('Sergey Levine'), arxiv.Result.Author('Chelsea Finn'), arxiv.Result.Author('Percy Liang')]",http://arxiv.org/abs/2112.05090v2,4.826763
3,Explainable Artificial Intelligence and Machine Learning: A reality rooted perspective,"We are used to the availability of big data generated in nearly all fields of\nscience as a consequence of technological progress. However, the analysis of\nsuch data possess vast challenges. One of these relates to the explainability\nof artificial intelligence (AI) or machine learning methods. Currently, many of\nsuch methods are non-transparent with respect to their working mechanism and\nfor this reason are called black box models, most notably deep learning\nmethods. However, it has been realized that this constitutes severe problems\nfor a number of fields including the health sciences and criminal justice and\narguments have been brought forward in favor of an explainable AI. In this\npaper, we do not assume the usual perspective presenting explainable AI as it\nshould be, but rather we provide a discussion what explainable AI can be. The\ndifference is that we do not present wishful thinking but reality grounded\nproperties in relation to a scientific theory beyond physics.","[arxiv.Result.Author('Frank Emmert-Streib'), arxiv.Result.Author('Olli Yli-Harja'), arxiv.Result.Author('Matthias Dehmer')]",http://arxiv.org/abs/2001.09464v1,4.770155
4,Robustness of Generalized Learning Vector Quantization Models against Adversarial Attacks,"Adversarial attacks and the development of (deep) neural networks robust\nagainst them are currently two widely researched topics. The robustness of\nLearning Vector Quantization (LVQ) models against adversarial attacks has\nhowever not yet been studied to the same extent. We therefore present an\nextensive evaluation of three LVQ models: Generalized LVQ, Generalized Matrix\nLVQ and Generalized Tangent LVQ. The evaluation suggests that both Generalized\nLVQ and Generalized Tangent LVQ have a high base robustness, on par with the\ncurrent state-of-the-art in robust neural network methods. In contrast to this,\nGeneralized Matrix LVQ shows a high susceptibility to adversarial attacks,\nscoring consistently behind all other models. Additionally, our numerical\nevaluation indicates that increasing the number of prototypes per class\nimproves the robustness of the models.","[arxiv.Result.Author('Sascha Saralajew'), arxiv.Result.Author('Lars Holdijk'), arxiv.Result.Author('Maike Rees'), arxiv.Result.Author('Thomas Villmann')]",http://arxiv.org/abs/1902.00577v2,4.715786
5,Concept Whitening for Interpretable Image Recognition,"What does a neural network encode about a concept as we traverse through the\nlayers? Interpretability in machine learning is undoubtedly important, but the\ncalculations of neural networks are very challenging to understand. Attempts to\nsee inside their hidden layers can either be misleading, unusable, or rely on\nthe latent space to possess properties that it may not have. In this work,\nrather than attempting to analyze a neural network posthoc, we introduce a\nmechanism, called concept whitening (CW), to alter a given layer of the network\nto allow us to better understand the computation leading up to that layer. When\na concept whitening module is added to a CNN, the axes of the latent space are\naligned with known concepts of interest. By experiment, we show that CW can\nprovide us a much clearer understanding for how the network gradually learns\nconcepts over layers. CW is an alternative to a batch normalization layer in\nthat it normalizes, and also decorrelates (whitens) the latent space. CW can be\nused in any layer of the network without hurting predictive performance.","[arxiv.Result.Author('Zhi Chen'), arxiv.Result.Author('Yijie Bei'), arxiv.Result.Author('Cynthia Rudin')]",http://arxiv.org/abs/2002.01650v5,4.590379
6,General Cyclical Training of Neural Networks,"This paper describes the principle of ""General Cyclical Training"" in machine\nlearning, where training starts and ends with ""easy training"" and the ""hard\ntraining"" happens during the middle epochs. We propose several manifestations\nfor training neural networks, including algorithmic examples (via\nhyper-parameters and loss functions), data-based examples, and model-based\nexamples. Specifically, we introduce several novel techniques: cyclical weight\ndecay, cyclical batch size, cyclical focal loss, cyclical softmax temperature,\ncyclical data augmentation, cyclical gradient clipping, and cyclical\nsemi-supervised learning. In addition, we demonstrate that cyclical weight\ndecay, cyclical softmax temperature, and cyclical gradient clipping (as three\nexamples of this principle) are beneficial in the test accuracy performance of\na trained model. Furthermore, we discuss model-based examples (such as\npretraining and knowledge distillation) from the perspective of general\ncyclical training and recommend some changes to the typical training\nmethodology. In summary, this paper defines the general cyclical training\nconcept and discusses several specific ways in which this concept can be\napplied to training neural networks. In the spirit of reproducibility, the code\nused in our experiments is available at \url{https://github.com/lnsmith54/CFL}.",[arxiv.Result.Author('Leslie N. Smith')],http://arxiv.org/abs/2202.08835v2,4.346595
7,Explaining Aviation Safety Incidents Using Deep Temporal Multiple Instance Learning,"Although aviation accidents are rare, safety incidents occur more frequently\nand require a careful analysis to detect and mitigate risks in a timely manner.\nAnalyzing safety incidents using operational data and producing event-based\nexplanations is invaluable to airline companies as well as to governing\norganizations such as the Federal Aviation Administration (FAA) in the United\nStates. However, this task is challenging because of the complexity involved in\nmining multi-dimensional heterogeneous time series data, the lack of\ntime-step-wise annotation of events in a flight, and the lack of scalable tools\nto perform analysis over a large number of events. In this work, we propose a\nprecursor mining algorithm that identifies events in the multidimensional time\nseries that are correlated with the safety incident. Precursors are valuable to\nsystems health and safety monitoring and in explaining and forecasting safety\nincidents. Current methods suffer from poor scalability to high dimensional\ntime series data and are inefficient in capturing temporal behavior. We propose\nan approach by combining multiple-instance learning (MIL) and deep recurrent\nneural networks (DRNN) to take advantage of MIL's ability to learn using weakly\nsupervised data and DRNN's ability to model temporal behavior. We describe the\nalgorithm, the data, the intuition behind taking a MIL approach, and a\ncomparative analysis of the proposed algorithm with baseline models. We also\ndiscuss the application to a real-world aviation safety problem using data from\na commercial airline company and discuss the model's abilities and\nshortcomings, with some final remarks about possible deployment directions.",[arxiv.Result.Author('Vijay Manikandan Janakiraman')],http://arxiv.org/abs/1710.04749v2,3.825403
8,Continual Unsupervised Representation Learning,"Continual learning aims to improve the ability of modern learning systems to\ndeal with non-stationary distributions, typically by attempting to learn a\nseries of tasks sequentially. Prior art in the field has largely considered\nsupervised or reinforcement learning tasks, and often assumes full knowledge of\ntask labels and boundaries. In this work, we propose an approach (CURL) to\ntackle a more general problem that we will refer to as unsupervised continual\nlearning. The focus is on learning representations without any knowledge about\ntask identity, and we explore scenarios when there are abrupt changes between\ntasks, smooth transitions from one task to another, or even when the data is\nshuffled. The proposed approach performs task inference directly within the\nmodel, is able to dynamically expand to capture new concepts over its lifetime,\nand incorporates additional rehearsal-based techniques to deal with\ncatastrophic forgetting. We demonstrate the efficacy of CURL in an unsupervised\nlearning setting with MNIST and Omniglot, where the lack of labels ensures no\ninformation is leaked about the task. Further, we demonstrate strong\nperformance compared to prior art in an i.i.d setting, or when adapting the\ntechnique to supervised tasks such as incremental class learning.","[arxiv.Result.Author('Dushyant Rao'), arxiv.Result.Author('Francesco Visin'), arxiv.Result.Author('Andrei A. Rusu'), arxiv.Result.Author('Yee Whye Teh'), arxiv.Result.Author('Razvan Pascanu'), arxiv.Result.Author('Raia Hadsell')]",http://arxiv.org/abs/1910.14481v1,3.443345
9,Analysis of Generalizability of Deep Neural Networks Based on the Complexity of Decision Boundary,"For supervised learning models, the analysis of generalization ability\n(generalizability) is vital because the generalizability expresses how well a\nmodel will perform on unseen data. Traditional generalization methods, such as\nthe VC dimension, do not apply to deep neural network (DNN) models. Thus, new\ntheories to explain the generalizability of DNNs are required. In this study,\nwe hypothesize that the DNN with a simpler decision boundary has better\ngeneralizability by the law of parsimony (Occam's Razor). We create the\ndecision boundary complexity (DBC) score to define and measure the complexity\nof decision boundary of DNNs. The idea of the DBC score is to generate data\npoints (called adversarial examples) on or near the decision boundary. Our new\napproach then measures the complexity of the boundary using the entropy of\neigenvalues of these data. The method works equally well for high-dimensional\ndata. We use training data and the trained model to compute the DBC score. And,\nthe ground truth for model's generalizability is its test accuracy. Experiments\nbased on the DBC score have verified our hypothesis. The DBC is shown to\nprovide an effective method to measure the complexity of a decision boundary\nand gives a quantitative measure of the generalizability of DNNs.","[arxiv.Result.Author('Shuyue Guan'), arxiv.Result.Author('Murray Loew')]",http://arxiv.org/abs/2009.07974v1,3.41993


### Analysing OpenCLIP embeddings on Nomic
Atlas is a platform for interacting with both small and internet scale unstructured datasets.

Atlas enables you to:
* Store, update and organize multi-million point datasets of unstructured text, images and embeddings.
* Visually interact with embeddings of your data from a web browser.
* Operate over unstructured data and embeddings with topic modeling, semantic duplicate clustering and semantic search.
* Generate high dimensional and two-dimensional embeddings of your data.

In [15]:
!pip install nomic --q

In [16]:
!nomic login

                        Authenticate with the Nomic API                        
                       https://atlas.nomic.ai/cli-login                        
 Click the above link to retrieve your access token and then run `nomic login  
                                   [token]`                                    


In [21]:
!nomic login #Paste your token from Nomic Ai cli login -- here

In [20]:
from nomic import atlas
import numpy as np

# Get pandas dataframe from lancedb table
df = tbl.to_pandas()

# get embeddings from df
embs = np.array(df.pop("vector").to_list())

project = atlas.map_embeddings(embeddings=embs, data=df.to_dict("records"))
print()

[32m2023-10-15 12:45:49.237[0m | [1mINFO    [0m | [36mnomic.project[0m:[36m_create_project[0m:[36m790[0m - [1mCreating project `voracious-remark` in organization `kaushalc64`[0m
[32m2023-10-15 12:45:51.920[0m | [1mINFO    [0m | [36mnomic.atlas[0m:[36mmap_embeddings[0m:[36m111[0m - [1mUploading embeddings to Atlas.[0m
1it [00:01,  1.88s/it]
[32m2023-10-15 12:45:53.844[0m | [1mINFO    [0m | [36mnomic.project[0m:[36m_add_data[0m:[36m1422[0m - [1mUpload succeeded.[0m
[32m2023-10-15 12:45:53.849[0m | [1mINFO    [0m | [36mnomic.atlas[0m:[36mmap_embeddings[0m:[36m130[0m - [1mEmbedding upload succeeded.[0m
[32m2023-10-15 12:45:57.000[0m | [1mINFO    [0m | [36mnomic.project[0m:[36mcreate_index[0m:[36m1132[0m - [1mCreated map `voracious-remark` in project `voracious-remark`: https://atlas.nomic.ai/map/9e13dcd5-15e1-4449-9005-93292f739c2c/aa195bbd-11f6-4813-8435-6468192274cc[0m
[32m2023-10-15 12:45:57.000[0m | [1mINFO    [0m | [36mn




The visualizations are very interesting and is worth exploring more. IN preliminary analysis, you can see that it succesfully creates clusters of similar types of papers. There are a few things that can be done next like comparing embeddings on various openclip models sizes and datasets. 
<img width="1433" alt="Screenshot 2023-08-24 at 3 47 51 PM" src="https://github.com/lancedb/vectordb-recipes/assets/15766192/34ef88a3-2925-4450-abcd-1abc350ef3e4">