<a href="https://colab.research.google.com/github/hiwei93/embeddings-practice/blob/main/Embedding_APP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Building an Application Based on Embedding  

Perform semantic search on the [Huggingface Daily Papers](https://huggingface.co/papers) from the past 10 days using Embedding.


Application Input:

1. Papers from the past 10 days
2. Selected embedding model
3. Choose between FASS and Chroma
4. Enter your question
5. Number of results


In [None]:
!pip install sentence_transformers
!pip install tqdm
!pip install datasets
!pip install faiss-gpu
!pip install gradio

Collecting sentence_transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/86.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting transformers<5.0.0,>=4.6.0 (from sentence_transformers)
  Downloading transformers-4.33.3-py3-none-any.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m32.2 MB/s[0m eta [36m0:00:00[0m
Collecting sentencepiece (from sentence_transformers)
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m61.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface-hub>=0.4.0 (from sentence_transformers)
  Downloading huggingface_

In [None]:
import gradio as gr

In [None]:
embedding_models = [
    "moka-ai/m3e-base",
    "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2",
    "intfloat/multilingual-e5-base",
    "intfloat/multilingual-e5-large",
    "sentence-transformers/LaBSE",
]

middlewares = [
    "FASS",
    "Chroma"
]

In [None]:
import requests

def get_daily_papers(date_str):
    url = f"https://huggingface.co/api/daily_papers?date={date_str}"
    return requests.get(url).json()


from datetime import date, timedelta
from tqdm import tqdm
import time

def load_daily_papers():
    days = 10
    today = date.today()

    papers = []
    for i in tqdm(range(days)):
        d = today - timedelta(days=i)
        date_str = d.isoformat()
        for paper in get_daily_papers(date_str):
            metadate = paper['paper']
            p = {k: metadate[k] for k in ('id', 'title', 'summary')}
            p['date'] = date_str
            papers.append(p)
    return papers

In [None]:
from abc import ABC

class SearchStrategy(ABC):
    pass

In [None]:
from sentence_transformers import SentenceTransformer

class EmbeddingModel():
    model = None
    model_name = None

    @classmethod
    def get_instance(cls, model_name):
        if model_name != cls.model_name:
            cls.model = SentenceTransformer(model_name)
        return cls.model

def get_embeddings(model, text_list):
    return model.encode(text_list, convert_to_numpy=True,)

In [None]:
from datasets import Dataset
import pandas as pd


def compose_content(paper):
    return {
        "text": "\n".join([f"{field}: {paper[field]}" for field in ('id', 'title', 'summary')])
    }

def do_search(embedding_model, middleware, query, progress=gr.Progress()):
    papers = load_daily_papers()
    paper_dataset = Dataset.from_list(papers)
    paper_dataset = paper_dataset.map(compose_content)
    progress(0.2, desc="load papers data")
    model = EmbeddingModel.get_instance(embedding_model)
    progress(0.5, desc="prepare model")
    paper_dataset = paper_dataset.map(
        lambda x: {"embeddings": get_embeddings(model, x["text"])}
    )
    progress(0.7, desc="generate embeddings")
    # paper_datadict = paper_dataset['train']
    paper_dataset.add_faiss_index(column="embeddings", device=0)

    question_embedding = get_embeddings(model, query)
    scores, samples = paper_dataset.get_nearest_examples(
        "embeddings", question_embedding, k=5
    )
    progress(0.9, desc="do search")

    samples_df = pd.DataFrame.from_dict(samples)
    samples_df["scores"] = scores
    samples_df = samples_df.drop(columns=['embeddings', 'text'])
    samples_df.sort_values("scores", ascending=False, inplace=True)
    return samples_df

In [None]:
with gr.Blocks() as demo:
    gr.Markdown("# Semantic search using Huggingface Daily Papers in 10 days")
    with gr.Row():
        with gr.Column():
            embedding_model = gr.Dropdown(choices=embedding_models, label="embedding_model")
            middleware = gr.Dropdown(choices=middlewares, label="embedding_storage")
            query = gr.Textbox(label="question")
            btn = gr.Button(value="Search", variant="primary")
        with gr.Column():
            search_reuslt = gr.Dataframe(
                label="search result",
                headers=["ids", "date", "title", "summary", "score"],
                wrap=True,
            )
    btn.click(fn=do_search, inputs=[embedding_model, middleware, query], outputs=search_reuslt)

demo.queue(concurrency_count=2).launch(share=True, debug=True)

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
Running on public URL: https://3fa52ef4470ab8eece.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


100%|██████████| 10/10 [00:02<00:00,  4.02it/s]


Map:   0%|          | 0/51 [00:00<?, ? examples/s]

Map:   0%|          | 0/51 [00:00<?, ? examples/s]

  0%|          | 0/1 [00:00<?, ?it/s]

100%|██████████| 10/10 [00:02<00:00,  4.62it/s]


Map:   0%|          | 0/51 [00:00<?, ? examples/s]

Map:   0%|          | 0/51 [00:00<?, ? examples/s]

  0%|          | 0/1 [00:00<?, ?it/s]

Keyboard interruption in main thread... closing server.
Killing tunnel 127.0.0.1:7861 <> https://3fa52ef4470ab8eece.gradio.live


