# Private AI Search with LangChain and Elasticsearch

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elastic/blog-langchain-elasticsearch/blob/main/Notebooks/Privacy_first_AI_search_using_LangChain_and_Elasticsearch.ipynb)

Motivations:
* **freshness** - there aren't enough GPUs in the world to train large language models for every problem, data gets old very fast. Semantic search can be used to push context into LLM prompts with real time data.
* **privacy** - pushing our most private of data to the big LLMs isn't really an option when that data is private or the competitive advantage of a company, big or small. Let's use a local smaller LLM that can be deployed privately in a closed network if necessary.

First let's set up the environment


In [1]:
%pip install beautifulsoup4 eland elasticsearch huggingface-hub langchain==0.0.157 tqdm requests sentence_transformers torch accelerate



# Elastic Cloud and Connection Details

While you can definitely pull this colab down to run lcoally as a python notebook, the simplest way to get this working is to create an Elastic cluster over at  https://cloud.elastic.co/ . Make sure to configure an ML node. You won't need more than the default spec for this project.  Once that cluster is up and running grab your connecting info and edit the below before running it.

In [2]:
# Now we'll load these into the python environment
from getpass import getpass
# endpoint = "3dd138be8cc44d2fa2bc0bb58720ab8a.us-central1.gcp.cloud.es.io"
# username = "elastic"
# password = "KDGbcC0FKi290qz18EbTKGBB"

endpoint = "localhost"
#username = "elastic"
#password = "9+KjRVIwctnHhsgCQs_y"


#es_url =  f"https://{username}:{password}@{endpoint}:9200"
es_url =  f"http://{endpoint}:9200"

print ("Using URL "+es_url)


Using URL http://localhost:9200


# Scraping a small set of data from Wookieepedia

We'll keep it to two pages for characters active in recent TV shows, too recent for updates to be caught by common 2021 AI data sets.

Check out the original article and origin of this parsing exmaple over at: https://towardsdatascience.com/star-wars-data-science-d32acde3432d

In [3]:
import re
import requests
from bs4 import BeautifulSoup
import pickle
import json
from tqdm import tqdm

In [4]:
scraped = {}
pages = [
    "https://starwars.fandom.com/wiki/N-1_starfighter",
    "https://starwars.fandom.com/wiki/Ahsoka_Tano",
    "https://starwars.fandom.com/wiki/Din_Djarin"]

last_number = 0
for page_url in pages:
    try:

        # Get page
        result = requests.get(page_url)
        content = result.content
        soup = BeautifulSoup(content, "html.parser")

        # Get title
        heading = soup.find('h1', id='firstHeading')
        if heading is None: continue
        heading = heading.text

        # Extract Sidebar
        is_character = False
        side_bar = {}
        sec = soup.find_all('section', class_='pi-item')
        for s in sec:
            title = s.find('h2')
            if title is None:
                title = '<no category>'
            else:
                title = title.text
            side_bar[title] = {}
            items = s.find_all('div', class_='pi-item')
            for item in items:
                attr = item.find('h3', class_='pi-data-label')
                if attr is None:
                    attr = '<no attribute>'
                else:
                    attr = attr.text
                if attr == 'Species': is_character = True
                value = re.sub("[\(\[].*?[\)\]]" ,'', '], '.join(item.find('div', class_='pi-data-value').text.split(']')))
                value = value.strip()[:-1].replace(',,', ',')
                if ',' in value:
                    value = [i.strip() for i in value.split(',') if i.strip() != '']
                side_bar[title][attr] = value

        # Raw page content
        raw_content = soup.find('div', class_='mw-parser-output')
        if raw_content is not None:
            content_pgs = []
            for raw_paragraph in raw_content.find_all('p', recursive=False):
                if 'aside' in str(raw_paragraph): continue
                content_pgs.append(re.sub("[\(\[].*?[\)\]]" ,'', raw_paragraph.text) )
            # paragraph = value = re.sub("[\(\[].*?[\)\]]" ,'', raw_paragraph.text)


        else:
            # Empty page
            paragraph = ''

        # Data object
        scraped[page_url] = {
            'url': page_url,
            'title': heading,
            'is_character': is_character,
            'side_bar': side_bar,
            'paragraph': content_pgs
        }

    except:
        print(f'Failed! {page_url}')


# Save final part to disk
fn =  './starwars_small_canon_data.pickle'
with open(fn, 'wb') as f:
    pickle.dump(scraped, f, protocol=pickle.HIGHEST_PROTOCOL)

In [5]:
## Let's do a quick test to make sure it worked we. Even if the data is big
## we can chunk it up with the above code and load it in sections.

from pathlib import Path

bookFilePath = "starwars_*_canon_data*.pickle"
files = sorted(Path('.').glob(bookFilePath))
for fn in files:
  with open(fn,'rb') as f:
      part = pickle.load(f)
      for key, value in part.items():
          title = value['title'].strip()
          print(title)

N-1 starfighter
Ahsoka Tano
Din Djarin


# Using LangChain to generate vectors and store in Elasticsearch

First we'll create the embeddings model

In [6]:
from langchain.embeddings import HuggingFaceEmbeddings

def setup_embeddings():
    # Huggingface embedding setup
    print(">> Prep. Huggingface embedding setup")
    model_name = "sentence-transformers/all-mpnet-base-v2"
    return HuggingFaceEmbeddings(model_name=model_name)

hf = setup_embeddings()


>> Prep. Huggingface embedding setup


  from .autonotebook import tqdm as notebook_tqdm


Next we'll create our elasticsearch vectorstore in the langchain style:

In [7]:
from langchain.vectorstores import ElasticVectorSearch

index_name = "book_wookieepedia_small"

db = ElasticVectorSearch(embedding=hf,elasticsearch_url=es_url, index_name=index_name)




Here goes the load. I like how small the code is, but eventually I'd love to see more flexibility on how we model the data as I'd like to do more hybrid search techniques.

In [8]:
from pathlib import Path

count = 0
bookFilePath = "starwars_*_canon_data*.pickle"
files = sorted(Path('.').glob(bookFilePath))
batchtext = []



for fn in files:
    print(f"Starting book: {fn}")
    with open(fn,'rb') as f:
        part = pickle.load(f)

        for ix, (key, value) in tqdm(enumerate(part.items()), total=len(part)):
            paragraphs = value['paragraph']
            for px, p in enumerate(paragraphs):
                # print(f"{ix} {px} {title}")
                batchtext.append(p)
                count = count + 1

print("")
print(len(batchtext))
db.from_texts(batchtext, embedding=hf, elasticsearch_url=es_url, index_name=index_name)

Starting book: starwars_small_canon_data.pickle


100%|██████████| 3/3 [00:00<00:00, 9612.61it/s]


649





Now we create a prompt chain that gets the most relevant passage from Elasticsearch using a vector search, and then uses that knowledge in the prompt to the LLM.

In [None]:
from langchain import PromptTemplate, HuggingFaceHub, LLMChain
from langchain.llms import HuggingFacePipeline
from transformers import AutoTokenizer, pipeline, AutoModelForSeq2SeqLM

topic = "Star Wars"
index_name = "book_wookieepedia_small"

cache_dir = "./cache"

def getFlanLarge():
    model_id = 'google/flan-t5-large'
    print(f">> Prep. Get {model_id} ready to go")
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    model = AutoModelForSeq2SeqLM.from_pretrained(
        model_id, cache_dir=cache_dir)

    pipe = pipeline(
        "text2text-generation",
        model=model,
        tokenizer=tokenizer,
        max_length=100
    )
    llm = HuggingFacePipeline(pipeline=pipe)
    return llm

def make_the_llm():
    template_informed = """
    I am a helpful AI that answers questions. When I don't know the answer I say I don't know.
    I know context: {context}
    when asked: {question}
    my response using only information in the context is: """

    prompt_informed = PromptTemplate(template=template_informed, input_variables=["context", "question"])

    llm = getFlanLarge()

    return LLMChain(prompt=prompt_informed, llm=llm)

llm_chain_informed= make_the_llm()




# Time to win at Star Wars trivia

As long as the questions are about Ashoka or Din Djarin, you can now ask questions!  Here's some questions you could try.  Note that ChatGPT 3.5 and GPT4 hallucinate bad answers to these questions:

* Who helped the Mandalorian build his N1 Starfighter?
* Who trained Ahsoka when she was a youngling?

In [None]:

## how to ask a question
def ask_a_question(question):
    similar_docs = db.similarity_search(question)
    print(f'The most relevant passage: \n\t{similar_docs[0].page_content}')

    ## Ask Local LLM context informed prompt
    informed_context= similar_docs[0].page_content
    informed_response = llm_chain_informed.run(context=informed_context,question=question)

    return informed_response

# The conversational loop


print(f'I am a trivia chat bot, ask me any question about {topic}')

while True:
    question = input("User Question >> ")
    response= ask_a_question(question)
    print (f"\t\Question  : {question}")
    print(f"\tAnswer  : {response}")


Have you fallen in love with LangChain as much as I have? As a wise old Jedi once said: “That's good. You have taken your first step into a larger world.” There are lots of directions to go from here. LangChain takes the complexity away from working with AI prompt engineering. I know Elasticsearch has many other roles to play here as long term memory for generative AI, so I am very excited to see what comes out of this quickly changing space.

## License

The example: `blog-langchain-elasticsearch` is available under the Apache 2.0 license.
For more details see [LICENSE](../LICENSE).