# Langchain <--> Elastic Search

Elasticsearch is an open source distributed, RESTful search and analytics engine, scalable data store, and vector database capable of addressing a growing number of use cases. As the heart of the Elastic Stack, it centrally stores your data for lightning-fast search, fine‑tuned relevancy, and powerful analytics that scale with ease.
Elasticsearch can store and index a variety of data, including structured and unstructured text, numerical data, and geospatial data. It's known for its ability to find queries in large-scale unstructured data
Elasticsearch uses a search index, which is similar to an index in the back of a book, to map content to its location in a document. This allows users to quickly find information without scanning through an entire document

- https://www.elastic.co/search-labs/blog/langchain-collaboration
- https://www.elastic.co/guide/en/elasticsearch/reference/current/docker.html
- https://python.langchain.com/docs/integrations/vectorstores/elasticsearch/
- https://www.elastic.co/blog/elasticsearch-is-open-source-again
- https://www.elastic.co/search-labs/blog/category/generative-ai


In [None]:
! pip install -r requirements.txt -q

# Install ELastic Search Docker

- docker network create elastic
- docker pull docker.elastic.co/elasticsearch/elasticsearch:8.15.3
- docker run --name es01 --net elastic -p 9200:9200 -it -m 1GB docker.elastic.co/elasticsearch/elasticsearch:8.15.3

In [None]:
import os
from dotenv import dotenv_values

In [None]:
config = dotenv_values("./keys/.env")

In [None]:
import os, tempfile
from langchain.prompts import PromptTemplate


from langchain_community.document_loaders import TextLoader

from langchain.chains import ConversationalRetrievalChain, RetrievalQA

from langchain_text_splitters import CharacterTextSplitter
from langchain_google_genai import ChatGoogleGenerativeAI, GoogleGenerativeAIEmbeddings
from langchain_community.document_loaders import PyPDFLoader
from google.oauth2 import service_account
from dotenv import dotenv_values
import json
import vertexai
 
import itertools
import time


In [None]:
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.schema import HumanMessage

In [None]:
from elasticsearch import Elasticsearch, helpers



In [None]:
try:
    es_endpoint ="http://127.0.0.1:9200"
    es_client = Elasticsearch(
        es_endpoint,
        #api_key=os.environ.get("ELASTIC_API_KEY")
    )
except Exception as e:
    print("No Client")
    es_client=None

In [None]:
index = "langchain-demo"

In [None]:
prompt='''

The document schema for the profiles is as follows:

{
  "nric": "string",
  "name": "string",
  "race": "string",
  "gender": "string",
  "date_of_birth": "date",
  "age": "integer",
  "country_of_birth": "string",
  "citizenship": "string",
  "religion": "string" ["Buddhism", "Christianity", "Islam", "Hinduism", "Taoism", "No Religion"],
  "marital_status": "string" ["Single", "Married", "Divorced", "Separated", "Widowed", "Civil Partnership", "Domestic Partnership", "Engaged", "Annulled"],
  "address": {
    "block": "string",
    "street_no": "string",
    "street": "string",
    "unit": "string",
    "town": "string",
    "postal_code": "string"
  },
  "phone_number": "string",
  "email": "string",
  "occupation": "string",
  "cpf_number": "string",
  "education": {
    "highest_qualification": "string",
    "institution": "string"
  },
  "languages": {
    "spoken": {"language":"fluency" ["Basic", "Conversational", "Fluent", "Native"]},
    "written": {"language":"fluency" ["Basic", "Conversational", "Fluent", "Native"]},
  },
  "height_cm": "integer",
  "weight_kg": "integer",
  "blood_type": "string" ["A+", "A-", "B+", "B-", "O+", "O-", "AB+", "AB-"],
  "passport_number": "string",
  "drivers_license_number": "string",
  "national_service": {
    "status": "string",
    "rank": "string"
  },
  "immigration_status": "string",
  "emergency_contact": {
    "name": "string",
    "relationship": "string",
    "phone_number": "string"
  },
  "deceased": "boolean",
  "date_of_death": "date"
}

-----------------------------------------------------------------------------------
Example query 1:
User: Find all male Singapore citizens between 25 and 30 years old who work as software developers and speak fluent English.

Your response should be:

{
  "query": {
    "bool": {
      "should": [
        { "match": { "gender": "Male" } },
        { "match": { "citizenship": "Singapore Citizen" } },
        { "range": { "age": { "gte": 25, "lte": 30 } } },
        { "match": { "occupation": "Software Developer" } },
        {
          "match": {
            "languages.spoken.English": {
              "query": "Fluent",
              "fuzziness": "AUTO"
            }
          }
        }
      ],
      "minimum_should_match": 2
    }
  }
}


Consider using multi_match for fields that might contain the value in different subfields:
{
  "multi_match": {
    "query": "Software Developer",
    "fields": ["occupation", "job_title", "role"],
    "type": "best_fields",
    "fuzziness": "AUTO"
  }
}

For names or other fields where word order matters, you might want to use match_phrase with slop:
{
  "match_phrase": {
    "full_name": {
      "query": "John Doe",
      "slop": 1
    }
  }
}

- When dealing with queries that involve categories, groups, or regions (such as language families, geographical areas, or professional fields), expand the search to include all relevant specific instances. For example, if asked about Slavic languages, include searches for Russian, Polish, Czech, etc. If asked about people from Europe, include searches for various European countries.
- [match] query does not support [slop]
- [match] query does not support [qeury]
- Generate a JSON query for Elasticsearch. Provide only the raw JSON without any surrounding tags or markdown formatting, because we need to convert your response to an object. 
- Be extremedely precise in create a correct JSON string without any extra payload.
- Use a lenient approach with 'should' clauses instead of strict 'must' clauses. Include a 'minimum_should_match' parameter to ensure some relevance while allowing flexibility. Avoid using 'must' clauses entirely.
- All queries must be lowercase.
- Use 'match' queries instead of 'term' queries to allow for partial matches and spelling variations. Where appropriate, include fuzziness parameters to further increase tolerance for spelling differences. 
- For name fields or other phrases where word order matters, consider using 'match_phrase' with a slop parameter. Use 'multi_match' for fields that might contain the value in different subfields.
- Create a query which satisfaces most closely what the user is requesting.
- let's think step by step
'''

In [None]:

# ! pip install langchain-ollama "ollama==0.4.2"  -q

In [None]:
import re

def extract_json(text):
    """
    Extracts the JSON content from a string, 
    handling cases with or without ```json\n markers.

    Args:
        text: The input string containing the JSON content.

    Returns:
        The extracted JSON string, or None if no match is found or invalid JSON.
    """
    try:
        # First, try to load the entire string as JSON
        json_content = json.loads(text)
        return json.dumps(json_content, indent=2)  # Reformat for consistent output
    except json.JSONDecodeError:
        # If the entire string is not valid JSON, try the previous regex method
        pattern = r"```json\n(.*?)\n```"
        match = re.search(pattern, text, re.DOTALL)
        if match:
            return match.group(1).strip()
        else:
            return None

In [None]:
# ELastic Query Model

from langchain_ollama import ChatOllama
from langchain_core.output_parsers import JsonOutputParser
from langchain_core.prompts import PromptTemplate
import pprint

# LLM
llm = ChatOllama(model="qwen2.5:3b",  temperature=0.9,  num_ctx=4096, format="json")

prompt2 = PromptTemplate(
    template=""" Your task is to interpret user questions about personal profiles and generate the appropriate Elasticsearch query in JSON format.
               you have here the Elastic Schema, some examples and Intructions to generate the query:
               Context: {prompt}.
               
               User question{question}
                """,
    input_variables=["question", "prompt"],
)

elastic_llm = prompt2 | llm 


# Qwen 2.5 3B

In [None]:
question = "All men over the age of 35, who works in IT related tasks and are living in Tanjong Pagar"

response =elastic_llm.invoke({"question": question, "prompt": prompt})



In [None]:
response.content


In [None]:
es_query= json.loads(response.content)

In [None]:
search_results = es_client.search(index=index, body=es_query)

total_hits = search_results['hits']['total']['value']
print(f"Total matches: {total_hits}")

for hit in search_results['hits']['hits']:
    print(f"Score: {hit['_score']}")
    print(f"Name: {hit['_source']['name']}")
    print(f"Age: {hit['_source']['age']}")
    print(f"Gender: {hit['_source']['gender']}")
    print(f"Citizenship: {hit['_source']['citizenship']}")
    print(f"Occupation: {hit['_source']['occupation']}")
    print(f"Address: {hit['_source']['address']}")
    print("---")

# Qwen2.5 3B

In [None]:
question = "Men who are not alive currently, who are universal blood donors born in singapore"

response =elastic_llm.invoke({"question": question, "prompt": prompt})

es_query=json.loads(response.content)
pprint.pprint(es_query)

In [None]:
search_results = es_client.search(index=index, body=es_query)

total_hits = search_results['hits']['total']['value']
print(f"Total matches: {total_hits}")

for hit in search_results['hits']['hits']:
    print(f"Score: {hit['_score']}")
    print(f"Name: {hit['_source']['name']}")
    print(f"Blood Type: {hit['_source']['blood_type']}")
    print(f"Gender: {hit['_source']['gender']}")
    print(f"Country of Birth: {hit['_source']['country_of_birth']}")
    print(f"Deceased: {hit['_source']['deceased']}")
    print("---")

# Qwen 2.5 3B

In [None]:
question = "People with height equal to 175 centimeters" 

response =elastic_llm.invoke({"question": question, "prompt": prompt})

es_query=json.loads(response.content)


In [None]:
pprint.pprint(es_query)

In [None]:
search_results = es_client.search(index=index, body=es_query)

total_hits = search_results['hits']['total']['value']
print(f"Total matches: {total_hits}")

for hit in search_results['hits']['hits']:
    print(f"Score: {hit['_score']}")
    print(f"Name: {hit['_source']['name']}")
    print(f"languages: {hit['_source']}")
    print("---")

In [None]:
question = "Women with weight less than 80Kg , heigth more than 170 cm and  divorced" 

response =elastic_llm.invoke({"question": question, "prompt": prompt})

es_query=json.loads(response.content)


In [None]:
pprint.pprint(es_query)

In [None]:
search_results = es_client.search(index=index, body=es_query)

total_hits = search_results['hits']['total']['value']
print(f"Total matches: {total_hits}")

for hit in search_results['hits']['hits']:
    print(f"Score: {hit['_score']}")
    print(f"Name: {hit['_source']['name']}")
    print(f"languages: {hit['_source']}")
    print("---")