# semantic_experiments

> Just some experiments with "RAG" indexing and semantic search using common toolkits like langchain, llamaindex, P   

In [None]:
#| default_exp semantic_experiments

# Experiments on using Proposition Chunking and RAG indexing for semantic search
This is based on the [5 Levels Of Text Splitting](https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/5_Levels_Of_Text_Splitting.ipynb) notebook example for the "Agentic Text Splitting" method. It is based on the [Dense X Retrieval: What Retrieval Granularity Should We Use?](https://arxiv.org/pdf/2312.06648.pdf) paper and has a prompt implementation in LangChain hub. Unfortunately, the LangChain hub is API walled, so I copied the prompt template to this notebook. We need to handle PDFs as a combination of text, image and tabular data for the purposes of chunking and indexing. LangChain blog has some interesting experiments on [Benchmarking RAG on tables](https://blog.langchain.dev/benchmarking-rag-on-tables/), [Multi-modal RAG on slide decks](https://blog.langchain.dev/multi-modal-rag-template/) and a set of notebooks [Multi-modal eval: GPT-4 w/ multi-modal embeddings and multi-vector retriever](https://langchain-ai.github.io/langchain-benchmarks/notebooks/retrieval/multi_modal_benchmarking/multi_modal_eval.html?ref=blog.langchain.dev). The [Unstructured IO](https://github.com/Unstructured-IO/unstructured) project provides multi-document extractors and handling for PDFs with text, tables and images. Langchain document loaders support unstructured.io [Unstructured File](https://python.langchain.com/docs/integrations/document_loaders/unstructured_file?ref=blog.langchain.dev) document loaders. There is a good example in the [Semi-structured RAG cookbook example](https://github.com/langchain-ai/langchain/blob/master/cookbook/Semi_Structured_RAG.ipynb). For retrieval, there is a [LangChain Cookbook Example](https://python.langchain.com/docs/expression_language/cookbook/retrieval) that demonstrates using an LLM with vector store to answer questions. There is an example of using Unstructured.IO to build KGs using neo4j at Neo4j KG [PDF-KG loader](https://github.com/Joshua-Yu/graph-rag/blob/main/unstructured-io/Unstructured-IO_PDF_KGLoader.ipynb).


In [None]:
#| hide
from nbdev.showdoc import *

In [None]:
import os
from typing import Any
from unstructured.partition.pdf import partition_pdf
from pydantic import BaseModel

filename = "data/DoD_Data_Strategy.pdf"

# Extracts the elements from the PDF
elements = partition_pdf(
    filename=filename,
    extract_images_in_pdf=False,
    # Unstructured Helpers
    strategy="hi_res", 
    model_name="yolox",
    # Use layout model (YOLOX) to get bounding boxes (for tables) and find titles
    # Titles are any sub-section of the document
    infer_table_structure=True,
    # Post processing to aggregate text once we have the title
    chunking_strategy="by_title",
    # Chunking params to aggregate text blocks
    # Attempt to create a new chunk 3800 chars
    # Attempt to keep chunks > 2000 chars
    max_characters=4000,
    new_after_n_chars=3800,
    combine_text_under_n_chars=2000,
)

Some weights of the model checkpoint at microsoft/table-transformer-structure-recognition were not used when initializing TableTransformerForObjectDetection: ['model.backbone.conv_encoder.model.layer2.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer3.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer4.0.downsample.1.num_batches_tracked']
- This IS expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
# Create a dictionary to store counts of each type
category_counts = {}

for element in elements:
    category = str(type(element))
    if category in category_counts:
        category_counts[category] += 1
    else:
        category_counts[category] = 1

# Unique_categories will have unique elements
unique_categories = set(category_counts.keys())
category_counts

{"<class 'unstructured.documents.elements.CompositeElement'>": 16,
 "<class 'unstructured.documents.elements.Table'>": 1}

In [None]:
class Element(BaseModel):
    type: str
    text: Any


# Categorize by type
categorized_elements = []
for element in elements:
    if "unstructured.documents.elements.Table" in str(type(element)):
        categorized_elements.append(Element(type="table", text=str(element)))
    elif "unstructured.documents.elements.CompositeElement" in str(type(element)):
        categorized_elements.append(Element(type="text", text=str(element)))

# Tables
table_elements = [e for e in categorized_elements if e.type == "table"]
print(len(table_elements))

# Text
text_elements = [e for e in categorized_elements if e.type == "text"]
print(len(text_elements))


1
16


In [None]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from dotenv import load_dotenv

load_dotenv()

True

In [None]:
# Prompt
prompt_text = """You are an assistant tasked with summarizing tables and text. \ 
Give a concise summary of the table or text. Table or text chunk: {element} """
prompt = ChatPromptTemplate.from_template(prompt_text)

# Summary chain
model = ChatOpenAI(temperature=0, model="gpt-3.5-turbo")
summarize_chain = {"element": lambda x: x} | prompt | model | StrOutputParser()

In [None]:
# Apply to tables
tables = [i.text for i in table_elements]
table_summaries = summarize_chain.batch(tables, {"max_concurrency": 5})

In [None]:
# Apply to texts
texts = [i.text for i in text_elements]
text_summaries = summarize_chain.batch(texts, {"max_concurrency": 5})

In [None]:
table_summaries

['This table provides an overview of the different sections and goals outlined in the document. It includes sections on problem statement, scope, vision statement, guiding principles, architecture, standards, governance, talent and culture, and goals related to making data visible, accessible, understandable, linked, trustworthy, interoperable, and secure. The document also mentions strengthened governance and focus areas.']

In [None]:
text_summaries

['The Department of Defense (DoD) has released its Data Strategy, which aims to transform the department into a data-centric enterprise. The strategy focuses on areas such as joint operations, decision support, and business analytics. It also outlines eight guiding principles, including treating data as a strategic asset and ensuring data ethics. The strategy highlights four essential capabilities, including architecture and talent/culture. The DoD has set seven goals to become a data-centric organization, including making data visible, accessible, and understandable for consumers.',
 "The DoD Data Strategy aims to improve data management in order to enhance the Department's ability to fight and win wars. The strategy includes making data linked, trustworthy, interoperable, and secure. To implement the strategy, measurable Data Strategy Implementation Plans will be developed and overseen by the DoD CDO and DoD Data Council. The data governance community and user communities will also c

In [None]:
PromptTemplate = ChatPromptTemplate.from_messages(
    [
                (
                    "system",
                    """
                    Decompose the "Content" into clear and simple propositions, ensuring they are interpretable out of context.
                        1. Split compound sentence into simple sentences. Maintain the original phrasing from the input whenever possible.
                        2. For any named entity that is accompanied by additional descriptive information, separate this information into its own distinct proposition.
                        3. De-contextualize the proposition by adding necessary modifier to nouns or entire sentences and replacing pronouns (e.g., "it", "he", "she", "they", "this", "that") with the full name of the entities they refer to.
                        4. Present the results as a list of strings, formatted in JSON.

                    Example:

                        Input: Title: ¯Eostre. Section: Theories and interpretations, Connection to Easter Hares. Content: The earliest evidence for the Easter Hare (Osterhase) was recorded in south-west Germany in 1678 by the professor of medicine Georg Franck von Franckenau, but it remained unknown in other parts of Germany until the 18th century. Scholar Richard Sermon writes that "hares were frequently seen in gardens in spring, and thus may have served as a convenient explanation for the
                        origin of the colored eggs hidden there for children. Alternatively, there is a European tradition that hares laid eggs, since a hare’s scratch or form and a lapwing’s nest look very similar, and both occur on grassland and are first seen in the spring. In the nineteenth century the influence of Easter cards, toys, and books was to make the Easter Hare/Rabbit popular throughout Europe. German immigrants then exported the custom to Britain and America where it evolved into the Easter Bunny."
                        Output: [ "The earliest evidence for the Easter Hare was recorded in south-west Germany in 1678 by Georg Franck von Franckenau.", "Georg Franck von Franckenau was a professor of medicine.", "The evidence for the Easter Hare remained unknown in other parts of Germany until the 18th century.", "Richard Sermon was a scholar.", "Richard Sermon writes a hypothesis about the possible explanation for the connection between hares and the tradition during Easter", "Hares
                        were frequently seen in gardens in spring.", "Hares may have served as a convenient explanation for the origin of the colored eggs hidden in gardens for children.", 
                        "There is a European tradition that hares laid eggs.", "A hare’s scratch or form and a lapwing’s nest look very similar.", "Both hares and lapwing’s nests occur on grassland and are first seen in the spring.", 
                        "In the nineteenth century the influence of Easter cards, toys, and books was to make the Easter Hare/Rabbit popular throughout Europe.", "German immigrants exported the custom of the Easter Hare/Rabbit to Britain and America.", 
                        "The custom of the Easter Hare/Rabbit evolved into the Easter Bunny in Britain and America."]
                    """,
                ),
                ("user", "Decompose the following:\n{input}\n`"),
            ]
)

In [None]:
llm = ChatOpenAI(model='gpt-4-1106-preview', openai_api_key = os.getenv("OPENAI_API_KEY", 'YouKey'))

  warn_deprecated(


In [None]:
# use it in a runnable
runnable = PromptTemplate | llm

In [None]:
# Pydantic data class
class Sentences(BaseModel):
    sentences: List[str]
    
# Extraction
extraction_chain = create_extraction_chain_pydantic(pydantic_schema=Sentences, llm=llm)

In [None]:
import ollama
response = ollama.chat(model='llama2', messages=[
  {
    'role': 'user',
    'content': 'Why are roses red?',
  },
])
print(response['message']['content'])


Roses are red because of the way their petals are structured and the pigments they contain. The red color of roses is due to the presence of a pigment called anthocyanin, which is a type of flavonoid. Anthocyanin is responsible for the red, purple, and blue colors found in many flowers, fruits, and vegetables.

The structure of the rose petal itself also plays a role in its color. The outer layer of the petal, called the epidermis, contains cells that are filled with anthocyanin. These cells are arranged in layers, with the largest amount of pigment found in the outermost layer. This layer is exposed to the sun's light, which causes the anthocyanin to absorb certain wavelengths of light and reflect others, giving the rose its red color.

The exact shade of red can vary depending on the type of rose and other factors such as the amount of sunlight it receives, the soil it's grown in, and the presence of other pigments. Some roses may have a more orange or yellow tint to their red, whil

In [None]:
ollama.embeddings(model='llama2', prompt='They sky is blue because of rayleigh scattering')


In [None]:
from langchain_experimental.llms.ollama_functions import OllamaFunctions

model = OllamaFunctions(model="dolphin-mixtral")

In [None]:
model = model.bind(
    functions=[
        {
            "name": "get_current_weather",
            "description": "Get the current weather in a given location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "The city and state, " "e.g. San Francisco, CA",
                    },
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"],
                    },
                },
                "required": ["location"],
            },
        }
    ],
    function_call={"name": "get_current_weather"},
)

In [None]:
import json
import time
success_count = 0
exception_count = 0
total_time = 0

for _ in range(25):
    start_time = time.time()
    try:
        response = model.invoke("what is the weather in Boston in fahrenheit?")
        success_count += 1
    except json.JSONDecodeError:
        exception_count += 1
    end_time = time.time()
    total_time += end_time - start_time

print("Success count:", success_count)
print("Exception count:", exception_count)
print("Total time:", total_time)
print("Average time:", total_time / 5)

model.invoke("what is the weather in Boston?")

Success count: 25
Exception count: 0
Total time: 63.302969217300415
Average time: 12.660593843460083


AIMessage(content='', additional_kwargs={'function_call': {'name': 'get_current_weather', 'arguments': '{"location": "Boston, MA"}'}})

In [None]:
#| hide
#import nbdev; nbdev.nbdev_export()