# Data Connection

## 1. Data Loader - Load a PDF document (2 ways)
## 2. Document transformers - multiple options
## 3. Embeddings from Open AI
## 4. Vector Store - Chroma DB, store 
## 5. Vector Store - Retrieval

##  1A. Ways to load a PDF document

https://python.langchain.com/docs/modules/data_connection/document_loaders/pdf


### 1Aa - Using PyPDF

PyPDF is a popular library used to work with PDF files in Python. It can be used to extract text, metadata, and other information from PDF documents. Here's how Langchain utilizes PyPDF:

- **Loading PDFs**: PyPDF is used to load PDF documents into an array of documents, where each document contains the page content and metadata with the page number.
- **Installation**: PyPDF can be installed using `pip install pypdf`.
- **Usage**: PyPDFLoader is used to load and split the PDF into pages, and the content can be accessed as shown in the code snippet on the webpage.
- **Advantage**: An advantage of using PyPDF is that documents can be retrieved with page numbers, allowing for easy navigation and reference.

### 1Ab - Using Unstructured

Unstructured is another method mentioned on the webpage for loading PDF documents. Here's how it works:

- **Loading PDFs**: UnstructuredPDFLoader is used to load PDF documents.
- **Elements Handling**: Under the hood, Unstructured creates different "elements" for different chunks of text. By default, these are combined together, but separation can be maintained by specifying `mode="elements"`.
- **Usage**: UnstructuredPDFLoader is used to load the PDF, and the content can be accessed similarly to PyPDF.

### When to Use PyPDF vs. Unstructured

- **PyPDF**:
  - **When Page Numbers are Important**: If you need to keep track of page numbers and want to work with individual pages, PyPDF is a suitable choice.
  - **General Text Extraction**: PyPDF is widely used for general text extraction from PDF documents and has a well-established community.

- **Unstructured**:
  - **Handling Different Text Elements**: If you need to work with different chunks of text and want to retain the separation between different elements, Unstructured might be a better option.
  - **Customized Text Processing**: Unstructured allows for more customized handling of text elements, making it suitable for specific use cases where text needs to be processed in a particular way.

### Selection

The choice between PyPDF and Unstructured depends on the specific requirements of your task. If you need general text extraction with page numbers, PyPDF is a solid choice. If you require more customized handling of text elements, Unstructured might be more suitable. Both methods are supported by Langchain, allowing for flexibility in handling PDF documents.

### 1Aa. PyPDF approach

In [1]:
from dotenv import load_dotenv,find_dotenv
load_dotenv(find_dotenv())

True

In [2]:
from langchain.document_loaders import PyPDFLoader

In [3]:
from langchain.prompts import (
    ChatPromptTemplate,
    SystemMessagePromptTemplate,
    HumanMessagePromptTemplate,
)
from langchain.chat_models import ChatOpenAI

In [4]:
loader = PyPDFLoader('data/Causal_Inference_in_Python.pdf')

In [5]:
pages = loader.load_and_split()

In [6]:
type(pages), len(pages)

(list, 399)

In [7]:
for page in pages:
    page.page_content = page.page_content.encode('utf-8', 'replace').decode('utf-8')


In [8]:
print (pages[100].page_content)

Front Door Adjustment
The backdoor adjustment is not the only possible strategy to identify causal effects.
One can leverage the knowledge of causal mechanisms to identify the causal effect via
a front door, even in the presence of unmeasured common causes:
With this strategy, you must be able to identify the effect of the treatment on a media‐
tor and the effect of that mediator on the outcome. Then, the identification of the
effect of treatment on the outcome becomes the combination of those two effects.
However, in the tech industry, it’s hard to find applications where such a graph is
plausible, which is why the front door adjustment is not so popular.
Confounding Bias
The first significant cause of bias  is confounding. It’s the bias we’ve been discussing so
far. Now, we are just putting a name to it. Confounding happens when there is an open
backdoor path through which association flows  noncausally, usually because the treat‐
ment and the outcome share a common cause . For examp

### 1Ab. Unstructured approach

In [9]:
from langchain.document_loaders import UnstructuredPDFLoader

In [10]:
loader2 = UnstructuredPDFLoader("data/Causal_Inference_in_Python.pdf", mode="elements")

In [11]:
data2 = loader2.load()

In [12]:
len(data2)

5869

In [13]:
data2[100]

Document(page_content='Changing gears a bit (or not at all), place yourself in the shoes of a brilliant risk ana‐ lyst. You were just hired by a lending company and your first task is to perfect its credit risk model. The goal is to have a good automated decision-making system that assesses the customers’ credit worthiness (underwrites them) and decides how much credit the company can lend them. Needless to say, errors in this system are incredi‐ bly expensive, especially if the given credit line is high.', metadata={'source': 'data/Causal_Inference_in_Python.pdf', 'coordinates': {'points': ((71.99558, 285.71403999999995), (71.99558, 359.21406999999994), (432.0044600000003, 359.21406999999994), (432.0044600000003, 285.71403999999995)), 'system': 'PixelSpace', 'layout_width': 504.0, 'layout_height': 661.5}, 'filename': 'Causal_Inference_in_Python.pdf', 'file_directory': 'data', 'last_modified': '2023-08-20T07:53:34', 'filetype': 'application/pdf', 'page_number': 14, 'category': 'Narrati

##  2A. Document Splitting Options

Two options for PDF are:


### 2Aa. Splitting by Character

Splitting by character involves breaking down the text into individual characters. This method is often used when you need to analyze the text at the most granular level.

**When it makes sense:**
1. **Character-Level Analysis**: If you need to perform character-level analysis, such as identifying specific symbols or characters within the text.
2. **Language-Independent Processing**: Character-level splitting is language-agnostic, making it suitable for multilingual documents.
3. **Code Analysis**: In the context of code (e.g., Python), character-level splitting can be useful for syntax highlighting or identifying specific operators and symbols.

### 2Ab. Splitting by BPE Tokens

BPE (Byte-Pair Encoding) tokenization is a method that splits the text into subword units, often balancing between word and character levels. BPE tokenization is commonly used in modern NLP models.

**When it makes sense:**
1. **Subword-Level Analysis**: BPE tokenization is useful for languages where words can be broken down into meaningful subword units. It helps in capturing the morphological structure of the words.
2. **Compatibility with Language Models**: Many pre-trained language models use BPE tokenization. If you are planning to use such models, splitting by BPE tokens aligns with their internal tokenization.
3. **Code Analysis**: For code analysis, BPE tokenization can capture common programming constructs and idioms, allowing for more nuanced analysis compared to character-level splitting.


In the context of documents containing code, the choice between character splitting and BPE token splitting may depend on the specific analysis or processing you want to perform on the code. Character splitting offers a more granular view, while BPE token splitting may provide a more nuanced understanding of code constructs.

### 2B Other options

It seems that the second link provided did not contain relevant information about the Markdown header metadata splitter. However, I was able to gather information about the code splitter from the first link. Let's analyze the different text splitters and compare them in the context of documents that may contain code (Python) extracted from PDFs or similar sources.

### 2Ba. Split by Character
- **Usage**: Splits the text into individual characters.
- **When to Use**: Useful when analyzing character-level patterns or when performing character-level tokenization for specific NLP tasks.

### 2Bb. Split by BPE (Byte-Pair Encoding) Token
- **Usage**: Splits the text into subword units using BPE algorithm.
- **When to Use**: Suitable for handling out-of-vocabulary words and preserving word meaning in various languages. It can be particularly useful when dealing with code, as it can tokenize variable names and other identifiers that may not be in a standard vocabulary.

### 2Bc. Code Splitter
- **Usage**: Splits code with multiple language support, including Python, JavaScript, Markdown, Latex, etc.
- **When to Use**: Ideal for documents containing code snippets in various programming languages. It recognizes language-specific separators and can split code accordingly.
- **Example**: In Python, it can split by class definitions, function definitions, etc., allowing for more meaningful segmentation of code.

### 2Bd. Markdown Header Metadata 
- **Usage**: Presumably, it would split text based on Markdown headers or metadata.
- **When to Use**: Likely useful for documents written in Markdown format, where headers and metadata provide a logical structure for segmentation.

### Comparison and Context Consideration
- **BPE vs. Character Split**: BPE provides more meaningful segmentation compared to character-level splitting, especially in the context of code where variable names and syntax are important.
- **BPE vs. Code Split**: Code Splitter is designed specifically for code and supports multiple languages, making it more suitable for documents containing code snippets. BPE might still be useful for general text within the document.
- **BPE vs. Markdown Splitter**: Without detailed information about the Markdown splitter, it's challenging to make a direct comparison. However, if the document follows Markdown syntax, a Markdown-specific splitter might provide more structured segmentation.

In summary, the choice of splitter depends on the nature of the document and the specific requirements of the task. For documents containing code, the Code Splitter seems to be the most tailored option, while BPE can be a versatile choice for mixed content.



### 2Aa - Split by charachter

In [None]:
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(        
    separator = "\n\n",
    chunk_size = 1000,
    chunk_overlap  = 200,
    length_function = len,
    is_separator_regex = False,
)

In [None]:
texts = text_splitter.create_documents([page.page_content for page in pages])
print(texts[0])

In [None]:
len(texts)

In [None]:
### 2Ab - Split by tokens

In [None]:
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=100, chunk_overlap=0
)

In [None]:
content = ''.join([page.page_content for page in pages])
utf8_content = content.encode('utf-8', 'replace')
texts2 = text_splitter.split_text(utf8_content.decode('utf-8'))
len(texts2)

### 3. Embeddings

In [16]:
from langchain.embeddings import OpenAIEmbeddings

embeddings_model = OpenAIEmbeddings()

In [None]:
embeddings = embeddings_model.embed_documents(
    [
        "Hi there!",
        "Oh, hello!",
        "What's your name?",
        "My friends call me World",
        "Hello World!"
    ]
)
len(embeddings), len(embeddings[0])

In [None]:
embeddings = embeddings_model.embed_documents(
    [pages[100].page_content]
)
len(embeddings), len(embeddings[0])

### 4. Embeddings and save to Chroma

##### 4A. PDF Document Loader Approach

In [14]:
from langchain.vectorstores import Chroma

In [None]:
# load it into Chroma
db = Chroma.from_documents(pages, embeddings_model,persist_directory='data/chroma/1Causal_Inference')

In [None]:
# Helpful to force a save
db.persist()

##### 4B. PDF Document Loader Approach

In [18]:
from langchain.vectorstores.utils import filter_complex_metadata

In [19]:
# load it into Chroma
db2 = Chroma.from_documents(filter_complex_metadata(data2), embeddings_model,persist_directory='data/chroma/2Causal_Inference')

In [20]:
# Helpful to force a save
db2.persist()

##### 4C. Load each and check similarity search

In [22]:
db1 = Chroma(persist_directory='data/chroma/1Causal_Inference',embedding_function=embeddings_model)

In [48]:
new_str = "simplify dif-in-diff covariates with OLS?"

In [49]:
docs = db1.similarity_search(new_str)

In [50]:
print(docs[0].page_content)

Alternative Coefficient  Formula
The fact that you only need to residualize the treatment suggests a simpler way of
rewriting the regression coefficient formula. In the single variable case, instead of
using the covariance of Y and T over the variance of T, you can useβ1=ETi−Tyi
ETi−T2.
In the multivariate case, this would beβ1=ETi−ETXyi
EVar TX.
There is a difference, though. Look at the p-value. It is a bit higher than what you got
earlier. That’s because you are not applying the denoising step, which is responsible
for reducing variance. Still, with only the debiasing step, you can already get the
unbiased estimate of the causal impact of credit limit on risk, given that all the con‐
founders were included in the debiasing model.
Y ou can also visualize what is going on by plotting the debiased version of credit limit
against default rate. Y ou’ll see that the relationship is no longer downward sloping, as
when the data was biased:
Denoising Step
While the debiasing step is crucial 

In [51]:
db2 = Chroma(persist_directory='data/chroma/2Causal_Inference',embedding_function=embeddings_model)

In [52]:
docs = db2.similarity_search(new_str)

In [53]:
print(docs[0].page_content)

In fact, to prove my point, let’s use OLS to build a synthetic control right now. All you have to do is to use y_pre_co as if it was the covariate matrix X and the column aver‐ age of y_pre_tr as the outcome y. Once you fit this model, the weights can be extrac‐ ted with .coef_:


In [54]:
docs = db1.similarity_search_by_vector(OpenAIEmbeddings().embed_query(new_str))

In [55]:
print(docs[0].page_content)

Alternative Coefficient  Formula
The fact that you only need to residualize the treatment suggests a simpler way of
rewriting the regression coefficient formula. In the single variable case, instead of
using the covariance of Y and T over the variance of T, you can useβ1=ETi−Tyi
ETi−T2.
In the multivariate case, this would beβ1=ETi−ETXyi
EVar TX.
There is a difference, though. Look at the p-value. It is a bit higher than what you got
earlier. That’s because you are not applying the denoising step, which is responsible
for reducing variance. Still, with only the debiasing step, you can already get the
unbiased estimate of the causal impact of credit limit on risk, given that all the con‐
founders were included in the debiasing model.
Y ou can also visualize what is going on by plotting the debiased version of credit limit
against default rate. Y ou’ll see that the relationship is no longer downward sloping, as
when the data was biased:
Denoising Step
While the debiasing step is crucial 

In [56]:
docs = db2.similarity_search_by_vector(OpenAIEmbeddings().embed_query(new_str))

In [57]:
print(docs[0].page_content)

In fact, to prove my point, let’s use OLS to build a synthetic control right now. All you have to do is to use y_pre_co as if it was the covariate matrix X and the column aver‐ age of y_pre_tr as the outcome y. Once you fit this model, the weights can be extrac‐ ted with .coef_:


In [58]:
retriever = db1.as_retriever()

In [59]:
search_kwargs = {"score_threshold":0.8,"k":4}
docs = retriever.get_relevant_documents(new_str,
                                       search_kwargs=search_kwargs)

In [62]:
docs[3].page_content

'CHAPTER 4\nThe Unreasonable Effectiveness\nof Linear Regression\nIn this chapter you’ll add  the first major debiasing technique in your causal inference\narsenal: linear regression or ordinary least squares (OLS) and orthogonalization.\nY ou’ll see how linear regression can adjust for confounders when estimating the rela‐\ntionship between a treatment and an outcome. But, more than that, I hope to equip\nyou with the powerful concept of treatment orthogonalization. This idea, born in lin‐\near regression, will come in handy later on when you start to use machine learning\nmodels for causal inference.\nAll You Need Is Linear Regression\nBefore you skip to the next chapter because “oh, regression is so easy! It’s the first\nmodel I learned as a data scientist” and yada yada, let me assure you that no, you\nactually don’t know linear regression. In fact, regression is one of the most fascinat‐\ning, powerful, and dangerous models in causal inference. Sure, it’s more than one\nhundred ye

### 5. Retrievers


These retrievers offer diverse approaches to document retrieval, catering to different needs and scenarios. Whether it's handling multiple queries, compressing embeddings, combining different algorithms, or managing the granularity of documents, these retrievers provide robust solutions for various retrieval challenges.


The **MultiQueryRetriever** is a part of Langchain's data connection retrievers. It's designed to enhance the process of distance-based vector database retrieval, which embeds queries in high-dimensional space and finds similar embedded documents based on "distance." Here's a summary of its functionality and benefits:

1. **Overcoming Limitations**: Traditional distance-based retrieval may produce different results with subtle changes in query wording or if the embeddings do not capture the semantics of the data well. Manual prompt engineering or tuning is often done to address these problems but can be tedious.

2. **Automating Prompt Tuning**: The MultiQueryRetriever automates the process of prompt tuning by using an LLM (Language Model) to generate multiple queries from different perspectives for a given user input query.

3. **Richer Results**: By generating multiple perspectives on the same question, the MultiQueryRetriever might be able to overcome some of the limitations of distance-based retrieval and get a richer set of results. It retrieves a set of relevant documents for each query and takes the unique union across all queries to get a larger set of potentially relevant documents.

4. **Integration with Other Components**: It can be used with other Langchain components like vector stores (e.g., Chroma), document loaders (e.g., WebBaseLoader), embeddings (e.g., OpenAIEmbeddings), and text splitters (e.g., RecursiveCharacterTextSplitter).

5. **Customizable Query Generation**: Users can specify the LLM to use for query generation, and the retriever will handle the rest. Additionally, users can supply their own prompt along with an output parser to split the results into a list of queries, allowing for more customized query generation.

6. **Use Cases**: Ideal for scenarios where a more comprehensive and nuanced retrieval of documents is required, and where traditional distance-based retrieval might fall short in capturing the semantics of the query.

In essence, the MultiQueryRetriever offers a more sophisticated and automated approach to document retrieval, leveraging multiple queries and perspectives to provide a broader and more relevant set of documents. It integrates well with other Langchain components and offers customization and logging features to suit various use cases.


The second document retriever from the provided links is the "Contextual Compression Retriever." Here's a summary of its functionality and usage:

### Contextual Compression Retriever

#### Purpose:
The Contextual Compression Retriever is designed to address the challenge of retrieving specific queries from a document storage system where the relevant information may be buried in a document with a lot of irrelevant text. It aims to reduce the cost of calls and improve responses by compressing the retrieved documents to only include relevant information.

#### How It Works:
- **Compression**: The term "compressing" here refers to both compressing the contents of an individual document and filtering out documents wholesale.
- **Components**: To use the Contextual Compression Retriever, you'll need a base Retriever and a Document Compressor.
- **Process**: The Contextual Compression Retriever passes queries to the base Retriever, takes the initial documents, and passes them through the Document Compressor. The Document Compressor shortens the list of documents by reducing their contents or dropping them altogether.

#### Example Usage:
The example provided in the documentation demonstrates how to initialize a simple vector store retriever and store the 2023 State of the Union speech in chunks. It shows how the retriever returns relevant and irrelevant documents, and even the relevant documents may contain irrelevant information.


#### Visualization:
The documentation also includes a visual representation of how the Contextual Compression Retriever works, showing the process of passing queries to the base Retriever and compressing the documents.

#### Conclusion:
The Contextual Compression Retriever is a valuable tool for extracting relevant information from large documents. By compressing and filtering the content, it ensures that only the pertinent details are returned, optimizing the retrieval process.

I will now proceed to retrieve information about the third document retriever from the provided links.

The third document retriever from the provided links is the "Ensemble Retriever." Here's a summary of its functionality and usage:

### Ensemble Retriever

#### Purpose:
The Ensemble Retriever combines the results of multiple retrievers and reranks them using the [Reciprocal Rank Fusion](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf) algorithm. By leveraging the strengths of different algorithms, it aims to achieve better performance than any single algorithm.

#### How It Works:
- **Combining Algorithms**: The Ensemble Retriever takes a list of retrievers as input and ensembles their results.
- **Hybrid Search**: The most common pattern is to combine a sparse retriever (like BM25) with a dense retriever (like Embedding similarity). This approach is known as "hybrid search."
- **Strengths**: The sparse retriever excels at finding relevant documents based on keywords, while the dense retriever is adept at finding relevant documents based on semantic similarity.

#### Example Usage:
The example provided in the documentation demonstrates how to initialize the BM25 retriever and FAISS retriever, and then combine them using the Ensemble Retriever.


#### Conclusion:
The Ensemble Retriever offers a powerful way to combine different retrieval algorithms to enhance the search performance. By integrating both sparse and dense retrievers, it ensures a more comprehensive and accurate retrieval of relevant documents.

I will now proceed to retrieve information about the fourth document retriever from the provided links.

The fourth document retriever from the provided links is the "Parent Document Retriever." Here's a summary of its functionality and usage:

### Parent Document Retriever

#### Purpose:
The Parent Document Retriever is designed to balance the conflicting desires of having small documents for accurate embeddings and long enough documents to retain context. It achieves this by splitting and storing small chunks of data and then retrieving the parent documents (larger chunks or whole raw documents) from which the small chunks originated.

#### How It Works:
- **Splitting Documents**: The retriever splits documents into small chunks and indexes them.
- **Retrieving Parent Documents**: During retrieval, it fetches the small chunks and then looks up the parent IDs for those chunks, returning the larger documents.

#### Two Modes of Operation:
1. **Retrieving Full Documents**: In this mode, the retriever retrieves the full documents by specifying only a child splitter.
2. **Retrieving Larger Chunks**: If the full documents are too large, the retriever can split the raw documents into larger chunks and then further split them into smaller chunks. It indexes the smaller chunks but retrieves the larger chunks upon request.


#### Conclusion:
The Parent Document Retriever provides a flexible way to handle document retrieval by allowing control over the granularity of the retrieved documents. It can retrieve full documents or larger chunks, depending on the requirements, ensuring that the context is retained without losing the accuracy of embeddings.

### Summary of All Four Retrievers:
1. **MultiQueryRetriever**: Rephrases your query using LLM and then searches your vector store
2. **Contextual Compression Retriever**: Many internal relevant documents are compressed and passed to the LLM to answer the question.
3. **Ensemble Retriever**: Combines multiple retrievers to enhance search performance.
4. **Parent Document Retriever**: Balances the need for small documents for accurate embeddings with the need to retain context.The ParentDocumentRetriever strikes that balance by splitting and storing small chunks of data. During retrieval, it first fetches the small chunks but then looks up the parent ids for those chunks and returns those larger documents.



##### 5A. Multi-Query Retriever

In [91]:
from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain.chat_models import ChatOpenAI

In [92]:
llm = ChatOpenAI(temperature=0)

In [93]:
retriever_from_llm = MultiQueryRetriever.from_llm(retriever=db1.as_retriever(),llm=llm)

In [94]:
unique_docs = retriever_from_llm.get_relevant_documents(query=new_str)

In [95]:
retriever_from_llm

MultiQueryRetriever(tags=None, metadata=None, retriever=VectorStoreRetriever(tags=['Chroma', 'OpenAIEmbeddings'], metadata=None, vectorstore=<langchain.vectorstores.chroma.Chroma object at 0x000001A4610F0220>, search_type='similarity', search_kwargs={}), llm_chain=LLMChain(memory=None, callbacks=None, callback_manager=None, verbose=False, tags=None, metadata=None, prompt=PromptTemplate(input_variables=['question'], output_parser=None, partial_variables={}, template='You are an AI language model assistant. Your task is \n    to generate 3 different versions of the given user \n    question to retrieve relevant documents from a vector  database. \n    By generating multiple perspectives on the user question, \n    your goal is to help the user overcome some of the limitations \n    of distance-based similarity search. Provide these alternative \n    questions separated by newlines. Original question: {question}', template_format='f-string', validate_template=True), llm=ChatOpenAI(cache=N

##### 5B. Context Compression

In [96]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

In [97]:
llm = ChatOpenAI(temperature=0)
compressor = LLMChainExtractor.from_llm(llm)

In [98]:
compression_retriever1 = ContextualCompressionRetriever(base_compressor=compressor, base_retriever=db2.as_retriever())

In [99]:
NEWQ = 'who wrote the paper "Estimating Dynamic Treatment Effects in Event Studies with Heterogeneous Treatment Effects" and when?'

In [100]:
compressed_docs = compression_retriever1.get_relevant_documents(NEWQ)

In [101]:
compressed_docs[0].page_content

'"Estimating Dynamic Treatment Effects in Event Studies with Heterogeneous Treatment Effects," by Sun and Abraham'

In [102]:
compressed_docs[0].dict

<bound method BaseModel.dict of Document(page_content='"Estimating Dynamic Treatment Effects in Event Studies with Heterogeneous Treatment Effects," by Sun and Abraham', metadata={'category': 'NarrativeText', 'file_directory': 'data', 'filename': 'Causal_Inference_in_Python.pdf', 'filetype': 'application/pdf', 'last_modified': '2023-08-20T07:53:34', 'page_number': 292, 'source': 'data/Causal_Inference_in_Python.pdf'})>