# Introduction
- This is a summary program based on Langchain's short course 'Chat with your data'
- A simplified version that uses PDF file only to show how a document is transformed
    - Load multiple PDF files
    - Split loaded files into smaller chunks
    - Chunks are embedded using OpenAI embedding
    - Store these embeddings in a Vector database (ChromaDB)

In [None]:
# Basic setup

import os,
import sys,
import openai

from dotenv import get_dotenv, load_dot_env

_ = load_dotenv(find_dotenv()) # read local .env file

#openai.api_key  = os.environ['OPENAI_API_KEY']

## Transformation

- Langchain provides a <loader> module for loading both structured and unstructured data sources. Here we are using PyPDF loader for uploading sample PDF files

In [None]:
from langchain.document_loader import PyPDFloader

loaders = [

PyPdfLoader("/docs/MachineLearning-Lecture01.pdf"),
PyPdfLoader("/docs/MachineLearning-Lecture02.pdf"),
PyPdfLoader("/docs/MachineLearning-Lecture03.pdf")

]

- Find out how many pages have been loaded

In [None]:
docs = []

for loader in loaders:
    docs.extend(loader.load())

len(docs)

- Break uploaded PDFs into smaller chunks. We need to define, chunk size, chunk overlap. Langchain provides RecursiveCharacterTextSplitter and Character text splitter. Major difference is 
    - Recursive character text splitter - is preferred for Generic text splitting
    - Character text splitter splits on new line character. If you set the character on space it will be split as recursive CTS
    - double new line separater between paragraphs
    - Recursive Character text splitters has other separators as well such /n/n, /n, " ", "" by default
    

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter

chunk_size = 120
chunk_overlap = 10

r_splitter = RecursiveCharacterTextSplitter(
    chunk_size = chunk_size,
    chunk_overlap = chunk_overlap
)

# Split the document

r_splitter.split_text(

    pages,
    seperator =  

)

- Explore these chucks
    - How many chunks?
    - What is inside them?

In [None]:
# print the number of chunks

In [None]:
# print the text of a chunk

- Understand Embeddings
    - Use OpenAI embedding
    - Understand similarity between words and sentences
    - Use `dot` product to compare embeddings, higher score is better

In [None]:
from langchain.embeddings.openai import OpenAI.Embeddings
embedding = OpenAI.Embeddings()

embedding_1 = embed_query('I go to school')

embedding_2 = embed_query('I go to my class')

embedding_3 = embed_query('I love playing football ')

In [None]:
len(embedding_1)

In [None]:
print(embedding_1[0:10])

In [None]:
import numpy as np

np.dot(embedding_1, embedding_1)
np.dot(embedding_1, embedding_2)
np.dot(embedding_1, embedding_3)

- Setup a vector database to store all splits after embedding them
    - Run `pip install chromadb`
    - Need to create a directory where all splits will be stored and pass it as persist_path
    - Need to pass embedding information, we are using OpenAI embeddings
    - All splits as document

In [None]:
from langchain.vector_database import Chroma



persist_directory='docs/chroma'

# remove any existing directory
!rm -rf ./docs/chroma

# setup chroma db

vector_db = Chroma.from_documents(
    documents=splits,
    persist_directory=persist_directory,
    embedding=embedding
)

- Count all embeddings in the vector database

In [None]:
print(vector_db._collection.count())

- Using file metadata for better contextulization. Add a filter

- Understanding embedding serach
    - Similarity search
    - Max marginal relevance search

In [None]:
question = "add some sample question"

In [None]:
result = vectordb.similarity_search(question, k=3)

- `k` is used to define number of docs to be returned, let us check the count of returned documents

In [None]:
len(result)

- Check the content of the returned documents

In [None]:
print(result.page_content)

In [None]:
result[0].page_content

- Persist the database for later use

In [None]:
vector_db.persist()

# Retrieval
- Retrieval using similarity search
- Max Marginal Query for similarity and introducing diversity in the returned documents

In [None]:
from langchain.retrieval import similarity_serach

ss = similarity_search(question, k=3)

- Adding context of page metadat and using self retrieval query using LLM