# Introduction
- This is a summary program based on Langchain's short course 'Chat with your data'
- A simplified version that uses PDF file only to show how a document is transformed
    - Load multiple PDF files
    - Split loaded files into smaller chunks
    - Chunks are embedded using OpenAI embedding
    - Store these embeddings in a Vector database (ChromaDB)

In [2]:
# Basic setup

import os
import sys
import openai

from dotenv import  load_dotenv

load_dotenv() # read local .env file

# openai_api_key='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'

openai.api_key=os.getenv('OPENAI_API_KEY')



# openai.api_key  = os.environ[OPENAI_API_KEY]

def myEnvironment():
    # print(f'My id is: {my_id}.')
    print(f'My secret key is: {openai.api_key}.')

if __name__ == "__main__":
    myEnvironment()


My secret key is: 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.


## Transformation

- Langchain provides a <loader> module for loading both structured and unstructured data sources. Here we are using PyPDF loader for uploading sample PDF files

In [3]:
from langchain.document_loaders import PyPDFLoader

loaders = [
PyPDFLoader("docs1/MachineLearning-Lecture01.pdf"),
PyPDFLoader("docs1/MachineLearning-Lecture02.pdf"),
PyPDFLoader("docs1/MachineLearning-Lecture03.pdf")

]

- Find out how many pages have been loaded

In [4]:
docs = []

for loader in loaders:
    docs.extend(loader.load())

len(docs)

56

- Break uploaded PDFs into smaller chunks. We need to define, chunk size, chunk overlap. Langchain provides RecursiveCharacterTextSplitter and Character text splitter. Major difference is 
    - Recursive character text splitter - is preferred for Generic text splitting
    - Character text splitter splits on new line character. If you set the character on space it will be split as recursive CTS
    - double new line separater between paragraphs
    - Recursive Character text splitters has other separators as well such /n/n, /n, " ", "" by default
    

In [6]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter

chunk_size = 120
chunk_overlap = 10

r_splitter = RecursiveCharacterTextSplitter(
    chunk_size = chunk_size,
    chunk_overlap = chunk_overlap
)

# Split the document

splits = r_splitter.split_documents(

docs

)

- Explore these chucks
    - How many chunks?
    - What is inside them?

In [7]:
# print the number of chunks

len(splits)

1889

In [8]:
# print the text of a chunk

splits[0:5]

[Document(page_content='MachineLearning-Lecture01  \nInstructor (Andrew Ng):  Okay. Good morning. Welcome to CS229, the machine', metadata={'source': 'docs1/MachineLearning-Lecture01.pdf', 'page': 0}),
 Document(page_content='learning class. So what I wanna do today is ju st spend a little time going over the logistics', metadata={'source': 'docs1/MachineLearning-Lecture01.pdf', 'page': 0}),
 Document(page_content="of the class, and then we'll start to  talk a bit about machine learning.", metadata={'source': 'docs1/MachineLearning-Lecture01.pdf', 'page': 0}),
 Document(page_content="By way of introduction, my name's  Andrew Ng and I'll be instru ctor for this class. And so", metadata={'source': 'docs1/MachineLearning-Lecture01.pdf', 'page': 0}),
 Document(page_content="I personally work in machine learning, and I' ve worked on it for about 15 years now, and", metadata={'source': 'docs1/MachineLearning-Lecture01.pdf', 'page': 0})]

- Understand Embeddings
    - Use OpenAI embedding
    - Understand similarity between words and sentences
    - Use `dot` product to compare embeddings, higher score is better

In [90]:
from langchain.embeddings.openai import OpenAIEmbeddings
embedding = OpenAIEmbeddings()


In [None]:
embedding_1 = "Apple",
embedding_2 = "car"

In [42]:
len(embedding_2)

1536

In [43]:
print(embedding_2[0:10])

[0.0028599445902576914, -0.005424751648017928, -0.01598874169942126, -0.021673343066195377, -0.021006076492248848, 0.027409271484722164, -0.007866050426361114, -0.005344551267009154, -0.003952273323249115, -0.0062492115089087695]


In [34]:
import numpy as np

np.dot(embedding_1, embedding_1)


0.9999999999999998

In [47]:
np.dot(embedding_1, embedding_2)

0.9487800690436125

In [48]:
np.dot(embedding_2, embedding_3)

0.7966507280236204

In [50]:
np.dot(embedding_1, embedding_4)

0.7636250274290804

In [53]:
np.dot(embedding_5, embedding_6)

0.870826933839829

- Setup a vector database to store all splits after embedding them
    - Run `pip install chromadb`
    - Need to create a directory where all splits will be stored and pass it as persist_path
    - Need to pass embedding information, we are using OpenAI embeddings
    - All splits as document

## Here is another attempt of using Chroma db

In [38]:
from langchain.vectorstores import Chroma
from langchain.vectorstores import FAISS
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings

In [None]:
# Create an embedding function
# did not work as pip install sentence-transformers was not installed due to some space issue in the server

embedding_function = SentenceTransformerEmbeddings(model_name='all-MiniLM-L6-v2')

In [39]:
import chromadb
embeddings = OpenAIEmbeddings()
# new_client = chromadb.EphemeralClient()
openai_lc_client = Chroma.from_documents(
    documents=splits, embedding=embeddings)

query = "What is machine learning?"
docs = openai_lc_client.similarity_search(query, k=3)
len(docs)
# print(docs[1].page_content)

3

In [54]:
persist_directory='docs1/chroma'

In [55]:
openai_lc_client.persist()

In [59]:
vectordb = Chroma.from_documents(
    documents = splits,
    persist_directory=persist_directory,
    embedding = embeddings
)

In [51]:
retrievar = openai_lc_client.as_retriever

In [52]:
from langchain.chains import RetrievalQA

In [53]:
from langchain.chat_models import ChatOpenAI

llm = ChatOpenAI(model="gpt-3.5-turbo")

In [62]:
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever()

)

In [64]:
query = "What is machine learning?"

result = qa_chain(query)

result["result"]

'Machine learning is a branch of artificial intelligence that focuses on developing algorithms and models that enable computers to learn from data and make predictions or decisions without being explicitly programmed. It involves the use of statistical techniques and pattern recognition to train computers to learn and improve from experience.'

In [None]:
db = FAISS.from_documents(splits, OpenAIEmbeddings())

- Count all embeddings in the vector database

In [18]:
print(openai_lc_client._collection.count())

9445


- Using file metadata for better contextulization. Add a filter

- Understanding embedding serach
    - Similarity search
    - Max marginal relevance search

In [None]:
question = "add some sample question"

In [None]:
result = vectordb.similarity_search(question, k=3)

- `k` is used to define number of docs to be returned, let us check the count of returned documents

In [None]:
len(result)

- Check the content of the returned documents

In [None]:
print(result.page_content)

In [None]:
result[0].page_content

- Persist the database for later use

In [None]:
vector_db.persist()

# Retrieval
- Retrieval using similarity search
- Max Marginal Query for similarity and introducing diversity in the returned documents

In [None]:
from langchain.retrieval import similarity_serach

ss = similarity_search(question, k=3)

- Adding context of page metadat and using self retrieval query using LLM

In [42]:
from langchain.chat_models import ChatOpenAI
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)

In [41]:
from langchain.chains import RetrievalQA


In [45]:
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retrievar = openai_lc_client.as_re
)

TypeError: unhashable type: 'VectorStoreRetriever'