## Talk to Your Data: A Python Notebook Tutorial
Welcome to this interactive tutorial where we'll explore how to "talk" to our data using the power of Large Language Models (LLMs) like OpenAI's GPT-3.5-turbo and Vector Databases. 

This tutorial will guide you through an exciting process of transforming unstructured data (a PDF document in our case) into an interactive and smart knowledge base that can answer your questions!

By following this tutorial, you will:

- Learn how to load data from a PDF file and split it into smaller, manageable chunks.
- Understand the concept of text embeddings and how we can utilize them to store our data in a Vector Database.
- Discover how to ask questions and retrieve the most relevant information from your database.
- Use a Large Language Model to generate answers based on the context we provide.

You will experience the benefit of harnessing the power of language models and vector databases in extracting and utilizing information from large amounts of text data. 

The approach used in this tutorial can be applied to a wide range of tasks, from creating a smart Q&A system to building a personal digital assistant or even designing a conversational AI.

<img src="images/qa_flow.png" alt="Image Alt Text" width="1200">

#### Installing packages
Before we start, you need to have a few Python libraries installed. You can install these libraries by running the following command in your terminal:
```bash
pip install openai langchain ipykernel python-dotenv chromadb pypdf tiktoken
```

Or
```bash
pip install -r requirements.txt
```

### Loading OpenAI API Key

We'll need it for Embeddings and the GPT-3.5 model.

In [1]:
import os
from dotenv import load_dotenv

load_dotenv()

True

### Step 1: Loading PDFs
In this step, we're using the PyPDFLoader class from the LangChain library to load our PDF file into standardized Document format.



In [2]:
from langchain.document_loaders import PyPDFLoader

# create an instance of PyPDFLoader with the target PDF file
loader = PyPDFLoader("transcripts/PT693-Transcript.pdf")

# load the PDF file into a variable named 'docs'
docs = loader.load()

In [3]:
len(docs), type(docs[0])

(43, langchain.schema.document.Document)

### Step 2: Splitting Pages
Next, we're using a `RecursiveCharacterTextSplitter` to break down the content of the PDF into smaller chunks.

We define a chunk size and overlap to decide how large each slice should be and how much they should overlap with each other.

In [4]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Constants: Define constants used in the code.
CHUNK_SIZE = 1200
CHUNK_OVERLAP = 200

# Create an instance of RecursiveCharacterTextSplitter with the desired chunk size and overlap
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=CHUNK_SIZE,
    chunk_overlap=CHUNK_OVERLAP, 
    length_function=len  # function used to measure chunk size
)

# split documents into smaller chunks
splits = r_splitter.split_documents(docs)


In [5]:
len(splits)

84

In [6]:
for doc in splits[:10]:
    print(len(doc.page_content))

103
1148
960
1157
1014
1193
1002
1161
1091
1152


### Step 3: Creating Embeddings for the Splits

**Computers don't understand words. They only understand numbers.**

Embeddings are a way of converting text into a numerical form that a machine can understand.

Imagine trying to describe a movie scene to a friend - you would use words to describe what's happening, the mood, the characters, etc. In a similar way, embeddings capture the essence of the text, but in a format that the machine can work with, like a numerical vector.



To create the embeddings, we're using OpenAIEmbeddings from LangChain. 

<img src="images/Embeddings.png" alt="Image Alt Text" width="600">


In [7]:
from langchain.embeddings.openai import OpenAIEmbeddings

# Create an instance of OpenAIEmbeddings for embedding the chunks
embedding = OpenAIEmbeddings()

#### Similarity Search and Cosine Similarity



In [9]:
sent1 = "I love dogs"
sent2 = "I love cats"
sent3 = "Yesterday I played basketball"
sent4 = "Yesterday I played football"
sent5 = "Leonardo Di Caprio is an underrated actor"

In [10]:
embedding1 = embedding.embed_query(sent1)
embedding2 = embedding.embed_query(sent2)
embedding3 = embedding.embed_query(sent3)
embedding4 = embedding.embed_query(sent4)
embedding5 = embedding.embed_query(sent5)

In [12]:
embedding1

[-0.02097434364259243,
 -0.005213170778006315,
 -0.02705739066004753,
 -0.027471037581562996,
 -0.02878497540950775,
 0.024916158989071846,
 -0.003899232717230916,
 -0.007013752590864897,
 0.016326896846294403,
 -0.012166093103587627,
 -0.0013793307589367032,
 0.0373985692858696,
 -0.007938375696539879,
 -0.010693996213376522,
 0.0029578814283013344,
 0.013540861196815968,
 0.03574398159980774,
 0.008668341673910618,
 0.012847394682466984,
 0.00022830434318166226,
 -0.019648240879178047,
 0.008911662735044956,
 0.006886008661240339,
 -0.02832266502082348,
 -0.003914440516382456,
 0.016898702830076218,
 0.006043506786227226,
 -0.0009040927980095148,
 -0.022130122408270836,
 -0.01986723020672798,
 0.024830995127558708,
 -0.007129330653697252,
 -0.02674107253551483,
 -0.02615709975361824,
 -0.003138852072879672,
 -0.021728642284870148,
 -0.001539010787382722,
 0.0037471565883606672,
 -0.002243123482912779,
 -0.011405711993575096,
 -0.0006668539717793465,
 0.020207880064845085,
 0.00237086

In [11]:
import numpy as np

def cosine_similarity(vec1, vec2):
    # Compute the dot product of vec1 and vec2
    dot_product = np.dot(vec1, vec2)
    
    # Compute the L2 norms (or magnitudes) of vec1 and vec2
    norm_vec1 = np.linalg.norm(vec1)
    norm_vec2 = np.linalg.norm(vec2)
    
    # Compute the cosine similarity
    cos_sim = dot_product / (norm_vec1 * norm_vec2)
    
    return cos_sim

# Assuming vec1 and vec2 are your embeddings
vec1 = np.array(embedding1)
vec2 = np.array(embedding2)
vec3 = np.array(embedding3)
vec4 = np.array(embedding4)
vec5 = np.array(embedding5)

# More similar
print(cosine_similarity(vec1, vec2))
print(cosine_similarity(vec3, vec4))

print("---")
# Less similar
print(cosine_similarity(vec1, vec3))
print(cosine_similarity(vec2, vec3))
print(cosine_similarity(vec2, vec4))
print(cosine_similarity(vec2, vec5))
print(cosine_similarity(vec4, vec5))


0.9113747123840847
0.9424351991605048
---
0.7609000588250058
0.7659245559775505
0.7612783173640613
0.745260925638238
0.7202260133268322


### Step 4: Storing them into a Vector Database

A vector database is like a library for these numerical vectors. We store these vectors in a structured manner so we can search and retrieve them efficiently later on.

Once we have the embeddings (the 'numerical' form of our text), we're storing them in a vector database using the Chroma class from LangChain.

<img src="images/VectorDatabaseCreate.png" alt="Image Alt Text" width="600">

In [14]:
from langchain.vectorstores import Chroma

# Define directory to persist the embeddings
persist_directory = 'chroma/sds/'

# Create an instance of Chroma with the documents, embeddings, and the persist directory
vectordb = Chroma.from_documents(
    documents=splits,
    embedding=embedding,
    persist_directory=persist_directory
)


### Step 5: Retrieving Relevant Documents
After storing the embeddings in the vector database (Chroma), we want to retrieve the most relevant ones based on our question. 

This is similar to asking a librarian for the most relevant books based on the topic you're interested in.

This is what the `similarity_search` method does. It takes our question and returns the most related documents from our database.

In [15]:
# Create an instance of Chroma with the persist directory and embeddings
vectordb = Chroma(persist_directory=persist_directory, embedding_function=embedding)

Use the created vector database to find the most similar documents to a given question

In [16]:
question = "Give me all books from the episode"

# Retrieve similar documents to a given question using the vector database
docs = vectordb.similarity_search(question, k=5)

In [17]:
docs

[Document(page_content="Harpreet Sahota:  01:15:00  Yeah, yeah, yeah. The Manga Guide to Calculus and the \nManga Guide to Linear Algebra. Super good.  \nJon Krohn:  01:15:06  Awesome. So, near the end of every episode I ask people \nfor book recommendations, but you have just given us a \nton. So, I think we've covered that question, unless you \nhave any other books you'd like to add.  \nHarpreet Sahota:  01:15:17  You know, I used to, I used to, I have, I've traded, when I \nwas recording the Artist of Data Science podcast, I read a \nlot of books lik e, cause I had so many authors on and \nsince I kind of put the podcast on hold for now, I spent \nmost of my time reading research papers in the morning \nwhenever I have free mornings. I have not read a book in \nlike six months, sadly. But the one that I have c urrently \njust gone back to rereading is Deep Work by Cal Newport. \nI think that's a good book. Important book for people who \nare in roles like ours that are knowledge -i

Checking cosine similarity

In [18]:
q_emb = embedding.embed_query(question)
q_vec = np.array(q_emb)

for d in docs:
    emb = embedding.embed_query(d.page_content)
    vec = np.array(emb)
    cosine = cosine_similarity(q_vec, vec)
    print(cosine)

0.7724149422983935
0.7522458412266196
0.7437202518308199
0.7277780673958708
0.7273169446502786


### Step 6: Generating the Answer
In this step, we use our Large Language Model (LLM), to generate a response. 

We provide the model with 2 things:
- our query
- the most relevant documents retrieved in the previous step

We use a PromptTemplate, which is a set of instructions for our LLM. It's like telling a story to a friend and then asking them a question about that story.

In this case, the PromptTemplate instructs the LLM to use the documents (context) to answer the question at the end.

<img src="images/VectorDatabaseProcess.png" alt="Image Alt Text" width="600">

In [19]:
# Import the necessary classes from the langchain library.
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from langchain.chat_models import ChatOpenAI

# Define a prompt template. This is a format for the text input we'll give to our model.
# It tells the model how to structure its response and what to do in different situations.
template = """I will provide you pieces of [Context] to answer the [Question]. \
If you don't know the answer based on [Context] just say that you don't know, don't try to make up an answer. \
[Context]: {context} \
[Question]: {question} \
Helpful Answer:"""

# If your answer includes any sort of list, return it in bullets. \
# Format your answer to Markdown. \

# Create a PromptTemplate object from our string template.
QA_CHAIN_PROMPT = PromptTemplate.from_template(template)

# Initialize our language model. We're using OpenAI's GPT-3.5-turbo model here.
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

# Create a RetrievalQA object. This uses our language model (llm) and a retriever,
# which is our vector database (vectordb). This object will handle asking our model questions
# and retrieving relevant documents to help answer them.
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever(),
    chain_type="stuff",
    return_source_documents=True,
    chain_type_kwargs={"prompt": QA_CHAIN_PROMPT}
)

In [20]:
question = "I'm an aspiring deep learning engineer. How should I start?"

# Ask our question to the qa_chain, and store the result.
result = qa_chain({"query": question})

# Print out the result
print(result["result"])


Based on the context provided, here is a helpful answer to the question:

To start your journey as an aspiring deep learning engineer, it is recommended to follow a top-down approach. This means focusing on applications and practical implementations rather than getting overwhelmed by the mathematical details initially. Look for courses or resources that provide hands-on experience with deep learning frameworks like PyTorch or TensorFlow.

Here are some steps you can take:

1. Start by exploring applications of deep learning, such as computer vision or natural language processing. Look for courses or tutorials that provide practical examples and projects.

2. Find resources that offer intuitive explanations of deep learning concepts. For example, Andrew Glassner's deep learning crash course or Jon Krohn's "Deep Learning: A Visual Approach" can provide a good understanding of how deep learning works without diving into complex math.

3. Build a strong foundation by learning about foundat

In [21]:
result["source_documents"]

[Document(page_content="Harpreet Sahota:  00:36:11  Yeah, yeah. Doing it on LinkedIn Learning, it's, it's, it's \ngonna  be a cool course. So, like it, the audience for this \ncourse are people who are like me before I got into deep \nlearning. So, if you're comfortable with statistics, math, \nPython programming, classical ML, if like, you're good \nwith all that, and you're like looking at this deep learning \nthing and wondering like, okay, how, how can I get into \nthis? Then this is the course that I made for you. I made \nit for an earlier version of me. And it goes through, like I \nstart with like a history of computer vision for im age \nclassification, and I talk about, you know, important \nconcepts like the things that I felt I needed to understand \nbefore I got into deep learning. So, I kind of structured it \nthat way. I start from pre -deep learning methods, just \nbriefly touch on those .", metadata={'page': 19, 'source': 'transcripts/PT693-Transcript.pdf'}),
 Document

## Bonus Section

### Understanding Retrieval from Vector Databases
Vector databases work like a magical library. 

When you ask a question, the database doesn't read through all the books (or in our case, document splits). Instead, it translates your question into a special language (the embeddings) and then finds the books that speak the same language the closest.

These databases use a measure of similarity, such as cosine similarity to find the most related vectors (or embeddings). In our project, `similarity_search()` does exactly this - it finds the most similar vectors (embeddings) to the query vector using cosine similarity.

If you think of vectors as arrows in space, cosine similarity measures the cosine of the angle between them. 

When the vectors are close to each other, pointing in almost the same direction, the cosine of the angle between them is close to 1, meaning they're very similar. On the contrary, if the vectors point in completely different directions, the cosine is close to -1, meaning they're very dissimilar.

To capture the essence:
- **vectors are close** (pointing in almost the same direction) -> high cosine similarity (close to 1) -> similar meaning
- **vectors are far apart** (pointing in almost the opposite direction) -> low cosine similarity (close to -1) -> dissimilar meaning

This is how vector databases can quickly find the most related documents to your question!


### Context Length in Large Language Models
When a language model reads text, it has a limit to how much it can remember at once. 

This limit is called the "context length".

Imagine you're reading a very long story but you have a memory limit. If the story exceeds this limit, you start to forget the earlier parts as you read further. 

The same happens with language models. 

For the standard GPT-3.5, the context length is 4096 tokens (~3000 words). 

If a text exceeds this limit, the model can't remember the initial parts while processing the later parts.

Context length matters because the quality of the response can significantly depend on the provided context. If important information is outside of the model's context length, it won't be able to reference it in its response.

### Vector Databases are the modern solution for the Context Length limits

We can't feed our Large Language Models with a 300-page PDF.

The Context Length is too short. The model won't "remember" most of it.

But we can feed our Vector Database with a 300-page PDF!

Thanks to similarity search, we can retrieve ONLY the relevant chunks from our PDF.

Then, we just take our query together with the chunks without exceeding the context length.