<a href="https://colab.research.google.com/github/notkannan/RAG-Experiment/blob/main/RAG_Workshop_Completed.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![Img](https://app.theheadstarter.com/static/hs-logo-opengraph.png)

# Headstarter RAG Workshop

**Skills: OpenAI, LangChain, Pinecone**


### Workshop Recording: https://www.loom.com/share/75af4269ab66450e943160c199895aa7


**Other Resources:**
- [Get your OpenAI API Key](https://platform.openai.com/settings/profile?tab=api-keys)
- [Get your Pinecone API Key](https://www.pinecone.io/)
- [Get your OpenRouter API Key](https://openrouter.ai/settings/keys)
- [JavaScript Code for RAG](https://js.langchain.com/v0.2/docs/tutorials/rag)
- [RAG with an in-memory database in Next.js](https://sdk.vercel.ai/examples/node/generating-text/rag)


### What is RAG anyway?


Retrieval-Augmented Generation (RAG) is a technique primarily used in GenAI applications to improve the quality and accuracy of generated text by LLMs by combining two key processes: retrieval and generation.

### Breaking It Down:
#### Retrieval:

- Before generating a response, the system first looks up relevant information from a large database or knowledge base. This is like searching through a library or the internet to find the most useful facts, articles, or data related to the question or topic.

#### Generation:

- Once the relevant information is retrieved, the system then uses it to help generate a response. This is where the model, like GPT, creates new text (answers, explanations, etc.) based on the retrieved information.

#### Install relevant libraries

In [1]:
! pip install langchain langchain-community openai tiktoken pinecone-client langchain_pinecone unstructured pdfminer==20191125 pdfminer.six==20221105 pillow_heif unstructured_inference youtube-transcript-api pytube sentence-transformers



In [2]:
from langchain.document_loaders import UnstructuredPDFLoader, OnlinePDFLoader, WebBaseLoader, YoutubeLoader, DirectoryLoader, TextLoader, PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from sklearn.metrics.pairwise import cosine_similarity
from langchain_pinecone import PineconeVectorStore
from langchain.embeddings import OpenAIEmbeddings
from langchain_community.embeddings import HuggingFaceEmbeddings
from google.colab import userdata
from pinecone import Pinecone
from openai import OpenAI
import numpy as np
import tiktoken
import os

pinecone_api_key = userdata.get("PINECONE_API_KEY")
os.environ['PINECONE_API_KEY'] = pinecone_api_key

openai_api_key = userdata.get("OPENAI_API_KEY")
os.environ['OPENAI_API_KEY'] = openai_api_key




# Initialize the OpenAI client

In [3]:
embeddings = OpenAIEmbeddings()
embed_model = "text-embedding-3-small"
openai_client = OpenAI()

  embeddings = OpenAIEmbeddings()


# Use HuggingFace & OpenRouter if you don't have an OpenAI account with credits



In [None]:
# HuggingFace Embeddings
# Use this instead of OpenAI embeddings if you don't have an OpenAI account with credits

text = "This is a test document."

hf_embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
query_result = hf_embeddings.embed_query(text)

In [None]:
query_result

[-0.038338541984558105,
 0.12346471846103668,
 -0.02864297851920128,
 0.05365270376205444,
 0.008845366537570953,
 -0.03983934596180916,
 -0.07300589233636856,
 0.04777132719755173,
 -0.030462471768260002,
 0.05497974902391434,
 0.08505292981863022,
 0.03665666654706001,
 -0.005319973453879356,
 -0.002233141800388694,
 -0.06071099638938904,
 -0.027237920090556145,
 -0.01135166734457016,
 -0.042437683790922165,
 0.00912993960082531,
 0.10081552714109421,
 0.07578728348016739,
 0.06911715865135193,
 0.009857431054115295,
 -0.0018377641681581736,
 0.02624903991818428,
 0.03290243074297905,
 -0.07177437096834183,
 0.028384247794747353,
 0.06170954555273056,
 -0.052529532462358475,
 0.033661652356386185,
 0.07446812838315964,
 0.07536034286022186,
 0.03538404777646065,
 0.06713404506444931,
 0.010798045434057713,
 0.08167017996311188,
 0.016562897711992264,
 0.03283063694834709,
 0.036325663328170776,
 0.0021727988496422768,
 -0.09895738214254379,
 0.0050467848777771,
 0.05089650675654411,


In [None]:
# Free Llama 3.1 API via OpenRouter
# Use this instead of OpenAI if you don't have an OpenAI account with credits

openrouter_client = OpenAI(
  base_url="https://openrouter.ai/api/v1",
  api_key=userdata.get("OPENROUTER_API_KEY")
)

## Initialize our text splitter
This is how we will chunk up the text to be retrieved during the RAG process

In [4]:
tokenizer = tiktoken.get_encoding('p50k_base')

# create the length function
def tiktoken_len(text):
    tokens = tokenizer.encode(
        text,
        disallowed_special=()
    )
    return len(tokens)

text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=2000,
        chunk_overlap=100,
        length_function=tiktoken_len,
        separators=["\n\n", "\n", " ", ""]
)

# Understanding Embeddings

In [5]:
def get_embedding(text, model="text-embedding-3-small"):
    # Call the OpenAI API to get the embedding for the text
    response = openai_client.embeddings.create(input=text, model=model)
    return response.data[0].embedding

def cosine_similarity_between_words(sentence1, sentence2):
    # Get embeddings for both words
    embedding1 = np.array(get_embedding(sentence1))
    embedding2 = np.array(get_embedding(sentence2))

    # Reshape embeddings for cosine_similarity function
    embedding1 = embedding1.reshape(1, -1)
    embedding2 = embedding2.reshape(1, -1)

    print("Embedding for Sentence 1:", embedding1)
    print("\nEmbedding for Sentence 2:", embedding2)

    # Calculate cosine similarity
    similarity = cosine_similarity(embedding1, embedding2)
    return similarity[0][0]


# Example usage
sentence1 = "I like to walk"
sentence2 = "I like to hike"


similarity = cosine_similarity_between_words(sentence1, sentence2)
print(f"\n\nCosine similarity between '{sentence1}' and '{sentence2}': {similarity:.4f}")


Embedding for Sentence 1: [[ 0.01786859 -0.0429923  -0.03763428 ...  0.02022715 -0.04473557
   0.00385508]]

Embedding for Sentence 2: [[ 0.00101707 -0.05145213 -0.02738291 ... -0.00255751 -0.03285436
   0.00197955]]


Cosine similarity between 'I like to walk' and 'I like to hike': 0.7358


# Load in a YouTube video and get its transcript

In [None]:
# Load in a YouTube video's transcript
loader = YoutubeLoader.from_youtube_url("https://www.youtube.com/watch?v=e-gwvmhyU7A", add_video_info=True)
data = loader.load()

print(data)



In [None]:
texts = text_splitter.split_documents(data)

In [None]:
texts

[Document(metadata={'source': 'e-gwvmhyU7A', 'title': 'Aravind Srinivas: Perplexity CEO on Future of AI, Search & the Internet | Lex Fridman Podcast #434', 'description': 'Unknown', 'view_count': 632412, 'thumbnail_url': 'https://i.ytimg.com/vi/e-gwvmhyU7A/hq720.jpg', 'publish_date': '2024-06-19 00:00:00', 'length': 10936, 'author': 'Lex Fridman Podcast'}, page_content='- Can you have a conversation with an AI where it feels like you\ntalk to Einstein or Feynman where you ask them a hard question, they\'re like, "I don\'t know." And then after a week they\ndid a lot of research- - They disappear and come back. Yeah.\n- And they come back and just blow your mind. If we can achieve that, that amount of inference compute where it leads to a\ndramatically better answer as you apply more inference compute, I think that will be the beginning of, like, real reasoning breakthroughs. (graphic whooshing) - The following is a conversation with Aravind Srinivas, CEO of Perplexity, a company that a

# Initialize Pinecone

In [6]:
vectorstore = PineconeVectorStore(index_name="neu-data", embedding=embeddings)

index_name = "neu-data"

namespace = "default"

# Insert data into Pinecone

Documentation: https://docs.pinecone.io/integrations/langchain#key-concepts

In [13]:
for document in texts:
    print("\n\n\n\n----")

    print(document.metadata, document.page_content)

    print('\n\n\n\n----')





----
{'source': '/content/sample_data/Intramurals_data.pdf', 'page': 0} Intramural Sports 
Manager’s Handbook




----




----
{'source': '/content/sample_data/Intramurals_data.pdf', 'page': 1} This publication  is a set of guidelines  established  for Intramural  Sports  participants  at Northeastern  University.  
 
 
Intramural  Sports  
Department  of University Recreation 
Northeastern University  
Boston,  MA 
 
Jack  Butler  
Senior Assistant  Director  of University  Recreation‐  Intramural  Sports  and 
Facilities (617) 373‐7895  
j.butler@northeastern. edu 
 
Daniel LaPalm  
Coordinator  of University  Recreation  ‐ Intramural  Sports  
(617) 373‐6846  
d.lapalm@northeastern.edu  
 
Intramural Sports Office:  
nuintramurals@gmail.com  
 
Office  of University Recreation  Website:  
https://recreation.northeastern.edu/  
 
Our Mission  
Our mission  is to provide  sport  and fitness  services  for our students,  staff,  faculty  and alumni.  Our purpose  is to 
encourage 

In [14]:
vectorstore_from_texts = PineconeVectorStore.from_texts([f"Source: {t.metadata['source']}, Title: {t.metadata['title']} \n\nContent: {t.page_content}" for t in texts], embeddings, index_name=index_name, namespace=namespace)

KeyError: 'title'

# Perform RAG

In [15]:
from pinecone import Pinecone

In [16]:
# Initialize Pinecone
pc = Pinecone(api_key=userdata.get("PINECONE_API_KEY"),)

# Connect to your Pinecone index
pinecone_index = pc.Index("neu-data")

In [28]:
query = "Explain KanJam rules"

In [30]:
raw_query_embedding = openai_client.embeddings.create(
    input=[query],
    model="text-embedding-3-small"
)

query_embedding = raw_query_embedding.data[0].embedding

In [31]:
query_embedding

[-0.028033873066306114,
 0.02136051468551159,
 -0.021245207637548447,
 -0.0024844929575920105,
 -0.011040589772164822,
 0.004111394751816988,
 -0.0050554657354950905,
 -0.00025493529392406344,
 -0.030181456357240677,
 0.023594576865434647,
 0.04695853963494301,
 -0.027255555614829063,
 0.04695853963494301,
 0.02477646805346012,
 -0.016128486022353172,
 -0.009700153023004532,
 -0.01710858941078186,
 0.0025583612732589245,
 0.0027601474430412054,
 0.04407588019967079,
 -0.016229378059506416,
 0.000894976023118943,
 -0.0028772554360330105,
 0.0518302358686924,
 0.021922633051872253,
 0.02053895592689514,
 0.01052891742438078,
 0.07425733655691147,
 0.025929531082510948,
 -0.02231179177761078,
 0.03580264374613762,
 -0.023608990013599396,
 -0.018607573583722115,
 -0.043066948652267456,
 -0.02009214460849762,
 -0.036177389323711395,
 -0.029893189668655396,
 0.04381643980741501,
 -0.028538338840007782,
 0.033813606947660446,
 0.0009846087777987123,
 0.0038159219548106194,
 -0.056817237287759

In [32]:
top_matches = pinecone_index.query(vector=query_embedding, top_k=1, include_metadata=True, namespace=namespace)

In [33]:
top_matches

{'matches': [{'id': '99970aee-81a9-4633-a82e-ceb40be61343',
              'metadata': {'text': 'Source: '
                                   '/content/sample_data/Intramurals_data.pdf, \n'
                                   '\n'
                                   'Content: Pop-a-shot Rules \n'
                                   ' \n'
                                   'GENERAL RULES:  \n'
                                   '1. All players must present their valid '
                                   'Northeastern University Photo ID to play.  '
                                   'No exceptions! \n'
                                   'Current, valid NU photo identification '
                                   'must be present in order to participate. '
                                   'Participation will \n'
                                   'be denied to anyone without their current, '
                                   'valid NU photo identification card. \n'
                      

In [34]:
# Get the list of retrieved texts
contexts = [item['metadata']['text'] for item in top_matches['matches']]

In [35]:
contexts



In [36]:
augmented_query = "<CONTEXT>\n" + "\n\n-------\n\n".join(contexts[ : 10]) + "\n-------\n</CONTEXT>\n\n\n\nMY QUESTION:\n" + query

In [37]:
print(augmented_query)

<CONTEXT>
Source: /content/sample_data/Intramurals_data.pdf, 

Content: Pop-a-shot Rules 
 
GENERAL RULES:  
1. All players must present their valid Northeastern University Photo ID to play.  No exceptions! 
Current, valid NU photo identification must be present in order to participate. Participation will 
be denied to anyone without their current, valid NU photo identification card. 
2. GAME TIME IS FORFEIT TIME! The minimum number of players must have their Husky Cards 
turned in, be present on the field and in uniform for a team not to forfeit. The team manager 
must be responsible for collecting his team's IDs and registering their uniform numbers with the 
IM staff. 
 
 
THE GAME AND FIELD:   
1. Game Location:  Games location will be announced on the IMLeagues. All facility policies must 
always be adhered to. 
2. Game Length: Game consists of 5 rounds.  Each round is 1 minute long with 1 minute break in 
between the rounds.   
3. Tiebreaker: In the event of a tie, the final roun

In [38]:
# Modify the prompt below as need to improve the response quality

primer = f"""You are a personal assistant. Answer any questions I have about the Intramurals information provided.
"""

res = openai_client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": primer},
        {"role": "user", "content": augmented_query}
    ]
)

openai_answer = res.choices[0].message.content

In [39]:
print(openai_answer)

I don't have the KanJam rules directly from your document, but I can explain the general rules of KanJam based on standard gameplay. Here’s a summary:

### KanJam Rules:

#### Equipment:
1. **Flying Disc**: One official KanJam disc.
2. **Goals/Targets**: Two KanJam goals, which are cylindrical cans with a slot in the front.

#### Teams:
1. **Players**: Each team consists of two players.
2. **Roles**: One player throws the disc while their teammate stands near the opposite goal to deflect the disc.

#### Objective:
1. Reach exactly 21 points before the other team.

#### Scoring:
1. **Dinger (1 Point)**: The deflector redirects the disc to hit the goal.
2. **Deuce (2 Points)**: The disc hits the goal directly without any deflection.
3. **Bucket (3 Points)**: The deflector redirects the disc into the goal through the open top.
4. **Instant Win**: The thrower hits the slot on the front of the goal directly without deflection. The game ends immediately.

#### Rules:
1. **Turns**: Each team 

# Using OpenRouter

In [None]:
 # Check out different models here: https://openrouter.ai/docs/models

res = openrouter_client.chat.completions.create(
    model="mistralai/mistral-nemo",
    messages=[
        {"role": "system", "content": primer},
        {"role": "user", "content": augmented_query}
    ]
)

answer = res.choices[0].message.content

In [None]:
print(answer)

Pre-training is mentioned to be important as it establishes a base level of intelligence and common sense that can be further built upon through post-training techniques. Aravind discusses various aspects of pre-training, including the role of data, data quality, and token size, and how they contribute to the overall performance of language models.


# Putting it all together

In [11]:
def perform_rag(query):
    raw_query_embedding = openai_client.embeddings.create(
        input=query,
        model="text-embedding-3-small"
    )

    query_embedding = raw_query_embedding.data[0].embedding

    top_matches = pinecone_index.query(vector=query_embedding, top_k=10, include_metadata=True, namespace=namespace)

    # Get the list of retrieved texts
    contexts = [item['metadata']['text'] for item in top_matches['matches']]

    augmented_query = "<CONTEXT>\n" + "\n\n-------\n\n".join(contexts[ : 10]) + "\n-------\n</CONTEXT>\n\n\n\nMY QUESTION:\n" + query

    # Modify the prompt below as need to improve the response quality
    system_prompt = f"""You are an expert personal assistant. Answer any questions I have about the Youtube Video provided. You always answer questions based only on the context that you have been provided.
    """

    res = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": augmented_query}
        ]
    )

    return res.choices[0].message.content


In [40]:
perform_rag("Explain KanJam rules")

'Certainly! The KanJam rules are as follows:\n\n### GENERAL RULES\n1. **ID Requirement:**\n   - All players must present their valid Northeastern University (NU) Photo ID to participate. Participation will be denied to anyone without their current, valid NU photo identification card.\n   \n2. **No Jewelry:**\n   - Jewelry is not allowed to be worn by any participant. Any dangerous equipment must be removed prior to participation. Jewelry cannot be taped or covered with a band-aid.\n\n3. **Forfeit Time:**\n   - GAME TIME IS FORFEIT TIME! The minimum number of players must have their Husky Cards turned in, be present on the field, and in uniform to avoid a forfeit. The team manager is responsible for collecting their team\'s IDs and registering their uniform numbers with the IM staff.\n\n4. **Unsportsmanlike Penalties:**\n   - 3 Unsportsmanlike Penalties will result in a team disqualification regardless of the score. 2 Personal Unsportsmanlike Penalties will result in the player leaving 

In [None]:
perform_rag("What advantages does Perplexity have over other AI companies?")

"Perplexity differentiates itself from other AI companies by focusing on a few unique aspects:\n\n1. **Answer-Centric Approach**: Unlike traditional search engines that display a list of URLs, Perplexity aims to provide direct, Wikipedia-like responses to queries. This method prioritizes giving users direct answers and relevant information over sending them to another webpage. This shifts the UI focus from a list of links to summarized answers, aiming to provide a more streamlined and valuable user experience.\n\n2. **Factual Grounding (RAG - Retrieval-Augmented Generation)**: Perplexity ensures their answers are factually grounded by only generating responses based on documents retrieved from the internet. This principle aims to reduce hallucinations by sticking closely to the retrieved content, enhancing the trustworthiness and accuracy of the information provided.\n\n3. **Knowledge-Centric Mission**: The company’s mission goes beyond search and aims to make people smarter by helping

# RAG over a PDF

In [10]:
loader = PyPDFLoader("/content/sample_data/Intramurals_data.pdf") # Insert the path to a PDF here
data = loader.load()

print(data)

text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=2000,
        chunk_overlap=100,
        length_function=tiktoken_len,
        separators=["\n\n", "\n", " ", ""]
    )

texts = text_splitter.split_documents(data)

# Insert all the chunks from the PDF into Pinecone
vectorstore_from_texts = PineconeVectorStore.from_texts([f"Source: {t.metadata['source']}, \n\nContent: {t.page_content}" for t in texts], embeddings, index_name=index_name, namespace=namespace)

# After this, all the code is the same from the Perform RAG section of this notebook
# Since the data from the PDF is now stored in Pinecone, you can perform RAG over it the same way as the YouTube video



In [None]:
from google.colab import drive
drive.mount('/content/drive')