# RAG On JFK Speeches: Part 1 

### 1. Introduction
--------------
In this post I venture into building a Retrival Augumented Generation (RAG) application that has been "trained" on President John F. Kennedy speeches. In past posts I covered how I [collected JFK speeches](http://michael-harmon.com/blog/jfk1.html) and [built a "speech writer"](http://michael-harmon.com/blog/jfk2.html) using a [Gated Recurrent Unit (GRU) Neural Network](https://en.wikipedia.org/wiki/Gated_recurrent_unit). In this post I build on the prior work to build a RAG pipeline. 

The first thing I will cover is how I collected the data to include extra metadata on speeches as well as using the [Asyncio](https://docs.python.org/3/library/asyncio.html) package reduce run time when writing to object storage. Next, I will go over how to load the json files from [Google Cloud Storage](https://cloud.google.com/storage?hl=en) using different [LangChain](https://www.langchain.com/) loaders. After that I cover how to embed documents and ingest the data into a [Pinecone Vector Database](https://pinecone.io/). In a follow up post I'll cover how to build the actual RAG application.

Now I'll import all the classes and functions I will need for the rest of the post.

In [27]:
# LangChain
from langchain_google_community.gcs_file import GCSFileLoader
from langchain_google_community.gcs_directory import GCSDirectoryLoader
from langchain.document_loaders import JSONLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_pinecone.vectorstores import PineconeVectorStore

# Google Cloud
import os
from google.cloud import storage
from google.oauth2 import service_account
credentials = service_account.Credentials.from_service_account_file('../credentials.json')
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "../credentials.json"


# Pinecone VectorDB
from pinecone import Pinecone
from pinecone import ServerlessSpec

# API Keys
from dotenv import load_dotenv
load_dotenv()


True

### 2. Scraping JFK Speeches using Asyncio
-------------
In the [first post](http://michael-harmon.com/blog/jfk1.html) of writing a speecher writer I covered how to injest the JFK speeches from his [presidential library](https://www.jfklibrary.org/archives/other-resources/john-f-kennedy-speeches) into [Google Cloud Storage](https://cloud.google.com/storage?hl=en). I was never completely satisfied with the way I wrote the job before and  decided to go back and redo it using the [Asyncio](https://docs.python.org/3/library/asyncio.html) library to perform Asynchronous reading of HTML and writing json to Google cloud storage. The json documents include the text of the speech, its title, source and url for the speech. I don't want to go into the details this work, but I will say it was not as hard as I would have thought! The main thing was to turn functions which use the request package into [coroutines](https://docs.python.org/3/library/asyncio-task.html#coroutines). Informally, when using `requests.get` method to scrape the scrape a website or query a REST API or other I/O methods the process is "blocking". This means the Python task is not able to proceed until its receives the return value (or hears back) from the API or website. In the time the program is waiting, the threads and CPU could be doing other work. The [Asyncio](https://docs.python.org/3/library/asyncio.html) library allows Python to to free up the threads to do other work while waiting for I/O to complete.

If you are interested in reading more about it the script is [here](https://github.com/mdh266/rag-jfk/blob/main/scripts/extract.py).



### 3. Loading and Embedding Speeches

At this point I have run the [extract.py](https://github.com/mdh266/rag-jfk/blob/main/scripts/extract.py) script and scraped the website to convert the speeches into the json.  At this point the data exists as json documents in [Google Cloud Storage](https://cloud.google.com/storage?hl=en) and in order to ingest it into [Pinecone](https://pinecone.io/) requires the use of the [JSONLoader](https://python.langchain.com/docs/integrations/document_loaders/json/) function from [LangChain](https://www.langchain.com/). I wanted to add metadata to the documents loaded by LangChain and create the `metadata_func` below:



In [34]:
from typing import Dict

def metadata_func(record: Dict[str, str], metadata: Dict[str, str]) -> Dict[str, str]:
    metadata["title"] = record.get("title")
    metadata["source"] = record.get("source")
    metadata["url"] = record.get("url")
    metadata["filename"] = record.get("filename")

    return metadata

I could put this to use by instantiating the object,

    loader = JSONLoader(
                file_path, 
                jq_schema=jq_schema, 
                text_content=False,
                content_key="text",
                metadata_func=metadata_func
    )
                
However, I would only be able to use this on local json document with a path (`file_path`) on my file system.

In order to use this function to load json from a GCP bucket I need to create a function that takes in a file and its path (`file_path`) as well as the function to process the metadata about the speech's name, where it came from and return an instantiated `JSONLoader` object to read the file:

In [35]:
    
def load_json(file_path: str, jq_schema: str="."):
    return JSONLoader(
                file_path, 
                jq_schema=jq_schema, 
                text_content=False,
                content_key="text",
                metadata_func=metadata_func
)

Now I can pass this to the LangChain's [GCFSFileLoader](https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.gcs_file.GCSFileLoader.html). I can then instantiate the class to load file the first debate between Kennedy and Nixon from my bucket. The full path for this json document is,

    gs://kennedyskis/1st-nixon-kennedy-debate-19600926.json

This is as follows,

In [36]:
loader = GCSFileLoader(project_name=credentials.project_id,
                       bucket="kennedyskis",
                       blob="1st-nixon-kennedy-debate-19600926.json",
                       loader_func=load_json)

Now I can load the debate which returns a list of [LangChain Document(s)](https://python.langchain.com/api_reference/core/documents/langchain_core.documents.base.Document.html):

In [5]:
document = loader.load()
document

[Document(metadata={'source': 'gs://kennedyskis/1st-nixon-kennedy-debate-19600926.json', 'seq_num': 1, 'title': 'Senator John F. Kennedy and Vice President Richard M. Nixon First Joint Radio-Television Broadcast, September 26, 1960', 'url': 'https://www.jfklibrary.org//archives/other-resources/john-f-kennedy-speeches/1st-nixon-kennedy-debate-19600926', 'filename': '1st-nixon-kennedy-debate-19600926'}, page_content='\n[Text, format, and style are as published in Freedom of Communications: Final Report of the Committee on Commerce, United States Senate..., Part III: The Joint Appearances of Senator John F. Kennedy and Vice President Richard M. Nixon and Other 1960 Campaign Presentations. 87th Congress, 1st Session, Senate Report No. 994, Part 3. Washington: U.S. Government Printing Office, 1961.]\nMonday, September 26, 1960\nOriginating CBS, Chicago, Ill., All Networks carried.\nModerator, Howard K. Smith.\nMR. SMITH: Good evening.\nThe television and radio stations of the United States 

The text of the debate can be seen using the `.page_content` attribute,

In [6]:
print(document[0].page_content[:1000])


[Text, format, and style are as published in Freedom of Communications: Final Report of the Committee on Commerce, United States Senate..., Part III: The Joint Appearances of Senator John F. Kennedy and Vice President Richard M. Nixon and Other 1960 Campaign Presentations. 87th Congress, 1st Session, Senate Report No. 994, Part 3. Washington: U.S. Government Printing Office, 1961.]
Monday, September 26, 1960
Originating CBS, Chicago, Ill., All Networks carried.
Moderator, Howard K. Smith.
MR. SMITH: Good evening.
The television and radio stations of the United States and their affiliated stations are proud to provide facilities for a discussion of issues in the current political campaign by the two major candidates for the presidency.
The candidates need no introduction. The Republican candidate, Vice President Richard M. Nixon, and the Democratic candidate, Senator John F. Kennedy.
According to rules set by the candidates themselves, each man shall make an opening statement of approx

The metadata for the document can be seen from the `.metadata` attribute,

In [7]:
document[0].metadata

{'source': 'gs://kennedyskis/1st-nixon-kennedy-debate-19600926.json',
 'seq_num': 1,
 'title': 'Senator John F. Kennedy and Vice President Richard M. Nixon First Joint Radio-Television Broadcast, September 26, 1960',
 'url': 'https://www.jfklibrary.org//archives/other-resources/john-f-kennedy-speeches/1st-nixon-kennedy-debate-19600926',
 'filename': '1st-nixon-kennedy-debate-19600926'}

This debate (and documents in generally) usually are too long to fit in the context window of an LLM so we need to break them up into smaller pieces of texts. This process is called "chunking". Below I will show how to break up the Nixon-Kennedy debate into "chunks" of 200 characters with 20 characters that overlap between chunks. I do this using the [RecursiveCharacterTextSplitter](https://python.langchain.com/api_reference/text_splitters/character/langchain_text_splitters.character.RecursiveCharacterTextSplitter.html) class as shown below,

In [8]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=20)
documents = text_splitter.split_documents(document)

print("Number of documents: ", len(documents))

Number of documents:  429


Now we can look at the documents and their metadata,

In [9]:
for n, doc in enumerate(documents[:3]):
    print(f"Doc {n}: ", doc.page_content, "\n", "\tMetadata:", doc.metadata, "\n")

Doc 0:  [Text, format, and style are as published in Freedom of Communications: Final Report of the Committee on Commerce, United States Senate..., Part III: The Joint Appearances of Senator John F. Kennedy 
 	Metadata: {'source': 'gs://kennedyskis/1st-nixon-kennedy-debate-19600926.json', 'seq_num': 1, 'title': 'Senator John F. Kennedy and Vice President Richard M. Nixon First Joint Radio-Television Broadcast, September 26, 1960', 'url': 'https://www.jfklibrary.org//archives/other-resources/john-f-kennedy-speeches/1st-nixon-kennedy-debate-19600926', 'filename': '1st-nixon-kennedy-debate-19600926'} 

Doc 1:  John F. Kennedy and Vice President Richard M. Nixon and Other 1960 Campaign Presentations. 87th Congress, 1st Session, Senate Report No. 994, Part 3. Washington: U.S. Government Printing Office, 
 	Metadata: {'source': 'gs://kennedyskis/1st-nixon-kennedy-debate-19600926.json', 'seq_num': 1, 'title': 'Senator John F. Kennedy and Vice President Richard M. Nixon First Joint Radio-Telev

Notice the metadata is the same for each of the documents since they all come from the same original debate. 

Now that we have data that is loaded up well go over how to use [embeddings](https://platform.openai.com/docs/guides/embeddings) to convert the text into vectors. I have covered this in [prior posts](http://michael-harmon.com/blog/jfk2.html), so I won't go over it much here.

.....

We can instantiate the LangChain [OpenAIEmbeddings](https://python.langchain.com/docs/integrations/text_embedding/openai/) class and then use the [embedd_query](https://python.langchain.com/docs/integrations/text_embedding/openai/#direct-usage) method to embed a single document as shown:

In [37]:
embedding = OpenAIEmbeddings(model='text-embedding-ada-002')

query = embedding.embed_query(documents[0].page_content)

Now we can see the first 5 entries of the vector,

In [17]:
print("First 5 entries in embedded document:", query[:5])

First 5 entries in embedded document: [-0.012023020535707474, 0.0033119581639766693, -0.005604343023151159, -0.03061368130147457, 0.013492794707417488]


As well as the size of the vector:

In [18]:
print("Vector size:", len(query))

Vector size: 1536


The embedding of text is important for the retrivial process of RAG. We embed all our documents and then embed our question and use the embeddings help to perform [semantic search](https://www.elastic.co/what-is/semantic-search) which will improve the results of our search.

### 4. Ingesting Speeches Into Pinecone Vector Database

Now we can load all of President Kennedys speeches. I can see the speeches of his presidency by getting the bucket and loading all the names of the speeches:

In [38]:
client = storage.Client(project=credentials.project_id,
                        credentials=credentials)

bucket = client.get_bucket("prezkennedyspeches")

speeches = [blob.name for blob in bucket.list_blobs()]
print(f"JFK had {len(speeches)} speeches in his presidency.")


JFK had 22 speeches in his presidency.


The speeches are:

In [39]:
speeches

['american-newspaper-publishers-association-19610427.json',
 'american-society-of-newspaper-editors-19610420.json',
 'american-university-19630610.json',
 'americas-cup-dinner-19620914.json',
 'berlin-crisis-19610725.json',
 'berlin-w-germany-rudolph-wilde-platz-19630626.json',
 'civil-rights-radio-and-television-report-19630611.json',
 'cuba-radio-and-television-report-19621022.json',
 'inaugural-address-19610120.json',
 'inaugural-anniversary-19620120.json',
 'irish-parliament-19630628.json',
 'latin-american-diplomats-washington-dc-19610313.json',
 'massachusetts-general-court-19610109.json',
 'peace-corps-establishment-19610301.json',
 'philadelphia-pa-19620704.json',
 'rice-university-19620912.json',
 'united-nations-19610925.json',
 'united-states-congress-special-message-19610525.json',
 'university-of-california-berkeley-19620323.json',
 'university-of-mississippi-19620930.json',
 'vanderbilt-university-19630518.json',
 'yale-university-19620611.json']

Now to load all of the speeches using the [GCSDirectoryLoader](https://python.langchain.com/docs/integrations/document_loaders/google_cloud_storage_directory/) and split them into chunks of size 2,000 characters with 100 characters overlapping:

In [40]:
loader = GCSDirectoryLoader(
                project_name=credentials.project_id,
                bucket="prezkennedyspeches",
                loader_func=load_json
)

text_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=100)

Now load all the speeches and split them into documents using the `load_and_split` method:

In [41]:
documents = loader.load_and_split(text_splitter)
print(f"There are {len(documents)} documents")

There are 180 documents


Now I can create the connection to Pinecone:

In [42]:
pc = Pinecone(api_key=os.getenv("PINECONE_API_KEY"))

I'll create an index in Pinecone. An index is basically a collection of embedded documents. [Vector databases](https://en.wikipedia.org/wiki/Vector_database) allow fo

First I delete the index if it exists to clear it of all prior records.

In [43]:
# delete the index if it exists
if pc.has_index(index_name):
    pc.delete_index(index_name)

Then I can create the connection and list out the indices

In [44]:
index_name = "prez-speeches"
dim = 1536

Now create the index that contains vectors of size dim:

In [45]:

# create the index
pc.create_index(
        name=index_name,
        dimension=dim,
        metric="cosine",
        spec=ServerlessSpec(
                  cloud="aws",
                  region="us-east-1"
        )
)

Notice we have to declare a metric that is useful for the search.

I can list the available indexes (indices?),

In [21]:
pc.list_indexes()

[
    {
        "name": "prez-speeches",
        "dimension": 1536,
        "metric": "cosine",
        "host": "prez-speeches-2307pwa.svc.aped-4627-b74a.pinecone.io",
        "spec": {
            "serverless": {
                "cloud": "aws",
                "region": "us-east-1"
            }
        },
        "status": {
            "ready": true,
            "state": "Ready"
        },
        "deletion_protection": "disabled"
    }
]

We can then get the statistics on the index, 

In [46]:
print(pc.Index(index_name).describe_index_stats())

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}


It shows us that we can hold vectors of size 1,536 dimensions and we have a total of 0 vectors in the index. Now to ingest documents into the database as vectors we instantiate the [PineconeVectorStore](https://python.langchain.com/api_reference/pinecone/vectorstores/langchain_pinecone.vectorstores.PineconeVectorStore.html) object connecting it to the index and passing the embedding,

In [47]:
vectordb = PineconeVectorStore(
                    pinecone_api_key=os.getenv("PINECONE_API_KEY"),
                    embedding=embedding,
                    index_name=index_name
)

Notice the dimesnion of the vector database has to match the dimension of the embedding!

Now load the documents into the index:

In [48]:
vectordb = vectordb.from_documents(
                            documents=documents, 
                            embedding=embedding, 
                            index_name=index_name
)

Under the hood LangChain will call the [embedding.embed_documents](https://python.langchain.com/docs/integrations/text_embedding/openai/#embed-multiple-texts) method to conver the documents from text to numerical vectors and then ingest them into the database.

One of the beautiful things about LangChain is how the consistency of the API allows for easily swapping different components of LLM applications. For instance one can switch to using a [Chroma](https://python.langchain.com/api_reference/chroma/vectorstores/langchain_chroma.vectorstores.Chroma.html#langchain_chroma.vectorstores.Chroma) database and the syntax remains exactly the same! This is important as each of these underlying databases and embedding models has their own API methods that are not necssarily consistent. Howevever, through LangChain we do have a consistent API and do not need to learn the different syntax for the different backends.

Now get the stats on the index again:

In [49]:
print(pc.Index(index_name).describe_index_stats())

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}


Now I can get the index:

In [50]:
index = pc.Index(index_name)

Which allows us to perform search for the closets documents queries on the 

In [54]:
question = "How did Kennedy feel about the Berlin Wall?"
query = embedding.embed_query(question)

matches = index.query(vector=query, top_k=5)

In [53]:
matches

{'matches': [], 'namespace': '', 'usage': {'read_units': 1}}

In [77]:
id = matches["matches"][0].get('id')

In [78]:
id

'a48ee926-4c6c-4614-aebf-bc8ea77d9cd3'

In [79]:
index.fetch(id)

{'namespace': '', 'usage': {'read_units': 4}, 'vectors': {}}

In [80]:
results = vectordb.search(query=question, search_type="similarity")

In [81]:
for doc in results:
    print()

 Document(id='7dc20458-f082-490f-ae4f-032b36123f57', metadata={'filename': 'berlin-w-germany-rudolph-wilde-platz-19630626', 'seq_num': 1.0, 'source': 'gs://prezkennedyspeches/berlin-w-germany-rudolph-wilde-platz-19630626.json', 'title': 'Remarks of President John F. Kennedy at the Rudolph Wilde Platz, Berlin, June 26, 1963', 'url': 'https://www.jfklibrary.org//archives/other-resources/john-f-kennedy-speeches/berlin-w-germany-rudolph-wilde-platz-19630626'}, page_content='Listen to speech. \xa0\xa0 View related documents. \nPresident John F. Kennedy\nWest Berlin\nJune 26, 1963\n[This version is published in the Public Papers of the Presidents: John F. Kennedy, 1963. Both the text and the audio versions omit the words of the German translator. The audio file was edited by the White House Signal Agency (WHSA) shortly after the speech was recorded. The WHSA was charged with recording only the words of the President. The Kennedy Library has an audiotape of a network broadcast of the full spe

### 5. Next Steps