<a href="https://colab.research.google.com/github/FMurray/hyperdemocracy/blob/main/hyper_democracy_v2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Install Dependencies

In [1]:
# if you are on a google colab, uncomment the lines below to fetch the requirements file and the hyperdemocracy.py module
# and pip install the requirements

#!wget https://raw.githubusercontent.com/FMurray/hyperdemocracy/main/requirements.txt
#!wget https://raw.githubusercontent.com/FMurray/hyperdemocracy/main/hyperdemocracy.py
#!pip install -r requirements.txt

In [24]:
import os
import rich

# Note on Formatted Output

Note that we patch the builtin Python `print` function with `rich.print` in the cell below. If you prefer a more traditional print output you can comment out the import below. 

In [3]:
from rich import print

# Note on Cost

Running this notebook with your OpenAI key in an environment variable will charge a small amount of money to your OpenAI account. The total cost of running this notebook multiple times should be less than 10 cents but that can change if the datasource is changed. Each cell that makes a request to an OpenAI endpoint that costs money will have the following comment in it, 

```
## THIS CELL SPENDS MONEY ##
```

Up to date pricing information on OpenAI models can be found here https://openai.com/pricing

# Setup Keys

In [4]:
# if you want to use local secrets, add a file called .env to this directory and uncomment the lines below

from dotenv import load_dotenv
load_dotenv(".env")
#%dotenv ./.env

True

In [7]:
# if you are using google colab, uncomment the lines below to manually enter your OpenAI key.

#import getpass
#os.environ['OPENAI_API_KEY'] = getpass.getpass()

In [8]:
%load_ext autoreload
%autoreload 2

# Load Assembleco Records

We are going to use a small subset of records provided by https://assembled.app/.

For the purposes of this workshop, we have created a [huggingface dataset](https://huggingface.co/datasets/assembleco/hyperdemocracy)  which we can load using the `load_dataset` function. This is all handled for you in the `load_assembleco_records` function. See more info here [datasets](https://huggingface.co/docs/datasets/index) package.

In [9]:
from hyperdemocracy import load_assembleco_records

In [10]:
df = load_assembleco_records(process=True, strip_html=True, remove_empty_body=True)

Found cached dataset parquet (/home/galtay/.cache/huggingface/datasets/assembleco___parquet/assembleco--hyperdemocracy-37bf3764bb15f4d0/0.0.0/14a00e99c0d15a23649d0db8944380ac81082d4b021f398733dd84f3a6c569a7)


In [11]:
df.head()

Unnamed: 0,key,name,sponsors,summary,body,themes,index,actions,amendments,committees,relatedbills,cosponsors,subjects,text,titles,congress_num,legis_class,legis_num,congress_gov_url
0,118HCONRES1,Regarding consent to assemble outside the seat...,"[[C001053, Rep. Cole, Tom [R-OK-4], sponsor]]",This concurrent resolution authorizes the Spe...,[Congressional Bills 118th Congress]\n[From th...,"[Congress, Congressional operations and organi...","{'bill': {'actions': {'count': 7, 'url': 'http...","{'actions': [{'actionCode': None, 'actionDate'...","{'amendments': [], 'pagination': {'count': 0},...","{'committees': [], 'request': {'billNumber': '...","{'pagination': {'count': 0}, 'relatedBills': [...","{'cosponsors': [], 'pagination': {'count': 0, ...","{'pagination': {'count': 2}, 'request': {'bill...","{'pagination': {'count': 1}, 'request': {'bill...","{'pagination': {'count': 2}, 'request': {'bill...",118,HCONRES,1,https://www.congress.gov/bill/118th-congress/h...
1,118HCONRES10,Expressing the sense of Congress that the Unit...,"[[T000165, Rep. Tiffany, Thomas P. [R-WI-7], s...",This concurrent resolution calls on the Presi...,[Congressional Bills 118th Congress]\n[From th...,[International Affairs],"{'bill': {'actions': {'count': 4, 'url': 'http...","{'actions': [{'actionCode': 'H11100', 'actionD...","{'amendments': [], 'pagination': {'count': 0},...",{'committees': [{'activities': [{'date': '2023...,"{'pagination': {'count': 0}, 'relatedBills': [...","{'cosponsors': [{'bioguideId': 'P000605', 'dis...","{'pagination': {'count': 1}, 'request': {'bill...","{'pagination': {'count': 1}, 'request': {'bill...","{'pagination': {'count': 2}, 'request': {'bill...",118,HCONRES,10,https://www.congress.gov/bill/118th-congress/h...
2,118HCONRES11,Providing for a joint session of Congress to r...,"[[S001176, Rep. Scalise, Steve [R-LA-1], spons...",This concurrent resolution provides for a joi...,[Congressional Bills 118th Congress]\n[From th...,"[Congress, Congressional operations and organi...","{'bill': {'actions': {'count': 10, 'url': 'htt...","{'actions': [{'actionCode': None, 'actionDate'...","{'amendments': [], 'pagination': {'count': 0},...","{'committees': [], 'request': {'billNumber': '...","{'pagination': {'count': 0}, 'relatedBills': [...","{'cosponsors': [], 'pagination': {'count': 0, ...","{'pagination': {'count': 3}, 'request': {'bill...","{'pagination': {'count': 3}, 'request': {'bill...","{'pagination': {'count': 2}, 'request': {'bill...",118,HCONRES,11,https://www.congress.gov/bill/118th-congress/h...
3,118HCONRES12,Expressing the sense of Congress that all dire...,"[[C001039, Rep. Cammack, Kat [R-FL-3], sponsor...",This concurrent resolution expresses the sens...,[Congressional Bills 118th Congress]\n[From th...,"[Foreign Trade and International Finance, Agri...","{'bill': {'actions': {'count': 5, 'url': 'http...","{'actions': [{'actionCode': 'H11000', 'actionD...","{'amendments': [], 'pagination': {'count': 0},...",{'committees': [{'activities': [{'date': '2023...,"{'pagination': {'count': 0}, 'relatedBills': [...","{'cosponsors': [{'bioguideId': 'K000380', 'dis...","{'pagination': {'count': 6}, 'request': {'bill...","{'pagination': {'count': 1}, 'request': {'bill...","{'pagination': {'count': 2}, 'request': {'bill...",118,HCONRES,12,https://www.congress.gov/bill/118th-congress/h...
4,118HCONRES13,Supporting the Local Radio Freedom Act.,"[[W000809, Rep. Womack, Steve [R-AR-3], sponso...",This concurrent resolution declares that Cong...,[Congressional Bills 118th Congress]\n[From th...,"[Science, Technology, Communications, Congress]","{'bill': {'actions': {'count': 3, 'url': 'http...","{'actions': [{'actionCode': 'H11100', 'actionD...","{'amendments': [], 'pagination': {'count': 0},...",{'committees': [{'activities': [{'date': '2023...,"{'pagination': {'count': 1}, 'relatedBills': [...","{'cosponsors': [{'bioguideId': 'C001066', 'dis...","{'pagination': {'count': 2}, 'request': {'bill...","{'pagination': {'count': 1}, 'request': {'bill...","{'pagination': {'count': 2}, 'request': {'bill...",118,HCONRES,13,https://www.congress.gov/bill/118th-congress/h...


In [12]:
df.shape

(51, 19)

# Sponsor Graph Sidequest

We will be focusing on the text content of the legislation in this workshop, but if you would like to explore building a graph from the sponsor / co-sponsor / legislation network check out the [sponsor_graph notebook](https://github.com/FMurray/hyperdemocracy/blob/main/sidequests/sponsor_graph.ipynb) to get started.

# From Pandas Dataframe to LangChain Documents

A langchain document is a simple class with two attributes, 
* page_content (a string)
* metadata (a dictionary)

In [13]:
from langchain.schema import Document 

In [14]:
Document??

[0;31mInit signature:[0m [0mDocument[0m[0;34m([0m[0;34m*[0m[0;34m,[0m [0mpage_content[0m[0;34m:[0m [0mstr[0m[0;34m,[0m [0mmetadata[0m[0;34m:[0m [0mdict[0m [0;34m=[0m [0;32mNone[0m[0;34m)[0m [0;34m->[0m [0;32mNone[0m[0;34m[0m[0;34m[0m[0m
[0;31mSource:[0m        
[0;32mclass[0m [0mDocument[0m[0;34m([0m[0mSerializable[0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m    [0;34m"""Interface for interacting with a document."""[0m[0;34m[0m
[0;34m[0m[0;34m[0m
[0;34m[0m    [0mpage_content[0m[0;34m:[0m [0mstr[0m[0;34m[0m
[0;34m[0m    [0mmetadata[0m[0;34m:[0m [0mdict[0m [0;34m=[0m [0mField[0m[0;34m([0m[0mdefault_factory[0m[0;34m=[0m[0mdict[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mFile:[0m           ~/miniconda3/envs/hd/lib/python3.10/site-packages/langchain/schema.py
[0;31mType:[0m           ModelMetaclass
[0;31mSubclasses:[0m     

Below we take each row from our legislation DataFrame and create a LangChain Document. We use the `body` column for the `page_content` attribute and populate the `metadata` attribute with data from some of the other columns. Note that the `source` key in the `metadata` dictionary is associated with a congress.gov url. The `source` key can hold an arbitrary string and will become important when we look into question answering systems that return information about the sources used to answer a question. We also restrict ourselves to `str`, `int`, and `float` types in the other values of our `metadata` dictionary. This is to make it easy to use them as filters when querying our vectorstore. If that doesn't make sense, dont worry! It will by the end of the workshop.  

In [15]:
docs = []
for irow, row in df.iterrows():
    doc = Document(
        page_content=row['body'],
        metadata={
            # Note: chroma can only filter on float, str, or int
            # https://docs.trychroma.com/usage-guide#using-where-filters
            'key': row['key'],
            'congress_num': row['congress_num'],
            'legis_class': row['legis_class'],
            'legis_num': row['legis_num'],
            'name': row['name'],
            'summary': row['summary'],
            'sponsor': row['sponsors'][0][0],
            'source': row['congress_gov_url'],
        },
    )
    docs.append(doc)

In [16]:
print(docs[0])

## Activity

* examine the Document content
* visit the congress.gov URL and view the document in various formats
* examine the body text below
* read the summary of the document and attempt to connect it with the long form text of the document

In [17]:
print(docs[0].page_content)

In [18]:
print(len(docs))

# Document QA Quickstart

* https://python.langchain.com/docs/modules/chains/additional/question_answering
* https://python.langchain.com/docs/modules/chains/document.html

Our goal is to setup a question answering (QA) system that can repond to natural language questions about legislation using source material that we provide. In the following section, we will explore the quickest most high-level approach provided by LangChain. Afterwards, we will unpack all of components and go over them in more detail. 

In [19]:
from langchain.indexes import VectorstoreIndexCreator

In [20]:
## THIS CELL SPENDS MONEY ##
index = VectorstoreIndexCreator().from_documents(docs)

In [28]:
## THIS CELL SPENDS MONEY ##

# QA 

# copy paste this cell and try some questions of your own
query = "What are the primary themes around energy policy?"
out = index.query(query)
out

' The primary themes around energy policy are reducing carbon emissions, embracing and accepting nuclear power as a clean baseload energy source, boosting the renewable energy economy, and avoiding overly restrictive regulations on the exploration, production, or marketing of energy resources.'

In [32]:
## THIS CELL SPENDS MONEY ##

# QA with sources

# copy paste this cell and try some questions of your own
out = index.query_with_sources(query)
print("question:\n", out['question'])
print("answer:\n", out['answer'])
print("sources:\n", out['sources'])

## Activity

* Contemplate why the answers are slightly different between the "QA" result and the "QA with sources" result.
* Visit the source links and check if the linked legislation is relevant to the question.

# Document QA - Step by Step

Now that we've used the high-level tools of LangChain, lets go into more detail.

## Langchain Text Splitters

> When you want to deal with long pieces of text, it is necessary to split up that text into chunks. As simple as this sounds, there is a lot of potential complexity here. Ideally, you want to keep the semantically related pieces of text together. What "semantically related" means could depend on the type of text. This notebook showcases several ways to do that.

> At a high level, text splitters work as following:

>    1. Split the text up into small, semantically meaningful chunks (often sentences).
>    2. Start combining these small chunks into a larger chunk until you reach a certain size (as measured by some function).
>    3. Once you reach that size, make that chunk its own piece of text and then start creating a new chunk of text with some overlap (to keep context between chunks).

> That means there are two different axes along which you can customize your text splitter:

>    1. How the text is split
>    2. How the chunk size is measured

-- https://python.langchain.com/docs/modules/data_connection/document_transformers/#text-splitters

Here are some useful options for splitting legislative text, 

* [character text splitter](https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/character_text_splitter)
  * How the text is split: by single character
  * How the chunk size is measured: by number of characters
* [recursive text splitter](https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/recursive_text_splitter)
  * How the text is split: by list of characters
  * How the chunk size is measured: by number of characters
* [split by token](https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/split_by_token)
  * How the text is split: by character passed in
  * How the chunk size is measured: by tiktoken tokenizer

If you are not familiar with the concept of a token, this article may help, 
* https://simonwillison.net/2023/Jun/8/gpt-tokenizers/

In [34]:
from langchain.text_splitter import CharacterTextSplitter
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.text_splitter import TokenTextSplitter

In [37]:
text = """We hold these truths to be self-evident, that all men are created equal,

that they are endowed by their Creator with certain unalienable Rights,

that among these are Life, Liberty and the pursuit of Happiness."""

### CharacterTextSplitter

In [45]:
# this is the default separator
CharacterTextSplitter(separator="\n\n", chunk_size=20, chunk_overlap=0).split_text(text)

Created a chunk of size 72, which is longer than the specified 20
Created a chunk of size 71, which is longer than the specified 20


['We hold these truths to be self-evident, that all men are created equal,',
 'that they are endowed by their Creator with certain unalienable Rights,',
 'that among these are Life, Liberty and the pursuit of Happiness.']

In [62]:
# this is what happens if we change the default separator
CharacterTextSplitter(separator=" ", chunk_size=20, chunk_overlap=0).split_text(text)

['We hold these truths',
 'to be self-evident,',
 'that all men are',
 'created equal,\n\nthat',
 'they are endowed by',
 'their Creator with',
 'certain unalienable',
 'Rights,\n\nthat among',
 'these are Life,',
 'Liberty and the',
 'pursuit of',
 'Happiness.']

In [63]:
# this is what overlap does
CharacterTextSplitter(separator=" ", chunk_size=20, chunk_overlap=10).split_text(text)

['We hold these truths',
 'truths to be',
 'to be self-evident,',
 'that all men are',
 'men are created',
 'created equal,\n\nthat',
 'they are endowed by',
 'endowed by their',
 'by their Creator',
 'Creator with certain',
 'certain unalienable',
 'Rights,\n\nthat among',
 'among these are',
 'these are Life,',
 'are Life, Liberty',
 'Liberty and the',
 'and the pursuit of',
 'of Happiness.']

### RecursiveCharacterTextSplitter

In [48]:
# these are the default separators
RecursiveCharacterTextSplitter(separators=["\n\n", "\n", " ", ""], chunk_size=40, chunk_overlap=0).split_text(text)

['We hold these truths to be self-evident,',
 'that all men are created equal,',
 'that they are endowed by their Creator',
 'with certain unalienable Rights,',
 'that among these are Life, Liberty and',
 'the pursuit of Happiness.']

In [64]:
RecursiveCharacterTextSplitter(separators=["\n\n", "\n", ",", " ", ""], chunk_size=40, chunk_overlap=0).split_text(text)

['We hold these truths to be self-evident',
 ', that all men are created equal,',
 'that they are endowed by their Creator',
 'with certain unalienable Rights',
 ',',
 'that among these are Life',
 ', Liberty and the pursuit of Happiness.']

### TokenTextSplitter

In [65]:
# the length unit here is tokens not characters
TokenTextSplitter(model_name="text-embedding-ada-002", chunk_size=10, chunk_overlap=0).split_text(text)

['We hold these truths to be self-evident',
 ', that all men are created equal,\n\nthat they',
 ' are endowed by their Creator with certain unalienable',
 ' Rights,\n\nthat among these are Life, Liberty and',
 ' the pursuit of Happiness.']

In [66]:
# the length unit here is tokens not characters
TokenTextSplitter(model_name="text-embedding-ada-002", chunk_size=10, chunk_overlap=4).split_text(text)

['We hold these truths to be self-evident',
 ' self-evident, that all men are created',
 ' all men are created equal,\n\nthat they are endowed',
 'that they are endowed by their Creator with certain un',
 ' Creator with certain unalienable Rights,\n\nthat among',
 ' Rights,\n\nthat among these are Life, Liberty and',
 ' Life, Liberty and the pursuit of Happiness.',
 ' of Happiness.']

In [54]:
import tiktoken

In [57]:
enc = tiktoken.encoding_for_model("text-embedding-ada-002")
print(enc)

In [59]:
print(enc.encode(text))

In [61]:
tokens = [enc.decode_single_token_bytes(token) for token in enc.encode(text)]
tokens

[b'We',
 b' hold',
 b' these',
 b' truths',
 b' to',
 b' be',
 b' self',
 b'-e',
 b'vid',
 b'ent',
 b',',
 b' that',
 b' all',
 b' men',
 b' are',
 b' created',
 b' equal',
 b',\n\n',
 b'that',
 b' they',
 b' are',
 b' endowed',
 b' by',
 b' their',
 b' Creator',
 b' with',
 b' certain',
 b' un',
 b'alien',
 b'able',
 b' Rights',
 b',\n\n',
 b'that',
 b' among',
 b' these',
 b' are',
 b' Life',
 b',',
 b' Liberty',
 b' and',
 b' the',
 b' pursuit',
 b' of',
 b' Happiness',
 b'.']

### Lets Make a TextSplitter Choice here

In [67]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=128)
split_docs = text_splitter.split_documents(docs)

In [68]:
print("Number of original docs: ", len(docs))
print("Number of split docs: ", len(split_docs))

## Embed and Index Doc Chunks

# Intro To Embdeddings

[Link to Notebook]("https://github.com/FMurray/hyperdemocracy/blob/main/sidequests/embeddings.ipynb")


## Index Embeddings in a Vector Database

In [None]:
from langchain.vectorstores import Chroma

In [None]:
db = Chroma.from_documents(split_docs, embeddings)

In [None]:
db

In [None]:
# explnain similarity types, cosine, inner-product, squared L2, 
# looks like chroma uses hnswlib which supports 3 distances (default cosine) [TODO confirm default]
# https://github.com/hwchase17/langchain/blob/master/langchain/vectorstores/chroma.py
# https://docs.trychroma.com/usage-guide#changing-the-distance-function
# https://github.com/nmslib/hnswlib/tree/master#supported-distances

# in addition langchain offers maximal marginal relevance on top of cosine
# https://github.com/hwchase17/langchain/blob/master/langchain/vectorstores/utils.py#L10

ret_docs = db.similarity_search_with_score(
    "nuclear power", 
    k=10, 
    filter={"source": "https://www.congress.gov/bill/118th-congress/house-concurrent-resolution/17"},
)

for doc in ret_docs:
    print(doc)

In [None]:
# show that this is all the docs from filter
len([d for d in split_docs if d.metadata['source']=='https://www.congress.gov/bill/118th-congress/house-concurrent-resolution/17'])

# What are retrievers?

TODO: TL;DR 

In [None]:
import langchain
langchain.verbose = False

In [None]:
from langchain.chains import RetrievalQA
from langchain.chains import RetrievalQAWithSourcesChain
from langchain.llms import OpenAI

In [None]:
retriever = db.as_retriever(search_kwargs={'k':10})

In [None]:
retriever

Compare the chains in the original DocumentQA quickstart with the chains here

In [None]:
qa = RetrievalQA.from_chain_type(
    llm=OpenAI(), 
    chain_type="stuff", 
    retriever=retriever, 
    return_source_documents=True,
)

# Questions

* what are the components of the RetrievalQA chain?
* what is the QA prompt?
* how would you modify the QA prompt?
* what is the difference between the following qa chain types?,
    * stuff
    * map_reduce
    * map_rerank
    * refine

# Resources

* https://github.com/hwchase17/langchain/tree/master/langchain/chains/retrieval_qa
* https://github.com/hwchase17/langchain/tree/master/langchain/chains/question_answering

In [None]:
from IPython import display

In [None]:
display.Image("https://python.langchain.com/assets/images/stuff-818da4c66ee17911bc8861c089316579.jpg", width=600)

In [None]:
# WARNING! Do not commit the outputs of this cell if it contains your API key

rich.print(qa)

## How many ways can we print a prompt? 

In [None]:
prompt_template = qa.combine_documents_chain.llm_chain.prompt
prompt_template

In [None]:
print(prompt_template.template)

In [None]:
import textwrap

In [None]:
rich.print(prompt_template.format(context='[CONTEXT]', question='[QUESTION]'))

In [None]:
answer = qa("What is the solution to climate change?")

In [None]:
answer.keys()

In [None]:
print(answer['result'])

In [None]:
qaws = RetrievalQAWithSourcesChain.from_chain_type(
    llm=OpenAI(), 
    chain_type="stuff", 
    retriever=retriever, 
    return_source_documents=True,
)

In [None]:
# WARNING! Do not commit the outputs of this cell if it contains your API key
print(qaws)

In [None]:
pt = qaws.combine_documents_chain.llm_chain.prompt

In [None]:
print(pt.format(summaries='[SUMMARIES]', question='[QUESTION]'))

In [None]:
answer = qaws("What is the solution to climate change?")

In [None]:
answer.keys()

In [None]:
print(answer['answer'])
print(answer['sources'])

# Prompt Construction Sidequest

# TODO

Try alternatives to stuff

Figure out how to pass all the options to the high level constructor. 

https://github.com/hwchase17/langchain/blob/master/langchain/indexes/vectorstore.py

In [None]:
index_creator = VectorstoreIndexCreator(
    vectorstore_cls=Chroma, 
    embedding=OpenAIEmbeddings(),
    text_splitter=RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=128)
)


Sticking this here to decide if we want to use this in the course content

https://xml.house.gov/

TODO: Sidequest on implementing a langchain document loader using this XML schema ^^^

https://www.everycrsreport.com/

# Lets make it a conversation

https://python.langchain.com/en/latest/modules/chains/index_examples/chat_vector_db.html

In [None]:
from langchain.chains import ConversationalRetrievalChain

In [None]:
# TODO cover serializing the db to disk
db

In [None]:
from langchain.memory import ConversationBufferMemory
memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)

In [None]:
qachat = ConversationalRetrievalChain.from_llm(
    OpenAI(temperature=0), 
    db.as_retriever(), 
    memory=memory
)

In [None]:
query = "What is the solution to climate change?"
answer = qachat(query)

In [None]:
print(answer)

In [None]:
follow_up = "How certain is the 350 number?"
result = qachat({"question": follow_up})

In [None]:
print(result)