<a href="https://colab.research.google.com/github/FMurray/hyperdemocracy/blob/main/hyper_democracy_v2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Install Dependencies

In [1]:
# if you are on a google colab, uncomment the lines below to fetch the requirements file and the hyperdemocracy.py module
# and pip install the requirements

#!wget https://raw.githubusercontent.com/FMurray/hyperdemocracy/main/requirements.txt
#!wget https://raw.githubusercontent.com/FMurray/hyperdemocracy/main/hyperdemocracy.py
!pip install -r requirements.txt











In [2]:
import os
import rich
import time
import tiktoken
from tqdm import tqdm
from transformers import AutoTokenizer
from rich import print

# Choose a Provider

We have options for HuggingFace and OpenAI model providers in this notebook.

In [3]:
PROVIDER = "HF"  
assert PROVIDER in ["HF", "OPENAI"]

CONFIGS = {
    "HF": {
        "embd": "sentence-transformers/all-mpnet-base-v2",
        "llm": "google/flan-t5-large",
    },
    "OPENAI": {
        "embd": "text-embedding-ada-002",
        "llm": "gpt-3.5-turbo-16k",
    },
}

CONFIG = CONFIGS[PROVIDER]

# Setup Keys

In [4]:
# if you want to use local secrets, add a file called .env to this directory and uncomment the lines below

from dotenv import load_dotenv
load_dotenv(".env")

True

In [None]:
# if you are using google colab, uncomment the lines below to manually enter your OpenAI key.

#import getpass
#os.environ['OPENAI_API_KEY'] = getpass.getpass()

In [None]:
# if you are using google colab, uncomment the lines below to manually enter your HuggingFace token.

#import getpass
#os.environ['HUGGINGFACEHUB_API_TOKEN'] = getpass.getpass()

In [None]:
# this is for development
#%load_ext autoreload
#%autoreload 2

# Load Assembled Records

We are going to use a small subset of records provided by https://assembled.app/.

For the purposes of this workshop, we have created a [huggingface dataset](https://huggingface.co/datasets/assembleco/hyperdemocracy)  which we can load using the `load_dataset` function. This is all handled for you in the `load_assembleco_records` function. See more info here [datasets](https://huggingface.co/docs/datasets/index) package.

In [5]:
from hyperdemocracy import load_assembleco_records

In [6]:
df = load_assembleco_records(process=True, strip_html=True, remove_empty_body=True)

Found cached dataset parquet (/home/calliope/.cache/huggingface/datasets/assembleco___parquet/assembleco--hyperdemocracy-a598a9b2b17e51dc/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)


In [8]:
df.head()

Unnamed: 0,key,name,sponsors,summary,body,themes,index,actions,amendments,committees,relatedbills,cosponsors,subjects,text,titles,congress_num,legis_class,legis_num,congress_gov_url
0,118HCONRES1,Regarding consent to assemble outside the seat...,"[[C001053, Rep. Cole, Tom [R-OK-4], sponsor]]",This concurrent resolution authorizes the Spe...,[Congressional Bills 118th Congress]\n[From th...,"[Congress, Congressional operations and organi...","{'bill': {'actions': {'count': 7, 'url': 'http...","{'actions': [{'actionCode': None, 'actionDate'...","{'amendments': [], 'pagination': {'count': 0},...","{'committees': [], 'request': {'billNumber': '...","{'pagination': {'count': 0}, 'relatedBills': [...","{'cosponsors': [], 'pagination': {'count': 0, ...","{'pagination': {'count': 2}, 'request': {'bill...","{'pagination': {'count': 1}, 'request': {'bill...","{'pagination': {'count': 2}, 'request': {'bill...",118,HCONRES,1,https://www.congress.gov/bill/118th-congress/h...
1,118HCONRES10,Expressing the sense of Congress that the Unit...,"[[T000165, Rep. Tiffany, Thomas P. [R-WI-7], s...",This concurrent resolution calls on the Presi...,[Congressional Bills 118th Congress]\n[From th...,[International Affairs],"{'bill': {'actions': {'count': 4, 'url': 'http...","{'actions': [{'actionCode': 'H11100', 'actionD...","{'amendments': [], 'pagination': {'count': 0},...",{'committees': [{'activities': [{'date': '2023...,"{'pagination': {'count': 0}, 'relatedBills': [...","{'cosponsors': [{'bioguideId': 'P000605', 'dis...","{'pagination': {'count': 1}, 'request': {'bill...","{'pagination': {'count': 1}, 'request': {'bill...","{'pagination': {'count': 2}, 'request': {'bill...",118,HCONRES,10,https://www.congress.gov/bill/118th-congress/h...
2,118HCONRES11,Providing for a joint session of Congress to r...,"[[S001176, Rep. Scalise, Steve [R-LA-1], spons...",This concurrent resolution provides for a joi...,[Congressional Bills 118th Congress]\n[From th...,"[Congress, Congressional operations and organi...","{'bill': {'actions': {'count': 10, 'url': 'htt...","{'actions': [{'actionCode': None, 'actionDate'...","{'amendments': [], 'pagination': {'count': 0},...","{'committees': [], 'request': {'billNumber': '...","{'pagination': {'count': 0}, 'relatedBills': [...","{'cosponsors': [], 'pagination': {'count': 0, ...","{'pagination': {'count': 3}, 'request': {'bill...","{'pagination': {'count': 3}, 'request': {'bill...","{'pagination': {'count': 2}, 'request': {'bill...",118,HCONRES,11,https://www.congress.gov/bill/118th-congress/h...
3,118HCONRES12,Expressing the sense of Congress that all dire...,"[[C001039, Rep. Cammack, Kat [R-FL-3], sponsor...",This concurrent resolution expresses the sens...,[Congressional Bills 118th Congress]\n[From th...,"[Foreign Trade and International Finance, Agri...","{'bill': {'actions': {'count': 5, 'url': 'http...","{'actions': [{'actionCode': 'H11000', 'actionD...","{'amendments': [], 'pagination': {'count': 0},...",{'committees': [{'activities': [{'date': '2023...,"{'pagination': {'count': 0}, 'relatedBills': [...","{'cosponsors': [{'bioguideId': 'K000380', 'dis...","{'pagination': {'count': 6}, 'request': {'bill...","{'pagination': {'count': 1}, 'request': {'bill...","{'pagination': {'count': 2}, 'request': {'bill...",118,HCONRES,12,https://www.congress.gov/bill/118th-congress/h...
4,118HCONRES13,Supporting the Local Radio Freedom Act.,"[[W000809, Rep. Womack, Steve [R-AR-3], sponso...",This concurrent resolution declares that Cong...,[Congressional Bills 118th Congress]\n[From th...,"[Science, Technology, Communications, Congress]","{'bill': {'actions': {'count': 3, 'url': 'http...","{'actions': [{'actionCode': 'H11100', 'actionD...","{'amendments': [], 'pagination': {'count': 0},...",{'committees': [{'activities': [{'date': '2023...,"{'pagination': {'count': 1}, 'relatedBills': [...","{'cosponsors': [{'bioguideId': 'C001066', 'dis...","{'pagination': {'count': 2}, 'request': {'bill...","{'pagination': {'count': 1}, 'request': {'bill...","{'pagination': {'count': 2}, 'request': {'bill...",118,HCONRES,13,https://www.congress.gov/bill/118th-congress/h...


In [9]:
print(len(df))

In [10]:
df.shape

(6132, 19)

# Sponsor Graph Sidequest

We will be focusing on the text content of the legislation in this workshop, but if you would like to explore building a graph from the sponsor / co-sponsor / legislation network check out the [sponsor_graph notebook](https://github.com/FMurray/hyperdemocracy/blob/main/sidequests/sponsor_graph.ipynb) to get started.

# From Pandas Dataframe to LangChain Documents

A langchain document is a simple class with two attributes, 
* page_content (a string)
* metadata (a dictionary)

In [13]:
from langchain.schema import Document
all_docs = []
for irow, row in df.iterrows():
    doc = Document(
        page_content=row['body'],
        metadata={
            # Note: chroma can only filter on float, str, or int
            # https://docs.trychroma.com/usage-guide#using-where-filters
            'key': row['key'],
            'congress_num': row['congress_num'],
            'legis_class': row['legis_class'],
            'legis_num': row['legis_num'],
            'name': row['name'],
            'summary': row['summary'],
            'sponsor': row['sponsors'][0][0],
            'source': row['congress_gov_url'],
        },
    )
    all_docs.append(doc)

In [14]:
print(all_docs[0])

## Activity

* examine the Document content
* visit the congress.gov URL and view the document in various formats
* examine the body text below
* read the summary of the document and attempt to connect it with the long form text of the document

In [None]:
print(all_docs[0].page_content)

In [None]:
print(len(all_docs))

# Subsample Docs

In [15]:
NUM_DOCS = 6132
# NUM_DOCS = 50
docs = all_docs[:NUM_DOCS]

In [16]:
from langchain.callbacks import get_openai_callback

In [17]:
def count_openai_tokens_in_docs(docs, model_name=CONFIG["embd"]):
    num_tokens = 0
    enc = tiktoken.encoding_for_model(model_name)
    for doc in docs:
        num_tokens += len(enc.encode(doc.page_content))
    return num_tokens

In [18]:
def count_hf_tokens_in_docs(docs, model_name=CONFIG["embd"]):
    num_tokens = 0
    tokenizer = AutoTokenizer.from_pretrained(CONFIG["embd"])
    for doc in docs:
        num_tokens += len(tokenizer(docs[0].page_content)['input_ids'])
    return num_tokens

In [19]:
# estimate cost
if PROVIDER == "OPENAI":
    num_tokens = count_openai_tokens_in_docs(docs)
    cost = EMBED_DOLLARS_PER_1K_TOKENS * num_tokens / 1000
    print('Num Docs: ', len(docs))
    print('Num Tokens: ', num_tokens)
    print('Total Cost (USD): ', '$'+str(cost))
elif PROVIDER == "HF":
    num_tokens = count_hf_tokens_in_docs(docs)
    cost = 0
    print('Num Docs: ', len(docs))
    print('Num Tokens: ', num_tokens)
    print('Total Cost (USD): ', '$'+str(cost))

## Activity

* Contemplate why the answers are slightly different between the "QA" result and the "QA with sources" result.
* Visit the source links and check if the linked legislation is relevant to the question.

# Document QA - Step by Step

Our goal is to setup a question answering (QA) system that can repond to natural language questions about legislation using source material that we provide. In the following sections we will unpack all of components and go over them in detail.

# Part 1 - Langchain Text Splitters

> When you want to deal with long pieces of text, it is necessary to split up that text into chunks. As simple as this sounds, there is a lot of potential complexity here. Ideally, you want to keep the semantically related pieces of text together. What "semantically related" means could depend on the type of text. This notebook showcases several ways to do that.

> At a high level, text splitters work as following:

>    1. Split the text up into small, semantically meaningful chunks (often sentences).
>    2. Start combining these small chunks into a larger chunk until you reach a certain size (as measured by some function).
>    3. Once you reach that size, make that chunk its own piece of text and then start creating a new chunk of text with some overlap (to keep context between chunks).

> That means there are two different axes along which you can customize your text splitter:

>    1. How the text is split
>    2. How the chunk size is measured

-- https://python.langchain.com/docs/modules/data_connection/document_transformers/#text-splitters

Here are some useful options for splitting legislative text, 

* [character text splitter](https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/character_text_splitter)
  * How the text is split: by single character
  * How the chunk size is measured: by number of characters
* [recursive text splitter](https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/recursive_text_splitter)
  * How the text is split: by list of characters
  * How the chunk size is measured: by number of characters
* [split by token](https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/split_by_token)
  * How the text is split: by character passed in
  * How the chunk size is measured: by tiktoken tokenizer

If you are not familiar with the concept of a token, this article may help, 
* https://simonwillison.net/2023/Jun/8/gpt-tokenizers/

Mini Side Quest
* see if there is anything interesting that can be done with this https://twitter.com/RLanceMartin/status/1670489431168659456?s=20

In [20]:
from langchain.text_splitter import CharacterTextSplitter
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.text_splitter import TokenTextSplitter
from langchain.text_splitter import SentenceTransformersTokenTextSplitter

In [21]:
text = """We hold these truths to be self-evident, that all men are created equal,

that they are endowed by their Creator with certain unalienable Rights,

that among these are Life, Liberty and the pursuit of Happiness."""

## CharacterTextSplitter

In [None]:
# this is the default separator
CharacterTextSplitter(separator="\n\n", chunk_size=20, chunk_overlap=0).split_text(text)

In [None]:
# this is what happens if we chandetailge the default separator
CharacterTextSplitter(separator=" ", chunk_size=20, chunk_overlap=0).split_text(text)

In [None]:
# this is what overlap does
CharacterTextSplitter(separator=" ", chunk_size=20, chunk_overlap=10).split_text(text)

## RecursiveCharacterTextSplitter

In [None]:
# these are the default separators
RecursiveCharacterTextSplitter(separators=["\n\n", "\n", " ", ""], chunk_size=40, chunk_overlap=0).split_text(text)

In [None]:
# this is what happens if we add "," to the separators
RecursiveCharacterTextSplitter(separators=["\n\n", "\n", ",", " ", ""], chunk_size=40, chunk_overlap=0).split_text(text)

## TokenTextSplitter

In [None]:
# the length unit for chunk_size is now tokens not characters
if PROVIDER == "HF":
    ts = SentenceTransformersTokenTextSplitter(
        model_name=CONFIG["embd"],
        chunk_size=10, 
        tokens_per_chunk=10,
        chunk_overlap=0,
    )
elif PROVIDER == "OPENAI":
    ts = TokenTextSplitter(
        model_name=CONFIG["embd"], 
        chunk_size=10, 
        chunk_overlap=0,
    )

In [None]:
ts.split_text(text)

In [None]:
# same for chunk_overlap
if PROVIDER == "HF":
    ts = SentenceTransformersTokenTextSplitter(
        model_name=CONFIG["embd"],
        chunk_size=10, 
        tokens_per_chunk=10,
        chunk_overlap=4,
    )
elif PROVIDER == "OPENAI":
    ts = TokenTextSplitter(
        model_name=CONFIG["embd"], 
        chunk_size=10, 
        chunk_overlap=4,
    )
ts.split_text(text)

## Lets Make a TextSplitter Choice here

In [22]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=128)
split_docs = text_splitter.split_documents(docs)

In [23]:
print("Number of original docs: ", len(docs))
print("Number of split docs: ", len(split_docs))

In [24]:
print(split_docs[50])

# Part 2 - Embed and Index Doc Chunks

Now we will embed and index the document chunks from the previous section. 
We have many choices when it comes to text embedding models and vector indexes. 
For this tutorial we will choose, 

* text embedding model: `text_embedding_ada_002`
* vector index:
  * https://www.trychroma.com
  * https://docs.trychroma.com/usage-guide#changing-the-distance-function
  * https://github.com/nmslib/hnswlib/tree/master#supported-distances
  * https://github.com/hwchase17/langchain/blob/master/langchain/vectorstores/chroma.py
  * https://github.com/hwchase17/langchain/blob/master/langchain/vectorstores/utils.py#L10

For a look at some of the top performing closed and open source text embedding models, check out the HuggingFace Massive Text Embedding Benchmark (MTEB), 
* https://huggingface.co/spaces/mteb/leaderboard
  
For a more detailed introduction to embeddings in general, see the embeddings notebook
* https://github.com/FMurray/hyperdemocracy/blob/main/sidequests/embeddings_v2.ipynb

In [25]:
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.embeddings import HuggingFaceEmbeddings

In [26]:
if PROVIDER == "HF":
    embeddings = HuggingFaceEmbeddings(model_name=CONFIG["embd"])
elif PROVIDER == "OPENAI":
    embeddings = OpenAIEmbeddings(model=CONFIG["embd"])

In [27]:
persist_directory = f"hyperdemocracy-chromadb-prov-{PROVIDER}-ndocs-{NUM_DOCS}"
print(persist_directory)

In [28]:
## THIS CELL SPENDS MONEY THE FIRST TIME ##
if os.path.exists(persist_directory):
    vec_store = Chroma(persist_directory=persist_directory, embedding_function=embeddings)
else:
    batch_size = 128
    for ii in tqdm(range(0, len(split_docs), batch_size)):
        batch = split_docs[ii:ii+batch_size]
        if ii == 0:
            vec_store = Chroma.from_documents(batch, embeddings, persist_directory=persist_directory)
        else:
            vec_store.add_documents(batch)
        time.sleep(1.0)
    vec_store.persist()

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

In [29]:
vec_store

<langchain.vectorstores.chroma.Chroma at 0x7f91c434d0d0>

In [30]:
ret_docs = vec_store.similarity_search_with_score(
    "Government Accountability Office GAO", 
    k=3, 
#    filter={"source": "https://www.congress.gov/bill/118th-congress/house-concurrent-resolution/17"},
)

print("number of returned docs: ", len(ret_docs))
for doc in ret_docs:
    print(doc)

# Part 3 - Build A RetrievalQA Chain

In [31]:
import langchain
langchain.verbose = False

In [32]:
from langchain.chains import RetrievalQA
from langchain.chains import RetrievalQAWithSourcesChain
from langchain.llms import HuggingFaceHub
from langchain.llms import OpenAI
from langchain.chat_models import ChatOpenAI

# base classes to examine
from langchain.vectorstores.base import VectorStore
from langchain.schema import BaseRetriever

## Create a Retriever from Chroma VectorStore

In [33]:
retriever = vec_store.as_retriever(search_kwargs={'k':5})

In [34]:
vec_store

<langchain.vectorstores.chroma.Chroma at 0x7f91c434d0d0>

In [35]:
retriever

VectorStoreRetriever(vectorstore=<langchain.vectorstores.chroma.Chroma object at 0x7f91c434d0d0>, search_type='similarity', search_kwargs={'k': 5})

## Plug in a Language Model

With LangChain we can use a text completion model or a chat model for QA.

In [36]:
if PROVIDER == "HF":
    llm = HuggingFaceHub(
        repo_id=CONFIG["llm"], 
        model_kwargs={
            "temperature": 0,
            "max_length": 512,
        })
elif PROVIDER == "OPENAI":
    if CONFIG["llm"].startswith("text"):
        llm = OpenAI(model_name=CONFIG["llm"], temperature=0)
    elif CONFIG["llm"].startswith("gpt"):
        llm = ChatOpenAI(model_name=CONFIG["llm"], temperature=0)

In [37]:
llm

HuggingFaceHub(cache=None, verbose=False, callbacks=None, callback_manager=None, tags=None, client=InferenceAPI(api_url='https://api-inference.huggingface.co/pipeline/text2text-generation/google/flan-t5-large', task='text2text-generation', options={'wait_for_model': True, 'use_gpu': False}), repo_id='google/flan-t5-large', task=None, model_kwargs={'temperature': 0, 'max_length': 512}, huggingfacehub_api_token=None)

In [38]:
# create a RetrievalQA Chain
qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff", 
    retriever=retriever, 
    return_source_documents=True,
)

In [39]:
response = qa("How can China be encouraged to discuss land deals in middle Asia?")
print(response['result'])
print(response['source_documents'])

In [40]:
response = qa("Which fields of industry receive significant federal funding through grants?")
print(response['result'])
print(response['source_documents'])

In [42]:
response = qa("Describe the crimes of war Russia has carried on in Ukraine.")
print(response['result'])
print(response['source_documents'])

In [None]:
response = qa("How are cryptocurrencies handled differently from normal banking procedures?")
print(response['result'])
print(response['source_documents'])

In [None]:
https://hyperdemocracy.us