<a href="https://colab.research.google.com/github/FMurray/hyperdemocracy/blob/main/hyper_democracy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Install Dependencies

In [None]:
# if you are on a google colab, uncomment the lines below to fetch the requirements file and the hyperdemocracy.py module
# and pip install the requirements

#!wget https://raw.githubusercontent.com/FMurray/hyperdemocracy/main/requirements.txt
#!wget https://raw.githubusercontent.com/FMurray/hyperdemocracy/main/hyperdemocracy.py
#!pip install -r requirements.txt

In [None]:
import os
import rich
import time
import tiktoken
from tqdm import tqdm
from transformers import AutoTokenizer

# Choose a Provider

We have options for HuggingFace and OpenAI model providers in this notebook.

In [None]:
#PROVIDER = "HF"
PROVIDER = "OPENAI"

assert PROVIDER in ["HF", "OPENAI"]

CONFIGS = {
    "HF": {
        "embd": "sentence-transformers/all-mpnet-base-v2",
        "llm": "google/flan-t5-base",
        #"llm": "google/flan-t5-large",
        #"llm": "google/flan-ul2",
    },
    "OPENAI": {
        "embd": "text-embedding-ada-002",
        "llm": "gpt-3.5-turbo-16k",
    },
}

CONFIG = CONFIGS[PROVIDER]

# Note on Formatted Output

Note that we patch the builtin Python `print` function with `rich.print` in the cell below. If you prefer a more traditional print output you can comment out the import below. 

In [None]:
from rich import print

# Note on Cost (OpenAI Only, HF is Free)

Running this notebook with your OpenAI key in an environment variable will charge a small amount of money to your OpenAI account. The total cost of running this notebook multiple times should be less than $5 but that can change if the datasource is changed. Each cell that makes a request to an OpenAI endpoint that costs money will have the following comment in it, 

```
## THIS CELL SPENDS MONEY ##
```

Up to date pricing information on OpenAI models can be found here https://openai.com/pricing

In [None]:
EMBED_DOLLARS_PER_1K_TOKENS = 0.0001

# Setup Keys

In [None]:
# if you want to use local secrets, add a file called .env to this directory and uncomment the lines below

from dotenv import load_dotenv
load_dotenv(".env")

In [None]:
# if you are using google colab, uncomment the lines below to manually enter your OpenAI key.

#import getpass
#os.environ['OPENAI_API_KEY'] = getpass.getpass()

In [None]:
# if you are using google colab, uncomment the lines below to manually enter your HuggingFace token.

#import getpass
#os.environ['HUGGINGFACEHUB_API_TOKEN'] = getpass.getpass()

In [None]:
# this is for development
#%load_ext autoreload
#%autoreload 2

# Load Assembleco Records

We are going to use a small subset of records provided by https://assembled.app/.

For the purposes of this workshop, we have created a [huggingface dataset](https://huggingface.co/datasets/assembleco/hyperdemocracy)  which we can load using the `load_dataset` function. This is all handled for you in the `load_assembleco_records` function. See more info here [datasets](https://huggingface.co/docs/datasets/index) package.

In [None]:
from hyperdemocracy import load_assembleco_records

In [None]:
df = load_assembleco_records(process=True, strip_html=True, remove_empty_body=True)

In [None]:
df.head()

In [None]:
df.shape

# Sponsor Graph Sidequest

We will be focusing on the text content of the legislation in this workshop, but if you would like to explore building a graph from the sponsor / co-sponsor / legislation network check out the [sponsor_graph notebook](https://github.com/FMurray/hyperdemocracy/blob/main/sidequests/sponsor_graph.ipynb) to get started.

# From Pandas Dataframe to LangChain Documents

A langchain document is a simple class with two attributes, 
* page_content (a string)
* metadata (a dictionary)

In [None]:
from langchain.schema import Document 

In [None]:
Document??

Below we take each row from our legislation DataFrame and create a LangChain Document. We use the `body` column for the `page_content` attribute and populate the `metadata` attribute with data from some of the other columns. Note that the `source` key in the `metadata` dictionary is associated with a congress.gov url. The `source` key can hold an arbitrary string and will become important when we look into question answering systems that return information about the sources used to answer a question. We also restrict ourselves to `str`, `int`, and `float` types in the other values of our `metadata` dictionary. This is to make it easy to use them as filters when querying our vectorstore. If that doesn't make sense, dont worry! It will by the end of the workshop.  

In [None]:
# If you don't want to embed all the assembled records, you can filter using a search query here:

from hyperdemocracy import filter_aco_df
query = "energy"
df = filter_aco_df(df, query)

# if that doesn't work for your needs, you can just take a random sample of some number
#df = df.sample(100)

In [None]:
docs = []
for irow, row in df.iterrows():
    doc = Document(
        page_content=row['body'],
        metadata={
            # Note: chroma can only filter on float, str, or int
            # https://docs.trychroma.com/usage-guide#using-where-filters
            'key': row['key'],
            'congress_num': row['congress_num'],
            'legis_class': row['legis_class'],
            'legis_num': row['legis_num'],
            'name': row['name'],
            'summary': row['summary'],
            'sponsor': row['sponsors'][0][0],
            'source': row['congress_gov_url'],
        },
    )
    docs.append(doc)

In [None]:
print(docs[0])

## Activity

* examine the Document content
* visit the congress.gov URL and view the document in various formats
* examine the body text below
* read the summary of the document and attempt to connect it with the long form text of the document

In [None]:
print(docs[0].page_content)

In [None]:
print(len(docs))

# Look at the token counts 

In [None]:
from langchain.callbacks import get_openai_callback

In [None]:
def count_openai_tokens_in_docs(docs, model_name=CONFIG["embd"]):
    num_tokens = 0
    enc = tiktoken.encoding_for_model(model_name)
    for doc in docs:
        num_tokens += len(enc.encode(doc.page_content))
    return num_tokens

In [None]:
def count_hf_tokens_in_docs(docs, model_name=CONFIG["embd"]):
    num_tokens = 0
    tokenizer = AutoTokenizer.from_pretrained(CONFIG["embd"])
    for doc in docs:
        num_tokens += len(tokenizer(docs[0].page_content)['input_ids'])
    return num_tokens

In [None]:
# estimate cost
if PROVIDER == "OPENAI":
    num_tokens = count_openai_tokens_in_docs(docs)
    cost = EMBED_DOLLARS_PER_1K_TOKENS * num_tokens / 1000
    print('Num Docs: ', len(docs))
    print('Num Tokens: ', num_tokens)
    print('Total Cost (USD): ', '$'+str(cost))
elif PROVIDER == "HF":
    num_tokens = count_hf_tokens_in_docs(docs)
    cost = 0
    print('Num Docs: ', len(docs))
    print('Num Tokens: ', num_tokens)
    print('Total Cost (USD): ', '$'+str(cost))

## Activity

* Contemplate why the answers are slightly different between the "QA" result and the "QA with sources" result.
* Visit the source links and check if the linked legislation is relevant to the question.

# Document QA - Step by Step

Our goal is to setup a question answering (QA) system that can repond to natural language questions about legislation using source material that we provide. In the following sections we will unpack all of components and go over them in detail.

# Part 1 - Langchain Text Splitters

> When you want to deal with long pieces of text, it is necessary to split up that text into chunks. As simple as this sounds, there is a lot of potential complexity here. Ideally, you want to keep the semantically related pieces of text together. What "semantically related" means could depend on the type of text. This notebook showcases several ways to do that.

> At a high level, text splitters work as following:

>    1. Split the text up into small, semantically meaningful chunks (often sentences).
>    2. Start combining these small chunks into a larger chunk until you reach a certain size (as measured by some function).
>    3. Once you reach that size, make that chunk its own piece of text and then start creating a new chunk of text with some overlap (to keep context between chunks).

> That means there are two different axes along which you can customize your text splitter:

>    1. How the text is split
>    2. How the chunk size is measured

-- https://python.langchain.com/docs/modules/data_connection/document_transformers/#text-splitters

Here are some useful options for splitting legislative text, 

* [character text splitter](https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/character_text_splitter)
  * How the text is split: by single character
  * How the chunk size is measured: by number of characters
* [recursive text splitter](https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/recursive_text_splitter)
  * How the text is split: by list of characters
  * How the chunk size is measured: by number of characters
* [split by token](https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/split_by_token)
  * How the text is split: by character passed in
  * How the chunk size is measured: by tiktoken tokenizer

If you are not familiar with the concept of a token, this article may help, 
* https://simonwillison.net/2023/Jun/8/gpt-tokenizers/

## Side Quest
* check out the [text splitting notebook](https://github.com/FMurray/hyperdemocracy/blob/main/sidequests/text_splitting.ipynb) side quest to see more details on text splitting.

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

## Lets Make a TextSplitter Choice here

In [None]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=128)
split_docs = text_splitter.split_documents(docs)

In [None]:
print("Number of original docs: ", len(docs))
print("Number of split docs: ", len(split_docs))

In [None]:
print(split_docs[50])

# Part 2 - Embed and Index Doc Chunks

Now we will embed and index the document chunks from the previous section. 
We have many choices when it comes to text embedding models and vector indexes. 
For this tutorial we will choose, 

* text embedding model: `text_embedding_ada_002`
* vector index:
  * https://www.trychroma.com
  * https://docs.trychroma.com/usage-guide#changing-the-distance-function
  * https://github.com/nmslib/hnswlib/tree/master#supported-distances
  * https://github.com/hwchase17/langchain/blob/master/langchain/vectorstores/chroma.py
  * https://github.com/hwchase17/langchain/blob/master/langchain/vectorstores/utils.py#L10

For a look at some of the top performing closed and open source text embedding models, check out the HuggingFace Massive Text Embedding Benchmark (MTEB), 
* https://huggingface.co/spaces/mteb/leaderboard
  
For a more detailed introduction to embeddings in general, see the embeddings notebook
* https://github.com/FMurray/hyperdemocracy/blob/main/sidequests/embeddings.ipynb

In [None]:
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.embeddings import HuggingFaceEmbeddings

In [None]:
if PROVIDER == "HF":
    embeddings = HuggingFaceEmbeddings(model_name=CONFIG["embd"])
elif PROVIDER == "OPENAI":
    embeddings = OpenAIEmbeddings(model=CONFIG["embd"])

In [None]:
embeddings

In [None]:
NUM_DOCS=len(docs)
persist_directory = f"hyperdemocracy-chromadb-prov-{PROVIDER}-ndocs-{NUM_DOCS}"
print(persist_directory)

In [None]:
## THIS CELL SPENDS MONEY THE FIRST TIME ##
if os.path.exists(persist_directory):
    vec_store = Chroma(persist_directory=persist_directory, embedding_function=embeddings)
else:
    batch_size = 128
    for ii in tqdm(range(0, len(split_docs), batch_size)):
        batch = split_docs[ii:ii+batch_size]
        if ii == 0:
            vec_store = Chroma.from_documents(batch, embeddings, persist_directory=persist_directory)
        else:
            vec_store.add_documents(batch)
        time.sleep(1.0)
    vec_store.persist()

In [None]:
vec_store

In [None]:
ret_docs = vec_store.similarity_search_with_score(
    "nuclear power", 
    k=3, 
    filter={"source": "https://www.congress.gov/bill/118th-congress/house-concurrent-resolution/17"},
)

print("number of returned docs: ", len(ret_docs))
for doc in ret_docs:
    print(doc)

# Part 3 - Build A RetrievalQA Chain

In [None]:
import langchain
langchain.verbose = False

In [None]:
from langchain.chains import RetrievalQA
from langchain.chains import RetrievalQAWithSourcesChain
from langchain.llms import HuggingFaceHub
from langchain.llms import OpenAI
from langchain.chat_models import ChatOpenAI

# base classes to examine
from langchain.vectorstores.base import VectorStore
from langchain.schema import BaseRetriever

## Create a Retriever from Chroma VectorStore

In [None]:
retriever = vec_store.as_retriever(search_kwargs={'k':5})

In [None]:
vec_store

In [None]:
retriever

## Choose an LLM

With LangChain we can use a text completion model or a chat model for QA.

In [None]:
if PROVIDER == "HF":
    # https://huggingface.co/docs/api-inference/detailed_parameters#text-generation-task
    llm = HuggingFaceHub(
        repo_id=CONFIG["llm"],
        model_kwargs={
            "temperature": 0,
            "max_length": 128,
            "top_p": 0.95,
            "repetition_penalty": 5.0,
        })
elif PROVIDER == "OPENAI":
    if CONFIG["llm"].startswith("text"):
        llm = OpenAI(model_name=CONFIG["llm"], temperature=0)
    elif CONFIG["llm"].startswith("gpt"):
        llm = ChatOpenAI(model_name=CONFIG["llm"], temperature=0)

In [None]:
llm

In [None]:
# create a RetrievalQA Chain
qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff", 
    retriever=retriever, 
    return_source_documents=True,
)

In [None]:
## THIS CELL SPENDS MONEY ##
query = "What are the primary themes around energy policy?"
with get_openai_callback() as cb:
    out = qa(query)

In [None]:
out.keys()

In [None]:
print(out['query'])

In [None]:
print(out['result'])

In [None]:
for doc in out['source_documents']:
    print(doc.page_content)
    print("-"*50)

In [None]:
out = qa("What is the solution to climate change?")

In [None]:
out.keys()

In [None]:
print(out['query'])

In [None]:
print(out['result'])

In [None]:
for doc in out['source_documents']:
    print(doc.page_content)
    print("-"*50)

## Activity

* what are the components of the RetrievalQA chain?
* what is the QA prompt?
* how would you modify the QA prompt?
* what is the difference between the following qa chain types?,
    * stuff
    * map_reduce
    * map_rerank
    * refine
 
## Resources

* https://python.langchain.com/docs/modules/chains/document/

In [None]:
# print(qa)

In [None]:
from langchain.chains.combine_documents.base import BaseCombineDocumentsChain

In [None]:
BaseCombineDocumentsChain?

## CombineDocumentChains

* https://python.langchain.com/docs/modules/chains/document/stuff
* https://python.langchain.com/docs/modules/chains/document/refine
* https://python.langchain.com/docs/modules/chains/document/map_reduce
* https://python.langchain.com/docs/modules/chains/document/map_rerank

## Examine the RetrievalQA Prompt 

Note that the prompt template used will depend on the choice of LLM (text completion vs chat). 

In [None]:
prompt_template = qa.combine_documents_chain.llm_chain.prompt
print(prompt_template)

# Part 4 - Create a RetrievalQAWithSourcesChain

Now we will do the same thing using a chain that provides sources in the generated answer.

In [None]:
qaws = RetrievalQAWithSourcesChain.from_chain_type(
    llm=llm, 
    chain_type="stuff", 
    retriever=retriever, 
    return_source_documents=True,
)

In [None]:
## THIS CELL SPENDS MONEY ##
out = qaws("What is the solution to climate change?")

In [None]:
out.keys()

In [None]:
print(out['question'])

In [None]:
print(out['answer'])

In [None]:
print(out['sources'])

In [None]:
print(out['source_documents'])

## Activity 

* In this example, all of the returned document chunks came from one original document (118HCONRES37). What can be done to encourage a more diverse set of documents?
* What prompt is used? 

In [None]:
pt = qaws.combine_documents_chain.llm_chain.prompt

In [None]:
print(pt.format(summaries='[SUMMARIES]', question='[QUESTION]'))

# Agents

In [129]:
from langchain.agents import initialize_agent, Tool
from langchain.agents import AgentType
from langchain.tools import BaseTool, DuckDuckGoSearchRun
from langchain.llms import OpenAI
from langchain import LLMMathChain

In [131]:
%pip install -U duckduckgo-search

Collecting duckduckgo-search
  Downloading duckduckgo_search-3.8.3-py3-none-any.whl (18 kB)
Collecting aiofiles>=23.1.0 (from duckduckgo-search)
  Downloading aiofiles-23.1.0-py3-none-any.whl (14 kB)
Collecting lxml>=4.9.2 (from duckduckgo-search)
  Using cached lxml-4.9.2.tar.gz (3.7 MB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting httpx[brotli,http2,socks]>=0.24.1 (from duckduckgo-search)
  Downloading httpx-0.24.1-py3-none-any.whl (75 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.4/75.4 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Collecting httpcore<0.18.0,>=0.15.0 (from httpx[brotli,http2,socks]>=0.24.1->duckduckgo-search)
  Downloading httpcore-0.17.2-py3-none-any.whl (72 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.5/72.5 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
Collecting h2<5,>=3 (from httpx[brotli,http2,socks]>=0.24.1->duckduckgo-search)
  Downloading h2-4.1.0-py3-none-any.whl (57 kB

In [139]:
qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff", 
    retriever=retriever, 
    return_source_documents=False,
)

In [152]:
tools = [
    # create a tool that has a name, description and function
    Tool(
        name="Assemble Co QA System",
        func=qa.run,
        description="Always use this when answering questions about legislation.",
    ),
    # Can use prebuilt tools too
    DuckDuckGoSearchRun(), 
]

In [145]:
from langchain.memory import ConversationBufferMemory

memory = ConversationBufferMemory(memory_key="chat_history")


In [153]:
agent_chain = initialize_agent(tools, llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, verbose=True, memory=memory)

In [155]:
agent_chain.run(input="What is congress proposing to do about climate change? How could we quantify the impact of their proposals?")



[1m> Entering new  chain...[0m
[32;1m[1;3mI should use the Assemble Co QA System to answer the first part of the question, as it is about legislation. For the second part, I might need to use duckduckgo_search to find information on quantifying the impact of climate change proposals.
Action: Assemble Co QA System
Action Input: What is congress proposing to do about climate change?[0m
Observation: [36;1m[1;3mCongress is proposing to create a Green New Deal, which aims to address climate change by achieving greenhouse gas and toxic emissions reductions, creating millions of high-wage union jobs, investing in infrastructure and industry, and promoting clean energy. The Green New Deal also emphasizes the importance of protecting vulnerable communities and restoring natural ecosystems. Additionally, Congress is committed to embracing nuclear power as a clean baseload energy source to achieve energy independence and reduce carbon emissions.[0m
Thought:[32;1m[1;3mI now need to fin




Observation: [33;1m[1;3mClimate policy curves quantify the relationship between a carbon price and future increases in global temperatures. They incorporate two important relationships: the economic link from carbon... A rapidly increasing literature base is quantifying associations between climate change and health outcomes. Here Ebi reviews methods for quantifying, projecting, and managing the health risks of... Climate risks could affect the Budget and the overall fiscal outlook through a number of pathways, including altering total tax revenue through effects on Gross Domestic Product (GDP) growth, and... The proposal development stage is also an appropriate time to plan how impacts of climate change and other stressors on project area resources will be analyzed. Becoming familiar with the current and projected climate change impacts on the project area will help in identification of which resources are most likely to be affected. Using detailed models of sectoral impacts (e.g.,

'Congress is proposing to create a Green New Deal to address climate change, which includes measures such as reducing greenhouse gas emissions, creating jobs, investing in infrastructure and clean energy, and protecting vulnerable communities. To quantify the impact of climate change proposals, various methods can be used, such as climate policy curves, analyzing health outcomes, assessing fiscal impacts, and using detailed models to quantify and monetize risks, impacts, and damages.'

In [156]:
agent_chain.run(input="What financial incentives might the sponsors of the related legislation stand to gain from their proposals?")



[1m> Entering new  chain...[0m
[32;1m[1;3mI should use the Assemble Co QA System to answer this question about legislation.
Action: Assemble Co QA System
Action Input: What financial incentives might the sponsors of the related legislation stand to gain from their proposals?[0m
Observation: [36;1m[1;3mBased on the provided context, the sponsors of the related legislation might stand to gain financial assistance or incentives. These incentives could include covered incentives offered by governmental entities, financial assistance previously provided under this subsection, and potential benefits from the covered entity receiving financial assistance. However, the specific details of these financial incentives are not mentioned in the given context.[0m
Thought:[32;1m[1;3mI now know the final answer
Final Answer: The sponsors of the related legislation might stand to gain financial assistance or incentives, but the specific details are not mentioned in the given context.[0m



'The sponsors of the related legislation might stand to gain financial assistance or incentives, but the specific details are not mentioned in the given context.'

In [157]:
agent_chain.run(input="Who are the sponsors of the related legislation?")



[1m> Entering new  chain...[0m
[32;1m[1;3mI should use the Assemble Co QA System to find the sponsors of the legislation.
Action: Assemble Co QA System
Action Input: Who are the sponsors of the related legislation?[0m
Observation: [36;1m[1;3mThe sponsors of the related legislation are Ms. Velazquez, Ms. Tlaib, Mr. Tonko, Ms. Lee of California, Ms. Stansbury, Mr. Gallego, Mrs. McBath, Mr. Cleaver, Ms. McCollum, Mr. Meeks, Mr. Payne, Ms. Ocasio-Cortez, Mr. Moskowitz, Mr. Kim of New Jersey, and Mr. DeSaulnier for the first bill, and Griffith, Mrs. Lesko, Mr. Mike Garcia of California, Mr. Langworthy, Ms. Stefanik, Ms. Van Duyne, Mrs. Spartz, Ms. Tenney, Mr. Webster of Florida, Mr. Weber of Texas, Mr. Issa, Mr. Balderson, Ms. Malliotakis, Mr. Stauber, Mr. Zinke, Mr. Smith of Missouri, Ms. Mace, Mrs. Kiggans of Virginia, Mr. Fallon, and Mr. Valadao for the second bill.[0m
Thought:[32;1m[1;3mI now know the final answer
Final Answer: The sponsors of the related legislation are Ms.

'The sponsors of the related legislation are Ms. Velazquez, Ms. Tlaib, Mr. Tonko, Ms. Lee of California, Ms. Stansbury, Mr. Gallego, Mrs. McBath, Mr. Cleaver, Ms. McCollum, Mr. Meeks, Mr. Payne, Ms. Ocasio-Cortez, Mr. Moskowitz, Mr. Kim of New Jersey, and Mr. DeSaulnier for the first bill, and Griffith, Mrs. Lesko, Mr. Mike Garcia of California, Mr. Langworthy, Ms. Stefanik, Ms. Van Duyne, Mrs. Spartz, Ms. Tenney, Mr. Webster of Florida, Mr. Weber of Texas, Mr. Issa, Mr. Balderson, Ms. Malliotakis, Mr. Stauber, Mr. Zinke, Mr. Smith of Missouri, Ms. Mace, Mrs. Kiggans of Virginia, Mr. Fallon, and Mr. Valadao for the second bill.'

In [158]:
print(memory)