## Explore Riksdagen API - V1

This notebook explores Riksdagen's API for fetching their open data.

In [217]:
# Ensure the imports are reloaded when running the script
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


### 1. Extract: Explore the Open API

In [143]:
from typing import Dict
import requests


def build_rest_url(base_url: str, query_params: Dict[str, str] = {}):
    query_string = "&".join(
        [f"{key}={value}" for key, value in query_params.items() if value]
    )
    return f"{base_url}?{query_string}"


def list_documents_by_filter(query_params: Dict[str, str] = {}):
    BASE_URL = "https://data.riksdagen.se/dokumentlista/"
    rest_url = build_rest_url(BASE_URL, query_params)
    response = requests.get(rest_url)
    return response.json()

In [144]:
def get_document_by_id(id: str):
    BASE_URL = "https://data.riksdagen.se/dokument"
    url = f"{BASE_URL}/{id}.json"

    r = requests.get(url)
    if r.status_code == 200:
        return r.json()
    else:
        return None

Here, we are able to first list the latest documents by query filters. Then selecting one of the items, we can fetch the specfic document by id.

In [145]:
query_params = {
    "doktyp": "bet",
    "sort": "rel",
    "sortorder": "desc",
    "utformat": "json",
    "p": "1",
}
response = list_documents_by_filter(query_params)
latest_docs = response["dokumentlista"]["dokument"]

print(f"Found {len(latest_docs)} documents. Sampling one of them.")

doc_overview = latest_docs[1]
print(
    f"""
Example data from the document:
    - id: {doc_overview['id']}
    - titel: {doc_overview['titel']}
    - typ: {doc_overview['typ']}
    - publicerad: {doc_overview['publicerad']}
    - dokument_url_html: {doc_overview['dokument_url_html']}
"""
)

Found 20 documents. Sampling one of them.

Example data from the document:
    - id: hb01ubu4
    - titel: Fortsatt giltighet av lagen om vissa register för forskning om vad arv och miljö betyder för människors hälsa
    - typ: bet
    - publicerad: 2023-09-18 16:03:05
    - dokument_url_html: //data.riksdagen.se/dokument/HB01UbU4.html



In [146]:
doc_id = doc_overview["id"]
response = get_document_by_id(doc_id)
doc = response["dokumentstatus"]

In [147]:
print(f"Document '{doc_id}' has {len(doc)} sections: {list(doc.keys())}.")

Document 'hb01ubu4' has 6 sections: ['dokument', 'dokutskottsforslag', 'dokaktivitet', 'dokuppgift', 'dokbilaga', 'dokreferens'].


#### 2. Transform: Understanding the document structure
For this exercise, we've selected the particular document: "Utgiftsområde 13 Jämställdhet och nyanlända invandrares etablering" which we'll embedd into Pinecone.

**Transform the data into our Domain model**

In [148]:
doc_id = "HA01AU1"
# doc_id = "HB01UbU3"
response = get_document_by_id(doc_id)
unvalidated_doc = response["dokumentstatus"]

In [245]:
import notebooks.models.riksdagen as riksdagen
from pydantic import ValidationError
from typing import Dict

_JSON = Dict[str, str]


def validate_doc(doc: _JSON) -> riksdagen.DokumentStatus:
    try:
        return riksdagen.DokumentStatus.model_validate(doc)
    except ValidationError as e:
        print(e)


doc = validate_doc(unvalidated_doc)

Via Pydantic, we now have a fully typed data object.

In [246]:
doc.dokreferens.referens[:5]

[Referens(referenstyp='behandlar', uppgift='motion 2022/23:1230 Svensk flyktingpolitik', ref_dok_id='HA021230', ref_dok_typ='mot', ref_dok_rm='2022/23', ref_dok_bet='1230', ref_dok_titel='Svensk flyktingpolitik', ref_dok_subtitel='av Tony Haddou m.fl. (V)', ref_dok_subtyp='mot', ref_dok_dokumentnamn='Motion'),
 Referens(referenstyp='behandlar', uppgift='motion 2022/23:1273 Utgiftsområde 13 Jämställdhet och nyanlända invandrares etablering', ref_dok_id='HA021273', ref_dok_typ='mot', ref_dok_rm='2022/23', ref_dok_bet='1273', ref_dok_titel='Utgiftsområde 13 Jämställdhet och nyanlända invandrares etablering', ref_dok_subtitel='av Nooshi Dadgostar m.fl. (V)', ref_dok_subtyp='mot', ref_dok_dokumentnamn='Motion'),
 Referens(referenstyp='behandlar', uppgift='motion 2022/23:2053 Utgiftsområde 13 Jämställdhet och nyanlända invandrares etablering', ref_dok_id='HA022053', ref_dok_typ='mot', ref_dok_rm='2022/23', ref_dok_bet='2053', ref_dok_titel='Utgiftsområde 13 Jämställdhet och nyanlända invandr

**Convert into domain models we care about**

I can think of the following questions users may ask our RAG model:
- 1. What is this Betänkande about?
- 2. What was Person X's opinion about this Betänkande?

To answer those questions, we need to embed the data in multiple ways. But let's start by making it simple - only embedding the original submission.

In [248]:
import io
from typing import List
import requests
from PyPDF2 import PdfReader
from langchain_text_splitters import RecursiveCharacterTextSplitter

from notebooks.models.retrieval import Document, DocumentMetadata


def extract_full_text_from_pdf(pdf_url):
    r = requests.get(pdf_url)
    if r.status_code != 200:
        print(f"Failed to fetch PDF from URL: {pdf_url}")
        return None
    on_fly_pdf = io.BytesIO(r.content)

    reader = PdfReader(on_fly_pdf)

    return "".join([page.extract_text() for page in reader.pages])


def create_chunk_id(parent_id: str, idx: int) -> str:
    return f"{parent_id}-{idx}"


def chunkify(parent_id: str, text: str) -> List[Document]:
    splitter = RecursiveCharacterTextSplitter(
        separators=["   \n", "  \n", " \n", "\n", ". ", " ", ""]
    )
    document_chunks_lc_domain = splitter.create_documents([text])
    return [
        Document(id=create_chunk_id(parent_id, idx=idx), text=lc_doc.page_content)
        for idx, lc_doc in enumerate(document_chunks_lc_domain)
    ]


doc_id = doc.dokument.dok_id
betankande_file_url = doc.dokbilaga.bilaga[0].fil_url
metadata = DocumentMetadata(
    data={
        "doc_id": str(doc_id),
        "titel": str(doc.dokument.titel),
        "typ": str(doc.dokument.typ),
        "publicerad": str(doc.dokument.publicerad),
        "dokument_url_html": str(doc.dokument.dokument_url_html),
    }
)

betankande_full_text = extract_full_text_from_pdf(betankande_file_url)
document_chunks = chunkify(doc_id, betankande_full_text)

In [249]:
document_chunks = [
    Document(id=doc.id, text=doc.text, metadata=metadata) for doc in document_chunks
]

We now have our chunks, with proper Ids and Metadata

In [251]:
for doc in document_chunks[3:5]:
    print(
        f"Document ID: '{doc.id}', \nMetadata: '{doc.metadata.data}' \nText: '{doc.text[:100]}' \n\n"
    )

Document ID: 'HA01AU1-3', 
Metadata: '{'doc_id': 'HA01AU1', 'titel': 'Utgiftsområde 13 Jämställdhet och nyanlända invandrares etablering', 'typ': 'bet', 'publicerad': '2022-10-20 09:12:21', 'dokument_url_html': 'http://data.riksdagen.se/dokument/HA01AU1'}' 
Text: '6    2022 /23:AU1  
Utskottets överväganden  
Regeringens resultatredovisning för utgiftsområde 13  ' 


Document ID: 'HA01AU1-4', 
Metadata: '{'doc_id': 'HA01AU1', 'titel': 'Utgiftsområde 13 Jämställdhet och nyanlända invandrares etablering', 'typ': 'bet', 'publicerad': '2022-10-20 09:12:21', 'dokument_url_html': 'http://data.riksdagen.se/dokument/HA01AU1'}' 
Text: 'I en första de l av resultatredovisningen beskrivs utvecklingen inom 
områdena arbetsmarknad, utbild' 




### 3. Load into Pinecone

This step requires us to do the following pipeline:
 1. Create our Pinecone instance + index
 2. Embed our chunks
 3. Upload to Pinecone

In [181]:
from dotenv import load_dotenv

load_dotenv()

True

In [226]:
OPENAI_EMBEDDING_MODEL = "text-embedding-3-large"
EMBEDDING_DIM = 1536
INDEX_NAME = "chie-rag"

**I. Create/Get Pinecone Index**

In [266]:
from pinecone import Pinecone
import os

pc = Pinecone(
    api_key=os.environ["PINECONE_API_KEY"],
)

index = pc.Index(INDEX_NAME)
print("Index stats: \n", index.describe_index_stats())


Index stats: 
 {'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}


**II. Embed Docs**

In [272]:
from openai import OpenAI
from notebooks.models.retrieval import Document, Embedding
from typing import List

openai_client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])


def embedd_docs(openai: OpenAI, docs: List[Document]) -> Embedding:
    doc_strings = [doc.text for doc in docs]
    embeddings = openai.embeddings.create(
        model=OPENAI_EMBEDDING_MODEL,
        dimensions=EMBEDDING_DIM,
        input=doc_strings,
    )
    domain_embeddings = [
        Embedding(embedding=embedding.embedding, document=doc)
        for doc, embedding in zip(docs, embeddings.data)
    ]
    return domain_embeddings

embeddings = embedd_docs(openai_client, document_chunks)

Now insert the docs into Pinecone

In [276]:
from typing import Iterable, TypeVar, Tuple

BATCH_SIZE = 100

T = TypeVar("T")


def yield_iterator(items: List[T], batch_size: int) -> Iterable[List[T]]:
    for i in range(0, len(items), batch_size):
        yield items[i : i + batch_size]


def to_pinecone_tuple(embedding: Embedding) -> List[Tuple]:
    metadata = {
        "text": embedding.document.text,
        **embedding.document.metadata.data,
    }
    return (
        embedding.document.id,
        embedding.embedding,
        metadata,
    )


for batch in yield_iterator(embeddings, BATCH_SIZE):
    print(f"Upserting batch of {len(batch)} items.")
    vectors = [to_pinecone_tuple(embedding) for embedding in batch]
    
    index.upsert(vectors=vectors, namespace="riksdagen-test")

Upserting batch of 24 items.


Verify that the vectors got upserted

In [277]:
print(f"Updated index stats: \n{index.describe_index_stats()}")

Updated index stats: 
{'dimension': 1536,
 'index_fullness': 0.00024,
 'namespaces': {'riksdagen-test': {'vector_count': 24}},
 'total_vector_count': 24}


## 4. Test the RAG

Good. Now, let's try to query our vector store, and later build a Generative Q&A.

**I. Try some queries**

In [301]:
from langchain_openai import OpenAIEmbeddings
from langchain_pinecone import PineconeVectorStore

# Initialize a LangChain embedding object
embeddings = OpenAIEmbeddings(
    openai_api_type=os.environ["OPENAI_API_KEY"],
    model=OPENAI_EMBEDDING_MODEL,
    dimensions=EMBEDDING_DIM,
)

# Initialize the LangChain vector store
vectorstore = PineconeVectorStore(
    index_name=INDEX_NAME,
    namespace="riksdagen-test",
    embedding=embeddings,
    text_key="text",  # The original text is stored in the metadata under the key "text"
)

In [304]:
query = "Vilka skrev motioner till betänkandet?"
top_k = 3

# Get the top 5 most similar documents
doc = vectorstore.similarity_search(
    query=query,
    k=top_k
)

**II. Build a Q&A Rag**

In [323]:
from langchain_openai import ChatOpenAI
from langchain import hub
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_core.prompts import PromptTemplate


retreiver = vectorstore.as_retriever()
prompt = PromptTemplate.from_template(
    """You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.
Question: {question} 
Context: {context} 
Answer:"""
)

llm = ChatOpenAI(
    api_key=os.environ["OPENAI_API_KEY"], model="gpt-3.5-turbo", temperature=0.0
)


def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


rag_chain = (
    {"context": retreiver | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

In [327]:
rag_chain.invoke("Vilka betänkanden har du åtkomst till?")

[Document(page_content='5  2022 /23:AU1  \nI utskottens uppgifter ingår att följa upp och utvärdera riksdagsbeslut \n(4 kap. 8 § regeringsformen). Som en del i utskottens uppföljning ingår att \nbehandla den resultatinformation som regeringen presenterar. Riksdagen har \nbeslutat om riktlinj er för bl.a. den löpande uppföljningen av regeringens \nresultatredovisning (framst. 2005/06:RS3, bet. 2005/06:KU21, rskr. 2005/06: \n333–335).  \nUtskottet har mot den bakgrunden gått igenom regeringens resultat -\nredovisning för utgiftsområde 13 i budgetpropositi onen. Genomgången är ett \nunderlag för utskottets behandling av budgetpropositionen och för den \nfortsatta mål - och resultatdialogen med regeringen.  \nBetänkandets disposition  \nBetänkandet har disponerats så att regeringens resultatredovisning behandlas \nförst. Därefter behandlar utskottet de förslag i budgetpropositionen och de \nmotionsförslag som gäller statens budget inom utgiftsområde 13.', metadata={'doc_id': 'HA01AU1', 'do

'Jag har åtkomst till betänkandena 2022/23:AU1, 2022/23:FiU1 och 2022/23:1273. Utskottet behandlar regeringens resultatredovisning, budgetpropositionen och motionsförslag inom utgiftsområde 13. Regeringen bedömer utvecklingen mot målen för jämställdhetspolitiken och diskriminering.'