<a href="https://colab.research.google.com/github/jmcinern/Oireachtas_RAG/blob/main/oireachtas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Oireachtas

**Project by Joseph McInerney.**
- This project aims to allow Irish citizens easily get relavant primary source political information to allow for minimal framing to let the user come to their own conclusions.
- The aim of the project is to allow functionality whereby:
  - Oireactas speeches are collected and stored along with their TD.
  - This is stored in a vector data base mapping speeches in semantic space (speeches covering similar topics are closer together).
  - A Large Language Model (LLM) is used to query this database responding with summaries of TD's positions on issues with direct quotes from the Oireactais data base.




# Requirements

In [None]:
!pip install -qU bitsandbytes datasets accelerate loralib chromadb peft gradio thefuzz[speedup] langchain openai langchain_community

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/67.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.1/76.1 MB[0m [31m11.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.4/491.4 kB[0m [31m31.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.3/18.3 MB[0m [31m95.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m66.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m94.9/94.9 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00

In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [None]:
import requests # for getting web data
import xml.etree.ElementTree as ET # for easy XML parsing
from datetime import datetime, timedelta # for knowing debate url search window
from tqdm import tqdm # for tracking progress
import concurrent.futures # for parallel processing
import chromadb # vector database
from google.colab import drive # for storing vector dbimport pickle
import os
from sentence_transformers import SentenceTransformer # embedding
import transformers # for hugging face model
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch
import torch.nn as nn
import pandas as pd
from thefuzz import process
import os # user-input API key
import openai # gpt
import langchain # for promtp template and few-shot
# allows for roles and system messages unlike simple PromptTemplate and integrates with few-shot
from langchain.prompts import (FewShotChatMessagePromptTemplate, ChatPromptTemplate, PromptTemplate)
from langchain.schema import HumanMessage # seperate examples cleanly
from getpass import getpass # for API key
from langchain.chat_models import ChatOpenAI # to initialise model

In [None]:
project_fpath = r"/content/drive/MyDrive/Oireachtas_RAG/"

In [None]:
# getOpenAI key
os.environ['OPENAI_KEY'] = getpass('Enter your OpenAI API key: ')
openai.api_key = os.getenv('OPENAI_KEY')

KeyboardInterrupt: Interrupted by user

# Data Collection

**Code that accesses Oireachtas debate records and parses XML to store info.**

## Debate URLs



*   **Genereate list of URLs pertaining to oireachtas debates given a time frame**
*   **Example: https://data.oireachtas.ie/akn/ie/debateRecord/dail/2013-05-03/debate/mul@/main.xml**





In [None]:
# Define the namespace for XML parsing
NS = {"akn": "http://docs.oasis-open.org/legaldocml/ns/akn/3.0/CSD13"}

# Function to generate URLs, default: for the last month
def get_XML_urls(number_of_days=30):
    urls = []
    today = datetime.today()
    start_date = today - timedelta(number_of_days)  # Last 30 days

    for i in tqdm(range(number_of_days)):
        # URL format: https://data.oireachtas.ie/akn/ie/debateRecord/dail/YYYY-MM-DD/debate/mul@/main.xml
        date_str = (start_date + timedelta(days=i)).strftime("%Y-%m-%d")
        url = f"https://data.oireachtas.ie/akn/ie/debateRecord/dail/{date_str}/debate/mul@/main.xml"
        urls.append(url)

    return urls

In [None]:
# Generate URLs for the last set amount of days
number_of_days = 3650 # ~10 years
debate_urls = get_XML_urls(number_of_days)

100%|██████████| 3650/3650 [00:00<00:00, 255869.93it/s]


## Fetching Speeches

- Given each URL there may be multiple speeches by multiple TDs.
- So store each speech with the relevant TD as well.

Extracts the debate language:
It looks for the <FRBRlanguage> element (using the same namespace as other


* elements) and reads its “language” attribute. In the example XML, the language value is "eng". The code then maps that to a two‐letter code (e.g. "eng" becomes "en" and you could map others such as "gle" to "ga").

* Counts the total number of words in the debate:
For each speech extracted, it joins all paragraph texts and then splits the text by spaces to count words. All these counts are summed to give the overall word count.

* Returns a dictionary with the debate language, word count, and the list of speeches:

* Each speech is stored as a dictionary with keys "speaker" and "text".

In [None]:
SPEAKERS = set() # set to match user query and data base speaker name ID
def fetch_and_extract_speeches(url):
    try:
        response = requests.get(url)

        # debate found
        if response.status_code == 200:

            root = ET.fromstring(response.content)
            speeches = []

            # Extract all <speech> elements
            for speech in root.findall(".//akn:speech", namespaces=NS):

                speaker = speech.get("by", "Unknown Speaker").strip("#")  # Extract speaker ID
                paragraphs = [p.text.strip() for p in speech.findall(".//akn:p", namespaces=NS) if p.text]

                if paragraphs:
                    full_text = " ".join(paragraphs)
                    speeches.append({"speaker": speaker, "text": full_text, "url": url})
                    SPEAKERS.add(speaker)

            return speeches
        else:
            return None

    except Exception as e:
        print(f"Error processing {url}: {e}")
        return None

- Leverage parallel processing to fetch urls.
- ~20 seconds for 10 years worth of data.

In [None]:
import concurrent.futures
from tqdm import tqdm

def fetch_speeches_parallel(urls):
    all_speeches = []
    with concurrent.futures.ThreadPoolExecutor(max_workers=100) as executor:
        results = list(tqdm(executor.map(fetch_and_extract_speeches, urls)))
        for result in results:
            if result is not None:
                all_speeches.extend(result)
    return all_speeches

# Usage:
all_speeches = fetch_speeches_parallel(debate_urls)
print(all_speeches[0])
print(f"Number of speeches {len(all_speeches)} from the last {number_of_days} days")


# write SPEAKERS to txt file
with open(project_fpath+"speakers.txt", "w") as f:
    for speaker in SPEAKERS:
        f.write(f"{speaker}\n")

3650it [00:58, 62.53it/s] 


{'speaker': 'SeanOFearghaillFF', 'text': "I have the unenviable task of standing in for the inimitable Deputy O'Dea. This question seeks to explore with the Tánaiste what plans, if any, she has to extend access to social welfare benefits to the self-employed. The question is posed against the background of all of us in this House wishing to see the indigenous sector develop. We see access to welfare benefits as part of that necessary change.", 'url': 'https://data.oireachtas.ie/akn/ie/debateRecord/dail/2015-05-06/debate/mul@/main.xml'}
Number of speeches 375685 from the last 3650 days


# Vector Database

## Embedding

- Chroma embeds with all-MiniLM-L6-v2, 384 dim embedding trained with cosine.
- Euclidean (l2) best for RAG-Chroma: https://medium.com/@stepkurniawan/comparing-similarity-searches-distance-metrics-in-vector-stores-rag-model-f0b3f7532d6f

In [None]:
import uuid
from sentence_transformers import SentenceTransformer
from chromadb.utils.batch_utils import create_batches

# ─── 0) Set up your Chroma client ───────────────────────────────────────────────
# (PersistentClient writes to disk; for ephemeral/in-memory use chromadb.Client())
chroma_client = chromadb.PersistentClient(
    path=project_fpath + "/debate_db"    # wherever you want the on-disk store
)
collection = chroma_client.get_or_create_collection("oireachtas_debates")

if collection.count()==0:
  # ─── 1) load your embedding model ───────────────────────────────────────────────
  model = SentenceTransformer("all-MiniLM-L6-v2", device="cuda")

  # ─── 2) build your parallel lists ───────────────────────────────────────────────
  ids       = [str(uuid.uuid4()) for _ in all_speeches]
  texts     = [s["text"] for s in all_speeches]
  metadatas = [
      {"speaker": s["speaker"], "url": s["url"], "text": s["text"]}
      for s in all_speeches
  ]

  # ─── 3) encode everything at once (or in large chunks) ─────────────────────────
  embeddings = model.encode(texts, batch_size=512, show_progress_bar=True)
  embeddings = embeddings.tolist()

  # model.encode returns numpy erray but creat_batches() expects list
  # ─── 4) batch-slice to avoid SQLite param limits ────────────────────────────────
  batches = create_batches(
      api=chroma_client,
      ids=ids,
      embeddings=embeddings,
      metadatas=metadatas,
      documents=texts,
  )

  # ─── 5) fire each super-chunk into Chroma ──────────────────────────────────────
  collection = chroma_client.get_or_create_collection("oireachtas_debates")
  for ids_b, embs_b, metas_b, docs_b in tqdm(batches):
      print(f"Adding batch of {len(ids_b)} docs …")
      collection.add(
          ids=ids_b,
          embeddings=embs_b,
          metadatas=metas_b,
          documents=docs_b,
      )

KeyboardInterrupt: 

In [None]:
chroma_client = chromadb.PersistentClient(
    path=project_fpath + "/debate_db"
)
collection = chroma_client.get_collection("oireachtas_debates")

## RAG

### **Retrieval**


*   Fetch top k utterences by speaker on topic using vector DB.



In [None]:
def search_speaker_position(speaker_name, topic, num_results=5):
    # Use ChromaDB's query functionality with `where` clause for speaker
    print("Looking for relevant utterences")
    results = collection.query(
        query_texts=[topic],
        n_results=num_results,
        where={"speaker": speaker_name},
        include=["metadatas"]
    )
    print("Done looking for relevant utterences")
    print(results)
    # Check if any results were found
    if not results['metadatas'][0]:  # ChromaDB returns a list of lists
        return f"No speeches found for {speaker_name}  talking about {topic} in the dataset."

    # Extract and format the results
    output = f"\n### {speaker_name}'s Position on '{topic}':\n"
    for i, metadata in enumerate(results['metadatas'][0]):
        output += f"\n **Quote {i+1} (debate url: {metadata['url']}):** {metadata['text'][:500]}...\n"

    return output

#### Few-shot prompt template - Langchain

In [None]:
examples=[
    #1
    {
  "question": "Summarize MicheálMartin's position on healthcare, weaving in short quotes placed within quotation marks from the reference material.\n\n### MicheálMartin's Position on 'healthcare':\n\n**Quote 1 (debate url: https://data.oireachtas.ie/akn/ie/debateRecord/dail/2011-05-03/debate/mul@/main.xml, year: 2011):** \"Healthcare must be a right, not a privilege, ensuring no citizen is denied essential treatment because of cost or location.\"\n\n**Quote 2 (debate url: https://data.oireachtas.ie/akn/ie/debateRecord/dail/2012-05-03/debate/mul@/main.xml, year: 2012):** \"Investment in modern hospital infrastructure is a critical pillar for equitable access to services across Ireland.\"\n\n**Quote 3 (debate url: https://data.oireachtas.ie/akn/ie/debateRecord/dail/2013-05-03/debate/mul@/main.xml, year: 2013):** \"Primary care should serve as the bedrock of our health system, treating issues early and reducing reliance on emergency departments.\"\n\n**Quote 4 (debate url: https://data.oireachtas.ie/akn/ie/debateRecord/dail/2014-05-03/debate/mul@/main.xml, year: 2014):** \"Strengthening Ireland's digital economy is key to attracting global investment and future-proofing our industries.\" (irrelevant)\n\n**Quote 5 (debate url: https://data.oireachtas.ie/akn/ie/debateRecord/dail/2015-05-03/debate/mul@/main.xml, year: 2015):** \"Climate action must be deeply integrated into all areas of policy, including education, housing, and agriculture.\" (irrelevant)",

  "answer": "Micheál Martin emphasizes that 'healthcare must be a right, not a privilege' (2011, https://data.oireachtas.ie/akn/ie/debateRecord/dail/2011-05-03/debate/mul@/main.xml). He highlights 'investment in modern hospital infrastructure' as essential (2012, https://data.oireachtas.ie/akn/ie/debateRecord/dail/2012-05-03/debate/mul@/main.xml) and asserts that 'primary care should serve as the bedrock of our health system' (2013, https://data.oireachtas.ie/akn/ie/debateRecord/dail/2013-05-03/debate/mul@/main.xml)."
    },
    #2
    {
  "question": "Summarize LeoVaradkar's statements on economic recovery, interweaving short quotations from the reference text.\n\n### LeoVaradkar's Position on 'economic recovery':\n\n**Quote 1 (debate url: https://data.oireachtas.ie/akn/ie/debateRecord/dail/2016-06-04/debate/mul@/main.xml, year: 2016):** \"Recovery must be fair and inclusive, lifting every household that bore the brunt of austerity.\"\n\n**Quote 2 (debate url: https://data.oireachtas.ie/akn/ie/debateRecord/dail/2017-06-04/debate/mul@/main.xml, year: 2017):** \"Direct supports for small businesses will drive sustainable growth across Ireland.\"\n\n**Quote 3 (debate url: https://data.oireachtas.ie/akn/ie/debateRecord/dail/2018-06-04/debate/mul@/main.xml, year: 2018):** \"Promoting cycling in cities is central to tackling traffic congestion.\" (irrelevant)\n\n**Quote 4 (debate url: https://data.oireachtas.ie/akn/ie/debateRecord/dail/2019-06-04/debate/mul@/main.xml, year: 2019):** \"Fiscal prudence ensures we do not burden future generations with today's mistakes.\"\n\n**Quote 5 (debate url: https://data.oireachtas.ie/akn/ie/debateRecord/dail/2020-06-04/debate/mul@/main.xml, year: 2020):** \"Strong ties with the European Union strengthen Ireland’s economic resilience.\" (irrelevant)",

  "answer": "Leo Varadkar insists that 'recovery must be fair and inclusive' (2016, https://data.oireachtas.ie/akn/ie/debateRecord/dail/2016-06-04/debate/mul@/main.xml) and stresses that 'direct supports for small businesses will drive sustainable growth' (2017, https://data.oireachtas.ie/akn/ie/debateRecord/dail/2017-06-04/debate/mul@/main.xml). He further notes that 'fiscal prudence ensures we do not burden future generations' (2019, https://data.oireachtas.ie/akn/ie/debateRecord/dail/2019-06-04/debate/mul@/main.xml)."
    },
    #3
    {
  "question": "Summarize MaryLouMcDonald's comments on housing, blending brief quotes from the reference material into the summary.\n\n### MaryLouMcDonald's Position on 'housing':\n\n**Quote 1 (debate url: https://data.oireachtas.ie/akn/ie/debateRecord/dail/2015-07-10/debate/mul@/main.xml, year: 2015):** \"The housing crisis is not accidental; it is the direct result of policy choices that neglected social needs.\"\n\n**Quote 2 (debate url: https://data.oireachtas.ie/akn/ie/debateRecord/dail/2016-07-10/debate/mul@/main.xml, year: 2016):** \"A massive programme of public housing construction is essential to meet demand.\"\n\n**Quote 3 (debate url: https://data.oireachtas.ie/akn/ie/debateRecord/dail/2017-07-10/debate/mul@/main.xml, year: 2017):** \"Strengthening the rural broadband network will unlock opportunities across Ireland.\" (irrelevant)\n\n**Quote 4 (debate url: https://data.oireachtas.ie/akn/ie/debateRecord/dail/2018-07-10/debate/mul@/main.xml, year: 2018):** \"Rents are out of control and the dream of homeownership is slipping away for an entire generation.\"\n\n**Quote 5 (debate url: https://data.oireachtas.ie/akn/ie/debateRecord/dail/2019-07-10/debate/mul@/main.xml, year: 2019):** \"Political will is needed to solve this, not more handouts to developers.\"",

  "answer": "Mary Lou McDonald argues that 'the housing crisis is not accidental' (2015, https://data.oireachtas.ie/akn/ie/debateRecord/dail/2015-07-10/debate/mul@/main.xml). She demands that 'a massive programme of public housing construction is essential' (2016, https://data.oireachtas.ie/akn/ie/debateRecord/dail/2016-07-10/debate/mul@/main.xml), warns that 'rents are out of control' (2018, https://data.oireachtas.ie/akn/ie/debateRecord/dail/2018-07-10/debate/mul@/main.xml), and stresses that 'political will is needed to solve this' (2019, https://data.oireachtas.ie/akn/ie/debateRecord/dail/2019-07-10/debate/mul@/main.xml)."
    },
    #4
    {
  "question": "Summarize EamonRyan's views on climate policy, incorporating brief excerpts from the reference material.\n\n### EamonRyan's Position on 'climate policy':\n\n**Quote 1 (debate url: https://data.oireachtas.ie/akn/ie/debateRecord/dail/2013-04-15/debate/mul@/main.xml, year: 2013):** \"Climate action is no longer optional; it is the defining issue for this generation.\"\n\n**Quote 2 (debate url: https://data.oireachtas.ie/akn/ie/debateRecord/dail/2014-04-15/debate/mul@/main.xml, year: 2014):** \"Renewable energy must form the cornerstone of any credible climate strategy.\"\n\n**Quote 3 (debate url: https://data.oireachtas.ie/akn/ie/debateRecord/dail/2015-04-15/debate/mul@/main.xml, year: 2015):** \"Strengthening community policing initiatives enhances public safety.\" (irrelevant)\n\n**Quote 4 (debate url: https://data.oireachtas.ie/akn/ie/debateRecord/dail/2016-04-15/debate/mul@/main.xml, year: 2016):** \"Protecting Ireland’s biodiversity is vital for environmental and economic sustainability.\"\n\n**Quote 5 (debate url: https://data.oireachtas.ie/akn/ie/debateRecord/dail/2017-04-15/debate/mul@/main.xml, year: 2017):** \"A just transition must ensure that no worker or community is left behind.\"",

  "answer": "Eamon Ryan declares that 'climate action is no longer optional' (2013, https://data.oireachtas.ie/akn/ie/debateRecord/dail/2013-04-15/debate/mul@/main.xml) and insists that 'renewable energy must form the cornerstone' (2014, https://data.oireachtas.ie/akn/ie/debateRecord/dail/2014-04-15/debate/mul@/main.xml). He emphasizes that 'protecting Ireland’s biodiversity is vital' (2016, https://data.oireachtas.ie/akn/ie/debateRecord/dail/2016-04-15/debate/mul@/main.xml) and concludes that 'a just transition must ensure no community is left behind' (2017, https://data.oireachtas.ie/akn/ie/debateRecord/dail/2017-04-15/debate/mul@/main.xml)."
    },
    #5
    {
  "question": "Summarize MicheálMartin's position on education reform, using short quotes from the reference material.\n\n### MicheálMartin's Position on 'education reform':\n\n**Quote 1 (debate url: https://data.oireachtas.ie/akn/ie/debateRecord/dail/2012-11-25/debate/mul@/main.xml, year: 2012):** \"Every child, regardless of background, deserves a world-class education.\"\n\n**Quote 2 (debate url: https://data.oireachtas.ie/akn/ie/debateRecord/dail/2013-11-25/debate/mul@/main.xml, year: 2013):** \"Investment in schools and teachers is critical for equality of opportunity.\"\n\n**Quote 3 (debate url: https://data.oireachtas.ie/akn/ie/debateRecord/dail/2014-11-25/debate/mul@/main.xml, year: 2014):** \"Robust transport links are vital to regional economic growth.\" (irrelevant)\n\n**Quote 4 (debate url: https://data.oireachtas.ie/akn/ie/debateRecord/dail/2015-11-25/debate/mul@/main.xml, year: 2015):** \"Education must not end at school — lifelong learning must be embraced.\"\n\n**Quote 5 (debate url: https://data.oireachtas.ie/akn/ie/debateRecord/dail/2016-11-25/debate/mul@/main.xml, year: 2016):** \"Access to higher education must not be determined by parental wealth.\"",

  "answer": "Micheál Martin affirms that 'every child deserves a world-class education' (2012, https://data.oireachtas.ie/akn/ie/debateRecord/dail/2012-11-25/debate/mul@/main.xml). He argues that 'investment in schools and teachers is critical' (2013, https://data.oireachtas.ie/akn/ie/debateRecord/dail/2013-11-25/debate/mul@/main.xml), calls for embracing 'lifelong learning' (2015, https://data.oireachtas.ie/akn/ie/debateRecord/dail/2015-11-25/debate/mul@/main.xml), and stresses that 'access to higher education must not be determined by parental wealth' (2016, https://data.oireachtas.ie/akn/ie/debateRecord/dail/2016-11-25/debate/mul@/main.xml)."
    }
]


In [None]:
example_prompt = ChatPromptTemplate.from_messages(
[('human', '{question}?'), ('ai', '{answer}\n')]
)
few_shot_prompt = FewShotChatMessagePromptTemplate(
    examples=examples,
    example_prompt=example_prompt,
)
full_prompt = ChatPromptTemplate.from_messages([
  ("system", "you are an Irish parliament chatbot, users will ask you questions about polititian's opinions on topics, you will provide summaries of their positions with reference to quotes that you will be provided as reference. You will not make anything up, you will not add quotes that are not relevant to the topic. These quotes have corresponding URLs, you will cite the quote using the URL and the year parsed from the URL as shown in the examples. You will value accuracy over plausibility."),
  few_shot_prompt,
  ("human", "{question}"),
])

# mini = bigger than nano
gpt_mini = ChatOpenAI(
    model_name="gpt-4.1-mini",
    temperature=0.9,
    openai_api_key=openai.api_key)

chain_mini = full_prompt | gpt_mini

  gpt_mini = ChatOpenAI(


In [None]:
def speaker_fuzzy_lookup(speaker, speaker_list):
  # fuzzy lookup to get best match of speaker in list
  best_match = process.extractOne(speaker, speaker_list)
  return best_match[0]

### **Augmented Generation**

*  Generate a summary of the speaker's position on a topic with reference to their top k quotes.



In [None]:
'''
function that takes in the user's speaker and topic, finds the match in the speaker list.
Then generates a summary of the speaker on a topic given the debates.
'''
def generate_answer(speaker_name, topic, list_of_speakers, num_results=5):

    # the response to the user
    response = ""

    speaker_name = speaker_fuzzy_lookup(speaker_name, list_of_speakers)

    # Retrieve relevant quotes
    retrieved_text = search_speaker_position(speaker_name, topic, 5)
    print(retrieved_text)

    if "No speeches found" in retrieved_text or "No relevant quotes found" in retrieved_text:
        return retrieved_text  # No results found, return directly

    response_mini = chain_mini.invoke({"question": f"Summarise {speaker_name}'s position on the topic: {topic}. Use the following quotes as reference: {retrieved_text}", "answer": ""})

    return response_mini.content

In [None]:
# Example Usage
speaker = "LeoVaradkar"
topic = "The housing crisis"

#read speakers.txt to get list of speakers
with open(project_fpath+"speakers.txt", "r") as f:
    SPEAKERS = f.read().splitlines()


torch.cuda.empty_cache()
final_answer = generate_answer(speaker, topic,  SPEAKERS)
print(final_answer)

Looking for relevant utterences


/root/.cache/chroma/onnx_models/all-MiniLM-L6-v2/onnx.tar.gz: 100%|██████████| 79.3M/79.3M [00:02<00:00, 33.3MiB/s]


Done looking for relevant utterences
{'ids': [['4a715fdc-d084-481b-9dd9-d9b32f9ef2d2', '6de3e6f4-6546-4917-bae9-cd5cb6905ed3', 'e1829429-1d8d-4156-8c63-cf64a16eeff4', '8a3f2bf8-228f-43d3-8fd1-f39165dcfc77', '5e1528e5-81da-4022-94be-3b76c935c16a']], 'embeddings': None, 'documents': None, 'uris': None, 'included': ['metadatas'], 'data': None, 'metadatas': [[{'text': 'The reason we have a housing crisis in this country-----', 'speaker': 'LeoVaradkar', 'url': 'https://data.oireachtas.ie/akn/ie/debateRecord/dail/2019-10-02/debate/mul@/main.xml'}, {'speaker': 'LeoVaradkar', 'url': 'https://data.oireachtas.ie/akn/ie/debateRecord/dail/2023-04-25/debate/mul@/main.xml', 'text': '-----or the Government for it. Both those analyses are far too simplistic. However, I have criticised people for objecting to housing developments. I will continue to do that. It is clear that we cannot fix the housing crisis without increased supply of all types of housing.'}, {'speaker': 'LeoVaradkar', 'url': 'https://

# Web App

In [None]:
import gradio as gr

def ask_about_speaker(speaker_name, topic):
    answer = generate_answer(speaker_name, topic, SPEAKERS)
    return answer

In [None]:
share=True
iface = gr.Interface(
    fn=ask_about_speaker,
    inputs=[
        gr.Textbox(label="Speaker Name", placeholder="e.g., Ivana Bacik"),
        gr.Textbox(label="Topic", placeholder="e.g., housing crisis")
    ],
    outputs=gr.Textbox(label="Summary"),
    title="Speaker Position Summarizer",
    description="Ask what a public figure has said on any topic, and get a summary with direct quotes.",
)

iface.launch()


# PEFT - LoRA Fine-Tuning



*   Parameter Efficient Fine-Tuning
*   Low Rank Adaption
*   Fine-tune mistral-7b for this specific task.
  * Freezes 7b weights and adds smaller weight matrix that projects to lower rank - small param size.




## Freeze the Original Weights (W)

In [None]:
for param in model.parameters():
  param.requires_grad = False  # freeze the model - train adapters later
  if param.ndim == 1:
    # cast the small parameters (e.g. layernorm) to fp32 for stability
    param.data = param.data.to(torch.float32)

model.gradient_checkpointing_enable()  # reduce number of stored activations
model.enable_input_require_grads()

class CastOutputToFloat(nn.Sequential):
    def forward(self, x): return super().forward(x).to(torch.float32)
model.lm_head = CastOutputToFloat(model.lm_head)

## Set up LoRA Adapters

In [None]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad: # count unfrozen params
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable %: {100 * trainable_params / all_param}"
    )

In [None]:
from peft import LoraConfig, get_peft_model

config_LoRA = LoraConfig(
    r=16, #attention heads
    lora_alpha=32, #alpha scaling
    # target_modules=["q_proj", "v_proj"], #if you know the
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM" # set this for CLM or Seq2Seq
)

model_LoRA = get_peft_model(model, config_LoRA)
print_trainable_parameters(model_LoRA)

## Example Data

In [None]:
from datasets import Dataset
import pandas as pd

# list of example input output pairs, will use GPT-4o as standard.
# use a symbol that won't have been in the original

examples_df = pd.read_csv(project_fpath+"Oireachtas_Examples.csv")

# Create a Hugging Face Dataset from the DataFrame
train_dataset = Dataset.from_pandas(examples_df)

tokenizer.pad_token = tokenizer.eos_token

# Tokenize the input text and create input_ids and labels
def preprocess_function(examples):
    inputs = examples['X']
    targets = examples['y']
    model_inputs = tokenizer(inputs, truncation=True, padding=True)
    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(targets, truncation=True, padding=True)
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

# Apply the preprocessing function to the dataset
tokenized_train_dataset = train_dataset.map(
    preprocess_function,
    batched=True,
    remove_columns=train_dataset.column_names,
)

# Now, use tokenized_train_dataset in your Trainer
trainer = transformers.Trainer(
    model=model_LoRA,
    train_dataset=tokenized_train_dataset,
    args=transformers.TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        warmup_steps=20,
        max_steps=50,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=1,
        output_dir='outputs'
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False)
)
model_LoRA.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()