<a href="https://colab.research.google.com/github/pk2971/computational-gender-analysis/blob/main/notebooks/Text_summary.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, pipeline
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.chains import RetrievalQA
from langchain.llms import HuggingFacePipeline
from langchain.text_splitter import RecursiveCharacterTextSplitter
from bs4 import BeautifulSoup
import zipfile
import re
from typing import Union, List

In [None]:
# ---------------------------
# Load + Split XML from ZIP
# ---------------------------
def extract_text_from_xml(xml_content: str) -> str:
    soup = BeautifulSoup(xml_content, "lxml-xml")
    return soup.get_text(separator=" ", strip=True)

def load_and_split_xml_from_zip(
    zip_path: str,
    years: Union[int, List[int]],
    chunk_size: int = 1000,
    chunk_overlap: int = 200
):
    if isinstance(years, int):
        years = [years]
    year_pattern = '|'.join(str(y) for y in years)

    with zipfile.ZipFile(zip_path, "r") as zip_file:
        matched_files = [
            f for f in zip_file.namelist()
            if re.search(rf'debates({year_pattern})-\d{{2}}-\d{{2}}a\.xml', f)
        ]

        all_texts = []
        for filename in matched_files:
            with zip_file.open(filename) as file:
                xml_content = file.read().decode("utf-8", errors="ignore")
                text = extract_text_from_xml(xml_content)
                all_texts.append(text)

    text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
    docs = text_splitter.create_documents(all_texts)
    return docs


In [None]:
zip_file_path = "/content/drive/MyDrive/debates.zip"  # Replace with your path
docs = load_and_split_xml_from_zip(zip_file_path, 2022)

In [None]:
pip install chromadb



In [None]:
# ---------------------------
# Vector store setup
# ---------------------------
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
vectorstore = Chroma.from_documents(docs, embeddings)

# ---------------------------
# Load LLaMA-3 model
# ---------------------------
model_name = "meta-llama/Llama-3.2-3B-Instruct"
access_token = " "

quantization_config = BitsAndBytesConfig(load_in_8bit=True)

tokenizer = AutoTokenizer.from_pretrained(model_name, token=access_token)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    token=access_token,
    quantization_config=quantization_config,
    device_map="auto"
)

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=200, temperature=0.2, top_k=50)
llm = HuggingFacePipeline(pipeline=pipe)

# ---------------------------
# Prompt Template + QA Chain
# ---------------------------
prompt_template = """
You are a helpful assistant trained to answer questions using excerpts from British parliamentary debate texts from the years provided.
Your goal is to extract or infer answers based strictly on the retrieved documents.

— If the answer is explicitly stated, quote the relevant parts.
— If it's implied (e.g. tone, sentiment, theme), explain your reasoning clearly.
— Do not repeat the same quote or sentence.
— If the information is not available, say: 'There is no information regarding this in the given text.'

Be accurate, concise, and clear in your response.
"""



ERROR:chromadb.telemetry.product.posthog:Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
ERROR:chromadb.telemetry.product.posthog:Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Device set to use cuda:0


In [None]:
def remove_duplicate_sentences(text):
    seen = set()
    sentences = re.split(r'(?<=[.!?])\s+', text)
    unique_sentences = []
    for s in sentences:
        cleaned = s.strip()
        if cleaned and cleaned.lower() not in seen:
            seen.add(cleaned.lower())  # lowercase for fuzzy dedup
            unique_sentences.append(cleaned)
    return '\n'.join(unique_sentences)

def generate_response(query):
    qa_chain = RetrievalQA.from_chain_type(llm, retriever=vectorstore.as_retriever())
    query_with_prompt = prompt_template + "\n\nQuestion: " + query
    response = qa_chain.run(query_with_prompt)
    response = response.split('Answer:')[1].strip() if 'Answer:' in response else response.strip()
    return remove_duplicate_sentences(response)


In [None]:
def print_pretty_response(response, width=100):
    import textwrap
    wrapper = textwrap.TextWrapper(width=width)
    print("\n" + "\n".join(wrapper.wrap(response)) + "\n")


In [None]:
# ---------------------------
# CLI Loop
# ---------------------------
if __name__ == "__main__":
    while True:
        query = input("Ask a question (or type 'quit' to exit): ")
        if query.lower() == 'quit':
            break
        response = generate_response(query)
        print("Chatbot:", print_pretty_response(response))

Ask a question (or type 'quit' to exit): is there any discussion regarding womens rights? if yes summarize them


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



Yes, there is a discussion regarding women's rights in the given text. Here is a summary:  The
debate highlights the importance of protecting women against abuse and ensuring their rights in the
workplace. The hon. Member for Barnsley Central (Dan Jarvis) expresses disappointment and dismay at
the undemocratic way the Government amended the legislation, which he believes undermines the
scrutiny provided by Parliament. He emphasizes that the Bill aims to protect women's rights and
provide long overdue guarantees to pregnant women. Member for Beckenham (Bob Stewart) also supports
the Bill, stating that it is about ensuring the country supports all its employees, male or female,
and that it is sad that such a Bill is needed in the 21st century. He highlights the importance of
recognizing that this is not just about women's rights, but about supporting all employees. The tone
of the debate suggests that there is a strong sentiment in favor of protecting women's rights and
ensuring their w

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



The tone in discussions involving women in the provided texts appears to be predominantly
**positive**. In the first text, the hon. Lady is quoted as saying "I am not here to doff my cap to
the Ministers; I am here to fight for the rights of women and girls." This statement conveys a sense
of determination and advocacy, indicating a positive tone. Additionally, the hon. Lady's refusal to
apologize to the Minister and her continued commitment to fighting for women's rights also suggest a
strong, positive tone. In the second text, the hon. Lady is also quoted as saying "I will continue
to do that, with every single bit of my tone just exactly as it is." This statement reinforces the
idea that she is committed to her cause and will not compromise on her tone, further suggesting a
positive tone. Overall, the tone in these discussions appears to be one of strong advocacy and
determination. There is no information regarding this in the given text.

Chatbot: None
Ask a question (or type 'qui