**Course**: FINA6289 Machine Learning for Financial Analytics

**Lecturer**: Mr. Ken Liu

**Student:** Wong Kwok Wah Malcolm

**Student ID:** 20134456

**Code Summary:**

*   This program will be useful for my day-to-day work to see which contractors are hired by various government departments, for what purposes and for how much.
*   Referenced "Alternate Data" & "LLM (Large Language Model)" as the code base.
*   This modified code uses the open dataSet for "Awarded service contracts of SOA-QPS" under "Technology and Broadcasting" category on the data.gov.hk site to analyse the HK Government's IT Solutions contracts awarded to vendors.
*   Using the "Alternate Data" code as a base, it downloads the monthly updated csv file and extract the details for analyses.   
*   Dropdown boxes are used to search for particular Keywords in the Work Assignment Title (Contract description) or the the name of the Awarded Contracting Company.
*   Once the results have been filtered, it will be passed to the LLM module that summarise the results and generate a report.
*   Specific instructions such as the summary of results, notable contractor and the awarded contract values were passed as prompt for a more meaningful summary.
*   Another test cases you may try: Change the Work Assignment Title Keyword to "Chatbot" or Change the Contractor Awarded Keyword to "Kinetix".  Re-run the program and you will get different results and summary reports.
*   Current Limitation/Future enhancement: Move to real test/production environment that can process the full result (currently to run in colab, I have only used the first 10 returned rows for LLM prompt).  Change the MAX_SAMPLE variable once moved to a real environment.
*   2nd Limitation: As the csv is a monthly updated file, it is recommended to test the program again after a new update, as the resourceID or the file name may have changed which is out of the programmer's (my) control












In [15]:
#@title Search selected column and generate an LLM report (dropdown + textbox)
SEARCH_COLUMN = "Work Assignment Title" #@param ["Work Assignment Title", "Contractor Awarded"]
KEYWORD = "Privacy Impact Assessment" #@param {type:"string"}
MAX_SAMPLE = 10  # number of matched rows to include in the prompt summary

import os, re, requests, pandas as pd
from IPython.display import display, Markdown
from google.colab import ai

# --- Config ---
RESOURCE_ID = "8c151c53-e3db-4248-b0f2-23f20e43a082"
CSV_URL = f"https://data.gov.hk/en-data/dataset/hk-dpo-dpo_hp-soa-qps-awarded-service-contracts/resource/{RESOURCE_ID}/download?format=csv"
LOCAL_CSV = "soa-qps-awarded-service-contracts.csv"
OUT_MATCHES = "chatbot_matches.csv"
CHUNKSIZE = 100_000  # set None to load whole file at once (not recommended)

def md(text): display(Markdown(text))
def displayLLMOutput(llmOutput):
    display(Markdown(f"<font size='4'>{llmOutput}</font>"))

# 1) Download CSV if missing
if not os.path.exists(LOCAL_CSV):
    md("**Downloading CSV from data.gov.hk...**")
    resp = requests.get(CSV_URL, timeout=60)
    resp.raise_for_status()
    with open(LOCAL_CSV, "wb") as f:
        f.write(resp.content)
    md(f"Saved `{LOCAL_CSV}` ({len(resp.content):,} bytes).")

# 2) Build safe regex from user input
escaped = re.escape(KEYWORD.strip())
pattern = re.compile(rf"\b{escaped}\b", flags=re.IGNORECASE)
col_name = SEARCH_COLUMN

# 3) Detect a working encoding for the CSV header
encodings = ["utf-8-sig", "utf-8", "latin1", "big5"]
used_encoding = None
for enc in encodings:
    try:
        pd.read_csv(LOCAL_CSV, nrows=0, encoding=enc)
        used_encoding = enc
        break
    except Exception:
        continue
if not used_encoding:
    raise RuntimeError("Unable to detect a working encoding for the CSV.")

# 4) Prepare output file
if os.path.exists(OUT_MATCHES):
    os.remove(OUT_MATCHES)

md(f"**Scanning `{LOCAL_CSV}` for `{KEYWORD}` in column `{col_name}`...**")

# 5) Stream and search
matches_found = 0
reader = pd.read_csv(LOCAL_CSV, chunksize=CHUNKSIZE, encoding=used_encoding, low_memory=False) if CHUNKSIZE else [pd.read_csv(LOCAL_CSV, encoding=used_encoding)]
first_write = True
sample_rows = []

for i, chunk in enumerate(reader, start=1):
    if col_name not in chunk.columns:
        md("**Error:** column `" + col_name + "` not found. Available columns (sample):")
        md("`" + "`  `".join(chunk.columns[:50]) + ("`  `...`" if len(chunk.columns) > 50 else "`"))
        raise SystemExit
    mask = chunk[col_name].astype(str).str.contains(pattern, na=False)
    if mask.any():
        matches = chunk.loc[mask]
        matches_found += len(matches)
        # write matches incrementally
        matches.to_csv(OUT_MATCHES, mode="a", header=first_write, index=False, encoding="utf-8-sig")
        first_write = False
        # collect a small sample for the LLM prompt
        if len(sample_rows) < MAX_SAMPLE:
            # select only a few useful columns if present
            cols_to_show = []
            for c in ["Work Assignment Title", "Contractor Awarded", "Awarded Contract Value", "Date of Award"]:
                if c in matches.columns:
                    cols_to_show.append(c)
            if not cols_to_show:
                cols_to_show = matches.columns.tolist()[:4]
            sample = matches[cols_to_show].head(MAX_SAMPLE - len(sample_rows)).to_dict(orient="records")
            sample_rows.extend(sample)
    if i % 10 == 0:
        md(f"Processed {i} chunks; matches so far: **{matches_found}**")

# 6) Report results and call LLM to generate a report
if matches_found == 0:
    md(f"**No matches found for `{KEYWORD}` in `{col_name}`.**")
else:
    md(f"**Found {matches_found} matching row(s).** Showing first 10:")
    display(pd.read_csv(OUT_MATCHES, nrows=10, encoding="utf-8-sig"))

    # Build a concise prompt for the LLM using the sample rows
    def format_sample_rows(rows):
        lines = []
        for i, r in enumerate(rows, start=1):
            parts = [f"{k}: {v}" for k, v in r.items()]
            lines.append(f"{i}. " + " | ".join(parts))
        return "\n".join(lines)

    sample_text = format_sample_rows(sample_rows) if sample_rows else "No sample rows available."
    prompt = (
        "You are an assistant that summarizes awarded contract matches.\n\n"
        f"Search column: {col_name}\n"
        f"Keyword: {KEYWORD}\n"
        f"Total matches found: {matches_found}\n\n"
        "Here are up to the first matches (one per line):\n"
        f"{sample_text}\n\n"
        "Please produce a concise report (3-6 bullet points) that includes:\n"
        "- A short summary of what the matches indicate\n"
        "- Any notable suppliers or repeated titles\n"
        "- Analyze the Awarded Contract Value\n"
        "Keep the report short and actionable."
    )

    # Show available LLM models (optional) and generate text
    try:
        displayLLMOutput("Available LLM models:")
        displayLLMOutput(ai.list_models())
    except Exception:
        # ignore model listing errors; proceed to generate if possible
        pass

    try:
        displayLLMOutput("Generating report from the LLM...")
        llm_response = ai.generate_text(prompt)
        # ai.generate_text may return a dict-like object; convert to string if needed
        if isinstance(llm_response, dict) and "content" in llm_response:
            report_text = llm_response["content"]
        else:
            report_text = str(llm_response)
        displayLLMOutput(report_text)
    except Exception as e:
        md(f"**LLM generation failed:** {e}")
        raise


**Downloading CSV from data.gov.hk...**

Saved `soa-qps-awarded-service-contracts.csv` (1,449,910 bytes).

**Scanning `soa-qps-awarded-service-contracts.csv` for `Privacy Impact Assessment` in column `Work Assignment Title`...**

**Found 664 matching row(s).** Showing first 10:

Unnamed: 0,QPS Contract,Service Category/Group,Bureau/ Department,Work Assignment Title,Date of Award,Contractor Awarded,Awarded Contract Value
0,SOA-QPS3,1,Electrical and Mechanical Services Department,Privacy Impact Assessment of the Interactive V...,28.Jul.17,Automated Systems (HK) Limited,"HK$45,986"
1,SOA-QPS3,1,Immigration Department,Privacy Impact Assessment and Privacy Complian...,28.Jul.17,Automated Systems (HK) Limited,"HK$90,302"
2,SOA-QPS3,1,Commerce and Economic Development Bureau,Privacy Impact Assessment for the Phase 1 of T...,21.Jul.17,Arcotect Limited,"HK$53,300"
3,SOA-QPS3,1,Fire Services Department,Privacy Impact Assessment and Privacy Complian...,13.Jul.17,Automated Systems (HK) Limited,"HK$70,432"
4,SOA-QPS3,1,Immigration Department,Privacy Impact Assessment and Privacy Complian...,03.Jul.17,Automated Systems (HK) Limited,"HK$59,930"
5,SOA-QPS3,1,Department of Health,Privacy Impact Assessment and Compliance Audit...,14.Jun.17,Kinetix Systems Limited,"HK$90,000"
6,SOA-QPS3,1,Hong Kong Police Force,Privacy Impact Assessment of Police Welfare Fu...,19.May.17,Automated Systems (HK) Limited,"HK$43,801"
7,SOA-QPS3,1,Buildings Department,Privacy Impact Assessment Services for the Rev...,20.Apr.17,Kinetix Systems Limited,"HK$88,500"
8,SOA-QPS3,1,Department of Health,Privacy Impact Assessment and Compliance Audit...,20.Apr.17,Kinetix Systems Limited,"HK$78,300"
9,SOA-QPS3,1,Department of Health,Privacy Impact Assessment for the Immunisation...,10.Mar.17,Arcotect Limited,"HK$52,750"


<font size='4'>Available LLM models:</font>

<font size='4'>['google/gemini-2.5-flash', 'google/gemini-2.5-flash-lite']</font>

<font size='4'>Generating report from the LLM...</font>

<font size='4'>Here's a concise report based on the provided contract matches:

*   The contracts predominantly involve **Privacy Impact Assessments (PIAs)**, often combined with Privacy Compliance Audits, for various IT systems across Hong Kong government departments.
*   **Automated Systems (HK) Limited** is a highly recurring contractor, securing 5 out of the 10 listed awards, followed by Kinetix Systems Limited (3 awards) and Arcotect Limited (2 awards), indicating key players in this specialized service area.
*   Awarded contract values for these PIA services consistently range from approximately **HK$43,000 to HK$90,000**, suggesting a relatively standardized scope or pricing model for these types of assessments.
*   The assessments target a wide array of critical government systems, from immigration and trade to health and police, underscoring a broad governmental emphasis on data privacy and compliance.</font>