#### Step 1: Install Required Libraries

This cell installs all necessary Python packages:

- `requests`, `beautifulsoup4`, and `tqdm` are used for scraping data from the ONS website and managing request delays.
- `faiss-cpu` is used to create a fast vector similarity search index.
- `sentence-transformers` provides pre-trained models for converting text into dense vector embeddings.
- `transformers` allows loading lightweight language models for text generation.
- `vaderSentiment` is used to assess sentiment (positive, neutral, negative) in economic text summaries.

In [None]:
# Install dependencies (if not already installed)
!pip install requests beautifulsoup4 tqdm
!pip install -q faiss-cpu sentence-transformers transformers vaderSentiment






#### Step 2: Import Python Libraries

This cell imports all required modules for the project:

- `pandas`, `numpy` for working with tabular and numerical data.
- `faiss` for building a vector index.
- `sentence_transformers` for generating embeddings from text.
- `transformers` for loading and running pre-trained LLMs (T5).
- `vaderSentiment` for sentiment scoring.
- `requests` and `BeautifulSoup` for scraping ONS bulletin data.
- `json`, `shutil`, and `time` for data handling and system operations.

In [None]:
import pandas as pd
import numpy as np
import faiss
from sentence_transformers import SentenceTransformer
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
import requests
from bs4 import BeautifulSoup
import time
import random
import json
import shutil
import re


#### Step 3: Mount Google Drive

Mounts the user's Google Drive to allow saving and retrieving files.

This is required for saving the final scraped dataset (`ons_bulletins.jsonl`) and for accessing any pre-uploaded files during later analysis.

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


#### Step 4: Define URL Template for ONS Bulletin Pages

Defines the base URL and the paginated path for accessing previous releases of ONS bulletins on economic activity and real-time indicators.

The `page_url_template` allows dynamic page access by formatting the page number during iteration.

In [None]:

# Base parts
base_url = "https://www.ons.gov.uk"
page_url_template = base_url + "/economy/economicoutputandproductivity/output/bulletins/economicactivityandsocialchangeintheukrealtimeindicators/previousreleases?q=&limit=10&sort=release_date&page={}"

# Store all links
all_bulletin_links = []

# User agent
HEADERS = {
    "User-Agent": "Mozilla/5.0"
}

# Loop through 21 pages
for page_num in range(1, 22):
    full_url = page_url_template.format(page_num)
    print(f"🔍 Scraping Page {page_num}: {full_url}")

    success = False
    retries = 0

    while not success and retries < 5:
        try:
            response = requests.get(full_url, headers=headers)
            if response.status_code == 429:
                retries += 1
                wait_time = 10 * retries  # Exponential backoff
                print(f"⏳ Rate limited (429). Waiting {wait_time} seconds...")
                time.sleep(wait_time)
                continue
            response.raise_for_status()
            success = True
        except Exception as e:
            print(f"❌ Error on page {page_num}: {e}")
            break

    if not success:
        print(f"❌ Skipping Page {page_num} after retries.")
        continue

    soup = BeautifulSoup(response.text, "html.parser")
    links = soup.select("li.ons-document-list__item h2 a")

    for tag in links:
        href = tag.get("href")
        if href and href.startswith("/economy/"):
            full_link = base_url + href
            all_bulletin_links.append(full_link)

    # Be nice to the server
    time.sleep(3)

print(f"\n✅ Total links collected: {len(all_bulletin_links)}")

🔍 Scraping Page 1: https://www.ons.gov.uk/economy/economicoutputandproductivity/output/bulletins/economicactivityandsocialchangeintheukrealtimeindicators/previousreleases?q=&limit=10&sort=release_date&page=1
🔍 Scraping Page 2: https://www.ons.gov.uk/economy/economicoutputandproductivity/output/bulletins/economicactivityandsocialchangeintheukrealtimeindicators/previousreleases?q=&limit=10&sort=release_date&page=2
🔍 Scraping Page 3: https://www.ons.gov.uk/economy/economicoutputandproductivity/output/bulletins/economicactivityandsocialchangeintheukrealtimeindicators/previousreleases?q=&limit=10&sort=release_date&page=3
🔍 Scraping Page 4: https://www.ons.gov.uk/economy/economicoutputandproductivity/output/bulletins/economicactivityandsocialchangeintheukrealtimeindicators/previousreleases?q=&limit=10&sort=release_date&page=4
🔍 Scraping Page 5: https://www.ons.gov.uk/economy/economicoutputandproductivity/output/bulletins/economicactivityandsocialchangeintheukrealtimeindicators/previousreleas

#### Step 5: Scrape Bulletin URLs from All Pages

Iterates through pages 1 to 21 of the ONS archive and extracts links to individual bulletin pages.

- For each page, it makes an HTTP request and parses the HTML to extract bulletin links using the specified CSS selectors.
- Includes retry logic in case of rate limiting (HTTP 429).
- Stores all valid links in the `all_bulletin_links` list.
- Adds a short delay between requests to respect the website’s server.

These links will be used later to download full bulletin content.

In [None]:
def clean_paragraphs(tag):
    return [p.get_text(strip=True) for p in tag.find_all("p") if p.get_text(strip=True)]

def parse_bulletin_page(url):
    response = requests.get(url, headers=HEADERS)
    response.raise_for_status()
    soup = BeautifulSoup(response.text, "html.parser")

    title_tag = soup.select_one("h1")
    title = title_tag.get_text(strip=True) if title_tag else "Untitled Bulletin"

    sections = soup.select("article .section__content--markdown")
    content_parts = []

    for section in sections:
        section_title_tag = section.find("h2")
        section_title = section_title_tag.get_text(strip=True) if section_title_tag else ""

        # Get list items (e.g. bullet points)
        ul = section.find("ul")
        if ul:
            for li in ul.find_all("li"):
                text = li.get_text(strip=True)
                if text:
                    content_parts.append(text)

        # Get paragraph text
        content_parts.extend(clean_paragraphs(section))

    # Join all text parts into one long document
    full_text = "\n".join(content_parts)
    return {
        "url": url,
        "title": title,
        "text": full_text
    }

def main():
    all_bulletins = []
    failed_links = []

    for idx, link in enumerate(all_bulletin_links):
        try:
            print(f"🔍 Scraping ({idx + 1}/{len(all_bulletin_links)}): {link}")
            bulletin = parse_bulletin_page(link)
            all_bulletins.append(bulletin)
        except requests.exceptions.HTTPError as e:
            if e.response.status_code == 429:
                print("🔁 429 Too Many Requests – sleeping 60s and retrying...")
                time.sleep(60)
                try:
                    bulletin = parse_bulletin_page(link)
                    all_bulletins.append(bulletin)
                except Exception as e:
                    print(f"❌ Failed again on {link}: {e}")
                    failed_links.append(link)
            else:
                print(f"❌ HTTP error on {link}: {e}")
                failed_links.append(link)
        except Exception as e:
            print(f"❌ General error on {link}: {e}")
            failed_links.append(link)

        sleep_time = random.uniform(1.5, 3.5)
        print(f"⏳ Sleeping for {sleep_time:.1f} seconds")
        time.sleep(sleep_time)

    # Save all bulletins as .jsonl
    with open("ons_bulletins.jsonl", "w", encoding="utf-8") as f:
        for entry in all_bulletins:
            json.dump(entry, f, ensure_ascii=False)
            f.write("\n")

    print(f"\n✅ Saved {len(all_bulletins)} bulletins to ons_bulletins.jsonl")
    if failed_links:
        print(f"\n⚠️ Failed on {len(failed_links)} links. You can retry these later.")
        with open("failed_links.txt", "w") as f:
            f.write("\n".join(failed_links))

if __name__ == "__main__":
    main()

🔍 Scraping (1/209): https://www.ons.gov.uk/economy/economicoutputandproductivity/output/bulletins/economicactivityandsocialchangeintheukrealtimeindicators/26june2025
⏳ Sleeping for 3.4 seconds
🔍 Scraping (2/209): https://www.ons.gov.uk/economy/economicoutputandproductivity/output/bulletins/economicactivityandsocialchangeintheukrealtimeindicators/19june2025
⏳ Sleeping for 2.7 seconds
🔍 Scraping (3/209): https://www.ons.gov.uk/economy/economicoutputandproductivity/output/bulletins/economicactivityandsocialchangeintheukrealtimeindicators/12june2025
⏳ Sleeping for 1.6 seconds
🔍 Scraping (4/209): https://www.ons.gov.uk/economy/economicoutputandproductivity/output/bulletins/economicactivityandsocialchangeintheukrealtimeindicators/5june2025
⏳ Sleeping for 2.3 seconds
🔍 Scraping (5/209): https://www.ons.gov.uk/economy/economicoutputandproductivity/output/bulletins/economicactivityandsocialchangeintheukrealtimeindicators/30may2025
⏳ Sleeping for 2.6 seconds
🔍 Scraping (6/209): https://www.ons.g

#### Step 9: Copy Output to Google Drive

After scraping and saving, this copies the resulting `ons_bulletins.jsonl` file into your Google Drive for persistent storage.

In [None]:

# Change the path below to wherever you want in your Drive
shutil.copy("ons_bulletins.jsonl", "/content/drive/MyDrive/ons_bulletins.jsonl")

'/content/drive/MyDrive/ons_bulletins.jsonl'

#### Step 10: Load Bulletins from JSONL File

Reads the `ons_bulletins.jsonl` file, where each line is a separate JSON object representing a single bulletin.
Stores the parsed dictionaries into a Python list.

In [None]:

# If running in Colab, first upload the file (you've already done this)
# from google.colab import files
# uploaded = files.upload()

# Load the JSONL file
file_path = "ons_bulletins.jsonl"  # adjust this if the file is in another path

# Read JSONL into a list of dictionaries
records = []
with open(file_path, "r", encoding="utf-8") as f:
    for line in f:
        records.append(json.loads(line))

# Convert to DataFrame
df = pd.DataFrame(records)

# Preview the data
print(f"✅ Loaded {len(df)} bulletins.\n")
display(df.head())  # Use `df.head(10)` to show more rows if needed

✅ Loaded 209 bulletins.



Unnamed: 0,url,title,text
0,https://www.ons.gov.uk/economy/economicoutputa...,"Economic activity and social change in the UK,...",Overall retail footfall remained broadly uncha...
1,https://www.ons.gov.uk/economy/economicoutputa...,"Economic activity and social change in the UK,...",The seasonally adjusted Direct Debit failure r...
2,https://www.ons.gov.uk/economy/economicoutputa...,"Economic activity and social change in the UK,...",Overall retail footfall decreased in the week ...
3,https://www.ons.gov.uk/economy/economicoutputa...,"Economic activity and social change in the UK,...",Overall retail footfall increased by 5% in the...
4,https://www.ons.gov.uk/economy/economicoutputa...,"Economic activity and social change in the UK,...",Overall retail footfall increased by 1% in the...


#### Step 11: Convert Bulletins to DataFrame

Converts the list of bulletin dictionaries into a pandas DataFrame for easier manipulation and preview.

In [None]:
df = pd.read_json('/content/drive/MyDrive/ons_bulletins.jsonl', lines=True)
df.dropna(subset=['text'], inplace=True)
df.reset_index(drop=True, inplace=True)
print(f"Loaded {len(df)} bulletin entries.")

Loaded 209 bulletin entries.


#### Step 13: Define Function to Split Bulletin Text into Paragraphs

Splits long bulletin text into smaller, clean paragraphs based on double line breaks or section headers.
Each paragraph must be at least 40 characters long to be retained.

In [None]:
def split_into_paragraphs(text):
    # Split on double newlines or "Section 1:", "Section 2:", etc. (case-insensitive)
    pattern = r'\n{2,}|section \d+:'
    paragraphs = re.split(pattern, text, flags=re.IGNORECASE)
    return [p.strip() for p in paragraphs if len(p.strip()) > 40]

# Expand into paragraph-level dataframe
rows = []
for _, row in df.iterrows():
    for para in split_into_paragraphs(row['text']):
        rows.append({
            "url": row["url"],
            "title": row["title"],
            "paragraph": para
        })

df_para = pd.DataFrame(rows)
print(f"Total paragraph entries: {len(df_para)}")
df_para.head(2)

Total paragraph entries: 1305


Unnamed: 0,url,title,paragraph
0,https://www.ons.gov.uk/economy/economicoutputa...,"Economic activity and social change in the UK,...",Overall retail footfall remained broadly uncha...
1,https://www.ons.gov.uk/economy/economicoutputa...,"Economic activity and social change in the UK,...",Data sources and quality.\nThese are official ...


#### Step 15: Generate Embeddings for Paragraphs

Encodes each paragraph using the `all-MiniLM-L6-v2` model from SentenceTransformers.
These embeddings will be used for semantic search later.

In [None]:
model_embed = SentenceTransformer("all-MiniLM-L6-v2")
para_embeddings = model_embed.encode(df_para["paragraph"].tolist(), show_progress_bar=True)



Batches:   0%|          | 0/41 [00:00<?, ?it/s]

#### Step 16: Build FAISS Index

Creates a FAISS flat index using L2 (Euclidean) distance and adds all paragraph embeddings to enable fast similarity search.

In [None]:
index = faiss.IndexFlatL2(para_embeddings.shape[1])
index.add(np.array(para_embeddings))

#### Step 17: Perform Sentiment Analysis on Paragraphs

Applies VADER sentiment analysis to each paragraph and stores the compound score in a new 'sentiment' column.

In [None]:
analyzer = SentimentIntensityAnalyzer()
df_para["sentiment"] = df_para["paragraph"].apply(lambda x: analyzer.polarity_scores(x)["compound"])

#### Step 18: Load T5 Model and Tokenizer

Loads the `flan-t5-base` model from Hugging Face, which will be used to rephrase paragraphs in a formal and concise style similar to ONS bulletins.

In [None]:
llm_tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base")
llm_model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base")

#### Step 19: Define Semantic Search Function

This function takes a query, embeds it, and searches the FAISS index for the most semantically similar paragraphs.

Returns the top matching paragraphs and their source URLs.

In [None]:
def search_paragraphs(query, top_k=3):
    query_embedding = model_embed.encode([query])
    D, I = index.search(np.array(query_embedding), top_k)
    return df_para.iloc[I[0]][["paragraph", "url"]].to_dict(orient="records")


#### Step 20: Define Rewriting Function Using T5

Rewrites the selected paragraph using the `flan-t5-base` model to produce a more concise and formal summary in the style of ONS bulletins.

In [None]:

def summarize_with_llm(text):
    short_text = text[:300]
    prompt = f"Rewrite this like an ONS bulletin summary: {short_text}"
    tokens = llm_tokenizer.encode(prompt, return_tensors="pt", truncation=True, max_length=512)
    output_ids = llm_model.generate(tokens, max_length=128)
    return llm_tokenizer.decode(output_ids[0], skip_special_tokens=True)



#### Step 21: Define Rolling Sentiment Summary

Calculates a rolling average of sentiment scores and classifies recent tone as broadly positive, negative, or neutral.

In [None]:
def get_sentiment_summary():
    avg = df_para["sentiment"].rolling(5).mean().iloc[-1]
    if avg > 0.2:
        return "Sentiment over recent bulletins is broadly positive."
    elif avg < -0.2:
        return "Sentiment over recent bulletins is somewhat negative."
    else:
        return "Sentiment over recent bulletins is mixed or neutral."

#### Step 22: Define Suggested Queries

Provides a list of example queries users can ask to explore trends and bulletins.

In [None]:
SUGGESTED_QUESTIONS = [
    "What’s happening with job adverts?",
    "How has retail footfall changed recently?",
    "What are debit failure rates doing?",
    "Has the economy improved this month?",
    "How is the economic sentiment lately?",
    "Tell me something from this week’s bulletin",
    "Is consumer spending increasing?",
    "Any recent decline in job postings?"
]

#### Step 23: Define Main Assistant Function

The core logic of the assistant:
- Accepts a user query
- Searches and retrieves relevant bulletin paragraphs
- Rewrites the most relevant one using the T5 model
- Returns a formatted result

In [None]:
import time

def ask_econbot(user_input, verbose=True):
    print("📊 ONS EconBot Assistant")
    print("💬 You asked:", user_input)

    if user_input.lower() in ["exit", "quit", "stop"]:
        return "👋 Exiting EconBot."

    if "sentiment" in user_input.lower():
        return "🤖 Sentiment Tracker: " + get_sentiment_summary()

    t0 = time.time()

    print("\n🔍 Step 1: Searching bulletin summaries...")
    results = search_paragraphs(user_input)
    if not results:
        return "❌ EconBot: Sorry, I couldn’t find anything relevant."

    output = "\n✅ Found similar past summaries:\n"
    for i, r in enumerate(results):
        output += f"\n{i+1}. {r['paragraph']}\n↪ Source: {r['url']}\n"

    print("\n🧠 Step 2: Running LLM to rephrase top result...")
    try:
        rewritten = summarize_with_llm(results[0]["paragraph"])
        output += f"\n📘 Rephrased Summary:\n{rewritten}"
    except Exception as e:
        output += f"\n⚠️ LLM failed to rephrase. Error: {e}"

    t1 = time.time()
    output += f"\n\n⏱️ Total response time: {t1 - t0:.1f} seconds"

    return output

#### Step 24: Run Assistant on Example Query

Runs the assistant on a test question about job adverts. Displays both raw results and the LLM-generated summary.

In [None]:
response = ask_econbot("How is the shipping ")
print(response)

📊 ONS EconBot Assistant
💬 You asked: How is the shipping 

🔍 Step 1: Searching bulletin summaries...

🧠 Step 2: Running LLM to rephrase top result...

✅ Found similar past summaries:

1. Business and workforce.
The average number of daily ship visits increased by 23% in the week to 29 January 2023 but was 14% below the level in the same period last year, while the average number of cargo and tanker ship visits increased by 6% in the latest week and was broadly unchanged from the level in the same period last year (exactEarth).
↪ Source: https://www.ons.gov.uk/economy/economicoutputandproductivity/output/bulletins/economicactivityandsocialchangeintheukrealtimeindicators/3february2023

2. Business and workforce.
The average number of daily ship visits increased by 23% in the week to 29 January 2023 but was 14% below the level in the same period last year, while the average number of cargo and tanker ship visits increased by 6% in the latest week and was broadly unchanged from the level i