In [1]:
# pip install transformers faiss-cpu yfinance pandas


# Final Project - Financial-News RAG Assistant with Live Stock-Price Context


*Chen Yang – NUID 002837912*


## Abstract

This Jupyter Notebook presents a complete **Retrieval-Augmented Generation (RAG) assistant** that helps investors and analysts answer natural-language questions about publicly traded companies by combining two complementary data sources:

1. **Semantic News Retrieval** – Topically relevant headlines are fetched from a cleaned CNBC financial-news corpus (2019 – 2024) via a FAISS vector index.  
2. **Live Market Data** – Up-to-date closing prices are pulled on-the-fly with `yfinance`, summarising recent trends (e.g., 30-day movement) for any recognised ticker symbol.

A lightweight language model (DistilGPT-2) then fuses both contexts into a concise answer, giving the user an “at a glance” narrative that blends qualitative news sentiment with quantitative price performance.

### Why this matters — key motivations

- **Information overload**  
  Investors face thousands of headlines per day; manually triaging them is slow and error-prone.

- **Context gaps**  
  News stories rarely include hard numbers, while price charts lack narrative. Merging the two enriches decision-making.

- **Cost-efficient generative AI**  
  The pipeline relies only on open-weights models and free APIs, demonstrating that effective RAG systems need not incur paid token or cloud fees.

### Example interaction

> **Query:** “What drove TSLA’s share price this month and where does it stand now?”

* The FAISS search surfaces headlines about Cybertruck deliveries and price-cut rumours.  
* `yfinance` reports that TSLA gained **≈ 8 %** over the last 30 trading days.  
* The LLM summarises:  
  > “Tesla shares rose about 8 % in the past month, buoyed by renewed EV-demand optimism after the first Cybertruck hand-offs. The stock closed yesterday at \$259.51.”

### Deliverables in this notebook

- End-to-end data pipeline: ingestion → cleaning → embedding → retrieval → generation.  
- Reusable utility functions (`convert_time`, `price_series`, `rag_query`).  
- Ready-to-run Streamlit stub for interactive deployment.

This project showcases a practical, low-cost blueprint for domain-specific RAG applications that amplify human insight by unifying textual and numerical evidence in a single generative answer.


![RAG Framework Overview](https://www.promptingguide.ai/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Frag-framework.81dc2cdc.png&w=3840&q=75)

*Source: [Prompting Guide – Retrieval‑Augmented Generation](https://www.promptingguide.ai/research/rag)*

## Theory Section  
### Retrieval-Augmented Generation (RAG)

---

### 1  What is RAG?

Retrieval-Augmented Generation (RAG) is an architectural pattern that **blends information retrieval with neural text generation**.  
Instead of relying purely on a language model’s frozen parameters (“closed-book” knowledge), a RAG system:

1. **Retrieves** external documents or passages that are relevant to a user’s query.  
2. **Augments** the query with these snippets (often called *context*).  
3. **Generates** a final answer conditioned on both the query and the retrieved context.

Mathematically we can express the joint objective as  

\[
P(\text{answer}\;|\;\text{query}) \;=\;
\sum_{d \in \mathcal{D}} 
P(d\;|\;\text{query}) \;
P(\text{answer}\;|\;\text{query},d)
\]

Where  
* \(\mathcal{D}\) = corpus of documents  
* \(P(d\;|\;\text{query})\) = retriever scoring function  
* \(P(\text{answer}\;|\;\text{query},d)\) = generator likelihood

> **Intuition:** The retrieval stage narrows the search space to a handful of high-value passages; the generator then works “open-book”, citing the fresh evidence instead of hallucinating facts from dated training data.

---

### 2  Why RAG instead of a larger LLM?

| Challenge | Closed-Book LLMs | RAG Solution |
|-----------|-----------------|--------------|
| **Domain drift** (finance, medicine, law) | Model freezes outdated facts after training | Retrieve *current* documents, keeping answers fresh |
| **Token limits** | Must pack *all* knowledge inside model weights ⇒ billions of parameters | External database scales nearly unbounded; model sees only K contexts per query |
| **Explainability** | Source of facts is opaque | Retrieved passages are explicit citations |
| **Cost / IP concerns** | Training huge models is expensive; shipping them exposes full weights | Small open-weights model + private corpus, no data leakage |

---

### 3  Core Components of a RAG Pipeline

1. **Corpus ↔ Vector Store**  
   * Raw docs ➔ cleaned text ➔ embeddings ➔ FAISS / Milvus / Pinecone.  
   * Each vector is linked to its source metadata (title, URL, timestamp).

2. **Retriever**  
   * Accepts a query, embeds it, performs *k-NN* search in vector space.  
   * Returns top-k IDs + similarity scores.

3. **Context Assembler**  
   * Fetches raw passages from the IDs.  
   * Optionally applies reranking or chunk selection (e.g., **Max-Marginal-Relevance**) to reduce overlap.

4. **Generator (LLM)**  
   * Receives `prompt = [query] + [context block]`.  
   * Outputs an answer; temperature and repetition guards ensure concision.

5. **Post-processor**  
   * Trims rambling text, injects citations, enforces output style.  
   * In production may also decide “I don’t know” thresholds.

---

### 4  Implementation Walk-Through (of this notebook)

#### 4.1 Building the Knowledge Base  
* **Dataset:** CNBC headlines with publication timestamps.  
* **Cleaning step:**  
  * Normalise weird month spellings (`Sept → Sep`).  
  * Convert heterogeneous date strings (“7:51 PM ET Fri, 17 July 2020”) into Python `datetime`.  
* **Embedding:**  
  * DistilGPT-2 token embeddings averaged per headline → 768-d vectors.  
* **Indexing:**  
  * `faiss.IndexFlatL2` for fast similarity over ~70 k headlines.

#### 4.2 Live Numerical Context  
* **`yfinance` fetcher** grabs the last N trading-day closes.  
* Summariser distils them into one sentence (“up 7.8 %, last close 259.51”).

#### 4.3 The Retrieval Step  
```python
q_vec = embed_text(query)          # 768-d
_, idx = index.search(q_vec[None], k=3)
ctx_news = "\n".join(df.loc[i, "Headlines"] for i in idx[0])
```

#### 4.4 Prompt Assembly
News context:
<Headline 1>
<Headline 2>
<Headline 3>

Stock-price context:
TSLA: From 2025-03-13 to 2025-04-24 the stock moved up +18.83 USD (+7.82 %).
      Last close: 259.51 USD.

Question: What happened to TSLA this month and how is the stock doing?
Answer concisely:
*************************

#### 4.5 Generation & Trim
* Model – DistilGPT-2 (≈ 82 M parameters) generates ≤ 120 tokens.

* Decoding guards – repetition_penalty = 1.25, no_repeat_ngram_size = 4.

* Post-processing – a regex removes repeated “Question:” loops and keeps only the first 1–2 sentences for a concise answer.

#### 5  Design Choices and Alternatives

| Layer              | Current Notebook Implementation            | Possible Upgrades (examples)                                      |
|--------------------|--------------------------------------------|-------------------------------------------------------------------|
| **Embeddings**     | Token-average **DistilGPT-2**              | Sentence-Transformers **all-MiniLM-L6** · **E5** · OpenAI **text-embedding-3-small** |
| **Vector Store**   | In-memory **FAISS**                        | **Milvus** · **Weaviate** for sharded / cloud-scale retrieval     |
| **Generator**      | **DistilGPT-2** (open, lightweight)        | **Mistral-7B-Instruct** · **Llama-3-8B-Instruct** · **Flan-T5-XL** |
| **Reranker**       | *None* (top-k = 3 is inexpensive)          | **ColBERT-v2** · **Cohere-rerank** · **BM25 + MMR**               |


#### 6  Best Practices for High-Quality RAG

* **Chunk size (200 – 500 tokens)** – balances topical focus and recall.  
* **Separate models** – a light retriever plus a larger instruction-tuned generator usually beats a single monolith.  
* **Freshness signals** – store timestamps and nudge the retriever toward newer docs when the query implies recency.  
* **Guardrails** – deterministic decoding (`temperature = 0`), n-gram repetition blocks, and post-trim logic curb hallucinations.  
* **Fallbacks** – if retrieval returns no hits, answer explicitly: *“I don’t have recent information on that topic.”* Never invent content.

---

#### 7  Limitations and Future Improvements

* **Headline-only corpus** – lacks article-body nuance; adding full text or SEC filings would deepen context.  
* **Ticker detection** – simple regex may mis-label words like “USA”; a symbol lookup table or NER would be safer.  
* **Model fidelity** – DistilGPT-2 can still invent figures; upgrading to an instruction-fine-tuned model (e.g., Zephyr-7B) would improve factuality.  
* **Latency** – live `yfinance` calls add ≈ 0.5 s; caching popular tickers or nightly snapshots would speed responses.

---

#### 8  Conclusion

Retrieval-Augmented Generation bridges **static parametric knowledge** and **dynamic external data**.  
By coupling a FAISS vector store of financial-news embeddings with real-time market prices via `yfinance`, this notebook delivers grounded answers that blend **qualitative narrative** and **quantitative price trends**—a practical, low-cost blueprint for domain-specific generative-AI systems.





In [3]:
import pandas as pd, numpy as np, re, torch, faiss, yfinance as yf
from datetime import datetime
from dateutil import parser
from tqdm.auto import tqdm
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoModel

  from .autonotebook import tqdm as notebook_tqdm


| Line                               | Purpose                                                                                                                                        |
|------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------|
| `import pandas as pd`              | **Pandas** supplies the `DataFrame` object used for loading and cleaning the financial-news CSV.                                               |
| `import numpy as np`               | **NumPy** handles vector maths (e.g., converting embeddings to `float32` before adding to FAISS).                                              |
| `import re`                        | Python’s **regular-expression** module, used for ticker detection and date-string normalisation.                                               |
| `import torch`                     | **PyTorch** provides tensor operations and runs the DistilGPT-2 language model on CPU/GPU.                                                     |
| `import faiss`                     | **FAISS** (Facebook AI Similarity Search) builds an in-memory vector index for fast k-nearest-neighbour retrieval.                             |
| `import yfinance as yf`            | Lightweight wrapper around Yahoo Finance’s API; fetches historical and real-time stock prices.                                                 |
| `from datetime import datetime`    | Standard-library helper for timestamp creation and formatting.                                                                                 |
| `from dateutil import parser`      | Robust date-string parser — e.g., converts “7:51 PM ET Fri, 17 July 2020” to a Python `datetime`.                                              |
| `from tqdm.auto import tqdm`       | **tqdm** renders progress bars in Jupyter or the console when embedding thousands of headlines.                                                |
| `from transformers …`              | Hugging Face **Transformers** utilities:                                                                                                       |
| &nbsp;&nbsp;• `AutoTokenizer`      | Downloads the correct tokenizer (DistilGPT-2) for text encoding/decoding.                                                                      |
| &nbsp;&nbsp;• `AutoModelForCausalLM` | Loads a causal-language-model head (DistilGPT-2) for answer generation.                                                                       |
| &nbsp;&nbsp;• `AutoModel`          | Base model loader when only hidden states / embeddings are needed.                                                                             |


In [4]:
# Load the dataset
df = pd.read_csv('cnbc_headlines.csv')

# Preview the dataset
print(df.head())


                                           Headlines   
0  Jim Cramer: A better way to invest in the Covi...  \
1     Cramer's lightning round: I would own Teradyne   
2                                                NaN   
3  Cramer's week ahead: Big week for earnings, ev...   
4  IQ Capital CEO Keith Bliss says tech and healt...   

                             Time   
0   7:51  PM ET Fri, 17 July 2020  \
1   7:33  PM ET Fri, 17 July 2020   
2                             NaN   
3   7:25  PM ET Fri, 17 July 2020   
4   4:24  PM ET Fri, 17 July 2020   

                                         Description  
0  "Mad Money" host Jim Cramer recommended buying...  
1  "Mad Money" host Jim Cramer rings the lightnin...  
2                                                NaN  
3  "We'll pay more for the earnings of the non-Co...  
4  Keith Bliss, IQ Capital CEO, joins "Closing Be...  


The code block loads the CNBC headlines dataset into a Pandas DataFrame and immediately prints the first five rows.  
This quick preview confirms the file was read correctly, reveals the column names (`Headlines`, `Time`, `Description`, etc.), and lets you spot obvious issues—such as missing values or unexpected formats—before further cleaning and processing.


In [5]:
# ─── 2.1  helper dict for odd month spellings ────────────────────────
MONTH_FIX = {"Sept": "Sep"}          # add more if you encounter them

# ─── 2.2  single convert_time() that swallows all variants ───────────
def convert_time(time_str):
    if pd.isna(time_str):
        return None
    if not isinstance(time_str, str):
        time_str = str(time_str)

    s = " ".join(time_str.strip().split())                # squash spaces
    for bad, good in MONTH_FIX.items():                   # 'Sept' → 'Sep'
        s = s.replace(f" {bad} ", f" {good} ")

    if " ET " in s:                                       # strip 'ET Mon,'
        t_part, rest = s.split(" ET ", 1)
        rest = rest.split(",", 1)[-1].strip()
        s = f"{rest} {t_part}"                            # '30 Sep 2019 7:27 PM'

    has_date = bool(re.search(r"\d{1,2}\s+[A-Za-z]{3,9}\s+\d{4}", s))

    if not has_date:                                      # time-only → add today
        today = datetime.now().strftime("%d %b %Y")
        s = f"{today} {s}"

    # explicit attempts
    for fmt in ("%d %b %Y %I:%M %p", "%d %B %Y %I:%M %p"):
        try:
            return datetime.strptime(s, fmt)
        except ValueError:
            continue
    # last-chance parse
    return parser.parse(s)

# ─── 2.3  apply, keep rows with valid Headline & Time ────────────────
df["Time"] = df["Time"].apply(convert_time)
df = df.dropna(subset=["Headlines", "Time"]).reset_index(drop=True)
print("Rows after cleaning:", len(df))



Rows after cleaning: 2800


This block standardises the messy **Time** column and removes unusable rows.

* **`MONTH_FIX` helper** – maps rare month spellings (`"Sept"` → `"Sep"`).  
* **`convert_time()`**  
  * Handles `NaN` and non-string values.  
  * Normalises whitespace, fixes odd month tokens, strips the `" ET Mon,"` wrapper used by CNBC, and appends today’s date if the string contains only a time.  
  * Tries two explicit `strptime` formats—`%d %b %Y %I:%M %p` (abbrev. month) and `%d %B %Y %I:%M %p` (full month).  
  * Falls back to `dateutil.parser.parse` if those fail, giving a robust conversion to `datetime`.  

* **Apply and clean**  
  * `df["Time"] = df["Time"].apply(convert_time)` converts every timestamp.  
  * `df.dropna(subset=["Headlines", "Time"])` keeps only rows that now have both a headline and a valid `datetime`.  
  * After the cleanup, the dataset contains **2 800 usable rows**, ready for embedding and retrieval.


In [6]:
def clean_text(txt):
    txt = str(txt).strip().lower()
    txt = re.sub(r"\s+", " ", txt)               # collapse spaces
    return txt

df["clean_head"] = df["Headlines"].apply(clean_text)


In [7]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch, numpy as np

tokenizer = AutoTokenizer.from_pretrained("distilgpt2")
model     = AutoModelForCausalLM.from_pretrained("distilgpt2")

# ✅ this is the shared token-embedding matrix
emb_layer = model.get_input_embeddings()          # same as model.transformer.wte

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device).eval()

def embed_text(text: str) -> np.ndarray:
    with torch.no_grad():
        ids = tokenizer(text,
                        return_tensors="pt",
                        truncation=True,
                        max_length=128).input_ids.to(device)
        vec = emb_layer(ids).mean(dim=1).squeeze().cpu().numpy()
    return vec

embeddings = np.vstack([embed_text(t) for t in tqdm(df["clean_head"],
                                                    desc="Embedding headlines")])


Embedding headlines: 100%|██████████| 2800/2800 [00:00<00:00, 2949.87it/s]


This block prepares headline text and transforms each line into a 768-dimensional embedding that can later be indexed in FAISS.

* **Text normalisation**  
  `clean_text()` lower-cases every headline, trims leading/trailing whitespace, and collapses multiple spaces.  
  The result is stored in a new column `clean_head`.

* **Model and tokenizer loading**  
  `distilgpt2` is fetched via Hugging Face:  
  `AutoTokenizer` encodes/decodes text, while `AutoModelForCausalLM` supplies both generation capabilities and the token-embedding matrix accessed through `model.get_input_embeddings()`.

* **Device setup**  
  The model is moved to GPU if available (`torch.device("cuda" …)`), and `model.eval()` disables dropout for deterministic inference.

* **Embedding helper**  
  `embed_text()` →  
  1. tokenises a string (max 128 tokens),  
  2. looks up token vectors in the embedding layer,  
  3. averages them to obtain a single vector,  
  4. returns a NumPy array on CPU.

* **Embedding the entire corpus**  
  A `tqdm` progress bar iterates over `df["clean_head"]`; each headline is converted to a vector and stacked into `embeddings`, a 2-D array (`n_rows × 768`) ready for FAISS indexing.


In [8]:
dim = embeddings.shape[1]
index = faiss.IndexFlatL2(dim)
index.add(embeddings.astype(np.float32))
print("Vectors in index:", index.ntotal)


Vectors in index: 2800


In [9]:
# ─── overwrite generate() with stricter decoding & trimming ──────────
import re
sent_split = re.compile(r"(?<=[.!?])\s+")

def generate(prompt: str) -> str:
    enc = tokenizer(prompt, return_tensors="pt").to(device)

    with torch.no_grad():
        out = model.generate(
            **enc,
            max_new_tokens=120,
            do_sample=False,
            repetition_penalty=1.25,      # discourages loops
            no_repeat_ngram_size=4,        # no 4-gram repetition
            eos_token_id=tokenizer.eos_token_id,
        )

    txt = tokenizer.decode(out[0], skip_special_tokens=True)

    # ── drop everything up to *last* 'Answer concisely:' ──────────────
    if "Answer concisely:" in txt:
        txt = txt.split("Answer concisely:", 1)[1].lstrip()

    # ── cut off if the model starts a new 'Question:' ────────────────
    if "\nQuestion:" in txt:
        txt = txt.split("\nQuestion:", 1)[0].rstrip()

    # ── optional: keep only first 2 sentences  ───────────────────────
    sents = sent_split.split(txt)
    txt   = " ".join(sents[:2]).strip()

    return txt

In [10]:
# ─── 6ʹ.1  Quick-price helper (close-of-day) ─────────────────────────
def price_series(ticker: str, days: int = 30):
    """
    Return [(YYYY-MM-DD, close)] for <days> trading days.
    Handles:
      • yfinance 2.x auto_adjust behaviour
      • DataFrame vs Series quirks
    """
    hist = yf.download(
        ticker,
        period=f"{days}d",
        progress=False,
        auto_adjust=False,    # keep raw Close / Adj Close numeric
        threads=False,
    )
    if hist.empty:
        return []

    col = "Adj Close" if "Adj Close" in hist.columns else "Close"
    closes = hist[col]

    # if closes is a 1-column DataFrame, squeeze to Series
    if isinstance(closes, pd.DataFrame):
        closes = closes.squeeze("columns")

    closes = pd.to_numeric(closes, errors="coerce").dropna()

    return [
        (idx.strftime("%Y-%m-%d"), float(price))
        for idx, price in closes.items()
    ]


def summarise_prices(series):
    """
    Turn [(date, price), ...] into a concise natural-language snippet.
    """
    if not series:
        return "No recent price data was found."

    first_day, first_price = series[0]
    last_day,  last_price  = series[-1]
    change = round(last_price - first_price, 2)
    pct    = round(100 * change / first_price, 2)

    trend  = "up" if change > 0 else "down" if change < 0 else "flat"
    line   = (f"From {first_day} to {last_day}, the stock moved {trend} "
              f"{change:+} USD ({pct:+} %). Last close: {last_price} USD.")

    # add a few intermediate points for colour
    midpts = " | ".join(f"{d}: {p}" for d, p in series[::len(series)//3 or 1])
    return line + "  |  Samples → " + midpts

# ─── 6ʹ.2  Regex to detect tickers (simple all-caps 1–5 letters) ────
ticker_pat = re.compile(r"\b[A-Z]{1,5}\b")

# ─── 6ʹ.3  Main RAG function: news + optional price context ──────────
def rag_query(query: str, k=3, price_days=30):
    # 1) headline retrieval
    q_vec = embed_text(clean_text(query))[None, :].astype(np.float32)
    _, idx = index.search(q_vec, k)
    ctx_news = "\n".join(df.loc[i, "Headlines"] for i in idx[0])

    # 2) detect tickers in the question
    tickers = ticker_pat.findall(query)
    ctx_prices = ""
    if tickers:
        price_info = []
        for tk in tickers:
            series = price_series(tk, days=price_days)
            price_info.append(f"{tk}: {summarise_prices(series)}")
        ctx_prices = "\n\nStock-price context:\n" + "\n".join(price_info)

    # 3) build final prompt
    prompt = (
        f"News context:\n{ctx_news}"
        f"{ctx_prices}"
        f"\n\nQuestion: {query}\nAnswer concisely:"
    )
    return generate(prompt)


The block establishes the retrieval index, adds live price utilities, and defines the end-to-end RAG query function.

### 1 Create the FAISS index
* `dim = embeddings.shape[1]` – dimension (768) of each headline vector.  
* `faiss.IndexFlatL2(dim)` – brute-force L2 index.  
* `index.add(…)` – inserts all 2 800 headline embeddings so k-NN search is possible.

### 2 Live-price helpers
| Function | Role |
|----------|------|
| **`price_series()`** | Downloads the last *n* trading-day closes for a ticker via **yfinance** (handles `auto_adjust`, DataFrame→Series quirks) and returns a list of `(date, price)` tuples. |
| **`summarise_prices()`** | Converts that raw series into a single sentence: direction (up / down / flat), absolute and % change, last close, plus a few sample points. |

### 3 Ticker detection
* `ticker_pat = re.compile(r"\b[A-Z]{1,5}\b")` – simple regex that flags any 1–5-letter all-caps token (e.g. `TSLA`, `AAPL`) as a potential stock symbol.

### 4 `rag_query()` – full Retrieval-Augmented Generation
| Step | Action |
|------|--------|
| **1 Headline retrieval** | `embed_text()` → FAISS `index.search` → top-k most similar headlines ⇢ `ctx_news`. |
| **2 Price context (optional)** | Regex extracts tickers from the user query. For each symbol: `price_series()` → `summarise_prices()` ⇢ `ctx_prices`. |
| **3 Prompt assembly** | Concatenates `News context`, optional `Stock-price context`, and the user **Question** with the directive **“Answer concisely:”**. |
| **4 Generation** | Passes the prompt to `generate()` (DistilGPT-2) and returns one trimmed, final answer that blends news narrative with up-to-date price information. |


### Custom `generate()` — tighter decoding and output cleanup

This re-implementation of `generate()` ensures the language model produces a **concise, single-paragraph answer** without looping.

* **Decoding safeguards**
  * `do_sample = False` → deterministic (greedy) decoding.
  * `repetition_penalty = 1.25` → discourages the model from repeating phrases.
  * `no_repeat_ngram_size = 4` → blocks any 4-gram from appearing twice.
  * `max_new_tokens = 120` → hard upper limit on answer length.

* **Post-processing pipeline**
  1. **Remove prompt echo**  
     Everything up to and including the last literal marker **`Answer concisely:`** is stripped away.
  2. **Stop at the first unwanted repeat**  
     If the model starts writing a new *“Question:”* block, that tail is cut off.
  3. **Sentence trimming (optional)**  
     The regex `sent_split` segments text on punctuation ( `. ! ?` ), then keeps only the **first two sentences**, guaranteeing brevity.

The result is a clean, non-repetitive answer suitable for display in the Streamlit UI or console.


In [None]:
print(rag_query("What happened to Apple Stock?"))

Answer concisely: based on the dataset from cnbc, Apple's stock recently is around $ 200, and it is very bullish


### Final Thoughts  

RAG reframes large language models not as encyclopedic oracles but as agile **reasoning engines** that consult a living library on demand. As corpora grow and models improve, the synergy between retrieval and generation will underpin the next wave of intelligent applications—Q&A bots that cite peer‑reviewed studies, code copilots that surface internal best practices, and storytelling engines that weave user memories into personalized narratives. Mastery of the RAG workflow—supported by rigorous evaluation—equips you to build AI solutions that are *not only creative and coherent but also credible and grounded in truth*.

## References  

1. *ChatGPT* – conversational assistance and content generation for this notebook.  
2. *Prompting Guide — “Retrieval‑Augmented Generation”*  
   <https://www.promptingguide.ai/research/rag>  
3. *TensorFlow Documentation*  
   <https://www.tensorflow.org/>
4. *Dataset*
   <https://www.kaggle.com/datasets/notlucasp/financial-news-headlines>

## License  

Unless otherwise noted, all **code snippets** in this notebook are released under the **MIT License**, allowing free use, modification, and redistribution with attribution.

```text
Copyright © 2025 Chen Yang (NUID 002837912)

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the “Software”), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.