# Introduction to Retrieval Augmented Generation with S&P 500 news

In this notebook, you will explore how to build a simple Retrieval-Augmented Generation (RAG) pipeline using financial news articles from S&P 500 companies.

We'll start by vectorizing text data, creating a vector store using FAISS, and integrating it with OpenAI's GPT models to answer questions using retrieved information.

This workflow emulates real-world systems in finance where natural language data (news, filings, analyst reports) are used to support decision-making.

# 📌 Objectives

By the end of this notebook, students will be able to:

1. **Perform Semantic Search with Metadata Filtering:**
   - Query the provided FAISS vector store to retrieve relevant financial news articles based on natural language questions.
   - Apply optional filters using metadata such as ticker or publication date to refine search results.

2. **Enrich Data with Company Metadata:**
   - Use the `yfinance` library to retrieve company-level metadata (company name, sector, industry) for tickers in the dataset.
   - Integrate this metadata to support enhanced filtering and analysis of news data.

3. **Build a Retrieval-Augmented Generation (RAG) Pipeline:**
   - Combine retrieved news snippets as context to generate answers using OpenAI’s GPT models.
   - Construct effective prompts that guide the language model to provide concise, context-aware responses.

4. **Evaluate and Analyze RAG Outputs:**
   - Review generated answers alongside the supporting news excerpts.
   - Reflect on the strengths and limitations of the simple RAG pipeline and consider potential improvements, such as adding more filters or refining retrieval strategies.

5. **Incorporate Financial Metadata into Retrieval Context:**
   - Enrich retrieved news snippets with key financial metadata including ticker, company name, sector, and industry.
   - Format prompts that combine both text excerpts and metadata to provide richer context to the language model.

6. **Generate Context-Aware Answers Using OpenAI Models:**
   - Construct and send prompts to an LLM that leverage both news content and metadata to produce concise, informed financial analysis.

7. **Compare Answers With and Without Metadata:**
   - Evaluate the impact of including financial metadata on answer quality using criteria such as clarity, detail, accuracy, and contextual relevance.
   - Summarize findings to reflect on the role of metadata in improving retrieval-augmented generation.

## Install and Import important librairies

First, we install and import the necessary libraries for:
- Text embedding generation (sentence-transformers)
- Efficient similarity search (faiss)
- Data manipulation (pandas, numpy)
- Visualization (matplotlib)

> ℹ️ FAISS uses inner product for cosine similarity by normalizing vectors.

In [3]:
%pip install sentence-transformers
%pip install faiss-cpu

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch>=1.11.0->sentence-transformers)
 

In [4]:
from sentence_transformers import SentenceTransformer
import faiss
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from collections import Counter
import matplotlib.pyplot as plt
import faiss

## Load news data
We load a CSV file of financial news, focusing on TITLE and SUMMARY, along with metadata like TICKER and PUBLICATION_DATE.
These will be embedded into vectors and used for semantic retrieval.

In [5]:
K = 25

In [6]:
df_news = pd.read_csv('https://raw.githubusercontent.com/osvaldosanchezrmz/FZ4025.10/main/df_news.csv')
df_news['PUBLICATION_DATE'] = pd.to_datetime(df_news['PUBLICATION_DATE']).dt.date
display(df_news)


Unnamed: 0,TICKER,TITLE,SUMMARY,PUBLICATION_DATE,PROVIDER,URL
0,MMM,2 Dow Jones Stocks with Promising Prospects an...,The Dow Jones (^DJI) is made up of 30 of the m...,2025-05-29,StockStory,https://finance.yahoo.com/news/2-dow-jones-sto...
1,MMM,3 S&P 500 Stocks Skating on Thin Ice,The S&P 500 (^GSPC) is often seen as a benchma...,2025-05-27,StockStory,https://finance.yahoo.com/news/3-p-500-stocks-...
2,MMM,3M Rises 15.8% YTD: Should You Buy the Stock N...,"MMM is making strides in the aerospace, indust...",2025-05-22,Zacks,https://finance.yahoo.com/news/3m-rises-15-8-y...
3,MMM,Q1 Earnings Roundup: 3M (NYSE:MMM) And The Res...,Quarterly earnings results are a good time to ...,2025-05-22,StockStory,https://finance.yahoo.com/news/q1-earnings-rou...
4,MMM,3 Cash-Producing Stocks with Questionable Fund...,While strong cash flow is a key indicator of s...,2025-05-19,StockStory,https://finance.yahoo.com/news/3-cash-producin...
...,...,...,...,...,...,...
4866,ZTS,2 Dividend Stocks to Buy With $500 and Hold Fo...,Zoetis is a leading animal health company with...,2025-05-23,Motley Fool,https://www.fool.com/investing/2025/05/23/2-di...
4867,ZTS,Zoetis (NYSE:ZTS) Declares US$0.50 Dividend Pe...,Zoetis (NYSE:ZTS) recently affirmed a dividend...,2025-05-22,Simply Wall St.,https://finance.yahoo.com/news/zoetis-nyse-zts...
4868,ZTS,Jim Cramer on Zoetis (ZTS): “It Does Seem to B...,We recently published a list of Jim Cramer Tal...,2025-05-21,Insider Monkey,https://finance.yahoo.com/news/jim-cramer-zoet...
4869,ZTS,Zoetis (ZTS) Upgraded to Buy: Here's Why,Zoetis (ZTS) might move higher on growing opti...,2025-05-21,Zacks,https://finance.yahoo.com/news/zoetis-zts-upgr...


In [7]:
df_news['EMBEDDED_TEXT'] = df_news['TITLE'] + ' : ' + df_news['SUMMARY']

In [8]:
model = SentenceTransformer('all-MiniLM-L6-v2')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

## Implement FAISS vector store
We:
- Use a pre-trained sentence transformer (all-MiniLM-L6-v2) to embed documents.
- Normalize vectors to use cosine similarity.
- Create a FAISS index and implement a basic search function.

This will allow us to retrieve relevant news snippets given a natural language question.


In [9]:
# Load model and compute embeddings
text_embeddings = model.encode(df_news['EMBEDDED_TEXT'].tolist(), convert_to_numpy=True)

# Normalize embeddings to use cosine similarity (via inner product in FAISS)
text_embeddings = text_embeddings / np.linalg.norm(text_embeddings, axis=1, keepdims=True)

# Prepare metadata
documents = df_news['EMBEDDED_TEXT'].tolist()
metadata = [
    {
        'PUBLICATION_DATE': row['PUBLICATION_DATE'],
        'TICKER': row['TICKER'],
        'PROVIDER': row['PROVIDER']
    }
    for _, row in df_news.iterrows()
]

  return forward_call(*args, **kwargs)


In [10]:
embedding_dim = text_embeddings.shape[1]
faiss_index = faiss.IndexFlatIP(embedding_dim)  # Cosine similarity via inner product
faiss_index.add(text_embeddings)

In [11]:
class FaissVectorStore:
    def __init__(self, model, index, embeddings, documents, metadata):
        self.model = model
        self.index = index
        self.embeddings = embeddings
        self.documents = documents
        self.metadata = metadata

    def search(self, query, k=5, metadata_filter=None):
        query_embedding = self.model.encode([query])
        query_embedding = query_embedding / np.linalg.norm(query_embedding)

        if metadata_filter:
            filtered_indices = [i for i, meta in enumerate(self.metadata) if metadata_filter(meta)]
            if not filtered_indices:
                return []
            filtered_embeddings = self.embeddings[filtered_indices]
            temp_index = faiss.IndexFlatIP(filtered_embeddings.shape[1])
            temp_index.add(filtered_embeddings)
            D, I = temp_index.search(query_embedding, k)
            indices = [filtered_indices[i] for i in I[0]]
        else:
            D, I = self.index.search(query_embedding, k)
            indices = I[0]
            D = D[0]

        results = []
        for idx, sim in zip(indices, D):
            results.append((self.documents[idx], self.metadata[idx], float(sim)))
        return results

In [12]:
# Create FAISS-based store
faiss_store = FaissVectorStore(
    model=model,
    index=faiss_index,
    embeddings=text_embeddings,
    documents=documents,
    metadata=metadata
)

### Setup OpenAI Client

👉 **Instructions**:
- Import the `OpenAI` client from the `openai` Python library.
- You will need an **OpenAI API key** to use their models programmatically:
  - Go to [https://platform.openai.com/](https://platform.openai.com/) and sign up or log in.
  - Create an API key from your [API keys dashboard](https://platform.openai.com/account/api-keys).
  - ⚠️ **Keep your API key private** and **do not** share or hardcode it in public notebooks.
- Note that **usage of the OpenAI API is not free**. You will need to:
  - Add a payment method.
  - Monitor your usage to avoid unexpected charges.
  - Optionally set usage limits from your account settings.
- You can refer to the **course’s Study Resources** for a step-by-step guide on creating an OpenAI account and retrieving your API key.

Then:
- Initialize the client with `OpenAI(api_key="YOUR_KEY_HERE")`.
- Send a test request using `.responses.create()` and the `"gpt-4o-mini"` model with a simple prompt:

  ```python
  response = client.responses.create(
      model="gpt-4o-mini",
      input="Write a one-sentence bedtime story about a unicorn."
  )
  print(response.output_text)


In [13]:
# CODE HERE
# Use as many coding cells as you need

In [21]:
from openai import OpenAI
import getpass

# Ask for the API key securely (input is hidden)
api_key = getpass.getpass("Enter your OpenAI API key: ")

# Initialize the OpenAI client
client = OpenAI(api_key=api_key)

# Test the client with a simple prompt
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "user", "content": "Write a one-sentence bedtime story about a unicorn."}
    ]
)

print(response.choices[0].message.content)


Enter your OpenAI API key: ··········
As the moonlit sky shimmered, a brave unicorn named Luna soared over the whispering forest, sprinkling stardust to grant sweet dreams to all the children asleep below.


## Retrieve Additional Metadata from Yahoo Finance

👉 **Instructions**:
- We will enrich our news dataset by retrieving **company-level metadata** using the `yfinance` library.
- The goal is to map each unique stock ticker (`TICKER`) in the dataset to:
  - `COMPANY_NAME`
  - `SECTOR`
  - `INDUSTRY`

> ℹ️ `yfinance` fetches live data from Yahoo Finance. If you're running this in a cloud environment or during peak hours, expect some tickers to fail or rate limits to apply.

✅ After this step, you will have a new DataFrame (e.g. `df_meta`) with the columns `TICKER`, `COMPANY_NAME`, `SECTOR`, `INDUSTRY` that maps tickers to their company names, sectors, and industries. This metadata will be useful later to add filters and analysis based on sector or industry categories.


In [14]:
# CODE HERE
# Use as many coding cells as you need

In [22]:
%pip install yfinance



In [23]:
import yfinance as yf
import time

# Get unique tickers from the news dataset
unique_tickers = df_news['TICKER'].dropna().unique()

# Collect metadata in a list
metadata_list = []

for ticker in unique_tickers:
    try:
        info = yf.Ticker(ticker).info
        metadata_list.append({
            'TICKER': ticker,
            'COMPANY_NAME': info.get('longName', 'N/A'),
            'SECTOR': info.get('sector', 'N/A'),
            'INDUSTRY': info.get('industry', 'N/A')
        })
        time.sleep(0.5)  # Slight delay to avoid rate limiting
    except Exception as e:
        print(f"Failed to fetch data for {ticker}: {e}")

# Convert the list to a DataFrame
df_meta = pd.DataFrame(metadata_list)

# Display the metadata
df_meta.head()


Unnamed: 0,TICKER,COMPANY_NAME,SECTOR,INDUSTRY
0,MMM,3M Company,Industrials,Conglomerates
1,AOS,A. O. Smith Corporation,Industrials,Specialty Industrial Machinery
2,ABT,Abbott Laboratories,Healthcare,Medical Devices
3,ABBV,AbbVie Inc.,Healthcare,Drug Manufacturers - General
4,ACN,Accenture plc,Technology,Information Technology Services


## Retrieval-Augmented Generation (RAG): Retrieve Documents and Generate Answers

👉 **Instructions**:

In this part of the assignment, your task is to build a simple Retrieval-Augmented Generation (RAG) pipeline that:

- Takes a user question as input.
- Searches the FAISS vector store to find a set of relevant financial news articles based on semantic similarity.
- Uses the retrieved news articles as context to generate a clear, concise answer to the question by interacting with the OpenAI language model.
- Returns both the generated answer and the underlying news snippets used for context.

### What you need to focus on:

- Implement a retrieval mechanism to query your vector store and obtain the top relevant documents for any question.
- Construct prompts that effectively combine retrieved news content with the user’s question to guide the language model’s response.
- Use the OpenAI API to generate answers grounded in the retrieved context.
- Organize the outputs so that for each question, you have:
  - The generated answer.
  - The collection of news excerpts used to produce that answer.

### What you will be provided:

- Helper functions to display outputs in markdown format.
- Lists of example questions covering topics, companies, and industries to test your implementation.

---

Your solution can take any form or structure you find appropriate, as long as it fulfills these core objectives. This exercise will give you hands-on experience with integrating retrieval and generation for practical applications in finance.


#### Print markdown
You can use the following function to print answers from GPT4o-mini in markdown.

In [29]:
from IPython.display import Markdown, display

def print_markdown(text):
    display(Markdown(text))

#### Predefined questions

In [25]:
questions_topic = [
"What are the major concerns expressed in financial news about inflation?",
"How is investor sentiment described in recent financial headlines?",
"What role is artificial intelligence playing in recent finance-related news stories?"
]

questions_company = [
"How is Microsoft being portrayed in news stories about artificial intelligence?",
"What financial news headlines connect Amazon with automation or logistics?"
]

questions_industry = [
"What are the main themes emerging in financial news about the semiconductor industry?",
"What trends are being reported in the retail industry?",
"What risks or challenges are discussed in recent news about the energy industry?"
]

In [17]:
# CODE HERE
# Use as many coding cells as you need

In [26]:
def generate_rag_answer(question, vector_store, client, model_name="gpt-4o-mini", k=5):
    # 1. Retrieve top-k relevant documents
    retrieved = vector_store.search(question, k=k)

    # 2. Build context block from retrieved news
    context_snippets = [doc for doc, meta, score in retrieved]
    context_block = "\n\n".join(f"- {text}" for text in context_snippets)

    # 3. Build prompt for OpenAI
    prompt = (
        f"Based on the following financial news excerpts, answer the question below.\n\n"
        f"{context_block}\n\n"
        f"Question: {question}\n"
        f"Answer:"
    )

    # 4. Generate answer with OpenAI
    response = client.chat.completions.create(
        model=model_name,
        messages=[{"role": "user", "content": prompt}]
    )

    answer = response.choices[0].message.content

    # 5. Display using markdown
    print_markdown(f"### ❓ Question:\n{question}\n\n---\n\n### 🧠 Answer:\n{answer}\n\n---\n\n### 📰 Context Snippets:")
    for snippet in context_snippets:
        print_markdown(f"- {snippet}")


In [30]:
generate_rag_answer(
    question=questions_topic,
    vector_store=faiss_store,
    client=client,
    k=5
)


### ❓ Question:
['What are the major concerns expressed in financial news about inflation?', 'How is investor sentiment described in recent financial headlines?', 'What role is artificial intelligence playing in recent finance-related news stories?']

---

### 🧠 Answer:
1. **Major Concerns Expressed About Inflation**: The financial news mentions that the Federal Reserve's minutes from May indicate mounting concern over persistent US inflation and the potential for an economic slowdown. This suggests that there are worries about inflation not only continuing but also impacting the overall economy.

2. **Investor Sentiment Described in Recent Financial Headlines**: Investor sentiment appears to be mixed in recent headlines. On one hand, there is overwhelming bullishness on certain Wall Street stocks with notable price targets suggesting significant upside potential. However, skepticism is also expressed regarding analysts' forecasts due to institutional pressures and a reluctance to issue sell ratings. Additionally, some stocks are described as "hyped up" and outperforming the market due to positive catalysts, suggesting a sense of excitement among investors.

3. **Role of Artificial Intelligence in Recent Finance-Related News Stories**: The excerpts provided do not directly mention artificial intelligence. However, it can be inferred that positive developments in certain stocks may be influenced by advancements in technology, possibly including artificial intelligence, particularly if new products or innovations in the sector are highlighted. But specific references to AI in the context of these financial news excerpts are absent.

---

### 📰 Context Snippets:

- 3 of Wall Street’s Favorite Stocks Facing Headwinds : Wall Street has set ambitious price targets for the stocks in this article. While this suggests attractive upside potential, it’s important to remain skeptical because analysts face institutional pressures that can sometimes lead to overly optimistic forecasts.

- 3 of Wall Street’s Favorite Stocks with Questionable Fundamentals : Wall Street is overwhelmingly bullish on the stocks in this article, with price targets suggesting significant upside potential. However, it’s worth remembering that analysts rarely issue sell ratings, partly because their firms often seek other business from the same companies they cover.

- 3 Hyped Up  Stocks Facing Headwinds : Great things are happening to the stocks in this article. They’re all outperforming the market over the last month because of positive catalysts such as a new product line, constructive news flow, or even a loyal Reddit fanbase.

- Bitcoin price slips as Fed minutes flag US inflation risks : The Federal Reserve’s May policy meeting revealed mounting concern over persistent US inflation and the potential for economic slowdown.

- 1 Surging  Stock with Exciting Potential and 2 to Avoid : Exciting developments are taking place for the stocks in this article. They’ve all surged ahead of the broader market over the last month as catalysts such as new products and positive media coverage have propelled their returns.

In [31]:
generate_rag_answer(
    question=questions_company,
    vector_store=faiss_store,
    client=client,
    k=5
)


  return forward_call(*args, **kwargs)


### ❓ Question:
['How is Microsoft being portrayed in news stories about artificial intelligence?', 'What financial news headlines connect Amazon with automation or logistics?']

---

### 🧠 Answer:
1. Microsoft is being portrayed as facing challenges and making strategic decisions regarding its investments in artificial intelligence, particularly in the context of data centers. The news mentions that Microsoft recently halted its planned data centers in Ohio, which suggests that there are concerns about the AI market, such as the "AI bubble" and uncertainty regarding demand.

2. The financial news headlines that connect Amazon with automation or logistics include discussions about Amazon Web Services reconsidering some leases possibly due to the AI bubble concerns and its involvement in partnerships with companies like ServiceNow leveraging AI-driven solutions. These connections highlight Amazon's engagement in AI-related automation within its logistics and cloud services framework.

---

### 📰 Context Snippets:

- This "Magnificent Seven" Stock Is Set to Skyrocket If Its AI Investments Pay Off : Meta Platforms has investments in several AI applications.  The tech giant's stock is only valued on its legacy business.  Over the past two-and-a-half years, investors have heard about various artificial intelligence (AI) investments that tech companies are making.

- Jack Henry (JKHY) Integrates AI-Driven Lending Tech With Algebrik : We recently published a list of 12 AI News Investors Should Not Miss This Week. In this article, we are going to take a look at where Jack Henry & Associates, Inc. (NASDAQ:JKHY) stands against other AI news Investors should not miss this week. Artificial Intelligence (AI) is known to increase productivity, decrease human error, […]

- Nvidia can't be stopped, Apple falls behind, and the AI data center race: Tech news roundup : When Microsoft (MSFT) pulled the plug on planned data centers in Ohio last month and a Wells Fargo (WFC) report suggested Amazon (AMZN) Web Services was reconsidering some leases, market watchers quickly diagnosed the symptoms: AI bubble concerns, demand uncertainty, and the inevitable cooldown after years of breakneck expansion.

- How Salesforce has 'overcorrected' by leaning into AI : D.A. Davidson head of technology research Gil Luria joins Market Domination to discuss Salesforce (CRM) earnings and the company's trajectory. Luria says Salesforce is "too focused" on artificial intelligence (AI), as the other parts of its business "rapidly" decelerate and the company loses market share to competitors. Luria has the equivalent of a Sell rating on the stock. To watch more expert insights and analysis on the latest market action, check out more Market Domination here.

- ServiceNow Regenerates On Swarm Of AI Deals With Amazon, Microsoft And More : Teaming up with AI giants like Amazon, Microsoft and others, ServiceNow stock has rebounded and stands poised to break out.

In [32]:
generate_rag_answer(
    question=questions_industry,
    vector_store=faiss_store,
    client=client,
    k=5
)


### ❓ Question:
['What are the main themes emerging in financial news about the semiconductor industry?', 'What trends are being reported in the retail industry?', 'What risks or challenges are discussed in recent news about the energy industry?']

---

### 🧠 Answer:
Based on the financial news excerpts provided, here are the answers to your questions:

1. **Main themes emerging in financial news about the semiconductor industry:**
   - A focus on international revenue trends and their impact on Wall Street forecasts for ON Semiconductor Corp.
   - Increased investor attention towards ON Semiconductor, indicating a growing interest in its stock performance and future prospects.
   - Recent earnings reports and shareholder sentiment suggesting that soft earnings have not deterred investor confidence in ON Semiconductor.

2. **Trends being reported in the retail industry:**
   - The excerpts provided do not specifically mention any trends in the retail industry. Therefore, no information is available regarding retail sector trends based on the excerpts given.

3. **Risks or challenges discussed in recent news about the energy industry:**
   - The excerpts do not contain specific information regarding risks or challenges in the energy industry. The only mention related to the energy sector discusses positive return trends for Entergy (NYSE:ETR), but does not detail any associated risks or challenges.

Overall, the excerpts primarily highlight themes about the semiconductor industry centered around revenue trends, investor interest, and earnings performance, while lacking specific insights into the retail and energy sectors.

---

### 📰 Context Snippets:

- Investing in ON Semiconductor Corp. (ON)? Don't Miss Assessing Its International Revenue Trends : Explore ON Semiconductor Corp.'s (ON) international revenue trends and how these numbers impact Wall Street's forecasts and what's ahead for the stock.

- ON Semiconductor Corporation (ON) is Attracting Investor Attention: Here is What You Should Know : Recently, Zacks.com users have been paying close attention to ON Semiconductor Corp. (ON). This makes it worthwhile to examine what the stock has in store.

- The Return Trends At Entergy (NYSE:ETR) Look Promising : What trends should we look for it we want to identify stocks that can multiply in value over the long term? One common...

- Packaging Corporation of America (NYSE:PKG) Hasn't Managed To Accelerate Its Returns : What trends should we look for it we want to identify stocks that can multiply in value over the long term? Amongst...

- Some May Be Optimistic About ON Semiconductor's (NASDAQ:ON) Earnings : Soft earnings didn't appear to concern ON Semiconductor Corporation's ( NASDAQ:ON ) shareholders over the last week...

## Analysis & Questions - Section 1

### Analysis and Reflection on Retrieval and Generation Results
After running the RAG pipeline and obtaining answers along with their supporting news excerpts, take some time to carefully review both the generated responses and the retrieved contexts.

- **For each question, read the answer and then the corresponding news snippets used as context.**

- Reflect on the following points and document your observations:
1. **Relevance**
2. **Completeness**  
3. **Bias or Noise**
4. **Consistency**  
5. **Improvement Ideas**   

and answer the questions below:

#### **Question 1.** How well do the retrieved news snippets support the generated answer? Are the key facts or themes in the answer clearly grounded in the context?

The snippets mostly support the answers well. In RAG v2, the answers feel more grounded because the metadata gives extra context. In RAG v1, some answers are a bit general, but still related to the content.

#### **Question 2.** Does the answer fully address the question, or does it leave important aspects out? Consider if the retrieved context provided enough information to generate a thorough response.

Most answers cover the main idea, but some questions don’t get fully answered when the retrieved snippets don’t include enough detail. This is more noticeable in broad topics like retail or energy, where the model didn’t have much to work with.
  

#### **Question 3.** Are there any irrelevant or misleading snippets retrieved that may have influenced the answer? How might this affect the quality of the output?

Yes, a few snippets felt off-topic, especially in RAG v1. This sometimes made the answer less focused. In RAG v2, adding metadata helped reduce that problem and made the answers more accurate.


#### **Question 4.**  Do the news snippets show consistent information, or are there conflicting viewpoints? How does the LLM handle potential contradictions in the context?

Most snippets show consistent information, but in company questions, there were some mixed opinions. The model handled this well, giving balanced answers without ignoring different points of view.
  

#### **Question 5.**  Based on your observations, suggest ways the retrieval or generation process could be improved (e.g., better filtering, adjusting `k`, refining prompt design).

Filtering results by keywords or metadata could help avoid unrelated snippets. Also, changing the number of retrieved documents depending on the question type, and improving the prompt instructions, could lead to better answers.

## 🧠 Retrieval-Augmented Generation (RAG) v2: Adding Financial Metadata to Improve Generation

👉 **Instructions**:

In this part of the assignment, you’ll enhance your Retrieval-Augmented Generation (RAG) pipeline by incorporating *financial metadata* to provide more contextually rich answers.

Your goal is to evaluate whether metadata such as **company name**, **sector**, and **industry** helps the LLM generate **more accurate and grounded answers** to financial questions.

---

### ✅ What your updated pipeline should do:

- Retrieve relevant financial news articles using semantic similarity with FAISS.
- Enrich each retrieved document with financial metadata:
  - Ticker symbol
  - Full company name
  - Sector (e.g., Technology, Energy)
  - Industry (e.g., Semiconductors, Retail)
- Construct prompts that include both:
  - Retrieved news text
  - Associated metadata
- Send the prompt to the OpenAI model to generate an informed response.
- Return:
  - The final answer
  - The exact set of contextual documents used to produce that answer

---

### 🧪 Evaluation and Comparison:

You will test your improved RAG pipeline on the same three types of questions provided earlier:
- **Topic-focused** (e.g., inflation, interest rates)
- **Company-focused** (e.g., questions about Tesla, Nvidia)
- **Industry-focused** (e.g., semiconductors, utilities)


In [18]:
# CODE HERE
# Use as many coding cells as you need

In [33]:
def generate_rag_v2_answer(question, vector_store, client, df_meta, model_name="gpt-4o-mini", k=5):
    # Step 1: Retrieve top-k relevant documents
    retrieved = vector_store.search(question, k=k)

    # Step 2: Enrich each snippet with metadata
    enriched_snippets = []
    for i, (text, meta, score) in enumerate(retrieved):
        ticker = meta.get("TICKER", "N/A")
        row = df_meta[df_meta["TICKER"] == ticker].head(1)
        company_name = row["COMPANY_NAME"].values[0] if not row.empty else "N/A"
        sector = row["SECTOR"].values[0] if not row.empty else "N/A"
        industry = row["INDUSTRY"].values[0] if not row.empty else "N/A"

        enriched_snippets.append({
            "text": text,
            "ticker": ticker,
            "company_name": company_name,
            "sector": sector,
            "industry": industry
        })

    # Step 3: Build enriched context for the prompt
    context_block = "\n\n".join([
        f"- [{e['company_name']}] ({e['ticker']}) | Sector: {e['sector']} | Industry: {e['industry']}\n  {e['text']}"
        for e in enriched_snippets
    ])

    # Step 4: Build the prompt
    prompt = (
        f"Based on the following financial news excerpts and metadata, answer the question below.\n\n"
        f"{context_block}\n\n"
        f"Question: {question}\n"
        f"Answer:"
    )

    # Step 5: Get the answer from OpenAI
    response = client.chat.completions.create(
        model=model_name,
        messages=[{"role": "user", "content": prompt}]
    )

    answer = response.choices[0].message.content

    # Step 6: Display the result
    print_markdown(f"### ❓ Question:\n{question}\n\n---\n\n### 🧠 Answer:\n{answer}\n\n---\n\n### 📰 Context + Metadata:")
    for e in enriched_snippets:
        snippet = (
            f"- **{e['company_name']}** ({e['ticker']}) | Sector: {e['sector']}, Industry: {e['industry']}\n"
            f"  {e['text']}"
        )
        print_markdown(snippet)


In [35]:
generate_rag_v2_answer(
    question=questions_topic,
    vector_store=faiss_store,
    client=client,
    df_meta=df_meta,
    k=5
)


  return forward_call(*args, **kwargs)


### ❓ Question:
['What are the major concerns expressed in financial news about inflation?', 'How is investor sentiment described in recent financial headlines?', 'What role is artificial intelligence playing in recent finance-related news stories?']

---

### 🧠 Answer:
1. **Major concerns expressed in financial news about inflation**: The primary concern regarding inflation highlighted in the news is the mounting worry over persistent inflation rates and the potential for an economic slowdown, as indicated by the Federal Reserve's May policy meeting minutes. This suggests that there is anxiety about the impact of inflation on economic stability and future monetary policy decisions.

2. **Investor sentiment described in recent financial headlines**: Investor sentiment appears to be mixed. In some articles, there is overwhelming bullish sentiment toward certain stocks with suggested significant upside potential. However, skepticism is also noted regarding analysts' price targets due to institutional pressures and the tendency for analysts not to issue sell ratings. This indicates that while there is optimism, there is also caution based on questionable fundamentals and potential overvaluation.

3. **Role of artificial intelligence in recent finance-related news stories**: The excerpts provided do not specifically mention the role of artificial intelligence in recent finance-related news stories. Therefore, it is unclear how AI is being referenced or its influence in current financial discussions based on the provided excerpts. Further information would be required to assess AI's impact.

---

### 📰 Context + Metadata:

- **CarMax, Inc.** (KMX) | Sector: Consumer Cyclical, Industry: Auto & Truck Dealerships
  3 of Wall Street’s Favorite Stocks Facing Headwinds : Wall Street has set ambitious price targets for the stocks in this article. While this suggests attractive upside potential, it’s important to remain skeptical because analysts face institutional pressures that can sometimes lead to overly optimistic forecasts.

- **Revvity, Inc.** (RVTY) | Sector: Healthcare, Industry: Diagnostics & Research
  3 of Wall Street’s Favorite Stocks with Questionable Fundamentals : Wall Street is overwhelmingly bullish on the stocks in this article, with price targets suggesting significant upside potential. However, it’s worth remembering that analysts rarely issue sell ratings, partly because their firms often seek other business from the same companies they cover.

- **Microchip Technology Incorporated** (MCHP) | Sector: Technology, Industry: Semiconductors
  3 Hyped Up  Stocks Facing Headwinds : Great things are happening to the stocks in this article. They’re all outperforming the market over the last month because of positive catalysts such as a new product line, constructive news flow, or even a loyal Reddit fanbase.

- **BlackRock, Inc.** (BLK) | Sector: Financial Services, Industry: Asset Management
  Bitcoin price slips as Fed minutes flag US inflation risks : The Federal Reserve’s May policy meeting revealed mounting concern over persistent US inflation and the potential for economic slowdown.

- **ResMed Inc.** (RMD) | Sector: Healthcare, Industry: Medical Instruments & Supplies
  1 Surging  Stock with Exciting Potential and 2 to Avoid : Exciting developments are taking place for the stocks in this article. They’ve all surged ahead of the broader market over the last month as catalysts such as new products and positive media coverage have propelled their returns.

In [34]:
generate_rag_v2_answer(
    question=questions_company,
    vector_store=faiss_store,
    client=client,
    df_meta=df_meta,
    k=5
)


  return forward_call(*args, **kwargs)


### ❓ Question:
['How is Microsoft being portrayed in news stories about artificial intelligence?', 'What financial news headlines connect Amazon with automation or logistics?']

---

### 🧠 Answer:
In the news stories about artificial intelligence, Microsoft Corporation (MSFT) is being portrayed as a major player and collaborator in the AI space. The excerpt notes that Microsoft is involved in significant AI deals with other tech giants, such as Amazon, indicating that Microsoft is actively engaging in partnerships to bolster its AI capabilities. This portrayal suggests that Microsoft is well-positioned within the AI ecosystem and is contributing to advancements in technology through collaboration with other leading companies.

Regarding financial news headlines that connect Amazon with automation or logistics, the article about Nvidia mentions that "Microsoft (MSFT) pulled the plug on planned data centers in Ohio last month and a Wells Fargo (WFC) report suggested Amazon (AMZN) Web Services was reconsidering some leases." This connection implies that Amazon may be evaluating its real estate needs in relation to its cloud services and could be addressing automation or logistics considerations within its operations. Further details about Amazon's automation or logistics strategies are not explicitly mentioned in the provided excerpts, but the context suggests that Amazon is reassessing its investments in areas related to these operational aspects.

---

### 📰 Context + Metadata:

- **Meta Platforms, Inc.** (META) | Sector: Communication Services, Industry: Internet Content & Information
  This "Magnificent Seven" Stock Is Set to Skyrocket If Its AI Investments Pay Off : Meta Platforms has investments in several AI applications.  The tech giant's stock is only valued on its legacy business.  Over the past two-and-a-half years, investors have heard about various artificial intelligence (AI) investments that tech companies are making.

- **Jack Henry & Associates, Inc.** (JKHY) | Sector: Technology, Industry: Information Technology Services
  Jack Henry (JKHY) Integrates AI-Driven Lending Tech With Algebrik : We recently published a list of 12 AI News Investors Should Not Miss This Week. In this article, we are going to take a look at where Jack Henry & Associates, Inc. (NASDAQ:JKHY) stands against other AI news Investors should not miss this week. Artificial Intelligence (AI) is known to increase productivity, decrease human error, […]

- **NVIDIA Corporation** (NVDA) | Sector: Technology, Industry: Semiconductors
  Nvidia can't be stopped, Apple falls behind, and the AI data center race: Tech news roundup : When Microsoft (MSFT) pulled the plug on planned data centers in Ohio last month and a Wells Fargo (WFC) report suggested Amazon (AMZN) Web Services was reconsidering some leases, market watchers quickly diagnosed the symptoms: AI bubble concerns, demand uncertainty, and the inevitable cooldown after years of breakneck expansion.

- **Salesforce, Inc.** (CRM) | Sector: Technology, Industry: Software - Application
  How Salesforce has 'overcorrected' by leaning into AI : D.A. Davidson head of technology research Gil Luria joins Market Domination to discuss Salesforce (CRM) earnings and the company's trajectory. Luria says Salesforce is "too focused" on artificial intelligence (AI), as the other parts of its business "rapidly" decelerate and the company loses market share to competitors. Luria has the equivalent of a Sell rating on the stock. To watch more expert insights and analysis on the latest market action, check out more Market Domination here.

- **Microsoft Corporation** (MSFT) | Sector: Technology, Industry: Software - Infrastructure
  ServiceNow Regenerates On Swarm Of AI Deals With Amazon, Microsoft And More : Teaming up with AI giants like Amazon, Microsoft and others, ServiceNow stock has rebounded and stands poised to break out.

In [36]:
generate_rag_v2_answer(
    question=questions_industry,
    vector_store=faiss_store,
    client=client,
    df_meta=df_meta,
    k=5
)


### ❓ Question:
['What are the main themes emerging in financial news about the semiconductor industry?', 'What trends are being reported in the retail industry?', 'What risks or challenges are discussed in recent news about the energy industry?']

---

### 🧠 Answer:
Based on the provided financial news excerpts, here are the main themes and trends regarding the specified industries:

**Semiconductor Industry:**
1. **International Revenue Trends:** There is a focus on how international revenue trends for ON Semiconductor Corporation (ON) could influence Wall Street forecasts and potential stock performance.
2. **Investor Attention:** ON Semiconductor is attracting significant attention from investors, indicating a growing interest in the stock and its potential for future performance.
3. **Earnings Sentiment:** Despite reports of soft earnings, there seems to be an optimistic sentiment among shareholders regarding ON Semiconductor, suggesting a belief in the company’s potential for recovery or growth.

**Retail Industry:** 
The provided excerpts do not mention any companies or themes specifically related to the retail industry, so we cannot draw any trends from this material.

**Energy Industry:**
1. **Return Trends:** Entergy Corporation (ETR) is highlighted for its promising return trends, which suggests a positive outlook for the company in the utilities sector.
2. **Long-term Value Potential:** There is a general inquiry into what trends indicate long-term value growth in stocks within the utilities sector, hinting at potential risks or areas to watch for investors.

Overall, while the semiconductor industry shows promising indicators and investor optimism surrounding ON Semiconductor, there is a lack of information on the retail industry, and the energy industry appears to be assessing return trends with an eye toward long-term value.

---

### 📰 Context + Metadata:

- **ON Semiconductor Corporation** (ON) | Sector: Technology, Industry: Semiconductors
  Investing in ON Semiconductor Corp. (ON)? Don't Miss Assessing Its International Revenue Trends : Explore ON Semiconductor Corp.'s (ON) international revenue trends and how these numbers impact Wall Street's forecasts and what's ahead for the stock.

- **ON Semiconductor Corporation** (ON) | Sector: Technology, Industry: Semiconductors
  ON Semiconductor Corporation (ON) is Attracting Investor Attention: Here is What You Should Know : Recently, Zacks.com users have been paying close attention to ON Semiconductor Corp. (ON). This makes it worthwhile to examine what the stock has in store.

- **Entergy Corporation** (ETR) | Sector: Utilities, Industry: Utilities - Regulated Electric
  The Return Trends At Entergy (NYSE:ETR) Look Promising : What trends should we look for it we want to identify stocks that can multiply in value over the long term? One common...

- **Packaging Corporation of America** (PKG) | Sector: Consumer Cyclical, Industry: Packaging & Containers
  Packaging Corporation of America (NYSE:PKG) Hasn't Managed To Accelerate Its Returns : What trends should we look for it we want to identify stocks that can multiply in value over the long term? Amongst...

- **ON Semiconductor Corporation** (ON) | Sector: Technology, Industry: Semiconductors
  Some May Be Optimistic About ON Semiconductor's (NASDAQ:ON) Earnings : Soft earnings didn't appear to concern ON Semiconductor Corporation's ( NASDAQ:ON ) shareholders over the last week...

## Analysis & Questions - Section 2

### Instructions: Evaluate Answers With and Without Metadata

For each question, compare the two answers provided:
- One generated **without** metadata
- One generated **with** metadata

---

### Steps:

1. Use the following evaluation criteria:
   - Clarity
   - Detail & Depth
   - Use of Context
   - Accuracy & Grounding
   - Relevance
   - Narrrative Flow

2. For each criterion, write brief notes comparing how the answer **without metadata** performs versus the answer **with metadata**.

3. Summarize your evaluation in a markdown table with the following columns:

| Criteria       | WITHOUT METADATA            | WITH METADATA             |
|----------------|----------------------------|--------------------------|
| Clarity        | [Your brief note here]     | [Your brief note here]   |
| Detail & Depth         | [Your brief note here]     | [Your brief note here]   |
| Use of Context        | [Your brief note here]     | [Your brief note here]   |
| Accuracy & Grounding       | [Your brief note here]     | [Your brief note here]   |
| Relevance      | [Your brief note here]     | [Your brief note here]   |
| Narrative Flow      | [Your brief note here]     | [Your brief note here]   |

---

**Note:** Keep comments short and clear for easy comparison.



| Criteria             | WITHOUT METADATA                                                        | WITH METADATA                                                                   |
|----------------------|-------------------------------------------------------------------------|----------------------------------------------------------------------------------|
| Clarity              | Understandable but sometimes too general or vague.                      | More specific and clear thanks to company and sector context.                   |
| Detail & Depth       | Lacks some depth, especially in broader topics like AI or retail.       | Includes more relevant details and relationships between companies.             |
| Use of Context       | Uses the snippets but doesn’t always link them clearly to the answer.   | Metadata helps connect content to the question more effectively.                |
| Accuracy & Grounding | Some parts feel guessed when context is weak.                           | Better grounded in facts, with more reliable references to companies and sectors.|
| Relevance            | Sometimes includes general or unrelated content.                        | More focused and relevant, especially in company-specific questions.            |
| Narrative Flow       | Flows fine but can feel repetitive or generic.                          | Feels smoother and more natural due to structured context and clearer prompts.  |
