# Introduction to text embeddings on S&P 500 news

objectives# 📌 Objectives

By the end of this notebook, students will be able to:

1. **Retrieve Financial News:**
   - Use the `yfinance` library to gather news headlines for all companies in the S&P 500 index.

2. **Clean and Structure Financial Text Data:**
   - Extract and organize relevant metadata (e.g., ticker, title, summary, publication date, URL) into a structured pandas DataFrame.

3. **Generate Text Embeddings:**
   - Apply a pre-trained sentence transformer model (`all-MiniLM-L6-v2`) to convert news headlines and summaries into numerical embeddings.

4. **Apply Clustering Techniques:**
   - Use K-Means clustering to identify groups of similar news articles based on semantic content.

5. **Determine Optimal Number of Clusters:**
   - Evaluate clustering quality using silhouette scores to find the best number of clusters.

6. **Visualize High-Dimensional Embeddings:**
   - Reduce the embedding space using PCA and visualize clusters in two dimensions.

7. **Interpret Cluster Themes:**
   - Analyze representative news


## Install and Import important librairies

In [1]:
# %pip install pandas
# %pip install yfinance
# %pip install lxml
# %pip install -U sentence-transformers

In [15]:
from sentence_transformers import SentenceTransformer
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from sentence_transformers import SentenceTransformer
import pandas as pd
from sklearn.metrics import silhouette_score
import yfinance as yf

## Get the list of stocks in the S&P 500 

In [3]:
# Read and print the stock tickers that make up S&P500
df_tickers = pd.read_html(
    'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')[0]

display(df_tickers.head())

Unnamed: 0,Symbol,Security,GICS Sector,GICS Sub-Industry,Headquarters Location,Date added,CIK,Founded
0,MMM,3M,Industrials,Industrial Conglomerates,"Saint Paul, Minnesota",1957-03-04,66740,1902
1,AOS,A. O. Smith,Industrials,Building Products,"Milwaukee, Wisconsin",2017-07-26,91142,1916
2,ABT,Abbott Laboratories,Health Care,Health Care Equipment,"North Chicago, Illinois",1957-03-04,1800,1888
3,ABBV,AbbVie,Health Care,Biotechnology,"North Chicago, Illinois",2012-12-31,1551152,2013 (1888)
4,ACN,Accenture,Information Technology,IT Consulting & Other Services,"Dublin, Ireland",2011-07-06,1467373,1989


In [4]:
ticker_list = df_tickers['Symbol'].tolist()

## Get the news of all 500 stocks in the S&P 500 Index
Use the yfinance library to retrieve the news of all 500 stocks in the index.
https://ranaroussi.github.io/yfinance/reference/yfinance.stock.html

### Get the news in a dictionary

In [24]:
news_by_ticker = {ticker: yf.Ticker(ticker).news for ticker in ticker_list}

In [None]:
# import asyncio
#
# BATCH_SIZE = 20
#
# def fetch_news(ticker):
#     news = yf.Ticker(ticker).news
#     return news
#
# async def fetch_news_async(tickers):
#     news_by_ticker = {}
#     news = []
#     batches = [tickers[i:min(i + BATCH_SIZE, len(tickers))] for i in range(0, len(tickers), BATCH_SIZE)]
#     for batch in batches:
#         tasks = [asyncio.to_thread(fetch_news, ticker) for ticker in batch]
#         news += await asyncio.gather(*tasks)
#
#     # news will have the elements in the same order as the tickers list
#     for ticker, news_items in zip(tickers, news):
#         news_by_ticker[ticker] = news_items
#
#     return news_by_ticker
#
# r = await fetch_news_async(ticker_list)  # Fetch news for the first 5 tickers as a test

### Structure the news into a pandas dataframe 

Your final dataframe should have the following columns: 
- TICKER
- TITLE (of the news)
- SUMMARY (of the news)
- PUBLICATION_DATE (of the news)
- URL (of the news)

Note: all of those fields are provided in the yfinance news component. Refer to the library documentation.

In [48]:
def map_to_row(ticker, article):
    return {
        'TICKER': ticker,
        'TITLE': article['content']['title'],
        'SUMMARY': article['content']['summary'],
        'PUBLICATION_DATE': article['content']['pubDate'],
        'URL': article['content']['canonicalUrl']['url']
    }

data = []
for ticker, articles in news_by_ticker.items():
    for a in articles:
        if a['content'] is not None:
            data.append(map_to_row(ticker, a))

df_news = pd.DataFrame(data)

In [49]:
df_news.describe()

Unnamed: 0,TICKER,TITLE,SUMMARY,PUBLICATION_DATE,URL
count,5013,5013,5013,5013,5013
unique,502,4174,3903,3864,4259
top,ZTS,"Starbucks downgraded, Oracle initiated: Wall S...","Whether you're a value, growth, or momentum in...",2025-07-17T13:46:52Z,https://finance.yahoo.com/news/starbucks-downg...
freq,10,13,48,13,13


## Exploring text embeddings

- Use the open-source model: 'sentence-transformers/all-MiniLM-L6-v2' to create embeddings on the news title and summary
- You should combine the title and summary into one string that you will embed together
- Add a column to your news dataframe called EMBEDDED_TEXT using ONLY the TITLE of the news
- Add a column to your news dataframe called EMBEDDINGS, which contains the embedding of EMBEDDED_TEXT


In [51]:
# Loads the pre-trained sentence transformer model
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
  return _bootstrap._gcd_import(name[level:], package, level)
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


In [59]:
df_news['EMBEDDED_TEXT'] = df_news['TITLE']

In [67]:
# Run the model to encode the text into embeddings
embeddings = model.encode(df_news['EMBEDDED_TEXT'].tolist())

In [64]:
display(embeddings.shape)

torch.Size([5013, 384])

In [65]:
df_news['EMBEDDINGS'] = embeddings.tolist()

In [69]:
display(df_news.head())

Unnamed: 0,TICKER,TITLE,SUMMARY,PUBLICATION_DATE,URL,EMBEDDED_TEXT,EMBEDDINGS
0,MMM,"AmEx earnings, consumer sentiment, housing dat...",Market Domination Overtime host Josh Lipton ou...,2025-07-17T23:00:00Z,https://finance.yahoo.com/video/amex-earnings-...,"AmEx earnings, consumer sentiment, housing dat...","[0.06263984739780426, -0.040196068584918976, 0..."
1,MMM,How To Earn $500 A Month From 3M Stock Ahead O...,3M Company (NYSE:MMM) will release earnings re...,2025-07-17T12:17:16Z,https://finance.yahoo.com/news/earn-500-month-...,How To Earn $500 A Month From 3M Stock Ahead O...,"[0.03646127134561539, -0.03693307191133499, -0..."
2,MMM,3M (MMM) Q2 Earnings: What To Expect,Industrial conglomerate 3M (NYSE:MMM) will be ...,2025-07-17T03:00:58Z,https://finance.yahoo.com/news/3m-mmm-q2-earni...,3M (MMM) Q2 Earnings: What To Expect,"[0.020555980503559113, -0.010142773389816284, ..."
3,MMM,Carlisle (CSL) to Report Q2 Results: Wall Stre...,Carlisle (CSL) possesses the right combination...,2025-07-16T14:00:07Z,https://finance.yahoo.com/news/carlisle-csl-re...,Carlisle (CSL) to Report Q2 Results: Wall Stre...,"[-0.003267791820690036, 0.023633969947695732, ..."
4,MMM,MMM Gears Up to Post Q2 Earnings: What Lies Ah...,3M's Q2 results are poised for gains from indu...,2025-07-16T13:35:00Z,https://finance.yahoo.com/news/mmm-gears-post-...,MMM Gears Up to Post Q2 Earnings: What Lies Ah...,"[-0.05767572298645973, 0.014248637482523918, 0..."


## Using K-means clustering on news embeddings
to simplify, keep only one news for each company (ticker), you should have 500 rows in your news dataframe

In [73]:
df_news_filtered = df_news.drop_duplicates(subset=['TICKER']).reset_index(drop=True)

df_news_filtered.describe()

Unnamed: 0,TICKER,TITLE,SUMMARY,PUBLICATION_DATE,URL,EMBEDDED_TEXT,EMBEDDINGS
count,502,502,502,502,502,502,502
unique,502,424,411,388,429,424,434
top,ZTS,'IPO window is open' & it's good news for Big ...,"With bank earnings underway, Citizens JMP Secu...",2025-07-17T10:30:00Z,https://finance.yahoo.com/video/ipo-window-ope...,'IPO window is open' & it's good news for Big ...,"[0.06263984739780426, -0.040196068584918976, 0..."
freq,1,6,6,6,6,6,5


In [76]:
k_means = KMeans(random_state=42)
r = k_means.fit_predict(df_news_filtered['EMBEDDINGS'].tolist())

In [77]:
r

array([3, 2, 1, 7, 3, 2, 6, 7, 4, 0, 5, 5, 4, 1, 2, 4, 2, 1, 7, 3, 0, 1,
       3, 7, 7, 4, 3, 0, 2, 5, 2, 3, 2, 7, 1, 5, 1, 3, 1, 1, 4, 2, 6, 0,
       1, 2, 2, 4, 7, 2, 2, 1, 2, 7, 3, 1, 3, 2, 1, 7, 2, 4, 1, 2, 5, 1,
       1, 1, 2, 1, 6, 2, 7, 2, 6, 5, 1, 1, 1, 4, 1, 2, 2, 4, 1, 1, 5, 4,
       7, 1, 0, 7, 5, 4, 4, 3, 2, 3, 1, 2, 1, 7, 4, 1, 1, 3, 3, 4, 2, 6,
       6, 3, 0, 2, 1, 7, 0, 3, 1, 6, 7, 4, 1, 2, 2, 2, 0, 7, 7, 7, 1, 2,
       7, 2, 3, 0, 2, 1, 2, 7, 7, 6, 2, 1, 4, 0, 1, 1, 7, 2, 5, 5, 2, 5,
       4, 2, 1, 1, 6, 1, 6, 5, 4, 4, 1, 5, 6, 5, 4, 4, 1, 2, 1, 2, 7, 7,
       0, 1, 7, 0, 7, 5, 1, 2, 0, 7, 4, 2, 0, 2, 7, 7, 4, 2, 3, 0, 4, 7,
       7, 0, 2, 2, 4, 4, 1, 0, 1, 1, 4, 4, 1, 7, 6, 2, 1, 1, 2, 5, 4, 3,
       3, 0, 1, 0, 5, 4, 1, 7, 0, 2, 0, 1, 2, 4, 2, 5, 3, 4, 4, 7, 2, 0,
       4, 2, 2, 1, 0, 1, 2, 2, 0, 5, 4, 0, 0, 1, 5, 2, 7, 0, 2, 7, 1, 2,
       0, 3, 4, 1, 5, 2, 1, 7, 1, 3, 1, 1, 6, 7, 0, 5, 5, 2, 4, 4, 5, 5,
       0, 5, 5, 4, 0, 7, 0, 1, 1, 2, 7, 2, 7, 2, 4,

### Identify the number of clusters using the silhouette score

- Using a for loop, do the clustering with different k values (number of clusters), test 1 to 6 clusters
- Compute the silhouette score for every k value
- Plot the silhouette score for different k values

#### Try different values of k and compute silhouette scores

In [79]:
silhouette_by_n_clusters = {}

for i in range(2, 7):
    k_means = KMeans(n_clusters=i, random_state=42)
    labels = k_means.fit_predict(df_news_filtered['EMBEDDINGS'].tolist())
    silhouette_avg = silhouette_score(df_news_filtered['EMBEDDINGS'].tolist(), labels)
    silhouette_by_n_clusters[i] = silhouette_avg
    print(f"n_clusters = {i}, silhouette score = {silhouette_avg:.4f}")

n_clusters = 2, silhouette score = 0.0579
n_clusters = 3, silhouette score = 0.0329
n_clusters = 4, silhouette score = 0.0363
n_clusters = 5, silhouette score = 0.0385
n_clusters = 6, silhouette score = 0.0254


#### Plot silhouette scores

In [10]:
# YOUR CODE HERE
# USE AS MANY CELLS AS YOU NEED
# MAKE SURE TO DISPLAY INTERMEDIARY RESULS

#### Identify the Best k

In [11]:
# YOUR CODE HERE
# USE AS MANY CELLS AS YOU NEED
# MAKE SURE TO DISPLAY INTERMEDIARY RESULS

#### Cluster the embeddings using 3 clusters (k=3)

In [12]:
# YOUR CODE HERE
# USE AS MANY CELLS AS YOU NEED
# MAKE SURE TO DISPLAY INTERMEDIARY RESULS

### Visualize the 2 first PCA Components of your embeddings

In [13]:
# YOUR CODE HERE
# USE AS MANY CELLS AS YOU NEED
# MAKE SURE TO DISPLAY INTERMEDIARY RESULS

#### Analyze the content of each cluster
- Add the kmeans cluster label to your news dataframe
- Print the content of each cluster and analyze it

In [14]:
# YOUR CODE HERE
# USE AS MANY CELLS AS YOU NEED
# MAKE SURE TO DISPLAY INTERMEDIARY RESULS



## Question Section

Take time to reflect on what you've implemented and observed. Answer the following questions in a separate markdown cell or notebook file:

---

### Technical Understanding

#### 1️⃣ How might the choice of embedding model (e.g., MiniLM vs. a larger transformer) affect your clustering results and interpretation?

YOUR WRITTEN RESPONSE HERE

---

#### 2️⃣ What would be the differences in embeddings if you used only the TITLE, only the SUMMARY, or the combination of both? How could you empirically test this?

YOUR WRITTEN RESPONSE HERE

---

#### 3️⃣ In what situations would using a different dimensionality reduction method (e.g., t-SNE, UMAP) be preferable over PCA for visualization of embeddings?

YOUR WRITTEN RESPONSE HERE


---

### Data Analysis and Interpretation

#### 4️⃣ Based on your cluster analysis, identify at least two potential challenges you faced in interpreting the clusters and propose strategies to address them.

YOUR WRITTEN RESPONSE

---

#### 5️⃣ Did you observe any outliers in your 2D visualization? How would you identify and handle these outliers in a production pipeline?

YOUR WRITTEN RESPONSE

---

#### 6️⃣ If you could assign a 'label' or 'theme' to each cluster you obtained, what would they be? How confident are you in these assignments, and what could you do to validate them systematically?

YOUR WRITTEN RESPONSE

---

### Critical Thinking

#### 7️⃣ If news sentiment was incorporated into the analysis, how might this influence the clustering structure and interpretation of the clusters in a financial analysis context?

YOUR WRITTEN RESPONSE

---

#### 8️⃣ Discuss the limitations of using k-means clustering for news embeddings. What alternative clustering methods could address these limitations, and under what conditions would you prefer them?

YOUR WRITTEN RESPONSE

---

#### 9️⃣ How could the approach in this notebook be extended to analyze the potential impact of news clusters on stock price movements over time? Sketch a high-level pipeline you would implement to test this.

YOUR WRITTEN RESPONSE

---

#### 10️⃣ Imagine your clustering shows clear groups of news, but your downstream task (e.g., prediction of stock movement) does not improve. What might explain this disconnect between clear clusters and predictive utility?

YOU WRITTEN RESPONSE

