# Clustering stocks

A popular application of clustering is clustering stocks. It is a way of building a balanced portfolio of stocks.

[link to the original article](https://medium.com/@facujallia/stock-classification-using-k-means-clustering-8441f75363de)

## Gathering data

We're using a somewhat different approach here. In stead of using a pre-downloaded CSV-file or a dataset from SKlearn we're scraping the data ourselves. Start by getting the [list of SP-500-companies from wikipedia](https://en.wikipedia.org/wiki/List_of_S%26P_500_companies). Use the pandas "read_html" function for that.

In [None]:
# !pip install lxml
# !pip install yfinance

In [None]:
import pandas as pd
import requests

# Define the url
sp500_url = 'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'
headers = {
    "User-Agent": "Mozilla/5.0"
}

# Read in the url and scrape ticker data
response = requests.get(sp500_url, headers=headers)

# Check if the request was successful
if response.status_code == 200:
    data_table = pd.read_html(response.text)

    tickers = data_table[0]['Symbol'].values.tolist()
    tickers = [s.replace('\n', '') for s in tickers]
    tickers = [s.replace('.', '-') for s in tickers]
    tickers = [s.replace(' ', '') for s in tickers]

    tickers[:5]
else:
    print(f"Failed to retrieve data: {response.status_code}")

When we have the name, let's download the closing prices for these stocks over the last year. This involves going to your local library and getting the newspapers from the last year. In these you can check the daily prices of the stocks of your choosing.

Or use [yfinance](https://ranaroussi.github.io/yfinance/), a Python-library that will do all this for you.

First we'll get the closing prices for Microsoft over the last year.

In [None]:
import yfinance as yf

dat = yf.Ticker("MSFT")
close = dat.history(start="2024-05-01", end="2025-05-01")["Close"]
close.head()

Next we use the "pct_change()"-method in pandas to get not the actual prices, but the percentage they changed against the previous value. Of this we print the mean and the standard deviation.

In [None]:
# Define the column Returns
print(close.pct_change().mean())

# Define the column Volatility
print(close.pct_change().std())

Let's store everything in a dictionary:
- Percentage of returns
- Volatility (standard deviation on previous value)
- divindeRate
- trailingPE

The first we multiply by 252 and the second we divide by the square root of 252. This is to annualize them, based on 252 working days a year. (An we'll remove "MSFT" from the tickers as we already did that one.)

In [None]:
dict_data = {
    'Ticker': "MSFT",
    'Returns': close.pct_change().mean() * 252,
    'Volatility': close.pct_change().std() * (252 ** 0.5),
    'dividendRate': dat.info['dividendRate'],
    'trailingPE': dat.info['trailingPE']
}

dict_data

Now for the real work: go over all 503 tickers and store the above information in a giant pandas dataframe.

In [None]:
# Up to you!



Are there any Na-values?

In [None]:
# Up to you!



Some, but only in dividendRate and trailingPE. We'll ignore them for now.

Now we'll export this data to a CSV so we don't have to load it everytime we want to revisit this dataset. It takes two minutes and if we had time to sit back and wait for two minutes we'd be training neural networks.

In [None]:
df.to_csv('../exports/PS500.csv', index=False)

The data is yours. Now analyze it.

## Clustering Returns and Volatility

Start by re-importing the data from the CSV. If you've just loaded the data you can skip this, but if you're restarting this notebook this saves you some time.

In [None]:
df = pd.read_csv('../exports/PS500.csv')
df.head()

Create an elbow curve to see how many clusters would be suitable. Limit yourself to Returns and Volatility, as these are roughly in the same range. (You remember that KMeans is sensitive to unscaled data).

In [None]:
# Up to you!



Looks like 5 is a good break-of point. Create a new model there and draw a colored scatter plot.

In [None]:
# Up to you!



## Outliers

In the scatterplot we see some obvious outliers appearing. Can we find these statiscally as well?

We could use the inter quartile range. Everything outside of 1.5 the IQR (which is the value at 75% of the data minus the value at 25% of the data) times 1.5 is considered an outlier. But if we up this 3 we only get the two datapoints that are above the 0.8-mark on the graph for Volatility and the datapoint on the far right and the far left for Returns. Let's work with these.

In [None]:
# Up to you!



Once you have identified them, remove them from the dataset.

In [None]:
# Up to you!



Now retrain the model and show the scatterplot again!

In [None]:
# Up to you!



You could do the elbow-plot again, but you'd see it doesn't change much, 5 remains a good option.

## Scaling

There is another problem (in the last scatter plot of the previous part) though. We see the groups are all side by side, indicating that the clustering worked mainly on Returns and not on Volatility. This means that Volatility is not taken into account as much as returns.

Sound like a job for scalerman!

<!-- ![](../files/2025-05-21-10-00-08.png) -->

<img src="../files/2025-05-21-10-00-08.png" width=150 />

Apply a standard scaler to our Returns and Volatility.

In [None]:
# Up to you!



Create a new elbow plot for this data.

In [None]:
# Up to you!



The elbow is going deeper. We could start using 4, although nothing would be wrong with 5. We'll go with 5 and decide based on the scatter plot if we're happy.

In [None]:
# Up to you!



Way better. Now...

* Export the model and the scaler
* Apply them to your portfolio
* Check to see which type you should by buying to balance it across all categories

You could go back to the [original article](https://medium.com/@facujallia/stock-classification-using-k-means-clustering-8441f75363de) and continue reading and trying out. Good luck!