### Question 1: [Index] S&P 500 Stocks Added to the Index

**Which year had the highest number of additions?**

Using the list of S&P 500 companies from Wikipedia's [S&P 500 companies page](https://en.wikipedia.org/wiki/List_of_S%26P_500_companies), download the data including the year each company was added to the index.

Hint: you can use [pandas.read_html](https://pandas.pydata.org/docs/reference/api/pandas.read_html.html) to scrape the data into a DataFrame.

Steps:
1. Create a DataFrame with company tickers, names, and the year they were added.
2. Extract the year from the addition date and calculate the number of stocks added each year.
3. Which year had the highest number of additions (1957 doesn't count, as it was the year when the S&P 500 index was founded)? Write down this year as your answer (the most recent one, if you have several records).

*Context*: 
> "Following the announcement, all four new entrants saw their stock prices rise in extended trading on Friday" - recent examples of S&P 500 additions include DASH, WSM, EXE, TKO in 2025 ([Nasdaq article](https://www.nasdaq.com/articles/sp-500-reshuffle-dash-tko-expe-wsm-join-worth-buying)).

*Additional*: How many current S&P 500 stocks have been in the index for more than 20 years? When stocks are added to the S&P 500, they usually experience a price bump as investors and index funds buy shares following the announcement.

---

The list of S&P 500 companies, along with the year they were added, is available on Wikipedia. We will use the pandas.read_html() function to directly extract tables from the Wikipedia page into a DataFrame.

In [2]:
import pandas as pd

# URL for the Wikipedia page containing the list of S&P 500 companies
url = "https://en.wikipedia.org/wiki/List_of_S%26P_500_companies"

# Use pandas to scrape all tables from the webpage
tables = pd.read_html(url)

# The S&P 500 companies table is usually the first one
s_p_500_df = tables[0]

# Display the first few rows to understand the structure of the data
s_p_500_df.head()


Unnamed: 0,Symbol,Security,GICS Sector,GICS Sub-Industry,Headquarters Location,Date added,CIK,Founded
0,MMM,3M,Industrials,Industrial Conglomerates,"Saint Paul, Minnesota",1957-03-04,66740,1902
1,AOS,A. O. Smith,Industrials,Building Products,"Milwaukee, Wisconsin",2017-07-26,91142,1916
2,ABT,Abbott Laboratories,Health Care,Health Care Equipment,"North Chicago, Illinois",1957-03-04,1800,1888
3,ABBV,AbbVie,Health Care,Biotechnology,"North Chicago, Illinois",2012-12-31,1551152,2013 (1888)
4,ACN,Accenture,Information Technology,IT Consulting & Other Services,"Dublin, Ireland",2011-07-06,1467373,1989


In [4]:
s_p_500_df.shape[0]

503

We need to extract the "Date added" column, which typically contains the date when each company was added to the index. We'll clean the data, convert it into a proper datetime format, and extract the year.

In [5]:
# Ensure 'Date added' column is in datetime format
s_p_500_df['Date added'] = pd.to_datetime(s_p_500_df['Date added'], errors='coerce')

# Extract the year from the 'Date added' column
s_p_500_df['Year added'] = s_p_500_df['Date added'].dt.year

# Display the updated dataframe
s_p_500_df[['Symbol', 'Security', 'Date added', 'Year added']].head()


Unnamed: 0,Symbol,Security,Date added,Year added
0,MMM,3M,1957-03-04,1957
1,AOS,A. O. Smith,2017-07-26,2017
2,ABT,Abbott Laboratories,1957-03-04,1957
3,ABBV,AbbVie,2012-12-31,2012
4,ACN,Accenture,2011-07-06,2011


Now that we have the year in a separate column, we can calculate how many companies were added each year (excluding 1957).

In [6]:
# Filter out the year 1957
s_p_500_df_filtered = s_p_500_df[s_p_500_df['Year added'] != 1957]

# Count the number of companies added each year
additions_per_year = s_p_500_df_filtered['Year added'].value_counts().sort_index()

# Find the year with the highest number of additions
max_additions_year = additions_per_year.idxmax()
max_additions_count = additions_per_year.max()

print(f"The year with the highest number of S&P 500 additions is {max_additions_year}, with {max_additions_count} additions.")


The year with the highest number of S&P 500 additions is 2016, with 23 additions.


**Additional Question - Stocks Added for More than 20 Years**

For this, we need to check how many companies currently in the S&P 500 index have been in the index for more than 20 years. This is done by checking the difference between the current year and the "Year added" for each company.

In [8]:
# Get the current year
current_year = pd.Timestamp.now().year

# Calculate how long each company has been in the index
s_p_500_df_filtered['Years in index'] = current_year - s_p_500_df_filtered['Year added']

# Count how many companies have been in the index for more than 20 years
companies_over_20_years = s_p_500_df_filtered[s_p_500_df_filtered['Years in index'] > 20]

# Display the number of companies with more than 20 years in the index
print(f"Number of current S&P 500 companies that have been in the index for more than 20 years: {len(companies_over_20_years)}")


Number of current S&P 500 companies that have been in the index for more than 20 years: 166


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  s_p_500_df_filtered['Years in index'] = current_year - s_p_500_df_filtered['Year added']


### Question 2. [Macro] Indexes YTD (as of 1 May 2025)

**How many indexes (out of 10) have better year-to-date returns than the US (S&P 500) as of May 1, 2025?**

Using Yahoo Finance World Indices data, compare the year-to-date (YTD) performance (1 January-1 May 2025) of major stock market indexes for the following countries:
* United States - S&P 500 (^GSPC)
* China - Shanghai Composite (000001.SS)
* Hong Kong - HANG SENG INDEX (^HSI)	
* Australia - S&P/ASX 200 (^AXJO)
* India - Nifty 50 (^NSEI)
* Canada - S&P/TSX Composite (^GSPTSE)
* Germany - DAX (^GDAXI)
* United Kingdom - FTSE 100 (^FTSE)
* Japan - Nikkei 225 (^N225)
* Mexico - IPC Mexico (^MXX)
* Brazil - Ibovespa (^BVSP)

*Hint*: use start_date='2025-01-01' and end_date='2025-05-01' when downloading daily data in yfinance

Context: 
> [Global Valuations: Who's Cheap, Who's Not?](https://simplywall.st/article/beyond-the-us-global-markets-after-yet-another-tariff-update) article suggests "Other regions may be growing faster than the US and you need to diversify."

Reference: Yahoo Finance World Indices - https://finance.yahoo.com/world-indices/

*Additional*: How many of these indexes have better returns than the S&P 500 over 3, 5, and 10 year periods? Do you see the same trend?
Note: For simplicity, ignore currency conversion effects.)


We will use the yfinance library to download daily data from Yahoo Finance for each of the 10 indexes, including the S&P 500. For each index, we will specify the start date (2025-01-01) and the end date (2025-05-01).

In [29]:
import yfinance as yf
import pandas as pd

# List of indexes and their tickers
indexes = {
    'United States - S&P 500': '^GSPC',
    'China - Shanghai Composite': '000001.SS',
    'Hong Kong - HANG SENG INDEX': '^HSI',
    'Australia - S&P/ASX 200': '^AXJO',
    'India - Nifty 50': '^NSEI',
    'Canada - S&P/TSX Composite': '^GSPTSE',
    'Germany - DAX': '^GDAXI',
    'United Kingdom - FTSE 100': '^FTSE',
    'Japan - Nikkei 225': '^N225',
    'Mexico - IPC Mexico': '^MXX',
    'Brazil - Ibovespa': '^BVSP'
}

# Define the start and end date for YTD performance calculation
start_date = '2025-01-01'
end_date = '2025-05-01'

# Dictionary to store data for each index
index_data = {}

# Download the data for each index using yf.Ticker
for name, ticker in indexes.items():
    ticker_obj = yf.Ticker(ticker)
    data = ticker_obj.history(start=start_date, end=end_date)
    index_data[name] = data['Close']  # We are interested in the 'Close' price



In [30]:
df

Unnamed: 0_level_0,United States - S&P 500,China - Shanghai Composite,Hong Kong - HANG SENG INDEX,Australia - S&P/ASX 200,India - Nifty 50,Canada - S&P/TSX Composite,Germany - DAX,United Kingdom - FTSE 100,Japan - Nikkei 225,Mexico - IPC Mexico,Brazil - Ibovespa
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2024-12-31 18:30:00+00:00,,,,,23742.900391,,,,,,
2025-01-01 13:00:00+00:00,,,,8201.200195,,,,,,,
2025-01-01 16:00:00+00:00,,3262.561035,19623.320312,,,,,,,,
2025-01-01 18:30:00+00:00,,,,,24188.650391,,,,,,
2025-01-01 23:00:00+00:00,,,,,,,20024.660156,,,,
...,...,...,...,...,...,...,...,...,...,...,...
2025-04-29 22:00:00+00:00,,,,,,,22496.980469,,,,
2025-04-29 23:00:00+00:00,,,,,,,,8494.900391,,,
2025-04-30 03:00:00+00:00,,,,,,,,,,,135067.0
2025-04-30 04:00:00+00:00,5569.060059,,,,,24841.699219,,,,,


In [23]:
# Convert the dictionary to a DataFrame
df = pd.DataFrame(index_data)

In [24]:
df.head()

Unnamed: 0_level_0,United States - S&P 500,China - Shanghai Composite,Hong Kong - HANG SENG INDEX,Australia - S&P/ASX 200,India - Nifty 50,Canada - S&P/TSX Composite,Germany - DAX,United Kingdom - FTSE 100,Japan - Nikkei 225,Mexico - IPC Mexico,Brazil - Ibovespa
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2024-12-31 18:30:00+00:00,,,,,23742.900391,,,,,,
2025-01-01 13:00:00+00:00,,,,8201.200195,,,,,,,
2025-01-01 16:00:00+00:00,,3262.561035,19623.320312,,,,,,,,
2025-01-01 18:30:00+00:00,,,,,24188.650391,,,,,,
2025-01-01 23:00:00+00:00,,,,,,,20024.660156,,,,


In [16]:
start_date = '2025-01-01'
end_date = '2025-05-01'
ticker_obj = yf.Ticker("^GSPC")
data = ticker_obj.history(start=start_date, end=end_date)

In [27]:
data["Close"]

Date
2025-01-02 00:00:00-03:00    120125.0
2025-01-03 00:00:00-03:00    118533.0
2025-01-06 00:00:00-03:00    120022.0
2025-01-07 00:00:00-03:00    121163.0
2025-01-08 00:00:00-03:00    119625.0
                               ...   
2025-04-24 00:00:00-03:00    134580.0
2025-04-25 00:00:00-03:00    134739.0
2025-04-28 00:00:00-03:00    135016.0
2025-04-29 00:00:00-03:00    135093.0
2025-04-30 00:00:00-03:00    135067.0
Name: Close, Length: 81, dtype: float64