# CFM 101 Final Project

### Project Category: Market-Meet

### Group Member: Max [insert last name], Sean Lee, Rain Luo

#### Intial strategy:
In terms of overall strategy, we sought to meet the market, replicating the S&P 500 and TSX 60 sector composition. The simulated portfolio remains diversified by weighting stocks according to the sector’s actual market representation. For each sector, the algorithm identifies the stocks that have the highest correlation with the sector’s performance, selecting those stocks that move most similarly to the industry as they represent its behavior best. Once these stocks are identified, the strategy allocates more weight to the stocks that are most correlated with the sector, with the number of stocks selected based on the sector’s weight in the index. Specifically, for each industry, the top N stocks are chosen, where N is calculated as a percentage of the sector’s weight in the index, multiplied by 25. Additionally, the allocation progressively decreases for each subsequent stock in the industry. The strategy aims to backtest this portfolio construction to simulate potential profits by using historical data, allowing it to assess whether this sector-based selection method can outperform broader market indices. Essentially, the strategy seeks to leverage sector-specific performance to build a portfolio that could potentially beat the market.

## Design decisions 
1. We discovererd that for one particular stock, it can have a a high corrlation to mutiple sectors of the S&P. We have decided to put a maximium cap.
2. We decided to use a year worth of past trend. From discussing with other groups 

### Set up
1. We found out that there are tickers on yfinance that trackes the S&P by sectors, they are:

2. We also included the sectors composition in the S&P 500 and TSX 60, which we will use later on.

| Sector                 | Composition in S&P500 | Composition in TSX60 | S&P Ticker | TSX Ticker  |
|------------------------|----------------------|--------------------|-----------------|---------------|
| Basic Materials        | 0.0171              | 0.0849            | ^SP500-15       | ^GSPTTMT      |
| Industrials            | 0.0719              | 0.1311            | ^SP500-20       | ^GSPTTIN      |
| Consumer Cyclical      | 0.1075              | 0.0531            | ^SP500-25       | ^GSPTTCD      |
| Consumer Defensive     | 0.0576              | 0.0509            | ^SP500-30       | ^GSPTTCS      |
| Healthcare             | 0.1014              | 0.0000            | ^SP500-35       | ^GSPTTHC      |
| Financial Services     | 0.1303              | 0.3387            | ^SP500-40       | ^SPTTFS       |
| Technology             | 0.3045              | 0.0963            | ^SP500-45       | ^SPTTTK       |
| Communication Services | 0.1340              | 0.0304            | ^SP500-50       | ^GSPTTTS      |
| Utilities              | 0.0235              | 0.0318            | ^SP500-55       | ^GSPTTUT      |
| Real Estate            | 0.0207              | 0.0062            | ^SP500-60       | ^GSPTTRE      |
| Energy                 | 0.0315              | 0.1766            | ^SP500-1010     | ^SPTTEN       |


In [None]:
import pandas as pd
import yfinance as yf
import matplotlib.pyplot as plt
import threading

attributes = ['sector', 'exchange', 'currency', 'marketCap']
start_date = '2023-10-01'
end_date = '2024-10-01'


# Each industry is mapped to (% share of S&P500, % share of TSX60, S&P industry ticker, TSX60 capped industry ticker)
# To obtain values for % share, run market_by_sector('SP500') and market_by_sector('TSX60'), respectively (see next cell)
# Since % share changes quarterly, we don't need to run this every time
sectors = {
    'Basic Materials': (0.0171, 0.0849, '^SP500-15', '^GSPTTMT'),
    'Industrials': (0.0719, 0.1311, '^SP500-20', '^GSPTTIN'),
    'Consumer Cyclical': (0.1075, 0.0531, '^SP500-25', '^GSPTTCD'),
    'Consumer Defensive': (0.0576, 0.0509, '^SP500-30', '^GSPTTCS'),
    'Healthcare': (0.1014, 0.0000, '^SP500-35', '^GSPTTHC'),
    'Financial Services': (0.1303, 0.3387, '^SP500-40', '^SPTTFS'),
    'Technology': (0.3045, 0.0963, '^SP500-45', '^SPTTTK'),
    'Communication Services': (0.1340, 0.0304, '^SP500-50', '^GSPTTTS'),
    'Utilities': (0.0235, 0.0318, '^SP500-55', '^GSPTTUT'),
    'Real Estate': (0.0207, 0.0062, '^SP500-60', '^GSPTTRE'),
    'Energy': (0.0315, 0.1766, '^SP500-1010', '^SPTTEN')
}

### Fetch Data

Here is where we organized the data. We decided to filter out stocks that have less than 100,000 daily volume. 

In [None]:
# adds ticker info to data df
def get_data(ticker, data, history, filter):
    yf_data = yf.Ticker(ticker).info
    if(not filter):
        for att in attributes:
            if(att not in yf_data):
                print(ticker, 'missing', att)
                continue
            data.loc[ticker, att] = yf_data[att]
        hist = yf.Ticker(ticker).history(start=start_date, end=end_date)
        history[ticker] = hist['Close'].pct_change().dropna()
        return
    # check if stock is CAD or USD
    if('currency' not in yf_data or yf_data['currency'] not in ['USD', 'CAD']):
        data.drop(ticker, inplace=True)
        print('Dropped', ticker)
        return
    for att in attributes:
        if(att not in yf_data):
            print(ticker, 'missing', att)
            continue
        data.loc[ticker, att] = yf_data[att]
    hist = yf.Ticker(ticker).history(start=start_date, end=end_date)
    history[ticker] = hist['Close'].pct_change().dropna()
    volume = hist['Volume'].resample('MS').sum()
    # Take all months with >= 18 trading days for volume calculation
    volume.drop([month for month in volume.index if hist.resample('MS').size().loc[month] < 18], inplace=True)
    # check if stock has at least 100,000 average monthly volume
    if(volume.mean() < 1e5):
        data.drop(ticker, inplace=True)

# returns df containing all ticker info
def get_tickers(file_name='Tickers.csv', filter=True):
    with threading.Lock():
        tickers = pd.read_csv(file_name, header=None)
        data = pd.DataFrame(index=[ticker for ticker in tickers[0]])
        history = {}
        threads = [threading.Thread(target=get_data, args=(ticker,data,history,filter)) for ticker in tickers[0]]
        for thread in threads:
            thread.start()
        for thread in threads:
            thread.join()
    return (data, history)

#### Performance & Correlation Analysis

In [None]:



# returns function of stocks to sectors as given by f
# 0 for S&P500, 1 for TSX60
def calc(data, history, f, index):
    sector_metric = {stock:{} for stock in data.index}
    for sector in sectors:
        for stock in data.index:
            if(sectors[sector][index] == 0):
                continue
            df = pd.DataFrame({stock: history[stock], sector: history[sectors[sector][2+index]]}).dropna()
            # calculate metric given a function f
            sector_metric[stock][sector] = f(df, stock, sector)
    return sector_metric

def beta(df, stock, sector):
    return df[stock].cov(df[sector])/df[sector].var()

def corr(df, stock, sector):
    return df[stock].corr(df[sector])

# returns df containing history for each sector in TSX60
# since historical data for individual TSX60 sectors is unavailable, we take the weighted average of all stocks in each sector
def tsx_sectors():
    data, history = get_tickers('TSX60.csv', False)
    sector_history = pd.DataFrame({sectors[sector][3]: 0 for sector in sectors}, index=history[data.index[0]].index)
    total_market_cap = {sectors[sector][3]: 0 for sector in sectors}
    for stock in history:
        total_market_cap[sectors[data['sector'].loc[stock]][3]] += data['marketCap'].loc[stock]
    for stock in history:
        sector = sectors[data['sector'].loc[stock]][3]
        sector_history[sector] += history[stock]*data['marketCap'].loc[stock]/total_market_cap[sector]
    return sector_history

# returns df containing history for each sector in S&P500
def sp_sectors():
    history = {sectors[sector][2]: yf.Ticker(sectors[sector][2]).history(start=start_date, end=end_date)['Close'].pct_change().dropna() for sector in sectors}
    return pd.DataFrame(history, index=list(history.values())[0].index)

# returns sector percent change since start date
def aggregate_pct_change(history, stock):
    result = pd.Series(index=history[stock].index)
    prev = 1
    for day in history[stock].index:
        result[day] = prev*(1+history[stock][day])
        prev = result[day]
    return result

# binary search for optimal max percentage of a single stock such that we can have 24 stocks in our portfolio
def max_percentage(min_pct):
    low = min_pct
    high = 1.0
    while(low < high):
        mid = (low+high)/2
        sum = 0
        for sector in sectors:
            sum += min(max(1, sectors[sector][0]/2//mid), sectors[sector][0]/2//min_pct) + min(max(1, sectors[sector][1]/2//mid), sectors[sector][1]/2//min_pct)
        if(sum > 24):
            low = mid+0.0001
        else:
            high = mid
    return round(low, 4)

def create_portfolio(sector_corr, min_pct, max_pct):
    portfolio = {stock: 0 for stock in data.index}
    for sector in sectors:
        for i in range(2):
            if(sectors[sector][i] < min_pct): # also checks if sector has no percentage
                continue
            tot = 0
            j = 0
            best_stocks = sorted(sector_corr[i], key=lambda x: sector_corr[i][x][sector], reverse=True)
            while(tot + max_pct <= sectors[sector][i]/2):
                v = max_pct
                if(tot == 0):
                    v += (sectors[sector][i]/2) % max_pct
                portfolio[best_stocks[j]] += v
                tot += v
                j += 1
            if(tot < sectors[sector][i]/2):
                portfolio[best_stocks[j]] += sectors[sector][i]/2
    return portfolio

def some_algorithm(sector_to_stock):
    return

MAX_STOCKS = 24
MIN_PCT = 1/(2*MAX_STOCKS)
MAX_PCT = max_percentage(MIN_PCT)
data, history = get_tickers()
tsx_by_sector = tsx_sectors()
for sector in tsx_by_sector:
    history[sector] = tsx_by_sector[sector]
sp_by_sector = sp_sectors()
for sector in sp_by_sector:
    history[sector] = sp_by_sector[sector]
sector_corr = [calc(data, history, corr, 0), calc(data, history, corr, 1)]
create_portfolio(sector_corr, MIN_PCT, MAX_PCT)

Dropped AGN
Dropped CELG
Dropped RTN
Dropped MON


{'AAPL': 0.651,
 'ABBV': 0.1635,
 'ABT': 0.0654,
 'ACN': 0.0654,
 'AIG': 0.16935,
 'AMZN': 0,
 'AXP': 0,
 'BA': 0,
 'BAC': 0,
 'BB.TO': 0,
 'BIIB': 0,
 'BK': 0,
 'BLK': 0,
 'BMY': 0,
 'C': 0,
 'CAT': 0,
 'CL': 0,
 'KO': 0,
 'LLY': 0,
 'LMT': 0,
 'MO': 0,
 'MRK': 0,
 'PEP': 0,
 'PFE': 0,
 'PG': 0,
 'PM': 0,
 'PYPL': 0,
 'QCOM': 0,
 'RY.TO': 0,
 'SHOP.TO': 0,
 'T.TO': 0,
 'TD.TO': 0,
 'TXN': 0,
 'UNH': 0,
 'UNP': 0,
 'UPS': 0,
 'USB': 0}

In [None]:
# adds market cap of ticker to data table
def ticker_by_sector(ticker, data):
    yf_data = yf.Ticker(ticker).info
    data[yf_data['sector']] += yf_data['marketCap']

# prints percentage of index in each sector
def market_by_sector(index):
    tickers = pd.read_csv(index+'.csv', header=None)
    data = {sector: 0 for sector in sectors}
    threads = [threading.Thread(target=ticker_by_sector, args=(ticker,data)) for ticker in tickers[0]]
    for thread in threads:
        thread.start()
    for thread in threads:
        thread.join()
    sum = 0
    for sector in data:
        sum += data[sector]
    for sector in data:
        print(sector, 'accounts for', round(data[sector]/sum*100, 2), 'percent of', index)