# Model Structure

Model Structure (1).svg

# Model Rationale

The overall assumption behind the CNN portion of the model is as follows:

* The movement in any given industry is resembling to that of patterns or dark spots in BW images
* The movement in a given industry can be predicted by momentum in that industry alone
* Momentum in an industry can be dedused from exclusively adj_close pct_changes for some amount of the top companies in that industry
* Momentum in an indsutry can be best interpreted if transformed to a Z-score of performance with a roughly zero mean value

The assumption behind the voting portion of the model is as follows:

* Performance of AAPL for an upcoming day can be roughly deduced through proxies of some interplay between industries and volatilities
* A model would need to account for large numbers of nonzero values to allow for stock data of companies younger than AAPL (RF/KNN)
* A model would need to use accurate methodology (DNN/RF)
* A model should use less accurate portions or more shuffled data to prevent overfitting and account for black swan / unexplained changes (KNN/RF)
* VIX should be emphasized the same as an industry

Additional assumptions and rounding will be documented as code comments later in the work



# Downloads and Imports

In [None]:
!pip install pandas-datareader
!pip install yfinance
!pip install fix_yahoo_finance
!pip install matplotlib
!pip install seaborn
!pip install pytickersymbols
!pip install xarray

In [142]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sn
import yfinance as yf
yf.pdr_override()
import sklearn.metrics as met
from pytickersymbols import PyTickerSymbols
from scipy import stats
import statistics
import tensorflow as tf
from tensorflow.keras import datasets, layers, models

# Retrieving our Final Y

In [123]:
df = yf.download('aapl', interval='1D', period='MAX')
aaplY = df['Adj Close'].pct_change().dropna()
del df
aaplY

[*********************100%***********************]  1 of 1 completed


Date
1980-12-15   -0.052170
1980-12-16   -0.073398
1980-12-17    0.024751
1980-12-18    0.028993
1980-12-19    0.061029
                ...   
2022-12-21    0.023809
2022-12-22   -0.023773
2022-12-23   -0.002798
2022-12-27   -0.013878
2022-12-28   -0.030685
Name: Adj Close, Length: 10600, dtype: float64

# Fetching Industries

Industries are recieved through the PyTickerSymbols package. In total, there are 347 industries, but many repeat/augment or are small.

To compensate, we decide to sort in order of the number of companies registered under each industry and only take companies with at least **Q** companies.

Assumptions
* Number of companies is an efficient proxy for industry size
* Industries do not share companies
* Industries overlap marginally
* The top industries encompass most types of industries

In [4]:
stock_data = PyTickerSymbols()
# object_methods = [method_name for method_name in dir(stock_data)
                  # if callable(getattr(stock_data, method_name))]
# object_methods
# ind = 'DAX'
# all_ticker_getter_names = list(filter(
#    lambda x: (
#          x.endswith('_yahoo_tickers')
#    ),
#    dir(stock_data),
# ))
# print(all_ticker_getter_names)
# tickers = getattr(stock_data, all_ticker_getter_names[0])()

In [5]:
industries = stock_data.get_all_industries()
print(len(industries))

347


In [6]:
Minimum_Industry_Size = 30

def indsize(val):
  return len(list(stock_data.get_stocks_by_industry(val)))

industries.sort(key=indsize, reverse=True)
old_industries = [x.capitalize() for x in industries]
industries = []
[industries.append(x) for x in old_industries if x not in industries]

Q = len([1 for x in industries if len(list(stock_data.get_stocks_by_industry(x))) >= Minimum_Industry_Size])
print("The following "+str(Q) + " industries have at least "+str(Minimum_Industry_Size) + " companies:")
industries[:Q]

The following 28 industries have at least 30 companies:


['Financials',
 'Industrials',
 'Consumer cyclicals',
 'Technology',
 'Healthcare',
 'Banking & investment services',
 'Basic materials',
 'Consumer non-cyclicals',
 'Industrial goods',
 'Software & it services',
 'Real estate',
 'Machinery, equipment & components',
 'Healthcare services',
 'Banking services',
 'Utilities',
 'Industrial & commercial services',
 'Energy',
 'Banks',
 'Cyclical consumer services',
 'Chemicals',
 'Insurance',
 'Technology equipment',
 'Food & beverages',
 'Residential & commercial retis',
 'Healthcare equipment & supplies',
 'Fossil fuels',
 'Retailers',
 'Pharmaceuticals & medical research']

# Generating CNN Data

Idea: Q industries will be decided by capping some minimum industry size

N x K features will be as such:
*   Top N companies by employee size (that have exited since some barrier date)
*   K columns of percent changes in AdjClose with # of days being 
```
days = [1, 2, 3, 4, 5, 7, 10, 20, 30, 50, 70, 100, 120, 160, 200, 300]
K = len(days)
```

Assumptions:

* K can be chosen with random arbitrarity, and it is better to have a looser K that is a power of 2 for the benefit of the CNN structure
* Employee size is reflective of market share, and market share is reflective of impact on industry
* N can be arbitrarily capped to match K for the sake of CNN performance / ease
* The large amount of NA values from both issues with downloading and discrepancies in start dates can be compensated

Future Direction:

* Determine the 'best' K companies by first filtering for companies with stock data at least as old as aapl and then sort by employee size or market cap


In [166]:
days = [1, 2, 3, 4, 5, 7, 10, 20, 30, 50, 70, 100, 120, 160, 200, 300]

def generate_cnn_x_and_y(tickers, days=days):
  df = yf.download(tickers)['Adj Close']
  pChange = pd.Series([df.pct_change(d) for d in days], index=days)
  pChangeRestructured = pd.Series([pd.DataFrame([pChange[d].loc[ind] for d in days], index=days) for ind in df.index])
  pChangeRestructured.index = df.index
  x = pChangeRestructured.shift().dropna()

  y = generate_cnn_y(x)

  x, y = x.iloc[1:], y.iloc[1:]
  return x, y

def generate_cnn_y(x):
  y = []
  for z in x:
    t = z.loc[1].dropna()
    if len(t) == 0:
      y.append(0)
    else:
      t_avg = sum(t)/len(t)
      y.append(t_avg)
  y = pd.Series(y, index = x.index)
  mean = statistics.mean(y)
  y = stats.zscore(y)
  print("There is inserted bias because the mean is "+str(mean)+ " and not 0")
  return y

In [167]:
K = len(days)

company_list = list(stock_data.get_stocks_by_industry(industries[0]))

def employee_size(val):
  if val['metadata']['employees'] == "":
    return 0
  else:
    return val['metadata']['employees']

company_list.sort(key=employee_size, reverse=True)
company_tickers_all_currencies = [x['symbols'] for x in company_list]

def usd_ticker(val):
  temp = []
  for x in val:
    for y in x:
      if y['currency'] == 'USD':
        temp.append(y['yahoo'])
        break
  print(len(temp))
  return temp[:K]
company_tickers = usd_ticker(company_tickers_all_currencies)
company_tickers

150


['BRK-B',
 'WFC',
 'JPM',
 'HSBC',
 'BAC',
 'C',
 'BNPQF',
 'AZSEY',
 'SCGLF',
 'CRARF',
 'AXAHF',
 'BBVA',
 'CBRE',
 'SCBFF',
 'DB',
 'INPTF']

In [168]:
def aapl_test_train_split(x, train=0.9):
  ind = aaplY[int(len(aaplY)*train):].index[0]
  x_train, x_test = x[:ind], x[ind:]
  x_train = x_train[:len(x_train)-1]
  return x_train, x_test

In [200]:
x, y = generate_cnn_x_and_y(company_tickers)

[*********************100%***********************]  16 of 16 completed

2 Failed downloads:
- INPTF: No timezone found, symbol may be delisted
- AZSEY: No timezone found, symbol may be delisted
There is inserted bias because the mean is 0.000548149000542094 and not 0


In [170]:
x_train, x_test = aapl_test_train_split(x)
y_train, y_test = aapl_test_train_split(y)

In [171]:
x_train[0].shape

(16, 16)

# Building the Model

Future Direction:

* Patching DF $→$ TF conversion issues
* Generating x/y_train from all Q industries

In [177]:
K = x_train[0].shape[0]
input_shape = (K, K, 1)
model = models.Sequential()
model.add(layers.Conv2D(K, (3,3), activation='relu', input_shape=input_shape, padding='same'))
# model.add(layers.Conv2D(2*K, (3, 3), activation='relu', input_shape=(32, 32, 3)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(2*K, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(2*K, (3, 3), activation='relu'))
model.add(layers.Flatten())
model.add(layers.Dense(32, activation='relu'))
model.add(layers.Dense(10, activation='tanh'))
model.add(layers.Dense(1))
model.summary()

Model: "sequential_16"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 conv2d_38 (Conv2D)          (None, 16, 16, 16)        160       
                                                                 
 max_pooling2d_24 (MaxPoolin  (None, 8, 8, 16)         0         
 g2D)                                                            
                                                                 
 conv2d_39 (Conv2D)          (None, 6, 6, 32)          4640      
                                                                 
 max_pooling2d_25 (MaxPoolin  (None, 3, 3, 32)         0         
 g2D)                                                            
                                                                 
 conv2d_40 (Conv2D)          (None, 1, 1, 32)          9248      
                                                                 
 flatten_2 (Flatten)         (None, 32)              

In [199]:
model.compile(optimizer='adam',
              loss=tf.keras.losses.MeanAbsolutePercentageError(),
              metrics=['accuracy'])

history = model.fit(x_train, y_train, epochs=10,)

ValueError: ignored