# VERITAS AI

An immersive Virtual AI program for ambitious high school students, founded and run by Harvard graduate students. For more information, see [Veritas AI](https://www.veritasai.com/).

<p align="center">
    <a href="https://www.veritasai.com/">
        <img src="https://images.squarespace-cdn.com/content/v1/61e9374e0434354049a258f9/f89e8510-6098-4747-a89e-0cd34fe13376/Veritas_Fellowship+MP+White+copy+2.png" width="400" height="400"/>
    </a>
</p>

## Content: All Slides and Codes

The links of the colabs for the code walkthrough of the first 8 weeks are listed below. **PLEASE DO NOT EDIT THE SCRIPT DIRECTLY. REMEMBER TO SAVE AS A COPY IF YOU WANT TO RUN THE CODE**

Section | Name | Links
--- | --- | ---
1 | **Intro to Basic Python, Numpy, and Pandas** | [Code](https://colab.research.google.com/drive/1z-0Z852bFOUNYhve0wbsCcRJCZpCqcqp?usp=sharing)
2 | **Exploratory Data Analysis** | [Code](https://colab.research.google.com/drive/1VaSC8CsBxAn5JcnN2I5g2YP1m9aWBBMO?usp=sharing)
3 | **Basics in Linear Regression** | [Code](https://colab.research.google.com/drive/1HC4netVsOZT1BHjyUNcu8u8f1Etfoi2b?usp=sharing)
4 | **Basics in Logistic Regression** | [Code](https://colab.research.google.com/drive/1lm5nv5ULZqJBklZJMyRonJGyk0sNjWGi?usp=sharing)
5 | **Intro to Neural Networks** | [Code](https://colab.research.google.com/drive/1OGxD35fQxXdWNDwYGYpZcES5zgo6UvdO?usp=sharing)
6 | **Intro to Convolutional Neural Networks** | [Code](https://colab.research.google.com/drive/1YAjTioD_wUuRIKhacUDpH2TC6HEOfV5u?usp=sharing)
7 | **More in Deep Neural Networks** | [Code](https://colab.research.google.com/drive/1uM7CzWPLifgf16aVtJhWeOGN6A0guvlv?usp=sharing)
8 | **Advanced Convolutional Neural Networks** | [Code](https://colab.research.google.com/drive/1duxfMW5sQbx91c0BJWiP2oSHVdtcbtND?usp=sharing)

REMEMBER: **PLEASE DO NOT EDIT THE SCRIPT DIRECTLY. REMEMBER TO SAVE AS A COPY IF YOU WANT TO RUN THE CODE**


# Factor Model Portfolio Construction


![image](https://paytmblogcdn.paytm.com/wp-content/uploads/2024/08/Blog_Paytm_Portfolio-Diversification.webp)

Factor Models, as introduced in the Week 6 slides and walkthrough, rely on conceptual investing in a collection of market signals (High Yield, Inflation Protection, World Equities, etc…) to construct an effective portfolio. In our example, we deployed the baseline OLS model to produce a return model. Using some of the techniques we learned in later weeks (or other techniques you research and find code for), improve upon this baseline model and see if you can produce a stronger representation of returns from factors.

- Algorithms: Linear Regression, Ridge Regression, LASSO, Random Forest

- Difficulty: Flexible! This dataset has many features so the group should start with some simple models and build up in difficulty as time allows.

# A second look at old content

## Reload factor model code from Weeks 6 & 7

In [None]:
#@title Configuration File

#Configuration File

#Do not change, these are ipython notebook demonstration

#Path for data
dataPath = 'https://github.com/nick-dahl/AI_Finance/blob/main/Data2016.csv?raw=true'

#Define the factor names
#factorName = ['World Equities Excess Return','Treasury Bond Excess Return','Default Risk Premium','Inflation Protection','Currency Protection']
factorName = ['World Equities', '10-year US Treasuries', 'High Yield', 'Inflation Protection', 'Currency Protection']

#Names of assets
#assetName = ['US Equities Excess Return','Real Estate Excess Return','Commodities']
assetName = ['SP500 Total Return','International Equity','U.S. Treasury 20 years', 'Corporate Bond','Real Estate', 'Commodity', 'TIPS']

#Name of date column
dateName = 'Date'

#User Analysis Section.  Change the variables in this section to run user specific analysis

#isDemo is a boolean variable, set to True if the user wants to run custom analysis
isDemo = False

#dataPathUser: Path to User Defined Data
dataPathUser = 'https://github.com/nick-dahl/AI_Finance/blob/main/Data2016.csv?raw=true'

#factorNameUser: List, defines the factors
factorNameUser = ['World Equities', '10-year US Treasuries', 'High Yield', 'Inflation Protection', 'Currency Protection']

#assetNameUser: List, defines the asset to be used
assetNameUser = 'Commodity'

#dateName: string, date column
dateNameUser = 'Date'

#lambdaHatUser: float, penalty term of user define LASSO regression
lambdaHatUser = .00005

#Start and End Dates for the Analysis
startDateUser = '1997-03-01'
endDateUser = '2014-12-01'

#Stuff for optional part of user section

#Best Subset Regression Related
maxVarsUser = 1

#Elastic Net Related
numL1RatioUser = 10
numAlphasUser = 20

In [None]:
#@title Library for Factor-Model usage

#Library for Factor-Model usage

import numpy as np #for numerical array data
import pandas as pd #for tabular data
from scipy.optimize import minimize
import matplotlib.pyplot as plt #for plotting purposes
import time

import cvxpy as cp
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.linear_model import ElasticNet
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV

#Plotting Functions
def plot_returns(data, names, flag='Total Return', date='Date', printFinalVals = False):
    '''plot_returns returns a plot of the returns
    INPUTS:
        names: string, name of column to be plotted, or list, in which case it plots all of them
        data: pd dataframe, where the data is housed
        flag: string, Either Total Return or Monthly Return
        date: string, column name corresponding to the date variable
        printFinalVals: Boolean, if True, prints the final Total Return
    Outputs:
        a plot'''
    #Clean Inputs:
    if(date not in data.columns):
        print ('date column not in the pandas df')
        return
    if(type(names) is str):
        names = [names]
    for name in names:
        if(name not in data.columns):
            print ('column ' + name + ' not in pandas df')
            return
    #If the inputs are clean, create the plot
    data = data.sort_values(date).copy()
    data.reset_index(drop=True, inplace=True)
    data[date] = pd.to_datetime(data[date])

    if (flag == 'Total Return'):
        n = data.shape[0]
        totalReturns = np.zeros((n,len(names)))
        totalReturns[0,:] = 1.
        for i in range(1,n):
            totalReturns[i,:] = np.multiply(totalReturns[i-1,:], (1+data[names].values[i,:]))
        for j in range(len(names)):
            plt.semilogy(data[date], totalReturns[:,j])

        plt.title('Total Return Over Time')
        plt.ylabel('Total Return')
        plt.legend(names)
        plt.xlabel('Date')
        plt.show()
        if(printFinalVals):
            print(totalReturns[-1])
    elif (flag == 'Return'):
        for i in range(len(names)):
            plt.plot(data[date], data[names[i]])
        plt.title('Returns Over Time')
        plt.ylabel('Returns')
        plt.legend(names)
        plt.xlabel('Date')
        plt.show()
    else:
        print ('flag variable must be either Total Return or Return')


#Helper Functions
def create_options():
    '''create standard options dictionary to be used as input to regression functions'''
    options = dict()
    options['time_period'] = 'all'
    options['date'] = 'Date'
    options['return_model'] = False
    options['print_loadings'] = True
    return options

def create_options_lasso():
    options = create_options()
    options['lambda_hat'] = .5
    return options

def create_options_ridge():
    options = create_options()
    options['lambda'] = 1
    return options

def create_options_cv_lasso():
    options = create_options()
    options['max_lambda_hat'] = 1
    options['n_lambda_hat'] = 200
    options['random_state'] = 7777
    options['n_folds'] = 10
    return options

def create_options_cv_ridge():
    options = create_options()
    options['max_lambda'] = 1
    options['n_lambda'] = 100
    options['random_state'] = 7777
    options['n_folds'] = 10
    return options

def create_options_cv_elastic_net():
    options = create_options()
    options['max_lambda_hat'] = 1
    options['max_l1_ratio'] = .99
    options['n_lambda_hat'] = 100
    options['n_l1_ratio'] = 20
    options['random_state'] = 7777
    options['n_folds'] = 10
    return options

def create_options_best_subset():
    '''create standard options dictionary to be used as input to regression functions'''
    options = create_options()
    options['return_model'] = False
    options['print_loadings'] = True
    options['max_vars'] = 3
    return options


def create_dictionary_for_analysis(method, methodDict=None):
    '''create_dictionary_for_anlsis creates the options dictionary that can be used as an input to a factor model
    INPUTS:
        method: string, defines the method
    OUTPUTS:
        methodDict: dictionary, keys are specific options the user wants to specify, values are the values of those options
    '''
    if(method == 'OLS'):
        options = create_options()
    elif(method == 'CVLasso'):
        options = create_options_cv_lasso()
    elif(method == 'CVRidge'):
        options = create_options_cv_ridge()
    elif(method == 'CVElasticNet'):
        options = create_options_cv_elastic_net()
    elif(method == 'BestSubset'):
        options = create_options_best_subset()
    elif(method == 'RelaxedLasso'):
        options = create_options_relaxed_lasso()
    else:
        print('Bad Method Specification for Train')
        return
    options['returnModel'] = True
    options['printLoadings'] = False
    options['date'] = 'DataDate'
    for key in methodDict:
        options[key] = methodDict[key]
    return options



def print_timeperiod(data, dependentVar, options):
    '''print_timeperiod takes a a dependent varaible and a options dictionary, prints out the time period
    INPUTS:
        data: pandas df, df with the data
        dependentVar: string, name of dependent variable
        options: dictionary, should constain at least two elements, timeperiod, and date
            timeperiod: string, if == all, means use entire dataframe, otherwise filter the df on this value
            date: name of datecol
    OUTPUTS:
        printed stuff
    '''
    print ('Dependent Variable is ' + dependentVar)
    if(options['time_period'] == 'all'):
        sortedValues = data.sort_values(options['date'])[options['date']].reset_index(drop=True)
        n = sortedValues.shape[0]
        beginDate = sortedValues[0]
        endDate = sortedValues[n-1]
        print ('Time period is between ' + num_to_month(beginDate.month) +  ' ' + str(beginDate.year) + ' to ' + num_to_month(endDate.month) +  ' ' + str(endDate.year) + ' inclusive   ')
    else:
        print ('Time period is ' + options['timeperiod'])

def display_factor_loadings(intercept, coefs, factorNames, options):
    '''display_factor_loadings takes an intercept, coefs, factorNames and options dict, and prints the factor loadings in a readable way
    INPUTS:
        intercept: float, intercept value
        coefs: np array, coeficients from pandas df
        factorNames: list, names of the factors
        options: dict, should contain at least one key, nameOfReg
            nameOfReg: string, name for the regression
    Outputs:
        output is printed
    '''
    loadings = np.insert(coefs, 0, intercept)
    if('name_of_reg' not in options.keys()):
        name = 'No Name'
    else:
        name = options['name_of_reg']
    out = pd.DataFrame(loadings, columns=[name])
    out = out.transpose()
    fullNames = ['Intercept'] + factorNames
    out.columns = fullNames
    print(out)

def best_subset(x,y,l_0):
    # Mixed Integer Programming in feature selection
    M = 1000
    n_factor = x.shape[1]
    z = cp.Variable(n_factor, boolean=True)
    beta = cp.Variable(n_factor)
    alpha = cp.Variable(1)

    def MIP_obj(x,y,b,a):
        return cp.norm(y-cp.matmul(x,b)-a,2)

    best_subset_prob = cp.Problem(cp.Minimize(MIP_obj(x, y, beta, alpha)),
                             [cp.sum(z)<=l_0, beta+M*z>=0, M*z>=beta])
    best_subset_prob.solve(solver='ECOS_BB')
    return alpha.value, beta.value


#First function, linear factor model build
def linear_regression(data, dependentVar, factorNames, options):
    '''linear_regression takes in a dataset and returns the factor loadings using least squares regression
    INPUTS:
        data: pandas df, data matrix, should constain the date column and all of the factorNames columns
        dependentVar: string, name of dependent variable
        factorNames: list, elements should be strings, names of the independent variables
        options: dictionary, should constain at least two elements, timeperiod, and date
            timeperiod: string, if == all, means use entire dataframe, otherwise filter the df on this value
            date: name of datecol
            returnModel: boolean, if true, returns model
    Outputs:
        reg: regression object from sikitlearn
        also prints what was desired
    '''
    #first filter down to the time period
    if(options['time_period'] == 'all'):
        newData = data.copy()
    else:
        newData = data.copy()
        newData = newData.query(options['time_period'])

    #perform linear regression
    linReg = LinearRegression(fit_intercept=True)
    linReg.fit(newData[factorNames], newData[dependentVar])

    if (options['print_loadings'] == True):
        #Now print the results
        print_timeperiod(newData, dependentVar, options)
        # Now print the factor loadings
        display_factor_loadings(linReg.intercept_, linReg.coef_, factorNames, options)

    if(options['return_model']):
        return linReg


def lasso_regression(data, dependentVar, factorNames, options):
    '''lasso_regression takes in a dataset and returns the factor loadings using lasso regression
    INPUTS:
        data: pandas df, data matrix, should constain the date column and all of the factorNames columns
        dependentVar: string, name of dependent variable
        factorNames: list, elements should be strings, names of the independent variables
        options: dictionary, should constain at least two elements, timeperiod, and date
            timeperiod: string, if == all, means use entire dataframe, otherwise filter the df on this value
            printLoadings: boolean, if true, prints the coeficients

            date: name of datecol
            returnModel: boolean, if true, returns model
            alpha: float, alpha value for LASSO regression
            NOTE: SKLearn calles Lambda Alpha.  Also, it uses a scaled version of LASSO argument, so here I scale when converting lambda to alpha
    Outputs:
        reg: regression object from sikitlearn
        also prints what was desired
    '''
    if('lambda_hat' not in options.keys()):
        print ('lambda_hat not specified in options')
        return

    #first filter down to the time period
    if(options['time_period'] == 'all'):
        newData = data.copy()
    else:
        newData = data.copy()
        newData = newData.query(options['time_period'])

    #perform linear regression
    lassoReg = Lasso(alpha=options['lambda_hat'], fit_intercept=True)
    lassoReg.fit(newData[factorNames], newData[dependentVar])

    if (options['print_loadings'] == True):
        #Now print the results
        print_timeperiod(newData, dependentVar, options)
        print('lambda_hat = ' + str(options['lambda_hat']))

        #Now print the factor loadings
        display_factor_loadings(lassoReg.intercept_, lassoReg.coef_, factorNames, options)

    if(options['return_model']):
        return lassoReg

def ridge_regression(data, dependentVar, factorNames, options):
    '''ridge_regression takes in a dataset and returns the factor loadings using ridge regression
    INPUTS:
        data: pandas df, data matrix, should constain the date column and all of the factorNames columns
        dependentVar: string, name of dependent variable
        factorNames: list, elements should be strings, names of the independent variables
        options: dictionary, should constain at least two elements, timeperiod, and date
            timeperiod: string, if == all, means use entire dataframe, otherwise filter the df on this value
            date: name of datecol
            returnModel: boolean, if true, returns model
            lambda: float, alpha value for Ridge regression
            printLoadings: boolean, if true, prints the coeficients
    Outputs:
        reg: regression object from sikitlearn
        also prints what was desired
    '''
    if('lambda' not in options.keys()):
        print ('lambda not specified in options')
        return

    #first filter down to the time period
    if(options['time_period'] == 'all'):
        newData = data.copy()
    else:
        newData = data.copy()
        newData = newData.query(options['time_period'])

    #perform linear regression
    ridgeReg = Ridge(alpha=options['lambda'], fit_intercept=True)
    ridgeReg.fit(newData[factorNames], newData[dependentVar])

    if (options['print_loadings'] == True):
        #Now print the results
        print_timeperiod(newData, dependentVar, options)
        print('lambda = ' + str(options['lambda']))

        #Now print the factor loadings
        display_factor_loadings(ridgeReg.intercept_, ridgeReg.coef_, factorNames, options)

    if(options['return_model']):
        return ridgeReg

def best_subset_regression(data, dependentVar, factorNames, options):
    '''best_subset_regression takes in a dataset and returns the factor loadings using best subset regression
    INPUTS:
        data: pandas df, data matrix, should constain the date column and all of the factorNames columns
        dependentVar: string, name of dependent variable
        factorNames: list, elements should be strings, names of the independent variables
        options: dictionary, should constain at least two elements, timeperiod, and date
            timeperiod: string, if == all, means use entire dataframe, otherwise filter the df on this value
            date: name of datecol
            returnModel: boolean, if true, returns model
            maxVars: int, maximum number of factors that can have a non zero loading in the resulting regression
            printLoadings: boolean, if true, prints the coeficients
    Outputs:
        reg: regression object from sikitlearn
        also prints what was desired
    '''
    #Check dictionary for maxVars option
    if('max_vars' not in options.keys()):
        print ('max_vars not specified in options')
        return

    if(options['time_period'] == 'all'):
        newData = data.copy()
    else:
        newData = data.copy()
        newData = newData.query(options['time_period'])

    #perform linear regression
    alpha, beta = best_subset(data[factorNames].values, data[dependentVar].values, options['max_vars'])
    #round beta values to zero
    beta[np.abs(beta) <= 1e-7] = 0.0

    if (options['print_loadings'] == True):
        #Now print the results
        print_timeperiod(newData, dependentVar, options)
        print('Max Number of Non-Zero Variables is ' + str(options['max_vars']))

        #Now print the factor loadings
        display_factor_loadings(alpha, beta, factorNames, options)

    if(options['return_model']):
        out = LinearRegression()
        out.intercept_ = alpha[0]
        out.coef_ = beta
        return out

def cross_validated_lasso_regression(data, dependentVar, factorNames, options):
    '''cross_validated_lasso_regression takes in a dataset and returns the factor loadings using lasso regression and cross validating the choice of lambda
    INPUTS:
        data: pandas df, data matrix, should constain the date column and all of the factorNames columns
        dependentVar: string, name of dependent variable
        factorNames: list, elements should be strings, names of the independent variables
        options: dictionary, should constain at least two elements, timeperiod, and date
            timeperiod: string, if == all, means use entire dataframe, otherwise filter the df on this value
            date: name of datecol
            returnModel: boolean, if true, returns model
            printLoadings: boolean, if true, prints the coeficients

            maxLambda: float, max lambda value passed
            nLambdas: int, number of lambda values to try
            randomState: integer, sets random state seed
            nFolds: number of folds
            NOTE: SKLearn calles Lambda Alpha.  Also, it uses a scaled version of LASSO argument, so here I scale when converting lambda to alpha
    Outputs:
        reg: regression object from sikitlearn
        also prints what was desired
    '''
    #Test timeperiod
    if(options['time_period'] == 'all'):
        newData = data.copy()
    else:
        newData = data.copy()
        newData = newData.query(options['time_period'])

    #Do CV Lasso
    alphas = np.logspace(-12, np.log(options['max_lambda_hat']), base=np.exp(1), num=options['n_lambda_hat'])
    #alphas = np.linspace(1e-12, alphaMax, options['nAlphas'])
    if(options['random_state'] == 'none'):
        lassoTest = Lasso(fit_intercept=True)
    else:
        lassoTest = Lasso(random_state = options['random_state'], fit_intercept=True)

    tuned_parameters = [{'alpha': alphas}]

    clf = GridSearchCV(lassoTest, tuned_parameters, cv=options['n_folds'], refit=True)
    clf.fit(newData[factorNames],newData[dependentVar])
    lassoBest = clf.best_estimator_
    alphaBest = clf.best_params_['alpha']

    if (options['print_loadings'] == True):
        #Now print the results
        print_timeperiod(newData, dependentVar, options)
        print('Best lambda_hat = ' + str(alphaBest))
        #Now print the factor loadings
        display_factor_loadings(lassoBest.intercept_, lassoBest.coef_, factorNames, options)

    if(options['return_model']):
        return lassoBest

def cross_validated_ridge_regression(data, dependentVar, factorNames, options):
    '''cross_validated_ridge_regression takes in a dataset and returns the factor loadings using ridge regression and choosing lambda via ridge regression
    INPUTS:
        data: pandas df, data matrix, should constain the date column and all of the factorNames columns
        dependentVar: string, name of dependent variable
        factorNames: list, elements should be strings, names of the independent variables
        options: dictionary, should constain at least two elements, timeperiod, and date
            timeperiod: string, if == all, means use entire dataframe, otherwise filter the df on this value
            date: name of datecol
            returnModel: boolean, if true, returns model
            printLoadings: boolean, if true, prints the coeficients

            maxLambda: float, max lambda value passed
            nLambdas: int, number of lambda values to try
            randomState: integer, sets random state seed
            nFolds: number of folds
            NOTE: SKLearn calles Lambda Alpha.  So I change Lambda -> Alpha in the following code
    Outputs:
        reg: regression object from sikitlearn
        also prints what was desired
    '''
    #Test timeperiod
    if(options['time_period'] == 'all'):
        newData = data.copy()
    else:
        newData = data.copy()
        newData = newData.query(options['time_period'])

    #Do CV Lasso
    alphaMax = options['max_lambda']
    alphas = np.logspace(-12, np.log(alphaMax), num=options['n_lambda'], base=np.exp(1))
    if(options['randomState'] == 'none'):
        ridgeTest = Ridge(fit_intercept=True)
    else:
        ridgeTest = Ridge(random_state = options['randomState'], fit_intercept=True)

    tuned_parameters = [{'alpha': alphas}]

    clf = GridSearchCV(ridgeTest, tuned_parameters, cv=options['n_folds'], refit=True)
    clf.fit(newData[factorNames],newData[dependentVar])
    ridgeBest = clf.best_estimator_
    alphaBest = clf.best_params_['alpha']

    if (options['print_loadings'] == True):
        #Now print the results
        print_timeperiod(newData, dependentVar, options)
        print('Best Lambda = ' + str(alphaBest))
        #Now print the factor loadings
        display_factor_loadings(ridgeBest.intercept_, ridgeBest.coef_, factorNames, options)

    if(options['return_model']):
        return ridgeBest

def cross_validated_elastic_net_regression(data, dependentVar, factorNames, options):
    '''cross_validated_elastic_net_regression takes in a dataset and returns the factor loadings using elastic net, also chooses alpha and l1 ratio via cross validation
    INPUTS:
        data: pandas df, data matrix, should constain the date column and all of the factorNames columns
        dependentVar: string, name of dependent variable
        factorNames: list, elements should be strings, names of the independent variables
        options: dictionary, should constain at least two elements, timeperiod, and date
            timeperiod: string, if == all, means use entire dataframe, otherwise filter the df on this value
            date: name of datecol
            returnModel: boolean, if true, returns model
            printLoadings: boolean, if true, prints the coeficients

            maxLambda: float, max lambda value passed
            nLambdas: int, number of lambda values to try
            maxL1Ratio: float
            randomState: integer, sets random state seed
            nFolds: number of folds
            NOTE: SKLearn calles Lambda Alpha.  So I change Lambda -> Alpha in the following code
    Outputs:
        reg: regression object from sikitlearn
        also prints what was desired
    '''
    #Test timeperiod
    if(options['time_period'] == 'all'):
        newData = data.copy()
    else:
        newData = data.copy()
        newData = newData.query(options['time_period'])

    #Do CV Lasso
    alphaMax = options['max_lambda_hat']
    alphas = np.logspace(-12, np.log(alphaMax), num=options['n_lambda_hat'])
    l1RatioMax = options['max_l1_ratio']
    l1Ratios = np.linspace(1e-6, l1RatioMax, options['n_l1_ratio'])
    if(options['random_state'] == 'none'):
        elasticNetTest = ElasticNet(fit_intercept=True)
    else:
        elasticNetTest = ElasticNet(random_state = options['random_state'], fit_intercept=True)

    tuned_parameters = [{'alpha': alphas, 'l1_ratio': l1Ratios}]

    clf = GridSearchCV(elasticNetTest, tuned_parameters, cv=options['n_folds'], refit=True)
    clf.fit(newData[factorNames],newData[dependentVar])
    elasticNetBest = clf.best_estimator_
    alphaBest = clf.best_params_['alpha']
    l1RatioBest = clf.best_params_['l1_ratio']

    if (options['print_loadings'] == True):
        #Now print the results
        print_timeperiod(newData, dependentVar, options)
        print('Best lambda_hat = ' + str(alphaBest))
        print('Best l1 ratio = ' + str(l1RatioBest))
        #Now print the factor loadings
        display_factor_loadings(elasticNetBest.intercept_, elasticNetBest.coef_, factorNames, options)

    if(options['return_model']):
        return elasticNetBest


def run_factor_model(data, dependentVar, factorNames, method, options):
    '''run_Factor_model allows you to specify the method to create a model, returns a model object according to the method you chose
    INPUTS:
        data: pandas df, must contain the columns specified in factorNames and dependentVar
        dependentVar: string, dependent variable
        factorNames: list of strings, names of independent variables
        method: string, name of method to be used.  Supports OLS, LASSO, CVLASSO atm
        options: dictionary object, controls the hyperparameters of the method
    Outputs:
        out: model object'''

    #Make sure the options dictionary has the correct settings
    options['return_model'] = True
    options['print_loadings'] = False

    #Now create the appropriate model
    if (method == 'OLS'): #run linear model
        return linear_regression(data, dependentVar, factorNames, options)
    if (method == 'LASSO'):
        return lasso_regression(data, dependentVar, factorNames, options)
    if (method == 'Ridge'):
        return ridge_regression(data, dependentVar, factorNames, options)
    if (method == 'CVLasso'):
        return cross_validated_lasso_regression(data, dependentVar, factorNames, options)
    if (method == 'CVRidge'):
        return cross_validated_ridge_regression(data, dependentVar, factorNames, options)
    if (method == 'CVElasticNet'):
        return cross_validated_elastic_net_regression(data, dependentVar, factorNames, options)
    if (method == 'BestSubset'):
        return best_subset_regression(data, dependentVar, factorNames, options)
    else:
        print ('Method ' + method + ' not supported')

# Function to create a time series of factor loadings using a trailing window
def compute_trailing_factor_regressions(data, dependentVar, factorNames, window, method, options, dateCol='Date', printTime=False):
    '''compute_trailing_factor_regressions computes the factor regresssions using a trailing window, returns a pandas df object
    INPUTS:
        data: pandas df, must constain the columns dependentVar, and the set of columns factorNames
        dependentVar: string, names the dependent variable, must be a column in the dataframe data
        factorNames: list of string, elements must be members
        window: int, lookback window, measured in number of trading days
        method: string, can be OLS, LASSO or CVLasso
        options: dictionary, options dictionary
        dateCol: string, name of date column, also must be included in data
        printTime: boolean, if True, prints time it took to run the regressions
    Outputs:
        regressionValues: pandas df, rows should be different dates, columns should be factor loadings calculated using the trailing window
    '''
    if(printTime):
        start = time.time()
    options['return_model'] = True
    options['print_loadings'] = False
    days = list(np.sort(data[dateCol].unique()))
    listOfFactorsAndDate = [dateCol] + factorNames
    regressionValues = pd.DataFrame(columns=listOfFactorsAndDate)
    for i in range(window, len(days)):
        #Filter the data
        filtered = data[(data[dateCol] <= days[i]) & (data[dateCol] >= days[i-window])]
        #Run the regression
        reg = run_factor_model(filtered, dependentVar, factorNames, method, options)
        #Append the regression values
        newRow = pd.DataFrame(reg.coef_)
        newRow = newRow.transpose()
        newRow.columns = factorNames
        newRow[dateCol] = days[i]
        regressionValues = regressionValues.append(newRow, sort=True)
    if(printTime):
        print('regression took ' + str((time.time() - start)/60.) + ' minutes')
    return regressionValues

#Asorted Nonsense
def num_to_month(month):
    #num to month returns the name of the month, input is an integer
    if (month==1):
        return 'January'
    if (month==2):
        return 'Febuary'
    if (month==3):
        return 'March'
    if (month==4):
        return 'April'
    if (month==5):
        return 'May'
    if (month==6):
        return 'June'
    if (month==7):
        return 'July'
    if (month==8):
        return 'August'
    if (month==9):
        return 'September'
    if (month==10):
        return 'October'
    if (month==11):
        return 'November'
    if (month==12):
        return 'December'


def data_time_periods(data, dateName):
    '''data_time_periods figures out if the data is daily, weekly, monthly, etc
    INPUTS:
        data: pandas df, has a date column in it with column name dateName
        dateName: string, name of column to be analysed
    '''
    secondToLast = data[dateName].tail(2)[:-1]
    last = data[dateName].tail(1)
    thingy = (last.values - secondToLast.values).astype('timedelta64[D]') / np.timedelta64(1, 'D')
    thingy = thingy[0]
    if (thingy > 200):
        return 'yearly'
    elif(thingy > 20):
        return 'monthly'
    elif(thingy > 5):
        return 'weekly'
    else:
        return 'daily'

In [None]:
#@title Package Uploads

#import all the necessary packages
import numpy as np #for numerical array data
import pandas as pd #for tabular data
import matplotlib.pyplot as plt #for plotting purposes

from sklearn.metrics import r2_score

%matplotlib inline
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12

import importlib as imp
import warnings
warnings.filterwarnings('ignore')

## Load data

In [None]:
# Load and restrict to 1995-2014
all_data = pd.read_csv(dataPath)
all_data[dateName] = pd.to_datetime(all_data[dateName])
all_data = all_data[all_data['Date'] <= '2014-12-01']

In [None]:
# Preview
display(all_data)

Unnamed: 0,Date,World Equities,10-year US Treasuries,High Yield,Inflation Protection,Currency Protection,U.S. Equity,SP500 Total Return,S&P 500,International Equity,U.S. Treasury 20 years,Corporate Bond,Real Estate,Commodity,TIPS
0,1995-01-01,-0.020349,0.022922,0.032048,0.006524,-0.003404,0.022239,0.025954,0.025897,-0.039228,0.025987,0.017042,-0.014847,-0.023908,0.029446
1,1995-02-01,0.010680,0.030014,0.013531,-0.038375,-0.020587,0.036258,0.035540,0.038974,-0.000434,0.026960,0.029870,0.032460,0.002221,-0.008361
2,1995-03-01,0.045804,0.007195,0.025871,-0.023184,-0.042060,0.030791,0.035454,0.029502,0.061949,0.009477,0.011937,-0.005961,0.019558,-0.015989
3,1995-04-01,0.036372,0.016973,0.030836,-0.003600,-0.015827,0.022403,0.026256,0.029414,0.037154,0.017695,0.010995,0.002259,0.038085,0.013373
4,1995-05-01,0.011040,0.054900,0.006743,-0.054282,0.009121,0.036521,0.040894,0.039949,-0.009769,0.076715,0.053217,0.045025,-0.009417,0.000618
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
235,2014-08-01,0.027793,0.018782,-0.020976,-0.014430,0.009927,0.043745,0.042988,0.037655,0.008299,0.038171,0.012999,0.035689,-0.009703,0.004352
236,2014-09-01,-0.042745,-0.010631,0.011383,-0.014319,0.039788,-0.033391,-0.026997,-0.015514,-0.048614,-0.015959,-0.006966,-0.055390,-0.062227,-0.024950
237,2014-10-01,0.013016,0.013129,-0.007149,-0.004635,0.009673,0.041572,0.037970,0.023201,-0.019297,0.027575,0.003061,0.093608,-0.064498,0.008494
238,2014-11-01,0.015443,0.014195,-0.014748,-0.011565,0.018641,0.016468,0.020038,0.024534,0.021751,0.029607,0.004313,0.012604,-0.082710,0.002630


In [None]:
# What do you see from this correlation matrix?
display(all_data.corr(numeric_only = True))

Unnamed: 0,World Equities,10-year US Treasuries,High Yield,Inflation Protection,Currency Protection,U.S. Equity,SP500 Total Return,S&P 500,International Equity,U.S. Treasury 20 years,Corporate Bond,Real Estate,Commodity,TIPS
World Equities,1.0,-0.176121,0.308225,0.331848,-0.528171,0.946991,0.943594,0.912133,0.965891,-0.198883,0.237443,0.64051,0.409148,0.110957
10-year US Treasuries,-0.176121,1.0,0.13105,-0.615927,-0.140221,-0.177292,-0.150594,-0.182054,-0.175836,0.942285,0.63722,-0.002124,-0.099199,0.625749
High Yield,0.308225,0.13105,1.0,0.004689,-0.18469,0.290282,0.295835,0.263561,0.30935,0.149314,0.316815,0.191219,0.060343,0.166567
Inflation Protection,0.331848,-0.615927,0.004689,1.0,-0.208674,0.298576,0.27318,0.236682,0.338636,-0.602057,-0.187725,0.296745,0.347401,0.229089
Currency Protection,-0.528171,-0.140221,-0.18469,-0.208674,1.0,-0.394643,-0.387068,-0.357044,-0.608743,-0.037019,-0.320512,-0.422314,-0.44781,-0.37987
U.S. Equity,0.946991,-0.177292,0.290282,0.298576,-0.394643,1.0,0.988462,0.91577,0.859048,-0.192557,0.195337,0.645034,0.357459,0.076567
SP500 Total Return,0.943594,-0.150594,0.295835,0.27318,-0.387068,0.988462,1.0,0.925838,0.849181,-0.170161,0.206732,0.628404,0.331523,0.08441
S&P 500,0.912133,-0.182054,0.263561,0.236682,-0.357044,0.91577,0.925838,1.0,0.824577,-0.196778,0.243264,0.527875,0.301916,0.0094
International Equity,0.965891,-0.175836,0.30935,0.338636,-0.608743,0.859048,0.849181,0.824577,1.0,-0.194423,0.233666,0.602867,0.445056,0.118031
U.S. Treasury 20 years,-0.198883,0.942285,0.149314,-0.602057,-0.037019,-0.192557,-0.170161,-0.196778,-0.194423,1.0,0.587746,-0.024133,-0.155073,0.56817


## Refresh your memory

Briefly go through the week 6 & 7 code walkthroughs again, and consider what is being done there.

How and what for are the linear regression models being used?

When are the Ridge/Lasso/Elastic-net variants of linear regression employed, and why? How come cross-validation is used for tuning penalty parameters?

In [None]:
# Choose assets (that is, "dependent" variables) and factors ("independent" variables, aka features)

# assetName  = .. code goes here ..
# factorName = .. code goes here ..

In [None]:
# Recreate a regularised linear model (for example Elastic-net) as done in the walkthroughs, using pre 2013 data to train

# .. code goes here ..

In [None]:
# Validate (ie, test) the linear model on post 2013 data and report the R^2

# .. codes goes here ..

# Define Research Question

What question will you try to answer?

Write your project goal in the form of a question to help guide the steps that follow.


# Suggested to-do list:

1) You have now used a linear regression model to predict the performance of the assets based on factors. What other models can be useful for regression?

2) Perform the same regression with alternative models. How do they compare against the linear model in terms of accuracy and interpretability? You can for example try:
  - [XGBoost](https://xgboost.readthedocs.io/en/stable/python/)
  - [Random Forests](https://scikit-learn.org/dev/modules/generated/sklearn.ensemble.RandomForestRegressor.html)

3) How does your top-performing model perform if the asset portfolio and factors are altered? For a few such asset/factor combinations, and for a few (two-three) different models, tabulate or otherwise visualise your training and validation accuracies.


## Challenge:

- Make an educated guess on the expected, average return of your factors for 2015 (without looking up their historic performance!), and use that to predict the return of your assets.
- Use your predicted returns, along with the week 3 materials, to construct an optimal portfolio where the Sharpe ratio is minimal. Assume the correlations between the assets will approximately remain the same in 2015 as in 1995 – 2014.
- Now look up the actual performance of your assets in 2015 (using, for example, yahoo finance). Are you happy with your porfolio?

## Performance Summary

Make a presentation of your result. You can refer to the syntax below.

Markdown | Preview
--- | ---
`**Model 1**` | **Model 2**
`*70%*` or `_italicized text_` | *90%*
`` `Monospace` `` | `Monospace`
`~~strikethrough~~` | ~~strikethrough~~
`[A link](https://www.google.com)` | [A link](https://www.google.com)
`![An image](https://www.google.com/images/rss.png)` | ![An image](https://www.google.com/images/rss.png)

More resources about creating tables in markdown of colab can be found [here](https://colab.research.google.com/notebooks/markdown_guide.ipynb#scrollTo=Lhfnlq1Surtk).

## Interpretation and Future Work

Present and also interpret your performance. Comment on potential future work or research questions that your project leads to. Consider all the assumptions that are baked into your model, and their validities.