# Deep Learning for OR and FE
## Wealth Management Project 6: Real Time Impacts & Stress Testing

### *Group 27: Louis Francois, 	Qinyi Li, Parag Dilip Mahajan*

In [1]:
"""
Copyright:
    Copyright (C) 2020 Sauma Capital Inc - All Rights Reserved
    Unauthorized copying of this file, via any medium, is strictly prohibited
    Proprietary and confidential
Product: Scenarios based stress test
Auther: Kai Fang
Description: Risk Scenarios analysis to generate risk shock for portfoliio of mutual fund/other asset
            The framework contains mutliple parts:
            1) Risk Generator
            2) Risk Scenarios
            3) Risk Simulator
            This file currently only provide based class for the researcher/developer to inherit, and provide an interface
            to enforce the format of the programming prastice
"""

'\nCopyright:\n    Copyright (C) 2020 Sauma Capital Inc - All Rights Reserved\n    Unauthorized copying of this file, via any medium, is strictly prohibited\n    Proprietary and confidential\nProduct: Scenarios based stress test\nAuther: Kai Fang\nDescription: Risk Scenarios analysis to generate risk shock for portfoliio of mutual fund/other asset\n            The framework contains mutliple parts:\n            1) Risk Generator\n            2) Risk Scenarios\n            3) Risk Simulator\n            This file currently only provide based class for the researcher/developer to inherit, and provide an interface\n            to enforce the format of the programming prastice\n'

In [50]:
import pandas as pd
import math
import numpy as np
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
import pickle
import yaml
import os
import sauma.core 
import pymysql
import json
from sqlalchemy import select, insert
from datetime import datetime
from datetime import timedelta
from functools import partial, reduce   
import logging
from utils import *
from DatabaseDevelopment import *

### Risk factor generator

In [51]:
class RiskGeneratorBased:
    """Risk Generator Based class that risk generator should inherit. For example, you could create a multi-variable regression
        Based risk generator, to map the risk factors to an asset or portfolio, and do the training to generate model parameters:
        beta, that we would possible cached in database or output to flat file. The input dependent variable would be a two dimension
        matrix (time, asset return/price) and the dependent variable would be two dimension matrix (time, risk factor series)
        For example:
        Y: [[Date, Asset1, Asset2, Asset3, Asset4, Asset5 ...],
            [.....],
            [.....],
            ...]
        X: [[Date, Factor1, Factor2, Factor3, Factor4, ....],
            [.....],
            [.....],
            ....]
        Final Result of output would be:
        [[Factor,  Asset1, Asset2, Asset3, Asset4, Asset5 ...],
         [Factor1, .....],
         [Factor2, .....],
         [Factor3, .....],
         .....]
        Currently we assume linear relationship of the risk factor, but later when we have some non-linear relationship, could have different
        return format.
        If you have mutliple asset, this may include mutliple traning process.
    """
    def __init__(self, risk_generator_name):
        self._risk_generator_name = risk_generator_name
    
    def set_up(self, **kwargs):
        """Function to setup any private variable for the allocator"""
        raise NotImplementedError("Subclasses should implement set_up function!")
    
    def risk_factors(self):
        """Print the list of risk factors that you use in this generator, do not hard coded the return list, the information
        should be able to be gathered from private variables either from set_up function or load data function"""
        raise NotImplementedError("Subclasses should implement risk_factors function!")
    
    def risk_generator_name(self):
        """This provide identifier of the clustering strategy that you are implementing."""
        return self._risk_generator_name
    
    def machine_learning_based(self):
        """This method tells us whether the risk_generator category is machine learning 
            based and need to run fit to train model parameters or not
            
        Parameters:
            None

            Return
                bool
                True if the strategy need to run fit to be ready for prediction, otherwise No
        """
        raise NotImplementedError("Subclasses should implement machine_learning_based!")
    
    def load_raw_data(self, source_type, **kwargs):
        """Function to load raw data from source, should be able to support 
        reading data from flat file or sql database. Please just implement the one using flat file now,
        later we would provide the sql python package that we would want to utilize for the database task
        You may also want to have some data cleaning on the raw data, please define private method for data transformation.
        and the data should be saved as private variable, as it would be used for later training and testing process.
        
        Parameters:
            source_type: str
                flat file type or sql, if it is flat file, file directory or 
                path need to be passed in as argument or in the setup function
                If it is sql, connection need to be extablished in setup function
                please avoid any hard coded name in the class, and set global variable to define those file name
        """
        raise NotImplementedError("Subclasses should implement load_raw_data function!")
    
    def set_hyper_parameter(self, **kwargs):
        """Function to re_config any hyper parameters that you need for your model, 
        the parameters should be initialized in your inherited setup function, by reading the config
        either from a config file or from argument, but please enable user to have a config file to set
        these hyper-parameters."""
        raise NotImplementedError("Subclasses should implement set_hyper_parameter!")

    def print_hyper_parameter(self, **kwargs):
        """Print all the hyper parameters that you set for your model."""
        raise NotImplementedError("Subclasses should implement print_hyper_parameter!")
    
    def fit(self, **kwargs):
        """Function to execute training either based on the data that you load from file or passed in as argument.
        When X, Y are passed in as argument, would train the model based on the training dataset passed in, and over write
        the existing data cached in the risk generator obj. If you implement some new machine learning model rather than using
        existing machine learning model by some python package, please seperate the implementation of the model in another class,
        and initialize an instance of that model in your setup function rather than implement the model directly in the fit function,
        so that we could seprate the business logics with the machine learning model maintaining logics, and those model could be reused
        somewhere else too."""
        raise NotImplementedError("Subclasses should implement print_hyper_parameter!")
    
    def predict(self, **kwargs):
        """Run prediction after fitting the model, should throw error message when the model did not run fit yet."""
        raise NotImplementedError("Subclasses should implement predict")
    
    def model_summary(self):
        """Function that provide summary of model result: prediction accuracy, different matrix 
            to measure the model, and hyper-parameters of the model"

        Parameters:
            None

            Return
                dict {str: float/dataframe}
                key is the staticial measure name
                value is the statical measure, either a number or a matrix or a dataframe
        """
        raise NotImplementedError("Subclasses should implement model_summary")

    def output_result(self, **kwargs):
        """Function to output the model, could use pickle to cached the obj that 
        has been trained, so that you could load the obj later directly later, and you could also use this function
        to output the beta output of the risk generator, please use arguments to config what you want to output
        
        Parameters:
            output_model: bool
                output model to pickle container
            output_beta: bool
                output beta of each risk factors for each fund
        """
        raise NotImplementedError("Subclasses should implement output_result")

In [142]:
# Define a Multi-Regression Based risk factor generator by inheriting from the class above.
# if you have some generic method that all risk factor generator could use, like those method in define in the based class, you could directly
# edit the based class, but for method that is specific to the method that you use in your risk factor generation, you should do it in the derived class


class SimpleRegressionRiskGenerator(RiskGeneratorBased):
    def __init__(self, risk_generator_name, risk_factors): 
        self._risk_generator_name = risk_generator_name
        self._risk_factors = risk_factors # list of risk factors (strings)
    
    def set_up(self, **kwargs):
        """Function to setup any private variable for the allocator"""
        self._hyperparameters = kwargs.get('hyperparameters') # dict: {'hyperparameter1': value1, ...}
        self._data_path = kwargs.get('data_path') # dict: {'fund': data_path for fund data, 'index': data_path for index data}
    
    def risk_factors(self):
        """Print the list of risk factors that you use in this generator, do not hard coded the return list, the information
        should be able to be gathered from private variables either from set_up function or load data function"""
        print(self._risk_factors)
    
    def risk_generator_name(self):
        """This provide identifier of the clustering strategy that you are implementing."""
        return self._risk_generator_name
    
    def machine_learning_based(self):
        """This method tells us whether the risk_generator category is machine learning 
            based and need to run fit to train model parameters or not
            
        Parameters:
            None

            Return
                bool
                True if the strategy need to run fit to be ready for prediction, otherwise No
        """
        return True 
    
    def load_raw_data(self, **kwargs):
        """Function to load raw data from source, should be able to support 
        reading data from flat file or sql database. Please just implement the one using flat file now,
        later we would provide the sql python package that we would want to utilize for the database task
        You may also want to have some data cleaning on the raw data, please define private method for data transformation.
        and the data should be saved as private variable, as it would be used for later training and testing process.
        
        Parameters:
            source_type: str
                flat file type or sql, if it is flat file, file directory or 
                path need to be passed in as argument or in the setup function
                If it is sql, connection need to be extablished in setup function
                please avoid any hard coded name in the class, and set global variable to define those file name
        """
        self._factors_data = pd.read_csv(self._data_path.get('index'))
        self._assets_data = pd.read_csv(self._data_path.get('fund'))
        self._libor_data = pd.read_csv(self._data_path.get('libor'))
        
    def split_train_test_data(self, **kwargs):
        X, Y = self._factors_data, self._assets_data
        X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=kwargs.get('test_size'), random_state=42)
        self._X_train = X_train.values[:,1:] # convert to numpy array and get rid of dates
        self._X_test = X_test.values[:,1:]
        self._Y_train = Y_train.values[:,1:]
        self._Y_test = Y_test.values[:,1:]
    
    def set_hyper_parameter(self, **kwargs):
        """Function to re_config any hyper parameters that you need for your model, 
        the parameters should be initialized in your inherited setup function, by reading the config
        either from a config file or from argument, but please enable user to have a config file to set
        these hyper-parameters."""
        path = kwargs.get('config_path') # string
        with open(path) as f:
            config = yaml.load(f, Loader=yaml.BaseLoader)  # config is a dict
        self._hyperparameters = config
        
    def print_hyper_parameter(self, **kwargs):
        """Print all the hyper parameters that you set for your model."""
        print(self._hyperparameters)
    
    def fit(self, **kwargs):
        """Function to execute training either based on the data that you load from file or passed in as argument.
        When X, Y are passed in as argument, would train the model based on the training dataset passed in, and over write
        the existing data cached in the risk generator obj. If you implement some new machine learning model rather than using
        existing machine learning model by some python package, please seperate the implementation of the model in another class,
        and initialize an instance of that model in your setup function rather than implement the model directly in the fit function,
        so that we could seprate the business logics with the machine learning model maintaining logics, and those model could be reused
        somewhere else too."""
        self._model = {} # dict to store all the regression models (one for each asset/fund)
        for i in range(self._Y_train.shape[1]):
            reg = LinearRegression(fit_intercept=False) 
            reg.fit(self._X_train,self._Y_train[:,i])
            self._model['fund'+str(i+1)] = reg
    
    def predict(self, **kwargs):
        """Run prediction after fitting the model, should throw error message when the model did not run fit yet."""
        assert (hasattr(self, '_model')), 'Did not run fit yet'
        self._Y_pred = {}
        for fund, model in self._model.items():
            self._Y_pred[fund] = model.predict(self._X_test)
           
    def betas_fund(self, **kwargs):
        """For each fund, store list of betas obtained"""
        betas = {}
        for fund, model in self._model.items():
            betas[fund] = model.coef_
        self._betas_fund = betas
        return self._betas_fund
    
    def betas_output(self, **kwargs):
        """Create and print a matrix such that each row corresponds to a risk factor, with the name of the risk factor followed 
        by all the betas obtained for each fund, i.e. we want the following output:
        [[Factor,  Asset1, Asset2, Asset3, Asset4, Asset5 ...],
         [Factor1, .....],
         [Factor2, .....],
         [Factor3, .....],
         .....]"""
        betas_output = []
        for beta in self._betas_fund.values():
            if len(betas_output) == 0:
                betas_output = beta
            else: 
                betas_output = np.vstack((betas_output, beta))
        betas_output = np.concatenate((np.array(self._risk_factors).reshape(len(self._risk_factors),1), betas_output.T), axis = 1)
        self._betas_output = betas_output
        return self._betas_output 
    
    def get_momentum_factor_score(self):
        self.mom_score = {} #Dict from funds to vol scores for each factor
        funds_data = self._assets_data.copy()
        funds_data.set_index('Date',inplace=True)
        log_funds_returns = funds_data.apply(lambda x: np.log(1+x/100) if np.issubdtype(x.dtype, np.number) else x)

        libor_data = self._libor_data.copy()
        libor_data.set_index('Date',inplace=True)
        libor_data.dropna(inplace=True)
        libor_data = libor_data.iloc[::-1]
        libor_returns = libor_data.pct_change()[1:]
        libor_returns = libor_returns.iloc[::-1]
        log_libor_returns = libor_returns.apply(lambda x: np.log(1+x) if np.issubdtype(x.dtype, np.number) else x)
        
        mom_factor_scores = log_funds_returns.sub(log_libor_returns['USD LIBOR 3M '],axis=0)
        
        self.mom_score = mom_factor_scores[30:365].sum().to_dict()
    
    def get_vol_factor_score(self):
        self.vol_score = {} #Dict from funds to vol scores for each factor
        funds_data = self._assets_data.copy()
        funds_data.set_index('Date',inplace=True)
        log_funds_returns = funds_data.apply(lambda x: np.log(1+x/100) if np.issubdtype(x.dtype, np.number) else x)

        libor_data = self._libor_data.copy()
        libor_data.set_index('Date',inplace=True)
        libor_data.dropna(inplace=True)
        libor_data = libor_data.iloc[::-1]
        libor_returns = libor_data.pct_change()[1:]
        libor_returns = libor_returns.iloc[::-1]
        log_libor_returns = libor_returns.apply(lambda x: np.log(1+x) if np.issubdtype(x.dtype, np.number) else x)
        
        vol_factor_scores = log_funds_returns.sub(log_libor_returns['USD LIBOR 3M '],axis=0)
        
        self.vol_score = vol_factor_scores[30:365].max().sub(vol_factor_scores[30:365].min()).to_dict()
        
    def get_size_factor_score(self,Mv_data_path):
        MV_data = pd.read_csv(Mv_data_path)
        MV_data.set_index('Date',inplace = True)
        MV_data = MV_data.astype(float)
        MV_data.reset_index(drop=True, inplace=True)
        MV_data = MV_data.apply(lambda x: -np.log(x))
        self.size_score = MV_data.sum().to_dict()
                
        
    def select_top_n_funds_with_factor_score(self,n,factor_score_dict):
        top_n_funds = sorted(factor_score_dict,key=factor_score_dict.get,reverse=True)[:n]
        self._assets_data = self._assets_data[top_n_funds]
          
    def model_summary(self):
        """Function that provide summary of model result: prediction accuracy, different matrix 
            to measure the model, and hyper-parameters of the model"

        Parameters:
            None

            Return
                dict {str: float/dataframe}
                key is the staticial measure name
                value is the statical measure, either a number or a matrix or a dataframe
        """
        ### can add other performance measures later on 
        summary = dict()
        summary['rmse score'] = {}
        i = 0
        for fund in self._model.keys():
            mse = mean_squared_error(y_true = self._Y_test[:,i], y_pred = self._Y_pred[fund])
            rmse = math.sqrt(mse)
            summary['rmse score'][fund] = rmse
            i += 1  
        return summary 

    def output_result(self, **kwargs):
        """Function to output the model, could use pickle to cached the obj that 
        has been trained, so that you could load the obj later directly later, and you could also use this function
        to output the beta output of the risk generator, please use arguments to config what you want to output
        
        Parameters:
            output_model: bool
                output model to pickle container
            output_beta: bool
                output beta of each risk factors for each fund
        """
        save_path = kwargs.get('save_path') # string
        output_model = kwargs.get('output_model')
        output_beta = kwargs.get('output_beta')
        if output_model == True:
            f1 = open(save_path + '/model.txt','wb')
            pickle.dump(self._model, f1)
        if output_beta == True:
            f2 = open(save_path + '/beta.txt','wb')
            pickle.dump(self._betas_output, f2)

In [143]:
# Define Risk Generator based on Lasso regression and Ridge Regression to do a little bit more risk factor 
# filtering, by implementing all these derived class, you could diagnose which method are common method that you could define 
# and move to the base class.
class LassoRegressionRiskGenerator(SimpleRegressionRiskGenerator):
    def fit(self, **kwargs):
        """Function to execute training either based on the data that you load from file or passed in as argument.
        When X, Y are passed in as argument, would train the model based on the training dataset passed in, and over write
        the existing data cached in the risk generator obj. If you implement some new machine learning model rather than using
        existing machine learning model by some python package, please seperate the implementation of the model in another class,
        and initialize an instance of that model in your setup function rather than implement the model directly in the fit function,
        so that we could seprate the business logics with the machine learning model maintaining logics, and those model could be reused
        somewhere else too."""
        self._model = {} # dict to store all the regression models (one for each asset/fund)
        alpha = self._hyperparameters.get('alpha')
        max_iter = self._hyperparameters.get('max_iter')
        for i in range(self._Y_train.shape[1]):
            lasso = Lasso(alpha = alpha, max_iter = max_iter, fit_intercept=False)
            lasso.fit(self._X_train,self._Y_train[:,i])
            self._model['fund'+str(i+1)] = lasso
    
    def model_summary(self):
        """Function that provide summary of model result: prediction accuracy, different matrix 
            to measure the model, and hyper-parameters of the model"

        Parameters:
            None

            Return
                dict {str: float/dataframe}
                key is the staticial measure name
                value is the statical measure, either a number or a matrix or a dataframe
        """
        ### can add other performance measures later on 
        summary = dict()
        summary['hyperparameter'] = self._hyperparameters
        summary['rmse score'] = {}
        i = 0
        for fund in self._model.keys():
            mse = mean_squared_error(y_true = self._Y_test[:,i], y_pred = self._Y_pred[fund])
            rmse = math.sqrt(mse)
            summary['rmse score'][fund] = rmse
            i += 1  
        return summary
    
class RidgeRegressionRiskGenerator(SimpleRegressionRiskGenerator):
    def fit(self, **kwargs):
        """Function to execute training either based on the data that you load from file or passed in as argument.
        When X, Y are passed in as argument, would train the model based on the training dataset passed in, and over write
        the existing data cached in the risk generator obj. If you implement some new machine learning model rather than using
        existing machine learning model by some python package, please seperate the implementation of the model in another class,
        and initialize an instance of that model in your setup function rather than implement the model directly in the fit function,
        so that we could seprate the business logics with the machine learning model maintaining logics, and those model could be reused
        somewhere else too."""
        self._model = {} # dict to store all the regression models (one for each asset/fund)
        alpha = self._hyperparameters.get('alpha')
        for i in range(self._Y_train.shape[1]):
            ridge = Ridge(alpha = alpha, fit_intercept=False)
            ridge.fit(self._X_train,self._Y_train[:,i])
            self._model['fund'+str(i+1)] = ridge
    
    def model_summary(self):
        """Function that provide summary of model result: prediction accuracy, different matrix 
            to measure the model, and hyper-parameters of the model"

        Parameters:
            None

            Return
                dict {str: float/dataframe}
                key is the staticial measure name
                value is the statical measure, either a number or a matrix or a dataframe
        """
        summary = dict()
        summary['hyperparameter'] = self._hyperparameters
        summary['rmse score'] = {}
        i = 0
        for fund in self._model.keys():
            mse = mean_squared_error(y_true = self._Y_test[:,i], y_pred = self._Y_pred[fund])
            rmse = math.sqrt(mse)
            summary['rmse score'][fund] = rmse
            i += 1  
        return summary

Let us test our risk generator classes.

In [3]:
# data samples
import pandas as pd
index = pd.read_csv('Index.csv')
fund = pd.read_csv('Fund.csv')
libor_data = pd.read_csv('USD_3M_LIBOR.csv')
fund.head(5)

Unnamed: 0,Date,NAS:APGAX - Return,NAS:APGCX - Return,NAS:ANIAX - Return,NAS:ANMCX - Return,NAS:GSCIX - Return,NAS:GXXAX - Return,NAS:JBGIX - Return,NAS:RAGTX - Return,NAS:PGWAX - Return,NAS:ANCMX - Return,NAS:PQNAX - Return,NAS:PRXIX - Return,NAS:RMDAX - Return,NAS:JGVVX - Return,NAS:VGRIX - Return,NAS:JLGMX - Return,NAS:JLVMX - Return,NAS:OLGCX - Return
0,Sep-20,-3.5,-3.56,-0.33,-0.47,-5.42,-3.76,0.18,-4.12,-3.5,-0.78,-2.23,-2.8,-0.81,-4.18,-3.05,-4.36,-2.69,-4.41
1,Aug-20,6.12,6.07,-0.2,-0.19,5.08,7.1,-0.57,8.74,12.96,7.51,1.65,1.64,6.9,9.59,4.75,10.78,4.96,10.67
2,Jul-20,6.14,6.07,1.26,1.12,5.76,6.13,1.9,9.64,7.8,8.12,6.3,3.91,7.91,9.43,3.84,9.87,2.95,9.79
3,Jun-20,2.0,1.94,1.47,1.48,1.76,0.63,1.71,6.61,5.09,4.49,0.16,3.79,3.37,5.19,0.02,5.94,2.1,5.86
4,May-20,8.44,8.36,2.48,2.41,10.28,7.93,1.23,12.06,7.45,8.56,4.86,6.32,12.43,8.21,3.78,9.08,3.96,8.95


In [145]:
index.values[:,1:]

array([[3436.19, 3363.0, 2580.77, 677.69, 3392.2],
       [3505.42, 3500.31, 2577.19, 677.83, 3401.53],
       [3455.6, 3271.12, 2605.74, 677.97, 3445.17],
       ...,
       [893.87, 1320.28, 1087.91, 526.72, 1092.73],
       [888.02, 1314.95, 1067.54, 524.21, 1071.93],
       [902.76, 1429.4, 1046.1, 521.57, 1058.22]], dtype=object)

In [146]:
# list of factors used for testing the code (corresponding to indexes in the dataset 'index')
factors = list(index.columns[1:])
factors

['Morningstar MSCI Emerging Markets - Close',
 'S&P 500 PR- Close',
 'BBgBarc US Treasury TR USD(1972) - Close',
 'LIBOR 3 Mon Interbank Eurodollar Inv TR - Close',
 'BBgBarc US Credit TR USD - Close']

In [147]:
# test class SimpleRegressionRiskGenerator with data sample 
x = SimpleRegressionRiskGenerator(risk_generator_name = 'Regression', risk_factors = factors)
x.set_up(risk_factors = factors, hyperparameters = {}, data_path = {'index': 'Index.csv', 'fund': 'Fund.csv','libor':'USD_3M_LIBOR.csv'})
x.load_raw_data()
x.split_train_test_data()
x.fit()
x.predict()
x.betas_fund()
x.get_vol_factor_score()
x.vol_score
x.select_top_n_funds_with_factor_score(2,x.vol_score)
x._assets_data

Unnamed: 0,NAS:RMDAX - Return,NAS:RAGTX - Return
0,-0.81,-4.12
1,6.90,8.74
2,7.91,9.64
3,3.37,6.61
4,12.43,12.06
5,14.20,13.31
6,-13.14,-10.15
7,-4.60,-4.33
8,2.09,6.87
9,2.12,0.88


In [19]:
#x.model_summary()

In [20]:
# test class LassoRegressionRiskGenerator with data sample 
x = LassoRegressionRiskGenerator(risk_generator_name = 'Lasso', risk_factors = factors)
x.set_up(risk_factors = factors, hyperparameters = {'alpha': 0.6, 'max_iter': 100000}, data_path = {'index': 'Index.csv', 'fund': 'Fund.csv'})
x.load_raw_data()
x.split_train_test_data()
x.fit()
x.predict()
x.betas_fund()

{'fund1': array([ 2.07878950e-03, -5.39625757e-05,  7.03559194e-03, -1.12721661e-02,
        -4.44738151e-03]),
 'fund2': array([ 2.08042673e-03, -3.31761405e-05,  7.12772084e-03, -1.14599806e-02,
        -4.51909715e-03]),
 'fund3': array([-0.00077314,  0.00016831,  0.00091984, -0.        , -0.        ]),
 'fund4': array([-0.00076481,  0.0001626 ,  0.00091961, -0.00010555,  0.        ]),
 'fund5': array([-0.00042997, -0.00110726, -0.00427445,  0.00365444,  0.00444184]),
 'fund6': array([ 0.00133162, -0.00227396, -0.00632484,  0.00097823,  0.00586219]),
 'fund7': array([-0.00099603,  0.000243  ,  0.00280407, -0.00138084, -0.00093212]),
 'fund8': array([ 0.00272676, -0.00167255, -0.00315705, -0.00324775,  0.00266686]),
 'fund9': array([ 2.79307360e-03, -9.79185966e-04,  2.76174672e-04, -7.12081344e-03,
         6.14397768e-05]),
 'fund10': array([ 0.00121518, -0.0006527 ,  0.0009397 , -0.00400219, -0.        ]),
 'fund11': array([-0.00078249, -0.        , -0.00369934,  0.00497501,  0.00

In [21]:
#x.model_summary()

In [22]:
#x.betas_output()

In [23]:
# test class RidgeRegressionRiskGenerator with data sample 
x = RidgeRegressionRiskGenerator(risk_generator_name = 'Ridge', risk_factors = factors)
x.set_up(risk_factors = factors, hyperparameters = {'alpha': 0.4}, data_path = {'index': 'Index.csv', 'fund': 'Fund.csv'})
x.load_raw_data()
x.split_train_test_data()
x.fit()
x.predict()
x.betas_fund()

{'fund1': array([ 2.11348957e-03,  8.77724063e-05,  8.02591079e-03, -1.22046655e-02,
        -5.17314555e-03]),
 'fund2': array([ 0.00211513,  0.00010856,  0.00811803, -0.01239247, -0.00524485]),
 'fund3': array([-0.00077539,  0.00028588,  0.00166696, -0.00069129, -0.00052927]),
 'fund4': array([-0.00076975,  0.00027735,  0.00167457, -0.00081874, -0.00052467]),
 'fund5': array([-0.00046369, -0.00128081, -0.00538446,  0.0046839 ,  0.00526588]),
 'fund6': array([ 0.00132122, -0.0024485 , -0.00742834,  0.00199497,  0.00666041]),
 'fund7': array([-0.00098563,  0.00041751,  0.00390744, -0.00239748, -0.00173025]),
 'fund8': array([ 0.00272908, -0.00175017, -0.00354027, -0.0029547 ,  0.00296803]),
 'fund9': array([ 0.00280196, -0.00093568,  0.00069558, -0.00755631, -0.00021521]),
 'fund10': array([ 0.00124279, -0.00053761,  0.00177562, -0.00480037, -0.00060412]),
 'fund11': array([-0.00081631, -0.00017049, -0.0047979 ,  0.00599521,  0.0037951 ]),
 'fund12': array([-2.47920811e-03,  7.19866443

In [24]:
#x.model_summary()

In [25]:
#x.betas_output()

Let us try to set up hyperparameters using a config file

In [26]:
# write a YAML file (config file) containing one set of hyperparameters 

params = {'alpha': 0.7, 'max_iter': 10000}

with open('params.yaml', 'w') as f:
    data = yaml.dump(params, f)

In [27]:
x = RidgeRegressionRiskGenerator(risk_generator_name = 'Ridge', risk_factors = factors)
x.set_up(risk_factors = factors, hyperparameters = {'alpha': 0.4}, 
         data_path = {'index': 'Index.csv', 'fund': 'Fund.csv'})
x.load_raw_data()
x.set_hyper_parameter(config_path = 'params.yaml')
x.split_train_test_data()
x.fit()
x.predict()
x.betas_fund()

{'fund1': array([ 2.11348903e-03,  8.77687226e-05,  8.02588560e-03, -1.22046415e-02,
        -5.17312744e-03]),
 'fund2': array([ 0.00211513,  0.00010855,  0.008118  , -0.01239244, -0.00524483]),
 'fund3': array([-0.00077539,  0.00028588,  0.00166696, -0.00069129, -0.00052927]),
 'fund4': array([-0.00076975,  0.00027735,  0.00167457, -0.00081873, -0.00052467]),
 'fund5': array([-0.00046369, -0.00128081, -0.00538444,  0.00468389,  0.00526587]),
 'fund6': array([ 0.00132122, -0.00244849, -0.00742832,  0.00199495,  0.0066604 ]),
 'fund7': array([-0.00098563,  0.00041751,  0.00390743, -0.00239747, -0.00173025]),
 'fund8': array([ 0.00272908, -0.00175017, -0.00354026, -0.00295471,  0.00296803]),
 'fund9': array([ 0.00280195, -0.00093568,  0.00069557, -0.00755631, -0.00021521]),
 'fund10': array([ 0.00124279, -0.00053761,  0.00177562, -0.00480037, -0.00060411]),
 'fund11': array([-0.00081631, -0.00017049, -0.00479788,  0.0059952 ,  0.00379509]),
 'fund12': array([-2.47920798e-03,  7.19867177

In [28]:
x.print_hyper_parameter()

{'alpha': '0.7', 'max_iter': '10000'}


### Risk scenarios

In [6]:
class RiskScenarioBased:
    """Mixin Class to provide risk scenario that we would apply on the factors model to see how a portfolio would react to risk"""
    
    def __init__(self, scenario_name, risk_factors):
        self._scenario_name = scenario_name
        self._risk_factors = risk_factors
    
    def set_up(self, **kwargs):
        """Function to setup any private variable for the allocator"""
        raise NotImplementedError("Subclasses should implement set_up function!")
    
    def risk_factors(self):
        """Print the list of risk factors that you use in this scenario, do not hard coded the return list, the information
        should be able to be gathered from private variables either from init function"""
        return self._risk_factors
    
    def generate_scenario(self, **kwargs):
        """Function to generate scenario based on config you provide in **kwargs, you may have some machine learning based model
        to derive the correlation of risk factors, and generate correlated risk scenarios, or you could assume that the movement
        of risk factor are independent of each other, and apply different shock on the risk factors"""
        raise NotImplementedError("Subclasses should implement set_up function!")

In [7]:
# Define a naive risk scenario where the user inputs the percentage shock directly in setup function, and then outputs
# the shocked factors data

class NaiveRiskScenario(RiskScenarioBased):
    """Risk Scenarios that the user passed in shock on each risk factors, and assume that the factors are independent
    of each other
    """
    def set_up(self, **kwargs):
        """Function to setup any private variable for the allocator"""
        self._data_path = kwargs.get('data_path') # dict: {'fund': data_path for fund data, 'index': data_path for index data}
        self._shocks = kwargs.get('shocks') # dict: {'factor1': shock1, ...} where shock1 = -0.1 to create a -10% shock
    
    def print_shocks(self, **kwargs):
        """Print the percentage shocks applied on each risk factor"""
        print(self._shocks)
    
    def load_raw_data(self, **kwargs):
        """Function to load raw data from source, should be able to support 
        reading data from flat file or sql database. Please just implement the one using flat file now,
        later we would provide the sql python package that we would want to utilize for the database task
        You may also want to have some data cleaning on the raw data, please define private method for data transformation.
        and the data should be saved as private variable, as it would be used for later training and testing process.
        
        Parameters:
            source_type: str
                flat file type or sql, if it is flat file, file directory or 
                path need to be passed in as argument or in the setup function
                If it is sql, connection need to be extablished in setup function
                please avoid any hard coded name in the class, and set global variable to define those file name
        """
        self._factors_data = pd.read_csv(self._data_path.get('index'))
    
    def generate_scenario(self, **kwargs):
        """Function to generate scenario based on config you provide in **kwargs, you may have some machine learning based model
        to derive the correlation of risk factors, and generate correlated risk scenarios, or you could assume that the movement
        of risk factor are independent of each other, and apply different shock on the risk factors"""
        X = self._factors_data
        X_shocked = X.copy()
        for factor in self._risk_factors:
            shock = self._shocks.get(factor)
            if shock is None:
                shock = 0.0
            X_shocked[factor] = X[factor]*(1+shock)
        self._factors_data_shocked = X_shocked
        return self._factors_data_shocked
    
# Later we would define Event Based Risk Scenario and Machine Learning Base Risk Scenarios
class PCARiskScenario(NaiveRiskScenario):
    """A simple Machine Learning Risk Scenario that you want to try is a PCA based Risk Scenarios. You could do a PCA on the list of risk factors,
    and use the PCA to represent movement of the factors, and you may have two derivation of this scenarios generation, the user should be able to either
    provide you shock they want on the principal component, or on the risk factor itself, and then you map it to the change on the pca component based on
    some assume, and generate the output shock on the risk factors"""
    def set_up(self, **kwargs):
        """Function to setup any private variable for the allocator"""
        self._data_path = kwargs.get('data_path') # dict: {'fund': data_path for fund data, 'index': data_path for index data}
        self._shocks = kwargs.get('shocks') # dict: {'factor1': shock1, ...} where shock1 = -0.1 to create a -10% shock
        self._factor_choice = kwargs.get('factor_choice') # 'pc' to shock the principal components, 'factor' to shock specific risk factors 
        
    def generate_scenario(self, **kwargs):
        """Function to generate scenario based on config you provide in **kwargs, you may have some machine learning based model
        to derive the correlation of risk factors, and generate correlated risk scenarios, or you could assume that the movement
        of risk factor are independent of each other, and apply different shock on the risk factors"""
        if self._factor_choice == 'pc': # this means we shock the principal components
            X = self._factors_data.to_numpy()[:,1:]# convert to numpy array and get rid of dates
            pca = PCA(whiten = True).fit(X)
            threshold = 0.85 # set a threshold to decide the number of principal components that should be kept
            n_components = np.where(np.cumsum(pca.explained_variance_ratio_) >= threshold)[0][0] + 1
            pca = PCA(n_components = n_components, whiten = True)
            pc_X = pca.fit_transform(X)
            pc_list = ['pc_'+str(i+1) for i in range(n_components)]
            pc_shocked = pc_X.copy()
            for i, pc in enumerate(pc_list):
                shock = self._shocks.get(pc)
                if shock is None:
                    shock = 0.0
                pc_shocked[:,i] = pc_shocked[:,i]*(1+shock)
            X_shocked = pca.inverse_transform(pc_shocked)
            self._factors_data_shocked = self._factors_data.copy()
            self._factors_data_shocked.iloc[:,1:] = X_shocked
            return self._factors_data_shocked   
        elif self._factor_choice == 'factor': # this means we shock specific risk factors, just like naive risk secenario
            return super().generate_scenario()

Let us test our risk scenario classes 

In [29]:
shock0 = {'S&P 500 PR- Close': -0.2}
nrs = NaiveRiskScenario(scenario_name = 'Naive', risk_factors = factors)
nrs.set_up(data_path = {'index': 'Index.csv', 'fund': 'Fund.csv'}, shocks = shock0)
nrs.load_raw_data()
nrs_x = nrs.generate_scenario()

In [99]:
nrs_x.head()

Unnamed: 0,Date,Morningstar MSCI Emerging Markets - Close,S&P 500 PR- Close,BBgBarc US Treasury TR USD(1972) - Close,LIBOR 3 Mon Interbank Eurodollar Inv TR - Close,BBgBarc US Credit TR USD - Close
0,Sep-20,3436.19,2690.4,2580.77,677.69,3392.2
1,Aug-20,3505.42,2800.248,2577.19,677.83,3401.53
2,Jul-20,3455.6,2616.896,2605.74,677.97,3445.17
3,Jun-20,3292.46,2480.232,2576.3,678.09,3342.11
4,May-20,3137.93,2435.448,2573.89,678.18,3282.14


In [111]:
shock1 = {'pc_1': -0.2}
pcars = PCARiskScenario(scenario_name = 'PCA', risk_factors = factors)
pcars.set_up(data_path = {'index': 'Index.csv', 'fund': 'Fund.csv'}, shocks = shock1, factor_choice = 'pc')
pcars.load_raw_data()
pcars_x = pcars.generate_scenario()

In [112]:
pcars_x.head()

Unnamed: 0,Date,Morningstar MSCI Emerging Markets - Close,S&P 500 PR- Close,BBgBarc US Treasury TR USD(1972) - Close,LIBOR 3 Mon Interbank Eurodollar Inv TR - Close,BBgBarc US Credit TR USD - Close
0,Sep-20,3400.098415,2716.372291,2469.697276,688.100804,3164.158054
1,Aug-20,3454.215496,2764.787878,2500.104793,691.094858,3213.017018
2,Jul-20,3399.847333,2716.147661,2469.556197,688.086913,3163.931367
3,Jun-20,3283.824653,2612.348526,2404.36492,681.6679,3059.181665
4,May-20,3212.909949,2548.905026,2364.519089,677.744508,2995.157165


In [31]:
pcars.set_up(data_path = {'index': 'Index.csv', 'fund': 'Fund.csv'}, shocks = shock0, factor_choice = 'factor')
pcars.load_raw_data()
pcars_x = pcars.generate_scenario()

### Risk simulator

In [8]:
class RiskSimulatorBased:
    """Simulator class that load the risk factor information and risk scenario information, and output the performance of the portfolio
    after we apply specific kind of risk shock on the portfolio based on the risk analysis.
    """
    
    def __init__(self, risk_generator_name, scenario_name):
        """initialize the risk generator and scenario based on the generator_name and scenario_name"""
        raise NotImplementedError("Subclasses should implement set_up function!")
    
    def run_simulation(self):
        """run simulation based on the portfolio that you set in risk_generator, and scenario"""
    
    def set_up(self, **kwargs):
        """Function to setup any private variable for the allocator, you may want to setup parameters for risk generator and scenario here too."""
        raise NotImplementedError("Subclasses should implement set_up function!")
    
    def reset_portfolio(self, portfolio):
        """reset portfolio on the risk generator, and re-train to get beta for risk factors"""
        raise NotImplementedError("Subclasses should implement reset_portfolio function!")

    def reset_scenario(self, scenario):
        """reset scenario on the scenario generator"""
        raise NotImplementedError("Subclasses should implement reset_scenario function!")

In [9]:
class RiskSimulatorFactory:
    def get_generator(self, generator_name, risk_factors):
        if generator_name == 'Regression':
            return SimpleRegressionRiskGenerator(generator_name, risk_factors)
        elif generator_name == 'Lasso':
            return LassoRegressionRiskGenerator(generator_name, risk_factors)
        elif generator_name == 'Ridge':
            return RidgeRegressionRiskGenerator(generator_name, risk_factors)
    def get_scenario(self, scenario_name, risk_factors):
        if scenario_name == 'Naive':
            return NaiveRiskScenario(scenario_name, risk_factors)
        elif scenario_name == 'PCA':
            return PCARiskScenario(scenario_name, risk_factors)

In [10]:
# define a risk simulator based on the naive risk scenario and pac risk scenario above, and also the regression risk factor generator,
# and apply some shock on a portfolio of mutual fund to see how the price would move.

class SimpleRiskSimulator(RiskSimulatorBased):
    def __init__(self, risk_generator_name, scenario_name):
        """initialize the risk generator and scenario based on the generator_name and scenario_name"""
        self._risk_generator_name = risk_generator_name
        self._scenario_name = scenario_name
    
    def set_up(self, **kwargs):
        """Function to setup any private variable for the allocator, you may want to setup parameters for risk generator and scenario here too."""
        self._risk_factors = kwargs.get('risk_factors') #list of risk factors
        self._hyperparameters = kwargs.get('hyperparameters') # dict: {'hyperparameter1': value1, ...}
        self._data_path = kwargs.get('data_path') # dict: {'fund': data_path for fund data, 'index': data_path for index data}
        
        self._factor_choice = kwargs.get('factor_choice')
        self._shocks = kwargs.get('shocks')
        
        self._risk_generator = RiskSimulatorFactory().get_generator(self._risk_generator_name, self._risk_factors)
        self._risk_scenario = RiskSimulatorFactory().get_scenario(self._scenario_name, self._risk_factors)
    
    def run_simulation(self):
        """run simulation based on the portfolio that you set in risk_generator, and scenario"""
        self._risk_generator.set_up(risk_factors = self._risk_factors, hyperparameters = self._hyperparameters,
                                    data_path = self._data_path)
        self._risk_generator.load_raw_data()
        self._risk_generator.split_train_test_data()
        self._risk_generator.fit()
        self._risk_generator.betas_fund()
        self._beta = self._risk_generator.betas_output()[:,1:]  # get rid of the factor name
        self._beta = self._beta.astype(np.float)
        
        self._risk_scenario.set_up(data_path = self._data_path, shocks = self._shocks, factor_choice = self._factor_choice)
        self._risk_scenario.load_raw_data()
        self._shocked_x = self._risk_scenario.generate_scenario().to_numpy()[:,1:] # get rid of the date
        self._shocked_x = self._shocked_x.astype(np.float)

        shocked_y = np.dot(self._shocked_x, self._beta)
        return shocked_y
    
    def reset_portfolio(self, portfolio):
        """reset portfolio on the risk generator, and re-train to get beta for risk factors"""
        new_data_path = {}
        new_data_path['index'] = self._data_path.get('index')
        new_data_path['fund'] = portfolio # portfolio is a string recording the data_path of new portfolio
        self._risk_generator.set_up(risk_factors = self._risk_factors, hyperparameters = self._hyperparameters, 
                                    data_path = new_data_path)
        self._risk_generator.load_raw_data()
        self._risk_generator.split_train_test_data()
        self._risk_generator.fit()
        self._risk_generator.betas_fund()
        new_beta = self._risk_generator.betas_output()[:,1:]
        new_beta = new_beta.astype(np.float)
        
        new_shocked_y = np.dot(self._shocked_x, new_beta)
        return new_shocked_y
    
    def reset_generator(self, generator):
        """reset risk generator and re-train to get beta for risk factors"""
        new_generator_name = generator.get('risk_generator_name')
        new_hyperparameters = generator.get('hyperparameters')
        
        new_risk_generator = RiskSimulatorFactory().get_generator(new_generator_name, self._risk_factors)
        new_risk_generator.set_up(risk_factors = self._risk_factors, hyperparameters = new_hyperparameters,
                                    data_path = self._data_path)
        new_risk_generator.load_raw_data()
        new_risk_generator.split_train_test_data()
        new_risk_generator.fit()
        new_risk_generator.betas_fund()
        new_beta = new_risk_generator.betas_output()[:,1:]  # get rid of the factor name
        new_beta = new_beta.astype(np.float)
        
        new_shocked_y = np.dot(self._shocked_x, new_beta)
        return new_shocked_y
        

    def reset_scenario(self, scenario):
        """reset scenario on the scenario generator"""
        new_scenario_name = scenario.get('scenario_name')
        new_factor_choice = scenario.get('factor_choice')
        new_shocks = scenario.get('shocks')
        
        new_risk_scenario = RiskSimulatorFactory().get_scenario(new_scenario_name, self._risk_factors)
        new_risk_scenario.set_up(data_path = self._data_path, shocks = new_shocks, factor_choice = new_factor_choice)
        new_risk_scenario.load_raw_data()
        new_shocked_x = new_risk_scenario.generate_scenario().to_numpy()[:,1:]
        new_shocked_x = new_shocked_x.astype(np.float)
        
        new_shocked_y = np.dot(new_shocked_x, self._beta)
        return new_shocked_y

In [32]:
simulator1 = SimpleRiskSimulator(risk_generator_name = 'Regression', scenario_name = 'Naive')
simulator1.set_up(risk_factors = factors, data_path = {'index': 'Index.csv', 'fund': 'Fund.csv'}, shocks = shock0)
shocked_y = simulator1.run_simulation()

In [33]:
shocked_y

array([[ 2.3921992 ,  2.32095925,  0.14291434, ...,  3.06064894,
         1.88090455,  2.9656781 ],
       [ 2.46945109,  2.39958251,  0.10963504, ...,  3.17968231,
         1.84095787,  3.08687481],
       [ 2.34973778,  2.27545113,  0.12024543, ...,  3.13581389,
         2.04063527,  3.03604223],
       ...,
       [-1.36793268, -1.42157756,  0.47989198, ..., -1.43438257,
        -0.10790335, -1.50618097],
       [-1.40592365, -1.45958026,  0.46199701, ..., -1.43093398,
        -0.11119687, -1.5027689 ],
       [-1.4356656 , -1.48789114,  0.45008473, ..., -1.42332392,
        -0.16996165, -1.49347286]])

In [34]:
scenario_changed_shocked_y = simulator1.reset_scenario(scenario = {'scenario_name':'PCA', 'factor_choice':'pc', 'shocks':shock1})

In [35]:
portfolio_changed_shocked_y = simulator1.reset_portfolio(portfolio = 'Fund2.csv')

## SQL Based Integration

In [36]:
# to do this assignment, you need to setup a mysql database on your local machine, and also import the sauma.core package provided by ca

# please define the setup_connection in this class

class SQLHandlerMixin:
    """Mixin class that you would include in the inheritance hierarchy to migrate all possible operation to SQL
    so as to speed up calculation, you would need to integrate the sauma.core package and utilize the connection obj here"""
    
    def setup_connection(self, username, password):
        """initilize the connection obj here, and use it for any operation"""
        self._conn = sauma.core.Connection(username=username, password=password)
    
    def setup_table_templates(self):
        """define the table template as local variable in this method for all derived class, 
        and utilize this method to setup tables"""
        raise NotImplementedError("Derived Class need to implement this method")
    
    def check_table_exist_or_not(self, schemas, table_name):
        """Please define this method to check whether a table under certain schemas exist or not"""
        pass

    def look_up_or_create_table(self, template, custom_table_name=None, custom_schemas_name=None):
        """Please define this method to create a table based on the template if a table does not exist, do nothing if table 
        already exist, you may want to use self.check_table_exist_or_not here, if custom_table_name is none, 
        you should be able to find it in template"""
        pass
    
    def drop_table(self, schemas, table_name):
        pass
    
    def chunks_update_table(self, schema, table_name, dataframe, **kwargs):
        """when you have a large dataframe, it mays takes a long time to update the sql table if you upload it at once, you could actually
        divide the table into smaller chunks and upload them piece by piece to speed up the process, as it is more memory efficient and use less cpu,
        try to implement this method here too"""
        pass

### Risk factor generator (with SQL)

In [37]:
# redefine the risk generator by including the sql mixin class into the inheritance hierarchy, and redefine all the data 
#collection and publishing process using SQL instead of csv file

class SimpleRegressionRiskGeneratorSQL(SimpleRegressionRiskGenerator, SQLHandlerMixin):
    """Inlcude SQL Operation in Mutual Fund Performance Feature Calculation"""
    def setup_table_templates(self):
        """Define all the table template as local variable here, all these table template should be defined as a global variable
        in a python file, and import here for this class to use, please check the sauma.core documentation, the template 
        format should be something like:
        {
            "tableName": "Test",
            "schema": "test_db",
            "primaryKey":["id"],
            "columns": [{"name": "id",       "type":"INTEGER"},                           // case insensitive
                        {"name": "text_col", "type":"STRING", "size":50},
                        {"name": "int_col",  "type":"INT"}
            ],
            "primaryKey":["id"],
            "description": 'sample table to know about the format'
        }
        
        
        for example, you define a list of template under performance_feature/custom.py
        so you could do from performance_feature.custom import TEMPLATE_A, TEMPLATE_B, TEMPLATE_C
        and do self.templateA = TEMPLATE_A inside this class
        and do self.look_up_or_create_table(self.templateA) to setup the table in the setup_table function
        as the sauma.core package require a json obj as input, you may need to transoform the dictionary into a json obj by doing 
        import json
        # Data to be written   
            dictionary ={   
              "id": "04",   
              "name": "sunil",   
              "depatment": "HR"
            }   

            # Serializing json    
            json_object = json.dumps(dictionary)
        """
        # take data from wrds for funds instead of morningstar
        from table_templates import template_factor, template_funds, template_results
        json_factor = json.dumps(template_factor)
        json_funds = json.dumps(template_funds)
        json_results = json.dumps(template_results)
        self._templates = {'factor': json_factor, 'funds': json_funds, 'results': json_results}
    
    def check_table_exist_or_not(self, schema, table_name):
        """Please define this method to check whether a table under certain schema exist or not"""
        c = self._conn.connect()
        try:
            self._conn.get_table(schema = schema, table_name = table_name) 
        except:
            print(table_name + ' does not exist in schema ' + schema)
        else: 
            print(table_name + ' exists in schema ' + schema)
            return True
        return False
    
    def look_up_or_create_table(self, template, custom_table_name=None, custom_schema_name=None):
        """Please define this method to create a table based on the template if a table does not exist, do nothing if table 
        already exist, you may want to use self.check_table_exist_or_not here, if custom_table_name is none, 
        you should be able to find it in template"""
        c = self._conn.connect()
        if custom_table_name == None:
            table_name = template.get('table_name')
        else: 
            table_name = custom_table_name    
        if custom_schema_name == None:
            schema_name = template.get('schema')
        else: 
            schema_name = custom_schema_name    
        if not self.check_table_exist_or_not(schema = schema_name , table_name = table_name):
                    template_custom = template
                    template_custom['table_name'] = table_name
                    template_custom['schema'] = schema_name
                    template_custom = json.dumps(template_custom) # sauma.core package require a json obj as input
                    self._conn.create_table(template_custom)  
                    
    def setup_tables(self):
        """setup all table based on the setup_table_templates"""
        # first let us setup the tables for each risk factor
        c = self._conn.connect()
        sql = "CREATE DATABASE IF NOT EXISTS factors"
        c.execute(sql)
        template_factor = json.loads(self._templates.get('factor')) # json object back to a dict
        for factor in self._risk_factors:
            self.look_up_or_create_table(template = template_factor, custom_table_name = factor, custom_schema_name= None) 
        # then we create the table for the funds
        sql = "CREATE DATABASE IF NOT EXISTS funds"
        c.execute(sql)
        template_funds = json.loads(self._templates.get('funds'))
        self.look_up_or_create_table(template_funds, custom_table_name = 'funds', custom_schema_name = None) 
        # finally, we create the table for the results
        sql = "CREATE DATABASE IF NOT EXISTS results"
        c.execute(sql)
        template_results = json.loads(self._templates.get('results'))
        self.look_up_or_create_table(template_results, custom_table_name = 'results', custom_schema_name = None)
        
    def __init__(self, risk_generator_name, risk_factors, username, password):
        SimpleRegressionRiskGenerator.__init__(self, risk_generator_name, risk_factors )
        self.setup_connection(username, password)
        self.setup_table_templates()
        self.setup_tables()
    
    def update_raw_data(self):
        """assuming that your data source is the csv file containing all the raw data, load the raw data from csv, and update
        the table which you already setup based on your template"""
        c = self._conn.connect()
        indiv_factors_data = {}
        for factor, path in self._data_path.get('index').items(): # its a dict {factor name: path of csv file corresponding} factor name more covenient than index name
            factor_data = pd.read_csv(path,sep=',', thousands=',')
            factor_data.rename(columns={'Unnamed: 0': 'Date', 
                                        factor_data.columns[1]: factor_data.columns[1].split(' -')[0]}, inplace=True) 
            # transfo of data imported directly from Morningstar 
            self._conn.update_table(table_name = factor , dataframe = factor_data, schema = 'factors', index = False, if_exists = 'replace')
            indiv_factors_data[factor] = self._conn.get_dataframe(table_name = factor, schema = 'factors') # indiv_factors_data is a dict
        # merge all dataframes of individual factors on date column 
        red = partial(pd.merge, on='Date', how='outer')
        self._factors_data = reduce(red, indiv_factors_data.values())
        # now let us get the funds data
        funds_data = pd.read_csv(self._data_path.get('fund'),sep=',', thousands=',')
        self._conn.update_table(table_name = 'funds' , dataframe = funds_data, schema = 'Funds', index = False, if_exists = 'replace')
        self._assets_data = self._conn.get_dataframe(table_name = 'funds', schema = 'Funds')
        
    def load_SQL_data(self):
        c = self._conn.connect()
        indiv_factors_data = {}
        for factor, path in self._data_path.get('index').items(): # its a dict {factor name: path of csv file corresponding} factor name more covenient than index name
            indiv_factors_data[factor] = self._conn.get_dataframe(table_name = factor, schema = 'factors') # indiv_factors_data is a dict
        # merge all dataframes of individual factors on date column 
        red = partial(pd.merge, on='Date', how='outer')
        self._factors_data = reduce(red, indiv_factors_data.values())
        # now let us get the funds data
        self._assets_data = self._conn.get_dataframe(table_name = 'funds', schema = 'Funds')
        
    def publish_result(self):
        """upload the result data you have to sql, you would need to setup a json template to show how the result table would looks like too"""
        df = pd.DataFrame.from_dict(self._betas_fund)
        df.insert(0, 'Factor', self._risk_factors)
        self._conn.update_table(table_name = 'results' , dataframe = df, schema = 'Results', index = False, if_exists = 'replace')
        return()

class LassoRegressionRiskGeneratorSQL(SimpleRegressionRiskGeneratorSQL):
    def fit(self, **kwargs):
        """Function to execute training either based on the data that you load from file or passed in as argument.
        When X, Y are passed in as argument, would train the model based on the training dataset passed in, and over write
        the existing data cached in the risk generator obj. If you implement some new machine learning model rather than using
        existing machine learning model by some python package, please seperate the implementation of the model in another class,
        and initialize an instance of that model in your setup function rather than implement the model directly in the fit function,
        so that we could seprate the business logics with the machine learning model maintaining logics, and those model could be reused
        somewhere else too."""
        self._model = {} # dict to store all the regression models (one for each asset/fund)
        alpha = self._hyperparameters.get('alpha')
        max_iter = self._hyperparameters.get('max_iter')
        for i in range(self._Y_train.shape[1]):
            lasso = Lasso(alpha = alpha, max_iter = max_iter, fit_intercept=False)
            lasso.fit(self._X_train,self._Y_train[:,i])
            self._model['fund'+str(i+1)] = lasso
    
    def model_summary(self):
        """Function that provide summary of model result: prediction accuracy, different matrix 
            to measure the model, and hyper-parameters of the model"

        Parameters:
            None

            Return
                dict {str: float/dataframe}
                key is the staticial measure name
                value is the statical measure, either a number or a matrix or a dataframe
        """
        ### can add other performance measures later on 
        summary = dict()
        summary['hyperparameter'] = self._hyperparameters
        summary['rmse score'] = {}
        i = 0
        for fund in self._model.keys():
            mse = mean_squared_error(y_true = self._Y_test[:,i], y_pred = self._Y_pred[fund])
            rmse = math.sqrt(mse)
            summary['rmse score'][fund] = rmse
            i += 1  
        return summary
    
class RidgeRegressionRiskGeneratorSQL(SimpleRegressionRiskGeneratorSQL):
    def fit(self, **kwargs):
        """Function to execute training either based on the data that you load from file or passed in as argument.
        When X, Y are passed in as argument, would train the model based on the training dataset passed in, and over write
        the existing data cached in the risk generator obj. If you implement some new machine learning model rather than using
        existing machine learning model by some python package, please seperate the implementation of the model in another class,
        and initialize an instance of that model in your setup function rather than implement the model directly in the fit function,
        so that we could seprate the business logics with the machine learning model maintaining logics, and those model could be reused
        somewhere else too."""
        self._model = {} # dict to store all the regression models (one for each asset/fund)
        alpha = self._hyperparameters.get('alpha')
        for i in range(self._Y_train.shape[1]):
            ridge = Ridge(alpha = alpha, fit_intercept=False)
            ridge.fit(self._X_train,self._Y_train[:,i])
            self._model['fund'+str(i+1)] = ridge
    
    def model_summary(self):
        """Function that provide summary of model result: prediction accuracy, different matrix 
            to measure the model, and hyper-parameters of the model"

        Parameters:
            None

            Return
                dict {str: float/dataframe}
                key is the staticial measure name
                value is the statical measure, either a number or a matrix or a dataframe
        """
        summary = dict()
        summary['hyperparameter'] = self._hyperparameters
        summary['rmse score'] = {}
        i = 0
        for fund in self._model.keys():
            mse = mean_squared_error(y_true = self._Y_test[:,i], y_pred = self._Y_pred[fund])
            rmse = math.sqrt(mse)
            summary['rmse score'][fund] = rmse
            i += 1  
        return summary

Let us test these new risk generator classes (integrating the SQLHandlerMixin class)

In [67]:
os.environ['USER'] = 'qinyi'
os.environ['PASSWORD'] = 'Lmz970802'
username = os.getenv('USER')
password = os.getenv('PASSWORD')

In [73]:
#factors need to be defined consistent
factors = ['morningstar msci emerging markets','s&p 500 pr', 'bbgbarc us treasury tr usd(1972)',
           'libor 3 mon interbank eurodollar inv tr','bbgbarc us credit tr usd']
# later will get list of factors from .txt file for example, once Miao shares with us the list of all the factors we should use

In [86]:
#c = sauma.core.Connection(username=username, password=password, schema='')
#conn = c.connect()
#sql = "DROP DATABASE IF EXISTS Results"
#conn.execute(sql)

<sqlalchemy.engine.result.ResultProxy at 0xa6d4dd35c0>

In [88]:
# first, we test class the SimpleRegressionRiskGeneratorSQL
x = SimpleRegressionRiskGeneratorSQL(risk_generator_name = 'Regression', risk_factors = factors, username = username, password = password)
x.set_up(hyperparameters = {}, data_path = {'index': {'morningstar msci emerging markets': 'Emerging Markets.csv',
                                                      's&p 500 pr': 'S&P 500.csv',
                                                      'bbgbarc us treasury tr usd(1972)': 'BBgBarc US Treasury.csv',
                                                      'libor 3 mon interbank eurodollar inv tr': 'LIBOR 3 Mon Interbank Eurodollar Inv.csv',
                                                      'bbgbarc us credit tr usd': 'BBgBarc US Credit.csv'}
                                            ,'fund': 'Fund.csv'})
x.update_raw_data()
x.split_train_test_data()
x.fit()
x.predict()
x.betas_fund()

Enter schema(optional): 

morningstar msci emerging markets does not exist in schema Factors
s&p 500 pr does not exist in schema Factors
bbgbarc us treasury tr usd(1972) does not exist in schema Factors
libor 3 mon interbank eurodollar inv tr does not exist in schema Factors
bbgbarc us credit tr usd does not exist in schema Factors
funds does not exist in schema Funds
results does not exist in schema Results


{'fund1': array([ 2.10322011e-03,  8.60685540e-05,  8.01541329e-03, -1.21904875e-02,
        -5.15615412e-03]),
 'fund2': array([ 0.00210485,  0.00010685,  0.00810752, -0.01237828, -0.00522785]),
 'fund3': array([-0.00077671,  0.00028673,  0.00166943, -0.00069375, -0.0005299 ]),
 'fund4': array([-0.00077121,  0.0002782 ,  0.00167698, -0.0008211 , -0.00052513]),
 'fund5': array([-0.00047566, -0.00127983, -0.00538609,  0.00468851,  0.00527788]),
 'fund6': array([ 0.0013128 , -0.00244948, -0.00743549,  0.00200495,  0.00667327]),
 'fund7': array([-0.00098656,  0.00041856,  0.00391079, -0.00240102, -0.00173188]),
 'fund8': array([ 0.00271786, -0.00175246, -0.00355331, -0.00293746,  0.00298775]),
 'fund9': array([ 0.00279185, -0.0009381 ,  0.0006826 , -0.00753939, -0.00019654]),
 'fund10': array([ 0.0012385 , -0.00053869,  0.00176992, -0.00479297, -0.00059605]),
 'fund11': array([-0.00082254, -0.0001694 , -0.00479665,  0.00599526,  0.00379979]),
 'fund12': array([-2.48557189e-03,  7.47862462

In [89]:
# test class RidgeRegressionRiskGeneratorSQL 
x = RidgeRegressionRiskGeneratorSQL(risk_generator_name = 'Ridge', risk_factors = factors, username = username, password = password)
x.set_up(hyperparameters = {'alpha': 0.4}, data_path = {'index': {'morningstar msci emerging markets': 'Emerging Markets.csv',
                                                      's&p 500 pr': 'S&P 500.csv',
                                                      'bbgbarc us treasury tr usd(1972)': 'BBgBarc US Treasury.csv',
                                                      'libor 3 mon interbank eurodollar inv tr': 'LIBOR 3 Mon Interbank Eurodollar Inv.csv',
                                                      'bbgbarc us credit tr usd': 'BBgBarc US Credit.csv'}
                                            ,'fund': 'Fund.csv'})
x.update_raw_data()
x.split_train_test_data()
x.fit()
x.predict()
x.betas_fund()

Enter schema(optional): 

morningstar msci emerging markets exists in schema Factors
s&p 500 pr exists in schema Factors
bbgbarc us treasury tr usd(1972) exists in schema Factors
libor 3 mon interbank eurodollar inv tr exists in schema Factors
bbgbarc us credit tr usd exists in schema Factors
funds exists in schema Funds
results exists in schema Results


{'fund1': array([ 2.10321939e-03,  8.60636510e-05,  8.01537976e-03, -1.21904556e-02,
        -5.15613003e-03]),
 'fund2': array([ 0.00210485,  0.00010685,  0.00810749, -0.01237825, -0.00522783]),
 'fund3': array([-0.00077671,  0.00028673,  0.00166942, -0.00069375, -0.0005299 ]),
 'fund4': array([-0.00077121,  0.0002782 ,  0.00167697, -0.00082109, -0.00052512]),
 'fund5': array([-0.00047566, -0.00127982, -0.00538607,  0.00468849,  0.00527786]),
 'fund6': array([ 0.0013128 , -0.00244948, -0.00743547,  0.00200494,  0.00667325]),
 'fund7': array([-0.00098656,  0.00041856,  0.00391078, -0.00240101, -0.00173187]),
 'fund8': array([ 0.00271786, -0.00175246, -0.0035533 , -0.00293746,  0.00298775]),
 'fund9': array([ 0.00279185, -0.0009381 ,  0.00068259, -0.00753937, -0.00019653]),
 'fund10': array([ 0.0012385 , -0.00053869,  0.00176991, -0.00479296, -0.00059604]),
 'fund11': array([-0.00082254, -0.00016939, -0.00479663,  0.00599524,  0.00379978]),
 'fund12': array([-2.48557172e-03,  7.47863409

In [90]:
# test class LassoRegressionRiskGeneratorSQL 
x = LassoRegressionRiskGeneratorSQL(risk_generator_name = 'Lasso', risk_factors = factors, username = username, password = password)
x.set_up(hyperparameters = {'alpha': 0.6, 'max_iter': 100000}, data_path = {'index': {'morningstar msci emerging markets': 'Emerging Markets.csv',
                                                      's&p 500 pr': 'S&P 500.csv',
                                                      'bbgbarc us treasury tr usd(1972)': 'BBgBarc US Treasury.csv',
                                                      'libor 3 mon interbank eurodollar inv tr': 'LIBOR 3 Mon Interbank Eurodollar Inv.csv',
                                                      'bbgbarc us credit tr usd': 'BBgBarc US Credit.csv'}
                                            ,'fund': 'Fund.csv'})
x.update_raw_data()
x.split_train_test_data()
x.fit()
x.predict()
x.betas_fund()

Enter schema(optional): 

morningstar msci emerging markets exists in schema Factors
s&p 500 pr exists in schema Factors
bbgbarc us treasury tr usd(1972) exists in schema Factors
libor 3 mon interbank eurodollar inv tr exists in schema Factors
bbgbarc us credit tr usd exists in schema Factors
funds exists in schema Funds
results exists in schema Results


{'fund1': array([ 2.06861665e-03, -5.56404773e-05,  7.02521235e-03, -1.12581494e-02,
        -4.43056530e-03]),
 'fund2': array([ 2.07024928e-03, -3.48551932e-05,  7.11733602e-03, -1.14459565e-02,
        -4.50227291e-03]),
 'fund3': array([-0.00077425,  0.0001691 ,  0.00092054, -0.        , -0.        ]),
 'fund4': array([-0.0007662 ,  0.00016332,  0.00092133, -0.00010734,  0.        ]),
 'fund5': array([-0.00044207, -0.0011063 , -0.00427622,  0.00365923,  0.00445405]),
 'fund6': array([ 0.00132306, -0.00227495, -0.00633205,  0.00098831,  0.00587521]),
 'fund7': array([-0.00099683,  0.00024405,  0.00280748, -0.00138449, -0.00093391]),
 'fund8': array([ 0.0027155 , -0.00167484, -0.00317009, -0.00323049,  0.00268661]),
 'fund9': array([ 2.78301099e-03, -9.81595541e-04,  2.63225394e-04, -7.10393637e-03,
         8.00487996e-05]),
 'fund10': array([ 0.00121142, -0.00065203,  0.00094417, -0.0040037 , -0.        ]),
 'fund11': array([-0.0007888 , -0.        , -0.00370226,  0.0049785 ,  0.00

### Risk scenarios (with SQL): TO DO

In [94]:
# redefine the risk scenario by including the sql mixin class into the inheritance hierarchy, and redefine all the data 
#collection and publishing process using SQL instead of csv file

class NaiveRiskScenarioSQL(NaiveRiskScenario, SQLHandlerMixin):
    """Inlcude SQL Operation in Risk Factors Shock Calculation """
    def setup_table_templates(self):
        # take data from wrds for funds instead of morningstar
        from table_templates import template_factor
        json_factor = json.dumps(template_factor)
        json_shocked_factor = json.dumps(template_factor)
        self._templates = {'factor': json_factor, 'shocked_factor': json_shocked_factor}
    
    def check_table_exist_or_not(self, schema, table_name):
        """Please define this method to check whether a table under certain schema exist or not"""
        c = self._conn.connect()
        try:
            self._conn.get_table(schema = schema, table_name = table_name) 
        except:
            print(table_name + ' does not exist in schema ' + schema)
        else: 
            print(table_name + ' exists in schema ' + schema)
            return True
        return False
    
    def look_up_or_create_table(self, template, custom_table_name=None, custom_schema_name=None):
        """Please define this method to create a table based on the template if a table does not exist, do nothing if table 
        already exist, you may want to use self.check_table_exist_or_not here, if custom_table_name is none, 
        you should be able to find it in template"""
        c = self._conn.connect()
        if custom_table_name == None:
            table_name = template.get('table_name')
        else: 
            table_name = custom_table_name    
        if custom_schema_name == None:
            schema_name = template.get('schema')
        else: 
            schema_name = custom_schema_name    
        if not self.check_table_exist_or_not(schema = schema_name , table_name = table_name):
                    template_custom = template
                    template_custom['table_name'] = table_name
                    template_custom['schema'] = schema_name
                    template_custom = json.dumps(template_custom) # sauma.core package require a json obj as input
                    self._conn.create_table(template_custom)  
                    
    def setup_tables(self):
        """setup all table based on the setup_table_templates"""
        # first let us setup the tables for each risk factor
        c = self._conn.connect()
        sql = "CREATE DATABASE IF NOT EXISTS factors"
        c.execute(sql)
        template_factor = json.loads(self._templates.get('factor')) # json object back to a dict
        for factor in self._risk_factors:
            self.look_up_or_create_table(template = template_factor, custom_table_name = factor, custom_schema_name= None)
        # then we create the table for shocked factors
        c = self._conn.connect()
        sql = "CREATE DATABASE IF NOT EXISTS ShockedFactors"
        c.execute(sql)
        template_shocked_factor = json.loads(self._templates.get('shocked_factor'))
        for factor in self._risk_factors:
            self.look_up_or_create_table(template = template_shocked_factor, custom_table_name = factor, custom_schema_name= 'ShockedFactors')
    
    
    def __init__(self, scenario_name, risk_factors, username, password):
        NaiveRiskScenario.__init__(self, scenario_name, risk_factors)
        self.setup_connection(username, password)
        self.setup_table_templates()
        self.setup_tables()
    
    def update_raw_data(self):
        """assuming that your data source is the csv file containing all the raw data, load the raw data from csv, and update
        the table which you already setup based on your template"""
        c = self._conn.connect()
        indiv_factors_data = {}
        for factor, path in self._data_path.get('index').items(): # its a dict {factor name: path of csv file corresponding} factor name more covenient than index name
            factor_data = pd.read_csv(path,sep=',', thousands=',')
            factor_data.rename(columns={'Unnamed: 0': 'Date', 
                                        factor_data.columns[1]: factor}, inplace=True)
            # transfo of data imported directly from Morningstar 
            self._conn.update_table(table_name = factor , dataframe = factor_data, schema = 'factors', index = False, if_exists = 'replace')
            indiv_factors_data[factor] = self._conn.get_dataframe(table_name = factor, schema = 'factors') # indiv_factors_data is a dict
        # merge all dataframes of individual factors on date column 
        red = partial(pd.merge, on='Date', how='outer')
        self._factors_data = reduce(red, indiv_factors_data.values())

    def publish_result(self):
        """upload the result data you have to sql, you would need to setup a json template to show how the result table would looks like too"""
        for factor in self._risk_factors:
            df = pd.DataFrame(self._factors_data_shocked[factor])
            self._conn.update_table(table_name = factor , dataframe = df, schema = 'ShockedFactors', index = False, if_exists = 'replace')

In [97]:
class PCARiskScenarioSQL(NaiveRiskScenarioSQL):
    def set_up(self, **kwargs):
        """Function to setup any private variable for the allocator"""
        self._data_path = kwargs.get('data_path') # dict: {'fund': data_path for fund data, 'index': data_path for index data}
        self._shocks = kwargs.get('shocks') # dict: {'factor1': shock1, ...} where shock1 = -0.1 to create a -10% shock
        self._factor_choice = kwargs.get('factor_choice') # 'pc' to shock the principal components, 'factor' to shock specific risk factors 
        
    def generate_scenario(self, **kwargs):
        """Function to generate scenario based on config you provide in **kwargs, you may have some machine learning based model
        to derive the correlation of risk factors, and generate correlated risk scenarios, or you could assume that the movement
        of risk factor are independent of each other, and apply different shock on the risk factors"""
        if self._factor_choice == 'pc': # this means we shock the principal components
            X = self._factors_data.to_numpy()[:,1:]# convert to numpy array and get rid of dates
            pca = PCA(whiten = True).fit(X)
            threshold = 0.85 # set a threshold to decide the number of principal components that should be kept
            n_components = np.where(np.cumsum(pca.explained_variance_ratio_) >= threshold)[0][0] + 1
            pca = PCA(n_components = n_components, whiten = True)
            pc_X = pca.fit_transform(X)
            pc_list = ['pc_'+str(i+1) for i in range(n_components)]
            pc_shocked = pc_X.copy()
            for i, pc in enumerate(pc_list):
                shock = self._shocks.get(pc)
                if shock is None:
                    shock = 0.0
                pc_shocked[:,i] = pc_shocked[:,i]*(1+shock)
            X_shocked = pca.inverse_transform(pc_shocked)
            self._factors_data_shocked = self._factors_data.copy()
            self._factors_data_shocked.iloc[:,1:] = X_shocked
            return self._factors_data_shocked   
        elif self._factor_choice == 'factor': # this means we shock specific risk factors, just like naive risk secenario
            return super().generate_scenario()

In [96]:
x = NaiveRiskScenarioSQL(scenario_name = 'Naive', risk_factors = factors, username = username, password = password)
x.set_up(data_path = {'index': {'morningstar msci emerging markets': 'Emerging Markets.csv',
                                                      's&p 500 pr': 'S&P 500.csv',
                                                      'bbgbarc us treasury tr usd(1972)': 'BBgBarc US Treasury.csv',
                                                      'libor 3 mon interbank eurodollar inv tr': 'LIBOR 3 Mon Interbank Eurodollar Inv.csv',
                                                      'bbgbarc us credit tr usd': 'BBgBarc US Credit.csv'}}, shocks = {'s&p 500 pr': -0.2})
x.update_raw_data()
x.generate_scenario()

Enter schema(optional): 

morningstar msci emerging markets exists in schema Factors
s&p 500 pr exists in schema Factors
bbgbarc us treasury tr usd(1972) exists in schema Factors
libor 3 mon interbank eurodollar inv tr exists in schema Factors
bbgbarc us credit tr usd exists in schema Factors
morningstar msci emerging markets exists in schema ShockedFactors
s&p 500 pr exists in schema ShockedFactors
bbgbarc us treasury tr usd(1972) exists in schema ShockedFactors
libor 3 mon interbank eurodollar inv tr exists in schema ShockedFactors
bbgbarc us credit tr usd exists in schema ShockedFactors


Unnamed: 0,Date,morningstar msci emerging markets,s&p 500 pr,bbgbarc us treasury tr usd(1972),libor 3 mon interbank eurodollar inv tr,bbgbarc us credit tr usd
0,9/2020,3452.73,2690.400,2580.77,677.69,3392.20
1,8/2020,3505.42,2800.248,2577.19,677.83,3401.53
2,7/2020,3455.60,2616.896,2605.74,677.97,3445.17
3,6/2020,3292.46,2480.232,2576.30,678.09,3342.11
4,5/2020,3137.93,2435.448,2573.89,678.18,3282.14
5,4/2020,3051.36,2329.944,2580.42,678.14,3229.64
6,3/2020,2934.74,2067.672,2564.12,677.61,3088.31
7,2/2020,3220.36,2363.376,2492.04,677.07,3307.52
8,1/2020,3331.68,2580.416,2427.69,676.37,3263.14
9,12/2019,3339.77,2584.624,2369.78,675.58,3188.55


In [115]:
x = PCARiskScenarioSQL(scenario_name = 'PCA', risk_factors = factors, username = username, password = password)
x.set_up(data_path = {'index': {'morningstar msci emerging markets': 'Emerging Markets.csv',
                                                      's&p 500 pr': 'S&P 500.csv',
                                                      'bbgbarc us treasury tr usd(1972)': 'BBgBarc US Treasury.csv',
                                                      'libor 3 mon interbank eurodollar inv tr': 'LIBOR 3 Mon Interbank Eurodollar Inv.csv',
                                                      'bbgbarc us credit tr usd': 'BBgBarc US Credit.csv'}}, shocks = {'pc_1': -0.2},factor_choice = 'pc')
x.update_raw_data()
x.generate_scenario()

Enter schema(optional): 

morningstar msci emerging markets exists in schema Factors
s&p 500 pr exists in schema Factors
bbgbarc us treasury tr usd(1972) exists in schema Factors
libor 3 mon interbank eurodollar inv tr exists in schema Factors
bbgbarc us credit tr usd exists in schema Factors
morningstar msci emerging markets exists in schema ShockedFactors
s&p 500 pr exists in schema ShockedFactors
bbgbarc us treasury tr usd(1972) exists in schema ShockedFactors
libor 3 mon interbank eurodollar inv tr exists in schema ShockedFactors
bbgbarc us credit tr usd exists in schema ShockedFactors


Unnamed: 0,Date,morningstar msci emerging markets,s&p 500 pr,bbgbarc us treasury tr usd(1972),libor 3 mon interbank eurodollar inv tr,bbgbarc us credit tr usd
0,9/2020,3404.811278,2720.314412,2472.158183,688.342291,3168.113801
1,8/2020,3454.425614,2764.692906,2500.029580,691.086594,3212.897777
2,7/2020,3400.050169,2716.055747,2469.483578,688.078941,3163.816224
3,6/2020,3284.010200,2612.261573,2404.296854,681.660455,3059.073689
4,5/2020,3213.081920,2548.818413,2364.452115,677.737220,2995.050853
5,4/2020,3145.379162,2488.260384,2326.419350,673.992397,2933.939509
6,3/2020,2996.308271,2354.921068,2242.677143,665.746881,2799.381880
7,2/2020,3207.277138,2543.626215,2361.191216,677.416142,2989.811214
8,1/2020,3283.021423,2611.377142,2403.741397,681.605764,3058.181177
9,12/2019,3259.265705,2590.128385,2390.396363,680.291770,3036.738272


### Sythetic Return Series Generator

In [120]:
class SyntheticReturnSeriesGenerator(LassoRegressionRiskGenerator):
    """Generate synthetic return series using RBSA"""
    def __init__(self):
        super().__init__
    
    def synthetic_return_series_generation(self):
        """Impletment Lasso Regression to make up return series of funds which have shorter historical return records than indexes.
           Inputs are return series of funds and indexes.
           Outputs are synthetic return series of funds which are suitable to use RBSA to generate synthetic return series
        """
        X, Y = self._factors_data.to_numpy()[:,1:], self._assets_data.to_numpy()[:,1:]
        self._asset_data_synthetic = self._assets_data.copy()
        alpha = self._hyperparameters.get('alpha')
        max_iter = self._hyperparameters.get('max_iter')
        for i in range(Y.shape[1]):
            if sum(np.isnan(Y[:,i])) == 0:
                continue
            elif sum(np.isnan(Y[:,i])[:36]) > 0: # check whether the most recent 36 months of returns exist
                print('Fund ' + self._assets_data.columns[i+1] +' cannot use RBSA to generate synthetic return series') 
                self._asset_data_synthetic.loc[:, i+1] = np.nan
            else:
                idx = ~np.isnan(Y[:,i]) #the index that the return series exists
                lasso = Lasso(alpha = alpha, max_iter = max_iter, fit_intercept=False)
                lasso.fit(X[idx], Y[idx, i])
                beta = lasso.coef_
                print(lasso.score(X[idx], Y[idx, i]))
                if lasso.score(X[idx], Y[idx, i]) >= 0.7: #check whether LASSO regression achieves the threshold to be used to create a synthetic return history
                    weight = beta/np.sum(beta)
                    synthetic_rtn = np.dot(X, weight) + np.random.normal(size = len(X)) # add independent standard normal noise
                    self._asset_data_synthetic.loc[~idx, i+1] = synthetic_rtn[~idx]
                else:
                    self._asset_data_synthetic.loc[:, i+1] = np.nan
                    print('Fund ' + self._assets_data.columns[i+1] +' cannot use RBSA to generate synthetic return series') 
        
        self._asset_data_synthetic.dropna(axis = 1, inplace = True)
        return self._asset_data_synthetic

In [121]:
x = SyntheticReturnSeriesGenerator()
x.set_up(risk_factors = factors, hyperparameters = {'alpha': 0.6, 'max_iter': 100000}, data_path = {'index': 'Index.csv', 'fund': '../data/syn_rtn_sample_fund.csv'})
x.load_raw_data()

In [122]:
syn_rtn = x.synthetic_return_series_generation()

0.0025303240494961576
Fund AAAAX cannot use RBSA to generate synthetic return series
Fund AAABX cannot use RBSA to generate synthetic return series
Fund AAACX cannot use RBSA to generate synthetic return series
Fund AAADX cannot use RBSA to generate synthetic return series
Fund AAAEX cannot use RBSA to generate synthetic return series
0.05494484687615819
Fund AAAGX cannot use RBSA to generate synthetic return series
0.004197575267383402
Fund AAANX cannot use RBSA to generate synthetic return series
0.0020486686981885383
Fund AAAPX cannot use RBSA to generate synthetic return series
0.008989358398208247
Fund AAAQX cannot use RBSA to generate synthetic return series
0.005553776919374087
Fund AAARX cannot use RBSA to generate synthetic return series
0.0029727270991414256
Fund AAASX cannot use RBSA to generate synthetic return series
0.005953371004166819
Fund AAATX cannot use RBSA to generate synthetic return series
Fund AAAU cannot use RBSA to generate synthetic return series
0.0095322672