# Project: Investment Factor Replication and Analysis

In this project, I developed a comprehensive framework to replicate and analyze key investment factors that are widely used in hedge fund strategies. The objective was to deepen my understanding of asset pricing anomalies and the construction of long-short portfolios by recreating several popular factors—similar to those found in the Capital IQ Alpha Factor Library—using historical stock data.

## Overview & Objectives

- **Replication of Investment Factors:**  
  I focused on replicating seven fundamental factors including Price Momentum, Analyst Expectations, Long-Term Momentum, Valuation (Book-to-Price), CAPM Beta, Size (Log Market Cap), and 12-Month Realized Price Volatility. The goal was to recreate these factors over a sample period from 1970 to 2019, ensuring the results correlated strongly (targeting correlations of 0.9 or higher) with benchmark data.

- **Data-Driven Approach:**  
  Utilizing a well-organized dataset sourced from major financial databases (Compustat-Capital IQ and I/B/E/S Thomson Reuters), I processed historical stock returns and firm-level characteristics. The dataset included extensive information on S&P 500 constituents, acknowledging the dynamic nature of the index over time.

## Methodology

- **Data Processing & Organization:**  
  I standardized the data to a monthly frequency and transformed individual variables into matrix forms, where each row represented a date and each column a firm. This allowed for efficient computation and portfolio construction.

- **Portfolio Sorting & Construction:**  
  Stocks were sorted into quintiles based on the specific characteristics relevant to each factor. I constructed equal-weighted portfolios, rebalancing them monthly, and calculated the quantile spread (difference between the top and bottom quintile returns) as a measure of factor performance.

- **Performance and Risk Analysis:**  
  In addition to plotting the time series of the factor spreads, I computed key performance statistics—mean, standard deviation, Sharpe ratio, skewness, kurtosis, and maximum drawdown. I also performed hypothesis testing to assess whether the mean returns of these factors were statistically significant.

In [1]:
# Import modules
import pandas as pd
import numpy as np
from datetime import datetime
import statsmodels.api as sm
import statsmodels.formula.api as smf # load the econometrics package
from scipy.stats import ttest_1samp
import warnings
warnings.simplefilter(action='ignore')
from statsmodels.regression.rolling import RollingOLS
import matplotlib.pyplot as plt

In [2]:
#Import our data
SP500 = pd.read_csv('SP500.csv',header=None)
SP500.columns = ['Date','ret_sp500','rf']# Specify the names of the columns
SP500 = SP500.drop(0,axis=0) #drop first row
SP500[['ret_sp500','rf']] =  SP500[['ret_sp500','rf']].astype(float) #convert values to floats
SP500['Date'] =  SP500['Date'].astype(str) #convert dates to strings

Data = pd.read_csv('Data.csv',header=None)
Data.rename(columns=Data.iloc[0], inplace=True)
Data = Data.drop(0,axis=0) 
Data['gvkey'] = Data['gvkey'].astype(str)
Data['date']=Data['date'].astype('int') #convert dates to integers to sort according to value
mask = (Data["date"] >=19700131) & (Data["date"] <= 20191231) #create mask to select specific dates within our dataframe
Data = Data.loc[mask]

CapitalIQ = pd.read_csv('CapitalIQ.csv',header=None)
CapitalIQ.rename(columns=CapitalIQ.iloc[0], inplace=True)
CapitalIQ = CapitalIQ.drop(0,axis=0) 

In [3]:
Data['date'] = Data['date'].astype(str)
Data['date'] = Data['date'].apply(lambda x: x[0:6]) #standardize "date" to monthly format

CapitalIQ['Date'] = CapitalIQ['Date'].astype(str)
CapitalIQ['Date'] = CapitalIQ['Date'].apply(lambda x: x[0:6])

In [4]:
#create a dataframe that tells us which companies are in S&P500 for each period
sp_members = pd.DataFrame(columns = Data['gvkey'].unique(), index = Data['date'].unique())
sp_members.sort_index(axis=0,inplace=True) #sort dates 

for i in sp_members.columns:
    x = Data[['sp500','date']].loc[Data['gvkey'] == i] #loc all gvkeys with sp500==1
    x = x.set_index(x['date']) #use the corresponding date as our index 
    for j in x.index:
        sp_members[i][j] = x['sp500'][j] #match each date with sp500 key
        
sp_members = sp_members.fillna(0) #fill NaNs with zeros to cancel out any non sp500 members during matrix multiplication
sp_members = sp_members.astype(int)
sp_members.replace(to_replace=0, value= np.nan,inplace=True)

In [5]:
def var_mat(var_name):
    df_var = pd.DataFrame(columns = Data['gvkey'].unique(), index = Data['date'].unique())  #create empty dataframe to contain factor data
    df_var.sort_index(axis=0,inplace=True)  #sort the date as indices
    for i in df_var.columns:                                                                       
        var_data = Data[[var_name,'date']].loc[Data['gvkey']==i]   #extract the data for each company from the original data file
        var_data = var_data.set_index(var_data['date'])                   
        df_var[i]=var_data[var_name]                                                    
    df_var = df_var.astype(float)
    df_var = df_var.multiply(sp_members)    #filter out the companies that are not in sp500 for the current time period
    for i in df_var.columns:                                #drop the companies that have never been in sp500
        if df_var[i].isnull().all()==True:
            df_var.drop(i,axis=1,inplace=True)
    return df_var

In [6]:
#extract the monthly return matrix to facilitate calculation of factor returns
trt1m_mat = var_mat('trt1m')
trt1m_mat = trt1m_mat / 100 #Convert the monthly returns from percentage to actual values
trt1m_mat_T = trt1m_mat.T

In [31]:
#create a function that takes the factor data for each company as input, and outputs a dataframe containing top and bottom portfolio returns, along with the factor returns
def factor_return(var_mat):
    var_mat_T = var_mat.T
    factor_matrix = pd.DataFrame(columns=['Top', 'Bottom','QSpread'],index=var_mat.index)
    for i in var_mat_T.columns:
        companies = var_mat_T[i].sort_values(ascending =False).dropna() #sort the companies by factor value, drop the non sp500 companies
        num = len(companies)                                                                                 #count the number of existing companies
        binsize = round(num/5)                                                                              #calculate the binsize
        top_companies = companies[:binsize].index                                        #extract the top and bottom bin 
        bottom_companies = companies[-binsize:].index
        valid_top_companies = [comp for comp in top_companies if comp in trt1m_mat_T[i].index]
        valid_bottom_companies = [comp for comp in bottom_companies if comp in trt1m_mat_T[i].index]
        top_average_return = trt1m_mat_T[i].loc[valid_top_companies].mean()
        # top_average_return = trt1m_mat_T[i][top_companies].mean()
        bottom_average_return = trt1m_mat_T[i].loc[valid_bottom_companies].mean()
        # bottom_average_return = trt1m_mat_T[i][bottom_companies].mean()
        factor_matrix['Top'][i]=top_average_return                                          #store the top and bottom bin, as well as the qspread to the empty matrix
        factor_matrix['Bottom'][i]=bottom_average_return
        factor_matrix['QSpread'] = factor_matrix['Top']-factor_matrix['Bottom']
    return factor_matrix

In [8]:
def turnover(var_mat):
    var_mat_T=var_mat.T    
    turnover_matrix = pd.DataFrame(columns=['Companies','Turnover','Ratio'],index=var_mat.index)  #create empty dataframe to contain factor data
    for i in var_mat_T.columns:
        companies=var_mat_T[i].sort_values().dropna()            #sort the companies by factor value, drop the non sp500 companies
        num = len(companies)                               #count the number of existing companies
        binsize = round(num/5)                            #calculate the binsize
        top_companies = companies[:binsize].index                      #extract the top and bottom bin 
        bottom_companies = companies[-binsize:].index
        total_companies=top_companies.append(bottom_companies)            
        turnover_matrix.at[i,'Companies']=total_companies.tolist()                    #save the current portfolio into the turnover matrix so we can compare them later
    for i in range(1,600):
        companies_last = turnover_matrix['Companies'].iloc[i-1]            #we compare the current company list with the companies in the previous portfolio
        companies_now = turnover_matrix['Companies'].iloc[i]
        turnover_count=0
        for j in companies_last:                                                #we count the number of companies that left out portfolio, and calculate the turnover rate based on that 
            if j not in companies_now:
                turnover_count=turnover_count+1
            else:
                turnover_count=turnover_count
        turnover_matrix['Turnover'].iloc[i]=turnover_count
        if len(companies_last) != 0:
            turnover_matrix['Ratio'].iloc[i]=turnover_count/len(companies_last)
        else:
            turnover_matrix['Ratio'].iloc[i]=np.nan
    return turnover_matrix

### Factor 1: Price Momentum

In [9]:
prccm_mat = var_mat('prccm')
HighM_mat=var_mat('prchm')
LowM_mat=var_mat('prclm')

In [10]:
CloseM = prccm_mat
HighM = HighM_mat
LowM = LowM_mat

HL1M=(HighM-CloseM)/(CloseM-LowM) 

HL1M.replace([np.inf,-np.inf],np.nan,inplace=True) #replace infinite values with NaN so as to not interfere with calculation 
HL1M=HL1M.shift().iloc[1:] #shift closing prices to use for next month

HL1M_factor=factor_return(HL1M) #dataframe containing top and bottom portfolio returns, along with the QSpread
HL1M_qspread = HL1M_factor['QSpread']['198701':]
HL1M_factor

Unnamed: 0,Top,Bottom,QSpread
197002,0.086551,0.046294,0.040257
197003,0.018495,0.007092,0.011403
197004,-0.110325,-0.110255,-0.00007
197005,-0.071576,-0.064375,-0.007201
197006,-0.039339,-0.066994,0.027655
...,...,...,...
201908,-0.022628,-0.043709,0.021082
201909,0.062645,0.009389,0.053256
201910,0.018128,0.009201,0.008926
201911,0.026092,0.039602,-0.01351


In [11]:
x1=CapitalIQ.loc[113:]['HL1M'].astype(float).tolist() #use data from 113th row (where our non-NaN values begin)
y1=HL1M_factor.loc['199605':]['QSpread'].tolist() 
np.corrcoef(x1,y1)

array([[1.        , 0.99098076],
       [0.99098076, 1.        ]])

### Factor 2:  Expected LTG

In [12]:
LTG_mat = var_mat('LTG')

In [13]:
LTG_factor = factor_return(LTG_mat)
LTG_qspread = LTG_factor['QSpread']['199605':]
LTG_factor

Unnamed: 0,Top,Bottom,QSpread
197001,,,
197002,,,
197003,,,
197004,,,
197005,,,
...,...,...,...
201908,-0.038331,-0.060025,0.021694
201909,0.003882,0.05653,-0.052649
201910,0.020826,0.007372,0.013454
201911,0.041152,0.04257,-0.001418


In [14]:
# Check correlation with benchmark
x2=CapitalIQ.loc[113:]['LTGC'].astype(float).tolist()
y2=LTG_factor.loc['199605':]['QSpread'].tolist()
np.corrcoef(x2,y2)

array([[1.        , 0.92169806],
       [0.92169806, 1.        ]])

### Factor 3: Long-Term Momentum

In [15]:
MOM=trt1m_mat.rolling(11,min_periods=11).mean()
MOM=MOM.shift(2)

MOM_factor=factor_return(MOM) #dataframe containing top and bottom portfolio returns, along with the factor returns
MOM_qspread = MOM_factor['QSpread']['198701':]

MOM_factor

Unnamed: 0,Top,Bottom,QSpread
197001,,,
197002,,,
197003,,,
197004,,,
197005,,,
...,...,...,...
201908,0.009965,-0.093272,0.103236
201909,-0.002911,0.07168,-0.074591
201910,-0.002744,0.021641,-0.024385
201911,0.019194,0.05509,-0.035896


In [16]:
x3=CapitalIQ['MOM'].astype(float).tolist() 
y3=MOM_factor.loc['198701':]['QSpread'].tolist() 
np.corrcoef(x3,y3)

array([[1.        , 0.92899776],
       [0.92899776, 1.        ]])

### Factor 4: Book to Market

In [17]:
ceqq_mat=var_mat('ceqq')
cshoq_mat=var_mat('cshoq')

In [23]:
CEQQ=ceqq_mat
CSHOQ=cshoq_mat
CloseM=prccm_mat
BTP=(CEQQ)/(CSHOQ*CloseM)
BTP.replace([np.inf,-np.inf],np.nan,inplace=True)
BTP=BTP.shift().iloc[1:]

BTP_factor=factor_return(BTP) #dataframe containing top and bottom portfolio returns, along with the factor returns
BTP_qspread = BTP_factor['QSpread']['198701':]
BTP_factor

Unnamed: 0,Top,Bottom,QSpread
197002,0.08455,0.064723,0.019827
197003,0.01106,-0.012891,0.023952
197004,-0.10481,-0.104483,-0.000327
197005,-0.090958,-0.102622,0.011663
197006,-0.036239,-0.039051,0.002812
...,...,...,...
201908,-0.069962,-0.005742,-0.06422
201909,0.068104,0.003546,0.064558
201910,0.002174,0.023001,-0.020827
201911,0.036384,0.039183,-0.002799


In [24]:
x4=CapitalIQ['BP'].astype(float).tolist()
y4=BTP_factor.loc['198701':]['QSpread'].tolist()
np.corrcoef(x4,y4)

array([[1.        , 0.98862778],
       [0.98862778, 1.        ]])

### Factor 5: CAPM Beta

In [25]:
#organize the sp500 dataframe to facilitate calculation
market_RP = SP500['ret_sp500']-SP500['rf']
SP500['market_RP']=market_RP
SP500.set_index(SP500['Date'],inplace=True)

#Calculate the company risk premiums by subtracting the risk free rate from the trt1m matrix
risk_premium = trt1m_mat
for i in trt1m_mat.columns:
    risk_premium[i] = trt1m_mat[i] - np.asarray(SP500['rf'])
    
#Create a dataframe to store all the betas extracted 
betas = pd.DataFrame(columns = risk_premium.columns, index = Data['date'].unique())
betas.sort_index(axis=0,inplace=True)

#We now loop through all companies to obtain the betas matrix
for i in risk_premium.columns:
    Y=risk_premium[i]
    X=SP500['market_RP']
    X=sm.add_constant(X)
    reg=RollingOLS(Y,X,window=48)
    regfit= reg.fit()
    params = regfit.params
    betas[i] = params['market_RP']
    
Betas_factor=factor_return(betas) #dataframe containing top and bottom portfolio returns, along with the factor returns
Betas_qspread = Betas_factor['QSpread']['198701':]
Betas_factor

Unnamed: 0,Top,Bottom,QSpread
197001,,,
197002,,,
197003,,,
197004,,,
197005,,,
...,...,...,...
201908,-0.08548,0.041007,-0.126486
201909,0.067716,0.010225,0.057491
201910,0.028013,-0.010403,0.038416
201911,0.059579,-0.007441,0.06702


In [26]:
x5=CapitalIQ['Beta'].astype(float).tolist()
y5=Betas_factor.loc['198701':]['QSpread'].tolist()
np.corrcoef(x5,y5)

array([[1.       , 0.9252044],
       [0.9252044, 1.       ]])

### Factor 6: LogMCap

In [27]:
cshom_mat = var_mat('cshom')
CSHOM=cshom_mat
LogMktCap=np.log(CSHOM*CloseM)

In [32]:
LogMktCap_factor=factor_return(LogMktCap) #dataframe containing top and bottom portfolio returns, along with the factor returns
LogMktCap_qspread = LogMktCap_factor.loc['199804':]['QSpread']
LogMktCap_factor

Unnamed: 0,Top,Bottom,QSpread
197001,,,
197002,,,
197003,,,
197004,,,
197005,,,
...,...,...,...
201908,0.002685,-0.097294,0.099979
201909,0.011449,0.073041,-0.061592
201910,0.026024,0.005467,0.020557
201911,0.040557,0.022623,0.017934


In [33]:
x6=CapitalIQ.loc[136:]['LogMktCap'].astype(float).tolist()
y6=LogMktCap_factor.loc['199804':]['QSpread'].tolist()
np.corrcoef(x6,y6)

array([[ 1.        , -0.92932585],
       [-0.92932585,  1.        ]])

### **Factor 7: Volatility**

In [34]:
rit = np.log(CloseM/CloseM.shift())
AnnVol = ((rit ** 2).rolling(11,min_periods=11).mean())**0.5 * (12**0.5)

AnnVol_factor = factor_return(AnnVol)
AnnVol_qspread = AnnVol_factor.loc['198701':]['QSpread']
AnnVol_factor

Unnamed: 0,Top,Bottom,QSpread
197001,,,
197002,,,
197003,,,
197004,,,
197005,,,
...,...,...,...
201908,-0.103695,0.029961,-0.133656
201909,0.070901,0.019371,0.051529
201910,0.024279,-0.001636,0.025914
201911,0.048941,0.007752,0.041189


In [35]:
x7=CapitalIQ['AnnVol12M'].astype(float).tolist() 
y7=AnnVol_factor.loc['198701':]['QSpread'].tolist()
np.corrcoef(x7,y7)

array([[1.        , 0.94608073],
       [0.94608073, 1.        ]])

### Correlation Table with Benchmark

In [42]:
#create correlation matrix 
corr_table = pd.DataFrame(columns=['Name','Correlation'],index=['HL1M','LTG','MOM','BP','Beta','Size','Vol12M'])
corr_table['Name']=['1M Price High - 1M Price Low', 'Expected LTG', 'Long-term Momentum',
                'Book to Price','CAPM Beta','Log Market Cap','12M Realized Price Volatility']

In [43]:
# input correlation data
corr_table['Correlation']['HL1M']=abs(np.corrcoef(x1,y1)[0][1])
corr_table['Correlation']['LTG']=abs(np.corrcoef(x2,y2)[0][1])
corr_table['Correlation']['MOM']=abs(np.corrcoef(x3,y3)[0][1])
corr_table['Correlation']['BP']=abs(np.corrcoef(x4,y4)[0][1])
corr_table['Correlation']['Beta']=abs(np.corrcoef(x5,y5)[0][1])
corr_table['Correlation']['Size']=abs(np.corrcoef(x6,y6)[0][1])
corr_table['Correlation']['Vol12M']=abs(np.corrcoef(x7,y7)[0][1])

In [44]:
corr_table

Unnamed: 0,Name,Correlation
HL1M,1M Price High - 1M Price Low,0.990981
LTG,Expected LTG,0.921698
MOM,Long-term Momentum,0.928998
BP,Book to Price,0.988628
Beta,CAPM Beta,0.925204
Size,Log Market Cap,0.929326
Vol12M,12M Realized Price Volatility,0.946081


### Factor Summary Statistics 

Computed several key statistics for each factor, including mean, standard deviation, Sharpe ratio, skewness, kurtosis, maximum and minimum returns, portfolio turnover, and autocorrelations. These metrics allowed me to assess the risk and return profiles of the factors and to compare their performance against the S&P 500.

In [47]:
#Create a matrix consisting only factor returns (qspread) and the SP500 return
qspread_mat = pd.DataFrame(columns=['HL1M', 'LTG','MOM','BP','Beta','Size', 'Vol12M','SP500'],index=sp_members.index)
qspread_mat['HL1M']=HL1M_factor['QSpread']
qspread_mat['LTG']=LTG_factor['QSpread']
qspread_mat['MOM']=MOM_factor['QSpread']
qspread_mat['BP']=BTP_factor['QSpread']
qspread_mat['Beta']=Betas_factor['QSpread']
qspread_mat['Size']=LogMktCap_factor['QSpread']
qspread_mat['Vol12M']=AnnVol_factor['QSpread']
qspread_mat['SP500']=SP500['ret_sp500']
qspread_mat=qspread_mat.astype(float)

In [48]:
#Create a matrix consisting of the portfolio turnover details (the percentage portfolio turnovers for each period) for each factor
turnover_rates = pd.DataFrame(columns=['HL1M', 'LTG','MOM','BP','Beta','Size', 'Vol12M','SP500'],index=sp_members.index)
nullrow=pd.DataFrame(columns=HL1M.columns,index=['197001'])
turnover_rates['HL1M']=turnover(pd.concat([nullrow,HL1M]))['Ratio']
turnover_rates['LTG']=turnover(LTG_mat)['Ratio']
turnover_rates['MOM']=turnover(MOM)['Ratio']
turnover_rates['BP']=turnover(pd.concat([nullrow,BTP]))['Ratio']
turnover_rates['Beta']=turnover(betas)['Ratio']
turnover_rates['Size']=turnover(LogMktCap)['Ratio']
turnover_rates['Vol12M']=turnover(AnnVol)['Ratio']

In [50]:
performance_table=pd.DataFrame(columns=['HL1M', 'LTG','MOM','BP','Beta','Size', 'Vol12M','SP500'],
                     index=['Mean','STD','Sharpe Ratio','Skewness','Kurtosis','Max','Min','Max Drawdown',
                           'Average turnover','ACF1','ACF12','ACF24'])
for i in performance_table.columns:
    performance_table[i]['Mean']=qspread_mat[i].mean() *100
    performance_table[i]['STD']=qspread_mat[i].std() *100
    performance_table[i]['Sharpe Ratio']=((qspread_mat[i] - SP500['rf'])/performance_table[i]['STD']).mean() *100
    performance_table[i]['Skewness']=qspread_mat[i].skew()
    performance_table[i]['Kurtosis']=qspread_mat[i].kurtosis()
    performance_table[i]['Max']=qspread_mat[i].max() *100
    performance_table[i]['Min']=qspread_mat[i].min() *100
    performance_table[i]['Max Drawdown']=(qspread_mat[i].max()-qspread_mat[i].min())/qspread_mat[i].max()
    performance_table[i]['Average turnover']=turnover_rates[i].mean() *100
    performance_table[i]['ACF1']=qspread_mat[i].autocorr(lag=1)
    performance_table[i]['ACF12']=qspread_mat[i].autocorr(lag=12)
    performance_table[i]['ACF24']=qspread_mat[i].autocorr(lag=24)
performance_table

Unnamed: 0,HL1M,LTG,MOM,BP,Beta,Size,Vol12M,SP500
Mean,0.980838,0.110822,0.370697,0.473495,0.235369,1.408454,0.502637,0.94386
STD,3.423207,4.443536,5.136436,4.179618,8.087944,4.038411,6.49719,4.343352
Sharpe Ratio,0.175929,-0.038659,-0.00102,0.022705,-0.017158,0.310388,0.019489,0.130059
Skewness,1.368019,0.277741,-1.577606,1.083242,0.346195,0.073279,0.417203,-0.450438
Kurtosis,6.91652,4.308199,9.747037,7.063891,4.531087,3.642492,3.850642,1.920181
Max,23.269273,23.353494,15.488669,28.901002,49.146652,15.880141,40.232885,16.8113
Min,-10.799872,-19.504682,-39.352615,-15.94035,-35.776658,-18.447081,-23.615961,-21.5795
Max Drawdown,1.464126,1.835193,3.540736,1.55155,1.727957,2.161645,1.586982,2.283631
Average turnover,60.696651,8.559626,24.358467,9.252893,7.719329,4.503139,12.552161,
ACF1,-0.081412,0.017813,0.065841,0.095744,0.066073,0.189731,0.098881,0.022908


In [51]:
#Create the correlation table
factor_corr_table=qspread_mat.corr()
factor_corr_table

Unnamed: 0,HL1M,LTG,MOM,BP,Beta,Size,Vol12M,SP500
HL1M,1.0,0.229815,-0.498962,0.438978,0.432314,-0.386749,0.487892,0.308365
LTG,0.229815,1.0,0.021208,-0.345663,0.583968,0.06774,0.53993,0.423661
MOM,-0.498962,0.021208,1.0,-0.652437,-0.357265,0.632582,-0.394255,-0.174367
BP,0.438978,-0.345663,-0.652437,1.0,0.232942,-0.758533,0.360145,0.12307
Beta,0.432314,0.583968,-0.357265,0.232942,1.0,-0.513606,0.921778,0.81905
Size,-0.386749,0.06774,0.632582,-0.758533,-0.513606,1.0,-0.620004,-0.446456
Vol12M,0.487892,0.53993,-0.394255,0.360145,0.921778,-0.620004,1.0,0.764253
SP500,0.308365,0.423661,-0.174367,0.12307,0.81905,-0.446456,0.764253,1.0


### Factor Significance

Conducted hypothesis tests using t-tests to determine whether the mean returns (or quantile spreads) of the factors were statistically different from zero.

In [52]:
t_test_table = pd.DataFrame(columns=['Name','Average Spread','t statistics','p-value'],index=['HL1M','LTG','MOM','BP','Beta','Size','Vol12M'])
t_test_table['Name']=['1M Price High - 1M Price Low', 'Expected LTG', 'Long-term Momentum',
                'Book to Price','CAPM Beta','Log Market Cap','12M Realized Price Volatility']

In [53]:
# Input the average qspread we calculated previously

t_test_table['Average Spread']['HL1M']=HL1M_qspread.mean()
t_test_table['Average Spread']['LTG']=LTG_qspread.mean()
t_test_table['Average Spread']['MOM']=MOM_qspread.mean()
t_test_table['Average Spread']['BP']=BTP_qspread.mean()
t_test_table['Average Spread']['Beta']=Betas_qspread.mean()
t_test_table['Average Spread']['Size']=LogMktCap_qspread.mean()
t_test_table['Average Spread']['Vol12M']=AnnVol_qspread.mean()

In [55]:
#Perform t-test to each of the factor average spreads, and collect the values
t_stat = []
p_value=[]
factor_arrays = [HL1M_qspread, LTG_qspread, MOM_qspread, BTP_qspread,
                 Betas_qspread, LogMktCap_qspread, AnnVol_qspread]
factor_arrays = [pd.to_numeric(i, errors='coerce') for i in factor_arrays]
for i in factor_arrays:
    t_stat.append(ttest_1samp(i, 0)[0])
    p_value.append(ttest_1samp(i, 0)[1])
    
print(t_stat)
print(p_value)

[4.739572512854932, 0.7047879340001459, 0.8058340673007299, 1.3759568803857878, 0.6868685795354149, 5.634459382084198, 1.1440477375074778]
[2.9938148758059958e-06, 0.4815218323881113, 0.4208233790721768, 0.16961461000632305, 0.4925685889990338, 4.5573087168870065e-08, 0.25329665262530343]


In [56]:
t_test_table['t statistics'] = t_stat
t_test_table['p-value']=p_value

t_test_table

Unnamed: 0,Name,Average Spread,t statistics,p-value
HL1M,1M Price High - 1M Price Low,0.008478,4.739573,2.993815e-06
LTG,Expected LTG,0.002088,0.704788,0.4815218
MOM,Long-term Momentum,0.002215,0.805834,0.4208234
BP,Book to Price,0.002718,1.375957,0.1696146
Beta,CAPM Beta,0.002968,0.686869,0.4925686
Size,Log Market Cap,0.014085,5.634459,4.557309e-08
Vol12M,12M Realized Price Volatility,0.00391,1.144048,0.2532967


**Comment:**

From the table above we can see that the expected LTG, long term momentum, book to price, CAPM Beta, 12 month realized price volatility all have factor returns with mean not so different than 0 (all of them have p-values > 0.05 so cannot reject null hypothesis). 

On the other hand, 1M high - 1M low, log market cap both have factor returns with mean statistically different than 0. 
- The results are consistent with our observations on the factor returns
    - These 2 factors both generally outperform the market 