# STGE Specification Search with GNS DGP - AK test

Implements the classic STGE strategy for a GNS DGP with varying values for $\gamma$, $\rho$ and $\lambda$ when the model is estimated without WX, i.e., a classic regression. This includes spatial Durbin models ($\lambda = 0$), SLX error models ($\rho = 0$), standard SLX regression ($\rho = \lambda = 0$), and for $\gamma = 0$, the standard spatial error ($\rho = 0$), spatial lag ($\lambda = 0$ and standard regression model ($\rho = \lambda = 0$).

The true DGP is GNS, i.e., using dgp_gns. Model estimation is OLS, so no WX taken into account and standard OLS LM diagnostics.

The search strategy is the classic one from Anselin(2005), augmented by making a decision when both robust LM statistics are significant to select the one that is most significant. When the lag is selected, an additional regression is carried out for the lag model with an AK test. If the later is significant, then SARMA is selected.

This rule is not applied when both robust LM statistics are not significant, but the initial LM statistics are both significant. Then the most significant is selected.

This template allows for four data sets contained in the folder data_master, two p-values, SAR and MA spatial error processes and normal and lognormal error terms.

# Modules

In [None]:
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import geopandas as gpd
import numpy as np
import time
import spreg
import libpysal
from openpyxl import Workbook
from openpyxl.styles import Font
from openpyxl.formatting.rule import CellIsRule


In [None]:
print("pandas ",pd.__version__)
print("geopandas ",gpd.__version__)
print("numpy ",np.__version__)
print("spreg ",spreg.__version__)
print("libpysal ",libpysal.__version__)

# Specify Data and Weights

### 20x20 square grid - queen contiguity - n=400

In [None]:
#infileshp = "./data_master/twentwengrid.shp"
#infilew = "./data_master/grid400_q.gal"
#layout = "20x20"

### 40x40 square grid - queen contiguity - n=1600

In [None]:
#infileshp = "./data_master/fourty40grid.shp"
#infilew = "./data_master/fourty40_q.gal"
#layout = "40x40"

### US Counties - queen contiguity - n=3085

In [None]:
infileshp = "./data_master/uscounty_nodata.shp"
infilew = "./data_master/uscounty_q.gal"
layout = "US_counties"

### Brazilian municipios - queen contiguity - n=5568

In [None]:
#infileshp = "./data_master/Brazil_nodata.shp"
#infilew = "./data_master/Braz_muni_q.gal"
#layout = "BRA_muni"

## Read in Data

In [None]:
dfs = gpd.read_file(infileshp)

print(dfs.shape)
print(list(dfs))

w = libpysal.io.open(infilew).read()
w.transform = 'r'
n = w.n
print(n)

## Forward Specification Logic with AK

In [None]:
def fw_spec_ak(y,x,w=w, p_value=0.01):
    """
    Forward specification: Evaluate results from LM-tests and their robust versions from spreg.OLS.
    Estimate lag model with AK test if warranted.
    In constrast to fw_spec, estimation is included in this function
    Arguments:
    ----------
    y        : dependent variable
    x        : matrix of explanatory variables
    w        : spatial weights
    
    plmtests : reps x 5 matrix with p-values from LM tests in OLS
               p_LM_error,p_LM_lag,p_RLM_error,p_RLM_lag,p_LM_SARMA
    p_value  : significance threshold
        
    Returns:
    ----------
    result: the selected model as a vector
            0 = OLS
            1 = LAG
            2 = ERROR
            3 = LAGr
            4 = ERRORr
            5 = LAG_Br
            6 = ERROR_Br
            7 = LAG_Nr
            8 = ERROR_Nr
            9 = SARMA
    """


    p=p_value
    
    result = np.zeros((1,10))   # OLS,LAG,ERROR,LAGr,ERRORr,LAGBr,ERRORBr,LAGNr,ERROR_Nr,SARMA

    model_ols_1 = spreg.OLS(y,x,w=w,slx_lags=0,spat_diag=True)
    #print(model_ols_1.summary)
    pvals = [model_ols_1.lm_error[1],model_ols_1.lm_lag[1],
                             model_ols_1.rlm_error[1],model_ols_1.rlm_lag[1],
                             model_ols_1.lm_sarma[1]]
    
    
    p_error,p_lag,p_rerror,p_rlag,p_sarma = pvals
    if p_lag>=p and p_error>=p: #First test, no LM significant= Stop and keep OLS
        result[0,0]=1  #'OLS'
    else : #if not SLX, go for traditional STGE approach WITH OLS (NOT SLX) RESULTS
            #Just one significant
        if p_lag<p and p_error>=p:
            result[0,1] = 1 #'LAG'
        elif p_lag>=p and p_error<p:
            result[0,2] = 1 # 'ERROR'
        #Both are significant (Check robust version)
        elif p_lag<p and p_error<p:
                #One robust significant
            if p_rlag<p and p_rerror>=p:
                result[0,3]= 1 #'LAGr'
            elif p_rlag>=p and p_rerror<p:
                result[0,4]= 1 #'ERRORr'
                #Both robust are significant (look for the most significant)
            elif p_rlag <p and p_rerror<p:
                # check AK in lag model
                try:
                    model_lag = spreg.GM_Lag(y,x,w=w,slx_lags=0,w_lags=2,hard_bound=True)
                    ak_lag = spreg.AKtest(model_lag,w,case='gen')
                    if ak_lag.p <= p:
                        result[0,9] = 1. # SARMA
                    elif p_rlag <= p_rerror:
                        result[0,5] = 1 # LAG_BR
                    elif p_rlag > p_rerror:
                        result[0,6]= 1 # 'ERROR_Br'
                except:
                    if p_rlag <= p_rerror:
                        result[0,5] = 1 # LAG_BR
                    else:
                        result[0,6]= 1 # 'ERROR_Br'                    

            else: #None robust are significant (still look for the 'most significant')
                if p_rlag <= p_rerror:
                    result[0,7] = 1 # 'LAG_Nr'
                elif p_rlag > p_rerror:
                    result[0,8] = 1 # 'ERROR_Nr'
    #print("result ",result)
    return result

## Model Parameters

In [None]:
# overall random seed
rndseed = 123456789
# number of replications
reps=1000
# error process
errp = 'sar'
#errp = 'ma'
# beta and gamma
b1 = [1,1]
#b1 = [1, 1, 1, 1]
# rho range and lambda range
rho_values = [0, 0.1, 0.3, 0.5, 0.7, 0.9]
lam_values = [0, 0.1, 0.3, 0.5, 0.7, 0.9]
# gamma range
gam_values = [0.0, -0.5, 0.5]
# result parameter labels
#diagtests = ['perr', 'plag', 'prerr', 'prlag', 'psarma' ]
models = ['OLS','LAG','ERROR','LAGr','ERRORr','LAGBr','ERRORBr','LAGNr','ERROR_Nr','SARMA']
#k = len(diagtests)
# Nested Dictionary to store results with the sctucture Result[rho][lam]
# assumes only a single set of simulations is run, otherwise needs to be initialized for each run
# Results1 is dictionary with p-values for tests
# Modselect is dictionary with selected model
#Results1 = {gam: {rho: {lam: {diag: [] for diag in diagtests} for lam in lam_values} 
#                   for rho in rho_values} for gam in gam_values}
Modselect = {gam: {rho: {lam: {model: [] for model in models} for lam in lam_values} 
                   for rho in rho_values} for gam in gam_values}
# inverse method - alternative is 'true_inv'
invmethod = 'power_exp'
# p-value
pvalue = 0.01
#pvalue = 0.05
# error distribution
errdist = 'normal'
#errdist = 'lognormal'
# number of explanatory variables
kx = len(b1) - 1

## RHS

X has variance 12, matched with variance 6 for error process, gives approximate R2 of 0.66

In [None]:
nk = n*kx
var1 = 12.0/kx
rng=np.random.default_rng(seed=rndseed) # set for X
xx = spreg.dgp.make_x(rng,nk,mu=[0],varu=[var1],method="uniform")
if kx > 1:
    x1 = np.reshape(xx,(n,kx))
else:
    x1 = xx
xb1 = spreg.dgp.make_xb(x1,b1)
wx1 = spreg.dgp.make_wx(x1,w) # default first order

## Print Settings

In [None]:
print("SETTINGS - STGE Classic Search with GNS DGP - OLS Estimation with AK")
print("Layout: ",infileshp)
print("Weights: ",infilew)
print("n: ",n)
print("k: ",kx)
print("Error Process: ",errp)
print("Error Distribution: ",errdist)
print("Replications: ",reps)
print("p-value: ",pvalue)
print("Inverse Method: ",invmethod)
print("--------------------------------------")

## Simulation Loop

In [None]:
t0 = time.time()

if errdist == 'normal':
    vv = 6.0     # var 6 for target R2 of 0.66
elif errdist == 'lognormal':
    vv = 1.1     # var 1.1 for target R2 of 0.66
else:
    print("Error distribution not recognized")    # not used


for gam in gam_values:
    gg=gam
    # create a list with multiple gamma values (all same) when more than one x
    if kx > 1:
        g1 = np.ones(kx)*gg
        g1 = g1.tolist()
    else:
        g1 = gg
        
    wxg1 = spreg.dgp.make_wxg(wx1,g1) 
    for rho in rho_values:
        rho1=rho
        for lam in lam_values:
            lam1=lam
            if not(rho1 + lam1 < 1):  # parameter constraint for SAR errors
                break
            else:
                print(g1,rho1,lam1)
                rng=np.random.default_rng(seed=rndseed) # reset for simulations
                modsel = np.zeros((reps,10))
                for i in range (reps):
                    #print("i ",i)
                    # default is normal error
                    u= spreg.dgp.make_error(rng,n,mu=0,varu=vv,method=errdist)  # error distribution as parameter

                    # DGP is GNS
                    y1 = spreg.dgp_gns(u,xb1,wxg1,w,rho1,lam1, model= errp)
                    # call fw_spec_ak
                    # result is a vector of 10 0-1 indicators with 1 for the selected model
                    result = fw_spec_ak(y1,x1,w=w,p_value=pvalue)
                    modsel[i,:] = result
                    
                    #print("modsel ",modsel)
                    
                modcount = modsel.sum(axis=0)
                modfreq = modcount / reps
                
                #print("modfreq ",modfreq)
                    
                for j in range(len(models)):
                    Modselect[gam][rho][lam][models[j]] = modfreq[j]  
                
#print("modselect at end ",Modselect)
                
t1 = time.time()
print("time in minutes: ",(t1-t0)/60.0)

## Dictionary with Selection Frequencies

At this point, Results is a nested dictionary with rho, lam and estimates as keys. Before we carry out forward specification, this must be turned into an array to pass to the forward specification logic.

In [None]:
lenr = len(rho_values)
lenl = len(lam_values)
for gam in gam_values:
    print("GAMMA: ",gam)
    print("------------")
    modlag = np.zeros((lenr,lenl))
    moderr = np.zeros((lenr,lenl))
    for pt in range(len(models)):
        mod = models[pt]
        modsel = np.zeros((lenr,lenl))
        for r in range(lenr):
            rr = lenr -1 -r
            for c in range(lenl):
                rho = rho_values[r]
                lam = lam_values[c]
                if not(rho+lam < 1):
                    modsel[rr,c] = np.nan
                else:
                    modsel[rr,c] = np.array(Modselect[gam][rho][lam][mod])
        if mod == 'LAG':
            modlag = modlag + modsel
        elif mod == 'LAGr':
            modlag = modlag + modsel
        elif mod == 'LAGBr':
            modlag = modlag + modsel
        elif mod == 'LAGNr':
            modlag = modlag + modsel
            
        if mod == 'ERROR':
            moderr = moderr + modsel
        elif mod == 'ERRORr':
            moderr = moderr + modsel
        elif mod == 'ERRORBr':
            moderr = moderr + modsel
        elif mod == 'ERROR_Nr':
            moderr = moderr + modsel

        print("Selection Frequency for",mod)
        print(modsel)
    print("All Lag Selections")
    print(modlag)
    print("All Error Selections")
    print(moderr)

In [None]:
#This code is the same as previous cell, but saving in a dictionary instead of printing

lenr = len(rho_values)
lenl = len(lam_values)
data_models={}
for gam in gam_values:
    data_models[gam] = {}
    #print("GAMMA: ",gam)
    #print("------------")
    modlag = np.zeros((lenr,lenl))
    moderr = np.zeros((lenr,lenl))
    for pt in range(len(models)):
        mod = models[pt]
        modsel = np.zeros((lenr,lenl))
        for r in range(lenr):
            rr = lenr -1 -r
            for c in range(lenl):
                rho = rho_values[r]
                lam = lam_values[c]
                if not(rho+lam < 1):
                    modsel[rr,c] = np.nan
                else:
                    modsel[rr,c] = np.array(Modselect[gam][rho][lam][mod])
        if mod == 'LAG':
            modlag = modlag + modsel
        elif mod == 'LAGr':
            modlag = modlag + modsel
        elif mod == 'LAGBr':
            modlag = modlag + modsel
        elif mod == 'LAGNr':
            modlag = modlag + modsel
            
        if mod == 'ERROR':
            moderr = moderr + modsel
        elif mod == 'ERRORr':
            moderr = moderr + modsel
        elif mod == 'ERRORBr':
            moderr = moderr + modsel
        elif mod == 'ERROR_Nr':
            moderr = moderr + modsel

        data_models[gam][mod] = modsel
        #print("Selection Frequency for",mod)
    #print(modlag)
    data_models[gam]['LAG_'] = modlag
    #print("All Error Selections")
    data_models[gam]['ERROR_'] = moderr
    #print(moderr)

In [None]:
final_models = ['OLS','LAG_','ERROR_', 'SARMA']
results_models={}
for mod in final_models:
    data = []
    for gam in data_models:
        for i, rho in enumerate(rho_values):
            for j, lam in enumerate(lam_values):
                if rho + lam < 1:  
                    data.append({'gamma': gam,  'rho': rho, 'lambda': lam, 'value': data_models[gam][mod][5-i,j]})
    # Create DataFrame
    df = pd.DataFrame(data)
    # Pivot the DataFrame to get the desired shape
    pivot_df = df.pivot_table(index=['lambda', 'rho'], columns='gamma', values='value', aggfunc='first')
    pivot_df = pivot_df.reset_index()[['rho', 'lambda', 0, -0.5, 0.5]]
    # Save the DataFrame in the dictionary, following a specific order
    results_models[mod] = pivot_df.iloc[[0,1,2,3,4,5,6,11,15,18,20,7,8,9,10,12,13,14,16,17,19]]

#Adjusting the names 
key_map = {'LAG_': 'LAG', 'ERROR_': 'ERROR','SARMA': 'SAR-Error'}
for old_key, new_key in key_map.items():
    results_models[new_key] = results_models.pop(old_key)


### Export to excel file

In [None]:
#Save in Excel format

with pd.ExcelWriter(f'001_STGE_Classic_{layout}_{errp}_{errdist}_p_{pvalue}.xlsx', engine='openpyxl') as writer:

    ws = writer.book.create_sheet(title='Sheet1')

    startrow = 0  # Initial row to start writing the dataframe
    for name, df in results_models.items():
        
        # Write the dataframe name
        ws.cell(row=startrow + 1, column=1).value = f"{name}"
        
        # Write the dataframe content
        df.round(3).to_excel(writer, sheet_name='Sheet1', startrow=startrow + 1, index=False)       
        # Update the startrow for the next dataframe
        endrow = startrow + 2 + len(df)
        startrow += len(df) + 3  
    
    # Apply conditional formatting to columns 3, 4, and 5 for specific value ranges
    # Define fonts for each color
    red_font = Font(color='FF0000')  
    wine_font = Font(color='722F37')   
    blue_font = Font(color='0000FF')  
    green_font = Font(color='00FF00') 
    columns = ['C', 'D', 'E']  # Corresponding to Excel columns 3, 4, and 5
    for col in columns:
        
        # Rule for Red: value > 0.95
        ws.conditional_formatting.add(f'{col}3:{col}{endrow}',
                                      CellIsRule(operator='greaterThan', formula=['0.95'], stopIfTrue=True, font=red_font))
        # Rule for Wine: 0.9 < value <= 0.95
        ws.conditional_formatting.add(f'{col}3:{col}{endrow}',
                                      CellIsRule(operator='between', formula=['0.9', '0.95'], stopIfTrue=True, font=wine_font))
        # Rule for Blue: 0.75 < value <= 0.9
        ws.conditional_formatting.add(f'{col}3:{col}{endrow}',
                                      CellIsRule(operator='between', formula=['0.75', '0.9'], stopIfTrue=True, font=blue_font))
        # Rule for Green: 0.5 < value <= 0.75
        ws.conditional_formatting.add(f'{col}3:{col}{endrow}',
                                      CellIsRule(operator='between', formula=['0.5001', '0.75'], stopIfTrue=True, font=green_font))
