Phase 3 is Predictive Modeling
 
1.	Define Nifty_Dir_Open =1/0 based on direction (dependent variable) (already derived in phase 2)
2.	Create data partition into train and test data sets (80/20)
3.	Run Binary Logistic Regression with Nifty Direction as dependent variable and previous day global market returns (and VIX) as independent variables. You may add more features such as previous day High/Low ratio for Nifty 50 and/or DJI
4.	Check multicollinearity and resolve if present
5.	Check which variables are significant (revise the model if needed)
6.	Obtain ROC curve and AUC for train data
7.	Obtain threshold to balance sensitivity and specificity
  	Go to step 8 only if you are satisfied with model on train data 
8. Obtain ROC curve and AUC for test data (compare with step 6)
9. Use above threshold to obtain sensitivity and specificity for test data 
         (compare with step 7)
10.	Finalize the model

In [None]:
import pandas as pd


df = pd.read_csv('markets_with_returns.csv')

# List of market return columns to be used in methods below
market_return_columns = [
    'Nifty_Return', 'DowJones_Return', 'Nasdaq_Return',
    'HangSeng_Return', 'Nikkei_Return' ]


In [16]:
# Define Nifty_Dir_Open =1/0 based on direction (dependent variable) (already derived in phase 2)
df['Nifty_Dir_Open'] = (df['Nifty_Return'] > 0).astype(int)
df.describe()

Unnamed: 0,Nifty_Open,Nifty_Close,DowJones_Open,DowJones_Close,Nasdaq_Open,Nasdaq_Close,HangSeng_Open,HangSeng_Close,Nikkei_Open,Nikkei_Close,...,Nifty_Return,DowJones_Return,Nasdaq_Return,HangSeng_Return,Nikkei_Return,DAX_Return,Quarter,Month,Year,Nifty_Dir_Open
count,1647.0,1647.0,1647.0,1647.0,1647.0,1647.0,1647.0,1647.0,1647.0,1647.0,...,1647.0,1647.0,1647.0,1647.0,1647.0,1647.0,1647.0,1647.0,1647.0,1647.0
mean,16730.809735,16720.34459,32958.507411,32962.678739,12798.080676,12800.410732,22865.172291,22858.875951,28728.350377,28726.524259,...,-0.067482,0.017619,0.034079,-0.016933,0.000484,0.02466,2.44748,6.328476,2021.681239,0.479053
std,4562.14146,4563.917864,5417.343038,5416.830025,3374.71549,3371.583671,4175.825148,4169.366472,5996.665587,5991.895916,...,0.91452,0.930121,1.190489,1.124313,0.921634,0.926731,1.126076,3.479439,1.835529,0.499713
min,7735.149902,7610.25,19028.359375,18591.929688,6506.910156,6463.5,14830.69043,14687.019531,16570.570312,16552.830078,...,-6.817998,-5.328405,-6.602975,-8.413594,-10.754634,-7.125518,1.0,1.0,2019.0,0.0
25%,11968.625,11946.325195,28217.995117,28237.584961,10471.260254,10474.540039,19345.15918,19363.695312,23418.344727,23409.665039,...,-0.477171,-0.404531,-0.548907,-0.674519,-0.422044,-0.381603,1.0,3.0,2020.0,0.0
50%,17183.449219,17176.699219,33516.429688,33530.828125,12968.379883,12997.75,23207.070312,23192.630859,27928.089844,27927.470703,...,-0.027866,0.04458,0.081241,-0.036686,-0.00317,0.081575,2.0,6.0,2022.0,0.0
75%,19632.125,19656.700195,35646.585938,35724.955078,15024.660156,15049.970215,26491.054688,26464.600586,32656.615234,32691.089844,...,0.383318,0.481138,0.675767,0.622819,0.465095,0.496467,3.0,9.0,2023.0,1.0
max,26248.25,26216.050781,45054.359375,45014.039062,20114.980469,20173.890625,31183.359375,31084.939453,42343.71875,42224.019531,...,9.30652,8.613852,11.961279,6.721295,8.099586,5.534786,4.0,12.0,2025.0,1.0


In [6]:
# Create data partition into train and test data sets (80/20)
from sklearn.model_selection import train_test_split
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42, shuffle=False)


In [13]:
# Run Binary Logistic Regression with Nifty Direction as dependent variable and previous day global market returns (and VIX) as independent variables.
# You may add more features such as previous day High/Low ratio for Nifty 50 and/or DJI
# Check for multicollinearity
import numpy as np
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Clean the data first - remove any rows with missing values
clean_train_df = train_df[market_return_columns + ['Nifty_Dir_Open']].dropna()

X_clean = clean_train_df[market_return_columns]
y_clean = clean_train_df['Nifty_Dir_Open']

# Check VIF for multicollinearity
vif_data = pd.DataFrame()
vif_data["Feature"] = X_clean.columns
vif_data["VIF"] = [variance_inflation_factor(X_clean.values, i) for i in range(len(X_clean.columns))]
print("Variance Inflation Factors:")
print(vif_data)
# Try the logistic regression with cleaned data and better convergence settings
import statsmodels.api as sm

# Use cleaned data
X_train_clean = clean_train_df[market_return_columns]
y_train_clean = clean_train_df['Nifty_Dir_Open']
X_train_clean = sm.add_constant(X_train_clean)

# Try with different solver and more iterations
try:
    logit_model = sm.Logit(y_train_clean, X_train_clean)
    result = logit_model.fit(method='bfgs', maxiter=1000, disp=True)
    print(result.summary())
except:
    print("BFGS failed, trying with different method...")
    try:
        result = logit_model.fit(method='newton', maxiter=1000, disp=True)
        print(result.summary())
    except:
        print("Newton method also failed. Data may have perfect separation issues.")

Variance Inflation Factors:
           Feature       VIF
0     Nifty_Return  1.071326
1  DowJones_Return  2.377468
2    Nasdaq_Return  2.295218
3  HangSeng_Return  1.095994
4    Nikkei_Return  1.092732
Optimization terminated successfully.
         Current function value: 0.000016
         Iterations: 80
         Function evaluations: 81
         Gradient evaluations: 81
                           Logit Regression Results                           
Dep. Variable:         Nifty_Dir_Open   No. Observations:                 1317
Model:                          Logit   Df Residuals:                     1311
Method:                           MLE   Df Model:                            5
Date:                Wed, 01 Oct 2025   Pseudo R-squ.:                   1.000
Time:                        00:05:31   Log-Likelihood:              -0.021207
converged:                       True   LL-Null:                       -912.24
Covariance Type:            nonrobust   LLR p-value:                     

  return 1/(1+np.exp(-X))
  return 1/(1+np.exp(-X))
