# Model Fit Optimization Script

This script is intended for data analysis applications as a quick way go test different combinations of variables for your models.
Basically, this script runs through all possible combinations of variables and estimates an OLS regression for each possible combination.
It then outputs the model which has the highest adjusted R² value.

Off course, this model doesn't have to be theoretically sound. It is up to you to ensure your models are properly vetted for theoretical plausibility, etc.
This script is just an interesting way of approaching a new dataset and building out your theory.

Note that this script currently only supports OLS regressions. I will add more statistical models soon.

Please ensure that all variables are properly cleaned and encoded correctly before using this script.

In [1]:
import pandas as pd
import statsmodels.api as sm
import numpy as np
import itertools

## Insert Your Data Here:

In [2]:
df = pd.read_csv('YOUR_DATASET.csv')
df.head(3)

FileNotFoundError: [Errno 2] No such file or directory: 'YOUR_DATASET.csv'

***If you would like an illustration, use this sample model data***

In [None]:
"""
df = pd.read_csv('titanic.csv')
df = df[['survived', 'pclass', 'age', 'child', 'female', 'fare']]
df = df.dropna()
df['child'] = np.where(df['child'] == 'Child', 1, 0)   # make "child" a dummy
df['female'] = np.where(df['female'] == 'Female', 1, 0)   # make "female" a dummy
#df.shape
"""

***If you would like an illustration, use this sample model data***

In [None]:
"""
X_variables = df[['pclass', 'age', 'child', 'female', 'fare']]

Y_variable = df[['survived']]

model1 = sm.OLS(Y_variable, X_variables, missing = 'drop').fit()
print(model1.summary())
"""

## Now Specify Your Variables:

In [None]:
X = ['LIST', 'YOUR', 'INDEPENDENT', 'VARIABLES', 'HERE']

Y = df[['YOUR DEPENDENT VARIABLE']]

"""
Example:

X = ['pclass', 'age', 'child', 'female', 'fare']

Y = df[['survived']]

"""

## You're done now. Just Run the Script and Look at the Output of the Last Cell :)

In [None]:
# This cell creates a list containing all possible combinations from your variables

varCombinations = []

for L in range(0, len(X)+1):
    for subset in itertools.combinations(X, L):
        #print(subset)
        varCombinations.append(subset)

print(varCombinations)

In [None]:
# This cell runs an OLS model for each combination and stores the adjusted R² values in a list

r2_list = []   # create an empty list in which I will store the adjusted R² values

df['const'] = 1 # add a constant to the dataframe

length = len(varCombinations)  # get the number of combinations there are

for i in range(1, length):  # I start at list index 1 and not 0 because the first list combination is an empty set (no variables)
  # turn the combination-item into a list of columns:
  d = varCombinations[i]
  cols = list(d)
  cols.append('const') # add a constant to each model
  indepVars = df[cols]
  # run a model with those variables/columns (indepVars)
  model = sm.OLS(Y, indepVars, missing = 'drop').fit()
  # get the adjusted R² value and append it to list:
  r2_value = model.rsquared_adj
  r2_list.append(r2_value)

print(r2_list)

In [None]:
# This cell grabs the model with the highest adjusted R² value and outputs it

indexNo = r2_list.index(max(r2_list)) + 1  # I add one because I substracted the 
# first list item (empty set of variables) at the beginning of my for-loop
print("\n") # empty line

print("The best combination of variables to use is: \n", varCombinations[indexNo])
print("\n") # empty line

print("The resulting OLS-Model looks like this: \n")
print("\n") # empty line

d = varCombinations[indexNo]
cols = list(d)
cols.append('const') # add a constant
indepVars = df[cols]
model = sm.OLS(Y, indepVars, missing = 'drop').fit()
print(model.summary())