# Regression Model

### Dependencies:
This project was done on Jupyter notebook hosted locally, and used python and library (not available by default) versions as below:

- Python == 3.8.8
- pandas == 1.2.4
- numpy == 1.20.1
- statsmodels == 0.12.2
- scikit-learn == 0.24.1
- itertools == 

In [1]:
# load python libraries
import numpy as np
import pandas as pd
from itertools import combinations
import statsmodels.formula.api as sm
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression

### Create Combinations for Regression
Load data with total 14 indicators as columns, then create combinations a various lengths in setting up for regression model:
- <b>power_rating:</b> our primary world cup likely to win indicator
- <b>gdp_usd:</b> GDP
- <b>mv_usd:</b> average Market Value of football clubs aggregated by country
- <b>elo_rating:</b> elo ratings based on previous world cup rankings and scores
- <b>bmi:</b> Body Mass Index
- <b>life_exp:</b> life expectancy from birth
- <b>qol:</b> quality of life index
- <b>mv_play_dom:</b> market value of domestic players
- <b>num_play_dom:</b> number of domestic players
- <b>mv_play_int:</b> market value of international players
- <b>num_play_int:</b> number of indernational players
- <b>dom_int_mv_ratio:</b> domestic to international market value ratio
- <b>dom_perc:</b> percentage of domestic players out of total
- <b>int_perc:</b> percentage of international players out of total

In [2]:
df = pd.read_csv('C:\\Users\\uremekn\\Downloads\\ALL_INDICATORS_AGG.csv') # load file
               
# StandardScaler standardizes a feature by subtracting the mean and then scaling to unit variance. 
# Unit variance means dividing all the values by the standard deviation. 
scaler = StandardScaler()

del df['country_name'] # remove since it's not needed for the model

df_reg = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)

# 14 indicators in list form
indicators = ['power_rating','gdp_usd','mv_usd','elo_rating','bmi','life_exp','qol', \
              'mv_play_dom','num_play_dom','mv_play_int','num_play_int','dom_int_mv_ratio', \
              'dom_perc','int_perc']

combinations_list = []

# create all possible combinations for the 14 indicators
for n in range(len(indicators) + 1):
    combinations_list += list(combinations(indicators, n))

### Create Formula for Regression Model
Using the ordinary least squares stats model (sm.ols) library, the variable "formula" is created using a loop to fit the following nomenclature:

<i>'indicator1 ~ indicator2 + indicator3 + indicatorN'</i>

Total <b>16369</b> combinations

In [4]:
# Combine both combo list with formula in single list. The formula will be used to calculate r-squared, 
# and the combo will be used when creating a dataframe with the r^2 value.
combo_formula_list = []

for combo in combinations_list:
    formula_list = []
    
    if len(combo) > 1:
        
        if len(combo) == 2:
            formula = combo[0] + ' ~ ' + combo[1]
            formula_list.append(formula) 
            
        formula_list2 = []

        if len(combo) > 2:
            formula = combo[0] + ' ~ ' + combo[1]

            for i in range(len(combo)-2):

                formula = formula + ' + ' + combo[i+2]
                formula_list2.append(formula)

            formula_list.append(formula_list2[-1])
                
        combo_formula_list.append([combo,formula_list])
                
    else:
        continue

In [5]:
# run regression as a loop using the combo_formula_list created, then append into a dataframe includeing r-squared
results_df = pd.DataFrame()
results_list = []

for combo, formula in combo_formula_list:
    results = sm.ols(formula=formula[0], data=df_reg).fit()
    r2 = results.rsquared
    results_list.append([combo,r2])

    results_df = pd.DataFrame(results_list, columns = ['combination', 'r-squared'])

results_df.sort_values('r-squared', ascending = False)

Unnamed: 0,combination,r-squared
90,"(dom_perc, int_perc)",1.000000
16251,"(mv_usd, elo_rating, bmi, life_exp, qol, mv_pl...",0.948296
16252,"(mv_usd, elo_rating, bmi, life_exp, qol, mv_pl...",0.948296
16353,"(mv_usd, elo_rating, bmi, life_exp, qol, mv_pl...",0.948296
15869,"(mv_usd, elo_rating, life_exp, qol, mv_play_do...",0.948291
...,...,...
0,"(power_rating, gdp_usd)",0.000660
51,"(bmi, num_play_int)",0.000622
77,"(num_play_dom, num_play_int)",0.000201
50,"(bmi, mv_play_int)",0.000175


In [6]:
# output file
results_df.to_csv('C:\\Users\\uremekn\\Downloads\\REGRESSION_RSQUARED_NU.csv')

### Run regression after removing indicators that are lowly correlated with power rating
Indicators that are removed:
- gdp_usd
- bmi
- life_exp

Indicators in the model:
- power_rating
- mv_usd
- elo_rating
- qol
- mv_play_dom
- num_play_dom
- mv_play_int
- num_play_int
- dom_int_mv_ratio
- dom_perc
- int_perc

Total <b>2048</b> combinations

In [7]:
indicators_filtered = ['power_rating','mv_usd','elo_rating','qol','mv_play_dom','num_play_dom',\
                       'mv_play_int','num_play_int','dom_int_mv_ratio', 'dom_perc','int_perc']

combo_filtered_list = []

for n in range(len(indicators_filtered) + 1):
    combo_filtered_list += list(combinations(indicators_filtered, n))

In [9]:
combo_formula_filtered_list = []

for combo in combinations_list:
    formula_list = []
    
    if len(combo) > 1:
        
        if len(combo) == 2:
            formula = combo[0] + ' ~ ' + combo[1]
            formula_list.append(formula) 
            
        formula_list2 = []

        if len(combo) > 2:
            formula = combo[0] + ' ~ ' + combo[1]

            for i in range(len(combo)-2):

                formula = formula + ' + ' + combo[i+2]
                formula_list2.append(formula)

            formula_list.append(formula_list2[-1])
                
        combo_formula_filtered_list.append([combo,formula_list])
                
    else:
        continue

In [10]:
results_filtered_df = pd.DataFrame()
results_filtered_list = []

for combo, formula in combo_formula_filtered_list:
    results = sm.ols(formula=formula[0], data=df_reg).fit()
    r2 = results.rsquared
    results_filtered_list.append([combo,r2])

    results_filtered_df = pd.DataFrame(results_filtered_list, columns = ['combination', 'r-squared'])

results_filtered_df.sort_values('r-squared', ascending = False)

Unnamed: 0,combination,r-squared
90,"(dom_perc, int_perc)",1.000000
16251,"(mv_usd, elo_rating, bmi, life_exp, qol, mv_pl...",0.948296
16252,"(mv_usd, elo_rating, bmi, life_exp, qol, mv_pl...",0.948296
16353,"(mv_usd, elo_rating, bmi, life_exp, qol, mv_pl...",0.948296
15869,"(mv_usd, elo_rating, life_exp, qol, mv_play_do...",0.948291
...,...,...
0,"(power_rating, gdp_usd)",0.000660
51,"(bmi, num_play_int)",0.000622
77,"(num_play_dom, num_play_int)",0.000201
50,"(bmi, mv_play_int)",0.000175


In [11]:
results_filtered_df.to_csv('C:\\Users\\uremekn\\Downloads\\REGRESSION_RSQUARED_FILTERED_NU.csv')