### Modeling

In [1]:
!pip install linearmodels

[0m

In [2]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (11, 5)  #set default figure size
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.iolib.summary2 import summary_col
from linearmodels.iv import IV2SLS
from sklearn import (
    linear_model, metrics, neural_network, pipeline, model_selection
)
import seaborn as sns

In [3]:
df = pd.read_csv('merged_data.csv')
print(df.columns)

Index(['country', 'Code', 'ContinentCode', 'year',
       'GDP per capita constant 2010 dollars',
       'Capital investment as percent of GDP',
       'Capital investment billion USD',
       'Household consumption as percent of GDP',
       'Household consumption billion USD', 'Labor force million people',
       'Government spending as percent of GDP',
       'Government spending billion USD', 'Population growth percent',
       'Happiness Index 0 (unhappy) - 10 (happy)',
       'Economic growth: the rate of change of real GDP',
       'Gross Domestic Product billions of 2010 U.S. dollars',
       'Unemployment rate', 'Exports of goods and services billion USD',
       'Exports of goods and services annual growth',
       'Current account balance billion USD', 'ranking index'],
      dtype='object')


#### Part 1: Linear regression using variables selected by Lasso

$ log (Ranking Index) = \beta_0 + \beta_1 (\text{GDP}) + \beta_2 (\text{GDP growth}) + \beta_3 x_3 + \cdots + \beta_n x_n + \varepsilon $

In [4]:
# as the 'Capital investment billion USD' and 'Capital investment as percent of GDP' represents similar things 
# and have a relatively high correlation, we kept one of them 
X = df[['Current account balance billion USD',
        'Exports of goods and services annual growth',
        'Exports of goods and services billion USD',
        'Unemployment rate',
        'Gross Domestic Product billions of 2010 U.S. dollars',
        'Economic growth: the rate of change of real GDP',
        'Happiness Index 0 (unhappy) - 10 (happy)',
        'Population growth percent',
        'Government spending as percent of GDP',
        'Labor force million people',
        'Household consumption as percent of GDP',
        'Capital investment billion USD',
        'GDP per capita constant 2010 dollars']]

y = df['ranking index']
y_log = np.log(y + 1)

df['const'] = 1
reg1 = sm.OLS(endog = y_log, exog = X, \
    missing='drop')
results = reg1.fit()
print(results.summary())

                                 OLS Regression Results                                
Dep. Variable:          ranking index   R-squared (uncentered):                   0.979
Model:                            OLS   Adj. R-squared (uncentered):              0.978
Method:                 Least Squares   F-statistic:                              648.1
Date:                Mon, 27 Nov 2023   Prob (F-statistic):                   6.07e-143
Time:                        02:45:16   Log-Likelihood:                         -235.09
No. Observations:                 192   AIC:                                      496.2
Df Residuals:                     179   BIC:                                      538.5
Df Model:                          13                                                  
Covariance Type:            nonrobust                                                  
                                                           coef    std err          t      P>|t|      [0.025      0.975]

In [5]:
# calculating mse 
sqft_lr_model = linear_model.LinearRegression()
sqft_lr_model.fit(X, y_log)
mse = metrics.mean_squared_error(y_log, sqft_lr_model.predict(X))
print(mse)

0.6429409949562513


- As we could see from the OLS result table above, the coefficient on the variable 'Gross Domestic Product billions of 2010 U.S. dollars' is -8.911e-05, the associated p-value is 0.304, which is statistically insignificant at the 5% level. 
- And the coefficient on the variable 'Economic growth: the rate of change of real GDP' is -0.0014, the associated p-value is 0.959, which is statistically insignificant at the 5% level as well.
- But the coefficient on 'GDP per capita constant 2010 dollars' has a p-value of 0, which is statistically significant at the 5% level.

In [6]:
# fitted graph

#### Part 2: Two-stage least squares (2SLS) regression

- As the OLS model above is likely suffer from endogeneity issues (reverse causality: better university is likely lead to a higher GDP of the country, and omitted variable bias : there are too many variables correlated with both the university ranking of a country and its GDP, and we may not be able to get data and control all of them). Therefore, here we include the 2SLS model to deal with the problem of endogeneity.

- In this model, to test the effect of 'Gross Domestic Product billions of 2010 U.S. dollars' on log of ranking index, the instrument we have chosen is 'Exports of goods and services billion USD', as it satisfies the three conditions for instruments, which are first stage, exogeneity and exclusion.

##### 1. first stage

$$
\text{Gross Domestic Product billions of 2010 U.S. dollars}_i = \delta_0 + \delta_1 \text{Exports of goods and services billion USD}_i + v_i
$$

In [7]:
# test the first stage
results_fs = sm.OLS(df['Gross Domestic Product billions of 2010 U.S. dollars'],
                    df[['const', 'Exports of goods and services billion USD']]).fit()
print(results_fs.summary())

                                             OLS Regression Results                                             
Dep. Variable:     Gross Domestic Product billions of 2010 U.S. dollars   R-squared:                       0.749
Model:                                                              OLS   Adj. R-squared:                  0.747
Method:                                                   Least Squares   F-statistic:                     566.3
Date:                                                  Mon, 27 Nov 2023   Prob (F-statistic):           6.80e-59
Time:                                                          02:45:17   Log-Likelihood:                -1735.8
No. Observations:                                                   192   AIC:                             3476.
Df Residuals:                                                       190   BIC:                             3482.
Df Model:                                                             1                         

- As we see from the table, the coefficient is large and the p-value is 0 which is lower than 0.05, therefore the instrument is correlated with the GDP. -> satisfies the first condition for instrument we mentioned above

- We cannot directly test whether the instrument is correlated with the error term or not (exogeneity and exclusion). But intuitively, the export should not correlated with the ranking index except for the fact that it inflences GDP. As we could see from the QS ranking calculation, none of the considered factors (Sustainability, Employment outcomes, International research network, etc.) seem related to the export of countries. Therefore we could infer export is a viable instrument in this case.

##### 2. second stage

$$
\log(\text{ranking index})_i = \beta_0 + \beta_1 \widehat{\text{GDP}}_i + u_i
$$

In [8]:
df['predicted_gdp'] = results_fs.predict()

results_ss = sm.OLS(y_log,
                    df[['const', 'predicted_gdp']]).fit()
print(results_ss.summary())

                            OLS Regression Results                            
Dep. Variable:          ranking index   R-squared:                       0.339
Model:                            OLS   Adj. R-squared:                  0.335
Method:                 Least Squares   F-statistic:                     97.27
Date:                Mon, 27 Nov 2023   Prob (F-statistic):           8.65e-19
Time:                        02:45:18   Log-Likelihood:                -290.06
No. Observations:                 192   AIC:                             584.1
Df Residuals:                     190   BIC:                             590.6
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
const             5.0544      0.094     53.847

##### 3. directly using linearmodels package (IV2SLS)

In [9]:
iv = IV2SLS(dependent = y_log,
            exog = df['const'],
            endog = df['Gross Domestic Product billions of 2010 U.S. dollars'],
            instruments = df['Exports of goods and services billion USD']).fit(cov_type='unadjusted')

print(iv.summary)

                          IV-2SLS Estimation Summary                          
Dep. Variable:          ranking index   R-squared:                      0.2503
Estimator:                    IV-2SLS   Adj. R-squared:                 0.2463
No. Observations:                 192   F-statistic:                    86.714
Date:                Mon, Nov 27 2023   P-value (F-stat)                0.0000
Time:                        02:45:25   Distribution:                  chi2(1)
Cov. Estimator:            unadjusted                                         
                                                                              
                                                  Parameter Estimates                                                   
                                                      Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------------------------------------------------
const

- the package give us the same coefficient as we get from the first stage second stage analysis, therefore in the next part we directly use the iv package

##### 4. testing the effect of GDP growth on ranking index

- for testing the impact of GDP gowth on log ranking index, we change the instrument to 'Exports of goods and services annual growth'

In [10]:
# first stage
results_fs = sm.OLS(df['Economic growth: the rate of change of real GDP'],
                    df[['const', 'Exports of goods and services annual growth']]).fit()
print(results_fs.summary())

                                           OLS Regression Results                                          
Dep. Variable:     Economic growth: the rate of change of real GDP   R-squared:                       0.607
Model:                                                         OLS   Adj. R-squared:                  0.605
Method:                                              Least Squares   F-statistic:                     293.3
Date:                                             Mon, 27 Nov 2023   Prob (F-statistic):           2.25e-40
Time:                                                     02:45:26   Log-Likelihood:                -448.93
No. Observations:                                              192   AIC:                             901.9
Df Residuals:                                                  190   BIC:                             908.4
Df Model:                                                        1                                         
Covariance Type:            

In [12]:
# 2SLS
iv2 = IV2SLS(dependent = y_log,
            exog = df['const'],
            endog = df['Economic growth: the rate of change of real GDP'],
            instruments = df['Exports of goods and services annual growth']).fit(cov_type='unadjusted')

print(iv2.summary)

                          IV-2SLS Estimation Summary                          
Dep. Variable:          ranking index   R-squared:                     -0.0023
Estimator:                    IV-2SLS   Adj. R-squared:                -0.0076
No. Observations:                 192   F-statistic:                    2.4150
Date:                Mon, Nov 27 2023   P-value (F-stat)                0.1202
Time:                        02:46:03   Distribution:                  chi2(1)
Cov. Estimator:            unadjusted                                         
                                                                              
                                                Parameter Estimates                                                
                                                 Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
-------------------------------------------------------------------------------------------------------------------
const               