<a href="https://colab.research.google.com/github/p09323028/2020f_NTU_Econometrics_I/blob/main/Textbook/CH9_Assessing_Studies_Based_on_Multiple_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Chapter 9: Assessing Studies Based on Multiple Regression**
Author: Jinze Wu

Student Number: p09323028

前置作業:
- import 套件
- 載入資料
- 讀取資料
- 設置變數

In [None]:
import pandas as pd
import numpy as np
import scipy.stats
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf

In [None]:
!gdown --id '1MArK8I9qrygzH1MkupLmS41sdo4_JFOW' --output mcas.xlsx

In [None]:
mcas = pd.read_excel('mcas.xlsx')

**Decription of Data;**

selected variable definitions:

code: District Code (numerical)

municipa: Municipality (name)

district: District Name

totsc4: 4th grade score (math+english+science) 

totsc8: 8th grade score (math+english+science) 

regday: Spending per pupil, regular

specneed: Spending per pupil, special needs

bilingua: Spending per pupil, bilingual

occupday: Spending per pupil, occupational

tot_day: Spending per pupil, Total

tchratio: Students per Teacher

s_p_c: Students per Computer

spec_ed: % Special Education Students

lnch_pct: % Eligible for free/reduced price lunch

avgsalry: Average Teacher Salary

percap: Per Capita Income

pctel: Percent English Learners

## **9.1 Internal and External Validity**

### Threats to Internal Validity

### Threats to External Validity
- Differences in populations.
- Differences in settings.

## **9.2 Threats to Internal Validity of Multiple Regression Analysis**
- Omitted Variable Bias
- Misspecification of the Functional Form of the Regression Function
- Measurement Error and Errors-in-Variables Bias
- Missing Data and Sample Selection
- Simultaneous Causality

## **9.3 Internal and External Validity When the Regression Is Used for Prediction**

## **9.4 Example: Test Scores and Class Size**

### Table 9.1

In [None]:
mcas = mcas.rename(columns={
    'tchratio':'str',
    'totsc4':'testscr',
    'pctel':'el_pct',
    'percap':'avginc',
    'lnch_pct':'meal_pct'
    })

In [None]:
mcas[['testscr','str','el_pct','meal_pct','avginc']].describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
testscr,220.0,709.827273,15.126474,658.0,701.0,711.0,720.0,740.0
str,220.0,17.344091,2.276666,11.4,15.8,17.1,19.025,27.0
el_pct,220.0,1.117676,2.90094,0.0,0.0,0.0,0.885939,24.493927
meal_pct,220.0,15.315909,15.060068,0.4,5.3,10.55,20.025,76.199997
avginc,220.0,18.746764,5.807637,9.686,15.223,17.128,20.376,46.855


In [None]:
print(mcas.el_pct.describe())
print('Skewness: {:.4f}'.format((mcas.el_pct.skew())))
print('Kurtosis: {:.4f}'.format(mcas.el_pct.kurt()))

count    220.000000
mean       1.117676
std        2.900940
min        0.000000
25%        0.000000
50%        0.000000
75%        0.885939
max       24.493927
Name: el_pct, dtype: float64
Skewness: 4.5892
Kurtosis: 25.9631


### Table 9.2

In [None]:
mcas['avginc2'] = mcas.avginc * mcas.avginc
mcas['avginc3'] = mcas.avginc2 * mcas.avginc
mcas['loginc'] = np.log(mcas.avginc)
mcas['hiel'] = mcas.el_pct.apply(lambda x: 1 if x>0 else 0)
mcas['strxhiel'] = mcas.str * mcas.hiel
mcas['sttr2'] = mcas.str * mcas.str
mcas['sttr3'] = mcas.sttr2 * mcas.str

In [None]:
# column(1)
reg_t9_2_1 = smf.ols(formula='testscr~str', data=mcas)
results_t9_2_1 = reg_t9_2_1.fit(cov_type='HC1')  # robust
print(results_t9_2_1.summary())

                            OLS Regression Results                            
Dep. Variable:                testscr   R-squared:                       0.067
Model:                            OLS   Adj. R-squared:                  0.063
Method:                 Least Squares   F-statistic:                     11.85
Date:                Mon, 12 Apr 2021   Prob (F-statistic):           0.000692
Time:                        08:03:17   Log-Likelihood:                -901.67
No. Observations:                 220   AIC:                             1807.
Df Residuals:                     218   BIC:                             1814.
Df Model:                           1                                         
Covariance Type:                  HC1                                         
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    739.6211      8.607     85.930      0.0

In [None]:
# column(2)
reg_t9_2_2 = smf.ols(formula='testscr~str+el_pct+meal_pct+loginc', data=mcas)
results_t9_2_2 = reg_t9_2_2.fit(cov_type='HC1')  # robust
print(results_t9_2_2.summary())

                            OLS Regression Results                            
Dep. Variable:                testscr   R-squared:                       0.676
Model:                            OLS   Adj. R-squared:                  0.670
Method:                 Least Squares   F-statistic:                     144.4
Date:                Mon, 12 Apr 2021   Prob (F-statistic):           9.93e-60
Time:                        08:03:47   Log-Likelihood:                -785.22
No. Observations:                 220   AIC:                             1580.
Df Residuals:                     215   BIC:                             1597.
Df Model:                           4                                         
Covariance Type:                  HC1                                         
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    682.4316     11.497     59.356      0.0

In [None]:
# column(3)
reg_t9_2_3 = smf.ols(formula='testscr~str+el_pct+meal_pct+avginc+avginc2+avginc3', data=mcas)
results_t9_2_3 = reg_t9_2_3.fit(cov_type='HC1')  # robust
print(results_t9_2_3.summary())

                            OLS Regression Results                            
Dep. Variable:                testscr   R-squared:                       0.685
Model:                            OLS   Adj. R-squared:                  0.676
Method:                 Least Squares   F-statistic:                     110.2
Date:                Mon, 12 Apr 2021   Prob (F-statistic):           1.62e-62
Time:                        08:04:25   Log-Likelihood:                -782.18
No. Observations:                 220   AIC:                             1578.
Df Residuals:                     213   BIC:                             1602.
Df Model:                           6                                         
Covariance Type:                  HC1                                         
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    744.0250     21.318     34.902      0.0

In [None]:
print(results_t9_2_3.f_test(['avginc2=0','avginc3=0']))

<F test: F=array([[7.74479922]]), p=0.0005664381585479322, df_denom=213, df_num=2>


In [None]:
# column(4)
reg_t9_2_4 = smf.ols(formula='testscr~str+sttr2+sttr3+el_pct+meal_pct+avginc+avginc2+avginc3', data=mcas)
results_t9_2_4 = reg_t9_2_4.fit(cov_type='HC1')  # robust
print(results_t9_2_4.summary())

                            OLS Regression Results                            
Dep. Variable:                testscr   R-squared:                       0.687
Model:                            OLS   Adj. R-squared:                  0.675
Method:                 Least Squares   F-statistic:                     105.7
Date:                Mon, 12 Apr 2021   Prob (F-statistic):           1.62e-69
Time:                        08:19:14   Log-Likelihood:                -781.63
No. Observations:                 220   AIC:                             1581.
Df Residuals:                     211   BIC:                             1612.
Df Model:                           8                                         
Covariance Type:                  HC1                                         
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    665.4961     81.332      8.182      0.0

In [None]:
print(results_t9_2_4.f_test(['str=0','sttr2=0','sttr3=0']))
print(results_t9_2_4.f_test(['sttr2=0','sttr3=0']))
print(results_t9_2_4.f_test(['avginc2=0','avginc3=0']))

<F test: F=array([[2.85645131]]), p=0.03808901452166653, df_denom=211, df_num=3>
<F test: F=array([[0.4462781]]), p=0.6406084508384267, df_denom=211, df_num=2>
<F test: F=array([[7.74871824]]), p=0.0005657464462124524, df_denom=211, df_num=2>


In [None]:
# column(5)
reg_t9_2_5 = smf.ols(formula='testscr~str+hiel+strxhiel+meal_pct+avginc+avginc2+avginc3', data=mcas)
results_t9_2_5 = reg_t9_2_5.fit(cov_type='HC1')  # robust
print(results_t9_2_5.summary())

                            OLS Regression Results                            
Dep. Variable:                testscr   R-squared:                       0.686
Model:                            OLS   Adj. R-squared:                  0.675
Method:                 Least Squares   F-statistic:                     73.02
Date:                Mon, 12 Apr 2021   Prob (F-statistic):           5.02e-53
Time:                        08:21:37   Log-Likelihood:                -782.03
No. Observations:                 220   AIC:                             1580.
Df Residuals:                     212   BIC:                             1607.
Df Model:                           7                                         
Covariance Type:                  HC1                                         
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    759.9142     23.233     32.708      0.0

In [None]:
print(results_t9_2_5.f_test(['str=0','strxhiel=0']))
print(results_t9_2_5.f_test(['avginc2=0','avginc3=0']))
print(results_t9_2_5.f_test(['hiel=0','strxhiel=0']))

<F test: F=array([[4.00619057]]), p=0.019597758297922596, df_denom=212, df_num=2>
<F test: F=array([[5.84677007]]), p=0.003375486213571273, df_denom=212, df_num=2>
<F test: F=array([[1.58347292]]), p=0.20767890663973093, df_denom=212, df_num=2>


In [None]:
# column(6)
reg_t9_2_6 = smf.ols(formula='testscr~str+meal_pct+avginc+avginc2+avginc3', data=mcas)
results_t9_2_6 = reg_t9_2_6.fit(cov_type='HC1')  # robust
print(results_t9_2_6.summary())

                            OLS Regression Results                            
Dep. Variable:                testscr   R-squared:                       0.681
Model:                            OLS   Adj. R-squared:                  0.674
Method:                 Least Squares   F-statistic:                     109.1
Date:                Mon, 12 Apr 2021   Prob (F-statistic):           7.36e-57
Time:                        08:23:04   Log-Likelihood:                -783.45
No. Observations:                 220   AIC:                             1579.
Df Residuals:                     214   BIC:                             1599.
Df Model:                           5                                         
Covariance Type:                  HC1                                         
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    747.3639     20.278     36.856      0.0

In [None]:
print(results_t9_2_6.f_test(['avginc2=0','avginc3=0']))

<F test: F=array([[6.5479168]]), p=0.001737372446958823, df_denom=214, df_num=2>


## **9.5 Conclusion**