# Thesis Regressions

In this notebook, I will run regressions for my thesis project. The goal of the project is to find the factors that explain performance-responsive revenue for Division 1 college football programs.

In [1]:
# importing needed libraries for two stage least squares and ols

import pandas as pd
from linearmodels.iv import IV2SLS
from linearmodels.panel import PanelOLS
from linearmodels.panel import PooledOLS
import statsmodels.api as sm


# reading in the data
df = pd.read_csv("~/Desktop/Thesis Data/FinalCSV.csv", header=0)

In [2]:
# checking out the columns

df.columns

Index(['Program', 'Year', 'TicketRevFB', 'GuarFB', 'ContribsFB', 'DistribsFB',
       'ConcessFB', 'SponsFB', 'CampsFB', 'TotRevFB', 'ResponsiveRev',
       'TwoStars', 'ThreeStars', 'FourStars', 'FiveStars', 'srs', 'lagged_srs',
       'stateunemp', 'stadium_cap', 'num_homegames', 'coach_experience',
       'conf_coy', 'nat_coy', 'new_coach', 'pop/d1prog', 'hs_grad_pct',
       'pro_team_per_million', 'spring_practice', 'num_jcs'],
      dtype='object')

In [3]:
# creating a new column that will represent revenue in thousands of dollars

df['ResponsiveRevThous'] = df['ResponsiveRev']/1000

In [4]:
# making the year a categorical variable

df.Year.astype('category')

0      2010
1      2011
2      2012
3      2013
4      2014
5      2010
6      2011
7      2012
8      2013
9      2014
10     2010
11     2011
12     2012
13     2013
14     2014
15     2010
16     2011
17     2012
18     2013
19     2014
20     2010
21     2011
22     2012
23     2013
24     2014
25     2010
26     2011
27     2012
28     2013
29     2014
       ... 
348    2010
349    2011
350    2012
351    2013
352    2014
353    2010
354    2011
355    2012
356    2013
357    2014
358    2010
359    2011
360    2012
361    2013
362    2014
363    2010
364    2011
365    2012
366    2013
367    2014
368    2010
369    2011
370    2012
371    2013
372    2014
373    2010
374    2011
375    2012
376    2013
377    2014
Name: Year, Length: 378, dtype: category
Categories (5, int64): [2010, 2011, 2012, 2013, 2014]

In [5]:
# setting the index to program and year. This makes python read the data as a panel

df = df.set_index(['Program','Year'])

In [6]:
# checking the datatypes

df.dtypes

TicketRevFB               int64
GuarFB                    int64
ContribsFB                int64
DistribsFB                int64
ConcessFB                 int64
SponsFB                   int64
CampsFB                   int64
TotRevFB                  int64
ResponsiveRev             int64
TwoStars                  int64
ThreeStars                int64
FourStars                 int64
FiveStars                 int64
srs                     float64
lagged_srs              float64
stateunemp              float64
stadium_cap               int64
num_homegames             int64
coach_experience          int64
conf_coy                  int64
nat_coy                   int64
new_coach                 int64
pop/d1prog              float64
hs_grad_pct             float64
pro_team_per_million    float64
spring_practice           int64
num_jcs                   int64
ResponsiveRevThous      float64
dtype: object

In [7]:
# looking at the first five rows of data. Everything looks fine.

df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,TicketRevFB,GuarFB,ContribsFB,DistribsFB,ConcessFB,SponsFB,CampsFB,TotRevFB,ResponsiveRev,TwoStars,...,coach_experience,conf_coy,nat_coy,new_coach,pop/d1prog,hs_grad_pct,pro_team_per_million,spring_practice,num_jcs,ResponsiveRevThous
Program,Year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
Alabama,2010,27683402,0,14323804,15871330,108063,2495000,0,72845206,60481599,6,...,14,1,1,0,1196394.75,86.9,0.0,1,0,60481.599
Alabama,2011,29278884,200000,18458891,15749103,72714,546574,0,78285095,64306166,3,...,15,1,1,0,1199662.25,85.2,0.0,1,0,64306.166
Alabama,2012,30294245,700000,18679937,15788146,70645,1302500,603200,82302856,67438673,1,...,16,1,1,0,962789.2,86.33,0.0,1,0,67438.673
Alabama,2013,36199233,0,18864861,15832996,46467,1297257,584616,88685941,72825430,1,...,17,1,1,0,965532.0,86.59,0.0,1,0,72825.43
Alabama,2014,34915615,0,20723362,15630254,58622,4547853,730580,95262742,76606286,1,...,18,1,1,0,968007.4,86.86,0.0,1,0,76606.286


**Possible Dependent Variables:**
- TicketRevFB (football ticket revenue)
- TotRevFB (total football revenue)
- ResponsiveRev (revenue that is responsive to performance)
- ResponsiveRevThous (this one is preferred)

**Exogenous Independent Variables:**
- srs (Simple Rating System rating from year *y*)
- lagged_srs (Simple Rating System rating from year *y-1*)
- stadium_cap (home stadium capacity in year *y*)
- num_homegames (number of home games in year *y*)
- coach_experience (prior years that the team's head coach has been the head coach of a FBS team)
- conf_coy (dummy, 1 if a coach has been awarded a coach of the year award from a FBS conference prior to or in year *y*)
- nat_coy (dummy, 1 if a coach has been awarded a national coach of the year award prior to or in year *y*)
- new_coach (dummy, 1 if the team's head coach is different than the previous year)

**Endogenous Independent Variables:**
- TwoStars
- ThreeStars
- FourStars
- FiveStars
These are the of the numbers of signed players of each star type in the last 4 years at each school.

**Possible Instruments:**
- pop/d1prog (population of the state divided by the number of FBS football programs in the state
- hs_grad_pct (percentage of adults aged 25-64 in the state that have a high school diploma)
- pro_team_per_million (number of professional sports teams per million people in the state)
- spring_practice (dummy, 1 if spring practices are allowed in the state for high school football programs)
- num_jcs (number of Junior Colleges with football teams in the state) (There is no historical data here, so it is current teams)

# Two Stage Least Squares
Since we have five potential instruments and four endogenous variables, will run five different two stage least squares regressions. In each, we will exclude one of the potential instruments to find the best model.


### 2SLS #1: Excluding *num_jcs*

In [15]:
dependent = df.ResponsiveRevThous
exog = df[['lagged_srs','stateunemp','num_homegames','stadium_cap','coach_experience','conf_coy','nat_coy','new_coach']]
endog = df[['ThreeStars','FourStars','FiveStars']]
instruments0 = df[['spring_practice','pop/d1prog','hs_grad_pct','pro_team_per_million']]

In [16]:
mod = IV2SLS(dependent, exog, endog, instruments0)
res = mod.fit()
res

0,1,2,3
Dep. Variable:,ResponsiveRevThous,R-squared:,0.7314
Estimator:,IV-2SLS,Adj. R-squared:,0.7233
No. Observations:,378,F-statistic:,1329.1
Date:,"Sat, Jun 23 2018",P-value (F-stat),0.0000
Time:,15:23:51,Distribution:,chi2(11)
Cov. Estimator:,robust,,
,,,

0,1,2,3,4,5,6
,Parameter,Std. Err.,T-stat,P-value,Lower CI,Upper CI
lagged_srs,869.45,290.84,2.9895,0.0028,299.42,1439.5
stateunemp,-3311.7,959.59,-3.4512,0.0006,-5192.5,-1431.0
num_homegames,7152.5,4103.9,1.7429,0.0814,-891.00,1.52e+04
stadium_cap,0.9396,0.2663,3.5284,0.0004,0.4177,1.4615
coach_experience,41.841,185.51,0.2256,0.8216,-321.74,405.43
conf_coy,4974.0,2877.4,1.7287,0.0839,-665.52,1.061e+04
nat_coy,-2900.3,4197.2,-0.6910,0.4896,-1.113e+04,5326.1
new_coach,1317.8,2647.9,0.4977,0.6187,-3872.1,6507.7
ThreeStars,-1099.1,502.63,-2.1867,0.0288,-2084.2,-113.94


### 2SLS #2: Excluding *pop/d1prog*

In [17]:
instruments1 = df[['spring_practice','num_jcs','hs_grad_pct','pro_team_per_million']]

mod = IV2SLS(dependent, exog, endog, instruments1)
res = mod.fit()
res

0,1,2,3
Dep. Variable:,ResponsiveRevThous,R-squared:,0.5646
Estimator:,IV-2SLS,Adj. R-squared:,0.5516
No. Observations:,378,F-statistic:,784.03
Date:,"Sat, Jun 23 2018",P-value (F-stat),0.0000
Time:,15:24:08,Distribution:,chi2(11)
Cov. Estimator:,robust,,
,,,

0,1,2,3,4,5,6
,Parameter,Std. Err.,T-stat,P-value,Lower CI,Upper CI
lagged_srs,616.09,246.13,2.5031,0.0123,133.68,1098.5
stateunemp,-1302.5,760.27,-1.7131,0.0867,-2792.6,187.65
num_homegames,907.85,2714.7,0.3344,0.7381,-4413.0,6228.7
stadium_cap,0.6319,0.2925,2.1605,0.0307,0.0587,1.2052
coach_experience,-164.69,261.24,-0.6304,0.5284,-676.72,347.33
conf_coy,1557.9,2625.1,0.5935,0.5529,-3587.2,6703.0
nat_coy,7911.0,5445.8,1.4527,0.1463,-2762.6,1.858e+04
new_coach,-174.57,2643.3,-0.0660,0.9473,-5355.3,5006.2
ThreeStars,-361.84,301.96,-1.1983,0.2308,-953.68,229.99


### 2SLS #3: Excluding *spring_practice*

In [18]:
instruments2 = df[['pop/d1prog','num_jcs','hs_grad_pct','pro_team_per_million']]

mod = IV2SLS(dependent, exog, endog, instruments2)
res = mod.fit()
res

0,1,2,3
Dep. Variable:,ResponsiveRevThous,R-squared:,0.9087
Estimator:,IV-2SLS,Adj. R-squared:,0.9059
No. Observations:,378,F-statistic:,3263.5
Date:,"Sat, Jun 23 2018",P-value (F-stat),0.0000
Time:,15:24:29,Distribution:,chi2(11)
Cov. Estimator:,robust,,
,,,

0,1,2,3,4,5,6
,Parameter,Std. Err.,T-stat,P-value,Lower CI,Upper CI
lagged_srs,382.87,173.28,2.2096,0.0271,43.252,722.49
stateunemp,-1882.1,295.13,-6.3773,0.0000,-2460.6,-1303.7
num_homegames,469.38,1265.1,0.3710,0.7106,-2010.1,2948.9
stadium_cap,0.6136,0.1279,4.7961,0.0000,0.3628,0.8643
coach_experience,69.885,125.63,0.5563,0.5780,-176.34,316.11
conf_coy,2779.2,1468.6,1.8924,0.0584,-99.192,5657.6
nat_coy,-713.94,1998.8,-0.3572,0.7210,-4631.5,3203.7
new_coach,1190.7,1495.3,0.7963,0.4259,-1740.0,4121.4
ThreeStars,-43.022,109.87,-0.3916,0.6954,-258.36,172.31


### 2SLS #4: Excluding *hs_grad_pct*

In [19]:
instruments3 = df[['pop/d1prog','num_jcs','spring_practice','pro_team_per_million']]

mod = IV2SLS(dependent, exog, endog, instruments3)
res = mod.fit()
res

0,1,2,3
Dep. Variable:,ResponsiveRevThous,R-squared:,0.8771
Estimator:,IV-2SLS,Adj. R-squared:,0.8734
No. Observations:,378,F-statistic:,2477.9
Date:,"Sat, Jun 23 2018",P-value (F-stat),0.0000
Time:,15:25:24,Distribution:,chi2(11)
Cov. Estimator:,robust,,
,,,

0,1,2,3,4,5,6
,Parameter,Std. Err.,T-stat,P-value,Lower CI,Upper CI
lagged_srs,386.67,208.57,1.8539,0.0638,-22.123,795.47
stateunemp,-2026.4,445.87,-4.5449,0.0000,-2900.3,-1152.5
num_homegames,2373.4,2226.7,1.0659,0.2865,-1990.9,6737.6
stadium_cap,0.4562,0.2815,1.6206,0.1051,-0.0955,1.0079
coach_experience,33.488,130.37,0.2569,0.7973,-222.03,289.00
conf_coy,2575.6,1650.3,1.5607,0.1186,-658.84,5810.1
nat_coy,-300.29,2664.7,-0.1127,0.9103,-5523.0,4922.4
new_coach,861.04,1724.9,0.4992,0.6177,-2519.7,4241.8
ThreeStars,-232.89,151.96,-1.5326,0.1254,-530.73,64.942


### 2SLS #5: Excluding *pro_team_per_million*

In [20]:
instruments4 = df[['pop/d1prog','num_jcs','spring_practice','hs_grad_pct']]

mod = IV2SLS(dependent, exog, endog, instruments4)
res = mod.fit()
res

0,1,2,3
Dep. Variable:,ResponsiveRevThous,R-squared:,0.8849
Estimator:,IV-2SLS,Adj. R-squared:,0.8815
No. Observations:,378,F-statistic:,2158.6
Date:,"Sat, Jun 23 2018",P-value (F-stat),0.0000
Time:,15:25:31,Distribution:,chi2(11)
Cov. Estimator:,robust,,
,,,

0,1,2,3,4,5,6
,Parameter,Std. Err.,T-stat,P-value,Lower CI,Upper CI
lagged_srs,442.39,208.90,2.1177,0.0342,32.946,851.83
stateunemp,-1745.7,319.62,-5.4620,0.0000,-2372.2,-1119.3
num_homegames,-950.84,1705.5,-0.5575,0.5772,-4293.5,2391.8
stadium_cap,0.7935,0.1415,5.6097,0.0000,0.5163,1.0708
coach_experience,71.357,153.76,0.4641,0.6426,-230.01,372.73
conf_coy,2947.2,1629.0,1.8092,0.0704,-245.57,6140.0
nat_coy,-46.629,2185.0,-0.0213,0.9830,-4329.1,4235.8
new_coach,1335.0,1648.5,0.8098,0.4180,-1896.0,4566.0
ThreeStars,35.452,170.63,0.2078,0.8354,-298.97,369.87


### 2SLS #6: No exclusions

In [21]:
instruments5 = df[['pop/d1prog','num_jcs','spring_practice','hs_grad_pct','pro_team_per_million']]

mod = IV2SLS(dependent, exog, endog, instruments5)
res = mod.fit()
res

0,1,2,3
Dep. Variable:,ResponsiveRevThous,R-squared:,0.8979
Estimator:,IV-2SLS,Adj. R-squared:,0.8948
No. Observations:,378,F-statistic:,2970.6
Date:,"Sat, Jun 23 2018",P-value (F-stat),0.0000
Time:,15:25:45,Distribution:,chi2(11)
Cov. Estimator:,robust,,
,,,

0,1,2,3,4,5,6
,Parameter,Std. Err.,T-stat,P-value,Lower CI,Upper CI
lagged_srs,485.57,187.65,2.5876,0.0097,117.78,853.36
stateunemp,-1813.2,320.12,-5.6640,0.0000,-2440.6,-1185.7
num_homegames,479.50,1337.2,0.3586,0.7199,-2141.4,3100.4
stadium_cap,0.7079,0.1347,5.2541,0.0000,0.4438,0.9720
coach_experience,21.634,128.55,0.1683,0.8664,-230.32,273.59
conf_coy,2750.1,1560.8,1.7620,0.0781,-309.00,5809.3
nat_coy,1111.7,1940.8,0.5728,0.5668,-2692.2,4915.7
new_coach,981.32,1564.4,0.6273,0.5305,-2084.9,4047.5
ThreeStars,-149.79,118.85,-1.2603,0.2075,-382.73,83.150


Unfortunately, none of our IV models performed well- I believe that this is due to bad instruments. In all cases, there is very little variation within the instruments over time. Because of this, they don't help to explain variance in performance-responsive football revenue.


# Pooled OLS

Since none of the two stage least squares regressions performed well, we'll try a Pooled OLS.

In [24]:
panelx = df[['lagged_srs','stateunemp','num_homegames','stadium_cap','coach_experience','conf_coy','nat_coy','ThreeStars','FourStars','FiveStars']]
panelx = sm.add_constant(panelx)
panely = df['ResponsiveRevThous']

mod = PooledOLS(panely, panelx)
pooled_res = mod.fit()
print(pooled_res)

                          PooledOLS Estimation Summary                          
Dep. Variable:     ResponsiveRevThous   R-squared:                        0.8029
Estimator:                  PooledOLS   R-squared (Between):              0.8683
No. Observations:                 378   R-squared (Within):               0.0936
Date:                Sat, Jun 23 2018   R-squared (Overall):              0.8029
Time:                        15:27:24   Log-likelihood                   -4043.8
Cov. Estimator:            Unadjusted                                           
                                        F-statistic:                      149.49
Entities:                          77   P-value                           0.0000
Avg Obs:                       4.9091   Distribution:                  F(10,367)
Min Obs:                       1.0000                                           
Max Obs:                       5.0000   F-statistic (robust):             149.49
                            

# Panel Regressions

Next, I am going to try a fixed effects model that will treat my data like panel data. My panel is an unbalanced panel, so I run one fixed-effects model with the panel as it is, and one with a balanced panel. I created the balanced panel by dropping the programs that didn't have an observation for all five years (there are four).

## Fixed Effects Model- unbalanced panel

In [25]:
mod = PanelOLS(panely, panelx, entity_effects=True)
fixed_effects = mod.fit()
print(fixed_effects)

                          PanelOLS Estimation Summary                           
Dep. Variable:     ResponsiveRevThous   R-squared:                        0.2347
Estimator:                   PanelOLS   R-squared (Between):              0.0456
No. Observations:                 378   R-squared (Within):               0.2347
Date:                Sat, Jun 23 2018   R-squared (Overall):              0.0626
Time:                        15:27:39   Log-likelihood                   -3827.5
Cov. Estimator:            Unadjusted                                           
                                        F-statistic:                      8.9259
Entities:                          77   P-value                           0.0000
Avg Obs:                       4.9091   Distribution:                  F(10,291)
Min Obs:                       1.0000                                           
Max Obs:                       5.0000   F-statistic (robust):             8.9259
                            