First, we need to import the required data.

In [1]:
import pandas as pd

In [2]:
data = pd.read_csv('data/main_data.tab', sep= '\t')

We will run the linear regressions (OLS) for repression in departments with and without host cities.

Based on the paper, we need to run three multiple regression models.

Each one will have the same dependent variable, `lnrpression`, and different independent variables.

First, let's import the libraries needed to run the regression.

In [3]:
import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import scipy.stats.stats as stats
%matplotlib inline

For the first model:

In [4]:
model1 = smf.ols("repression ~ hostcitytime + hostcitytime2 + hostcity + time + time2", data = data)
m1 = model1.fit(cov_type='hc1')
m1.summary()
m1.nobs

58107.0

For the second model:

In [5]:
model2 = smf.ols("repression ~ hostcitytime + hostcitytime2 + hostcity + time + time2 + lnpop_1970 + vote_frejuli + literacy_avg + lnrebact1974 + lnrepression70_77", data = data)
m2 = model2.fit(cov_type='hc1')
m2.summary()

0,1,2,3
Dep. Variable:,repression,R-squared:,0.051
Model:,OLS,Adj. R-squared:,0.051
Method:,Least Squares,F-statistic:,13.75
Date:,"Mon, 19 Dec 2022",Prob (F-statistic):,1.49e-24
Time:,12:27:57,Log-Likelihood:,48335.0
No. Observations:,56394,AIC:,-96650.0
Df Residuals:,56383,BIC:,-96550.0
Df Model:,10,,
Covariance Type:,hc1,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-0.0065,0.004,-1.738,0.082,-0.014,0.001
hostcitytime,0.9026,0.296,3.044,0.002,0.322,1.484
hostcitytime2,-0.7119,0.225,-3.164,0.002,-1.153,-0.271
hostcity,-0.0471,0.064,-0.741,0.459,-0.172,0.077
time,-0.0064,0.004,-1.462,0.144,-0.015,0.002
time2,0.0047,0.003,1.385,0.166,-0.002,0.011
lnpop_1970,0.0016,0.000,3.713,0.000,0.001,0.002
vote_frejuli,-5.113e-05,3.78e-05,-1.354,0.176,-0.000,2.29e-05
literacy_avg,-0.0095,0.002,-4.910,0.000,-0.013,-0.006

0,1,2,3
Omnibus:,144913.61,Durbin-Watson:,1.698
Prob(Omnibus):,0.0,Jarque-Bera (JB):,3471418998.956
Skew:,29.366,Prob(JB):,0.0
Kurtosis:,1217.046,Cond. No.,9290.0


For the third model:

In [6]:
model3 = smf.ols("repression ~ hostcitytime + hostcitytime2 + hostcity + time + time2 + lnpop_1970 + vote_frejuli + literacy_avg + lnrebact1974 + lnrepression70_77 + zone2 + zone3 + zone4 + zone5", data = data)
m3 = model3.fit(cov_type='hc1')
m3.summary()

0,1,2,3
Dep. Variable:,repression,R-squared:,0.055
Model:,OLS,Adj. R-squared:,0.055
Method:,Least Squares,F-statistic:,10.53
Date:,"Mon, 19 Dec 2022",Prob (F-statistic):,2.5699999999999997e-24
Time:,12:27:57,Log-Likelihood:,48455.0
No. Observations:,56394,AIC:,-96880.0
Df Residuals:,56379,BIC:,-96750.0
Df Model:,14,,
Covariance Type:,hc1,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-0.0063,0.004,-1.432,0.152,-0.015,0.002
hostcitytime,0.9026,0.295,3.064,0.002,0.325,1.480
hostcitytime2,-0.7119,0.224,-3.184,0.001,-1.150,-0.274
hostcity,-0.0482,0.063,-0.764,0.445,-0.172,0.075
time,-0.0064,0.004,-1.460,0.144,-0.015,0.002
time2,0.0047,0.003,1.385,0.166,-0.002,0.011
lnpop_1970,0.0057,0.001,6.498,0.000,0.004,0.007
vote_frejuli,3.084e-05,4.58e-05,0.674,0.500,-5.88e-05,0.000
literacy_avg,-0.0409,0.005,-8.101,0.000,-0.051,-0.031

0,1,2,3
Omnibus:,144749.652,Durbin-Watson:,1.705
Prob(Omnibus):,0.0,Jarque-Bera (JB):,3455809060.067
Skew:,29.28,Prob(JB):,0.0
Kurtosis:,1214.315,Cond. No.,9290.0


Notice that we use the argument `cov_type='hc1'` with the `.fit()` method.

That is because we want standard errors to be heteroscedasticity robust (based on https://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.OLSResults.get_robustcov_results.html).

Now, we need to combine all the results into a single table.

In [7]:
results1 = m1.params
results1 = results1.to_frame(name='(1)')
results1

results2 = m2.params
results2 = results2.to_frame(name='(2)')
results2

results3 = m3.params
results3 = results3.to_frame(name='(3)')
results3

results = pd.concat([results1, results2,results3], axis=1)
results = results.round(3)

results.rename({'hostcity':'Host City',
                'hostcitytime':'Host City * Time',
                'hostcitytime2':'Host City * Time2',
                'time':'Time',
                'time2': 'Time2',
                'lnpop_1970': 'Population Size',
                'literacy_avg':'Literacy Rate',
                'vote_frejuli':'Peronist Vote Share',
                'lnrebact1974':'Rebel activity',
                'lnrepression70_77':'Past Repression', 
                'Intercept':'Constant',
                "zone1":"Military Zone 1",
                "zone2":"Military Zone 2",
                "zone3":"Military Zone 3",
                "zone4":"Military Zone 4",
                "zone5":"Military Zone 5"},inplace =True)
results = results.reindex(['Host City * Time','Host City * Time2','Host City','Time','Time2','Population Size','Literacy Rate','Peronist Vote Share','Rebel activity','Past Repression', 'Constant'])
results.fillna('')

Unnamed: 0,(1),(2),(3)
Host City * Time,0.902,0.903,0.903
Host City * Time2,-0.712,-0.712,-0.712
Host City,-0.006,-0.047,-0.048
Time,-0.006,-0.006,-0.006
Time2,0.004,0.005,0.005
Population Size,,0.002,0.006
Literacy Rate,,-0.01,-0.041
Peronist Vote Share,,-0.0,0.0
Rebel activity,,-0.0,-0.002
Past Repression,,0.007,0.005


Now, we need to extract the other metrics of the 3 models (R^2, observations, F Statistic) and include the standard error of each variable.

In [8]:
std1 = m1.bse
std1 = std1.to_frame(name='(1)')
std1

std2 = m2.bse
std2 = std2.to_frame(name='(2)')
std2

std3 = m3.bse
std3 = std3.to_frame(name='(3)')
std3

std = pd.concat([std1, std2,std3], axis=1)
std = std.round(3)

# std = std.duplicated(keep='first')
std.rename({'hostcity':'Host City Std',
                'hostcitytime':'Host City * Time Std',
                'hostcitytime2':'Host City * Time2 Std',
                'time':'Time Std',
                'time2': 'Time2 Std',
                'lnpop_1970': 'Population Size Std',
                'literacy_avg':'Literacy Rate Std',
                'vote_frejuli':'Peronist Vote Share Std',
                'lnrebact1974':'Rebel activity Std',
                'lnrepression70_77':'Past Repression Std', 
                'Intercept':'Constant Std'}, inplace=True)
std = std.reindex(['Host City * Time Std','Host City * Time2 Std','Host City Std','Time Std','Time2 Std','Population Size Std','Literacy Rate Std','Peronist Vote Share Std','Rebel activity Std','Past Repression Std', 'Constant Std'])
std = std.fillna('')

std['(1)'] = std['(1)'].astype(str) 
std['(1)'] = '(' + std['(1)'] + ')'

std['(2)'] = std['(2)'].astype(str) 
std['(2)'] = '(' + std['(2)'] + ')'

std['(3)'] = std['(3)'].astype(str) 
std['(3)'] = '(' + std['(3)'] + ')'

std

Unnamed: 0,(1),(2),(3)
Host City * Time Std,(0.298),(0.296),(0.295)
Host City * Time2 Std,(0.226),(0.225),(0.224)
Host City Std,(0.063),(0.064),(0.063)
Time Std,(0.004),(0.004),(0.004)
Time2 Std,(0.003),(0.003),(0.003)
Population Size Std,(),(0.0),(0.001)
Literacy Rate Std,(),(0.002),(0.005)
Peronist Vote Share Std,(),(0.0),(0.0)
Rebel activity Std,(),(0.0),(0.0)
Past Repression Std,(),(0.001),(0.001)


In [9]:
# results = pd.concat([std,results],axis=1)
results = results.append(std)
results = results[~results.index.duplicated()]
results = results.reindex(['Host City * Time','Host City * Time Std',
                           'Host City * Time2','Host City * Time2 Std',
                           'Host City','Host City Std',
                           'Time','Time Std',
                           'Time2','Time2 Std',
                           'Population Size','Population Size Std',
                           'Literacy Rate','Literacy Rate Std',
                           'Peronist Vote Share','Peronist Vote Share Std',
                           'Rebel activity','Rebel activity Std',
                           'Past Repression','Past Repression Std',
                           'Constant','Constant Std'])
results = results.fillna('')
results = results.replace(['nan','()'], '')
results

Unnamed: 0,(1),(2),(3)
Host City * Time,0.902,0.903,0.903
Host City * Time Std,(0.298),(0.296),(0.295)
Host City * Time2,-0.712,-0.712,-0.712
Host City * Time2 Std,(0.226),(0.225),(0.224)
Host City,-0.006,-0.047,-0.048
Host City Std,(0.063),(0.064),(0.063)
Time,-0.006,-0.006,-0.006
Time Std,(0.004),(0.004),(0.004)
Time2,0.004,0.005,0.005
Time2 Std,(0.003),(0.003),(0.003)


Now, it is time to add the metrics of the 3 models.



First,we extract AIC.

In [10]:
a1=m1.aic
a2=m2.aic
a3=m3.aic

aic = [a1,a2,a3]

aic

[-100648.42982952218, -96648.02542259055, -96879.68098686342]

Then, we extract Wald x^2.

In [11]:
id1 = ['hostcitytime', 'hostcitytime2', 'hostcity', 'time', 'time2']
id2 = ['hostcitytime', 'hostcitytime2', 'hostcity', 'time', 'time2','lnpop_1970', 'vote_frejuli', 'literacy_avg', 'lnrebact1974', 'lnrepression70_77']
id3 = ['hostcitytime', 'hostcitytime2', 'hostcity', 'time', 'time2','lnpop_1970', 'vote_frejuli', 'literacy_avg', 'lnrebact1974', 'lnrepression70_77', 'zone2', 'zone3', 'zone4', 'zone5']

all_zero1 = [x + '= 0' for x in id1]
all_zero2 = [x + '= 0' for x in id2]
all_zero3 = [x + '= 0' for x in id3]

w1=m1.wald_test(all_zero1).statistic[0][0]
w2=m2.wald_test(all_zero2).statistic[0][0]
w3=m3.wald_test(all_zero3).statistic[0][0]

wald = [w1,w2,w3]

wald




[55.248742675625756, 137.47797704517902, 147.40609347423236]

Then, we extract the R^2.

In [12]:
rs1 = m1.rsquared
rs1

rs2 = m2.rsquared
rs2

rs3 = m3.rsquared
rs3

rs = [rs1,rs2,rs3]

Then, the F-Statistic.

In [13]:
fstat1 = m1.fvalue
fstat1

fstat2 = m2.fvalue
fstat2

fstat3 = m3.fvalue
fstat3

fstats= [fstat1,fstat2,fstat3]

Next, the number of observations for each model.

In [14]:
obs1 = m1.nobs
obs1

obs2 = m2.nobs
obs2

obs3 = m3.nobs
obs3

obs = [obs1,obs2,obs3]

We need to add the `Controls` and the `Zone FE` variables.

`Controls` shows if the control variables are used in the model.

The control variables are: `lnpop_1970, vote_frejuli, literacy_avg, lnrebact1974, lnrepression70_77`

Only the first model does not use the control variables.

In [15]:
ce1 =  'No'
ce2 =  'Yes'
ce3 =  'Yes'

Finally, we will add the zone fixed effects.

`Zone fixed effects` shows if  fixed effects for military zones are used in the model.

The fixed effect variables are : `zone2, zone3, zone4, zone5`

Only the third model uses the fixed effect variables.

In [16]:
fe1 =  'No'
fe2 =  'No'
fe3 =  'Yes'

In [17]:
c1 = [round(a1,2),round(w1,2),round(fstat1,2),round(rs1,2),round(obs1,2),ce1,fe1]
c2 = [round(a2,2),round(w2,2),round(fstat2,2),round(rs2,2),round(obs2,2),ce2,fe2]
c3 = [round(a3,2),round(w3,2),round(fstat3,2),round(rs3,2),round(obs3,2),ce3,fe3]

Now, we need to create a new dataframe with these metrics and combine it with the `results` dataframe.

In [18]:
data1 = {'(1)':c1,'(2)':c2,'(3)':c3}
indexes = ["AIC",'Wald χ^2','F-Statistic','R^2','Observations',"Controls","Zone fixed effects"]

metrics = pd.DataFrame(data1,index=indexes)
# metrics = metrics.round(2)

metrics['(1)'] = metrics['(1)'].astype(str)
metrics['(2)'] = metrics['(2)'].astype(str)
metrics['(3)'] = metrics['(3)'].astype(str)

metrics

Unnamed: 0,(1),(2),(3)
AIC,-100648.43,-96648.03,-96879.68
Wald χ^2,55.25,137.48,147.41
F-Statistic,11.05,13.75,10.53
R^2,0.04,0.05,0.05
Observations,58107.0,56394.0,56394.0
Controls,No,Yes,Yes
Zone fixed effects,No,No,Yes


Now, we combine the `metrics` dataframe with the `results` dataframe.

In [19]:
results = results.append(metrics)
results = results[~results.index.duplicated()]


results = results.reindex(['Host City * Time','Host City * Time Std',
                           'Host City * Time2','Host City * Time2 Std',
                           'Host City','Host City Std',
                           'Time','Time Std',
                           'Time2','Time2 Std',
                           'Population Size','Population Size Std',
                           'Literacy Rate','Literacy Rate Std',
                           'Peronist Vote Share','Peronist Vote Share Std',
                           'Rebel activity','Rebel activity Std',
                           'Past Repression','Past Repression Std',
                           'Constant','Constant Std',
                           "AIC",
                           'Wald χ^2',
                            'R^2','F-Statistic',
                            'Observations',
                          "Controls",
                           "Zone fixed effects"])

results

Unnamed: 0,(1),(2),(3)
Host City * Time,0.902,0.903,0.903
Host City * Time Std,(0.298),(0.296),(0.295)
Host City * Time2,-0.712,-0.712,-0.712
Host City * Time2 Std,(0.226),(0.225),(0.224)
Host City,-0.006,-0.047,-0.048
Host City Std,(0.063),(0.064),(0.063)
Time,-0.006,-0.006,-0.006
Time Std,(0.004),(0.004),(0.004)
Time2,0.004,0.005,0.005
Time2 Std,(0.003),(0.003),(0.003)


Having replicated the table and detailed the variables as in Table 1 of the paper, it is time to interpret the 3 models we run.

Let's take a look at the hypotheses again.

* H1: In the run-up to an international sports tournament, state repression spikes in host cities, but not in other cities.

* H2: During an international sports tournament, state repression drops in host cities but remains unchanged in other cities.



In [20]:
m1.summary()

0,1,2,3
Dep. Variable:,repression,R-squared:,0.041
Model:,OLS,Adj. R-squared:,0.041
Method:,Least Squares,F-statistic:,11.05
Date:,"Mon, 19 Dec 2022",Prob (F-statistic):,1.17e-10
Time:,12:27:58,Log-Likelihood:,50330.0
No. Observations:,58107,AIC:,-100600.0
Df Residuals:,58101,BIC:,-100600.0
Df Model:,5,,
Covariance Type:,hc1,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,0.0046,0.001,3.750,0.000,0.002,0.007
hostcitytime,0.9023,0.298,3.028,0.002,0.318,1.486
hostcitytime2,-0.7117,0.226,-3.147,0.002,-1.155,-0.268
hostcity,-0.0065,0.063,-0.102,0.918,-0.131,0.118
time,-0.0061,0.004,-1.412,0.158,-0.015,0.002
time2,0.0044,0.003,1.339,0.181,-0.002,0.011

0,1,2,3
Omnibus:,150327.397,Durbin-Watson:,1.682
Prob(Omnibus):,0.0,Jarque-Bera (JB):,3763213157.033
Skew:,29.878,Prob(JB):,0.0
Kurtosis:,1248.293,Cond. No.,201.0


### Based on the first model, the linear equation is the following:


$$ repression = 0.004 + 0.902 hostcitytime - 0.711 hostcitytime2 + 0.006 hostcity - 0.006 time + 0.004 time2 $$



### Based on the second model, the linear equation is the following:


$$ repression = -0.007 + 0.903 hostcitytime - 0.712 hostcitytime2 - 0.047 hostcity - 0.006 time + 0.005 time2 + 0.002lnpop_1970 -0.000051vote_frejuli -0.010 literacy_avg - 0.000201lnrebact1974 + 0.006975lnrepression70_77 $$



### Based on the third model,  the linear equation is the following:

$$ 𝑟𝑒𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛=−0.006+0.903ℎ𝑜𝑠𝑡𝑐𝑖𝑡𝑦𝑡𝑖𝑚𝑒−0.712ℎ𝑜𝑠𝑡𝑐𝑖𝑡𝑦𝑡𝑖𝑚𝑒2−0.048ℎ𝑜𝑠𝑡𝑐𝑖𝑡𝑦−0.006𝑡𝑖𝑚𝑒+0.005𝑡𝑖𝑚𝑒2+0.006𝑙𝑛𝑝𝑜𝑝1970−0.00003𝑣𝑜𝑡𝑒𝑓𝑟𝑒𝑗𝑢𝑙𝑖−0.04𝑙𝑖𝑡𝑒𝑟𝑎𝑐𝑦𝑎𝑣𝑔−0.002𝑙𝑛𝑟𝑒𝑏𝑎𝑐𝑡1974+0.005𝑙𝑛𝑟𝑒𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛7077 - 0.026zone2 - 0.024zone3 - 0.017zone4 - 0.016zone5
$$

## General interpretation:

The authors formulate two hypotheses:

* *H1: In the run-up to an international sports tournament, state repression spikes in host cities, but not in other cities.*

* *H2: During an international sports tournament, state repression drops in host cities but remains unchanged in other cities.*


Based on the 3 models we run, we can safely say that both hypotheses hold true.

This is attributed mostly to the coefficients of the variables: `ℎ𝑜𝑠𝑡𝑐𝑖𝑡𝑦𝑡𝑖𝑚𝑒` and `ℎ𝑜𝑠𝑡𝑐𝑖𝑡𝑦𝑡𝑖𝑚𝑒2`.

Since these variables capture both the time and the space as "dimensions" of the repression in host and nonhost cities, it is important to notice how they behave.

In the models, we can see that the value of coefficient of the variable `hostcity` is `0.902`,`0.903` and `0.903` respectively. This means that, if the variable `hostcity` is increased by one, the repression will **increase** by `0.902`,`0.903` and `0.903` respectively, ceteris paribus.

Similarly, the same logic applies for the variable `ℎ𝑜𝑠𝑡𝑐𝑖𝑡𝑦𝑡𝑖𝑚𝑒2`.

In the models, we can see that the value of coefficient of the variable `hostcity2` is `-0.711`,`-0.712` and `-0.712` respectively. This means that, if the variable `hostcity2` is increased by one, the repression will **decrease** by `-0.711`,`-0.712` and `-0.712` respectively, ceteris paribus.

Moreover, it is interesting that the actual variable that indicates if a city is a host
for the World cup, `hostcity`, barely affects the repression events. More specifically,  the value of coefficient of the variable `hostcity` is `0.006`,`-0.047` and `-0.048` respectively. This shows that the dynamics of the space and time combined plays a crucial role in the repression events.


**To generalise this**, we can conclude that the positive coefficient of the variable `ℎ𝑜𝑠𝑡𝑐𝑖𝑡𝑦𝑡𝑖𝑚𝑒` and the negative coefficient of the variable `ℎ𝑜𝑠𝑡𝑐𝑖𝑡𝑦𝑡𝑖𝑚𝑒2` show that repression in host cities first spiked and then dropped.