**Author:** Jim Maddock  
**Created:** 8-15-22  
**Description:** OLS and Autoregressive Distributed Lag model (ARDL) results for Momentum.  Includes dataframes for RQ 1 (the relationship between readership and new editors) and RQ 2 (the relationship between active editors and content creation).  For a methods overview see [this document](https://docs.google.com/document/d/1FoAv1xFfmtMPX7PC33XZBSYaZGM0Lf5RBaGpFSCkRVk/edit?usp=sharing)

In [83]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import statsmodels as sm
from statsmodels.tsa.api import ARDL
from statsmodels.tsa.ardl import ardl_select_order

In [84]:
FILEPATH = '~/datasets/momentum/pageview_new_accounts_8-7-22.csv'

pageview_accounts_df = pd.read_csv(FILEPATH)

pageview_accounts_df = pageview_accounts_df.loc[pageview_accounts_df['wiki_age'] != pageview_accounts_df['wiki_age'].max()]
pageview_accounts_df = pd.concat((pageview_accounts_df,pd.get_dummies(pageview_accounts_df['month'],prefix='month')),axis=1)
pageview_accounts_df = sm.tools.add_constant(pageview_accounts_df)

In [85]:
pageview_accounts_df

Unnamed: 0,const,month,year,num_pageviews,num_new_accounts,num_articles,wiki_age,month_1,month_2,month_3,month_4,month_5,month_6,month_7,month_8,month_9,month_10,month_11,month_12
0,1.0,5,2015,6113855894,66099,19464270,5,0,0,0,0,1,0,0,0,0,0,0,0
1,1.0,6,2015,5943640901,62862,19563376,6,0,0,0,0,0,1,0,0,0,0,0,0
2,1.0,7,2015,6046735010,62370,19834768,7,0,0,0,0,0,0,1,0,0,0,0,0
3,1.0,8,2015,6021283659,62402,20071547,8,0,0,0,0,0,0,0,1,0,0,0,0
4,1.0,9,2015,6059240560,65628,20167477,9,0,0,0,0,0,0,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
82,1.0,3,2022,6222799117,49194,33646356,87,0,0,1,0,0,0,0,0,0,0,0,0
83,1.0,4,2022,5920294077,44585,33740626,88,0,0,0,1,0,0,0,0,0,0,0,0
84,1.0,5,2022,5982361801,43647,33853815,89,0,0,0,0,1,0,0,0,0,0,0,0
85,1.0,6,2022,5700627282,41161,33957286,90,0,0,0,0,0,1,0,0,0,0,0,0


# RQ 1 (Readers -> Editors) Descriptive Stats

## Model 1.1

Our initial model illustrates the relationship between monthly pageviews and the number of new accounts created per month without autoregressive or distributed lag components.  Generally speaking, Model 1.1 explores whether the relationship between the number of monthly pageviews and the number of new accounts remains relatively consistant throughout our 87 month dataset, controlling for the age of the language edition and seasonality.  Model 1.1 does not include autoregressive or distributed lag components, so it cannot explain the relationship between new account creation and past increases or decreases in monthly pageviews.

Results from Model 1.1 illustrate that there is a signifigant but small relationship between monthly pageviews new accounts.  The positive but small coefficient of .000007672 indicates that for every 130344 pageviews in a given month, we can, on average, expect one additional new account during that month.  The relationship between pageviews and wiki age is signifigant but negative, indicating that for every additional month of wiki age, we can expect a decrease of roughly 260 new accounts created.  Some of our month controls are signifigant at the p < .05, indicating possible seasonality effects.  That is, over our 87 month dataset, new account creation during some months is consistantly different than account creation during January, our baseline month.

Since we are predominantly interested in the relationship between pageviews and new account creation, it bears repeating that without lagged variables Model 1.1 does not indicate that an increase in pageviews leads to a subsequent increase in new accounts.  More likely, exogenous factors that lead to an increase in pageviews *also* result in an increase in new accounts.

In [87]:
y = pageview_accounts_df['num_new_accounts']
X = pageview_accounts_df[['num_pageviews','wiki_age','month_2','month_3','month_4','month_5','month_6','month_7','month_8','month_9','month_10','month_11','month_12','const']]

model = sm.api.OLS(y, X)
results = model.fit()
results.summary()

0,1,2,3
Dep. Variable:,num_new_accounts,R-squared:,0.868
Model:,OLS,Adj. R-squared:,0.845
Method:,Least Squares,F-statistic:,37.03
Date:,"Thu, 11 Aug 2022",Prob (F-statistic):,7.44e-27
Time:,20:00:54,Log-Likelihood:,-818.58
No. Observations:,87,AIC:,1665.0
Df Residuals:,73,BIC:,1700.0
Df Model:,13,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
num_pageviews,7.672e-06,1.25e-06,6.138,0.000,5.18e-06,1.02e-05
wiki_age,-261.2704,13.961,-18.714,0.000,-289.094,-233.446
month_2,1505.2014,1927.799,0.781,0.437,-2336.896,5347.299
month_3,2554.4221,1756.777,1.454,0.150,-946.830,6055.675
month_4,-766.1085,1786.436,-0.429,0.669,-4326.471,2794.254
month_5,-2592.8284,1735.346,-1.494,0.139,-6051.368,865.711
month_6,-5066.5222,1959.820,-2.585,0.012,-8972.438,-1160.606
month_7,-6982.8369,1845.844,-3.783,0.000,-1.07e+04,-3304.076
month_8,-5177.4077,1885.748,-2.746,0.008,-8935.697,-1419.118

0,1,2,3
Omnibus:,1.378,Durbin-Watson:,0.566
Prob(Omnibus):,0.502,Jarque-Bera (JB):,1.138
Skew:,-0.043,Prob(JB):,0.566
Kurtosis:,2.446,Cond. No.,164000000000.0


In [99]:
y = pageview_accounts_df['num_new_accounts']
X = pageview_accounts_df[['num_pageviews']]
fixed_X = pageview_accounts_df[['wiki_age','month_2','month_3','month_4','month_5','month_6','month_7','month_8','month_9','month_10','month_11','month_12']]

sel_res = ardl_select_order(
    y, 12, X, 12, ic="bic", trend="c", fixed = fixed_X
)

for i, val in enumerate(sel_res.bic.head(10)):
    print(f"{i+1}: {val}")

1: (1, {'num_pageviews': 0})
2: (2, {'num_pageviews': 0})
3: (1, {'num_pageviews': 1})
4: (3, {'num_pageviews': 0})
5: (2, {'num_pageviews': 1})
6: (1, {'num_pageviews': 2})
7: (3, {'num_pageviews': 1})
8: (4, {'num_pageviews': 0})
9: (2, {'num_pageviews': 2})
10: (1, {'num_pageviews': 3})


## Model 1.2

Model 1.2 extends Model 1.1 by adding autogressive and distributed lag components.  Whereas Model 1.1 only illustrates the relationship between pageviews and new accounts at a specific time t, Model 1.2 explores the relationship between pageviews and new accounts at times t to t-n.  Statsmodels' ardl_select_order() method indicates that an autoregressive order of 1 and a distributed lag order of 0 result in the lowest BIC.  The resulting 1611.933 BIC of Model 1.2 is indeed lower than Model 1.1's 1700 BIC, indicating better model fit despite the increased number of model parameters.

Model parameterization alone indicates two noteable results.  First, the lack of a lagged parameter for pageviews indicates a lack of a lagged temporal relationship between pageviews and new accounts.  In otherwords, while an increase in pageviews during a given month on average corrosponds to an increase in new accounts during that same month, an increase in pageviews during a previous month *does not* result in a subseqent increase in new accounts during the current month.  Second the first order autoregressive variable indicatse that an increase in pageviews during a previous month does, on average, correspond to an increase in pageviews within the current month.  Our selected model does include a zero order (e.g. not lagged) variable for pageviews with a small but positive coefficient, which indicates that the relationship between monthly pageviews and new accounts illustrated in Model 1.1 holds with the addition of autoregressive variables.  Both *NewAccounts<sub>t-1</sub>* and *Pageviews<sub>t0</sub>* are statistically signifigant at p < .01 and positive, indicating that increases in either variable corrospond to an increase in *NewAccounts<sub>t0</sub>*.

Our *WikiAge* control remains statistically siginfigant and negative, but the effect size decreases between Model 1.1 and Model 1.2.  In contrast to Model 1.1, after we add our autoregressive component all of our seasonal controls are signifigant at p < .05, indicating that there is a strong relationship between seasonality and new account creation.  Interestingly all of these coefficients are also negative, indicating that most accounts are created in our baseline month of January.

Taken together, these results indicate that there is a signifigant relationship between *Pageviews* and *NewAccounts* during the same time window (e.g. t = 0), but no relationship between *Pageviews<sub>t-n</sub>* and *NewAccounts<sub>t0</sub>*.  That is, while exogenous factors may cause the number of pageviews and new accounts to increase or decrease similarily during  given month, past increases in pageviews *do not* seem to corrospond to future increases in new account creation.

In [100]:
res = sel_res.model.fit()
res.summary()

0,1,2,3
Dep. Variable:,num_new_accounts,No. Observations:,87.0
Model:,"ARDL(1, 0)",Log Likelihood,-770.332
Method:,Conditional MLE,S.D. of innovations,1878.839
Date:,"Mon, 15 Aug 2022",AIC,1572.664
Time:,18:29:12,BIC,1611.933
Sample:,1,HQIC,1588.468
,87,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,3637.5536,5946.577,0.612,0.543,-8219.585,1.55e+04
num_new_accounts.L1,0.6710,0.066,10.150,0.000,0.539,0.803
num_pageviews.L0,4.331e-06,8.65e-07,5.005,0.000,2.61e-06,6.06e-06
wiki_age,-95.0671,18.991,-5.006,0.000,-132.934,-57.201
month_2,-7669.5659,1528.238,-5.019,0.000,-1.07e+04,-4622.347
month_3,-2665.9155,1238.208,-2.153,0.035,-5134.831,-197.000
month_4,-9168.4442,1412.175,-6.492,0.000,-1.2e+04,-6352.648
month_5,-7675.3087,1265.745,-6.064,0.000,-1.02e+04,-5151.484
month_6,-1.098e+04,1382.952,-7.937,0.000,-1.37e+04,-8218.732


In [41]:
FILEPATH = '/home/jmads/datasets/momentum/active_editors_content_added_8-7-22.csv'

editors_content_df = pd.read_csv(FILEPATH)
editors_content_df = editors_content_df.loc[editors_content_df['wiki_age'] != editors_content_df['wiki_age'].max()]
editors_content_df = pd.concat((editors_content_df,pd.get_dummies(editors_content_df['month'],prefix='month')),axis=1)
editors_content_df = sm.tools.add_constant(editors_content_df)

# RQ 2 (Editors -> Content) Descriptive Stats

## Model 2.1

Similar to Model 1.1, Model 2.1 illustrates the relationship between numer of active editors and bytes added without distributed lag or autoregressive components, controlling for both the Wiki's age and seasonality.  The lack of distributed lag or autoregressive variables means that we cannot determine whether past active editor counts or bytes added have any relationship to future counts of bytes added.

Both *ActiveEditors* and *WikiAge* are signifigant in Model 2.1.  The positive coefficient of *ActiveEditors* indicates that for every increase or decrease in active editors, we can expect a corrosponding 24,290 byte increase or decrease in content added for that month.  Conversely, the negative coefficient on *WikiAge* indicates that for every month the Wiki ages, we can expect on average a 11,350,000 byte decrease in content added.  None of the coefficients on our *Month* variables are statistically signifigant, which indicates that seasonality does not play much of a role in the amount of content added.

Broadly, these results show that there seems to be a perceivable relationship between the amount of content added and the number of active editors.  Intuatively this makes some ammount of sense; while a given editor might not always produce the same amount of content from month to month, on average increasing the number of editors producing content will also increase the amount of content produced for a given month.

In [42]:
y = editors_content_df['num_bytes_added']
X = editors_content_df[['num_active_editors','wiki_age','month_2','month_3','month_4','month_5','month_6','month_7','month_8','month_9','month_10','month_11','month_12','const']]

model = sm.api.OLS(y, X)
results = model.fit()
results.summary()

0,1,2,3
Dep. Variable:,num_bytes_added,R-squared:,0.833
Model:,OLS,Adj. R-squared:,0.824
Method:,Least Squares,F-statistic:,93.77
Date:,"Thu, 11 Aug 2022",Prob (F-statistic):,3.52e-87
Time:,18:55:15,Log-Likelihood:,-5535.6
No. Observations:,259,AIC:,11100.0
Df Residuals:,245,BIC:,11150.0
Df Model:,13,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
num_active_editors,2.429e+04,699.392,34.731,0.000,2.29e+04,2.57e+04
wiki_age,-1.135e+07,4.93e+05,-23.044,0.000,-1.23e+07,-1.04e+07
month_2,1.486e+07,1.44e+08,0.103,0.918,-2.68e+08,2.98e+08
month_3,8.407e+06,1.44e+08,0.059,0.953,-2.75e+08,2.91e+08
month_4,6.895e+07,1.44e+08,0.480,0.632,-2.14e+08,3.52e+08
month_5,1.543e+08,1.44e+08,1.074,0.284,-1.29e+08,4.37e+08
month_6,2.217e+08,1.44e+08,1.542,0.124,-6.15e+07,5.05e+08
month_7,1.191e+08,1.44e+08,0.829,0.408,-1.64e+08,4.02e+08
month_8,8.517e+07,1.45e+08,0.586,0.558,-2.01e+08,3.72e+08

0,1,2,3
Omnibus:,33.172,Durbin-Watson:,0.165
Prob(Omnibus):,0.0,Jarque-Bera (JB):,54.591
Skew:,0.738,Prob(JB):,1.4e-12
Kurtosis:,4.696,Cond. No.,1390000.0


In [91]:
y = editors_content_df['num_bytes_added']
X = editors_content_df[['num_active_editors']]
fixed_X = editors_content_df[['wiki_age','month_2','month_3','month_4','month_5','month_6','month_7','month_8','month_9','month_10','month_11','month_12']]

sel_res = ardl_select_order(
    y, 12, X, 12, ic="bic", trend="c", fixed = fixed_X
)

for i, val in enumerate(sel_res.bic.head(10)):
    print(f"{i+1}: {val}")

1: (1, {'num_active_editors': 1})
2: (1, {'num_active_editors': 2})
3: (1, {'num_active_editors': 3})
4: (1, {'num_active_editors': 4})
5: (2, {'num_active_editors': 1})
6: (2, {'num_active_editors': 2})
7: (2, {'num_active_editors': 3})
8: (2, {'num_active_editors': 4})
9: (1, {'num_active_editors': 5})
10: (3, {'num_active_editors': 1})


## Model 2.2

Model 2.2 builds on Model 2.1, adding both autoregressive and distributed lag variables.  Whereas the ARDL components of Model 1.2 are relatively intuative--i.e a past increase in readership could lead to a future increase in new editors--the ARDL components of Model 2.2 are less so.  A lagged *ActiveEditors* variable would indicate that an increase or decrease in the number of active editors in past months correlates to an increase or decrease in bytes added in some future month.  An autogressive variable would indicate that a change in bytes added during some past month seems to correlate to an increase or decrease in bytes added during some future month.

Again using Statsmodels' ardl_select_order() method, we determin that orders of 1 on both the autoregressive and distributed lag variables minimize BIC.  This results in a BIC of 10647.598, which is lower than Model 2.1's BIC of 11150, indicating improved model fit.

In [92]:
res = sel_res.model.fit()
res.summary()

0,1,2,3
Dep. Variable:,num_bytes_added,No. Observations:,259.0
Model:,"ARDL(1, 1)",Log Likelihood,-5276.599
Method:,Conditional MLE,S.D. of innovations,184469491.168
Date:,"Thu, 11 Aug 2022",AIC,10587.198
Time:,20:03:04,BIC,10647.598
Sample:,1,HQIC,10611.485
,259,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,-3.527e+07,5.57e+07,-0.633,0.527,-1.45e+08,7.44e+07
num_bytes_added.L1,0.9249,0.026,35.429,0.000,0.873,0.976
num_active_editors.L0,2.241e+04,2858.770,7.840,0.000,1.68e+04,2.8e+04
num_active_editors.L1,-2.086e+04,2814.212,-7.411,0.000,-2.64e+04,-1.53e+04
wiki_age,-7.14e+05,3.57e+05,-2.001,0.047,-1.42e+06,-1.11e+04
month_2,6.632e+07,6.59e+07,1.006,0.315,-6.35e+07,1.96e+08
month_3,6.173e+07,5.85e+07,1.056,0.292,-5.34e+07,1.77e+08
month_4,1.118e+08,6.7e+07,1.669,0.096,-2.01e+07,2.44e+08
month_5,1.461e+08,6.31e+07,2.318,0.021,2.19e+07,2.7e+08
