# ROI Logistic Regression

Do a Logistic Regression of ROI onto the Efficiency, Size, and Activity categories. This is Table 11 in the paper.

In [1]:
import os
import sys
import re

import statsmodels.api as sm

from research_tools import storage

pd.options.display.float_format = lambda x: '{:,.4f}'.format(x) if abs(x) < 1 else '{:,.2f}'.format(x)

# Load Data

First load the data we saved at the end of the Trader Analysis notebooks.

In [2]:
def load_pickle(filename):
    with open(os.path.join('data', filename), 'rb') as f:
        return pickle.load(f)

In [3]:
os.chdir('..')

dem_trader_classifications = load_pickle('dem.trader_classifications.p')
dem_trader_stats_summary = load_pickle('dem.trader_stats_summary.p')

rep_trader_classifications = load_pickle('gop.trader_classifications.p')
rep_trader_stats_summary = load_pickle('gop.trader_stats_summary.p')

In [4]:
basename = 'dem'

dem_behavior_analysis, = storage.retrieve_all([basename + '.behavior_analysis'])

basename = 'gop'

rep_behavior_analysis, = storage.retrieve_all([basename + '.behavior_analysis'])

Reading data from data/dem.behavior_analysis.p
Reading data from data/gop.behavior_analysis.p


# What are the median PnL numbers for each market?

First, the Democrat market. The median PnL after fees is -0.1950.

In [5]:
dem_trader_stats_summary.pnl_net_fee.describe()

count    3,750.00
mean        -3.13
std        162.18
min     -1,699.58
25%         -9.20
50%       -0.1950
75%          6.31
max      1,187.54
Name: pnl_net_fee, dtype: float64

The gross PnL (before fees) is about the same. The mean is of course zero as one would expect for a zero sum game. The median is -0.11.

In [6]:
dem_trader_stats_summary.gross_pnl.describe()

count    3,750.00
mean      -0.0000
std        168.75
min     -1,699.58
25%         -9.00
50%       -0.1100
75%          7.26
max      1,319.49
Name: gross_pnl, dtype: float64

In the Republican market the mean PnL after fees is -0.25.

In [7]:
rep_trader_stats_summary.pnl_net_fee.describe()

count    4,452.00
mean        -5.19
std        211.91
min     -1,701.32
25%        -10.54
50%       -0.2500
75%          7.35
max      2,882.00
Name: pnl_net_fee, dtype: float64

The median gross PnL is -0.14.

In [8]:
rep_trader_stats_summary.gross_pnl.describe()

count    4,452.00
mean      -0.0000
std        224.30
min     -1,698.95
25%        -10.30
50%       -0.1400
75%          8.86
max      3,202.22
Name: gross_pnl, dtype: float64

# ROI Linear Regression

Next, a linear regression relating indicator variables for Efficiency, Size, and Activity with the net ROI percentages.

In [9]:
dem_trader_classifications.head()

Unnamed: 0_level_0,category,efficiency,size,activity
user_guid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0022AC92-4A31-3308-BCB9-D94C6F507A31,Efficient Small Inactive,Efficient,Small,Inactive
00318BA5-01FC-34A4-A4A1-3523BF5485C6,Inefficient Small Inactive,Inefficient,Small,Inactive
0034C80D-C854-3C60-8F01-64B48B565AA5,Efficient Small Inactive,Efficient,Small,Inactive
005E1296-C898-3911-A4C1-0B33FAB05A29,Inefficient Large Active,Inefficient,Large,Active
005E56D2-76B6-39DA-9199-366D761FE63D,Inefficient Large Inactive,Inefficient,Large,Inactive


First we must assemble the data.

In [10]:
def prepare_data(trader_stats_summary, trader_classifications, behavior_analysis):
    data = pd.merge(trader_stats_summary[['gross_pnl', 'pnl_net_fee']],
                    behavior_analysis[['max_in_pool']],
                    how='outer', left_index=True, right_index=True)
    
    data['efficient'] = (trader_classifications['efficiency'] == 'Efficient').astype('int')
    data['size'] = (trader_classifications['size'] == 'Large').astype('int')
    data['active'] = (trader_classifications['activity'] == 'Active').astype('int')

    data['net_pnl_roi'] = data['pnl_net_fee'] / data['max_in_pool']
    data['gross_pnl_roi'] = data['gross_pnl'] / data['max_in_pool']

    data.drop(['gross_pnl', 'pnl_net_fee', 'max_in_pool'], inplace=True, axis=1)
    
    return data

dem_data = prepare_data(dem_trader_stats_summary, dem_trader_classifications, dem_behavior_analysis)
rep_data = prepare_data(rep_trader_stats_summary, rep_trader_classifications, rep_behavior_analysis)

Assembly is easy because Python is awesome.

The value 1 is for Efficient traders, Large traders, and Active traders. The value 0 is for Inefficient traders, Small traders, and Inactive traders.

We will regress these indicator variables against the Net ROI values. These numbers are normalized by the amount of money each trader put in the pool of the zero-sum game so they are nicely comparable to each other.

In [11]:
dem_data.head()

Unnamed: 0_level_0,efficient,size,active,net_pnl_roi,gross_pnl_roi
user_guid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0022AC92-4A31-3308-BCB9-D94C6F507A31,1,0,0,0.024,0.0267
00318BA5-01FC-34A4-A4A1-3523BF5485C6,0,0,0,-1.0,-1.0
0034C80D-C854-3C60-8F01-64B48B565AA5,1,0,0,-1.0,-1.0
005E1296-C898-3911-A4C1-0B33FAB05A29,0,1,1,0.4829,0.5423
005E56D2-76B6-39DA-9199-366D761FE63D,0,1,0,0.5795,0.6438


In [12]:
rep_data.head()

Unnamed: 0_level_0,efficient,size,active,net_pnl_roi,gross_pnl_roi
user_guid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
000DA83C-D191-3726-A4D5-4E56F0CC7F80,0,1,1,-1.0,-0.9997
00140D47-5573-38BE-8E79-AA60B9D8563D,0,0,0,-1.0,-1.0
001AE9D0-FC38-3273-9DB1-973C0678E270,0,0,0,-1.0,-1.0
0026E24F-32BF-386C-A133-9C0061E04278,1,0,0,-0.3525,-0.3525
0055E87B-7662-3DC4-934F-144884375093,1,1,0,-0.1602,-0.1509


Now perform the linear regression for the Democrat market using the Python stats models library.

We see that the F-statistic is very good with a R-squared of 4.3%.

The Efficiency, Size, and Active coefficients are 0.1435, 0.1976, and 0.1331 and are all statistically significant at the 0.01 level. The adjusted R^2 is 4.2% and the F-statistic is high.

This is a good model.

In [13]:
X = dem_data[['efficient', 'size', 'active']]
y = dem_data['net_pnl_roi']

X2 = sm.add_constant(X)
est = sm.OLS(y, X2)
est2 = est.fit()
print(est2.summary(alpha=0.01))

                            OLS Regression Results                            
Dep. Variable:            net_pnl_roi   R-squared:                       0.043
Model:                            OLS   Adj. R-squared:                  0.042
Method:                 Least Squares   F-statistic:                     55.88
Date:                Wed, 17 Jul 2019   Prob (F-statistic):           2.47e-35
Time:                        19:20:17   Log-Likelihood:                -3588.2
No. Observations:                3750   AIC:                             7184.
Df Residuals:                    3746   BIC:                             7209.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.005      0.995]
------------------------------------------------------------------------------
const         -0.3482      0.014    -24.714      0.0

And now the Republican market.

For this data the coefficients are 0.1504, 0.0148, and 0.0989 for Efficiency, Size, and Activity, respectively.

At the 0.01 level only the Efficiency variable is significant. At the 0.05 level Efficiency and and Activity are.

The adjusted R^2 is much lower.

In [14]:
X = rep_data[['efficient', 'size', 'active']]
y = rep_data['net_pnl_roi']

X2 = sm.add_constant(X)
est = sm.OLS(y, X2)
est2 = est.fit()
print(est2.summary(alpha=0.01))

                            OLS Regression Results                            
Dep. Variable:            net_pnl_roi   R-squared:                       0.007
Model:                            OLS   Adj. R-squared:                  0.006
Method:                 Least Squares   F-statistic:                     9.762
Date:                Wed, 17 Jul 2019   Prob (F-statistic):           2.04e-06
Time:                        19:20:17   Log-Likelihood:                -5898.9
No. Observations:                4452   AIC:                         1.181e+04
Df Residuals:                    4448   BIC:                         1.183e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.005      0.995]
------------------------------------------------------------------------------
const         -0.1110      0.019     -5.950      0.0

Results discussed in the paper.