# ROI Logistic Regression

Do a Logistic Regression of ROI onto the Efficiency, Size, and Activity categories. This is Table 11 in the paper.

In [None]:
import os
import sys
import re

import statsmodels.api as sm

from research_tools import storage

pd.options.display.float_format = lambda x: '{:,.4f}'.format(x) if abs(x) < 1 else '{:,.2f}'.format(x)

# Load Data

First load the data we saved at the end of the Trader Analysis notebooks.

In [None]:
def load_pickle(filename):
    with open(os.path.join('data', filename), 'rb') as f:
        return pickle.load(f)

In [None]:
os.chdir('..')

dem_trader_classifications = load_pickle('dem.trader_classifications.p')
dem_trader_stats_summary = load_pickle('dem.trader_stats_summary.p')

rep_trader_classifications = load_pickle('gop.trader_classifications.p')
rep_trader_stats_summary = load_pickle('gop.trader_stats_summary.p')

In [None]:
basename = 'dem'

dem_behavior_analysis, = storage.retrieve_all([basename + '.behavior_analysis'])

basename = 'gop'

rep_behavior_analysis, = storage.retrieve_all([basename + '.behavior_analysis'])

# What are the median PnL numbers for each market?

First, the Democrat market. The median PnL after fees is -0.1950.

In [None]:
dem_trader_stats_summary.pnl_net_fee.describe()

The gross PnL (before fees) is about the same. The mean is of course zero as one would expect for a zero sum game. The median is -0.11.

In [None]:
dem_trader_stats_summary.gross_pnl.describe()

In the Republican market the mean PnL after fees is -0.25.

In [None]:
rep_trader_stats_summary.pnl_net_fee.describe()

The median gross PnL is -0.14.

In [None]:
rep_trader_stats_summary.gross_pnl.describe()

# ROI Linear Regression

Next, a linear regression relating indicator variables for Efficiency, Size, and Activity with the net ROI percentages.

In [None]:
dem_trader_classifications.head()

First we must assemble the data.

In [None]:
def prepare_data(trader_stats_summary, trader_classifications, behavior_analysis):
    data = pd.merge(trader_stats_summary[['gross_pnl', 'pnl_net_fee']],
                    behavior_analysis[['max_in_pool']],
                    how='outer', left_index=True, right_index=True)
    
    data['efficient'] = (trader_classifications['efficiency'] == 'Efficient').astype('int')
    data['size'] = (trader_classifications['size'] == 'Large').astype('int')
    data['active'] = (trader_classifications['activity'] == 'Active').astype('int')

    data['net_pnl_roi'] = data['pnl_net_fee'] / data['max_in_pool']
    data['gross_pnl_roi'] = data['gross_pnl'] / data['max_in_pool']

    data.drop(['gross_pnl', 'pnl_net_fee', 'max_in_pool'], inplace=True, axis=1)
    
    return data

dem_data = prepare_data(dem_trader_stats_summary, dem_trader_classifications, dem_behavior_analysis)
rep_data = prepare_data(rep_trader_stats_summary, rep_trader_classifications, rep_behavior_analysis)

Assembly is easy because Python is awesome.

The value 1 is for Efficient traders, Large traders, and Active traders. The value 0 is for Inefficient traders, Small traders, and Inactive traders.

We will regress these indicator variables against the Net ROI values. These numbers are normalized by the amount of money each trader put in the pool of the zero-sum game so they are nicely comparable to each other.

In [None]:
dem_data.head()

In [None]:
rep_data.head()

Now perform the linear regression for the Democrat market using the Python stats models library.

We see that the F-statistic is very good with a R-squared of 4.3%.

The Efficiency, Size, and Active coefficients are 0.1435, 0.1976, and 0.1331 and are all statistically significant at the 0.01 level. The adjusted R^2 is 4.2% and the F-statistic is high.

This is a good model.

In [None]:
X = dem_data[['efficient', 'size', 'active']]
y = dem_data['net_pnl_roi']

X2 = sm.add_constant(X)
est = sm.OLS(y, X2)
est2 = est.fit()
print(est2.summary(alpha=0.01))

And now the Republican market.

For this data the coefficients are 0.1504, 0.0148, and 0.0989 for Efficiency, Size, and Activity, respectively.

At the 0.01 level only the Efficiency variable is significant. At the 0.05 level Efficiency and and Activity are.

The adjusted R^2 is much lower.

In [None]:
X = rep_data[['efficient', 'size', 'active']]
y = rep_data['net_pnl_roi']

X2 = sm.add_constant(X)
est = sm.OLS(y, X2)
est2 = est.fit()
print(est2.summary(alpha=0.01))

Results discussed in the paper.