# Applying fix for reversal error

### The Problem

The source of the problem is a column called **TrueProbGood**. 

**TrueProbGood** is actually a misnomer of a column header in this rendition of the study, and this gives us our first hint to the nature of the error. What it *should* be showing us is the the actual probability that the stock is *bad*, after the outcome of a given trial.

The misnomer comes from study 1, in which participants were asked to guess the probability that the stock was *good*. In study 2, we asked participants to guess the probability the stock was *bad*, but neglected to invert this particular calculation before starting data collection.

### The Fix

During each session, this calculation was the comparison point which determined how much money participants would earn from their guesses, so when the error became apparent, we knew we had been unknowingly undercompensating our participants.

So, the fix was applied in the experiment code as soon as we noticed it, before we continued data collection. It was a simple reversal. After calculating the probability that the stock was *good*, we simply subtracted this value from 1 ...

... and compared participant's guesses to the new value ("TrueProbBad", if you will) to determine their payouts. We did not, however, change the name of the column - to avoid further complicating our later analyses - and so we must live with the misnomer. 

We could not retroactively apply the fix to data we had already collected in the same way, obviously. That is where this notebook comes in.

### sub-220

Starting with sub-220, the fix was in. **TrueProbGood** properly represented the value participants were meant to guess, and so they were being properly rewarded for their guesses. Let's check out the data from sub-220 so we can see how that should look.

First, we'll do some initial setup ...

In [1]:
import os
import pandas as pd

# Setting up our working directory.
# All of the data relevant data is in the study 2 dataset (ds2)
# We're in .../econdec/code, and our source directory is .../econdec/sourcedata/ds2
source_dir = os.path.join(os.path.abspath('..'),'sourcedata','ds2')

# Reading the raw data from sub-220
s = 'sub-220'
fpath = os.path.join(source_dir,s,s+'_task-main_beh.xlsx')
df=pd.read_excel(fpath)

Now that we're set up and we've read the main task data from sub-220 as a dataframe, let's break down the relevant columns, then we'll take a look at the data.

#### StockValue
The value output of the "stock" on a given trial. Participants saw this information on every trial, and the value here is what drives the computer's probability calculations and, hopefully, the participant's guesses. Relatively high values (+10, -2) increase the probability that the stock is good, and decrease the probability that it is bad. Relatively low values (+2, -10) have the inverse consequence.

#### TrueProbGood
As you may remember, this is the computer's calculation of the probability, _after the StockValue outcome of the trial_. As the StockValue gives *good* outcomes (high gains/low losses), this value decreases. Likewise, as the StockValue gives *bad* outcomes (low gains or high losses), this value increases.

* **Remember**: this column header says "ProbGood", but the value here should represent the probability that the stock is ***BAD***.

#### ProbGood
This is the value that the participant entered when prompted to guess the probability that the stock is bad. Like TrueProbGood, this column header carries a misnomer from study 1. Participants should be increasing their estimations when the stock performs badly, and decreasing it when the stock performs well.

#### EstWithinRange?
A binary flag which represents whether the participant's guess (ProbGood) is within 5% of the computer's calculation (TrueProbGood). The flags in this value would later be tallied and determine the amount of participants' earnings.

In [2]:
df[['StockValue','TrueProbGood','ProbGood','EstWithinRange?']]

Unnamed: 0,StockValue,TrueProbGood,ProbGood,EstWithinRange?
0,10,0.300000,30,1
1,10,0.155172,20,1
2,10,0.072973,15,0
3,10,0.032635,10,0
4,2,0.072973,17,0
5,2,0.155172,25,0
6,-10,0.700000,70,1
7,-10,0.844828,80,1
8,-10,0.927027,85,0
9,-10,0.967365,93,1


So how does the data from sub-220 look? Pretty good! Let's break it down:

On trial 0, the value of the stock was +10. This is a good outcome by any estimation. Therefore, the probability that the stock is bad should be low. Here, we see it's 30%. When asked to estimate this probability, the subject said "30". That's right on target, so they're within range, and they get a point towards their payout.

On the next few trials, the stock continues to perform well, with a total of four +10 trials in a row. In this case, the probability that the stock is bad should be steadily decreasing, which is exactly what we see. Our subject is bringing their estimations down, too, which is good! But, we see that after the first 2 trials, they're guessing too high, and not within range.

When the stock starts to perform worse, with a couple of low gains, we see the objective probability rise, along with our subject's guesses. They don't get back in that 5% range, so they're not earning anything. But it looks like we're calculating the right probabilities, and that's what we need. 

After 6 trials, the subject is dealing with a new stock, and finds themselves in a loss condition. As it turns out, this stock performs very poorly, with mounting large losses, and an increasing certainty that the stock is bad.

### sub-215

So, now that we've seen what the data should look like, let's look at an example of data that we need to fix.

In [3]:
s = 'sub-215'
fpath = os.path.join(source_dir,s,s+'_task-main_beh.xlsx')
df = pd.read_excel(fpath)
df[['StockValue','TrueProbGood','ProbGood','EstWithinRange?']]

Unnamed: 0,StockValue,TrueProbGood,ProbGood,EstWithinRange?
0,-10,0.300000,70,0
1,-2,0.500000,50,1
2,-10,0.300000,70,0
3,-2,0.500000,50,1
4,-2,0.700000,30,0
5,-2,0.844828,30,0
6,-10,0.300000,70,0
7,-10,0.155172,70,0
8,-10,0.072973,80,0
9,-10,0.032635,85,0


sub-215 actually performed the task quite well. We can see that their estimations were in line with the probability that we were asking them for: rising when the stock performed badly, falling when it performed well. Unfortunately, we compared their guesses to an incorrect calculation in **TrueProbGood**, and they were therefore not earning money on many of the trials they should have.

More critically, however, is that many of our analyses are compromised by this error. Any analyses which depend upon:

* the stock's objective probability, or
* the rate at which subjects accurately guessed these probabilities

are all working from incorrect data until this fix is applied to the subjects before sub-220.

### The Post-hoc Fix

First, we'll set up a list of all the subjects we need to apply the fix to ...

In [4]:
subjs = ['sub-'+str(x) for x in range(201,220)]

Then, loop through that list to read in a list of the dataframes we want to fix ...

In [5]:
frames=[]
for s in subjs:
    fpath = os.path.join(source_dir,s,s+'_task-main_beh.xlsx')
    try:
        frames.append(pd.read_excel(fpath))
    except:
        continue

Now that we have all the data we need, let's take a look at one of these ...

In [6]:
frames[0][['StockValue','TrueProbGood']]

Unnamed: 0,StockValue,TrueProbGood
0,2,0.300000
1,2,0.155172
2,2,0.072973
3,10,0.155172
4,10,0.300000
5,2,0.155172
6,10,0.700000
7,2,0.500000
8,10,0.700000
9,10,0.844828


As before, we can see that the calculation being done in TrueProbGood is not what we want. It should be *increasing* when StockValue is a bad outcome, but instead it is *decreasing*. 

For thoroughness' sake, we can do a quick check of the first index of each dataframe we read in:

In [7]:
pd.concat(frames).loc[0][['SubjNum','TrialNum','StockValue','TrueProbGood']].set_index('SubjNum')

Unnamed: 0_level_0,TrialNum,StockValue,TrueProbGood
SubjNum,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
201,1,2,0.3
202,1,-2,0.7
203,1,-10,0.3
204,1,2,0.3
205,1,-10,0.3
206,1,10,0.7
207,1,-10,0.3
208,1,10,0.7
209,1,10,0.7
210,1,10,0.7


There are only 4 possible values in StockValue: -10, -2, +2, +10
    
We know that if StockValue is a good outcome (-2 or +10) on the very first trial, the objective probability should be 0.30. Likewise, if the first trial StockValue is bad (-10 or +2), the objective probability should be 0.70.

In [8]:
def convert_prob(row):
    good_prob = row['TrueProbGood']
    bad_prob = (1.00 - good_prob)
    return bad_prob

def convert_check(row):
    subj_resp = row['ProbGood']
    bad_prob = row['TrueProbGood']
    old_check = row['EstWithinRange?']
    new_check = 1 if (bad_prob+.05) >= (subj_resp*.01) >= (bad_prob-.05) else 0
    return new_check



In [9]:
for f in frames:
    f['TrueProbGood'] = f.apply(convert_prob, axis=1)
    f['EstWithinRange?'] = f.apply(convert_check, axis=1)
    #print f[['StockValue','ProbGood','TrueProbGood','EstWithinRange?']]
    #print f.sum('EstWithinRange?')
    
pd.concat(frames).loc[0][['SubjNum','TrialNum','StockValue','TrueProbGood']].set_index('SubjNum')

Unnamed: 0_level_0,TrialNum,StockValue,TrueProbGood
SubjNum,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
201,1,2,0.7
202,1,-2,0.3
203,1,-10,0.7
204,1,2,0.7
205,1,-10,0.7
206,1,10,0.3
207,1,-10,0.7
208,1,10,0.3
209,1,10,0.3
210,1,10,0.3


After applying the vectorized `convert_prob` function, we can check the first trial of each frame again and see that the probability has reversed. It now accurately shows what the objective probability is that the stock is BAD.