# Psychometric Comparison of Delay Discounting Measures
### A Python Replication of Wan et al. (2023), *Behavioural Processes*

---

### Project Objective

This notebook provides a complete, reproducible Python workflow for the analyses presented in the following publication:

> Wan, H., Myerson, J., & Green, L. (2023). Individual differences in degree of discounting: Do different procedures and measures assess the same construct?. *Behavioural Processes*, *208*, 104864. https://doi.org/10.1016/j.beproc.2023.104864

The central research question is methodological: do the two most prominent procedures for measuring financial patience—the **Adjusting-Amount (Adj-Amt)** procedure and the **Monetary Choice Questionnaire (MCQ)**—actually measure the same underlying construct? This analysis tests the convergent validity of these two measurement tools using data from two large online samples.

### Analysis Workflow

The analysis is structured into the following sections:

1.  **Setup & Data Processing**: Loads libraries, defines helper functions, and processes the raw data into an analysis-ready format. This includes calculating both theoretical (`log k`) and atheoretical (Area under the Curve / choice count) discounting measures for each individual.
2.  **Data Quality & Reliability**: Replicates the initial analyses establishing the quality and internal consistency of the data from both procedures.
3.  **Convergent Validity (Correlation Analysis)**: Replicates the core correlational analyses to test the primary hypothesis that measures from the two procedures are highly correlated.
4.  **Procedural & Sample Differences (ANOVA)**: Replicates the mixed-effects ANOVA used to test for systematic differences in the measured discounting rates.

In [6]:
# --- Environment Setup ---
#
# This cell installs the required Python packages for the analysis.
# Uncomment and run this cell only if you are setting up a new environment.

# import sys
# !{sys.executable} -m pip install pandas numpy scipy statsmodels openpyxl

In [7]:
# --- 1. SETUP: IMPORTS, FUNCTIONS, AND DATA PROCESSING ---

# --- 1.1 Load Libraries ---

# Core data science libraries
import pandas as pd
import numpy as np
import warnings

# Statistical modeling and analysis
from lmfit import Model
import statsmodels.api as sm
import statsmodels.formula.api as smf
from scipy.optimize import curve_fit
from scipy.stats import gmean

# Suppress warnings for a cleaner final report
warnings.filterwarnings('ignore')
# Set pandas display options for consistent formatting
pd.options.display.float_format = '{:.3f}'.format


# --- 1.2 Custom Helper Functions ---

def fit_nls_logk(df):
    """
    Fits a simple hyperbola to individual Adj-Amt data to estimate log(k).
    
    Args:
        df (pd.DataFrame): A dataframe for a single participant and amount,
                           containing 'iv' (delay) and 'value' (subjective value).
    Returns:
        float: The estimated log(k) parameter.
    """
    def model_func(iv, k):
        return -np.log(1 + np.exp(k) * iv)
    
    try:
        popt, _ = curve_fit(model_func, df['iv'], np.log(df['value']), p0=[-4])
        return popt[0]
    except RuntimeError:
        return np.nan # Return NA if the model fails to converge

def score_mcq_logk(df):
    """
    Calculates the theoretical log(k) for MCQ data using Kirby's (1999)
    consistency-checking scoring algorithm.
    
    Args:
        df (pd.DataFrame): A dataframe for a single participant and amount,
                           containing 'iv' (k_values) and 'value' (choices).
    Returns:
        float: The estimated log(k) parameter.
    """
    choices = df['value'].values
    k_values = df['iv'].values
    
    # Handle edge cases for participants who never switch preference
    if all(choices == 1): return np.log(0.00016) 
    if all(choices == 0): return np.log(0.25)   
    
    # For participants who switch, find the k-value that maximizes consistency
    n_consistent = [
        sum((choices == 0) & (k_values <= k) | (choices == 1) & (k_values > k))
        for k in k_values
    ]
    max_consistency = np.max(n_consistent)
    indifference_ks = k_values[np.where(n_consistent == max_consistency)]
    
    # The estimated indifference point is the geometric mean of the k-values
    # that produce the most consistent choice pattern.
    return np.log(gmean(indifference_ks))

# --- Define mathematical functions for curve fitting ---
def hyperboloid_model(iv, k, s):
    """Hyperboloid discounting function for group-level Adj-Amt data."""
    return 1 / (1 + np.exp(k) * iv)**s

def logistic_growth_model(iv_log, x, r):
    """Logistic growth function for group-level MCQ data."""
    return 1 / (1 + np.exp(-(iv_log - x) * r))

def test_amount_effect(df_provider, procedure):
    """
    Fits a GLM with clustered standard errors and performs a Wald test
    to check for a linear trend in discounting across reward amounts.
    """
    # Define a contrast matrix for a linear trend (Amount 3 > Amount 1)
    contrast_matrix = np.array([[-1, 0, 1]])
    
    if procedure == 'aa':
        # For proportions, clip values to (0, 1) and use a Binomial GLM
        # This approximates a beta regression when data hits the boundaries.
        df_provider['atheoretical_adj'] = np.clip(df_provider['atheoretical'], 1e-5, 1 - 1e-5)
        formula = "atheoretical_adj ~ C(amt)"
        model = smf.glm(formula, data=df_provider, family=sm.families.Binomial())
        
    elif procedure == 'mcq':
        # For counts, create a proportion and use a weighted Binomial GLM
        df_provider['prop_delayed'] = df_provider['atheoretical'] / 9.0
        formula = "prop_delayed ~ C(amt)"
        model = smf.glm(formula, data=df_provider, family=sm.families.Binomial(),
                        weights=np.repeat(9, len(df_provider)))
        
    # Fit the model with standard errors clustered by participant ID
    fit = model.fit(cov_type='cluster', cov_kwds={'groups': df_provider['id']})
    wald_test = fit.wald_test(contrast_matrix)
    return wald_test.pvalue

# --- 1.3 Load and Process Data ---

# Load the raw data from a CSV file
raw_df = pd.read_csv("AdjAmt_MCQ.csv").iloc[:, 1:]
print("Raw data loaded successfully.")

# --- Process Adj-Amt Data ---
adj_amt_summary = raw_df[
    (raw_df['procedure'] == "aa") & (raw_df['iv'] != 730)
].groupby(['id', 'procedure', 'amt']).apply(lambda g: pd.Series({
    # Atheoretical measure: Area Under the Curve (AuC) using the trapezoidal rule
    'atheoretical': np.trapz(y=g['value'], x=g['iv'] / 180),
    # Theoretical measure: log(k) from a non-linear squares fit
    'theoretical': fit_nls_logk(g)
})).reset_index()

# --- Process MCQ Data ---
mcq_summary = raw_df[
    raw_df['procedure'] == "mcq"
].groupby(['id', 'amt']).apply(
    lambda g: pd.Series({
        # Atheoretical measure: Total number of delayed choices
        'atheoretical': g['value'].sum(),
        # Theoretical measure: log(k) from the scoring algorithm
        'theoretical': score_mcq_logk(g)
    })
).reset_index()
mcq_summary['procedure'] = 'mcq'

# --- Combine into Final Analysis DataFrame ---
provider_map = raw_df[['id', 'provider']].drop_duplicates()
individual_level_df = pd.concat([adj_amt_summary, mcq_summary]).merge(provider_map, on='id')

Raw data loaded successfully.


## 2. Data Quality and Reliability Analysis

This section replicates the initial analyses from the paper that establish the quality and internal consistency of the data. This is a crucial step to ensure the data is valid before testing the primary hypotheses.

### 2.1 Group-Level Model Fits & The Amount Effect

First, we assess how well established mathematical models of choice describe the aggregated data from each group. High R-squared ($R^2$) values indicate that participants' choices were systematic and not random, conforming to theoretical expectations.

Second, we test for the **"amount effect,"** a benchmark finding in discounting research where delayed rewards are discounted less steeply as their amount increases. Confirming this effect serves as a critical validity check for the dataset.

In [8]:
# --- 2.1.1 Group-Level Model Fit Assessment ---

# --- Create aggregated dataframe for group-level analysis ---
group_level_summary = raw_df.groupby(['provider', 'procedure', 'amt', 'iv']).agg(
    mean_value=('value', 'mean'),
    median_value=('value', 'median')
).reset_index()
group_level_summary['iv_log'] = np.log(group_level_summary['iv'])

# --- Calculate R-squared for group fits ---
print("--- Group-Level Nonlinear Model Fit (R-squared) ---")

# Adj-Amt Procedure
aa_model_fit = Model(hyperboloid_model)
r2_adj_amt = group_level_summary[group_level_summary['procedure'] == 'aa'].groupby(['provider', 'amt']).apply(
    lambda g: aa_model_fit.fit(g['median_value'], iv=g['iv'], k=-4, s=1).rsquared
).unstack(level='provider')
print("\nAdj-Amt R-squared:")
print(r2_adj_amt)

# MCQ Procedure
mcq_model_fit = Model(logistic_growth_model)
r2_mcq = group_level_summary[group_level_summary['procedure'] == 'mcq'].groupby(['provider', 'amt']).apply(
    lambda g: mcq_model_fit.fit(g['mean_value'], iv_log=g['iv_log'], x=-4, r=1).rsquared
).unstack(level='provider')
print("\nMCQ R-squared:")
print(r2_mcq)


# --- 2.1.2 Amount Effect Test ---

# --- Run tests for Adj-Amt procedure ---
print("\n\n--- Adj-Amt: Amount Effect (p-values) ---")
adj_amt_effect = individual_level_df[
    individual_level_df['procedure'] == 'aa'
].groupby('provider').apply(
    test_amount_effect, procedure='aa'
).rename('p_value').to_frame()
print(adj_amt_effect)

# --- Run tests for MCQ procedure ---
print("\n--- MCQ: Amount Effect (p-values) ---")
mcq_effect = individual_level_df[
    individual_level_df['procedure'] == 'mcq'
].groupby('provider').apply(
    test_amount_effect, procedure='mcq'
).rename('p_value').to_frame()
print(mcq_effect)

--- Group-Level Nonlinear Model Fit (R-squared) ---

Adj-Amt R-squared:
provider  MTurk  Prolific
amt                      
1         0.882     0.980
2         0.996     0.982
3         0.987     0.979

MCQ R-squared:
provider  MTurk  Prolific
amt                      
1         0.991     0.991
2         0.982     0.984
3         0.965     0.996


--- Adj-Amt: Amount Effect (p-values) ---
                         p_value
provider                        
MTurk     1.4083481218328474e-21
Prolific   8.073663339395418e-07

--- MCQ: Amount Effect (p-values) ---
                         p_value
provider                        
MTurk     1.2768358690089668e-64
Prolific  1.2601763159855179e-30


## 2.2 Reliability and Validity Correlations

These analyses replicate the correlation tables from the paper (**Tables 1, 2, and 3**), providing a comprehensive psychometric evaluation of the two discounting procedures.

1.  **Alternate-Forms Reliability**: First, we test whether individuals' discounting behavior is consistent across different reward amounts **within** the same procedure. High positive correlations indicate that the measures are internally consistent and reliable.
2.  **Convergent Validity of Measures**: Next, we test whether the atheoretical (e.g., AuC) and theoretical (e.g., `log k`) scoring methods capture the same information. High (negative) correlations between these two measures confirm they are assessing the same underlying construct **within** each procedure.
3.  **Convergent Validity of Procedures**: Finally, we address the study's primary hypothesis by correlating the measures **between** the two different procedures. If both the Adj-Amt and MCQ tasks are measuring the same trait, their respective measures should be highly correlated.

In [9]:
# --- 2.2.1 Alternate-Forms Reliability (Within-Procedure) ---

# Test if each measure is internally consistent across different reward amounts.
print("--- Adj-Amt: Within-Measure Correlations ---")
# Pivot Adj-Amt data to have amounts as columns for each measure type
adj_amt_pivot_wide = individual_level_df[
    individual_level_df['procedure'] == 'aa'
].pivot_table(index=['id', 'provider'], columns='amt', values=['atheoretical', 'theoretical'])

# Calculate and display correlations for each provider and measure
adj_amt_reliability = adj_amt_pivot_wide.groupby('provider').corr()
print("\nAtheoretical (AuC) Correlations:")
print(adj_amt_reliability.loc[:, ('atheoretical', slice(None))].droplevel(0, axis=1))
print("\nTheoretical (log k) Correlations:")
print(adj_amt_reliability.loc[:, ('theoretical', slice(None))].droplevel(0, axis=1))

print("\n\n--- MCQ: Within-Measure Correlations ---")
# Pivot MCQ data to have amounts as columns for each measure type
mcq_pivot_wide = individual_level_df[
    individual_level_df['procedure'] == 'mcq'
].pivot_table(index=['id', 'provider'], columns='amt', values=['atheoretical', 'theoretical'])

# Calculate and display correlations for each provider and measure
mcq_reliability = mcq_pivot_wide.groupby('provider').corr()
print("\nAtheoretical (Choice Count) Correlations:")
print(mcq_reliability.loc[:, ('atheoretical', slice(None))].droplevel(0, axis=1))
print("\nTheoretical (log k) Correlations:")
print(mcq_reliability.loc[:, ('theoretical', slice(None))].droplevel(0, axis=1))

# --- 2.2.2 Convergent Validity of Measures (Between-Measure) ---

# Test if the atheoretical and theoretical scoring methods are highly correlated.
print("\n\n--- Correlation Between Atheoretical and Theoretical Measures ---")
measure_convergence = individual_level_df.groupby(['provider', 'procedure', 'amt']).apply(
    lambda g: g['atheoretical'].corr(g['theoretical'])
).unstack(level='amt')
print(measure_convergence)


# --- 2.2.3 Convergent Validity of Procedures (Between-Procedure) ---

# This is the main hypothesis test: correlate measures between the two procedures.
print("\n\n--- Between-Procedure Correlations (Convergent Validity) ---")
# Filter for the common reward amounts
validity_df = individual_level_df[
    ((individual_level_df['procedure'] == 'aa') & (individual_level_df['amt'] != 3)) |
    ((individual_level_df['procedure'] == 'mcq') & (individual_level_df['amt'] != 2))
].copy()
validity_df['amount_label'] = np.where(validity_df['amt'] == 1, '$30', '$80')

# Pivot to a wide format with one row per participant and columns for each measure/procedure
validity_pivot = validity_df.pivot_table(
    index=['id', 'provider', 'amount_label'],
    columns='procedure',
    values=['atheoretical', 'theoretical']
)
# Clean up the multi-level column names
validity_pivot.columns = ['_'.join(col).strip() for col in validity_pivot.columns.values]

# Calculate the between-procedure correlations
validity_correlations = validity_pivot.groupby(['provider', 'amount_label']).apply(
    lambda g: pd.Series({
        'atheoretical_corr': g['atheoretical_aa'].corr(g['atheoretical_mcq']),
        'theoretical_corr': g['theoretical_aa'].corr(g['theoretical_mcq'])
    })
)
print(validity_correlations)

--- Adj-Amt: Within-Measure Correlations ---

Atheoretical (AuC) Correlations:
amt                            1      2      3
provider              amt                     
MTurk    atheoretical 1    1.000  0.877  0.841
                      2    0.877  1.000  0.879
                      3    0.841  0.879  1.000
         theoretical  1   -0.973 -0.863 -0.826
                      2   -0.866 -0.979 -0.852
                      3   -0.852 -0.883 -0.982
Prolific atheoretical 1    1.000  0.841  0.737
                      2    0.841  1.000  0.870
                      3    0.737  0.870  1.000
         theoretical  1   -0.963 -0.806 -0.697
                      2   -0.840 -0.958 -0.803
                      3   -0.759 -0.864 -0.956

Theoretical (log k) Correlations:
amt                            1      2      3
provider              amt                     
MTurk    atheoretical 1   -0.973 -0.866 -0.852
                      2   -0.863 -0.979 -0.883
                      3   -0.826 -0.852 

## 3. Comparing Systematic Differences in Discounting

While the correlation analyses confirm that the two procedures measure the same construct, a crucial practical question remains: **are the measures interchangeable?** That is, do the two procedures and two online samples produce systematically different absolute values of discounting?

To answer this, we replicate the final analysis from the paper. We fit a **linear mixed-effects model** to account for the repeated-measures design (i.e., multiple data points per participant). This model examines the main effects and interactions of **experimental procedure** (Adj-Amt vs. MCQ), **participant sample** (Prolific vs. MTurk), and **reward amount** on the estimated `log k` values.

In [10]:
# --- 3.1 Mixed-Effects Model on Theoretical (log k) Measures ---

# This analysis uses a linear mixed-effects model, the correct approach for a
# repeated-measures design where each participant provides multiple data points.
# This model accounts for the non-independence of observations by including a
# random intercept for each participant.

# Prepare the data for the model, ensuring categorical variables are set up
anova_df = individual_level_df[
    ((individual_level_df['procedure'] == 'aa') & (individual_level_df['amt'] != 3)) |
    ((individual_level_df['procedure'] == 'mcq') & (individual_level_df['amt'] != 2))
].copy()
anova_df['amount_label'] = np.where(anova_df['amt'] == 1, '30', '80')

# Define and fit the mixed-effects model
# The `groups=anova_df['id']` term specifies the random intercepts for each participant.
mixed_effects_model = smf.mixedlm(
    "theoretical ~ C(provider) * C(amount_label) * C(procedure)",
    data=anova_df,
    groups=anova_df['id']
).fit()

# Print the model summary table
print("--- Mixed-Effects Model: Comparing log k Values ---")
print(mixed_effects_model.summary())

# --- Interpreting the Post-Hoc Test from the Summary Table ---
# The original paper's post-hoc test examined the Procedure x Amount interaction.
# In the summary table above, the interaction term:
#   C(amount_label)[T.80]:C(procedure)[T.mcq]
# directly tests if the difference between the MCQ and Adj-Amt procedures is
# significantly different for the $80 amount compared to the $30 amount.

--- Mixed-Effects Model: Comparing log k Values ---
                                    Mixed Linear Model Regression Results
Model:                              MixedLM                  Dependent Variable:                  theoretical
No. Observations:                   1572                     Method:                              REML       
No. Groups:                         393                      Scale:                               0.7341     
Min. group size:                    4                        Log-Likelihood:                      -2549.8343 
Max. group size:                    4                        Converged:                           Yes        
Mean group size:                    4.0                                                                      
-------------------------------------------------------------------------------------------------------------
                                                                  Coef.  Std.Err.    z    P>|z| [0.025 0