# **Model Fit (Parameter Estimation) with** ***Real Data***
### - **Maximum Likelihood Estimation (MLE)**
- Estimates Alpha and Gamma/Mu first, then Beta second


#### **====Why bother using our own math for LL rather than Scipy's Bernoulli logpmf?====**

- Most importantly, so we can better understand the mathematical foundation of our modeling. It is necessary to understand each part so we know when and why these components may not be appropriate or even misleading.
    - Lets everyone know what's actually happening! Couldn't find docs explaining the "efficiency mechanics" of Bernoulli's logpmf, so let's investigate compare the LL outcomes:
    - Returns identical summed LL
        - log(x) * y == log(x**y)
            - If you multiply a logged value by another value (*left*), it's the same as if you exponentiated before logging (*right*)
                - *Logs are used to get the exponent required for the base to reach the input value*
                    - z = log_e(x) --> e**z = x
                        - z * y = y * log_e(x) --> e**(z*y) = x**y
                            - z * y = log_e(x**y)
        - x = prob of choosing lottery for the current condition subgroup
        - y <= number of trials per subgroup
        - Can be represented as log(x) + log(x) == 2 * log(x)
            - log(x) was originally part of log(x**2)
        - *Works regardless of number of trials per subgroup*
    - Bernoulli:
        - log(x) + log(x) + log(x[y number of times])
    - Manual:
        - log(x) * y

In [None]:
## Demonstration ##
import numpy as np
from scipy.stats import bernoulli

MLL_trials = np.array([2.0, 1.0, 1.0]) ## (y) each value represents the sum of lottery choices in that condition subgroup | number of trials per subgroup = 2
MLL_probs  = np.array([1.0, 0.5, 0.7]) ## (p) probability (predictive - not proportion) Ss will chosVe lottery for each condition subgroup
MLL_total_trials = np.array([2, 2, 2]) ## (previously 1) number of trials per subgroup
print("No black box LL:    ", np.nansum(np.log(MLL_probs) * MLL_trials + np.log(1 - MLL_probs) * (MLL_total_trials - MLL_trials)))

BLL_trials = np.array([1.0, 1.0, 0.0, 1.0, 1.0, 0.0])
BLL_probs  = np.array([1.0, 1.0, 0.5, 0.5, 0.7, 0.7])
print("Bernoulli logpmf LL:", np.nansum(bernoulli.logpmf(BLL_trials, BLL_probs)))

In [45]:
"""
===================
Mandy Renfro (2024)
===================
"""

from glob import glob
import matplotlib.pyplot as plt
import numpy as np
np.seterr(all = "ignore")
import os
import os.path, sys
import pandas as pd
from scipy import stats
from scipy.integrate import simps, trapz
import seaborn as sns
sns.set_theme(style="white", palette="muted")
import sys
if not sys.warnoptions:
    import warnings
    warnings.simplefilter("ignore")
    
base_proj_dir = "Z:/data/SDAN/crdm" ## base project directory
data_dir = "Z:/data/SDAN/crdm/sourcedata/adult" ## directory containing adult data


def goodness_of_fit(max_LL, choices, n = 1, num_params = 3):
    """Calculates goodness of fit metrics (R Squared and AIC) 
        INPUT:
        - max_LL: maximum likelihood score associated with best fit parameters
        - choice:
        - n: each row containing a single trial
        - num_params: number of free parameters (alpha, beta, & gamma)
        OUTPUT:
        - R2: proportion of variance in the observed data that is explained by the parameters
        - AIC: Akaike Information Criterion for current model
        - AIC_0: Akaike Information Criterion for strawman model
    """
    strawman_MLL = np.nansum(np.log(0.5) * choices + np.log(1 - 0.5) * (n - choices))
    maximum_LL = max_LL
    R2 = np.round(1 - (maximum_LL/strawman_MLL), 4) 
    AIC = np.round((2 * num_params) - (2 * maximum_LL), 4)
    AIC_0 = np.round((2 * 0) - (2 * strawman_MLL), 4)
    return R2, AIC, AIC_0


def pd_extend(df_dest, df_extending):
    """ Concatenates all subjects dataframe for saving current Ss parameters
        INPUT:
        - df_dest: destination dataframe to which second dataframe will be appended
        - df_extending: dataframe to be added to the destination dataframe
        OUTPUT:
        - Concatenated dataframe containing info from currently iterated Ss
    """
    return pd.concat([df_dest, df_extending])


def SVs(alpha, beta, lottery_value, certain_value, ambiguity, probability):
    """ Calculate SV for lottery and safe options (Gilboa & Schmeidler, 1989)
        **Note: np.sign() and np.abs() allows this function to flexibly handle *both* gain and loss trials
        INPUT:
        - alpha: current Ss alpha
        - beta: current Ss beta
        - lottery_value: winning lottery amount
        - ambiguity: ambiguity level
        - probability: probability level
        OUTPUT:
        - SV_lottery: subjective value for lottery option
        - SV_certain: subjective value for certain option
    """
    SV_lottery = (probability - beta * ambiguity / 2) * np.sign(lottery_value) * (np.abs(lottery_value)**alpha)
    SV_certain = np.sign(certain_value) * (np.abs(certain_value)**alpha)
    return SV_lottery, SV_certain

## **Softmax Function**

In [46]:
def binomial_likelihood_SM(alpha, beta, gamma, y, n, lottery_value, certain_value, ambiguity, probability):
    """ Calculates binomial LL for each parameter combination
        *See above notes for why we can use our own math rather than rely on bernoulli.logpmf()
        *Deals with positive infinity values by assigning a value of 0, 
        then during MLE functions, parameter pairs with 0 LL score are removed
        INPUT:
        - alpha: current Ss risk parameter [high values indicate risk seeking]
        - beta: current Ss ambiguity parameter [high values indicate ambiguity avoidance]
        - gamma: current Ss choice stochasticity parameter [high values indicate more noise]
        - y: summation of lottery choices
        - n: number of trials
        - lottery_value: all lottery values for risk or ambiguity trials 
        - certain_value: all safe values for risk or ambiguity trials 
        - ambiguity: all ambiguity values for risk (amb = 0) or ambiguity trials
        - probability: all probability values for risk or ambiguity (prob = 0.5) trials
        OUTPUT:
        - log_likelihood: relative likeliness score representing probabilty observed data resulted from current parameter combo
    """
    p = probability_of_lottery_choice_SM(alpha, beta, gamma, lottery_value, certain_value, ambiguity, probability)
    log_likelihood = (np.log(p) * y + np.log(1 - p) * (n - y))
    log_likelihood[log_likelihood == np.inf] = 0
    return log_likelihood


def fit_data_SM(data, session, domain, pid, df):
    """ (1) Calls MLE functions to fit model and return Alpha, Gamma, and Beta parameters
        (2) Creates dataframe for saving parameter estimates
        INPUTS:
        - data: current Ss dataframe containing trial data (domain, prob, amb, sure_amt, lott_amt, choice)
        - session: experiment session
        - domain: run separately for gain/loss trials
        - pid: current subject ID
        - df: dataframe (df_participants) to which the current Ss data will be appended
        OUTPUT:
        - *res: packed variable representing current Ss alpha, beta, and gamma for gain or loss domain
        - df: df_participants appended to include Ss fitted parameters
    """
    choices = data["choice"].values
    lotteries = data["lott_amt"].values
    certain_values = data["sure_amt"].values
    ambiguities = data["amb"].values/100
    probabilities = data["prob"].values/100
    res = MLE_alpha_gamma_SM(choices, lotteries, certain_values, ambiguities, probabilities)
    res, max_LL = MLE_beta_SM(choices, res[0], res[1], lotteries, certain_values, ambiguities, probabilities)
    gof = goodness_of_fit(max_LL, choices)
    df = pd_extend(df, pd.DataFrame(np.array([pid, session, domain, *res, np.round(max_LL, 4), *gof]).reshape(1, -1), columns = df.columns))
    return *res, df


def MLE_alpha_gamma_SM(choices, lotteries, certain_values, ambiguities, probabilities):
    """ Grid search of alpha and gamma parameter space to determine which point produces max log likelihood score. 
        *Risk trials only*
        *Notes: Conceptualize parameter and probability space as two parallel multidimensional spaces. 
                Using a "grid search" method, we carefully iterate through a 2D parameter space (alpha/gamma) 
                to determine which point best explains the data, quantified as the maximum likelihood score.
        INPUT:
        ** Numpy parallel arrays **
        - choices: Ss choices
        - lotteries: winning lottery amounts
        - certain_values: certain amounts
        - ambiguities: ambiguity level
        - probabilities: probability level
        OUTPUT:
        - best_fit: Tuple containing best fit parameters (alpha and gamma)
    """
    best_fit = None
    max_likelihood = None
    x, n = choices, np.ones(choices.shape)
    idx = np.where(ambiguities == 0)
    for alpha in np.round(np.arange(0.3, 2, 0.01), 2): ## Grid Search 1 (search through 2D: alpha and gamma)
        for gamma in np.round(np.arange(0.01, 8, 0.01), 2):
            ## Summation of log likelihoods for all trials (beta/ambiguity held constant at 0 | only risk trials)
            likelihood = np.nansum(binomial_likelihood_SM(alpha, 0, gamma, x[idx], n[idx], lotteries[idx], 
                                                            certain_values[idx], ambiguities[idx], probabilities[idx]))
            if max_likelihood is None or likelihood > max_likelihood:
                max_likelihood = likelihood
                best_fit = (alpha, gamma)
    return best_fit


def MLE_beta_SM(choices, alpha, gamma, lotteries, certain_values, ambiguities, probabilities):
    """ Grid search of beta parameter space to determine which point produces max log likelihood score. 
        *Ambiguity trials only*
        *Notes: Conceptualize parameter and probability space as two parallel multidimensional spaces. 
                Using a "grid search" method previously to determine the best fit of alpha and gamma parameters, 
                we can use those values to now guide our search through 1D beta space.
        INPUT:
        ** Numpy parallel arrays **
        - choices: Ss choices
        - lotteries: winning lottery amounts
        - certain_values: certain amounts
        - ambiguities: ambiguity level
        - probabilities: probability level
        OUTPUT:
        - best_fit: Tuple containing best fit parameters (alpha, beta, and gamma)
        - max_likelihood: Maximum log likelihood associated with best fit parameters
    """
    best_fit = None
    max_likelihood = None
    x, n = choices, np.ones(choices.shape)
    idx = np.where(ambiguities != 0)
    for beta in np.round(np.arange(-1.3, 1.31, 0.01), 2): ## Grid Search 2 (search through 1D: beta)
        ## Summation log likelihoods of all trials (pre-fit alpha/gamma included to inform search | only ambiguity trials)
        likelihood = np.nansum(binomial_likelihood_SM(alpha, beta, gamma, x[idx], n[idx], lotteries[idx], 
                                                        certain_values[idx], ambiguities[idx], probabilities[idx]))
        if max_likelihood is None or likelihood > max_likelihood:
            max_likelihood = likelihood
            best_fit = (alpha, beta, gamma)
    return best_fit, max_likelihood


def probability_of_lottery_choice_SM(alpha, beta, gamma, lottery_value, certain_value, ambiguity, probability):
    """ Determines probability of selecting lottery using the Softmax probabilitic function
        INPUT:
        - alpha: current Ss risk parameter [high values indicate risk seeking]
        - beta: current Ss ambiguity parameter [high values indicate ambiguity avoidance]
        - gamma: current Ss choice stochasticity parameter [high values indicate more noise]
        - lottery_value: all lottery values for risk or ambiguity trials 
        - certain_value: all safe values for risk or ambiguity trials 
        - ambiguity: all ambiguity values for risk (amb = 0) or ambiguity trials
        - probability: all probability values for risk or ambiguity (prob = 0.5) trials
        OUTPUT:
        - Ss probability of choosing lottery for trials with the current condition combination 
    """
    SV_lottery, SV_certain = SVs(alpha, beta, lottery_value, certain_value, ambiguity, probability)
    return 1 / (1 + np.exp(-gamma * (SV_lottery - SV_certain)))


def SVdelta_plotSM(data, idx, session, domain, pid, alpha, beta, gamma):
    """ Plots Ss choice across SV difference space.
        INPUT
        - data: current Ss dataframe containing trial data (domain, prob, amb, sure_amt, lott_amt, choice)
        - idx: current Ss idx number amongst all subjects
        - pid: current sibject ID
        - alpha: current Ss best fit alpha
        - beta: current Ss best fit beta
        - gamma: current Ss best fit gamma
        OUTPUT
        - plt: Psychometric choice curve demonstrating choice behavior across subject-specific SVΔ range
    """
    plt.figure(idx, figsize = (6, 4))
    choices = data["choice"].values
    lottery_values = data["lott_amt"].values
    certain_values = data["sure_amt"].values
    ambiguities = data["amb"].values/100
    probabilities = data["prob"].values/100
    SV_lottery, SV_certain = SVs(alpha, beta, lottery_values, certain_values, ambiguities, probabilities)
    SV_delta = SV_lottery - SV_certain
    sv_choice_tups = list(zip(SV_delta, choices))
    sv_choice_tups = sorted(list(set(sv_choice_tups)))
    para_deltas, para_choices = list(zip(*sorted(sv_choice_tups)))
    SV_fit = np.linspace(min(SV_delta), max(SV_delta), 300)
    prob_fit = []
    for sv in SV_fit:
        prob_fit.append(1 / (1 + np.exp(-gamma * (sv - 0))))
    plt.plot([min(SV_delta), max(SV_delta)], [0.5, 0.5], 'k--')
    plt.plot(SV_fit, prob_fit, "b-")
    plt.plot(para_deltas, para_choices, "ro-")
    plt.xlabel("SVΔ (Lottery-Certain)", fontsize = 12)
    plt.ylim([-0.05, 1.05])
    plt.yticks([0, 0.5, 1], ["Certain (0)", "PSE (0.5)", "Lottery (1)"])
    plt.ylabel("Probability of Lottery Choice", fontsize = 12)
    sns.despine(top = True)
    plt.title("SDAN sub-" + pid + " {0}-{1} | α = {2}, β = {3}, ɣ = {4}".format(session, domain, alpha, beta, gamma), fontsize = 12)
    plt.tight_layout()
    return plt

In [None]:
save_proj_dir = os.path.join(base_proj_dir, "derivatives/adult/parameter_estimation/softmax1") ## output directory

subs = [] ## list to store subject IDs
descriptives = {}

for i in ["gain", "loss", "combined"]:
        for j in ["resp_rate", "prop_lott_risk", "prop_lott_amb", "prop_lott_all",
                    "conf1_ct", "conf2_ct", "conf3_ct", "conf4_ct", "conf_mean", "at_bound"]:
            descriptives["{0}_{1}".format(j, i)] = []

df_participants = pd.DataFrame(columns = ["SubID", "Session", "Domain", ## Ss parameters dataframe
                                            "Alpha", "Beta", "Gamma", 
                                            "MaxLL", "R2", "AIC", "AIC0"]) 

files = sorted(glob(os.path.join(data_dir, "*.csv"))) ## grab all participant datafiles
for curr_file in files: ## iterate through globbed files and save subject ID to a list
    sub_id = os.path.basename(curr_file)[:5] ## grab first 5 indices of filename string
    if not sub_id in subs: ## check if already in list
        subs.append(sub_id) ## if not, append new Ss ID to list

for idx, sub in enumerate(subs): ## iterate through Ss ID list
    save_sub_dir = os.path.join(save_proj_dir, "sub-{0}".format(sub)) ## output directory
    if not os.path.exists(save_sub_dir): ## new Ss
        os.makedirs(save_sub_dir) ## make new Ss save directory
    sub_files = sorted(glob(os.path.join(data_dir, "{0}*_crdm.csv".format(sub)))) ## files for current Ss
    sub_cols = ["domain", "prob", "amb", "sure_amt", "lott_amt", "choice"] ## trial variables
    sub_df = pd.DataFrame(columns = sub_cols) ## subject-specific dataframe w/ preset columns
    if len(sub_files) == 1: ## if Ss has only 1 file
        continue ## bad Ss, move on 
    else:  ## Ss has two data files
        for i in range(1, -1, -1): ## go through data files in reverse order
            raw_df = pd.read_csv(sub_files[i]) ## open current data file
            df = raw_df.loc[(raw_df["crdm_trial_type"] == "task") & (raw_df["crdm_choice"].notnull())] ## only task trials w/ responses
            sub_df["domain"] = df["crdm_domain"] ## trial type (gain/loss)
            sub_df["prob"] = df["crdm_lott_p"] ## trial probability
            sub_df["amb"] = df["crdm_amb_lev"] ## trial ambiguity
            sub_df["sure_amt"] = df["crdm_sure_amt"] ## trial certain amount
            sub_df["lott_amt"] = df["crdm_lott_top"] + df["crdm_lott_bot"] ## trial lottery amount
            sub_df["choice"] = df["crdm_choice"] ## trial choice (-1 = nonresponse, 0 = certain, 1 = lottery)
            sub_gains = sub_df.loc[(sub_df["domain"] == "gain")] ## DF with only gain trials
            sub_losses = sub_df.loc[(sub_df["domain"] == "loss")] ## DF with only loss trials
            df_dict = {"gain": sub_gains, "loss": sub_losses, "combined": sub_df}
            if i == 1: ## 1st data file in sorted sub_files list (stable = SESSION 1)
                gf_alpha, gf_beta, gf_gamma, df_participants = fit_data_SM(sub_gains, "S1", "gain", sub, df_participants)
                plt1 = SVdelta_plotSM(sub_gains, idx, "S1", "Gain", sub, gf_alpha, gf_beta, gf_gamma)
                fig_save1 = os.path.join(save_sub_dir, "sub-{}_sdan-crdm-sm_svdelta-choice-curve-gain1.png".format(subs[idx]))
                plt1.savefig(fig_save1)
                plt1.show()
                lf_alpha, lf_beta, lf_gamma, df_participants = fit_data_SM(sub_losses, "S1", "loss", sub, df_participants)
                plt2 = SVdelta_plotSM(sub_losses, idx, "S1", "Loss", sub, lf_alpha, lf_beta, lf_gamma)
                fig_save2 = os.path.join(save_sub_dir, "sub-{}_sdan-crdm-sm_svdelta-choice-curve-loss1.png".format(subs[idx]))
                plt2.savefig(fig_save2)
                plt2.show()
                cf_alpha, cf_beta, cf_gamma, df_participants = fit_data_SM(sub_df, "S1", "both", sub, df_participants)
                plt3 = SVdelta_plotSM(sub_df, idx, "S1", "Combined", sub, cf_alpha, cf_beta, cf_gamma)
                fig_save3 = os.path.join(save_sub_dir, "sub-{}_sdan-crdm-sm_svdelta-choice-curve-combined1.png".format(subs[idx]))
                plt3.savefig(fig_save3)
                plt3.show()
                for d in ["gain", "loss", "combined"]: ## 3 different parameter estimations for Session 1
                    if d == "combined": ## combined parameter estimation
                        descriptives["resp_rate_combined"].append(len(df_dict[d]["choice"]) / len(raw_df.loc[(raw_df["crdm_trial_type"] == "task")]))
                    else: ## gain and loss parameter estimations
                        descriptives["resp_rate_{0}".format(d)].append(len(df_dict[d]["choice"]) / len(raw_df.loc[(raw_df["crdm_domain"] == "{0}".format(d)) 
                                                                                                                    & (raw_df["crdm_trial_type"] == "task")]))
                    descriptives["prop_lott_risk_{0}".format(d)].append(len(df_dict[d].loc[(df_dict[d]["amb"] == 0) 
                                                                        & (df_dict[d]["choice"] == 1)]) / len((df_dict[d]["amb"] == 0)))
                    descriptives["prop_lott_amb_{0}".format(d)].append(len(df_dict[d].loc[(df_dict[d]["amb"] != 0) 
                                                                        & (df_dict[d]["choice"] == 1)]) / len((df_dict[d]["amb"] != 0)))
                    descriptives["prop_lott_all_{0}".format(d)].append(len(df_dict[d].loc[(df_dict[d]["choice"] == 1)]) / len(df_dict[d]["choice"]))
                    descriptives["conf1_ct_{0}".format(d)].append(len(df.loc[(df["crdm_conf_resp.keys"] == 1)]))
                    descriptives["conf2_ct_{0}".format(d)].append(len(df.loc[(df["crdm_conf_resp.keys"] == 2)]))
                    descriptives["conf3_ct_{0}".format(d)].append(len(df.loc[(df["crdm_conf_resp.keys"] == 3)]))
                    descriptives["conf4_ct_{0}".format(d)].append(len(df.loc[(df["crdm_conf_resp.keys"] == 4)]))
                    descriptives["conf_mean_{0}".format(d)].append(np.nanmean(df["crdm_conf_resp.keys"])) ## nanmean to exclude non-responses
                    if gf_alpha <= 0.11 or gf_alpha >= 1.99 or gf_beta <= -1.29 or gf_beta >= 1.29 or gf_gamma <= 0.01 or gf_gamma >= 7.99:
                        descriptives["at_bound_{0}".format(d)].append(1) ## indicates Ss value at parameter bounds
                    else:
                        descriptives["at_bound_{0}".format(d)].append(0) ## indicates Ss value not at parameter bounds
            else: ## 2nd data file in sorted sub_files list (adaptive = SESSION 2)
                gf_alpha, gf_beta, gf_gamma, df_participants = fit_data_SM(sub_gains, "S2", "gain", sub, df_participants)
                plt1 = SVdelta_plotSM(sub_gains, idx, "S2", "Gain", sub, gf_alpha, gf_beta, gf_gamma)
                fig_save1 = os.path.join(save_sub_dir, "sub-{}_sdan-crdm-sm_svdelta-choice-curve-gain2.png".format(subs[idx]))
                plt1.savefig(fig_save1)
                plt1.show()
                lf_alpha, lf_beta, lf_gamma, df_participants = fit_data_SM(sub_losses, "S2", "loss", sub, df_participants)
                plt2 = SVdelta_plotSM(sub_losses, idx, "S2", "Loss", sub, lf_alpha, lf_beta, lf_gamma)
                fig_save2 = os.path.join(save_sub_dir, "sub-{}_sdan-crdm-sm_svdelta-choice-curve-loss2.png".format(subs[idx]))
                plt2.savefig(fig_save2)
                plt2.show()
                cf_alpha, cf_beta, cf_gamma, df_participants = fit_data_SM(sub_df, "S2", "both", sub, df_participants)
                plt3 = SVdelta_plotSM(sub_df, idx, "S2", "Combined", sub, cf_alpha, cf_beta, cf_gamma)
                fig_save3 = os.path.join(save_sub_dir, "sub-{}_sdan-crdm-sm_svdelta-choice-curve-combined2.png".format(subs[idx]))
                plt3.savefig(fig_save3)
                plt3.show()
                for d in ["gain", "loss", "combined"]: ## 3 different parameter estimations for Session 2
                    if d == "combined":
                        descriptives["resp_rate_combined"].append(len(df_dict[d]["choice"]) / len(raw_df.loc[(raw_df["crdm_trial_type"] == "task")]))
                    else:
                        descriptives["resp_rate_{0}".format(d)].append(len(df_dict[d]["choice"]) / len(raw_df.loc[(raw_df["crdm_domain"] == "{0}".format(d)) 
                                                                                                                    & (raw_df["crdm_trial_type"] == "task")]))
                    descriptives["prop_lott_risk_{0}".format(d)].append(len(df_dict[d].loc[(df_dict[d]["amb"] == 0) 
                                                                        & (df_dict[d]["choice"] == 1)]) / len((df_dict[d]["amb"] == 0)))
                    descriptives["prop_lott_amb_{0}".format(d)].append(len(df_dict[d].loc[(df_dict[d]["amb"] != 0) 
                                                                        & (df_dict[d]["choice"] == 1)]) / len((df_dict[d]["amb"] != 0)))
                    descriptives["prop_lott_all_{0}".format(d)].append(len(df_dict[d].loc[(df_dict[d]["choice"] == 1)]) / len(df_dict[d]["choice"]))
                    descriptives["conf1_ct_{0}".format(d)].append(len(df.loc[(df["crdm_conf_resp.keys"] == 1)]))
                    descriptives["conf2_ct_{0}".format(d)].append(len(df.loc[(df["crdm_conf_resp.keys"] == 2)]))
                    descriptives["conf3_ct_{0}".format(d)].append(len(df.loc[(df["crdm_conf_resp.keys"] == 3)]))
                    descriptives["conf4_ct_{0}".format(d)].append(len(df.loc[(df["crdm_conf_resp.keys"] == 4)]))
                    descriptives["conf_mean_{0}".format(d)].append(np.nanmean(df["crdm_conf_resp.keys"])) ## nanmean to exclude non-responses
                    if gf_alpha <= 0.11 or gf_alpha >= 1.99 or gf_beta <= -1.29 or gf_beta >= 1.29 or gf_gamma <= 0.01 or gf_gamma >= 7.99:
                        descriptives["at_bound_{0}".format(d)].append(1) ## indicates Ss value at parameter bounds
                    else:
                        descriptives["at_bound_{0}".format(d)].append(0) ## indicates Ss alue not at parameter bounds
    print(sub, end = "\r") ## print current Ss as output (indication of speed and where things might get hung up)
#print(df_participants) ## df_participants has two rows per Ss: parameters for (1) gains and (2) losses. Needs reformatting

## reformat dataframe to be more legible and easier to parse 
idxG = np.where(df_participants["Domain"].values == "gain")[0] ## indices for gain parameter rows
idxL = np.where(df_participants["Domain"].values == "loss")[0] ## indices for loss parameter rows
idxC = np.where(df_participants["Domain"].values == "both")[0] ## indices for loss parameter rows

## reorganized by horizonally stacking domains
stacked_data = np.hstack((df_participants.values[idxG], df_participants.values[idxL], df_participants.values[idxC])) 
#print(stacked_data[0]) ## new DF format
## temporary dataframe with hstacked data, tagging rows to drop with "d#"
temp_df = pd.DataFrame(stacked_data, columns = ["PID", "Session", 
                                                "d1",
                                                "Alpha-G", "Beta-G", "Gamma-G", "MLL-G", "R2-G", "AIC-G", "AIC0-G", 
                                                "d2", "d3", "d4", 
                                                "Alpha-L", "Beta-L", "Gamma-L", "MLL-L", "R2-L", "AIC-L", "AIC0-L",
                                                "d5", "d6", "d7",
                                                "Alpha-C", "Beta-C", "Gamma-C", "MLL-C", "R2-C", "AIC-C", "AIC0-C"])
## redeclare df_participants variable with temp_df while dropping unwanted columns (i.e., "gain", Ss # repeat, "loss")
df_participants = temp_df.drop(["d1", "d2", "d3", "d4", "d5", "d6", "d7"], axis = 1) 

dom = ["G", "L", "C"]
for i, d in enumerate(["gain", "loss", "combined"]):
    df_participants["RespRate-{0}".format(dom[i])] = np.round(descriptives["resp_rate_{0}".format(d)], 2) ## proportion of trials Ss responded
    df_participants["RiskLottC-{0}".format(dom[i])] = np.round(descriptives["prop_lott_risk_{0}".format(d)], 2) ## proportion of risk trials Ss chose lottery
    df_participants["AmbigLottC-{0}".format(dom[i])] = np.round(descriptives["prop_lott_amb_{0}".format(d)], 2) ## proportion of ambiguity trials Ss chose lottery
    df_participants["AllLottC-{0}".format(dom[i])] = np.round(descriptives["prop_lott_all_{0}".format(d)], 2) ## proportion of trials Ss chose lottery
    df_participants["Conf1-{0}".format(dom[i])] = descriptives["conf1_ct_{0}".format(d)] ## "not at all confident" count
    df_participants["Conf2-{0}".format(dom[i])] = descriptives["conf2_ct_{0}".format(d)] ## "a litte confident" count
    df_participants["Conf3-{0}".format(dom[i])] = descriptives["conf3_ct_{0}".format(d)] ## "somewhat confident" count
    df_participants["Conf4-{0}".format(dom[i])] = descriptives["conf4_ct_{0}".format(d)] ## "very confident" count
    df_participants["ConfAvg-{0}".format(dom[i])] = np.round(descriptives["conf_mean_{0}".format(d)], 2) ## mean confidence ratings
    df_participants["ParamOut-{0}".format(dom[i])] = descriptives["at_bound_{0}".format(d)] ## parameter outlier status (boolean)

filename = os.path.join(save_proj_dir, "sdan-crdm-sm_modelfit.csv") ## filename indicates domain and choice model
df_participants.to_csv(filename, columns = ["PID", "Session", "Alpha-G", "Beta-G", "Gamma-G", "MLL-G", "R2-G", 
                                            "AIC-G", "AIC0-G", "RespRate-G", "RiskLottC-G", "AmbigLottC-G", "AllLottC-G", 
                                            "Conf1-G", "Conf2-G", "Conf3-G", "Conf4-G", "ConfAvg-G", "ParamOut-G",
                                            "Alpha-L", "Beta-L", "Gamma-L", "MLL-L", "R2-L", 
                                            "AIC-L", "AIC0-L", "RespRate-L", "RiskLottC-L", "AmbigLottC-L", "AllLottC-L", 
                                            "Conf1-L", "Conf2-L", "Conf3-L", "Conf4-L", "ConfAvg-L", "ParamOut-L",
                                            "Alpha-C", "Beta-C", "Gamma-C", "MLL-C", "R2-C", 
                                            "AIC-C", "AIC0-C", "RespRate-C", "RiskLottC-C", "AmbigLottC-C", "AllLottC-C", 
                                            "Conf1-C", "Conf2-C", "Conf3-C", "Conf4-C", "ConfAvg-C", "ParamOut-C"],
                                            index = False) ## save csv without dataframe row indexing

In [None]:
df_participants

## **Luce Model of Discrete Choice**

- Does not currently produce SVDelta Choice plots

In [None]:
def binomial_likelihood_LUCE(alpha, beta, mu, y, n, lottery_value, certain_value, ambiguity, probability):
    p = probability_of_lottery_choice_LUCE(alpha, beta, mu, lottery_value, certain_value, ambiguity, probability)
    log_likelihood = (np.log(p) * y + np.log(1 - p) * (n - y)) 
    log_likelihood[log_likelihood == np.inf] = 0
    return log_likelihood


def fit_data_LUCE(data, session, domain, pid, df):
    choices = data["choice"].values
    lotteries = data["lott_amt"].values
    certain_values = data["sure_amt"].values
    ambiguities = data["amb"].values/100
    probabilities = data["prob"].values/100
    res = MLE_alpha_mu_LUCE(choices, lotteries, certain_values, ambiguities, probabilities)
    res, max_LL = MLE_beta_LUCE(choices, res[0], res[1], lotteries, certain_values, ambiguities, probabilities)
    gof = goodness_of_fit(max_LL, choices)
    df = pd_extend(df, pd.DataFrame(np.array([pid, session, domain, *res, np.round(max_LL, 4), *gof]).reshape(1, -1), columns = df.columns))
    return *res, df


def MLE_alpha_mu_LUCE(choices, lotteries, certain_values, ambiguities, probabilities):
    best_fit = None
    max_likelihood = None
    y, n = choices, np.ones(choices.shape)
    idx = np.where(ambiguities == 0)
    for alpha in np.round(np.arange(0.3, 2, 0.01), 2):
        for mu in np.round(np.arange(0.01, 1, 0.01), 2):
            likelihood = np.nansum(binomial_likelihood_LUCE(alpha, 0, mu, y[idx], n[idx], lotteries[idx], 
                                                            certain_values[idx], ambiguities[idx], probabilities[idx]))
            if max_likelihood is None or likelihood > max_likelihood:
                max_likelihood = likelihood
                best_fit = (alpha, mu)
    return best_fit


def MLE_beta_LUCE(choices, alpha, mu, lotteries, certain_values, ambiguities, probabilities):
    best_fit = None
    max_likelihood = None
    y, n = choices, np.ones(choices.shape)
    idx = np.where(ambiguities != 0)
    for beta in np.round(np.arange(-1.3, 1.31, 0.01), 2):
        likelihood = np.nansum(binomial_likelihood_LUCE(alpha, beta, mu, y[idx], n[idx], lotteries[idx], 
                                                        certain_values[idx], ambiguities[idx], probabilities[idx]))
        if max_likelihood is None or likelihood > max_likelihood:
            max_likelihood = likelihood
            best_fit = (alpha, beta, mu)
    return best_fit, max_likelihood


def probability_of_lottery_choice_LUCE(alpha, beta, mu, lottery_value, certain_value, ambiguity, probability):
    SV_lottery, SV_certain = SVs(alpha, beta, lottery_value, certain_value, ambiguity, probability)
    p = np.sign(SV_lottery) * np.abs(SV_lottery)**(1 / mu) / (np.sign(SV_lottery) * np.abs(SV_lottery)**(1 / mu) + np.sign(SV_certain) * np.abs(SV_certain)**(1 / mu))
    idx = np.where(SV_lottery < 0)[0]
    p[idx] = 1 - p[idx]
    return p

In [None]:
save_proj_dir = os.path.join(base_proj_dir, "derivatives/adult/parameter_estimation/luce")

subs = [] 
descriptives = {}

for i in ["gain", "loss", "combined"]:
        for j in ["resp_rate", "prop_lott_risk", "prop_lott_amb", "prop_lott_all",
                    "conf1_ct", "conf2_ct", "conf3_ct", "conf4_ct", "conf_mean", "at_bound"]:
            descriptives["{0}_{1}".format(j, i)] = []

df_participants = pd.DataFrame(columns = ["SubID", "Session", "Domain",
                                            "Alpha", "Beta", "Mu", 
                                            "MaxLL", "R2", "AIC", "AIC0"]) 

files = sorted(glob(os.path.join(data_dir, "*.csv")))
for curr_file in files:
    sub_id = os.path.basename(curr_file)[:5]
    if not sub_id in subs:
        subs.append(sub_id)

for idx, sub in enumerate(subs):
    save_sub_dir = os.path.join(save_proj_dir, "sub-{0}".format(sub))
    if not os.path.exists(save_sub_dir):
        os.makedirs(save_sub_dir)
    sub_files = sorted(glob(os.path.join(data_dir, "{0}*_crdm.csv".format(sub))))
    sub_cols = ["domain", "prob", "amb", "sure_amt", "lott_amt", "choice"]
    sub_df = pd.DataFrame(columns = sub_cols)
    if len(sub_files) == 1:
        continue
    else:
        for i in range(1, -1, -1):
            raw_df = pd.read_csv(sub_files[i])
            df = raw_df.loc[(raw_df["crdm_trial_type"] == "task") & (raw_df["crdm_choice"].notnull())]
            sub_df["domain"] = df["crdm_domain"]
            sub_df["prob"] = df["crdm_lott_p"]
            sub_df["amb"] = df["crdm_amb_lev"]
            sub_df["sure_amt"] = df["crdm_sure_amt"]
            sub_df["lott_amt"] = df["crdm_lott_top"] + df["crdm_lott_bot"]
            sub_df["choice"] = df["crdm_choice"]
            sub_gains = sub_df.loc[(sub_df["domain"] == "gain")]
            sub_losses = sub_df.loc[(sub_df["domain"] == "loss")]
            df_dict = {"gain": sub_gains, "loss": sub_losses, "combined": sub_df}
            if i == 1:
                gf_alpha, gf_beta, gf_mu, df_participants = fit_data_LUCE(sub_gains,  "S1", "gain", sub, df_participants)
                lf_alpha, lf_beta, lf_mu, df_participants = fit_data_LUCE(sub_losses, "S1", "loss", sub, df_participants)
                cf_alpha, cf_beta, cf_mu, df_participants = fit_data_LUCE(sub_df,     "S1", "both", sub, df_participants)
                for d in ["gain", "loss", "combined"]:
                    if d == "combined":
                        descriptives["resp_rate_combined"].append(len(df_dict[d]["choice"]) / len(raw_df.loc[(raw_df["crdm_trial_type"] == "task")]))
                    else:
                        descriptives["resp_rate_{0}".format(d)].append(len(df_dict[d]["choice"]) / len(raw_df.loc[(raw_df["crdm_domain"] == "{0}".format(d)) 
                                                                                                                    & (raw_df["crdm_trial_type"] == "task")]))
                    descriptives["prop_lott_risk_{0}".format(d)].append(len(df_dict[d].loc[(df_dict[d]["amb"] == 0) 
                                                                        & (df_dict[d]["choice"] == 1)]) / len((df_dict[d]["amb"] == 0)))
                    descriptives["prop_lott_amb_{0}".format(d)].append(len(df_dict[d].loc[(df_dict[d]["amb"] != 0) 
                                                                        & (df_dict[d]["choice"] == 1)]) / len((df_dict[d]["amb"] != 0)))
                    descriptives["prop_lott_all_{0}".format(d)].append(len(df_dict[d].loc[(df_dict[d]["choice"] == 1)]) / len(df_dict[d]["choice"]))
                    descriptives["conf1_ct_{0}".format(d)].append(len(df.loc[(df["crdm_conf_resp.keys"] == 1)]))
                    descriptives["conf2_ct_{0}".format(d)].append(len(df.loc[(df["crdm_conf_resp.keys"] == 2)]))
                    descriptives["conf3_ct_{0}".format(d)].append(len(df.loc[(df["crdm_conf_resp.keys"] == 3)]))
                    descriptives["conf4_ct_{0}".format(d)].append(len(df.loc[(df["crdm_conf_resp.keys"] == 4)]))
                    descriptives["conf_mean_{0}".format(d)].append(np.nanmean(df["crdm_conf_resp.keys"]))
                    if gf_alpha <= 0.11 or gf_alpha >= 1.99 or gf_beta <= -1.29 or gf_beta >= 1.29 or gf_mu <= 0.01 or gf_mu >= 7.99:
                        descriptives["at_bound_{0}".format(d)].append(1)
                    else:
                        descriptives["at_bound_{0}".format(d)].append(0)
            else:
                gf_alpha, gf_beta, gf_mu, df_participants = fit_data_LUCE(sub_gains,  "S2", "gain", sub, df_participants)
                lf_alpha, lf_beta, lf_mu, df_participants = fit_data_LUCE(sub_losses, "S2", "loss", sub, df_participants)  
                cf_alpha, cf_beta, cf_mu, df_participants = fit_data_LUCE(sub_df,     "S2", "both", sub, df_participants) 
                for d in ["gain", "loss", "combined"]:
                    if d == "combined":
                        descriptives["resp_rate_combined"].append(len(df_dict[d]["choice"]) / len(raw_df.loc[(raw_df["crdm_trial_type"] == "task")]))
                    else:
                        descriptives["resp_rate_{0}".format(d)].append(len(df_dict[d]["choice"]) / len(raw_df.loc[(raw_df["crdm_domain"] == "{0}".format(d)) 
                                                                                                                    & (raw_df["crdm_trial_type"] == "task")]))
                    descriptives["prop_lott_risk_{0}".format(d)].append(len(df_dict[d].loc[(df_dict[d]["amb"] == 0) 
                                                                        & (df_dict[d]["choice"] == 1)]) / len((df_dict[d]["amb"] == 0)))
                    descriptives["prop_lott_amb_{0}".format(d)].append(len(df_dict[d].loc[(df_dict[d]["amb"] != 0) 
                                                                        & (df_dict[d]["choice"] == 1)]) / len((df_dict[d]["amb"] != 0)))
                    descriptives["prop_lott_all_{0}".format(d)].append(len(df_dict[d].loc[(df_dict[d]["choice"] == 1)]) / len(df_dict[d]["choice"]))
                    descriptives["conf1_ct_{0}".format(d)].append(len(df.loc[(df["crdm_conf_resp.keys"] == 1)]))
                    descriptives["conf2_ct_{0}".format(d)].append(len(df.loc[(df["crdm_conf_resp.keys"] == 2)]))
                    descriptives["conf3_ct_{0}".format(d)].append(len(df.loc[(df["crdm_conf_resp.keys"] == 3)]))
                    descriptives["conf4_ct_{0}".format(d)].append(len(df.loc[(df["crdm_conf_resp.keys"] == 4)]))
                    descriptives["conf_mean_{0}".format(d)].append(np.nanmean(df["crdm_conf_resp.keys"]))
                    if gf_alpha <= 0.11 or gf_alpha >= 1.99 or gf_beta <= -1.29 or gf_beta >= 1.29 or gf_mu <= 0.01 or gf_mu >= 7.99:
                        descriptives["at_bound_{0}".format(d)].append(1) 
                    else:
                        descriptives["at_bound_{0}".format(d)].append(0)
            print(sub, end = "\r")

idxG = np.where(df_participants["Domain"].values == "gain")[0]
idxL = np.where(df_participants["Domain"].values == "loss")[0]
idxC = np.where(df_participants["Domain"].values == "both")[0]
stacked_data = np.hstack((df_participants.values[idxG], df_participants.values[idxL], df_participants.values[idxC])) 
temp_df = pd.DataFrame(stacked_data, columns = ["PID", "Session", 
                                                "d1",
                                                "Alpha-G", "Beta-G", "Mu-G", "MLL-G", "R2-G", "AIC-G", "AIC0-G", 
                                                "d2", "d3", "d4", 
                                                "Alpha-L", "Beta-L", "Mu-L", "MLL-L", "R2-L", "AIC-L", "AIC0-L",
                                                "d5", "d6", "d7",
                                                "Alpha-C", "Beta-C", "Mu-C", "MLL-C", "R2-C", "AIC-C", "AIC0-C"])
df_participants = temp_df.drop(["d1", "d2", "d3", "d4", "d5", "d6", "d7"], axis = 1) 

dom = ["G", "L", "C"]
for i, d in enumerate(["gain", "loss", "combined"]):
    df_participants["RespRate-{0}".format(dom[i])] = np.round(descriptives["resp_rate_{0}".format(d)], 2)
    df_participants["RiskLottC-{0}".format(dom[i])] = np.round(descriptives["prop_lott_risk_{0}".format(d)], 2)
    df_participants["AmbigLottC-{0}".format(dom[i])] = np.round(descriptives["prop_lott_amb_{0}".format(d)], 2)
    df_participants["AllLottC-{0}".format(dom[i])] = np.round(descriptives["prop_lott_all_{0}".format(d)], 2)
    df_participants["Conf1-{0}".format(dom[i])] = descriptives["conf1_ct_{0}".format(d)]
    df_participants["Conf2-{0}".format(dom[i])] = descriptives["conf2_ct_{0}".format(d)]
    df_participants["Conf3-{0}".format(dom[i])] = descriptives["conf3_ct_{0}".format(d)]
    df_participants["Conf4-{0}".format(dom[i])] = descriptives["conf4_ct_{0}".format(d)]
    df_participants["ConfAvg-{0}".format(dom[i])] = np.round(descriptives["conf_mean_{0}".format(d)], 2)
    df_participants["ParamOut-{0}".format(dom[i])] = descriptives["at_bound_{0}".format(d)]

filename = os.path.join(save_proj_dir, "sdan-crdm-luce_modelfitGL.csv")
df_participants.to_csv(filename, columns = ["PID", "Session", "Alpha-G", "Beta-G", "Mu-G", "MLL-G", "R2-G", 
                                            "AIC-G", "AIC0-G", "RespRate-G", "RiskLottC-G", "AmbigLottC-G", "AllLottC-G", 
                                            "Conf1-G", "Conf2-G", "Conf3-G", "Conf4-G", "ConfAvg-G", "ParamOut-G",
                                            "Alpha-L", "Beta-L", "Mu-L", "MLL-L", "R2-L", 
                                            "AIC-L", "AIC0-L", "RespRate-L", "RiskLottC-L", "AmbigLottC-L", "AllLottC-L", 
                                            "Conf1-L", "Conf2-L", "Conf3-L", "Conf4-L", "ConfAvg-L", "ParamOut-L",
                                            "Alpha-C", "Beta-C", "Mu-C", "MLL-C", "R2-C", 
                                            "AIC-C", "AIC0-C", "RespRate-C", "RiskLottC-C", "AmbigLottC-C", "AllLottC-C", 
                                            "Conf1-C", "Conf2-C", "Conf3-C", "Conf4-C", "ConfAvg-C", "ParamOut-C"],
                                            index = False)

In [None]:
df_participants