## Prepare data set from survey data for the choice model
Based on Graham (2020) and Joon (2025, https://github.com/krisjuune/cantonal-preferences-s2z/)

### Load libraries

In [30]:
import pandas as pd
import numpy as np
import os
import re
import importlib
import f_data_processing as dp
import map_data as md

dp = importlib.reload(dp)
md = importlib.reload(md)
print(os.getcwd())


c:\Users\stdt\ETH Zurich\PhD Veronika - 07_Master Thesis Natalie\10_hcm_code\natural_hazard_solidarity


### Read surveys and do first filtering

Define the data file paths to the survey-data as dictionary. In the dictionary you can always add more datasets to merge (i.e. after another survey round S2).


In [43]:
survey_dict = {'S0': '../data/Survey_Adaptation Natural Hazards_First Wave_raw data.csv', 
               'S1': '../data/Survey_Adaptation Natural Hazards_Second Wave_raw data.csv'}

Loop through the dictionary and apply filtering and cleansing and type conversion. The following steps are done:

1. Read csv-file and clean column names for trailing spaces, dots, and naming as well as change cell values to numeric for the necessary id-columns.
2. Remove entries that are not complete.
3. Filter out speedies and slowies, and inattentives.
4. Transform needed likert/numeric columns (defined in `map_data.py`) into numeric columns.
5. Transform needed non-numeric cells to numeric value.
6. Translate choice experiment replies (choice descriptions) to english.
7. Drop additional responses where more than two have been done using the same IPA-address.
8. Delete unecessary or unanonymos columns.
9. Add prefix to every column name so the different survey can be merged.


In [46]:
df_dict ={}

# language region is kept to analyze eventual cultural differences within Switzerland

for s, file in survey_dict.items():
    df = pd.read_csv(file, dtype={'id': 'string'}, skiprows=[1,2])
    # remove lines, replace empty cells, rename columnames
    df = df.replace('', pd.NA)
    df.columns = df.columns.str.replace('.', '', regex=False)
    df.columns = df.columns.str.replace(r'\s+', '', regex=True)
    df = dp.rename_columns(df, 'municipality', 'benefits')
    df = dp.to_num_col(df, ['id', 'm'])

    # filter only complete/valid response rows
    df = df[(df['DistributionChannel'] != 'preview') & (df['Finished'] != False)].copy()
    df = df[~df['Q_TerminateFlag'].isin(['PoorQuality', 'NA', 'QuotaMet', 'Screened'])].copy()
    
    # Exclude speeders, straightliners, inattentives
    speedy_slowy = dp.rm_speeders(df)
    straightliners = dp.rm_straightliners(df)
    inattentives = df[f'attention_check'] != 'Agree'
    df_filtered = df[~((speedy_slowy | straightliners | inattentives))]
    df_filtered = df_filtered.copy()
    
    # transform liker/numeric scales, non-numeric columns
    if s == 'S0':
        df_filtered[md.NUM_COLUMNS] = df_filtered[md.NUM_COLUMNS].apply(pd.to_numeric, errors="coerce")
    df_filtered = dp.string_mapping(df_filtered, md.LIKERT_MAP, md.VALID_COLUMNS, numeric=True)
    df_filtered = dp.string_mapping(df_filtered, md.LIKERT_MAP, ['climatechange_nh_1'], numeric=True)
    df_filtered = dp.string_mapping(df_filtered, md.NH_EXPERIENCE_MAP,['experience_nh'],numeric=True)
    
    # invert likert scales for consistency
    dp.invert_likert(df_filtered, ['finan_vulnerability_1'], 6)
    
    # remove double ipas
    df_filtered = dp.filter_double_ipa(df_filtered)
    
    pattern = '|'.join(map(re.escape, md.ANONYMIZE_COLS))
    columns = df_filtered.filter(regex=pattern)
    df_filtered = df_filtered.drop(columns=df_filtered.filter(regex=pattern).columns)
    
    # prepare for merging
    df_filtered = df_filtered.add_prefix(f'{s}_')
    df_dict[s] = df_filtered
    df_filtered.to_csv(f'../results/{s}_clean_survey.csv')
    print(len(df_filtered))
    
df_filtered


Duration: min  150  max  21320
Fastest and slowest 10% (Total): 118 respondents (<366.8s or >2754.999999999999s)
942
Duration: min  288  max  10870
Fastest and slowest 10% (Total): 60 respondents (<475.55s or >3196.1000000000004s)
509


Unnamed: 0,S1_StartDate,S1_EndDate,S1_Progress,S1_duration,S1_Finished,S1_knowledge_nh_1,S1_knowledge_nh_2,S1_knowledge_nh_3,S1_sensitivity_nh_1,S1_sensitivity_nh_2,...,S1_choice9_costs1,S1_choice9_costs2,S1_choice9_exemptions1,S1_choice9_exemptions2,S1_choice9_benefits1,S1_choice9_benefits2,S1_Q_TerminateFlag,S1_Q_R_Del,S1_screened_out,S1_SelectedLanguage
0,2025-08-30 16:55:44,2025-08-30 17:06:19,100,635,True,Yes,Yes,Yes,1.0,1.0,...,Tous les citoyens paient le même montant,Les personnes paient proportionnellement à leu...,Les personnes à faibles et moyens revenus peuv...,Les personnes à faibles et moyens revenus peuv...,Municipalités ayant une grande valeur culturel...,Niveaux de protection égaux pour toutes les mu...,Complete,,False,FR
1,2025-08-30 16:55:45,2025-08-30 17:09:24,100,818,True,Yes,No,Yes,6.0,6.0,...,Les entreprises paient proportionnellement à l...,Tous les citoyens paient le même montant,Aucun groupe n'est exempté des coûts,Aucun groupe n'est exempté des coûts,Les municipalités les plus touchées par les ri...,Niveaux de protection égaux pour toutes les mu...,Complete,,False,FR
2,2025-08-30 16:55:27,2025-08-30 17:09:25,100,837,True,Yes,Yes,Yes,2.0,2.0,...,Tous les citoyens paient le même montant,Les entreprises paient proportionnellement à l...,Les personnes à faible revenu peuvent être exe...,Aucun groupe n'est exempté des coûts,Municipalités ayant une grande valeur culturel...,Les municipalités économiquement prospères,Complete,,False,FR
3,2025-08-30 16:56:35,2025-08-30 17:11:53,100,917,True,Yes,Yes,Yes,1.0,1.0,...,Les entreprises paient proportionnellement à l...,Les entreprises paient proportionnellement à l...,Aucun groupe n'est exempté des coûts,Aucun groupe n'est exempté des coûts,Les municipalités les plus touchées par les ri...,Les municipalités économiquement prospères,Complete,,False,FR
4,2025-08-30 17:02:18,2025-08-30 17:13:04,100,645,True,Yes,Yes,Yes,3.0,4.0,...,"Menschen und Unternehmen, die von Schutzmaßnah...","Menschen und Unternehmen, die von Schutzmaßnah...",Keine Gruppen sind von den Kosten ausgenommen,mit Ausnahme von Menschen mit niedrigem Einkommen,Gemeinden mit vielen Kulturgütern wie z.B. his...,Gemeinden mit vielen Kulturgütern wie z.B. his...,Complete,,False,DE
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
504,2025-09-12 12:49:21,2025-09-12 13:09:29,100,1208,True,Yes,Yes,Yes,5.0,5.0,...,Unternehmen zahlen proportional zu ihrem CO2-A...,Alle Menschen zahlen den gleichen Betrag,Keine Gruppen sind von den Kosten ausgenommen,mit Ausnahme von Menschen mit niedrigem Einkommen,"Gemeinden, in denen Menschen seit vielen Jahre...",Wirtschaftlich wohlhabende Gemeinden,Complete,,False,DE
505,2025-09-12 13:06:28,2025-09-12 13:29:18,100,1369,True,Yes,Yes,Yes,1.0,3.0,...,Menschen zahlen proportional zu ihrem CO2-Auss...,Alle Menschen zahlen den gleichen Betrag,mit Ausnahme von Menschen mit niedrigem und mi...,Keine Gruppen sind von den Kosten ausgenommen,Gleiche Schutzniveaus für alle Gemeinden,Wirtschaftlich wohlhabende Gemeinden,Complete,,False,DE
506,2025-09-12 18:23:30,2025-09-12 18:36:10,100,759,True,Yes,Yes,Yes,4.0,5.0,...,Menschen zahlen proportional zu ihrem CO2-Auss...,Menschen zahlen proportional zu ihrem Einkommen,Keine Gruppen sind von den Kosten ausgenommen,mit Ausnahme von Menschen mit niedrigem Einkommen,Gemeinden mit vielen Kulturgütern wie z.B. his...,"Gemeinden, in denen Menschen seit vielen Jahre...",Complete,,False,DE
507,2025-09-13 11:21:48,2025-09-13 11:34:14,100,746,True,Yes,Yes,Yes,2.0,3.0,...,Les personnes paient proportionnellement à leu...,Les personnes et entreprises bénéficiant des m...,Aucun groupe n'est exempté des coûts,Aucun groupe n'est exempté des coûts,Municipalités ayant une grande valeur culturel...,Les municipalités économiquement prospères,Complete,,False,FR


### Merging and cleanup
Merge the datasets on the user id so only respondents are kept that filled out both survey. This makes it possible to analyze the change in preference related to a natural hazard occurence (Blatten).
Afterwards the remaining answers are all standardized by translating answers to English.

In [47]:
# merge both surveys and check for uniqueness
# here you can always add more dataframes to merge (i.e. after another survey round S2)
merged_waves_df = (
    md.KEYS
    .merge(df_dict['S0'], how='inner', left_on='S0_idx', right_on='S0_id')
    .merge(df_dict['S1'], how='inner', left_on='S1_idx', right_on='S1_m')
    .drop(columns=['S0_id', 'S0_m', 'S1_id', 'S1_m'])
)

print(len(merged_waves_df))

# drop non-unique rows and reset unique ids
df_cleaned = merged_waves_df.drop_duplicates(subset=['S0_idx', 'S1_idx'], keep='first')
df_cleaned = df_cleaned.reset_index(drop=True)
df_cleaned['respondent_id'] = df_cleaned.index + 1

# map demographics and translate choice options
df_cleaned = dp.string_mapping(df_cleaned, md.DEMOGRAPHICS_DICT, column_patterns=[r'_gender$', r'_age$', r'_education$', r'_income$', r'_language$', r'_language_region$', r'_party_choice$'])
df_cleaned = dp.string_mapping(df_cleaned, md.TRANSLATION_DICT, column_patterns=[r'^S._choice._exemptions', r'^S._choice._costs', r'^S._choice._benefits'])


df_cleaned

504


  df[cols] = df[cols].replace(mapping_dict)


Unnamed: 0,S0_idx,S1_idx,S0_StartDate,S0_EndDate,S0_Progress,S0_duration,S0_Finished,S0_age,S0_gender,S0_education,...,S1_choice9_costs2,S1_choice9_exemptions1,S1_choice9_exemptions2,S1_choice9_benefits1,S1_choice9_benefits2,S1_Q_TerminateFlag,S1_Q_R_Del,S1_screened_out,S1_SelectedLanguage,respondent_id
0,307963907634351,318152052069452,2025-05-08 10:24:25,2025-05-08 10:37:59,100,814,True,50+,Male,University degree,...,All people pay the same amount,Low- and middle-income earners exempted from c...,Low-income earners exempted from costs,Equal protection levels for all municipalities,Culturally valuable municipalities e.g. with h...,Complete,,False,IT,1
1,307988881480844,318078313779222,2025-05-08 15:59:17,2025-05-08 16:13:18,100,840,True,50+,Female,Below Secondary,...,People pay proportionally to their income,No groups exempted from costs,Low-income earners exempted from costs,Municipalities most affected by natural hazard...,Economically prosperous municipalities,Complete,,False,FR,2
2,307799977910308,318078398527710,2025-05-06 16:55:34,2025-05-06 17:09:39,100,844,True,50+,Female,University degree,...,Companies pay proportionally to their CO2 emis...,No groups exempted from costs,No groups exempted from costs,Municipalities in which people have lived in f...,Culturally valuable municipalities e.g. with h...,Complete,,False,DE,3
3,307988976305297,318078398527711,2025-05-08 15:57:09,2025-05-08 16:15:22,100,1092,True,50+,Male,Vocational training or apprenticeship,...,Companies pay proportionally to their CO2 emis...,Low- and middle-income earners exempted from c...,No groups exempted from costs,Municipalities in which people have lived in f...,Municipalities in which people have lived in f...,Complete,,False,DE,4
4,307799977920554,318078398527712,2025-05-06 12:47:48,2025-05-06 13:22:10,100,2061,True,50+,Male,Vocational training or apprenticeship,...,People pay proportionally to their CO2 emissions,No groups exempted from costs,Low- and middle-income earners exempted from c...,Culturally valuable municipalities e.g. with h...,Municipalities in which people have lived in f...,Complete,,False,DE,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
498,307799982152721,318152004294715,2025-05-06 14:12:08,2025-05-06 14:20:49,100,521,True,50+,Female,Vocational training or apprenticeship,...,People & companies being protected by protecti...,Low-income earners exempted from costs,Low- and middle-income earners exempted from c...,Municipalities most affected by natural hazard...,Municipalities most affected by natural hazard...,Complete,,False,DE,499
499,307799982224449,318152004294717,2025-05-06 14:17:32,2025-05-06 14:32:06,100,873,True,18 - 34,Female,Vocational training or apprenticeship,...,All people pay the same amount,Low-income earners exempted from costs,Low- and middle-income earners exempted from c...,Municipalities most affected by natural hazard...,Economically prosperous municipalities,Complete,,False,DE,500
500,307714474969321,318152004294719,2025-05-05 13:41:59,2025-05-05 13:54:37,100,757,True,50+,Female,Vocational training or apprenticeship,...,People & companies being protected by protecti...,No groups exempted from costs,Low-income earners exempted from costs,Municipalities most affected by natural hazard...,Municipalities in which people have lived in f...,Complete,,False,DE,501
501,307714475005954,318152004294722,2025-05-05 15:36:33,2025-05-05 15:54:43,100,1090,True,35 - 49,Male,University degree,...,Companies pay proportionally to their CO2 emis...,No groups exempted from costs,No groups exempted from costs,Municipalities most affected by natural hazard...,Economically prosperous municipalities,Complete,,False,DE,502


After merging the survey data sets:
> This leaves us with TODO 503 respondes that filled out both surveys

In [48]:
df_cleaned.to_csv('../results/combined_surveys.csv')
df_cleaned['S0_climatechange_nh_1']

0      2.0
1      1.0
2      5.0
3      3.0
4      6.0
      ... 
498    6.0
499    6.0
500    3.0
501    6.0
502    5.0
Name: S0_climatechange_nh_1, Length: 503, dtype: float64

### Prepare conjoint data set 

Select all columns to be used in the conjoint analysis:
- The columns describing the conjoint options ```z``` 1 and 2 presented to the respondent per task ```y``` per survey ```x```.
- The chosen option for each task ```y``` per survey ```x```.
- As well as the unique ```respondent_id``` over the two surveys.

In [21]:
# select conjoint option columns for cost, benefit, and exemption
attr_pattern = r'^S\d+_choice\d+_(costs|benefits|exemptions)\d+$'
attr_cols = df_cleaned.filter(regex=attr_pattern).columns

# select conjoint choice columns
choice_pattern = r'^S\d+_\d+_conjoint_prefer$'
choice_cols = df_cleaned.filter(regex=choice_pattern).columns

# select other socioeconomic columns to keep
id_cols = ['respondent_id', 'S0_idx', 'S1_idx']
if id_cols is not None:
    id_cols = [c for c in id_cols if c in df_cleaned.columns and c not in attr_cols and c not in choice_cols]
else:
    id_cols = [c for c in df_cleaned.columns if c not in attr_cols and c not in choice_cols]

Create a column per suboption (costs, exemptions, benefits) and per taks and survey.

Combine the suboptions of an option into one row including the ```response_id``` and the indicators for the survey, the task-number and the option number of the task.

In [22]:
# create df from which to start building conjoint df
df_attr = df_cleaned.melt(id_vars=id_cols, value_vars=attr_cols, var_name='var', value_name='value')

# for each respondent and survey (nh_event yes/no) create a row with a suboption, the option number of a task, as well as the task number
df_attr[['nh_event', 'task', 'attribute', 'option']] = df_attr['var'].str.extract(r'S(\d+)_choice(\d+)_(costs|benefits|exemptions)(\d+)')
df_attr['nh_event'] = df_attr['nh_event'].astype(int)
df_attr['task']     = df_attr['task'].astype(int)
df_attr['option']   = df_attr['option'].astype(int)

# combine suboptions per survey, respondent, task and option into one row
df_option = df_attr.pivot(
    index=id_cols + ['nh_event', 'task', 'option'],
    columns='attribute',
    values='value'
).reset_index().copy()

df_option.columns.name = None
df_option


Unnamed: 0,respondent_id,S0_idx,S1_idx,nh_event,task,option,benefits,costs,exemptions
0,1,307963907634351,318152052069452,0,1,1,Municipalities most affected by natural hazard...,People pay proportionally to their income,Low-income earners exempted from costs
1,1,307963907634351,318152052069452,0,1,2,Municipalities in which people have lived in f...,People pay proportionally to their income,Low-income earners exempted from costs
2,1,307963907634351,318152052069452,0,2,1,Economically prosperous municipalities,People pay proportionally to their CO2 emissions,Low- and middle-income earners exempted from c...
3,1,307963907634351,318152052069452,0,2,2,Economically prosperous municipalities,People pay proportionally to their CO2 emissions,Low- and middle-income earners exempted from c...
4,1,307963907634351,318152052069452,0,3,1,Culturally valuable municipalities e.g. with h...,People pay proportionally to their CO2 emissions,Low- and middle-income earners exempted from c...
...,...,...,...,...,...,...,...,...,...
15085,503,308339135861871,318152004294728,1,7,2,Equal protection levels for all municipalities,All people pay the same amount,No groups exempted from costs
15086,503,308339135861871,318152004294728,1,8,1,Municipalities most affected by natural hazard...,People pay proportionally to their CO2 emissions,No groups exempted from costs
15087,503,308339135861871,318152004294728,1,8,2,Municipalities most affected by natural hazard...,People & companies being protected by protecti...,Low- and middle-income earners exempted from c...
15088,503,308339135861871,318152004294728,1,9,1,Culturally valuable municipalities e.g. with h...,People & companies being protected by protecti...,Low- and middle-income earners exempted from c...


Create a dataframe with rows indicating for every option ```z``` if it was chosen in a specific task ```y```. 

This is split into rows also based on the ```response_id```, the survey ```x``` (pre/post).

In [None]:
# create row per option, task and survey with the choice value
df_choice = df_cleaned.melt(id_vars=id_cols, value_vars=choice_cols, var_name='pref_var', value_name='choice')

# map choice to numeric value 0 or 1 and convert to integer
df_choice = dp.string_mapping(df_choice, mapping_dict= md.PREFERENCE_MAP, column_patterns=['choice'], numeric=True)

# pref_var sieht z.B. so aus: S1_3_conjoint_prefer
df_choice[['nh_event', 'task']] = df_choice['pref_var'].str.extract(r'S(\d+)_(\d+)_conjoint_prefer')

df_choice['nh_event'] = df_choice['nh_event'].astype(int)
df_choice['task']     = df_choice['task'].astype(int)
df_choice             = df_choice[id_cols + ['nh_event', 'task', 'choice']]
df_choice


  df[cols] = df[cols].replace(mapping_dict)


Unnamed: 0,respondent_id,S0_idx,S1_idx,nh_event,task,choice
0,1,307963907634351,318152052069452,0,1,1
1,2,307988881480844,318078313779222,0,1,2
2,3,307799977910308,318078398527710,0,1,2
3,4,307988976305297,318078398527711,0,1,1
4,5,307799977920554,318078398527712,0,1,1
...,...,...,...,...,...,...
7540,499,307799982152721,318152004294715,1,9,2
7541,500,307799982224449,318152004294717,1,9,1
7542,501,307714474969321,318152004294719,1,9,1
7543,502,307714475005954,318152004294722,1,9,2


Merge the choice dataframe with the option dataframe to have a dataset for the choice model. 

Merging will lead to having a dataframe indicating for every option ```z``` in a conjoint task ```y``` of a survey ```x``` if the option was chosen or not ```(0|1)```.

In [24]:
# merge dfs
df_option_choice = df_option.merge(
    df_choice,
    on=id_cols + ['nh_event', 'task'],
    how='left'
)

# TODO: describe
df_option_choice['chosen'] = (df_option_choice['choice'] == df_option_choice['option']).astype(int)
df_option_choice.drop(columns=['choice', 'S0_idx', 'S1_idx'],inplace=True)

# save to csv
df_option_choice.to_csv('../results/conjoint_df.csv')
df_option_choice


Unnamed: 0,respondent_id,nh_event,task,option,benefits,costs,exemptions,chosen
0,1,0,1,1,Municipalities most affected by natural hazard...,People pay proportionally to their income,Low-income earners exempted from costs,1
1,1,0,1,2,Municipalities in which people have lived in f...,People pay proportionally to their income,Low-income earners exempted from costs,0
2,1,0,2,1,Economically prosperous municipalities,People pay proportionally to their CO2 emissions,Low- and middle-income earners exempted from c...,1
3,1,0,2,2,Economically prosperous municipalities,People pay proportionally to their CO2 emissions,Low- and middle-income earners exempted from c...,0
4,1,0,3,1,Culturally valuable municipalities e.g. with h...,People pay proportionally to their CO2 emissions,Low- and middle-income earners exempted from c...,1
...,...,...,...,...,...,...,...,...
15085,503,1,7,2,Equal protection levels for all municipalities,All people pay the same amount,No groups exempted from costs,1
15086,503,1,8,1,Municipalities most affected by natural hazard...,People pay proportionally to their CO2 emissions,No groups exempted from costs,0
15087,503,1,8,2,Municipalities most affected by natural hazard...,People & companies being protected by protecti...,Low- and middle-income earners exempted from c...,1
15088,503,1,9,1,Culturally valuable municipalities e.g. with h...,People & companies being protected by protecti...,Low- and middle-income earners exempted from c...,1


In [10]:
# key = id_cols + ['nh_event', 'choice', 'option', 'attribute']

# dups = df_option.duplicated(key, keep=False)
# print('Duplikat-Zeilen für pivot:', dups.sum())

# # Zeig ein paar Beispiele
# df_option.loc[dups, key + ['var', 'value']].sort_values(key).head(30)

In [None]:
# check IRR
# IRR survey pre blatten
IRR_S0_task1_choice = df_choice[(df_choice['task'] == 1)&(df_choice['nh_event'] == 0)][['respondent_id', 'nh_event','choice']]
IRR_S0_task6_choice = df_choice[(df_choice['task'] == 6)&(df_choice['nh_event'] == 0)][['respondent_id', 'nh_event','choice']]
IRR_S0_choice = pd.merge(IRR_S0_task1_choice, IRR_S0_task6_choice, on='respondent_id', suffixes=('_first', '_last'))
# dp.calc_IRR(IRR_S0_choice)

# IRR survey post blatten
IRR_S1_task1_choice = df_choice[(df_choice['task'] == 1)&(df_choice['nh_event'] == 1)][['respondent_id', 'nh_event','choice']]
IRR_S1_task9_choice = df_choice[(df_choice['task'] == 9)&(df_choice['nh_event'] == 1)][['respondent_id', 'nh_event','choice']]
IRR_S1_choice = pd.merge(IRR_S1_task1_choice, IRR_S1_task9_choice, on='respondent_id', suffixes=('_first', '_last'))
# dp.calc_IRR(IRR_S1_choice)

responses = IRR_S1_choice['respondent_id'].nunique()

# get inconsistent answers (options were switched in first and last task of surveys thus !=)
same_choice = IRR_S1_choice[(IRR_S1_choice['choice_first'] != IRR_S1_choice['choice_last'])]
num_same_choice = len(same_choice)

# get consistent answers
different_choice = IRR_S1_choice[(IRR_S1_choice['choice_first'] == IRR_S1_choice['choice_last'])]
num_different_choice = len(different_choice)

# calculate IRR
IRR_choice = (num_same_choice) / (num_same_choice + num_different_choice)
print(f'IRR: {IRR_choice}')

from scipy.stats import norm

sqer_IRR_choice = np.sqrt((IRR_choice * (1 - IRR_choice)) / responses)
z_crit = norm.ppf(0.975)  # 95% confidence interval
CI_plus = IRR_choice + (z_crit * sqer_IRR_choice)
CI_minus = IRR_choice - (z_crit * sqer_IRR_choice)

print(f'CI Plus: {CI_plus}')
print(f'CI Minus: {CI_minus}')


IRR: 0.7077534791252486
CI Plus: 0.747498230062467
CI Minus: 0.6680087281880301


In [26]:
# Krippendorff and Cohens K
from sklearn.metrics import cohen_kappa_score
import krippendorff

# inversion to account for inversion of options in conjoint
IRR_S0_choice['choice_last_invert'] = 3 - IRR_S0_choice['choice_last']

c_kappa = cohen_kappa_score(IRR_S0_choice['choice_first'], IRR_S0_choice['choice_last_invert'])

k_alpha = krippendorff.alpha(np.array([IRR_S0_choice['choice_first'].to_numpy(), IRR_S0_choice['choice_last_invert'].to_numpy()]))

print(f'Choice consitency - Cohens kappa: {c_kappa:.2f} and Krippendorffs alpha: {k_alpha:.2f}')


Choice consitency - Cohens kappa: 0.51 and Krippendorffs alpha: 0.51


Choice consitency - Cohens kappa: 0.51 and Krippendorffs alpha: 0.51

--> Moderate consistency... Should I remove those too?

0.51 doesn’t automatically mean the model is “bad.” It means respondents are not perfectly consistent—which is normal in conjoint/choice tasks because:
- choices are noisy,
- some pairs are near-indifferent,
- people use heuristics or get fatigued.

What I’d check next (quick, informative)
- Raw percent agreement (how often same choice repeated)
- Confusion matrix (how often flips happen and in which direction)
- Prevalence / imbalance (kappa/alpha can look lower if one option is chosen much more often)