# Estimate embedding spaces based on MTurk data

This notebook takes a CSV file with data from the MTurk study, parses it, and filters to get complete participants. It also scores the initial object test and the catch trials during the similarity judgments. Then it uses the PsiZ package to infer the psychological representation driving the similarity judgments.

In [1]:
import os
import sys
import numpy as np
import pandas as pd
module_path = os.path.abspath('..')
if module_path not in sys.path:
    sys.path.append(module_path)
from wikisim import simtask
from wikisim import embed

data_dir = '/Users/morton/Dropbox/data/bender'
work_dir = '/Users/morton/Dropbox/work/bender/mturk'
model_dir = '/Users/morton/Dropbox/work/bender/batch/models3'

pool_file = os.path.join(data_dir, 'stimuli', 'stimuli.csv')
tab_file = os.path.join(data_dir, 'mturk', 'data', 'Data_03.13.20.csv')

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


## Load pool information and MTurk data

MTurk data are downloaded from the experiment's data page. The pool information is stored in a CSV file that includes the name, category, and subcategory of each stimulus.

In [2]:
data, pool = simtask.read_mturk(tab_file, pool_file)

## Summarize behavioral data

Each participant only completes one condition. Conditions are:
1. face semantic
2. scene semantic
3. face visual
4. scene visual

Tests and catch trials are scored and included in the summary:
* familiarity: mean familiarity rating over all items (1: never heard of it; 4: know all about it)
* test: fraction of object practice test trials answered correctly
* catch: fraction of catch trials (i.e., similarity trials where one of the choices is a reversed version of the prompt item) answered correctly. A catch trial is scored as correct if the reversed image was either of the two responses
* vis: post-experiment rating of how often visual similarity affected their judgments (1: 0-20%; 5: 80-100%)
* sem: post-experiment rating of how often conceptual similarity affected their judgments

The last five columns indicate the time in minutes that each phase took. The instruct column includes all instruction screens.

In [3]:
# get summary of participants who finished
pd.set_option('display.max_rows', None)
summary_raw = simtask.session_summary(data)
summary = summary_raw.dropna().copy()

# include only participants with reasonable performance
completed = summary.query('age >= 22 and age <= 34').index
include = summary.query('test > .6 and catch > .5 and age >= 22 and age <= 34').index
summary.loc[:, 'include'] = 0
summary.loc[include, 'include'] = 1

summary = summary.sort_values(by=['condition', 'start_time'])
summary.loc[completed].to_csv(os.path.join(work_dir, 'mturk_summary.csv'))

In [5]:
f'Participants included: {summary.include.sum()} / {summary.shape[0]}'

'Participants included: 102 / 151'

In [6]:
# get participants with reasonable performance
included = summary.query('include == 1')
included.groupby('condition')['start_time'].count()

condition
1    26
2    25
3    25
4    26
Name: start_time, dtype: int64

In [7]:
fam_include = data['fam']['subject'].isin(included.index)
fam = data['fam'].loc[fam_include]
fam.to_csv(os.path.join(work_dir, 'mturk_fam.csv'))

## Demographics

Participants were screened to be self-reported as native English speakers and of age 22-34 (this is age range used in the original study, aged up).

In [None]:
data['dem'].reindex(index=completed).to_csv(
    os.path.join(work_dir, 'mturk_dem_completed.csv'))

dem = data['dem'].reindex(index=included.index)
dem.to_csv(os.path.join(work_dir, 'mturk_dem_included.csv'))
dem

In [9]:
# complete participants
complete = data['dem'].reindex(index=completed)
print(f'n={complete.shape[0]}')
print(complete['gender'].value_counts())
print(complete['age'].agg(['mean', 'std', 'min', 'max']))

n=150
Male      82
Female    67
Other      1
Name: gender, dtype: int64
mean    29.293333
std      3.324803
min     22.000000
max     34.000000
Name: age, dtype: float64


In [10]:
# included participants
print(f'n={dem.shape[0]}')
print(dem['gender'].value_counts())
print(dem['age'].agg(['mean', 'std', 'min', 'max']))

n=102
Male      57
Female    44
Other      1
Name: gender, dtype: int64
mean    29.392157
std      3.285603
min     22.000000
max     34.000000
Name: age, dtype: float64


## Post-experiment questionnaire

Full answers to all questions. The "clear" column has a rating of how frequently (1-5) there was a clear answer on the similarity judgment trials.

In [None]:
pd.set_option('display.max_colwidth', None)
deb = data['deb'].reindex(index=included.index)
deb = pd.concat((included.start_time, deb), axis=1).copy()
deb = deb.sort_values(by=['condition', 'start_time'])
deb.to_csv(os.path.join(work_dir, 'mturk_debrief.csv'))
deb

## Estimate embedding for each condition

We estimate a separate embedding for each condition.

In [12]:
# save similarity judgment data for included participants
sim = data['sim'].loc[np.isin(data['sim'].subject, included.index)]

# exclude catch trials
sim = sim.loc[sim.trial_type == 'similarity']
sim_file = os.path.join(work_dir, 'mturk_sim.csv')
sim.to_csv(sim_file)

In [13]:
sim.loc[:, 'include'] = sim['stim_fam'] > 1
sim.groupby(['condition', 'subject'])['include'].sum().agg(['min', 'max', 'mean', 'std'])

min     19.000000
max     80.000000
mean    72.686275
std     11.034581
Name: include, dtype: float64

In [14]:
sim.groupby('condition')['include'].sum()

condition
1    1999.0
2    1783.0
3    1824.0
4    1808.0
Name: include, dtype: float64

In [15]:
emb = embed.cond_embed(sim, pool, n_dim=6)

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor


  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


    Progress: |██████████████████████████████████████████████████| 100.0%% Complete | ETA: 0:00:00
    Elapsed time: 0:02:19
    Progress: |--------------------------------------------------| 0.0%% Complete | ETA: 0:00:00

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


    Progress: |██████████████████████████████████████████████████| 100.0%% Complete | ETA: 0:00:00
    Elapsed time: 0:03:42
    Progress: |--------------------------------------------------| 0.0%% Complete | ETA: 0:00:00

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


    Progress: |██████████████████████████████████████████████████| 100.0%% Complete | ETA: 0:00:00
    Elapsed time: 0:02:05
    Progress: |--------------------------------------------------| 0.0%% Complete | ETA: 0:00:00

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


    Progress: |██████████████████████████████████████████████████| 100.0%% Complete | ETA: 0:00:00
    Elapsed time: 0:02:16


## Save embedding models for further analysis

The embedding models are not transferable across categories. However, for ease of use given that each of the other models apply across category, saving them to a single file for each task condition. The two models are called SEM (semantic encoding model) and VEM (visual encoding model).

In [16]:
embed.save_embed_rdm(model_dir, 'sem', emb, [1, 2], pool['stim'].tolist())
embed.save_embed_rdm(model_dir, 'vem', emb, [3, 4], pool['stim'].tolist())

In [17]:
for cond, e in emb.items():
    embed_file = os.path.join(work_dir, f'mturk_embed_cond{cond}.hdf5')
    e.save(embed_file)