#Preliminary exploratory notebook

From the pilot I want to look through the data thus far and see if there are particular anagrams, responding, or patterns that we need to address before collecting more. 

Key things to see: 
    - do people respond with real words
    - do people respond with the intended correct word
    - what are the kinds of RTs that we see
    - are some anagrams simply too difficult

In [1]:
import pandas as pd
# update to the file name to be read
# the header is the row that has the collumn names (zero indexed, and verify which is the header row before running)
# the sep is the seperator of the data
df = pd.read_csv("anagram_rating_pilot_filtered_20241031_1616.csv", header=0, sep=',')

Lets make a file that is the anagram id number in collumn A, the anagram in B, the correct word in C, and the responses in D.

Below becomes response DF function named get_response_df

In [2]:
grammerdf = df.filter(['id', 'anagram', 'response']).dropna()
grammerdf = grammerdf[~((grammerdf['id'] == 'practice') | (grammerdf['id']== 'end_confirm_subjid'))]
grammerdf_grouped = grammerdf.groupby('id').agg({'anagram': 'first', 'response': ' '.join}).reset_index()

From the experiment, grab the stimuli file and add to working directory. Using node in the terminal run the javascript to convert it to a json. That way we can bring it into this notebook. 

In [3]:
#import the stimuli file from the experiment (since this ensures the same data and lists)
import json
with open('stimuli.json') as f:
    stimuli = json.load(f)

#remove the brackets and separate cols by the comma 
stimuli = pd.DataFrame(stimuli)
stimulidf = stimuli.filter(['id', 'anagram', 'correct', 'valid']).drop_duplicates()
#merge the two dataframes
merged = pd.merge(grammerdf_grouped, stimulidf, on='id', how='left').drop_duplicates()
merged = merged.filter(['id', 'anagram_x', 'correct', 'response', 'valid']).drop_duplicates()

# remove the bracket and '' before setting the valid col to a list of strings
merged['valid'] = merged['valid'].apply(lambda x: x.replace('[', '').replace(']', '').replace('\'', ''))
merged['valid'] = merged['valid'].apply(lambda x: x.split(', '))


Lets start getting some information about the anagrams now that we have the correct collumn added, lets see a proportion of responses that match the correct one.  

When we have that lets then start thinking about the RT. 

In [6]:
max_collength =  merged.response.str.split().apply(len).max()


In [None]:
anagram_stat = merged['response'].str.split(' ', expand=True)
anagram_stat_columns = [f"response{i+1}" for i in range(max_collength)]
anagram_stat.columns = anagram_stat_columns
anagram_stat = pd.concat([merged, anagram_stat], axis=1)

# Count the number of responses that match the 'correct' column
anagram_stat['match_count'] = anagram_stat[anagram_stat_columns].apply(lambda row: row.isin(anagram_stat.loc[row.name, 'correct']).sum(), axis=1)
# Insert the new column next to the 'id' column
id_index = anagram_stat.columns.get_loc('id')
cols = anagram_stat.columns.tolist()
cols.insert(id_index + 1, cols.pop(cols.index('match_count')))
anagram_stat = anagram_stat[cols]

# Count the number of responses that are not empty
anagram_stat['response_count'] = anagram_stat[['response1', 'response2', 'response3', 'response4', 'response5', 'response6', 'response7', 'response8', 'response9']].apply(lambda row: row.notnull().sum(), axis=1)
# Add to the df next to the id column
cols = anagram_stat.columns.tolist()
cols.insert(id_index + 1, cols.pop(cols.index('response_count')))
anagram_stat = anagram_stat[cols]


# Compute the ratio of correct responses to the total number of responses, ie. match_count / response_count
anagram_stat['match_ratio'] = anagram_stat['match_count'] / anagram_stat['response_count']
# Add to the df next to the id column
cols = anagram_stat.columns.tolist()
cols.insert(id_index + 1, cols.pop(cols.index('match_ratio')))
anagram_stat = anagram_stat[cols]

# Compute the number of responses that are not empty and do not match the correct response
anagram_stat['non_match_count'] = anagram_stat['response_count'] - anagram_stat['match_count']
# Add to the df next to the id column
cols = anagram_stat.columns.tolist()
cols.insert(id_index + 1, cols.pop(cols.index('non_match_count')))
anagram_stat = anagram_stat[cols]



# Use the "valid" to see how many responses are valid, count the number of valid responses and add to the df
anagram_stat['valid_count'] = anagram_stat[['response1', 'response2', 'response3', 'response4', 'response5', 'response6', 'response7', 'response8', 'response9']].apply(lambda row: row.isin(anagram_stat.loc[row.name, 'valid']).sum(), axis=1)
# Add to the df next to the id column
cols = anagram_stat.columns.tolist()    
cols.insert(id_index + 1, cols.pop(cols.index('valid_count')))
anagram_stat = anagram_stat[cols]

# Use the valid_count to compute the ratio of valid responses to the total number of responses, ie. valid_count / response_count
anagram_stat['valid_ratio'] = anagram_stat['valid_count'] / anagram_stat['response_count']
# Add to the df next to the id column
cols = anagram_stat.columns.tolist()
cols.insert(id_index + 1, cols.pop(cols.index('valid_ratio')))
anagram_stat = anagram_stat[cols]


# Use the responses to count the number of 'idk' responses and then add a column to the df
anagram_stat['idk_count'] = anagram_stat[['response1', 'response2', 'response3', 'response4', 'response5', 'response6', 'response7', 'response8', 'response9']].apply(lambda row: (row == 'idk').sum(), axis=1)
# Add to the df next to the id column
cols = anagram_stat.columns.tolist()
cols.insert(id_index + 1, cols.pop(cols.index('idk_count')))
anagram_stat = anagram_stat[cols]

# Use the idk_count to compute the ratio of 'idk' responses to the total number of responses, ie. idk_count / response_count
anagram_stat['idk_ratio'] = anagram_stat['idk_count'] / anagram_stat['response_count']
# Add to the df next to the id column
cols = anagram_stat.columns.tolist()
cols.insert(id_index + 1, cols.pop(cols.index('idk_ratio')))
anagram_stat = anagram_stat[cols]


# make a plot of the idk ratio and the response time

# To do
plot idk ratio to response time
get the distribution people wise proportion of idk (how much idk for that subject)


In [5]:
# Save out the dataframe to a csv file but remove only the response column (since it is not needed)
anagram_stat.drop(['response'], axis=1).to_csv('anagram_stat.csv', index=False)
