# Preliminary exploratory notebook

From the pilot I want to look through the data thus far and see if there are particular anagrams, responding, or patterns that we need to address before collecting more. 

Key things to see: 
    - do people respond with real words
    - do people respond with the intended correct word
    - what are the kinds of RTs that we see
    - are some anagrams simply too difficult

In [1]:
import pandas as pd
# update to the file name to be read
# the header is the row that has the collumn names (zero indexed, and verify which is the header row before running)
# the sep is the seperator of the data
df = pd.read_csv("anagram_rating_pilot_filtered_20240808_1713.csv", header=0, sep=',')

Lets make a file that is the anagram id number in collumn A, the anagram in B, the correct word in C, and the responses in D.

In [2]:
grammerdf = df.filter(['id', 'anagram', 'response']).dropna()
grammerdf = grammerdf[~((grammerdf['id'] == 'practice') | (grammerdf['id']== 'end_confirm_subjid'))]
grammerdf_grouped = grammerdf.groupby('id').agg({'anagram': 'first', 'response': ' '.join}).reset_index()


From the experiment, grab the stimuli file and add to working directory. Using node in the terminal run the javascript to convert it to a json. That way we can bring it into this notebook. 

In [3]:
#import the stimuli file from the experiment (since this ensures the same data and lists)
import json
with open('stimuli.json') as f:
    stimuli = json.load(f)

#remove the brackets and separate cols by the comma 
stimuli = pd.DataFrame(stimuli)
stimulidf = stimuli.filter(['id', 'anagram', 'correct']).drop_duplicates()
#merge the two dataframes
merged = pd.merge(grammerdf_grouped, stimulidf, on='id', how='left').drop_duplicates()
merged = merged.filter(['id', 'anagram_x', 'correct', 'response'])


Lets start getting some information about the anagrams now that we have the correct collumn added, lets see a proportion of responses that match the correct one.  

When we have that lets then start thinking about the RT. 

In [4]:
anagram_stat = merged['response'].str.split(' ', expand=True)
anagram_stat.columns = ['response1', 'response2', 'response3', 'response4', 'response5', 'response6', 'response7', 'response8', 'response9', 'response10']
anagram_stat = pd.concat([merged, anagram_stat], axis=1)

# Count the number of responses that match the 'correct' column
anagram_stat['match_count'] = anagram_stat[['response1', 'response2', 'response3', 'response4', 'response5', 'response6', 'response7', 'response8', 'response9', 'response10']].apply(lambda row: (row == anagram_stat.loc[row.name, 'correct']).sum(), axis=1)

# Insert the new column next to the 'id' column
id_index = anagram_stat.columns.get_loc('id')
cols = anagram_stat.columns.tolist()
cols.insert(id_index + 1, cols.pop(cols.index('match_count')))
anagram_stat = anagram_stat[cols]

# Count the number of responses that are not empty
anagram_stat['response_count'] = anagram_stat[['response1', 'response2', 'response3', 'response4', 'response5', 'response6', 'response7', 'response8', 'response9', 'response10']].apply(lambda row: row.notnull().sum(), axis=1)
# Add to the df next to the id column
cols = anagram_stat.columns.tolist()
cols.insert(id_index + 1, cols.pop(cols.index('response_count')))
anagram_stat = anagram_stat[cols]


# Compute the ratio of correct responses to the total number of responses, ie. match_count / response_count
anagram_stat['match_ratio'] = anagram_stat['match_count'] / anagram_stat['response_count']
# Add to the df next to the id column
cols = anagram_stat.columns.tolist()
cols.insert(id_index + 1, cols.pop(cols.index('match_ratio')))
anagram_stat = anagram_stat[cols]

# Compute the number of responses that are not empty and do not match the correct response
anagram_stat['non_match_count'] = anagram_stat['response_count'] - anagram_stat['match_count']
# Add to the df next to the id column
cols = anagram_stat.columns.tolist()
cols.insert(id_index + 1, cols.pop(cols.index('non_match_count')))
anagram_stat = anagram_stat[cols]


# Preproc of the responses

- Remove spaces
- Lower case everything
- Remove non letter characters 

In [5]:
# For the responses we are going to process them to remove spaces, make all lower case, and remove non alphabetical characters
responses = anagram_stat.filter(['response1', 'response2', 'response3', 'response4', 'response5', 'response6', 'response7', 'response8', 'response9', 'response10'])
responses = responses.applymap(lambda x: x.replace(" ", "") if isinstance(x, str) else x) # remove spaces
responses = responses.applymap(lambda x: x.lower() if isinstance(x, str) else x) # make lower case
responses = responses.applymap(lambda x: ''.join(filter(str.isalpha, x)) if isinstance(x, str) else x) # remove non alphabetical characters
# Add the ID back to the responses
responses['id'] = anagram_stat['id']


  responses = responses.applymap(lambda x: x.replace(" ", "") if isinstance(x, str) else x) # remove spaces
  responses = responses.applymap(lambda x: x.lower() if isinstance(x, str) else x) # make lower case
  responses = responses.applymap(lambda x: ''.join(filter(str.isalpha, x)) if isinstance(x, str) else x) # remove non alphabetical characters


## Checking the response
 
- String length match 
- Create list and sort list and test for equality of target vs response
- If all above is true then test if the response is a real english word. 

In [6]:
#1. Now for the responses, what the string legnth is for each and then compare to the correct answer string length. If the string length is different then we know the response is invalid.
string_length = responses.applymap(lambda x: len(x) if isinstance(x, str) else 0)
string_length['correct'] = anagram_stat['correct'].apply(lambda x: len(x)) # add the correct answer string length to the dataframe
string_length['valid'] = string_length.apply(lambda row: sum(row[:-1] == row['correct']), axis=1) # compare the string length of the response to the correct answer sum the valid responses
# In a QC df of responses, we drop the responses that are invalid based on string length
# make a validate response function
def validate_response_length(row):
    correct_length = string_length['correct'][row.name]
    return row.apply(lambda x: x if isinstance(x, str) and len(x) == correct_length else '')

qc_responses = responses.apply(validate_response_length, axis=1)

  string_length = responses.applymap(lambda x: len(x) if isinstance(x, str) else 0)


In [7]:
from collections import Counter

#2. Then we compare the characters used in the response to the correct answer, if there are different characters in the response (irrespective of order) then we know the response is invalid.
characters = responses.applymap(lambda x: Counter(x) if isinstance(x, str) else x)
characters['correct'] = anagram_stat['correct'].apply(lambda x: Counter(x))
characters['valid'] = characters.apply(lambda row: sum(row[:-1] == row['correct']), axis=1)

# We apply the same method to modify the qc_responses dataframe to only include valid responses based on characters used
def validate_response_character(row):
    correct_characters = characters['correct'][row.name]
    return row.apply(lambda x: x if isinstance(x, str) and Counter(x) == correct_characters else '')

qc_responses = qc_responses.apply(validate_response_character, axis=1)

  characters = responses.applymap(lambda x: Counter(x) if isinstance(x, str) else x)


In [8]:
#3. Finally, if the response is the same length and uses the same characters, we check if that word is a real english word. If it is not a real english word then we know the response is invalid.
# We can use the nltk library to check if a word is a real english word.
#import nltk
#nltk.download('words')
#from nltk.corpus import words
import json
with open('words_dictionary.json') as f:
    valid_words = json.load(f)

# We can now use the valid_words set to check if a response is a valid english word
def validate_response_word(row):
    return row.apply(lambda x: x if isinstance(x, str) and x in valid_words else '')
qc_responses = qc_responses.apply(validate_response_word, axis=1)
# Once we know if a response is valid or invalid we can then calculate the accuracy of the response and add this to the anagram_stat dataframe as a new column.

### With a QC response fame Now
Lets add to our stats df an accuracy collumn which is the sum of the number of qc'd responses for that anagram id. It'll be a col next to ID

In [9]:
# Count the number of valid responses in the qc_responses dataframe and add this to the anagram_stat dataframe as a new column that is matched with the id. 
anagram_stat['valid_count'] = qc_responses.apply(lambda row: row.apply(lambda x: x != '').sum(), axis=1)
# Add to the df next to the id column
cols = anagram_stat.columns.tolist()
cols.insert(id_index + 1, cols.pop(cols.index('valid_count')))
anagram_stat = anagram_stat[cols]

# Compute the ratio of valid responses to the total number of responses, ie. valid_count / response_count
anagram_stat['valid_ratio'] = anagram_stat['valid_count'] / anagram_stat['response_count']
# Add to the df next to the id column
cols = anagram_stat.columns.tolist()
cols.insert(id_index + 1, cols.pop(cols.index('valid_ratio')))
anagram_stat = anagram_stat[cols]

# add to the df next to the id column the "corect" column
cols = anagram_stat.columns.tolist()
cols.insert(id_index + 1, cols.pop(cols.index('correct')))
anagram_stat = anagram_stat[cols]

### The chunk below is for saving out
Always make sure you recomment this chunk when you go to push the repo!

In [10]:
# from datetime import datetime
# now = datetime.now()
# dt_string = now.strftime("%Y%m%d_%H%M")
# timestamped = 'anagram_statistics' + dt_string + '.csv'
# df.to_csv(timestamped, index=False)
# anagram_stat.drop(['response'], axis=1).to_csv(timestamped, index=False)