# 5. Image comparison

Juan Berrios | jb25@csufresno.edu | Last updated: April 3, 2025

**Summary and overview of the data:**

- In this purpose we continue the work of processing images so their degree of similarity can be used as a measure of task success. I will process the maps completed by participants and then calculate similarity measures between them and the target map.

**Contents:**

1. [Preparations](#1.-Preparations): includes the necessary preparations, including importing libraries and loading files.
2. [Image similarity](#2.-Image-similarity): extracting the similarity measures by working on each file in the directory.
3. [Processing](#3.-Processing): processing the extracted data to turn it into a data frame.
4. [Analysis](#4.-Analysis): a descriptive analysis of the dataset. 

## 1. Preparations

In [1]:
#Importing libraries

import glob #for directory-level operations
import pandas as pd #for data frames
import numpy as np #for arrays
import cv2 #for images
from skimage.metrics import structural_similarity as ssim #Similarity measure
import re #regular expressions
from scipy.stats import zscore #To calculate Z-scores

#Releasing all output:     

from IPython.core.interactiveshell import InteractiveShell #Prints all commands rather than the last one.
InteractiveShell.ast_node_interactivity = "all"

#Turning pretty print off:
%pprint

Pretty printing has been turned OFF


## 2. Image similarity

In [2]:
def image_compare(fdir,route):
    """Takes a directory and a route filename as input. It then takes files that match a participant pattern 
    (i.e., excluding routes) and compares them to the route. The output is the similarity measure. Returns a list of 
    tuples with filename and similarity index."""
    
    li = []
    
    for fname in glob.glob(fdir + "[A-Z][0-9]_out.png"): 
        drawing = cv2.cvtColor(cv2.imread(fname), cv2.COLOR_BGR2GRAY)
        target = cv2.cvtColor(cv2.imread(fdir + route), cv2.COLOR_BGR2GRAY)
        sim = (fname,ssim(drawing, target))
        li.append(sim)
    return li

In [3]:
#Comparing participant-traced routes with target routes

A_1_I = image_compare("../data/images/maps/drawings/A_1_I/","route_out.png") 
A_1_II = image_compare("../data/images/maps/drawings/A_1_II/","route_out.png")
A_2_I = image_compare("../data/images/maps/drawings/A_2_I/","route_out.png")
A_2_II = image_compare("../data/images/maps/drawings/A_2_II/","route_out.png")
B_1_I = image_compare("../data/images/maps/drawings/B_1_I/","route_out.png")
B_1_II = image_compare("../data/images/maps/drawings/B_1_II/","route_out.png")
B_2_I = image_compare("../data/images/maps/drawings/B_2_I/","route_out.png")
B_2_II = image_compare("../data/images/maps/drawings/B_2_II/","route_out.png")

In [4]:
#Creating a list with the output for all maps

master_df = pd.DataFrame()
df_list = [A_1_I,A_1_II,A_2_I,A_2_II,B_1_I,B_1_II,B_2_I,B_2_II]

#Creating a master data frame

for i in df_list: 
        df = pd.DataFrame(i, columns=['Filename', 'Similarity'])
        master_df = pd.concat([master_df,df])

In [5]:
#Results

#Dimensions
master_df.shape

#Resetting index
master_df = master_df.reset_index(drop=True)

#Previewing (first and last five rows)

master_df.head(5)
master_df.tail(5)

(80, 2)

Unnamed: 0,Filename,Similarity
0,../data/images/maps/drawings/A_1_I\A2_out.png,0.938607
1,../data/images/maps/drawings/A_1_I\F1_out.png,0.946823
2,../data/images/maps/drawings/A_1_I\J1_out.png,0.949825
3,../data/images/maps/drawings/A_1_I\J2_out.png,0.952273
4,../data/images/maps/drawings/A_1_I\L2_out.png,0.948071


Unnamed: 0,Filename,Similarity
75,../data/images/maps/drawings/B_2_II\O2_out.png,0.94448
76,../data/images/maps/drawings/B_2_II\P1_out.png,0.945496
77,../data/images/maps/drawings/B_2_II\T1_out.png,0.945062
78,../data/images/maps/drawings/B_2_II\W1_out.png,0.936982
79,../data/images/maps/drawings/B_2_II\Y1_out.png,0.94563


## 3. Processing

- We can now extract relevant information using the filename as a point of departure:

In [6]:
#Extracting map and participant

master_df['Map']= master_df['Filename'].str.extract(r"drawings/([A-Z].*)\\") #Map
master_df['Participant']= master_df['Filename'].str.extract(r"\\([A-Z][1-2])") #Participant

#Turning similarity to a numeric variable

master_df['Similarity'] = master_df['Similarity'].apply(pd.to_numeric, errors='coerce') 

#Previewing

master_df.sample(10)

Unnamed: 0,Filename,Similarity,Map,Participant
40,../data/images/maps/drawings/B_1_I\E1_out.png,0.956577,B_1_I,E1
67,../data/images/maps/drawings/B_2_I\T1_out.png,0.965554,B_2_I,T1
7,../data/images/maps/drawings/A_1_I\Q2_out.png,0.928612,A_1_I,Q2
12,../data/images/maps/drawings/A_1_II\F1_out.png,0.945076,A_1_II,F1
72,../data/images/maps/drawings/B_2_II\H1_out.png,0.913662,B_2_II,H1
66,../data/images/maps/drawings/B_2_I\P1_out.png,0.964016,B_2_I,P1
23,../data/images/maps/drawings/A_2_I\I1_out.png,0.95872,A_2_I,I1
6,../data/images/maps/drawings/A_1_I\N1_out.png,0.917768,A_1_I,N1
54,../data/images/maps/drawings/B_1_II\K1_out.png,0.941218,B_1_II,K1
4,../data/images/maps/drawings/A_1_I\L2_out.png,0.948071,A_1_I,L2


- We'll now add information about the session by building a dictionary based on participants and the session they took part in. This will also allow us to add a language column:

In [7]:
master_df['Participant'].unique()

array(['A2', 'F1', 'J1', 'J2', 'L2', 'M2', 'N1', 'Q2', 'R1', 'U1', 'A1',
       'B1', 'B2', 'I1', 'I2', 'K2', 'M1', 'N2', 'Q1', 'R2', 'V1', 'E1',
       'E2', 'G1', 'G2', 'K1', 'O1', 'P2', 'S1', 'X1', 'Z1', 'F2', 'H1',
       'H2', 'L1', 'O2', 'P1', 'T1', 'W1', 'Y1'], dtype=object)

In [8]:
#Session dictionary

session_dict = {'A2': 13, 'F1': 2, 'J1': 4, 'J2': 16, 'L2': 17, 'M2': 18, 'N1': 6, 'Q2': 20,  'R1': 8, 'U1': 10, 
                'A1': 1, 'B1': 1, 'B2': 13, 'I1': 4, 'I2': 16, 'K2': 17, 'M1': 6, 'N2': 18, 'Q1': 8, 'R2': 20, 
                'V1': 10, 'E1': 2, 'E2': 14, 'G1': 3, 'G2': 15, 'K1': 5, 'O1': 7, 'P2': 19,  'S1': 9, 'X1': 11, 
                'Z1': 12, 'F2': 15, 'H1': 3, 'H2': 15, 'L1': 5, 'O2': 19, 'P1': 7, 'T1': 9, 'W1': 11, 'Y1': 12}

In [9]:
#Language dictionary 

lang_dict = {1: 'English', 2: 'English', 3: 'English', 4: 'English', 5: 'English', 6: 'Spanish', 7: 'Spanish', 
             8: 'Spanish', 9: 'English', 10: 'English', 11: 'English', 12: 'English', 13: 'English',
             14: 'Spanish', 15: 'Spanish', 16: 'Spanish', 17: 'Spanish', 18: 'English', 19: 'Spanish', 20: 'Spanish'}

In [10]:
#Mapping dictionary values

master_df['Session'] = master_df['Participant'].map(session_dict)
master_df['Language'] = master_df['Session'].map(lang_dict)

In [11]:
#Previewing

master_df.sample(10)

Unnamed: 0,Filename,Similarity,Map,Participant,Session,Language
35,../data/images/maps/drawings/A_2_II\M1_out.png,0.948597,A_2_II,M1,6,Spanish
18,../data/images/maps/drawings/A_1_II\Q2_out.png,0.917068,A_1_II,Q2,20,Spanish
14,../data/images/maps/drawings/A_1_II\J2_out.png,0.950266,A_1_II,J2,16,Spanish
77,../data/images/maps/drawings/B_2_II\T1_out.png,0.945062,B_2_II,T1,9,English
37,../data/images/maps/drawings/A_2_II\Q1_out.png,0.945407,A_2_II,Q1,8,Spanish
15,../data/images/maps/drawings/A_1_II\L2_out.png,0.941589,A_1_II,L2,17,Spanish
42,../data/images/maps/drawings/B_1_I\G1_out.png,0.954138,B_1_I,G1,3,English
76,../data/images/maps/drawings/B_2_II\P1_out.png,0.945496,B_2_II,P1,7,Spanish
67,../data/images/maps/drawings/B_2_I\T1_out.png,0.965554,B_2_I,T1,9,English
78,../data/images/maps/drawings/B_2_II\W1_out.png,0.936982,B_2_II,W1,11,English


In [12]:
#Saving files

#Spreadsheet
master_df.to_excel("../spreadsheets/df_similarity.xlsx", index=False) 

#Pickling
master_df.to_pickle('../pkl/df_similarity.pkl')                 

## 4. Analysis

- Language numbers:

In [13]:
#Overall, including mean, SD, min, and max

master_df.groupby('Language').describe()

Unnamed: 0_level_0,Similarity,Similarity,Similarity,Similarity,Similarity,Similarity,Similarity,Similarity,Session,Session,Session,Session,Session,Session,Session,Session
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
Language,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
English,44.0,0.949338,0.011372,0.913662,0.942407,0.948823,0.959793,0.965554,44.0,8.0,5.193914,1.0,3.0,9.0,12.0,18.0
Spanish,36.0,0.945907,0.013204,0.915688,0.942162,0.946752,0.957958,0.964016,36.0,13.611111,5.04991,6.0,8.0,15.0,17.0,20.0


- Session numbers:

In [14]:
#Overall, including mean, SD, min, and max

master_df.groupby(['Session','Language']).describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,Similarity,Similarity,Similarity,Similarity,Similarity,Similarity,Similarity,Similarity
Unnamed: 0_level_1,Unnamed: 1_level_1,count,mean,std,min,25%,50%,75%,max
Session,Language,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
1,English,4.0,0.949198,0.010166,0.937855,0.94402,0.948359,0.953536,0.962218
2,English,4.0,0.943391,0.01321,0.925088,0.940079,0.945949,0.949261,0.956577
3,English,4.0,0.94337,0.020665,0.913662,0.937657,0.949897,0.95561,0.960023
4,English,4.0,0.954035,0.006089,0.947821,0.949324,0.954272,0.958984,0.959776
5,English,4.0,0.950122,0.01122,0.941218,0.941906,0.946961,0.955176,0.965348
6,Spanish,4.0,0.933933,0.016269,0.917768,0.921019,0.934684,0.947598,0.948597
7,Spanish,4.0,0.952449,0.010233,0.942353,0.94471,0.951713,0.959452,0.964016
8,Spanish,4.0,0.946534,0.001636,0.945407,0.945524,0.9459,0.94691,0.948928
9,English,4.0,0.954327,0.010811,0.945062,0.945073,0.953347,0.962602,0.965554
10,English,4.0,0.947531,0.012926,0.935181,0.937012,0.947549,0.958069,0.959845


- We'll also add Z-scoring as an additional way to compare data at a glance:

In [15]:
#Adding a Z score column

master_df['Similarity_Z'] = zscore(master_df['Similarity']) #Master data frame

In [16]:
#Previewing

master_df.sample(10)

Unnamed: 0,Filename,Similarity,Map,Participant,Session,Language,Similarity_Z
22,../data/images/maps/drawings/A_2_I\B2_out.png,0.950569,A_2_I,B2,13,English,0.227581
3,../data/images/maps/drawings/A_1_I\J2_out.png,0.952273,A_1_I,J2,16,Spanish,0.367283
0,../data/images/maps/drawings/A_1_I\A2_out.png,0.938607,A_1_I,A2,13,English,-0.753374
34,../data/images/maps/drawings/A_2_II\K2_out.png,0.945474,A_2_II,K2,17,Spanish,-0.190295
60,../data/images/maps/drawings/B_2_I\A1_out.png,0.962218,B_2_I,A1,1,English,1.182831
70,../data/images/maps/drawings/B_2_II\B1_out.png,0.937855,B_2_II,B1,1,English,-0.815053
51,../data/images/maps/drawings/B_1_II\E2_out.png,0.935208,B_1_II,E2,14,Spanish,-1.032105
73,../data/images/maps/drawings/B_2_II\H2_out.png,0.937988,B_2_II,H2,15,Spanish,-0.804138
54,../data/images/maps/drawings/B_1_II\K1_out.png,0.941218,B_1_II,K1,5,English,-0.539231
72,../data/images/maps/drawings/B_2_II\H1_out.png,0.913662,B_2_II,H1,3,English,-2.798952


- Mean similarity and Z score by session and language:

In [17]:
#By session and language

pd.DataFrame(master_df.groupby(['Session', 'Language']).mean()[['Similarity','Similarity_Z']])

Unnamed: 0_level_0,Unnamed: 1_level_0,Similarity,Similarity_Z
Session,Language,Unnamed: 2_level_1,Unnamed: 3_level_1
1,English,0.949198,0.115088
2,English,0.943391,-0.361074
3,English,0.94337,-0.362821
4,English,0.954035,0.511803
5,English,0.950122,0.19089
6,Spanish,0.933933,-1.136621
7,Spanish,0.952449,0.381709
8,Spanish,0.946534,-0.103345
9,English,0.954327,0.535752
10,English,0.947531,-0.021562


In [18]:
#Accross the data set 
master_df["Similarity"].mean()
master_df["Similarity_Z"].mean()

0.9477941805245328

7.105080412905807e-15

- As a last step, we'll create a summary table with all the relevant information as a summary, including similarity mean, max, min as well as Z-score mean for each session:

In [19]:
#Create data frame
desc = master_df.groupby(['Session', 'Language']).describe()[['Similarity', 'Similarity_Z']]

# Drop extraneous columns
desc = desc.drop(columns=[('Similarity', 'count'), ('Similarity', 'std'), ('Similarity', '25%'), 
                          ('Similarity', '50%'), ('Similarity', '75%'),('Similarity_Z', 'count'), 
                          ('Similarity_Z', 'std'), ('Similarity_Z', 'min'), ('Similarity_Z', 'max'), 
                          ('Similarity_Z', '25%'), ('Similarity_Z', '50%'), ('Similarity_Z', '75%')])

#Previewing

desc.round(2) #Round to two decimal places

Unnamed: 0_level_0,Unnamed: 1_level_0,Similarity,Similarity,Similarity,Similarity_Z
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,min,max,mean
Session,Language,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
1,English,0.95,0.94,0.96,0.12
2,English,0.94,0.93,0.96,-0.36
3,English,0.94,0.91,0.96,-0.36
4,English,0.95,0.95,0.96,0.51
5,English,0.95,0.94,0.97,0.19
6,Spanish,0.93,0.92,0.95,-1.14
7,Spanish,0.95,0.94,0.96,0.38
8,Spanish,0.95,0.95,0.95,-0.1
9,English,0.95,0.95,0.97,0.54
10,English,0.95,0.94,0.96,-0.02


In [20]:
#Saving results

desc.to_excel("../spreadsheets/df_description.xlsx") #Saving as spreadsheet file