# Scan Path Analysis pt2
#### @tutor: Ms SHARAFI Zohreh
#### @student: Mr SHAW Oscar

### The purpose of this notebook is to analyze and compare problem solving strategies between different participants thanks to the similarity between their scanpaths

#### How do we proceed: 
There, we start the analysis by focusing only on the sequence of entities seen by the participant and measure the distance between two sequences thanks to the Damerau-Levenshtein algorithms, the interest of this method is that it takes into acount the transposition of two adjacent characters. Instead of comparing paths between differents participants, there we'll compare paths for differents code for the same participant to see if a participant always follow the same problem solving strategy.

A guess would be that someone with a strong background should know best problem solving methods and therefore will always follow the same strategy. Thus its sequences must be close, that is to say a high normalized score.

Then we study the differences between sequences for a same task for two differents participants.

In [1]:
# import libraries
# !pip install pandas

import os
import pandas as pd
import numpy as np

## 1- Data Import

In [2]:
# Path of the folder which contains the files
FOLDER_PATH = "C:\\Users........"
DATA_FOLDER = os.path.join(FOLDER_PATH, 'data_files')

# Dictionary establishing the correspondence between a file and its sequence of entities
# i.e data{key = file_id, value = [seq. of entities]} where entities are in [‘Comment’, ‘Bug_Report’, ‘Member_Variable’, 
# ‘Method_Body’, ‘Method_Signature’]
# ex: file_id = P_103 & value = ['Comment','Bug_Report','None','Bug_Report']
data = {}


# For each csv file, we use the dataframe to populate the dictionary with the data 
for filename in os.listdir(DATA_FOLDER):
    df = pd.read_csv(os.path.join(DATA_FOLDER, filename),delimiter=',')
    data[filename.split('_')[0]] = df

## 2- Cleaning & Preprocessing

In [3]:
# clean dataframe from NONE feature values
def clean(df: object, features: [str]) -> object:
#     input:  df, features  -> object (dataframe), array of string containing the feature we want to filter 
#     output: df -> object (dataframe)   
    for i in features:
        indexNames = df[ df[i] == 'NONE' ].index
        df.drop(indexNames , inplace=True)
    df.reset_index(drop=True, inplace=True)
    return df

In [4]:
# Allow to only keep the features we want to use in our study
def selection(df: object, features: [str]) -> object:
#     input:  df, features  -> object (dataframe), array of string containing the feature we want to keep 
#     output: df -> object (dataframe)
    return df[features]

In [5]:
# Allow to convert entity name to letter for latter treatments 
def convert(df):
#     input: df -> object (dataframe) with sequence of entities in plain text
#     output: df -> object (dataframe) with sequence of letter to embodies entities
    return df.applymap(lambda x: 'A' if x.lower() == 'method_body' else ('B' if x.lower() == 'comment' else ('C' if x.lower() == 'member_variable' else ('D' if x.lower() == 'bug_report' else ('E' if x.lower() == 'method_signature' else pd.NA)))))

In [6]:
# Transform the sequence of letter into a string for latter treatments
def SequenceToString(df):
#     input: df -> object (dataframe)
#     output: string -> string which embodies the sequence of entities with letters
    return ''.join( ''.join(map(str, x)) for x in df.entity)

In [17]:
# Divide the DF regarding the phase number
def divide(df,i):
#     input:  df, i  -> object (dataframe) large dataframe with all phase number, int number of phase_number 
#     output: df -> object (dataframe) array of df depending on the phase number 
    inp  = [df.loc[df['phase_number'] == i+1, 'entity'].to_frame() for i in range(i)]
    return inp

In [18]:
# Allow to clean NA values
def cleanNA(df):
#     input:  df  -> object (dataframe) with NA  
#     output: df -> object (dataframe) without NA and reindexed 
    df.dropna(inplace=True)
    df.reset_index(drop=True, inplace=True)
    return df

In [20]:
# Iterate over each dataframes of data and clean data in each, then convert the data.
# There we use another dictionnary for more reusability of past data
def pipeline(data):
#      input: data, features  -> dictionnary{key = file_id, value = object (dataframe)}, array of string containing the feature we want to keep
#      output: new_data -> dictionnary{key = file_id, value = [sequency]}   
    new_data = {}
    # divide dataframe regarding the phase_number
    for i in data:
        divided_df = divide(data[i],max(data[i]['phase_number']))
        res = []
        for df in divided_df:
            df = convert(df)
            df = cleanNA(df)
            # df to String
            df = SequenceToString(df)       
            res.append(df)
        new_data[i] = res
    return new_data

new_data = pipeline(data)
new_data

{'P102': ['BDBCCBDDBCBBBABEBAAAAAAAAAAAAAAAABBBAAABAAABBABBBABBABAAABABBBBAABBBABBBACCBABAAAAAAAAAAAABAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABBAAAAAABABAABBBAAABAAEAEAAAAAAEABBAAAAAACAAAAAABBBBBCBAABBBCCBCBCCCCAAAAAABAABBBBBAAABAAAAAAAAAAAAAACAAAAAABAAAABAABABBAAAAABAABBABAAAAAAAAABAAAAAABAAAAABAAAAAAAAAABAAAAAAABBAAAAAAAAAAABAABAABAAAAAAABABABBABBABABBAAABAAAAABABAAAAAAAAAAAAAAAAAAAAAABABAAAAAAAAAAAAAAAAAAAABABBBAAAAABAAAAAAAABBABBBBABAABAAAABAACAAABB',
  'AAAAAAAAAABEBBEEAAAABEAAEAAAAAAAAAAABAABABAAAAAABAAAAAAAAAABAAAAAABABAAAAABAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABBAABBCCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABABBBBBAAAAAAAAABAAAAAAABCAABAAAAAAAAAAEAAAAAAAAAAAAAAABAAAAABBAAAAAAAAAAAAAAAAAAAABAAAAAAABBABBEAACAAAABABBAABAAABCBBBBCBCCBBAAAAABAAAAABBABBBAAAAABABBBABACBAAAAABBAABAAAAAAAAAAABAABAAAAAAAAAAAAAAAAAAABAAAAAAAAAAAAAAAAAAAABAAAAAAAAAABBBAAAAAAEABAAAAAAAABBAAABABBBAAAABEAAAAABBEAABBBAEABBABAAAEAAAAAAAAAAAAAAABAAAABBBAAAAAABBAABAAAABABABAAABAAAAABAAABAAAAAAAAAAAAAAAAAAAAAAAAA

### Damerau-Levenshtein algorithm - Iterative version

In [23]:
def damerauLevenshtein(s1: str, s2: str) -> float:
#      input: s1, s2  -> str, str: the two strings that are compared 
#      output: normalized_score -> float: normalized score between the strings
    score_matrix = np.zeros((len(s1)+1,len(s2)+1),int) #we define a matrix which will contains the scores
    score_matrix[:,0] = np.linspace(0, len(s1),len(s1)+1) #fulfill the first column of the matrix with 1,2,3,4,...
    score_matrix[0,:] = np.linspace(0, len(s2),len(s2)+1) #fulfill the first row of the matrix with 1,2,3....

    for i in range(len(s1)):
        for j in range(len(s2)):
            if s1[i] == s2[j]:
                cost = 0
            else:
                cost = 1
            score_matrix[i+1,j+1] = min(score_matrix[i,j]+cost, # score by alignement
                                    score_matrix[i,j+1]+1, # score by deletion
                                    score_matrix[i+1,j]+1) # score by insertion
            
            if i > 0 and j > 0 and s1[i] == s2[j-1] and s1[i-1] == s2[j]:
                score_matrix[i+1,j+1] = min (score_matrix[i+1,j+1],  score_matrix[i-1,j-1] + 1) # score by transposition
    distance = float(score_matrix[-1,-1])
    normalized_score = 1.0-distance/max(len(s1),len(s2))
    return normalized_score

### Levenshtein algorithm - Iterative version

In [24]:
## Following, the iterative version of Levenshetin algorithm (using matrix) which return the normalized distance between two strings
def iterLevenshtein(s1: str, s2: str) -> float: 
#      input: s1, s2  -> str, str: the two strings that are compared 
#      output: normalized_score -> float: normalized score between the strings
    score_matrix = np.zeros((len(s1)+1,len(s2)+1),int) #we define a matrix which will contains the scores
    score_matrix[:,0] = np.linspace(0, len(s1),len(s1)+1) #fulfill the first column of the matrix with 1,2,3,4,...
    score_matrix[0,:] = np.linspace(0, len(s2),len(s2)+1) #fulfill the first row of the matrix with 1,2,3....

    for i in range(1,len(s1)+1):
        for j in range(1,len(s2)+1):
            if s1[i-1] == s2[j-1]:
                cost = 0
            else:
                cost = 1
            score_matrix[i,j] = min(score_matrix[i-1,j-1]+cost, # score by alignement
                                    score_matrix[i-1,j]+1, # score by deletion
                                    score_matrix[i,j-1]+1) # score by insertion
    distance = float(score_matrix[-1,-1])
    normalized_score = 1.0-distance/max(len(s1),len(s2))
    return normalized_score

## 3 - Data Analysis

In [30]:
intra_dam = {}
intra_lev = {}
#foreach doc
for x in new_data.items():
    tab_dam = []
    tab_lev = []
    for y in range(len(x[1])-1):
        for z in range(y+1,len(x[1])):
            # compare only non-empty strings of a same df
            if x[1][y] and x[1][z]:
                tab_dam.append((y+1,z+1,round(damerauLevenshtein(x[1][y],x[1][z]),5)))
                tab_lev.append((y+1,z+1,round(iterLevenshtein(x[1][y],x[1][z]),5)))

    intra_dam[x[0]] = tab_dam
    intra_lev[x[0]] = tab_lev    

In [31]:
intra_dam

{'P102': [(1, 2, 0.62171), (1, 3, 0.27232), (2, 3, 0.21217)],
 'P103': [(1, 2, 0.13153), (1, 3, 0.15068), (2, 3, 0.01982)],
 'P105': [(1, 2, 0.57047)],
 'P106': [(2, 3, 0.29242)],
 'P107': [(1, 2, 0.0444), (1, 3, 0.07091), (2, 3, 0.50622)]}

In [32]:
intra_lev

{'P102': [(1, 2, 0.62171), (1, 3, 0.27232), (2, 3, 0.21217)],
 'P103': [(1, 2, 0.13153), (1, 3, 0.15068), (2, 3, 0.01982)],
 'P105': [(1, 2, 0.57047)],
 'P106': [(2, 3, 0.29242)],
 'P107': [(1, 2, 0.0444), (1, 3, 0.07091), (2, 3, 0.50622)]}

## Results

Score has to be interpreted as (taskA, taskB, score(taskA,taskB))

Here we compare the strategies used between the differents programs for a same participant, the higher the score between two phases, the more the sequences of entities are similar and therefore the more the participant tends to follow the same resolution strategy in his differents tasks

##### As in the previous analysis, we can see that P102 and P105 are the two with the highest score, for task 1 & 2 which may be adapted to this kind of strategy. It would seem that they are the most coherent in their strategy and that their strategies are close, a first hypothesis would be that they are the two most experienced

## Next step

 Previously we compare the complete sequence between candidates and the partials sequences for a same candidates, so let's compare the partials sequences for different candidates to check if the firsts results are always correct when we affine the analysis

In [36]:
alreadyMade = []
correlated_dam = {}
correlated_lev = {}
for x in new_data.items():
    for y in new_data.items():
        partialsSeq_dam = []
        partialsSeq_lev = []
        for i in range(len(x[1])):
            if y not in alreadyMade and i< len(y[1]) and x[0]!=y[0] and x[1][i] and y[1][i]:
                partialsSeq_dam.append((i+1,round(damerauLevenshtein(x[1][i],y[1][i]),5)))
                partialsSeq_lev.append((i+1,round(iterLevenshtein(x[1][i],y[1][i]),5)))
        if partialsSeq_dam:
            correlated_dam['_'.join((x[0],y[0]))] = partialsSeq_dam
            correlated_lev['_'.join((x[0],y[0]))] = partialsSeq_lev
    alreadyMade.append(x)


In [37]:
correlated_lev

{'P102_P103': [(1, 0.15625), (2, 0.66118), (3, 0.08527)],
 'P102_P105': [(1, 0.65179), (2, 0.68914)],
 'P102_P106': [(2, 0.44737), (3, 0.54264)],
 'P102_P107': [(1, 0.10938), (2, 0.49911), (3, 0.18669)],
 'P103_P105': [(1, 0.17561), (2, 0.63758)],
 'P103_P106': [(2, 0.48649), (3, 0.13415)],
 'P103_P107': [(1, 0.50685), (2, 0.4325), (3, 0.01592)],
 'P105_P106': [(2, 0.45302)],
 'P105_P107': [(1, 0.11707), (2, 0.49023)],
 'P106_P107': [(2, 0.24512), (3, 0.11867)]}

In [39]:
correlated_dam

{'P102_P103': [(1, 0.15625), (2, 0.66283), (3, 0.08527)],
 'P102_P105': [(1, 0.65179), (2, 0.68914)],
 'P102_P106': [(2, 0.44737), (3, 0.54264)],
 'P102_P107': [(1, 0.10938), (2, 0.49911), (3, 0.18669)],
 'P103_P105': [(1, 0.17561), (2, 0.63758)],
 'P103_P106': [(2, 0.48649), (3, 0.13415)],
 'P103_P107': [(1, 0.50685), (2, 0.4325), (3, 0.01592)],
 'P105_P106': [(2, 0.45302)],
 'P105_P107': [(1, 0.11707), (2, 0.49023)],
 'P106_P107': [(2, 0.24512), (3, 0.11867)]}

### Average score analysis for Levenshtein Algorithm

In [60]:
avg_lev = [(x[0], round(np.array([y[1] for y in x[1]]).mean(),5)) for x in correlated_lev.items()]
avg_lev

[('P102_P103', 0.3009),
 ('P102_P105', 0.67046),
 ('P102_P106', 0.495),
 ('P102_P107', 0.26506),
 ('P103_P105', 0.4066),
 ('P103_P106', 0.31032),
 ('P103_P107', 0.31842),
 ('P105_P106', 0.45302),
 ('P105_P107', 0.30365),
 ('P106_P107', 0.1819)]

### Average score analysis for Damerau-Levenshtein Algorithm

In [63]:
avg_dam = [(x[0], round(np.array([y[1] for y in x[1]]).mean(),5)) for x in correlated_dam.items()]
avg_dam

[('P102_P103', 0.30145),
 ('P102_P105', 0.67046),
 ('P102_P106', 0.495),
 ('P102_P107', 0.26506),
 ('P103_P105', 0.4066),
 ('P103_P106', 0.31032),
 ('P103_P107', 0.31842),
 ('P105_P106', 0.45302),
 ('P105_P107', 0.30365),
 ('P106_P107', 0.1819)]

## Results
We can see that on both task 1 & 2, P102 and P105 have high score, so their sequences are close, it confirm the hypothesis emitted in part1, that they follow close strategy.
