# Evaluation Script

With the vectors from tfidf, fastText and BERT prepared and saved from previous notebooks, we can now evaluate the performance based on any prepared answer keys of correct legislation sections pairs.

The metric to be used will be Recall@k. The scripts below can compute different values of k, but for simplicity, this project will focus on k=3. Recall@3 will be a good proxy for a user being able to see the ideal result instantly without having to scroll down.

**Important Note:** When evaluating, ensure that the entries for the output jurisdiction `juris_B` are all below that of the input jurisdiction `juris_A`, as the evaluation scripts make the assumption that they are ordered that way.

The outputs below demo based on data for the SG Copyright Act and UK CDPA, but do note as mentioned, the data files containing legislation content will not be in the repo.

### Imports and Loading Data

In [1]:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
id_col = 'sec'
required_input_cols = [id_col, 'title', 'url', 'cleaned']

In [3]:
juris_A = 'sg'
juris_B = 'uk'

Specify Evaluation Params

In [4]:
# example with copyright legislation data

# params to feed into eval function
params = {}
params['law_topic'] = 'copyright' # legal topic
params['data_file'] = '../data/clean/copyright/sg_uk_copyright.csv' # this data file will not be pushed to git repo 
params['ak_file'] = '../data/answer-keys/copyright_ak.csv' # answer key file
params['output_jurisdiction'] = juris_B 
params['ks_to_try'] = [1, 3, 5]
params['vector_files'] = [
    '../data/vectors/copyright/cp_sg_uk_vecs_tfidf.npy',
    '../data/vectors/copyright/cp_sg_uk_vecs_ft.npy',
    '../data/vectors/copyright/cp_sg_uk_vecs_bert_mlm.npy'
]

baseline_filepath = '../data/baselines/sg_uk_copyright_levdists.csv' # file containg levenshtein edit distance of titles

### Evalution Functions

In [5]:
def evaluate(result_vecs, div_index, answer_key, k=3, get_wrongs=True):
    """
    This function evaluates recall @ k. Default of k=3.
    
    The input of result_vecs is the dataframe of vectors 
    that have the section as the index, e.g. 'sg_5'
    while the values of the row represents the vectors.
    
    Take note of the index number that separates the two jurisdictions, div_index.,
    which would be the numeric index of the first entry of jurisdiction 2 in the result_vecs.
    """
    
    answers_in_top_k = 0
    wrongs = []
    
    # separate first and second jurisdictions into j1 and j2
    j1_vecs = result_vecs[:div_index]
    j2_vecs = result_vecs[div_index:]
    
    for i, _ in answer_key.iterrows():
        # the tested section will be in column 0 of answer keys
        test_sec = answer_key.iloc[i,0]
        # the correct corresponding section will be in column 1 of answer keys
        correct_sec = answer_key.iloc[i,1]
        # get the candidate cosine similarities between 
        # the tested section and the candidate sections of the other jurisdiction
        cos_sims = cosine_similarity(
            [j1_vecs.loc[test_sec]],
            j2_vecs)
        
        # get the top k results and get their respective sections
        j2_result_ids = (-cos_sims)[0].argsort()[:k]
        j2_result_secs = j2_vecs.iloc[j2_result_ids].index

        if correct_sec in j2_result_secs:
            answers_in_top_k += 1
        else:
            wrongs.append((test_sec, correct_sec, j2_result_secs.values))
    
    score = round((answers_in_top_k / answer_key.shape[0]), 2)
    
    if get_wrongs:
        return {'score': score,
                'wrongs': wrongs}
    else:
        return f'score {score}'

In [6]:
def evaluate_all(law_topic, data_file, ak_file, vector_files, ks_to_try, output_jurisdiction, get_wrongs=True):
    data = pd.read_csv(data_file)
    answer_key = pd.read_csv(ak_file)
    juris_split_id = data[data[id_col].str.contains(output_jurisdiction)].first_valid_index()
    print(juris_split_id)
    print(f'Evaluating results for {law_topic}...', '\n\n')
    
    for vector_file in vector_files:
        for k in ks_to_try:
            vector_file_shortname = vector_file.rsplit('/', 1)[-1]
            print(f'Getting recall at {k} results for vector file {vector_file_shortname}...')
            all_sent_vecs = np.load(vector_file)
            print(f'Vector file of shape {all_sent_vecs.shape}...')
            result_vecs = pd.DataFrame(all_sent_vecs, index=data[id_col])
            print(evaluate(result_vecs, div_index=juris_split_id, 
                           answer_key=answer_key, k=k, get_wrongs=get_wrongs))
            print('\n\n\n')
    
    print('Evaluation complete.')

The levenshtein (edit) distance function works slightly differently because there can be many title matches with the same edit distance, to make the baseline more lenient, as long as the answers edit distance is in the top k edit distances, it counts as a hit. 

It is ok to be lenient with the baseline as it would make beating the baseline more challenging.

In [7]:
def evaluate_levdis(result_levdis, output_juris, ak_filepath, data_filepath, k=3):
    """
    This function evaluates recall @ k. Default of k=3.
    
    The input of result_vecs is the dataframe of lev distances 
    that have the section as the index, e.g. 'sg_5'
    while the values of the row represents the lev distance,
    
    Take note of the index number that separates the two jurisdictions, div_index.,
    which would be the numeric index of the first entry of jurisdiction 2.
    """
    answer_key = pd.read_csv(ak_filepath)
    data = pd.read_csv(data_filepath)
    
    div_index = data[data[id_col].str.contains(output_juris)].first_valid_index()
    answers_in_top_k = 0
    wrongs = []
    
    for i, _ in answer_key.iterrows():
        # the tested section will be in column 0 of answer keys
        test_sec = answer_key.iloc[i,0]
        # the correct corresponding section will be in column 1 of answer keys
        correct_sec = answer_key.iloc[i,1]
        # get the shortest k lev distances
        shortest_k_levdis = result_levdis.loc[test_sec][div_index:].sort_values()[:k].values
        # get the lev distance of the correct j2 sec in the answer key
        correct_sec_levdis = result_levdis.loc[test_sec, correct_sec]
        
        # comparing the actual distance as there may be many tied distances
        if correct_sec_levdis in shortest_k_levdis:
            answers_in_top_k += 1
        else:
            shortest_k_levdis_secs = result_levdis.loc[test_sec][div_index:].sort_values()[:k].index
            wrongs.append((test_sec, correct_sec, shortest_k_levdis_secs))
    
    return {'score': round((answers_in_top_k / answer_key.shape[0]), 2),
            'wrongs': wrongs}

### Getting Baseline Scores

In [8]:
# saved_levdis_filepath = ''

In [9]:
result_levdis = pd.read_csv(baseline_filepath)

In [10]:
result_levdis.head()

Unnamed: 0,sec,sg_1,sg_2,sg_3,sg_4,sg_5,sg_6,sg_7,sg_7A,sg_8,...,uk_297C,uk_297D,uk_298,uk_299,uk_301,uk_302,uk_303,uk_304,uk_305,uk_306
0,sg_1,0,10,21,45,30,18,11,49,10,...,71,42,84,44,65,45,32,9,11,0
1,sg_2,10,0,14,47,33,19,8,48,10,...,72,43,84,42,69,45,32,10,11,10
2,sg_3,21,14,0,41,22,15,22,43,21,...,63,39,78,36,63,42,29,21,19,21
3,sg_4,45,47,41,0,41,42,45,49,48,...,60,44,70,44,55,47,43,49,46,45
4,sg_5,30,33,22,41,0,29,31,45,32,...,60,37,72,37,59,42,33,33,31,30


In [11]:
result_levdis.set_index(id_col, inplace=True)

In [12]:
evaluate_levdis(result_levdis, juris_B, params['ak_file'], params['data_file'])

{'score': 0.35,
 'wrongs': [('sg_7A',
   'uk_3A',
   Index(['uk_3', 'uk_271', 'uk_243'], dtype='object')),
  ('sg_28', 'uk_12', Index(['uk_13B', 'uk_13A', 'uk_14'], dtype='object')),
  ('sg_29', 'uk_57', Index(['uk_13A', 'uk_14', 'uk_13B'], dtype='object')),
  ('sg_36', 'uk_30', Index(['uk_70', 'uk_148', 'uk_68'], dtype='object')),
  ('sg_38', 'uk_45', Index(['uk_148', 'uk_70', 'uk_122'], dtype='object')),
  ('sg_38A', 'uk_28A', Index(['uk_172', 'uk_54', 'uk_175'], dtype='object')),
  ('sg_65', 'uk_31', Index(['uk_63', 'uk_89', 'uk_68'], dtype='object')),
  ('sg_120', 'uk_99', Index(['uk_199', 'uk_108', 'uk_231'], dtype='object')),
  ('sg_136', 'uk_107', Index(['uk_250', 'uk_304', 'uk_193'], dtype='object')),
  ('sg_140M',
   'uk_109',
   Index(['uk_135B', 'uk_162', 'uk_100'], dtype='object')),
  ('sg_188', 'uk_84', Index(['uk_190', 'uk_182', 'uk_294'], dtype='object')),
  ('sg_193F', 'uk_56', Index(['uk_31', 'uk_15A', 'uk_170'], dtype='object')),
  ('sg_200', 'uk_253C', Index(['uk_45'

### Using Evaluation Function

Works with the saved tfidf, fastText and BERT vectors (npy files).

In [13]:
evaluate_all(**params, get_wrongs=False)

366
Evaluating results for copyright... 


Getting recall at 1 results for vector file cp_sg_uk_vecs_tfidf.npy...
Vector file of shape (780, 32516)...
score 0.8




Getting recall at 3 results for vector file cp_sg_uk_vecs_tfidf.npy...
Vector file of shape (780, 32516)...
score 0.8




Getting recall at 5 results for vector file cp_sg_uk_vecs_tfidf.npy...
Vector file of shape (780, 32516)...
score 0.9




Getting recall at 1 results for vector file cp_sg_uk_vecs_ft.npy...
Vector file of shape (780, 100)...
score 0.25




Getting recall at 3 results for vector file cp_sg_uk_vecs_ft.npy...
Vector file of shape (780, 100)...
score 0.4




Getting recall at 5 results for vector file cp_sg_uk_vecs_ft.npy...
Vector file of shape (780, 100)...
score 0.45




Getting recall at 1 results for vector file cp_sg_uk_vecs_bert_mlm.npy...
Vector file of shape (780, 768)...
score 0.4




Getting recall at 3 results for vector file cp_sg_uk_vecs_bert_mlm.npy...
Vector file of shape (780, 768)...
score 

In [14]:
# to get the results the model got wrong printed out
evaluate_all(**params, get_wrongs=True)

366
Evaluating results for copyright... 


Getting recall at 1 results for vector file cp_sg_uk_vecs_tfidf.npy...
Vector file of shape (780, 32516)...
{'score': 0.8, 'wrongs': [('sg_29', 'uk_57', array(['uk_12'], dtype=object)), ('sg_140M', 'uk_109', array(['uk_296ZE'], dtype=object)), ('sg_160', 'uk_118', array(['uk_119'], dtype=object)), ('sg_200', 'uk_253C', array(['uk_102'], dtype=object))]}




Getting recall at 3 results for vector file cp_sg_uk_vecs_tfidf.npy...
Vector file of shape (780, 32516)...
{'score': 0.8, 'wrongs': [('sg_29', 'uk_57', array(['uk_12', 'uk_104', 'uk_13B'], dtype=object)), ('sg_140M', 'uk_109', array(['uk_296ZE', 'uk_107', 'uk_198'], dtype=object)), ('sg_160', 'uk_118', array(['uk_119', 'uk_120', 'uk_123'], dtype=object)), ('sg_200', 'uk_253C', array(['uk_102', 'uk_97', 'uk_101A'], dtype=object))]}




Getting recall at 5 results for vector file cp_sg_uk_vecs_tfidf.npy...
Vector file of shape (780, 32516)...
{'score': 0.9, 'wrongs': [('sg_140M', 'uk_109', a

### Retriever Function

For a more exploratory mode, this retriever function can be used to explore results returned for individual input.

This function is also used in the flask application to retrieve matches based on the user's input.

In [15]:
# example with copyright provisions
data_path = '../data/clean/copyright/sg_uk_copyright.csv' # this data file will not be pushed to git repo 
vector_path = '../data/vectors/copyright/cp_sg_uk_vecs_bert_mlm.npy'

In [16]:
data = pd.read_csv(data_path)
vectors = np.load(vector_path)

In [17]:
input_secs = [sec for sec in data[id_col] if juris_A in sec]

In [18]:
def retrieve(input_sec, data, vectors, output_juris=juris_B, k=5):
    vectors = pd.DataFrame(vectors, index=data[id_col])
    juris_split_id = data[data[id_col].str.contains(output_juris)].first_valid_index()
    # get the index for when the next jurisdiction entries start 
    candidate_vecs = vectors[juris_split_id:]
    cos_sims = cosine_similarity(
            [vectors.loc[input_sec]],
            candidate_vecs)
    result_ids = (-cos_sims)[0].argsort()[:k]
    result_secs = candidate_vecs.iloc[result_ids].index
    return {i+1:data[data[id_col]==res][[id_col, 'title', 'url']].to_dict('records')[0] 
            for i,res in enumerate(result_secs)}

Example retrieving a section. Input the section id.

In [19]:
test_input_section = 'sg_50'

In [20]:
%time
retrieve(test_input_section, data, vectors)

CPU times: user 31 µs, sys: 1 µs, total: 32 µs
Wall time: 5.96 µs


{1: {'sec': 'uk_80',
  'title': 'Right to object to derogatory treatment of work',
  'url': 'www.legislation.gov.uk/ukpga/1988/48/section/80'},
 2: {'sec': 'uk_55',
  'title': 'Articles for producing material in particular typeface',
  'url': 'www.legislation.gov.uk/ukpga/1988/48/section/55'},
 3: {'sec': 'uk_77',
  'title': 'Right to be identified as author or director',
  'url': 'www.legislation.gov.uk/ukpga/1988/48/section/77'},
 4: {'sec': 'uk_33',
  'title': 'Anthologies for educational use',
  'url': 'www.legislation.gov.uk/ukpga/1988/48/section/33'},
 5: {'sec': 'uk_175',
  'title': 'Meaning of publication and commercial publication',
  'url': 'www.legislation.gov.uk/ukpga/1988/48/section/175'}}

## Results Analysis

To keep this notebook clean and reusable for those who would like to try on their own prepared data, the actual results of the project will be consolidated and discussed in the the README instead.