Discussion of splitting `text.pkl` into training and testing partitions.

N.B. The actual creation of training and testing sets is done elsewhere; this notebook is for discussion purposes.

In [4]:
from pathlib import Path
import pandas as pd
from sklearn.model_selection import train_test_split

# location of text.pkl - get path of current working directory, then go up one level, then navigate to text.pkl in ./interim/
text_path = Path.cwd().parent.joinpath('data/interim/text.pkl')

# load text.pkl
text = pd.read_pickle(text_path)

# number of records after preprocessing
print(f'Number of preprocessed records: {len(text)}\n')

# first 5 preprocessed records
text.head()

Number of preprocessed records: 4290



Unnamed: 0,ModuleCode,Aims,OutlineOfSyllabus,IntendedKnowledgeOutcomes,IntendedSkillOutcomes
0,"[ACC1000, ACC1003, LAC1003]","to provide an introduction to key concepts, te...",introduction to accounting and finance measur...,by the end of the module students should be ab...,by the end of the module students should be ab...
1,[ACC1010],to provide a basic foundation of knowledge and...,the basic structure of accounting and double ...,at the end of the module students will be able...,at the end of the module students will be able...
2,[ACC1011],to introduce students to the fundamental conce...,classification of costs product costs & prici...,at the end of the module students will be able...,at the end of the module students will be able...
3,[ACC1012],to introduce n400 students to a range of key s...,the key skills element and the mathematics/ st...,"by the end of the module, students will be abl...","by the end of the module, students will be abl..."
4,[ACC1052],the module aims to provide an introduction to ...,key elements of the financial system fundamen...,"by the end of the module, students will be abl...","by the end of the module, students will be abl..."


To evaluate the output of the Semantic Textual Similarity modelling within this project, we need a set of data that has been labelled manually with similarity scores. The model output can then be compared to this labelled data and have it's performance tested. While there may be other, auxiliary, methods we can use to get a profile of how the model has made its predictions, a quantitative assessment against a test set is a good idea.

After preprocessing the records via `make_dataset.py`, we have reduced the dataset from 5359 records to 4290 records. SimCSE and TSDAE, the sentence embedding fine-tuning methods of interest, work by augmenting individual passages of text. Hence, if they were to be trained on our full dataset, there would be 4290 iterations of training in one epoch. Something to note is that the ways our semantic textual similarity methods train are fundamentally different to how we will test them. While training is done by fine-tuning sentence embeddings through data augmentation, testing is done by measuring pairwise similarities of the generated sentence embeddings which can be seen as a method of evaluating the quality of the embeddings.

Employing the widely used Pareto principle, we will reserve 20% of our records for testing the model. Hence, we will keep 858 records for testing. Within this testing partition, there are $\sum_{i = 1}^{857}i = 367653$ similarities in total (treating similarity as undirected and excluding self-similarities). It would be infeasible to label this many similarities considering our available time and resources, so we should instead perform the testing in a way that considers as much of this partition as possible with minimal similarities measured. The most sensible method for this is to create disjoint pairs across each of the testing samples; there would then be 429 pairs of which to measure the similarity. We can assume that there will be a reasonable spread of similarities across 429 pairs such that we can do the pairing randomly.

We will now outline the schema for how this partitioning and pairing will be performed.

In [20]:
# split the data into training and testing partitions at a ratio of 4:1 samples
# we use random_state for reproducibility and shuffle as the data is currently ordered in terms of module type (e.g. ACC first)
train, test = train_test_split(text, test_size = 0.2, random_state = 1, shuffle = True)

In [30]:
print(f'Number of training samples: {len(train)}\n') 
train.head()

Number of training samples: 3432



Unnamed: 0,ModuleCode,Aims,OutlineOfSyllabus,IntendedKnowledgeOutcomes,IntendedSkillOutcomes
1586,[FIN2013],to build upon the foundation of understanding...,"students attend a schedule of tutorials, group...",students will be aware of a range of approache...,students will have begun to consider and apply...
3537,[PHY2037],to discuss mathematically the wave theory of m...,reminder of preliminary concepts: de broglie a...,students will gain knowledge of the basic prin...,students will be able to formulate and solve s...
558,"[CAG3002, CAG8002]",to further develop and refine students' lingu...,students taking this module will read an ancie...,"to read, understand, translate and critically...",to translate greek texts fluently and accurat...
726,[CEG8225],this module aims: to explain the principles an...,this module will provide an introduction to th...,students will be able to demonstrate their kno...,students will demonstrate: creative and innova...
1642,[FMS8360],to provide students with research skills requ...,this foundational module in research methods a...,an understanding of key research approaches i...,the ability to use relevant databases and pro...


In [45]:
print(f'Number of testing samples: {len(test)}\n') 
test.head()

Number of testing samples: 858



Unnamed: 0,ModuleCode,Aims,OutlineOfSyllabus,IntendedKnowledgeOutcomes,IntendedSkillOutcomes
2627,[MCD8007],aims: to explore the management of advanced pe...,topics covered focus on the clinical field of ...,on successful completion of this module studen...,on successful completion of this module studen...
1960,[HIS3351],this special subject proceeds from the premise...,this module is structured around three central...,students will be able to explain the foundati...,students will explain and interpret key moder...
1364,[EDU8231],this module brings together expertise in educa...,this module is structured into three organisat...,to develop knowledge and understanding of: the...,to develop the ability to: distinguish between...
4111,[SPE1051],this module aims to provide an introduction to...,syllabus covers: introduction to the course. d...,knowledge of the principles underlying experi...,ability to create an experimental design to a...
2927,[MON2001],this module is an introduction to human anatom...,what distinguishes the human body from other p...,describe the structure of the human body in an...,use anatomical terminology in oral and written...


The partitioning seems satisfactory. Since the partitioning shuffled the samples, we can simply pair adjacent records within the testing set to construct pairs for similarity assessment. Thus, the pairings would look like the following:

In [46]:
test_pairs = []
# perform the pairing: ModuleCode indices 0 and 1 are paired, 2 and 3 are paired, etc.
for index in range(0, len(test), 2):
    pair = [test.ModuleCode.iloc[index], test.ModuleCode.iloc[index + 1]]
    test_pairs.append(pair)
for i in test_pairs:
    print(i)

[['MCD8007'], ['HIS3351']]
[['EDU8231'], ['SPE1051']]
[['MON2001'], ['ICM0100']]
[['HSS8007'], ['SML9024']]
[['TCP8902'], ['MCH3164', 'MCH8164']]
[['LAS4001'], ['CSC8633']]
[['ECO2007'], ['CHY8842']]
[['CSC3121'], ['MCD8003']]
[['EEE3007', 'EEE8004'], ['MCH3011']]
[['SPE8011'], ['LBU2018']]
[['NES8002'], ['HIS3328']]
[['MEC3013'], ['SEL3402']]
[['EEE3018'], ['PSY1002']]
[['EDU8008'], ['MAS2701']]
[['INU1117', 'INU1517'], ['CLA3090']]
[['SML3007', 'SML3008'], ['CEG3710']]
[['PHI2020'], ['CSC3131']]
[['PHY3020'], ['GRN8207']]
[['MCH1030'], ['NUS8209']]
[['LBS8142'], ['ONC8027']]
[['MAS2905'], ['BGM3058']]
[['PHY3008'], ['ACE3211', 'ACE8211']]
[['BGM1002', 'BMN1001'], ['SEL3012']]
[['HIS2306'], ['EEE3016', 'PHY3037']]
[['LBS8509', 'NBS8509'], ['PSY8083']]
[['ARA2016'], ['MMB8034']]
[['PHY2032'], ['LPS1001']]
[['MAS8752'], ['MAR8228']]
[['SOC3078'], ['HIS2305']]
[['GRN8807'], ['MCH8503']]
[['LAW8563'], ['CSC6005']]
[['ACE1044'], ['SOC2086']]
[['LAW8572'], ['MCH1034']]
[['CEG8101'], ['LAS40

In the above, we have a series of lists that themselves contain lists. In the first case, `['MCD8007']` and `['HIS3351']` would have their catalogue entries compared and given an integer similarity from 0 (dissimilar) to 5 (essentially identical). Looking at the fifth case, we have `['TCP8902']` and `['MCH3164', 'MCH8164']`. `MCH3164` and `MCH8164` were grouped due to having high cosine similarity in their bag-of-words representations, classifying them as equivalent modules; they are nearly lexicographically identical. Thus, in this comparison, we would compare `TCP8902` with one of `MCH3164` or `MCH8164`; it doesn't matter which.

Each of the pairings contains two lists (commonly these lists are singleton). Thus to generalise the above idea, we compare *one* of the modules in the first list with *one* of the modules in the second list, per pairing, with the choice of module per list being non-significant.