Exploration of BigBird-CT model generated embeddings that have high cosine similarity to demonstrate the applicability of the methodology.

The BigBird-CT model which has been trained on the full dataset is used here, to replicate that an end product model might be trained and then used for inference on the full dataset.

N.B. Strongly recommended to run this Notebook on a GPU

In [None]:
from pathlib import Path
import pandas as pd
from sentence_transformers import SentenceTransformer, util
import numpy as np


pd.set_option('max_colwidth', None)

# load train.pkl
train_path = Path.cwd().parent.joinpath('data/processed/train.pkl')
train = pd.read_pickle(train_path)

# load test_unlabelled.pkl
test_path = Path.cwd().parent.joinpath('data/interim/test_unlabelled.pkl')
test = pd.read_pickle(test_path)

# concatenate train and test data
fulldata = pd.concat([train, test])

# load our fine-tuned BigBird-CT with in-batch negatives model which has been trained on the full dataset
model_fulldata_bigbird_ct_path = Path.cwd().parent.joinpath('models/FULLDATA_bigbird-ct')
model = SentenceTransformer(model_fulldata_bigbird_ct_path)

sentences = fulldata['Concatenated'].tolist()
codes = fulldata['ModuleCode'].tolist()

# get document embeddings for our testing set modules
embeddings = model.encode(sentences,
                          batch_size = 16,
                          show_progress_bar = True)

Batches:   0%|          | 0/269 [00:00<?, ?it/s]

Attention type 'block_sparse' is not possible if sequence_length: 696 <= num global tokens: 2 * config.block_size + min. num sliding tokens: 3 * config.block_size + config.num_random_blocks * config.block_size + additional buffer: config.num_random_blocks * config.block_size = 704 with config.block_size = 64, config.num_random_blocks = 3. Changing attention type to 'original_full'...


One application of this project is the discovery of module catalogue entries that are semantically similar to some other catalogue entry of interest. This can be used to many ends, such as course recommendations for students, identification of duplicate teaching, and could even facilitate collaboration between university faculty members including those across different departments.

We will demonstrate an implementation of this here, directly using the cosine similarities of the generated document embeddings. The results of topic modelling could be used to a similar effect, where instead the clusters are used.

We list the fifty highest cosine similarity document embeddings to an arbitrarily chosen document embedding, including the self-similarity.

In [None]:
# find the cosine similarity matrix for the embeddings
cos_sim = util.cos_sim(embeddings, embeddings)

# add all pairs to a list, with their cosine similarity score, including self-similarities
all_sentence_combinations = []
for i in range(len(cos_sim) - 1):
    for j in range(i, len(cos_sim)):
        all_sentence_combinations.append([cos_sim[i][j], i, j])

In [None]:
def most_similar_embeddings(document_id):
    '''
    Get the fifty highest cosine similarity document embeddings to the document embedding associated with the provided ID
    This includes the self-similarity
    '''
    # get all pairs that feature the document embedding of interest
    similarity_pairs = []
    for score, i, j in all_sentence_combinations:
        if (i == document_id) or (j == document_id):
            similarity_pairs.append([score, i, j])
    # sort the list by descending cosine similarity
    similarity_pairs = sorted(similarity_pairs, key = lambda x: x[0], reverse = True)
    # get ten largest similarity pairs
    similarity_pairs = similarity_pairs[0:50]

    # make dataframe containing details of fifty largest similarity pairs
    most_similar_df = pd.DataFrame(columns = ['ModuleCode', 'Document', 'Cosine Similarity'])
    for comparison in similarity_pairs:
        score, i, j = comparison
        if j == document_id:
            to_append = i
        else:
            to_append = j
        most_similar_df.loc[len(most_similar_df)] = [codes[to_append], sentences[to_append], float(score)]

    return most_similar_df

most_similar = most_similar_embeddings(3832)

most_similar

Unnamed: 0,ModuleCode,Document,Cosine Similarity
0,"[CSC8423, CSC8430]","in this module the apprentices will learn about the security principles and considerations that should be adopted in the software engineering process from requirements, design, through to development and testing. the module also puts a particular emphasis on the ethical, legal and social considerations of software engineering when applied to the workplace. *apprentices and their employers who wish to apply a project from their workplace must consult with the module leader to ensure the scope is manageable in the semester, and the project criteria is met. the module will cover cyber security: o the need for security o foundations of security o privacy o practical security o information governance: ethical, legal and social issues involved in data management and analysis ethical, social and legal consideration in cyber security to be able to describe and discuss: the information governance requirements that exist in the uk, and the relevant organisational and legislative data protection and data security standards that exist the basic tenets of computer security: confidentiality, integrity and availability authentication and access control problems and their solutions introduction to symmetric and asymmetric (public key) cryptography examples of uses of cryptography for confidentiality, integrity, authentication and non repudiation examples of secure protocols a selection of security modelling techniques security engineering methodology ethical, social and legal concerns in data and information systems apprentices will be able to: present an analysis of the security considerations of a given system detect vulnerabilities and threats in an existing system formulate a practical security solution to a problem, making effective use of time and resources available implement network protocols at various layers",1.0
1,"[CSC8202, CSC8413]","this course aims to explore trust, identity, privacy and security, which are fundamental and intertwined concepts in modern systems. the material focuses on the underlying theoretical principles, exploring how we can assess the overall security of a system, taking into account human factors and information management. security risk definition/assessment/management identification & authentication biometrics access control & authorisation trust privacy to be able to describe and discuss: the importance of information security requirements in modern distributed environments. the information security mechanisms and primitives, threats and counter measures. the techniques to analyse designs against security requirements and threat environments. the ways in which formal methods can be used to support rigorous specification and validation of a design. the ethical and professional issues associated with security and trust. the ability to: design and validate designs of secure systems. select and use techniques and computer aided model checking tools for security protocols.",0.887955
2,[CSC8015],"to create awareness of the need for security in computer and communications systems, and to introduce techniques aiming at analysing and improving security. by exploring topics such as the need for security, system and network security, cryptography, privacy and practical security, including hardware, software and human elements of security, this module aims to introduce requirements and solutions for security for many components of a computer system: hardware, network, databases, web applications, operating systems and user interface. the module will look at the main areas under the field of cybersecurity and introduce a number of security tools, covering: main principles of information security. website security network analysis cybersecurity tools human aspects of cybersecurity to be able to: identify the basic tenets of computer security: confidentiality, integrity and availability. define symmetric and asymmetric (public key) cryptography. describe and discuss web applications and network threats. discuss human factors in cybersecurity. to be able to: analyze network traffic in real time. present an analysis of the security of a given system. use real attacks tools (e.g. kali linux) in a controlled environment. use vulnerabilities scanning security tools to detect vulnerabilities and threats in an existing system. propose and formulate a practical security solution to a problem, making effective use of time and resources available.",0.877045
3,[CSC8420],"security attacks can target modern systems, possibly impacting the information managed by such system. it is crucial for system users, designers and maintainers to learn how to detect, prevent and respond to these attacks. this module aims to cover a broad range of information and system security concepts, focusing on the underpinning theory behind attacks and mechanisms, supported by up to date research papers, as well as some real world examples. cryptography, security protocols network and distributed system security hardware security security of emerging systems usability security risk definition/assessment/management identification, authentication, access control & authorisation privacy the ability to describe & discuss: system vulnerabilities, and common attacks on security systems, security engineering methods: threat model, security policy and protection mechanisms, trade offs that needed to be considered with any sensible security scheme. the importance of information security requirements in modern distributed environments. the techniques to analyse designs against security requirements and threat environments. the ethical and professional issues associated with security and trust. the ability to: work out a threat model for a given scenario, formulate a security policy, design specific protection mechanisms to implement a security policy design and validate designs of secure systems.",0.870184
4,[CSC3632],"to explore in depth the different mechanisms used to protect the security of systems and network, and to manage the corresponding risk. cryptography: simple and practical introduction to symmetric and asymmetric encryption, hashing and signature. malicious code: xxs, code injection, reverse engineering network security: firewall, ids, packet analysis, security protocols authentication and authorisation: biometrics, access control risk management: threat modelling, risk assessment privacy: k anonymity human factors: usability, behavioural security to be able to: assess and incorporate the cyber, physical and social factors involved in system and network security into system design and implementation adopt an adversarial mind set when facing a new system to be able to: conduct practical attacks in a controlled environment conduct a risk assessment of a realistic system and make security recommendations design a security policy and enforce it use and apply a range of security and privacy analysis tools and techniques",0.8653
5,[CSC3124],"to create awareness of the need for security in computer and communications systems, and to introduce techniques aiming at analysing and improving security. by exploring topics such as the need for security, system and network security, cryptography, privacy and practical security, including hardware, software and human elements of security, this module aims to introduce the need for security in many components of a computer system: hardware, network, databases, web applications, operating systems and user interface. the need for security: threat and risk modelling. system security: authentication (password, biometrics), access control, malicious code execution, user behaviour. network security: tcp/ip firewall, intrusion detection, routing attacks. cryptography: simple introduction to symmetric and asymmetric encryption, hashing and signature. privacy: k anonymity. practical security: conduct attacks in a controlled environment. to be able to: identify the basic tenets of computer security: confidentiality, integrity and availability. discuss authentication problems and their solutions. discuss access control problems and their solutions. discuss security modelling and its uses. define symmetric and asymmetric (public key) cryptography, including some common cryptographic algorithms. discuss uses of cryptography for confidentiality, integrity, authentication and non repudiation. recognise some examples of secure protocols. recognise the purpose of a firewall, and some of the techniques required in its construction. define a number of authentication mechanisms. define a selection of security modelling techniques. define the security engineering methodology to be able to: present an analysis of the security of a given system use real attacks tools (e.g. kali linux)in a controlled environment detect vulnerabilities and threats in an existing system formulate a practical security solution to a problem, making effective use of time and resources available.",0.845584
6,"[CSC8207, CSC8410]","complex systems, such as industrial control systems or electronic voting systems, include social, cyber and physical aspects, which can all be exploited by attackers. users are often wrongly portrayed as “the weakest link”, when the problem lies in the lack of a usable and secure system. the security analysis of a complex system therefore requires a holistic approach, leveraging a range of techniques. the aim of this module is to study techniques required for complex systems, using concrete case studies as well as exploring possible future attacks. the module covers, through the study of research papers and technical reports, attacks against complex systems, as well as techniques to detect, respond to and prevent such attacks. the complex systems studied during the module will reflect current research and technical challenges, for example: industrial control systems and cyber physical infrastructure, social engineering techniques, human aspects of security, forensics analysis, or machine learning based intrusion and misuse detection. security of complex systems (e.g., industrial control systems, smart grids) sophisticated attack mechanisms (e.g., adversarial machine learning) usable security and privacy social engineering techniques the ability to describe and discuss: the interaction of security of social, cyber and physical aspects in complex systems, and their impact on the security of the whole system. the role of human users in the security and privacy of complex systems. the possible security mechanisms to detect, respond to and prevent attacks against complex systems. the ability to analyse and summarise key research papers related to the security of complex systems. the ability to suggest and recommend security mechanisms for a specific complex system.",0.816682
7,"[CSC8102, CSC8412]","most computer based systems are vulnerable to security attacks, and it is crucial for system users, designers and maintainers to learn how to detect, prevent and respond to these attacks. this course aims to cover a wide range of attacks, such as man in the middle and api attacks, and to introduce relevant security mechanisms, such as cryptographic techniques and network analysis tool. the material covers the underpinning theory behind attacks and mechanisms, supported by up to date research papers, as well as some real world examples. cryptography security protocols network and distributed system security hardware security security of emerging systems usability. the ability to describe & discuss: system vulnerabilities, and common attacks on security systems, importance of interface usability in robust secure systems, security engineering methods: threat model, security policy and protection mechanisms, trade offs that needed to be considered with any sensible security scheme. the ability to: work out a threat model for a given scenario, formulate a security policy, design specific protection mechanisms to implement a security policy.",0.810643
8,[CSC8210],"it is often impossible to guarantee the complete security of a system, and a cyber security analyst often aims instead to reveal gaps in security provisioning. the aim of this module is to develop skills to select and apply tools for carrying out security testing strategies including vulnerability scanning, penetration testing and ethical hacking. this module will look at a range of security tools and analysis, covering: definition of ethical hacking network analysis (including host discovery and traffic analysis) web application analysis (including xss and vulnerability reporting) operating system analysis (including privilege escalation and buffer exploitation) the ability to describe and discuss: how to select and tools and techniques to carry out a variety of security testing strategies the fundamentals of ethical hacking the ability to: plan and carry out testing a variety of security testing strategies identify, investigate and correlate actionable security events conduct a vulnerability assessment conduct analysis of attacker tools",0.782492
9,[CSC3631],"to introduce students to the theory and practice of block ciphers, cryptographic hash functions, public key cryptography and cryptographic protocols. algorithms – cryptographic algorithms historical overview of cryptography private key cryptography and the key exchange problem public key cryptography digital signatures security protocols applications (zero knowledge proofs, authentication, and so on) net centric computing – network security fundamentals of cryptography secret key algorithms public key algorithms authentication protocols digital signatures to be able to: compare the major cryptographic algorithms and protocols relate the historical development of cryptography to modern systems and challenges analyse the authentication and related protocols in real world situations incorporate cryptographic requirements into net centric computing systems to be able to: analyse the cryptographic requirements of a system and select an appropriate solution implement cryptographic algorithms to encode and decode messages and data develop software to apply more complex cryptographic algorithms evaluate the efficacy of cryptographic algorithms in practical applications.",0.76507


The first row of the above table includes the document that is being evaluated for similar documents; it is included for comparative purposes.

Here we are analysing the most similar document embeddings to that for modules **CSC8423** and **CSC8430**, which were grouped during preprocessing for essentially represent duplicate modules. This module, generally, discusses security principles and considerations for software engineering, including concepts like data management. Looking in the table, we see that generally other modules within the Computing school are found as similar, as you would expect, with many of these modules being very similar. More interestingly however, looking at index 30, **LAW3051** and **LAW3251**, from the Law school, have been listed with a cosine similarity of 0.650249, which is fairly high. Indeed, reading the text for this module we see that it refers to the legality of the internet and networked information, referencing concepts such as digital assets.

These modules are semantically similar enough to the point of interest. A student who studied **CSC8423/CSC8430** may wish to learn more on the subject, from more of a legal standpoint; through this methodology they may find **LAW3051/LAW3251** and thus enquire into this other, similar module. Similarly, a module lead for **CSC8423/CSC8430** may wish to encorporate more robust legal content into the module, and might hence consult with a module lead from **LAW3051/LAW3251**.

There are many other applications of the output of this modelling. For example, a lecturer from the school of mathematics, statistics and physics may wish to teach a course on security principles and considerations for mathematical programming, not knowing that a very similar set of teaching exists under **CSC8423/8430** with the school of computing. They could then write a set of prose in the format of the module catalogue entries to detail what their module would contain. This would then get modelled by the Transformer and have its similar document embeddings found. By this, the lecturer would then see that a module similar to the subject of what they wish to teach already exists, **CSC8423/8430**, and may be able to borrow ideas from this already existing module.