Exploration of BigBird-CT model generated embeddings that have high cosine similarity to demonstrate the applicability of the methodology

N.B. Strongly recommended to run this Notebook on a GPU

In [3]:
from pathlib import Path
import pandas as pd
from sentence_transformers import SentenceTransformer, util


pd.set_option('max_colwidth', None)

# load full testing data
test_path = Path.cwd().parent.joinpath('data/interim/test_unlabelled.pkl')
test = pd.read_pickle(test_path)

# load our fine-tuned BigBird-CT with in-batch negatives model
model_bigbird_ct_path = Path.cwd().parent.joinpath('models/bigbird-ct')
model = SentenceTransformer(model_bigbird_ct_path)

sentences = test['Concatenated'].tolist()
codes = test['ModuleCode'].tolist()

# get document embeddings for our testing set modules
embeddings = model.encode(sentences,
                          batch_size = 16,
                          show_progress_bar = True)

Batches:   0%|          | 0/54 [00:00<?, ?it/s]

Attention type 'block_sparse' is not possible if sequence_length: 676 <= num global tokens: 2 * config.block_size + min. num sliding tokens: 3 * config.block_size + config.num_random_blocks * config.block_size + additional buffer: config.num_random_blocks * config.block_size = 704 with config.block_size = 64, config.num_random_blocks = 3. Changing attention type to 'original_full'...


One application of this project is the discovery of module catalogue entries that are semantically similar to some other catalogue entry of interest. This can be used to many ends, such as course recommendations for students, identification of duplicate teaching, and could even facilitate collaboration between university faculty members including those across different departments.

We will demonstrate an implementation of this here, directly using the cosine similarities of the generated document embeddings. The results of topic modelling could be used to a similar effect, where instead the clusters are used.

We list the ten highest cosine similarity document embeddings to an arbitrarily chosen document embedding, including the self-similarity.

In [23]:
# find the cosine similarity matrix for the embeddings
cos_sim = util.cos_sim(embeddings, embeddings)

# add all pairs to a list, with their cosine similarity score, including self-similarities
all_sentence_combinations = []
for i in range(len(cos_sim) - 1):
    for j in range(i, len(cos_sim)):
        all_sentence_combinations.append([cos_sim[i][j], i, j])

In [39]:
def most_similar_embeddings(document_id):
    '''
    Get the ten highest cosine similarity document embeddings to the document embedding associated with the provided ID
    This includes the self-similarity
    '''
    # get all pairs that feature the document embedding of interest
    similarity_pairs = []
    for score, i, j in all_sentence_combinations:
        if (i == document_id) or (j == document_id):
            similarity_pairs.append([score, i, j])
    # sort the list by descending cosine similarity
    similarity_pairs = sorted(similarity_pairs, key = lambda x: x[0], reverse = True)
    # get ten largest similarity pairs
    similarity_pairs = similarity_pairs[0:10]

    # make dataframe containing details of ten largest similarity pairs
    most_similar_df = pd.DataFrame(columns = ['ModuleCode', 'Document', 'Cosine Similarity'])
    for comparison in similarity_pairs:
        score, i, j = comparison
        if j == document_id:
            to_append = i
        else:
            to_append = j
        most_similar_df.loc[len(most_similar_df)] = [codes[to_append], sentences[to_append], float(score)]

    return most_similar_df

most_similar = most_similar_embeddings(400)

most_similar

Unnamed: 0,ModuleCode,Document,Cosine Similarity
0,"[CSC8423, CSC8430]","in this module the apprentices will learn about the security principles and considerations that should be adopted in the software engineering process from requirements, design, through to development and testing. the module also puts a particular emphasis on the ethical, legal and social considerations of software engineering when applied to the workplace. *apprentices and their employers who wish to apply a project from their workplace must consult with the module leader to ensure the scope is manageable in the semester, and the project criteria is met. the module will cover cyber security: o the need for security o foundations of security o privacy o practical security o information governance: ethical, legal and social issues involved in data management and analysis ethical, social and legal consideration in cyber security to be able to describe and discuss: the information governance requirements that exist in the uk, and the relevant organisational and legislative data protection and data security standards that exist the basic tenets of computer security: confidentiality, integrity and availability authentication and access control problems and their solutions introduction to symmetric and asymmetric (public key) cryptography examples of uses of cryptography for confidentiality, integrity, authentication and non repudiation examples of secure protocols a selection of security modelling techniques security engineering methodology ethical, social and legal concerns in data and information systems apprentices will be able to: present an analysis of the security considerations of a given system detect vulnerabilities and threats in an existing system formulate a practical security solution to a problem, making effective use of time and resources available implement network protocols at various layers",1.0
1,[CSC3632],"to explore in depth the different mechanisms used to protect the security of systems and network, and to manage the corresponding risk. cryptography: simple and practical introduction to symmetric and asymmetric encryption, hashing and signature. malicious code: xxs, code injection, reverse engineering network security: firewall, ids, packet analysis, security protocols authentication and authorisation: biometrics, access control risk management: threat modelling, risk assessment privacy: k anonymity human factors: usability, behavioural security to be able to: assess and incorporate the cyber, physical and social factors involved in system and network security into system design and implementation adopt an adversarial mind set when facing a new system to be able to: conduct practical attacks in a controlled environment conduct a risk assessment of a realistic system and make security recommendations design a security policy and enforce it use and apply a range of security and privacy analysis tools and techniques",0.875024
2,"[CSC8207, CSC8410]","complex systems, such as industrial control systems or electronic voting systems, include social, cyber and physical aspects, which can all be exploited by attackers. users are often wrongly portrayed as “the weakest link”, when the problem lies in the lack of a usable and secure system. the security analysis of a complex system therefore requires a holistic approach, leveraging a range of techniques. the aim of this module is to study techniques required for complex systems, using concrete case studies as well as exploring possible future attacks. the module covers, through the study of research papers and technical reports, attacks against complex systems, as well as techniques to detect, respond to and prevent such attacks. the complex systems studied during the module will reflect current research and technical challenges, for example: industrial control systems and cyber physical infrastructure, social engineering techniques, human aspects of security, forensics analysis, or machine learning based intrusion and misuse detection. security of complex systems (e.g., industrial control systems, smart grids) sophisticated attack mechanisms (e.g., adversarial machine learning) usable security and privacy social engineering techniques the ability to describe and discuss: the interaction of security of social, cyber and physical aspects in complex systems, and their impact on the security of the whole system. the role of human users in the security and privacy of complex systems. the possible security mechanisms to detect, respond to and prevent attacks against complex systems. the ability to analyse and summarise key research papers related to the security of complex systems. the ability to suggest and recommend security mechanisms for a specific complex system.",0.826529
3,"[CSC8212, CSC8414]","it is often impossible to guarantee the complete security of a system, and a cyber security analyst often aims instead to reveal gaps in security provisioning. the aim of this module is to develop skills to select and apply tools and techniques for carrying out security testing strategies including vulnerability scanning, penetration testing and ethical hacking. this module will look at a range of security tools and analysis, covering: definition of ethical hacking network analysis (such as host discovery and traffic analysis) web application analysis (such as xss and vulnerability reporting) operating system analysis (such as privilege escalation and buffer exploitation) cryptography analysis (such as brute force on hashing) malware analysis (such as reverse engineering and intrusion detection) forensics analysis (such as steganography and log analysis) the ability to describe and discuss: how to select and tools and techniques to carry out a variety of security testing strategies the fundamentals of ethical hacking the ability to: plan and carry out testing a variety of security testing strategies identify, investigate and correlate actionable security events conduct a vulnerability assessment select and apply cyber forensic tools and techniques for attack reconstruction conduct analysis of attacker tools",0.789867
4,[EEE8119],"wired and wireless communications networks is one of the fastest growing fields in the engineering world, and a tremendous interest for this topic exists among students. the purpose of the course is: to introduce the students to advanced topics in wired and wireless communications networks and security, their evolution and impacts on modern society. to introduce a broad coverage of modern communication networks and technologies, transmission and switching; to provide students with knowledge of the issues relating to modern telecommunications systems, protocols, flow and error control. to provide students with an understanding of security and encryption and their importance in modern communication systems. to introduce the principles of wireless and broadband communication networks. to help students develop skills required to practice “life long” learning, through covering material related to current and future security and encryption algorithms, communications systems and technologies. to familiarise students with selected topics which are being developed in the research community. this involves a lot of self study, reading papers, technical magazines and handouts. the new syllabus will be divided into six parts as follows: the necessity for communications networks, networks types, networking issues, networking topologies. protocols architectures: the iso/osi reference model; tcp/ip and the osi mode. communications networks transmission principles: switching technologies; error and flow control techniques and standards; performance issues and analysis; routing algorithms and congestion control. fundamentals of privacy and security as applied to modern communications: concept of a cipher system; public key and private key cryptosystems. wireless technologies (ieee 802.11, iee 802.15, ieee802.16 student self learning (selected topics from): privacy and security in communications networks: privacy and security issues, protocols, algorithms and techniques, public and private key cryptography, common attacks, ""man in the middle"" attack, countermeasures encryption techniques: advanced encryption standards (aes), rsa, elliptic curve cryptography, authentication and signature. wireless technologies (ieee 802.11, iee 802.15, ieee802.16) wireless communications systems, 4g/5g, iot ethernet, physical constraints in ethernets, csma/cd, back off algorithm, unfairness, max frame rate and throughput, full duplex ethernet, link aggregation, ethernet frame, fragmentation. repeaters, hubs, bridges, switches. ip6, format, arp, fragmentation, basic routing, algorithm, routing table, classes of networks, netmasks, lan analysis, icmp. voice over ip;traffic analysis the performance of network communication systems evaluation in terms of throughput, capacity and utilisation. wireless communication for the tactile internet device to device communications 1) an understanding of communications networks and systems. an awareness of commonly employed communication standards and protocols. an understanding of the theory of traffic and dimensioning of transmission and switching technologies an understanding of commonly used flow and error control techniques as applied to communications networks. an understanding and appreciation of security and encryption issues in modern communications. an understanding of current wireless broadband communications systems and technologies. 1) ability to analyse and plan telecommunications systems ability to implement and simulate encryption and communications algorithms and protocols using software platforms ability to analyse and evaluate the performance of communications algorithms and systems ability to develop skills required to practice “life long” learning and covering new research material.",0.735155
5,[CSC8004],"to explain how computer networks are implemented using layered protocols, and to explore the techniques used to implement network protocols at the various layers. to provide students with a knowledge and understanding of current and emerging internet technologies, thus giving them a perspective on the past, present and future of the web along with an awareness of the key trade offs in both architecture and user experience. to introduce students to the relevant technology underlying web content delivery and presentation, and to enable them to construct simple web based applications using common, current tools and systems part 1: history and evolution of the web & web publishing: basic languages (e.g. html, css, javascript) and how they work, forms & scripts (inc. client side vs server side scripting); web/database integration: constructing dynamic web pages using database content & php; new developments e.g. xml/xslt part 2: data communication packet switching, cell switching, routing, congestion control, latency; local area networks ethernet, ring networks; network protocols osi and darpa architectures, tcp/ip, application oriented protocols; network security encryption, access control. to be able to describe and discuss the history and evolution of the web & web publishing. to be able to explain how computer networks are implemented using layered protocols. the abililty to construct simple web based applications using common, current tools and systems. the abililty to use the techniques involved in implementing network protocols at the various layers.",0.704993
6,"[LAW3051, LAW3251]","students should by end of module be able to demonstrate an understanding of the implications of the arrival of the internet and networked information and communications in our society and how it has created new legal conflicts in most areas of law. they should then be able to identify legal frameworks of regulation with which to analyse and/or resolve such conflicts students should be able to identify the applicable national and regional laws for the legal issues examined in the module and also to analyse the legal, technological, societal, political and other challenges and conflicts these areas entail. they should demonstrate an understanding of the global framework of internet law, and a background understanding of the international human rights and governance structures relevant to the area. will vary as topical events impact on what is a very dynamic area but typically drawn from: intro/ regulation of the internet privacy/ data protection :introduction to privacy, dp principles and user rights data protection 2: user rights data protection / e commerce 3: cookies, spam, profiling, marketing algorithms, machine learning, and ai : regulation of transparency and profiling state surveillance and “surveillance capitalism” online intermediaries “online harms” and intermediaries : trolling, abuse, terror and fake news cybercrime high technology topics e.g. digital assets: ai; driverless cars; care robots ; ai and the legal profession the internet has come to dominate our society over the last twenty or so years and now is of vital significance to our work and social lives, to our mental health, to our educational development and to our democratic norms. in the commercial world, the internet has transformed delivery of services and goods and expanded consumer choice and power. in terms of dysfunction, the internet has created now opportunities and venues for crime, disorder, harassment and abuse. there are arguments that the internet has transformed us into a surveillance society and that privacy is dead. all of these affect a wide variety of regulations and laws. more than this though the internet has ushered in the idea that we are as much governed by the software and hardware around us as by laws : the idea that “law is code”. it has also exposed the power not just of traditional law and policy makers but also of the private sector; and it has exposed national laws to the reality that data flows seamlessly across borders. the internet is thus a transnational legal phenomenon. the module will attempt to put over these ideas suing a series of topics drawn from privacy, data protection, e commerce, cybercrime and security, free speech and censorship online etc. there will be a particular emphasis on newer “ai” related topics such as machine learning, profiling, robots, driverless cars etc. after completing this module, learners will be able to demonstrate critical knowledge and understanding of: the economic, social, technological and historical factors important to the rise of the internet the legal instruments and principles which are key to looking at governance of the internet, including uk, eu and us legal sources as well as international human rights. the commercial, social, technological and cultural barriers to enforcement and application of law online. the notions of regulation by “code” and associated ideas such as pots (privacy optimising technologies.) students should be able to work, individually and in groups, to provide summaries and analyses of relevant legal issues; and participate as required in group discussions and presentations, thus demonstrating oral as well as written skills, as well as coordination and completion of tasks to specifications. students should demonstrate the ability to undertake independent research on internet law topics as required throughout the course and for the essay assignment . they should show the ability to access both primary legal source materials and secondary materials, including online material; understand when each is appropriate; and use correct legal citation style. students should demonstrate the ability to understand and use the english language proficiently in relation to internet law issues; to present knowledge or an argument in a way which is comprehensible to others, even under time pressures; and to read and understand internet law materials which are written in technical and complex language.",0.703038
7,[CSC3131],"to introduce participants to the skills necessary for developing systems and their operation methods intended for use principally by non developers. such systems usually have a client facing aspect which needs to meet requirements and also needs underlying support systems, most often accessed over the internet. these support systems can range from the relatively simple to those supporting millions of concurrent users. multitier architectures and their operational considerations : continuous integration deployment maintainability scalability observability security at the end of the module, students will be familiar with the features of a range of current development tools and environments assess the important differences and similarities between various platforms appreciate the importance of design when creating systems for people be aware of future trends that may change how systems are built and what systems are needed at the end of the module, students will be familiar with the features of a range of current development tools and environments assess the important differences and similarities between various platforms appreciate the importance of design when creating systems be aware of future trends that may change how systems are built and what systems are needed",0.699226
8,[CSC8021],"to explain the fundamental principles which govern the operation of the internet. to explain how computer networks are implemented using layered protocols, and to explore the techniques used to implement network protocols at the various layers. to explain the usage and purpose of a remotely accessed networked operating system. linux/unix operating system, she will scripting, network technologies, data communication packet switching, routing, congestion control, latency; local area networks ethernet, wireless networks, satellite communication, network protocols osi architecture , tcp/ip, application oriented protocols. to be able to explain how computer networks are implemented using layered protocols. to be able to explain the operation of a remotely accessed unix like operating system. the ability to use a number of systems software and computing tools (e.g., scripting in unix operating systems), and ability to use the techniques involved in implementing network protocols at the various layers.",0.695339
9,[CSC8429],"this individual work based project is a substantial piece of independent work which is designed based on your job role and the specialism that you are undertaking as part of the software engineering master’s level degree apprenticeship. you will have the opportunity to solve real world problem focus on live business scenarios in your workplace, develop your own specialist expertise in the project, and further improve and demonstrate your professional development skills. you will work closely with your employer and the apprenticeship tutor. specifically, the module aims to equip the apprentices with the following knowledge and skills: to deepen the knowledge, skills and behaviours acquired in the degree apprenticeship programme through practice. to develop an awareness of the range and limitations of technologies available. to develop an awareness of the real world problems in software engineering. toward the end of the programme a lead academic supervisor accompanied by multiple academic supervisors will agree upon a business related project with the apprentice’s employer and the apprentices based on the apprentice’s job role and specialism that they are undertaking as part of the digital and technology solution specialist master’s degree. the independent assessor of the epa (end point assessment) should present the meeting for finalising the topic for capstone project. if the apprentice or their employer needs to change the capstone project scope they must resubmit a project form for the epa assessor to approve. the agreed project will present a typical business task, appropriate for demonstrating the skills and knowledge on the standard. in every project there will be a research component and a strong design, programming and/or analytic element. project definition and planning: the agreed project will be comparable in terms of content and complexity for all apprentices – it is the context within which the knowledge, and skills must be demonstrated that will vary. the project is undertaken and completed on programme and pre gateway to the epa (end point assessment). supervision: each project has a lead supervisor and second supervisor, both staff from the school. additional supervision support may be provided by the apprentice’s employer. the apprentice and lead supervisor will meet regularly throughout the period of the project. research: background research will be undertaken in the selected topic with access to the library and online resources. development and software engineering skills: the core of the project will involve carrying out the project plan largely independently, but with guidance from the supervisors. project report: a project report will be prepared, describing the technical background, the work undertaken, the analysis of results and directions for further work. guidance on the style and content of a report will be provided by means of lectures and through the supervisor. project presentation: a mini show and share peers of their six month project, and as a way to bring a celebratory closure to the program. to be able to describe and discus: the various inputs, statements of requirements, security considerations and constraints that guide solution architecture and the development of logical and physical systems’ designs; the methodologies designed to help create approaches for organising the software engineering process, the activities that need to be undertaken at different stages in the life cycle and techniques for managing risks in delivering software solutions; the approaches used to modularise the internal structure of an application and describe the structure and behaviour of applications used in a business, with a focus on how they interact with each other and with business users; how to design, develop and deploy software solutions that are secure and effective in delivering the requirements of stakeholders and the factors that affect the design of a successful code; the range of metrics which might be used to evaluate a delivered software product. to be able to demonstrate the ability to: identify, document, review and design complex it enabled business processes that define a set of activities that will accomplish specific organisational goals and provides a systematic approach to improving those processes; professionally present digital and technology solution specialism plans and solutions in a well structured business report; demonstrate self direction and originality in solving problems, and act autonomously in planning and implementing digital and technology solutions specialist tasks at a professional level; be competent at negotiating and closing techniques in a range of interactions and engagements, both with senior internal and external stakeholders; architect, build and support leading edge concurrent software platforms that are performant to industry standards and deliver responsive solutions with good test coverage; drive the technology decision making and development process for projects of varying scales, considering current technologies including devops and cloud computing, and evaluate different technology design and implementation options making reasoned proposals and recommendations; develop and deliver, distributed or semi complex software solutions that are scalable and which deliver innovative user experiences and journeys that encompass cross functional teams, platforms and technologies; update current software products, improving the efficiency and functionality, and build new features to product specifications; accomplish planned software development tasks that deliver the expected features, within specified time constraints, security and quality requirements; 10. be accountable for the quality of deliverables from one or more software development teams (source code quality, automated testing, design quality, documentation etc.) and following company standard processes (code reviews, unit testing, source code management etc.).",0.683988


The first row of the above table includes the document that is being evaluated for similar documents; it is included for comparative purposes.

Here we are analysing the most similar document embeddings to that for modules **CSC8423** and **CSC8430**, which were grouped during preprocessing for essentially represent duplicate modules. This module, generally, discusses security principles and considerations for software engineering, including concepts like data management. Looking in the table, we see **LAW3051** and **LAW3251** have been listed with a cosine similarity of 0.703038, which is fairly high. Indeed, reading the text for this module we see that it refers to the legality of the internet and networked information, referencing concepts such as digital assets.

These modules are semantically similar enough to the point of interest. A student who studied **CSC8423/CSC8430** may wish to learn more on the subject, from more of a legal standpoint; through this methodology they may find **LAW3051/LAW3251** and thus enquire into this other, similar module. Similarly, a module lead for **CSC8423/CSC8430** may wish to encorporate more robust legal content into the module, and might hence consult with a module lead from **LAW3051/LAW3251**.

There are many other applications of the output of this modelling. For example, a lecturer from the school of mathematics, statistics and physics may wish to teach a course on security principles and considerations for mathematical programming, not knowing that a very similar set of teaching exists under **CSC8423/8430** with the school of computing. They could then write a set of prose in the format of the module catalogue entries to detail what their module would contain. This would then get modelled by the Transformer and have its similar document embeddings found. By this, the lecturer would then see that a module similar to the subject of what they wish to teach already exists, **CSC8423/8430**, and may instead discuss with the module lead of this existing course how they might be able to accomodate their wish to teach the subject.