# Sentence Embeddings using Siamese BERT-Networks
---
This Google Colab Notebook illustrates using the Sentence Transformer python library to quickly create BERT embeddings for sentences and perform fast semantic searches.

The Sentence Transformer library is available on [pypi](https://pypi.org/project/sentence-transformers/) and [github](https://github.com/UKPLab/sentence-transformers). The library implements code from the ACL 2019 paper entitled "[Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](https://www.aclweb.org/anthology/D19-1410.pdf)" by Nils Reimers and Iryna Gurevych.


## Install Sentence Transformer Library

In [1]:
# Install the library using pip
!pip install sentence-transformers

Collecting sentence-transformers
[?25l  Downloading https://files.pythonhosted.org/packages/f5/5a/6e41e8383913dd2ba923cdcd02be2e03911595f4d2f9de559ecbed80d2d3/sentence-transformers-0.3.9.tar.gz (64kB)
[K     |████████████████████████████████| 71kB 3.5MB/s 
[?25hCollecting transformers<3.6.0,>=3.1.0
[?25l  Downloading https://files.pythonhosted.org/packages/3a/83/e74092e7f24a08d751aa59b37a9fc572b2e4af3918cb66f7766c3affb1b4/transformers-3.5.1-py3-none-any.whl (1.3MB)
[K     |████████████████████████████████| 1.3MB 7.5MB/s 
Collecting tokenizers==0.9.3
[?25l  Downloading https://files.pythonhosted.org/packages/4c/34/b39eb9994bc3c999270b69c9eea40ecc6f0e97991dba28282b9fd32d44ee/tokenizers-0.9.3-cp36-cp36m-manylinux1_x86_64.whl (2.9MB)
[K     |████████████████████████████████| 2.9MB 38.7MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████

## Load the BERT Model

In [2]:
from sentence_transformers import SentenceTransformer

# Load the BERT model. Various models trained on Natural Language Inference (NLI) https://github.com/UKPLab/sentence-transformers/blob/master/docs/pretrained-models/nli-models.md and 
# Semantic Textual Similarity are available https://github.com/UKPLab/sentence-transformers/blob/master/docs/pretrained-models/sts-models.md

model = SentenceTransformer('bert-base-nli-mean-tokens')

100%|██████████| 405M/405M [00:17<00:00, 23.1MB/s]


## Setup a Corpus

In [11]:
# What-is-machine-learning?
sentences_machine_learning = [
                          
  'Suggested: Machine Learning (ML) can be explained as automating and improving the learning process of computers based on their experiences without being actually programmed i.e. without any human assistance. The process starts with feeding good quality data and then training our machines(computers) by building machine learning models using the data and different algorithms. The choice of algorithms depends on what type of data do we have and what kind of task we are trying to automate. Machine learning involves computers discovering how they can perform tasks without being explicitly programmed to do so. It involves computers learning from data provided so that they carry out certain tasks. For simple tasks assigned to computers, it is possible to program algorithms telling the machine how to execute all steps required to solve the problem at hand; on the computers part, no learning is needed. For more advanced tasks, it can be challenging for a human to manually create the needed algorithms. In practice, it can turn out to be more effective to help the machine develop its own algorithm, rather than having human programmers specify every needed step.',
  'Machine Learning is the science of getting computers to learn and act like humans do, and improve their learning over time in autonomous fashion, by feeding them data and information in the form of observations and real-world interactions.',
  'Machine Learning is a sub-area of artificial intelligence, whereby the term refers to the ability of IT systems to independently find solutions to problems by recognizing patterns in databases. In other words: Machine Learning enables IT systems to recognize patterns on the basis of existing algorithms and data sets and to develop adequate solution concepts.',
  'Think of ML as a recipe to learn an algorithm. The recipe is: Learn from past experience of tasks Continue to carry out tasks Raise performance with each experience gained How to raise performance with increasing experience is the algorithm it teaches itself with the help of the recipe. A machine is said to learn if and only if it increases it is performance with each gaining experience',
  'Machine learning is a method of data analysis that automates analytical model building. It is a branch of artificial intelligence based on the idea that systems can learn from data, identify patterns and make decisions with minimal human intervention. Because of new computing technologies, machine learning today is not like machine learning of the past. It was born from pattern recognition and the theory that computers can learn without being programmed to perform specific tasks; researchers interested in artificial intelligence wanted to see if computers could learn from data. ',
  'Machine learning invariably refers to a family of mathematical techniques and computational algorithms used to ‘learn’ from ‘data’. There are several terms that deserve an amplification. Data. It’s impossible to provide an axiomatic definition to the term data. The term information may be used as a synonym for data but again, that doesn’t say very much (circular arguments never got me very far.) You may think of data as a spreadsheet, a collection of pictures, audio recording, collection of videos, data residing in bank’s credit-card transactions database, the bytes on your hard-disk and so on.',
  'Machine Learning is actually a game of data. If you can play with data then you can master Machine Learning. For Example, if you have spent your childhood in some city say Kolkata then it must be obvious that you know all the places in Kolkata. Now if someone asks you about places in Kolkata then you will be able to answer him/her with the maximum confidence that you have. Relating this example I can say that the experiences of your childhood was the training process (training data) with which you have learned and the person who was asking you was testing on you (test data) and your answer was checked if it was right or wrong in percentage(accuracy rate).',
  'Is a process of enabling a computer based system to learn to do tasks based on well defined statistical and mathematical methods The ability to do the tasks come from the underlying model which is the result of the learning process. Sometimes the ability comes from an mathematical algorithm The model generated represents behavior of the processes that were earlier performed before machine learning The model is generated from huge volume of data, huge both in breadth and depth reflecting the real world in which the processes are performed The more representative data is of the real world, the better the model would be. The challenge is how to make it a true representative',
  'Machine Learning is the science of getting computers to learn and act like humans do, and improve their learning over time in autonomous fashion, by feeding them data and information in the form of observations and real-world interactions.',
  'Machine learning is the art and science to making dumb computers really smart at predicting things and taking necessary step to induce a positive outcome. In simple words you feed some data to the machine along with the results and it learns the pattern in the input variables that led to those specific outcomes, and hence apply these learnings to input data for which we may not know the outcome. This is a very specific type of learning it called supervised learning.',
  'It is how you make computer programs learn something meaningful from data given to it without being explicitly programmed for it. Behind the scenes, it is clever maths that ensures that program converges to a point where it can be safely assumed that program has accurately understood the relationship that input data may hold within. The premise is that a set of inputs samples (corresponding to a given phenomenon) may hold some some hidden pattern which otherwise is unnoticeable by humans. And in case there truly is no pattern in the input data, maths ensures that program never converges or converges poorly.',

]

In [12]:
len(sentences_machine_learning)

11

In [13]:
# Each sentence is encoded as a 1-D vector with 78 columns
sentence_embeddings_machine_learning = model.encode(sentences_machine_learning)

In [20]:
#cosine similarity - SBert
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
print(np.round_(np.array(cosine_similarity(sentence_embeddings_machine_learning, sentence_embeddings_machine_learning)), decimals=3))

[[1.    0.799 0.758 0.661 0.9   0.794 0.685 0.745 0.799 0.843 0.798]
 [0.799 1.    0.84  0.711 0.796 0.624 0.699 0.795 1.    0.809 0.678]
 [0.758 0.84  1.    0.64  0.828 0.717 0.68  0.798 0.84  0.765 0.743]
 [0.661 0.711 0.64  1.    0.62  0.561 0.661 0.752 0.711 0.69  0.583]
 [0.9   0.796 0.828 0.62  1.    0.817 0.64  0.824 0.796 0.828 0.809]
 [0.794 0.624 0.717 0.561 0.817 1.    0.573 0.736 0.624 0.716 0.788]
 [0.685 0.699 0.68  0.661 0.64  0.573 1.    0.682 0.699 0.715 0.595]
 [0.745 0.795 0.798 0.752 0.824 0.736 0.682 1.    0.795 0.838 0.748]
 [0.799 1.    0.84  0.711 0.796 0.624 0.699 0.795 1.    0.809 0.678]
 [0.843 0.809 0.765 0.69  0.828 0.716 0.715 0.838 0.809 1.    0.8  ]
 [0.798 0.678 0.743 0.583 0.809 0.788 0.595 0.748 0.678 0.8   1.   ]]


In [29]:
print(np.round_(np.array(cosine_similarity(sentence_embeddings_machine_learning, sentence_embeddings_machine_learning)), decimals=3)[0])

[1.    0.799 0.758 0.661 0.9   0.794 0.685 0.745 0.799 0.843 0.798]


In [None]:
# sentences = [
#  'Machine Learning is the science of getting computers to learn and act like humans do, and improve their learning over time in autonomous fashion, by feeding them data and information in the form of observations and real-world interactions.',
#  'Machine Learning (ML) can be explained as automating and improving the learning process of computers based on their experiences without being actually programmed i.e. without any human assistance. The process starts with feeding good quality data and then training our machines(computers) by building machine learning models using the data and different algorithms. The choice of algorithms depends on what type of data do we have and what kind of task we are trying to automate.',
#  'Data science is an inter-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from many structural and unstructured data. Data science is related to data mining, machine learning and big data.',
#  'software engineers are really good in coding they are a good team player and good in mathematics',
#  'Joint Entrance Examination – Advanced (JEE-Advanced), formerly the Indian Institutes of Technology-Joint Entrance Examination (IIT-JEE), is an academic examination held annually in India. It is conducted by one of the seven zonal IITs (IIT Roorkee, IIT Kharagpur, IIT Delhi, IIT Kanpur, IIT Bombay, IIT Madras, and IIT Dharwad) under the guidance of the Joint Admission Board (JAB). It is the sole prerequisite for admission to the Indian Institutes of Technology. Other universities like the Rajiv Gandhi Institute of Petroleum Technology, Indian Institute of Science Education and Research and the Indian Institute of Science also use the score obtained on the JEE-Advanced exam as the basis for admission. The examination is organised each year by one of the IITs, on a round-robin rotation pattern.',
#  "The president of India, officially the President of the Republic of India (IAST: Bhārat kē Rāṣhṭrapati), is the ceremonial head of state of India and the Commander-in-chief of the Indian Armed Forces.The president is indirectly elected by an electoral college comprising the Parliament of India (both houses) and the legislative assemblies of each of India's states and territories, who themselves are all directly elected."
# ]

In [22]:
sentences_java = [
  "Suggested: Java is a class-based, object-oriented programming language that is designed to have as few implementation dependencies as possible. It is a general-purpose programming language intended to let application developers write once, run anywhere (WORA),[17] meaning that compiled Java code can run on all platforms that support Java without the need for recompilation.[18] Java applications are typically compiled to bytecode that can run on any Java virtual machine (JVM) regardless of the underlying computer architecture. The syntax of Java is similar to C and C++, but has fewer low-level facilities than either of them. The Java runtime provides dynamic capabilities (such as reflection and runtime code modification) that are typically not available in traditional compiled languages. As of 2019, Java was one of the most popular programming languages in use according to GitHub,[19][20] particularly for client-server web applications, with a reported 9 million developers.",
  "Java is just a programming language that James Gosling developed with Sun Microsystems and released in 1995. It was initially known as Oak language named after an Oak tree outside Mr. Gosling’s Office and then renamed as Green and then finally the “Java” from the famous Java Coffee at that time. Its just a programming language which follows absolute principles of Object Oriented Programming. And there is no doubt that this language has helped in developing so many robust applications and frameworks.",
  "JAVA is the name of coffee beans in Indonesia. James gosling who is known as the father of java, his team used to take coffee made of java beans while developing the java language. so, they named it as JAVA. first, they named the language as 'OAK' (which is a strong tree) but due to copyright problem, they changed that name and tried other different names but they couldn’t find suitable to their programming language. finally they tried JAVA. as everyone in the team liked this name so they fixed this. ",
  "Java is a programming language and computing platform. Java released by James Gosling at Sun Microsystems in 1995 and later developed by Oracle Corporation. It is centered on importing the necessary packages to have access to “Object” and “classes.” These objects have methods that do actions and fields that store data. Java is very fast, secure, and reliable language.",
  "Java is a general-purpose, concurrent, object-oriented, class-based, and the runtime environment(JRE) which consists of JVM which is the cornerstone of the Java platform. Java was developed in the mid-1990s by James A. Gosling, a former computer scientist with Sun Microsystems. Java is a programming language that produces software for multiple platforms. When a programmer writes a Java application, the compiled code (known as bytecode) runs on most operating systems (OS), including Windows, Linux and Mac OS. Java derives much of its syntax from the C and C++ programming languages.",
  "Java is a high-level programming language developed by Sun Microsystems. It was originally designed for developing programs for set-top boxes and handheld devices, but later became a popular choice for creating web applications. The Java syntax is similar to C++, but is strictly an object-oriented programming language.",
  "Java is an OO language which helps you to coding through object it based on imperative style coding, It is not so called pure OOPs (all though may debates are here) but as it supports primitive so it is not pure OO language. It is a High level language (generation 4) so it does not directly access OS level resources.",
  "Java is a programming language as well as it is a platform. initially it was developed for operating electronic devices.",
  "Java is a general-purpose computer programming language that is concurrent, class-based, object-oriented, and specifically designed to have as few implementation dependencies as possible. A virtual machine, called the Java Virtual Machine (JVM), is used to run the bytecode on each platform.",
  "Java is a object oriented programming language which allows the user to ‘write once and run anywhere'making it platform independent. it is used to develop applications and softwares. Java is portable, robust, secure that is why among the most famous programming languages used.",
  "Java is a programming language that developers use to create applications on yourcomputer. Chances are you've downloaded a program that required the Javaruntime, and so you probably have it installed it on your system. Java also has a web plug-in that allows you to run these apps in your browser.",

]

In [23]:
len(sentences_java)

11

In [24]:
# Each sentence is encoded as a 1-D vector with 78 columns
sentence_embeddings_java = model.encode(sentences_java)


In [27]:
#cosine similarity - SBert
from sklearn.metrics.pairwise import cosine_similarity
print(np.round_(np.array(cosine_similarity(sentence_embeddings_java, sentence_embeddings_java)), decimals=3))

[[1.    0.66  0.55  0.786 0.755 0.799 0.773 0.657 0.906 0.823 0.718]
 [0.66  1.    0.768 0.79  0.774 0.719 0.555 0.546 0.633 0.609 0.661]
 [0.55  0.768 1.    0.605 0.568 0.621 0.563 0.429 0.464 0.484 0.525]
 [0.786 0.79  0.605 1.    0.857 0.836 0.686 0.699 0.768 0.846 0.751]
 [0.755 0.774 0.568 0.857 1.    0.839 0.606 0.728 0.805 0.667 0.757]
 [0.799 0.719 0.621 0.836 0.839 1.    0.743 0.791 0.78  0.746 0.802]
 [0.773 0.555 0.563 0.686 0.606 0.743 1.    0.63  0.718 0.645 0.584]
 [0.657 0.546 0.429 0.699 0.728 0.791 0.63  1.    0.715 0.639 0.7  ]
 [0.906 0.633 0.464 0.768 0.805 0.78  0.718 0.715 1.    0.727 0.719]
 [0.823 0.609 0.484 0.846 0.667 0.746 0.645 0.639 0.727 1.    0.728]
 [0.718 0.661 0.525 0.751 0.757 0.802 0.584 0.7   0.719 0.728 1.   ]]


In [28]:
print(np.round_(np.array(cosine_similarity(sentence_embeddings_java, sentence_embeddings_java)), decimals=3)[0])

[1.    0.66  0.55  0.786 0.755 0.799 0.773 0.657 0.906 0.823 0.718]
