<a href="https://colab.research.google.com/github/kumar-sam/Automatic-Question-Answer-Evaluation/blob/main/Bert%2BMax_Pooling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## SBert

In [65]:
!pip install tensorflow
!pip install transformers
!pip list | grep -E 'transformers|tokenizers'

tokenizers                    0.9.3          
transformers                  3.5.1          


In [66]:
from transformers import AutoTokenizer, AutoModel
import torch

In [67]:
#Max Pooling - Take the max value over time for every dimension
def max_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    token_embeddings[input_mask_expanded == 0] = -1e9  # Set padding tokens to large negative value
    max_over_time = torch.max(token_embeddings, 1)[0]
    return max_over_time

In [68]:
#Sentences we want sentence embeddings for

sentences = [
 'Machine Learning is the science of getting computers to learn and act like humans do, and improve their learning over time in autonomous fashion, by feeding them data and information in the form of observations and real-world interactions.',
 'Machine Learning (ML) can be explained as automating and improving the learning process of computers based on their experiences without being actually programmed i.e. without any human assistance. The process starts with feeding good quality data and then training our machines(computers) by building machine learning models using the data and different algorithms. The choice of algorithms depends on what type of data do we have and what kind of task we are trying to automate.',
 'Data science is an inter-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from many structural and unstructured data. Data science is related to data mining, machine learning and big data.',
 'software engineers are really good in coding they are a good team player and good in mathematics',
 'Joint Entrance Examination – Advanced (JEE-Advanced), formerly the Indian Institutes of Technology-Joint Entrance Examination (IIT-JEE), is an academic examination held annually in India. It is conducted by one of the seven zonal IITs (IIT Roorkee, IIT Kharagpur, IIT Delhi, IIT Kanpur, IIT Bombay, IIT Madras, and IIT Dharwad) under the guidance of the Joint Admission Board (JAB). It is the sole prerequisite for admission to the Indian Institutes of Technology. Other universities like the Rajiv Gandhi Institute of Petroleum Technology, Indian Institute of Science Education and Research and the Indian Institute of Science also use the score obtained on the JEE-Advanced exam as the basis for admission. The examination is organised each year by one of the IITs, on a round-robin rotation pattern.',
 "The president of India, officially the President of the Republic of India (IAST: Bhārat kē Rāṣhṭrapati), is the ceremonial head of state of India and the Commander-in-chief of the Indian Armed Forces.The president is indirectly elected by an electoral college comprising the Parliament of India (both houses) and the legislative assemblies of each of India's states and territories, who themselves are all directly elected.",
]

In [69]:
#Load AutoModel from huggingface model repository
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/bert-base-nli-max-tokens")
model = AutoModel.from_pretrained("sentence-transformers/bert-base-nli-max-tokens")

#Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, max_length=128, return_tensors='pt')

#Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

In [70]:
model_output[1]

tensor([[-0.6785, -0.3811, -0.3117,  ..., -0.8613, -0.6026,  0.4337],
        [-0.4550, -0.4918, -0.2056,  ..., -0.6448, -0.4818,  0.1482],
        [-0.7266, -0.3246, -0.1335,  ..., -0.9099, -0.6470,  0.7198],
        [ 0.2156,  0.0120, -0.8795,  ..., -0.8673,  0.1825, -0.6063],
        [-0.6515, -0.4800, -0.3298,  ..., -0.6480, -0.1953,  0.0141],
        [-0.8324, -0.5395, -0.7741,  ..., -0.9241, -0.5157,  0.1302]])

In [71]:
#Perform pooling. In this case, max pooling
sentence_embeddings_sbert = max_pooling(model_output, encoded_input['attention_mask'])

print("Sentence embeddings:")
print(sentence_embeddings_sbert)

Sentence embeddings:
tensor([[ 5.4378e-02,  9.6291e-01,  1.9213e+00,  ...,  1.6476e-01,
         -5.0602e-01,  5.4386e-01],
        [ 2.4989e-01,  1.3479e+00,  1.3915e+00,  ...,  2.0159e-01,
         -2.6415e-01,  8.0529e-01],
        [ 3.3215e-01,  1.5401e+00,  2.8933e-01,  ..., -1.6547e-01,
          8.3750e-03,  4.5810e-01],
        [ 1.8371e-01,  9.1992e-01, -4.8081e-01,  ...,  4.1432e-01,
         -9.3064e-04,  2.4924e-01],
        [ 6.5023e-01,  1.1848e+00,  1.3291e+00,  ...,  6.5337e-01,
          1.5838e+00,  1.3274e+00],
        [ 7.8597e-01,  4.7389e-01,  4.5608e-01,  ..., -9.9444e-02,
          9.9268e-01,  1.3300e+00]])


In [72]:
np.array(model_output[0]).shape

(6, 128, 768)

In [73]:
np.array(model_output[1]).shape

(6, 768)

In [74]:
import numpy as np 
np.array(sentence_embeddings).shape

(2, 768)

In [75]:
sentence_embeddings_sbert

tensor([[ 5.4378e-02,  9.6291e-01,  1.9213e+00,  ...,  1.6476e-01,
         -5.0602e-01,  5.4386e-01],
        [ 2.4989e-01,  1.3479e+00,  1.3915e+00,  ...,  2.0159e-01,
         -2.6415e-01,  8.0529e-01],
        [ 3.3215e-01,  1.5401e+00,  2.8933e-01,  ..., -1.6547e-01,
          8.3750e-03,  4.5810e-01],
        [ 1.8371e-01,  9.1992e-01, -4.8081e-01,  ...,  4.1432e-01,
         -9.3064e-04,  2.4924e-01],
        [ 6.5023e-01,  1.1848e+00,  1.3291e+00,  ...,  6.5337e-01,
          1.5838e+00,  1.3274e+00],
        [ 7.8597e-01,  4.7389e-01,  4.5608e-01,  ..., -9.9444e-02,
          9.9268e-01,  1.3300e+00]])

## Bert + Max Pooling

In [76]:
import numpy as np
import pandas as pd
import torch
import transformers as ppb # pytorch transformers
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split

In [77]:
#Max Pooling - Take the max value over time for every dimension
def max_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    token_embeddings[input_mask_expanded == 0] = -1e9  # Set padding tokens to large negative value
    max_over_time = torch.max(token_embeddings, 1)[0]
    return max_over_time

In [78]:
#Sentences we want sentence embeddings for

sentences = [
 'Machine Learning is the science of getting computers to learn and act like humans do, and improve their learning over time in autonomous fashion, by feeding them data and information in the form of observations and real-world interactions.',
 'Machine Learning (ML) can be explained as automating and improving the learning process of computers based on their experiences without being actually programmed i.e. without any human assistance. The process starts with feeding good quality data and then training our machines(computers) by building machine learning models using the data and different algorithms. The choice of algorithms depends on what type of data do we have and what kind of task we are trying to automate.',
 'Data science is an inter-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from many structural and unstructured data. Data science is related to data mining, machine learning and big data.',
 'software engineers are really good in coding they are a good team player and good in mathematics',
 'Joint Entrance Examination – Advanced (JEE-Advanced), formerly the Indian Institutes of Technology-Joint Entrance Examination (IIT-JEE), is an academic examination held annually in India. It is conducted by one of the seven zonal IITs (IIT Roorkee, IIT Kharagpur, IIT Delhi, IIT Kanpur, IIT Bombay, IIT Madras, and IIT Dharwad) under the guidance of the Joint Admission Board (JAB). It is the sole prerequisite for admission to the Indian Institutes of Technology. Other universities like the Rajiv Gandhi Institute of Petroleum Technology, Indian Institute of Science Education and Research and the Indian Institute of Science also use the score obtained on the JEE-Advanced exam as the basis for admission. The examination is organised each year by one of the IITs, on a round-robin rotation pattern.',
 "The president of India, officially the President of the Republic of India (IAST: Bhārat kē Rāṣhṭrapati), is the ceremonial head of state of India and the Commander-in-chief of the Indian Armed Forces.The president is indirectly elected by an electoral college comprising the Parliament of India (both houses) and the legislative assemblies of each of India's states and territories, who themselves are all directly elected.",
]

In [79]:
#Load AutoModel from huggingface model repository
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

#Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, max_length=128, return_tensors='pt')

#Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

In [80]:
model_output

(tensor([[[ 0.1730, -0.2285, -0.4436,  ..., -0.0600, -0.2608,  0.6115],
          [ 0.4410,  0.4642, -0.6297,  ...,  0.3118,  0.7988,  0.6890],
          [ 0.3267,  0.3485, -0.6775,  ..., -0.4906,  0.1210,  0.7811],
          ...,
          [ 0.3720,  0.1589, -0.0394,  ..., -0.5004, -0.4275,  0.0529],
          [-0.4392, -0.2352, -0.3201,  ...,  0.1480,  0.0387, -0.2422],
          [-0.3278, -0.3365, -0.3303,  ...,  0.0835, -0.1275, -0.0612]],
 
         [[ 0.1139, -0.1046, -0.1401,  ..., -0.3234, -0.2775,  0.7110],
          [ 0.1873,  0.3992, -0.1844,  ...,  0.4850,  0.8252,  0.4604],
          [ 0.1450,  0.3667, -0.0930,  ..., -0.5656, -0.2463,  0.8268],
          ...,
          [-0.2153, -0.4368, -0.0599,  ...,  0.0941,  0.1570,  0.0618],
          [-0.1771, -0.3815,  0.2083,  ..., -0.1897, -0.1277,  0.2641],
          [-0.1634, -0.3891,  0.1858,  ..., -0.1735, -0.1737,  0.2126]],
 
         [[ 0.1438, -0.3382, -0.4895,  ..., -0.1735, -0.2984,  0.7280],
          [ 0.0452, -0.0023,

In [81]:
model_output[0]

tensor([[[ 0.1730, -0.2285, -0.4436,  ..., -0.0600, -0.2608,  0.6115],
         [ 0.4410,  0.4642, -0.6297,  ...,  0.3118,  0.7988,  0.6890],
         [ 0.3267,  0.3485, -0.6775,  ..., -0.4906,  0.1210,  0.7811],
         ...,
         [ 0.3720,  0.1589, -0.0394,  ..., -0.5004, -0.4275,  0.0529],
         [-0.4392, -0.2352, -0.3201,  ...,  0.1480,  0.0387, -0.2422],
         [-0.3278, -0.3365, -0.3303,  ...,  0.0835, -0.1275, -0.0612]],

        [[ 0.1139, -0.1046, -0.1401,  ..., -0.3234, -0.2775,  0.7110],
         [ 0.1873,  0.3992, -0.1844,  ...,  0.4850,  0.8252,  0.4604],
         [ 0.1450,  0.3667, -0.0930,  ..., -0.5656, -0.2463,  0.8268],
         ...,
         [-0.2153, -0.4368, -0.0599,  ...,  0.0941,  0.1570,  0.0618],
         [-0.1771, -0.3815,  0.2083,  ..., -0.1897, -0.1277,  0.2641],
         [-0.1634, -0.3891,  0.1858,  ..., -0.1735, -0.1737,  0.2126]],

        [[ 0.1438, -0.3382, -0.4895,  ..., -0.1735, -0.2984,  0.7280],
         [ 0.0452, -0.0023, -0.2324,  ...,  0

In [82]:
model_output[1]

tensor([[-0.9018, -0.3883, -0.7451,  ..., -0.8394, -0.6117,  0.6867],
        [-0.7585, -0.4638, -0.9114,  ..., -0.9656, -0.5245,  0.4084],
        [-0.8920, -0.4542, -0.8641,  ..., -0.8886, -0.6859,  0.6971],
        [-0.8888, -0.7805, -0.9993,  ..., -0.9924, -0.7944,  0.8092],
        [-0.6271, -0.1364, -0.7894,  ..., -0.8675, -0.2967,  0.1622],
        [-0.8770, -0.4566, -0.9789,  ..., -0.9786, -0.5337,  0.2696]])

In [83]:
#Perform pooling. In this case, max pooling
sentence_embeddings_bert = max_pooling(model_output, encoded_input['attention_mask'])

print("Sentence embeddings: Bert")
print(sentence_embeddings_bert)

Sentence embeddings: Bert
tensor([[ 0.8721,  1.0943,  1.4140,  ...,  0.3142,  0.7988,  1.4586],
        [ 1.0921,  1.6610,  1.2811,  ...,  0.4850,  0.8252,  1.3723],
        [ 0.9241,  1.4654,  1.1465,  ...,  0.2740,  0.7673,  1.4029],
        [ 1.1887,  1.1802,  0.4218,  ..., -0.1099,  0.9717,  0.4272],
        [ 0.9490,  1.1527,  1.5347,  ...,  0.7770,  1.3213,  1.5900],
        [ 1.4372,  1.3312,  1.1156,  ...,  0.4308,  0.9271,  1.5311]])


In [84]:
print("Sentence embeddings: SBert")
print(sentence_embeddings_sbert)

Sentence embeddings: SBert
tensor([[ 5.4378e-02,  9.6291e-01,  1.9213e+00,  ...,  1.6476e-01,
         -5.0602e-01,  5.4386e-01],
        [ 2.4989e-01,  1.3479e+00,  1.3915e+00,  ...,  2.0159e-01,
         -2.6415e-01,  8.0529e-01],
        [ 3.3215e-01,  1.5401e+00,  2.8933e-01,  ..., -1.6547e-01,
          8.3750e-03,  4.5810e-01],
        [ 1.8371e-01,  9.1992e-01, -4.8081e-01,  ...,  4.1432e-01,
         -9.3064e-04,  2.4924e-01],
        [ 6.5023e-01,  1.1848e+00,  1.3291e+00,  ...,  6.5337e-01,
          1.5838e+00,  1.3274e+00],
        [ 7.8597e-01,  4.7389e-01,  4.5608e-01,  ..., -9.9444e-02,
          9.9268e-01,  1.3300e+00]])


### Similarity score SBert

In [85]:
#cosine similarity - SBert
from sklearn.metrics.pairwise import cosine_similarity
print(cosine_similarity(sentence_embeddings_sbert, sentence_embeddings_sbert))

[[1.         0.89349794 0.7808059  0.728122   0.7061805  0.6594599 ]
 [0.89349794 1.         0.76358527 0.7168366  0.79300374 0.72225964]
 [0.7808059  0.76358527 1.0000001  0.6484803  0.7265867  0.6744678 ]
 [0.728122   0.7168366  0.6484803  1.0000001  0.6005163  0.52225864]
 [0.7061805  0.79300374 0.7265867  0.6005163  0.9999999  0.8709479 ]
 [0.6594599  0.72225964 0.6744678  0.52225864 0.8709479  1.        ]]


### Similarity score Bert + Max Pooling

In [86]:
#cosine similarity - Bert + Max pooling
from sklearn.metrics.pairwise import cosine_similarity
print(cosine_similarity(sentence_embeddings_bert, sentence_embeddings_bert))

[[0.9999997  0.9563465  0.9465624  0.89467204 0.93351865 0.9284128 ]
 [0.9563465  1.0000002  0.93994844 0.90052104 0.94531024 0.93707633]
 [0.9465624  0.93994844 0.99999964 0.8899784  0.93335116 0.92260206]
 [0.89467204 0.90052104 0.8899784  1.         0.8841766  0.872373  ]
 [0.93351865 0.94531024 0.93335116 0.8841766  1.0000001  0.9502973 ]
 [0.9284128  0.93707633 0.92260206 0.872373   0.9502973  0.9999999 ]]
