In [1]:
from transformers import AutoTokenizer, AutoModel
import torch

#for not seing a warning message
import logging
logging.getLogger("transformers.modeling_utils").setLevel(logging.ERROR)



def get_text_embedding(text, model_name='bert-base-uncased'):
    # Load pre-trained model and tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModel.from_pretrained(model_name)

    # Tokenize input text and convert to PyTorch tensors
    inputs = tokenizer(text, padding=True, truncation=True, return_tensors='pt')

    # Get output from pre-trained model
    with torch.no_grad():
        outputs = model(**inputs)

    # Extract last layer of output (CLS token) as the text embedding
    embedding = outputs.last_hidden_state.mean(dim=1).squeeze()


    return embedding

In [2]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
def get_similar_paragraphs(query_embedding, paragraph_embeddings, paragraphs, k=5):

    # Compute the cosine similarities between the query embedding and each paragraph embedding
    similarities = [cosine_similarity(query_embedding.reshape(1, -1), embedding.reshape(1, -1)) for embedding in paragraph_embeddings]

    # Get the indices of the top k paragraphs based on their similarity scores
    top_indices = sorted(range(len(similarities)), key=lambda i: similarities[i], reverse=True)[:k]

    # Return the top k paragraphs and their similarity scores as a list dictionaries
    return [{"content": paragraphs[i], "score" :similarities[i]} for i in top_indices]




In [36]:
query="Did the Greeks and the Romans have universities?"

In [4]:
#Open example text file
path="Documents/"
file_name="rise_of_universities.txt"

with open(path+file_name, "r",encoding='utf-8') as file:
    input_text = file.read()

#split into paragraphs
paragraphs = input_text.split("\n\n")

#Get embeddings of paragraphs and query
embeddings = [get_text_embedding(paragraph) for paragraph in paragraphs]
query_embedding = get_text_embedding(query)

#Get top k similar paragraphs and cosine distance score
similar_paragraphs = get_similar_paragraphs(query_embedding, embeddings,paragraphs, k=5)


Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/440M [00:00<?, ?B/s]

In [43]:
#print most similar paragraphs and score
top_answers=similar_paragraphs[0]["content"]
print(top_answers)

In northern Europe the origin of universities must be sought at Paris, in the cathedral school of Notre-Dame. By the beginning of the twelfth century in France and the Low Countries learning was no longer confined to monasteries but had its most active centres in the schools attached to cathedrals, of which the most famous were those of Liège, Rheims, Laon, Paris, Orleans, and Chartres. The most notable of these schools of the liberal arts was probably Chartres, distinguished by a canonist like St. Ives and by famous teachers of classics and philosophy like Bernard and Thierry. As early as 991 a monk of Rheims, Richer, describes the hardships of his journey to Chartres in order to study the Aphorisms of Hippocrates of Cos; while from the twelfth century John of Salisbury, the leading northern humanist of the age, has left us an[20] account of the masters which we shall later have occasion to cite. Nowhere else today can we drop back more easily into a cathedral city of the twelfth cent

In [38]:
prompt=f"You are a friendly and helpful chatbot. your job is to give a short and relevant answer to a given question, based on a specific background context. The context is: '{top_answers}'. Please give a relevant answer to the following question. Question: {query}. Answer:"


In [40]:
from transformers import pipeline
#Feed the relevant sentences and query to a GPT model to generate a response
generator = pipeline("text-generation", model="gpt2-large")
generator(prompt, max_length=600, do_sample=True)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "You are a friendly and helpful chatbot. your job is to give a short and relevant answer to a given question, based on a specific background context. The context is: 'In northern Europe the origin of universities must be sought at Paris, in the cathedral school of Notre-Dame. By the beginning of the twelfth century in France and the Low Countries learning was no longer confined to monasteries but had its most active centres in the schools attached to cathedrals, of which the most famous were those of Liège, Rheims, Laon, Paris, Orleans, and Chartres. The most notable of these schools of the liberal arts was probably Chartres, distinguished by a canonist like St. Ives and by famous teachers of classics and philosophy like Bernard and Thierry. As early as 991 a monk of Rheims, Richer, describes the hardships of his journey to Chartres in order to study the Aphorisms of Hippocrates of Cos; while from the twelfth century John of Salisbury, the leading northern humanist 