# Projektbeskrivelse

Jeg har bygget en semantisk søgemaskine, der søger i et dataset med alle nobelprisvindere siden 1901. Det tekstuelle korpus består af den tekst, der kort beskriver årsagen til tildelelingen af prisen. Søgemaskinen kan dermed finde resultater der har en semantisk relavans for brugerens forespørgelse.

## Semantic Search

Semantic search seeks to improve search accuracy by understanding the content of the search query. In contrast to traditional search engines which only find documents based on lexical matches, semantic search can also find synonyms.

The idea behind semantic search is to embed all entries in your corpus, whether they be sentences, paragraphs, or documents, into a vector space.

At search time, the query is embedded into the same vector space and the closest embeddings from your corpus are found. These entries should have a high semantic overlap with the query.

Source: https://www.sbert.net/examples/applications/semantic-search/README.html

## SentenceTransformers
SentenceTransformers is a Python framework for state-of-the-art sentence, text and image embeddings. The initial work is described in our paper Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.

You can use this framework to compute sentence / text embeddings for more than 100 languages. These embeddings can then be compared e.g. with cosine-similarity to find sentences with a similar meaning. This can be useful for semantic textual similar, semantic search, or paraphrase mining.

The framework is based on PyTorch and Transformers and offers a large collection of pre-trained models tuned for various tasks. Further, it is easy to fine-tune your own models.

In [1]:
from sentence_transformers import SentenceTransformer, util
import numpy as np
import csv
import pandas as pd
import scipy
import torch

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
#! pip install wheel
#! pip install sentence_transformers

In [3]:
df = pd.read_csv(r'C:\Users\Lars\OneDrive - Københavns Erhvervsakademi\Documents 1\data-sets\nobel.csv')

In [4]:
df.head()

Unnamed: 0,Year,Category,Prize,Motivation,Prize Share,Laureate ID,Laureate Type,Full Name,Birth Date,Birth City,Birth Country,Sex,Organization Name,Organization City,Organization Country,Death Date,Death City,Death Country
0,1901,Chemistry,The Nobel Prize in Chemistry 1901,"""in recognition of the extraordinary services ...",1/1,160,Individual,Jacobus Henricus van 't Hoff,1852-08-30,Rotterdam,Netherlands,Male,Berlin University,Berlin,Germany,1911-03-01,Berlin,Germany
1,1901,Literature,The Nobel Prize in Literature 1901,"""in special recognition of his poetic composit...",1/1,569,Individual,Sully Prudhomme,1839-03-16,Paris,France,Male,,,,1907-09-07,Châtenay,France
2,1901,Medicine,The Nobel Prize in Physiology or Medicine 1901,"""for his work on serum therapy, especially its...",1/1,293,Individual,Emil Adolf von Behring,1854-03-15,Hansdorf (Lawice),Prussia (Poland),Male,Marburg University,Marburg,Germany,1917-03-31,Marburg,Germany
3,1901,Peace,The Nobel Peace Prize 1901,,1/2,462,Individual,Jean Henry Dunant,1828-05-08,Geneva,Switzerland,Male,,,,1910-10-30,Heiden,Switzerland
4,1901,Peace,The Nobel Peace Prize 1901,,1/2,463,Individual,Frédéric Passy,1822-05-20,Paris,France,Male,,,,1912-06-12,Paris,France


In [5]:
df = df.dropna(subset='Motivation')
corpus = df['Motivation']

In [6]:
corpus = corpus.tolist()

all-MiniLM-L6-v2: This is a sentence-transformers model: It maps sentences & paragraphs to a 384 dimensional dense vector space and can be used for tasks like clustering or semantic search.

In [13]:
embedder = SentenceTransformer('all-MiniLM-L6-v2')
corpus_embeddings = embedder.encode(corpus, convert_to_tensor=True) # fine-tune model with my own corpus
corpus_embeddings # each line is an embedding
corpus_embeddings.shape # 881 vectors, 384 dimensions


torch.Size([881, 384])

We then use the util.cos_sim() function to compute the cosine similarity between the query and all corpus entries.

In [8]:
# diabetes, heart desease, sorrow and grief, 

queries = ['machine learning']

top_k = min(10, len(corpus))
for query in queries:
    query_embedding = embedder.encode(query, convert_to_tensor=True)
     
    # We use cosine-similarity and torch.topk to find the highest scores
    cos_scores = util.cos_sim(query_embedding, corpus_embeddings)[0]
    top_results = torch.topk(cos_scores, k=top_k)

    print("\n\n======================\n\n")
    print("Query:", query)
    print("\nTop most relevant values for your query:\n")

    for score, idx in zip(top_results[0], top_results[1]):
        print("Motivation: " + corpus[idx], "\n(Score: {:.4f})".format(score),'\n\n',"||" '\n\n', df.iloc[corpus.index(corpus[idx])], '\n''\n')





Query: machine learning

Top most relevant values for your query:

Motivation: "for his development of theory and methods for analyzing selective samples" 
(Score: 0.3015) 

 ||

 Year                                                                 2000
Category                                                        Economics
Prize                   The Sveriges Riksbank Prize in Economic Scienc...
Motivation              "for his development of theory and methods for...
Prize Share                                                           1/2
Laureate ID                                                           732
Laureate Type                                                  Individual
Full Name                                                James J. Heckman
Birth Date                                                     1944-04-19
Birth City                                                    Chicago, IL
Birth Country                                    United States of America
Se

In [9]:
# queries = ['model']
# query_embeddings = model.encode(queries)

# closest_n = 20
# for query, query_embedding in zip(queries, query_embeddings):
#     distances = scipy.spatial.distance.cdist([query_embedding], corpus_embeddings, "cosine")[0]

#     results = zip(range(len(distances)), distances)
#     results = sorted(results, key=lambda x: x[1])

#     print("\n\n======================\n\n")
#     print("Query:", query)
#     print("\nTop 10 most relevant values for your query:")

#     for idx, distance in results[0:closest_n]:
#         print("(Score: %.4f)" % (1-distance), corpus[idx].strip(), "|| index:", idx )
