## Import Transformer

First we'll import our pre-trained sentence similarity model. This one was trained using BERT techniques on a massive set of tuples from the internet. Tuples take the form of input-output. So for example, an input could be a question, and an output could be an answer. 

In [1]:
from sentence_transformers import SentenceTransformer, util

# Load the model
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

## Prepare Corpus

We are going to pull the summary from the <a href="https://en.wikipedia.org/wiki/Japan">Japan Wikipedia Page</a>, then prepare it for vector embedding. 

In [2]:
# set corpus from first page of wikipedia
corpus = "Japan is an island country in East Asia. It is situated in the northwest Pacific Ocean, and is bordered on the west by the Sea of Japan, while extending from the Sea of Okhotsk in the north toward the East China Sea, Philippine Sea, and Taiwan in the south. Japan is a part of the Ring of Fire, and spans an archipelago of 6852 islands covering 377,975 square kilometers (145,937 sq mi); the five main islands are Hokkaido, Honshu, Shikoku, Kyushu, and Okinawa. Tokyo is the nation's capital and largest city, followed by Yokohama, Osaka, Nagoya, Sapporo, Fukuoka, Kobe, and Kyoto. Japan is the eleventh most populous country in the world, as well as one of the most densely populated and urbanized. About three-fourths of the country's terrain is mountainous, concentrating its population of 125.5 million on narrow coastal plains. Japan is divided into 47 administrative prefectures and eight traditional regions. The Greater Tokyo Area is the most populous metropolitan area in the world, with more than 37.4 million residents. Japan has been inhabited since the Upper Paleolithic period (30,000 BC), though the first written mention of the archipelago appears in a Chinese chronicle (the Book of Han) finished in the 2nd century AD. Between the 4th and 9th centuries, the kingdoms of Japan became unified under an emperor and the imperial court based in Heian-kyō. Beginning in the 12th century, political power was held by a series of military dictators (shōgun) and feudal lords (daimyō) and enforced by a class of warrior nobility (samurai). After a century-long period of civil war, the country was reunified in 1603 under the Tokugawa shogunate, which enacted an isolationist foreign policy. In 1854, a United States fleet forced Japan to open trade to the West, which led to the end of the shogunate and the restoration of imperial power in 1868. In the Meiji period, the Empire of Japan adopted a Western-modeled constitution and pursued a program of industrialization and modernization. Amidst a rise in militarism and overseas colonization, Japan invaded China in 1937 and entered World War II as an Axis power in 1941. After suffering defeat in the Pacific War and two atomic bombings, Japan surrendered in 1945 and came under a seven-year Allied occupation, during which it adopted a new constitution and began a military alliance with the United States. Under the 1947 constitution, Japan has maintained a unitary parliamentary constitutional monarchy with a bicameral legislature, the National Diet. Japan is a highly developed country, and a great power in global politics. Its economy is the world's third-largest by nominal GDP and the fourth-largest by PPP. Although Japan has renounced its right to declare war, the country maintains Self-Defense Forces that rank as one of the world's strongest militaries. After World War II, Japan experienced record growth in an economic miracle, becoming the second-largest economy in the world by 1972 but has stagnated since 1995 in what is referred to as the Lost Decades. Japan has the world's highest life expectancy, though it is experiencing a decline in population. A global leader in the automotive, robotics and electronics industries, the country has made significant contributions to science and technology. The culture of Japan is well known around the world, including its art, cuisine, music, and popular culture, which encompasses prominent comic, animation and video game industries. It is a member of numerous international organizations, including the United Nations (since 1956), OECD, G20 and Group of Seven."
# turn it into an array of sentences
docs = corpus.split('.')
print(docs)

['Japan is an island country in East Asia', ' It is situated in the northwest Pacific Ocean, and is bordered on the west by the Sea of Japan, while extending from the Sea of Okhotsk in the north toward the East China Sea, Philippine Sea, and Taiwan in the south', ' Japan is a part of the Ring of Fire, and spans an archipelago of 6852 islands covering 377,975 square kilometers (145,937 sq mi); the five main islands are Hokkaido, Honshu, Shikoku, Kyushu, and Okinawa', " Tokyo is the nation's capital and largest city, followed by Yokohama, Osaka, Nagoya, Sapporo, Fukuoka, Kobe, and Kyoto", ' Japan is the eleventh most populous country in the world, as well as one of the most densely populated and urbanized', " About three-fourths of the country's terrain is mountainous, concentrating its population of 125", '5 million on narrow coastal plains', ' Japan is divided into 47 administrative prefectures and eight traditional regions', ' The Greater Tokyo Area is the most populous metropolitan a

## Encode Corpus
encode each array (sentence) into a 384 dimension vector

In [3]:
corpus_vector = model.encode(docs)
print("Length of vector:", len(corpus_vector[0]))

print(corpus_vector)

Length of vector: 384
[[ 0.05527088  0.04808547 -0.00781395 ... -0.01564412 -0.05199255
  -0.0269122 ]
 [ 0.07182305  0.11629473  0.03326566 ...  0.00400936 -0.04030822
   0.09569606]
 [ 0.11922333  0.00596007 -0.01733764 ...  0.02097987 -0.07156345
   0.01953295]
 ...
 [ 0.07631288 -0.05397929 -0.02969839 ... -0.03893645  0.0111805
   0.04070465]
 [-0.02801095 -0.03043354  0.00067352 ... -0.08902537 -0.00195532
   0.02784135]
 [-0.11883843  0.04829871 -0.00254811 ...  0.1264095   0.04654899
  -0.0157173 ]]


## Embed Our Query

We then take an english-intuitive question, also send that through the same 384 dimension calculation and then the resulting vector query and corpus query are sent through the `calculate` function, where the most similar strings are calculated. 

In [4]:
# Encode our question and documents in 384 dimension

query = "How many islands are comprised of Japan?"
query_vector = model.encode(query)
print(query_vector)

[ 5.34856394e-02 -1.82546861e-02 -4.17558961e-02  2.40752995e-02
 -1.52715258e-02 -8.24557766e-02  1.37597863e-02 -1.88691774e-03
  6.09181868e-03 -1.85392741e-02  9.03924331e-02 -8.29271749e-02
  4.08576708e-03  6.82265460e-02  7.74517506e-02 -3.90558690e-02
 -9.22505334e-02 -3.49890329e-02  1.34321377e-02 -3.24943848e-02
  4.07671370e-02 -4.95001972e-02  4.43692207e-02 -3.05862669e-02
  5.50294556e-02  1.99010093e-02  9.95273069e-02  9.98777337e-03
  5.35172829e-03 -1.62346214e-02 -1.21098258e-01  4.16901782e-02
  6.46011382e-02  2.43208650e-02  5.00398874e-02  1.92501638e-02
 -5.40138397e-04  3.18752378e-02  9.13850404e-03  1.53699494e-03
 -1.10878177e-01  8.23686924e-03  1.31913960e-01  2.46800184e-02
  4.45419773e-02  1.39279626e-02 -5.25977612e-02 -1.72258019e-02
  5.86168393e-02 -4.52657603e-02  2.14625336e-02 -2.01706309e-02
 -2.12410130e-02  6.19120747e-02  4.46807370e-02 -7.14175999e-02
 -5.87950423e-02 -4.10385840e-02  1.64512675e-02  4.14514653e-02
  3.65730338e-02 -7.20107

## Calculate Similarity

In [5]:
# Calculate cosine similarity between the corpus of vectors and the query vector
scores = util.cos_sim(query_vector, corpus_vector)[0].cpu().tolist()

# Combine docs & scores
doc_score_pairs = list(zip(docs, scores))

# Sort by decreasing score
doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)

# Convert doc_score_pairs to a list of strings
doc_score_strings = [f"Score: {score}, Document: {doc}" for doc, score in doc_score_pairs]

# Output passages & scores
for doc, score in doc_score_pairs:
    print(doc_score_strings, doc)

['Score: 0.7428829073905945, Document:  Japan is a part of the Ring of Fire, and spans an archipelago of 6852 islands covering 377,975 square kilometers (145,937 sq mi); the five main islands are Hokkaido, Honshu, Shikoku, Kyushu, and Okinawa', 'Score: 0.7245738506317139, Document: Japan is an island country in East Asia', 'Score: 0.7163315415382385, Document:  Japan is divided into 47 administrative prefectures and eight traditional regions', 'Score: 0.5539742708206177, Document:  Japan has been inhabited since the Upper Paleolithic period (30,000 BC), though the first written mention of the archipelago appears in a Chinese chronicle (the Book of Han) finished in the 2nd century AD', 'Score: 0.48450547456741333, Document:  Japan is the eleventh most populous country in the world, as well as one of the most densely populated and urbanized', "Score: 0.46953386068344116, Document:  Tokyo is the nation's capital and largest city, followed by Yokohama, Osaka, Nagoya, Sapporo, Fukuoka, Kobe

## Enrich with Context to ChatGPT

In [6]:
import json
import requests

key = "API_KEY"

top_n_docs = doc_score_pairs[:5]

# Concatenating the top 5 documents
text_to_summarize = [doc for doc, score in doc_score_pairs]

# prompt as context

contexts = f"""
            Question: {query}
            Contexts: {text_to_summarize}
"""

content = f"""
            You are an AI assistant providing helpful advice.
            You are given the following extracted parts of a long document and a question. 
            Provide a conversational answer based on the context provided. 
            You should only provide hyperlinks that reference the context below. 
            Do NOT make up hyperlinks. If you can't find the answer in the context below, 
            just say "Hmm, I'm not sure. Try one of the links below." Do NOT try to make up an answer. 
            If the question is not related to the context, politely respond that you are tuned to only answer 
            questions that are related to the context. Do NOT however mention the word "context"
            in your responses. 
            =========
            {contexts}
            =========
            Answer in Markdown
        """

url = "https://api.openai.com/v1/chat/completions"

payload = json.dumps({
  "model": "gpt-3.5-turbo",
  "messages": [
    {
      "role": "user",
      "content": content
    }
  ]
})
headers = {
  'Authorization': f'Bearer {key}',
  'Content-Type': 'application/json'
}

response = requests.request("POST", url, headers=headers, data=payload)

just_text_response = response.json()['choices'][0]['message']['content']
print(just_text_response)

Japan is comprised of an archipelago of 6852 islands, with the five main islands being Hokkaido, Honshu, Shikoku, Kyushu, and Okinawa. You can find more information about Japan's islands [here](https://en.wikipedia.org/wiki/Geography_of_Japan#Islands).
