In [147]:
import pandas as pd
import openai
import numpy as np
import pickle
import requests
import json
import re
from typing import Set
from typing import List
from transformers import GPT2TokenizerFast
from nltk.tokenize import sent_tokenize
from apiclient.discovery import build

In [148]:
openai.api_key = "OPENAI_API_KEY"
search_api_key = "GOOGLESEARCH_API_KEY"

In [145]:
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")

def count_tokens(text: str) -> int:
    """count the number of tokens in a string"""
    return len(tokenizer.encode(text))

    
def extract_text(snippet: str, title: str, link: str) -> str:
    date_regex = r'\b(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s+\d{1,2},\s+\d{4}\b'
    date = re.search(date_regex, title)
    date = date.group() if date is not None else None
    ntitle = re.sub(date_regex,'',title)
    ntitle = re.sub(r"[\(\[].*[\)\]]|\s*\.{3,}\s*|['\"]", '', ntitle)
    nsnippet = re.sub(r"[\(\[].*[\)\]]|\s*\.{3,}\s*|['\"]", '', snippet)
    return [(ntitle, nsnippet, link, count_tokens(ntitle + " " + nsnippet), date)]
    

Downloading:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

In [149]:
prompt_query = "What is a Von Neumann Universal Constructor?"
response = openai.Completion.create(
  model="text-davinci-003",
  prompt= prompt_query,
  temperature=0.3,
  max_tokens=200,
  top_p=0.0,
  frequency_penalty=0.0,
  presence_penalty=0.0
)
full_text = "".join(response["choices"][0]["text"].split("\n"))
print(full_text)

A Von Neumann Universal Constructor is a type of self-replicating machine that is capable of constructing copies of itself using raw materials from its environment. It was first proposed by mathematician and computer scientist John von Neumann in the 1950s. The concept has been used in various fields, including robotics, artificial intelligence, and nanotechnology.


### Preprocessing + Google Search

In [45]:
search_query = full_text
print(search_query)

A Von Neumann Universal Constructor is a type of self-replicating machine that is capable of constructing copies of itself using raw materials from its environment. It was first proposed by mathematician and computer scientist John von Neumann in the 1950s. The concept has been used in various fields, including robotics, artificial intelligence, and nanotechnology.


In [150]:
resource = build("customsearch", 'v1', developerKey=search_api_key).cse()
result = resource.list(q=search_query,cx='CUSTOM_SEARCH_URL',highRange=60).execute()
results = [] 
results.extend(result["items"])
res = []
for items in results:
    print(items['snippet'], items['title'], items['link']) 
    res += extract_text(items['title'], items['snippet'], items['link'])
df = pd.DataFrame(res, columns=["Title", "Snippet", "Link", "Tokens", "Date"])
df = df.drop_duplicates(['Title','Snippet'])
df = df.reset_index().drop('index',axis=1) # reset index
df
    

A self-replicating machine is a type of autonomous robot that is capable of reproducing itself autonomously using raw materials found in the environment, ... Von Neumann himself used the term universal constructor to describe such ... Self-replicating machine - Wikipedia https://en.wikipedia.org/wiki/Self-replicating_machine
A schematic diagram of von Neumann's self-replicating cellular automaton. The system is a universal constructor (UC), namely, a machine (i.e., CA-embedded ... Fifty Years of Research on Self-Replication: An Overview https://fab.cba.mit.edu/classes/865.18/replication/Sipper.pdf
By 1948 von Neumann had already proposed a general abstract architecture that ... with his inspiration from Turing's idea of a universal computing machine, ... Chapter 5 From Idea to Reality: Designing and Building Self ... https://www.tim-taylor.com/selfrepbook/web/chapter-5.html
Von Neumann himself used the term Universal Constructor. ... mining robots to collect raw materials, construction

Unnamed: 0,Title,Snippet,Link,Tokens,Date
0,A self-replicating machine is a type of autono...,Self-replicating machine - Wikipedia,https://en.wikipedia.org/wiki/Self-replicating...,49,
1,A schematic diagram of von Neumanns self-repli...,Fifty Years of Research on Self-Replication: A...,https://fab.cba.mit.edu/classes/865.18/replica...,48,
2,By 1948 von Neumann had already proposed a gen...,Chapter 5 From Idea to Reality: Designing and ...,https://www.tim-taylor.com/selfrepbook/web/cha...,38,
3,Von Neumann himself used the term Universal Co...,Replication | Future | Fandom,https://future.fandom.com/wiki/Replication,34,
4,"Von Neumanns Universal Constructor, as applied...",The Drexler-Smalley debate on molecular assemb...,https://www.kurzweilai.net/the-drexler-smalley...,48,"Dec 1, 2003"
5,Following pioneering work by von Neumann in th...,"Chapter 3 Green grass, red blood, blueprint: r...",https://www.witpress.com/Secure/elibrary/paper...,46,
6,Self-replicating systems could be used as an u...,Modeling Kinematic Cellular Automata Final Report,http://www.niac.usra.edu/files/studies/final_r...,42,"Apr 30, 2004"
7,"While using AnyLogic software, we were able to...",The Origin of Prebiotic Information System in ...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,46,
8,Around the same time in the 1940s von Neumann ...,1 Preprint. Final version to appear in Oren Ha...,https://www.ehudlamm.com/outsiders.pdf,49,
9,"Stan Ulam, a Polish-American mathematician who...",Advanced Automation for Space Missions,https://space.nss.org/wp-content/uploads/1982-...,39,


In [151]:
MODEL_NAME = "davinci"
ADA_DOC = "text-embedding-ada-doc-002"
ADA_QUERY = "text-embedding-ada-query-002"
DOC_EMBEDDINGS_MODEL = f"text-search-{MODEL_NAME}-doc-001"
QUERY_EMBEDDINGS_MODEL = f"text-search-{MODEL_NAME}-query-001"

In [152]:
def get_embedding(text: str, model: str):
    result = openai.Embedding.create(
        model= model, 
        input=text
    )
    return result["data"][0]["embedding"]

def get_doc_embedding(text: str):
    return get_embedding(text, DOC_EMBEDDINGS_MODEL)  #DOC_EMBEDDINGS_MODEL

def get_query_embedding(text: str):
    return get_embedding(text, QUERY_EMBEDDINGS_MODEL) #QUERY_EMBEDDINGS_MODEL

def compute_doc_embeddings(df: pd.DataFrame):
    """
    Create an embedding for each row in the dataframe using the OpenAI Embeddings API.
    
    Return a dictionary that maps between each embedding vector and the index of the row that it corresponds to.
    """
    return {
        idx: get_doc_embedding(r.Title + " " + r.Snippet) for idx, r in df.iterrows()
    }

In [153]:
df


Unnamed: 0,Title,Snippet,Link,Tokens,Date
0,A self-replicating machine is a type of autono...,Self-replicating machine - Wikipedia,https://en.wikipedia.org/wiki/Self-replicating...,49,
1,A schematic diagram of von Neumanns self-repli...,Fifty Years of Research on Self-Replication: A...,https://fab.cba.mit.edu/classes/865.18/replica...,48,
2,By 1948 von Neumann had already proposed a gen...,Chapter 5 From Idea to Reality: Designing and ...,https://www.tim-taylor.com/selfrepbook/web/cha...,38,
3,Von Neumann himself used the term Universal Co...,Replication | Future | Fandom,https://future.fandom.com/wiki/Replication,34,
4,"Von Neumanns Universal Constructor, as applied...",The Drexler-Smalley debate on molecular assemb...,https://www.kurzweilai.net/the-drexler-smalley...,48,"Dec 1, 2003"
5,Following pioneering work by von Neumann in th...,"Chapter 3 Green grass, red blood, blueprint: r...",https://www.witpress.com/Secure/elibrary/paper...,46,
6,Self-replicating systems could be used as an u...,Modeling Kinematic Cellular Automata Final Report,http://www.niac.usra.edu/files/studies/final_r...,42,"Apr 30, 2004"
7,"While using AnyLogic software, we were able to...",The Origin of Prebiotic Information System in ...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,46,
8,Around the same time in the 1940s von Neumann ...,1 Preprint. Final version to appear in Oren Ha...,https://www.ehudlamm.com/outsiders.pdf,49,
9,"Stan Ulam, a Polish-American mathematician who...",Advanced Automation for Space Missions,https://space.nss.org/wp-content/uploads/1982-...,39,


In [154]:
context_embeddings

{0: [-0.003945088014006615,
  0.01628735102713108,
  -0.002778989728540182,
  -0.009772410616278648,
  0.002761561656370759,
  0.0032337047159671783,
  -0.0017269662348553538,
  -0.00974706094712019,
  0.0018141067121177912,
  -0.0015265430556610227,
  -0.012858768925070763,
  0.0006194896996021271,
  -0.002474790206179023,
  0.004274637438356876,
  0.0032115234062075615,
  -0.0011233201948925853,
  0.011806745082139969,
  -0.0068254778161644936,
  -0.012104607187211514,
  0.008352813310921192,
  -0.0002517172251828015,
  0.01561557687819004,
  -0.00432850606739521,
  -0.01036179717630148,
  -0.008695037104189396,
  -0.003165576606988907,
  0.0023496246431022882,
  0.0071106646209955215,
  -0.01593245193362236,
  -0.0007989199366420507,
  -0.007471901830285788,
  0.00156932114623487,
  -0.0036852508783340454,
  0.013727004639804363,
  -0.0030768518336117268,
  -0.0035078010987490416,
  0.009144999086856842,
  -0.0037866507191210985,
  0.010970196686685085,
  0.011046246625483036,
  -0.

In [155]:
context_embeddings = compute_doc_embeddings(df)    

In [156]:
example_entry = list(context_embeddings.items())[0]
print(f"{example_entry[0]} : {example_entry[1][:5]}... ({len(example_entry[1])} entries)")

0 : [0.006855120416730642, 0.014167248271405697, 0.01252328883856535, -0.01789948157966137, 0.01456078328192234]... (12288 entries)


In [157]:
# Make the key of the dict a tuple of the tile, row, and link
document_embeddings = dict()
for key, row in df.iterrows():
    document_embeddings[(row['Title'], row['Snippet'], row['Link'])] = context_embeddings[key]
    

In [158]:
def vector_similarity(x: List[float], y: List[float]):
    """
    We could use cosine similarity or dot product to calculate the similarity between vectors.
    In practice, we have found it makes little difference. 
    """
    return np.dot(np.array(x), np.array(y))

def order_document_sections_by_query_similarity(query, contexts):
    """
    Find the query embedding for the supplied query, and compare it against all of the pre-calculated document embeddings
    to find the most relevant sections. 
    
    Return the list of document sections, sorted by relevance in descending order.
    """
    query_embedding = get_query_embedding(query)
    
    document_similarities = sorted([
        (vector_similarity(query_embedding, doc_embedding), doc_index) for doc_index, doc_embedding in contexts.items()
    ], reverse=True)
    
    return document_similarities

In [159]:
#Ordering the queries based on similarity to original answer
sites = order_document_sections_by_query_similarity(search_query, context_embeddings)


In [160]:
# Final site ranking based on vector embedding distance between original answer with Snippet + Title
i = 1
for site in sites:
    print(str(i) + ":", df.iloc[site[1]]['Link'] + " ,index:" + str(site[1]))
    i += 1

1: https://en.wikipedia.org/wiki/Self-replicating_machine ,index:0
2: https://future.fandom.com/wiki/Replication ,index:3
3: https://fab.cba.mit.edu/classes/865.18/replication/Sipper.pdf ,index:1
4: https://www.kurzweilai.net/the-drexler-smalley-debate-on-molecular-assembly ,index:4
5: https://www.witpress.com/Secure/elibrary/papers/9781853128530/9781853128530003FU1.pdf ,index:5
6: http://www.niac.usra.edu/files/studies/final_report/883Toth-Fejel.pdf ,index:6
7: https://space.nss.org/wp-content/uploads/1982-Self-Replicating-Lunar-Factory.pdf ,index:9
8: https://www.ehudlamm.com/outsiders.pdf ,index:8
9: https://www.tim-taylor.com/selfrepbook/web/chapter-5.html ,index:2
10: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6463137/ ,index:7


In [168]:
# Web crawling in the actual sites and summarizing the text
from nltk.tokenize import word_tokenize
def crawl_page(url):
    
    #Using diffbot to parse information
    diffbot_api_token = "DIFFBOT_API_TOKEN"
    crawl_url = url 
    request_url = "https://api.diffbot.com/v3/analyze?token=" + diffbot_api_token +  "&url=" + crawl_url
    headers = {"accept": "application/json"}

    response = requests.get(request_url, headers=headers)
    data = response.json()
    
    # Extract the text from the webpage
    text = data["objects"][0]["text"] if len(data["objects"]) > 0 else "No Text"
    # Set the maximum number of tokens
    max_tokens = 600
  
    # Tokenize the text for calculating embedding distances 
    # tokens = nltk.word_tokenize(text)

    # Truncate the text into max token length 
    text = text[0:max_tokens]
  

    #print(text)
    # Summarize the text block using the OpenAI API
    response = openai.Completion.create( 
  model="text-davinci-003",
  prompt= text + "\n\nTl;dr",
  temperature=0.7,
  max_tokens=max_tokens,
  top_p=1,
  frequency_penalty=0,
  presence_penalty=1
)
    summary = "".join(response["choices"][0]["text"].split("\n"))
    print(summary)
    return summary

In [169]:
# Adding text summaries to 
i = 1
summaries = []
df["Summary"] = [None for i in df.iterrows()]
for site in sites:
    print(str(i) + ":", df.iloc[site[1]])
    df.loc[df.index[site[1]], 'Summary'] = crawl_page(df.iloc[site[1]]['Link'])
    
    i += 1

1: Title      A self-replicating machine is a type of autono...
Snippet                 Self-replicating machine - Wikipedia
Link       https://en.wikipedia.org/wiki/Self-replicating...
Tokens                                                    49
Date                                                    None
Summary                                                 None
Name: 0, dtype: object
Self-replicating machines are autonomous robots that can reproduce themselves autonomously with raw materials found in the environment, exhibiting self-replication analogous to nature. This concept has been studied by several prominent scientists and researchers, including Homer Jacobson, Edward F. Moore, Freeman Dyson, John von Neumann, Konrad Zuse, and K. Eric Drexler.
2: Title      Von Neumann himself used the term Universal Co...
Snippet                        Replication | Future | Fandom
Link              https://future.fandom.com/wiki/Replication
Tokens                                          

In [170]:
df

Unnamed: 0,Title,Snippet,Link,Tokens,Date,Summary
0,A self-replicating machine is a type of autono...,Self-replicating machine - Wikipedia,https://en.wikipedia.org/wiki/Self-replicating...,49,,Self-replicating machines are autonomous robot...
1,A schematic diagram of von Neumanns self-repli...,Fifty Years of Research on Self-Replication: A...,https://fab.cba.mit.edu/classes/865.18/replica...,48,,This article provides an overview of research...
2,By 1948 von Neumann had already proposed a gen...,Chapter 5 From Idea to Reality: Designing and ...,https://www.tim-taylor.com/selfrepbook/web/cha...,38,,": In the 1930s, Turing developed a theory of u..."
3,Von Neumann himself used the term Universal Co...,Replication | Future | Fandom,https://future.fandom.com/wiki/Replication,34,,Nanolevel Replication is the process of dupli...
4,"Von Neumanns Universal Constructor, as applied...",The Drexler-Smalley debate on molecular assemb...,https://www.kurzweilai.net/the-drexler-smalley...,48,"Dec 1, 2003",- Nanotechnology pioneer Eric Drexler and Nob...
5,Following pioneering work by von Neumann in th...,"Chapter 3 Green grass, red blood, blueprint: r...",https://www.witpress.com/Secure/elibrary/paper...,46,,This paper reviews the various approaches to s...
6,Self-replicating systems could be used as an u...,Modeling Kinematic Cellular Automata Final Report,http://www.niac.usra.edu/files/studies/final_r...,42,"Apr 30, 2004",This project proposes the development of a con...
7,"While using AnyLogic software, we were able to...",The Origin of Prebiotic Information System in ...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,46,,:We propose prebiotic pathways from cosmic bui...
8,Around the same time in the 1940s von Neumann ...,1 Preprint. Final version to appear in Oren Ha...,https://www.ehudlamm.com/outsiders.pdf,49,,The folklore of mathematics is full of stories...
9,"Stan Ulam, a Polish-American mathematician who...",Advanced Automation for Space Missions,https://space.nss.org/wp-content/uploads/1982-...,39,,": No text is provided, so there is no tl;dr."


In [171]:
df.iloc[0]["Summary"]

'Self-replicating machines are autonomous robots that can reproduce themselves autonomously with raw materials found in the environment, exhibiting self-replication analogous to nature. This concept has been studied by several prominent scientists and researchers, including Homer Jacobson, Edward F. Moore, Freeman Dyson, John von Neumann, Konrad Zuse, and K. Eric Drexler.'