
This colab lets you upload a paper to your drive and talk to it using Open AI's embeddings. 



## Install Dependencies

In [142]:
!pip install pypdf
!pip install wget
!pip install PyPDF2
!pip install tiktoken
!pip install openai

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


## Import Dependencies

In [144]:
import sys
from collections import defaultdict
from matplotlib import pyplot as plt
from matplotlib import patches
import argparse
from pypdf import PdfReader
from pathlib import Path
import requests
from google.colab import drive
import pandas as pd
import numpy as np
import openai 
import tiktoken
from openai.embeddings_utils import get_embedding, cosine_similarity

## Setup
Specify api key, mount Gdrive

In [147]:
drive.mount('/content/drive')
openai.api_key = '___'
sys.path.append("../")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Upload paper

In [148]:
filename = '/content/drive/MyDrive/' + '1706.03762.pdf'

## Parse PDF to text

In [149]:
def parse_paper(path):
  print("Parsing paper")
  reader = PdfReader(path)
  number_of_pages = len(reader.pages)
  print(f"Total number of pages: {number_of_pages}")
  paper_text = []
  for i in range(number_of_pages):
    page = reader.pages[i]
    page_text = []

    def visitor_body(text, cm, tm, fontDict, fontSize):
      x = tm[4]
      y = tm[5]
      # ignore header/footer
      if (y > 50 and y < 720) and (len(text.strip()) > 1):
        page_text.append({
          'fontsize': fontSize,
          'text': text.strip().replace('\x03', ''),
          'x': x,
          'y': y
        })

    _ = page.extract_text(visitor_text=visitor_body)

    blob_font_size = None
    blob_text = ''
    processed_text = []

    for t in page_text:
      if t['fontsize'] == blob_font_size:
        blob_text += f" {t['text']}"
      else:
        if blob_font_size is not None and len(blob_text) > 1:
          processed_text.append({
            'fontsize': blob_font_size,
            'text': blob_text,
            'page': i
          })
        blob_font_size = t['fontsize']
        blob_text = t['text']
    paper_text += processed_text
  return paper_text

In [160]:
paper_text = parse_paper(filename)

Parsing paper
Total number of pages: 15


## Apply a small filter

In [161]:
filtered_paper_text = []
for row in paper_text:
  if len(row['text']) < 30:
    continue
  filtered_paper_text.append(row)

## Convert to dataframe and inspect

In [None]:
df = pd.DataFrame(filtered_paper_text)
print(df.shape)
df.head()


## Calculate pdf embeddings

In [None]:
import datetime
import time

last_time = datetime.datetime.now()

def rate_limit(min_interval_seconds = 0.5):
    global last_time
    sleep = min_interval_seconds - (datetime.datetime.now() - last_time).total_seconds() 
    if sleep > 0:
        time.sleep(sleep)
    last_time = datetime.datetime.now()

In [153]:
def get_embedding_with_ratelimit(x, engine):
    rate_limit()
    return get_embedding(x, engine=engine)

embedding_model = "text-embedding-ada-002"
embeddings = df.text.apply([lambda x: get_embedding_with_ratelimit(x, engine=embedding_model)])
df["embeddings"] = embeddings

In [154]:
df.shape

(43, 4)

## Embed query and Search

We return the chunk in pdf with highest cosine similarity with query embedding

In [155]:
def search_reviews(df, query, n=3, pprint=True):
    query_embedding = get_embedding(
        query,
        engine="text-embedding-ada-002"
    )
    df["similarity"] = df.embeddings.apply(lambda x: cosine_similarity(x, query_embedding))

    results = (
        df.sort_values("similarity", ascending=False)
        
    )
    return results

## Few Example Results

In [158]:
results = search_reviews(df, "explain how multi head self attention works", n=3)
results.iloc[0]['text']


'Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. With a single attention head, averaging inhibits this. MultiHead( Q;K;V ) = Concat(head ;:::; head where head = Attention( QW ;KW ;VW Where the projections are parameter matrices'

In [157]:
results = search_reviews(df, "explain the training procedure", n=3)
results.iloc[0]['text']

'This section describes the training regime for our models. 5.1 Training Data and Batching We trained on the standard WMT 2014 English-German dataset consisting of about 4.5 million sentence pairs. Sentences were encoded using byte-pair encoding [ 3], which has a shared source- target vocabulary of about 37000 tokens. For English-French, we used the signiﬁcantly larger WMT 2014 English-French dataset consisting of 36M sentences and split tokens into a 32000 word-piece vocabulary [ 38]. Sentence pairs were batched together by approximate sequence length. Each training batch contained a set of sentence pairs containing approximately 25000 source tokens and 25000 target tokens. 5.2 Hardware and Schedule We trained our models on one machine with 8 NVIDIA P100 GPUs. For our base models using the hyperparameters described throughout the paper, each training step took about 0.4 seconds. We trained the base models for a total of 100,000 steps or 12 hours. For our big models,(described on the b