Readwise ( http://readwise.io/ ) is a wonderful app that makes it easy to read 
books and articles and highight parts of the text that are interesting for later
reference or review. I've been using it for a few years and over that time have
ammased a huge library of highlights on many topics. I review some random
highlights daily, and also sync them as markdown to read in my text editor. But
wouldn't it be awesome to chat with an AI that has access to the entire library?

... enter Open AI, LangChain, and Chroma ...

In this sample, we read the highlights from Readwise using the API, index the
highlights with their embeddings for semantic retrieval with the ChromaDB vector
database, and connect the vector database to a chat model using LangChain.

We'll start by installing the dependencies and configuring the setup with
an Open AI API key, and the access token for Readwise (yes, if you also use
Readwise you can run this with your access token to chat with your
highlights library).

In [84]:
%pip install langchain openai chromadb requests tqdm
from IPython.display import clear_output ; clear_output()

In [85]:
import os
os.environ["OPENAI_API_KEY"] = " ... " # Replace with you Open AI API key
READWISE_ACCESS_TOKEN = " ... " # Replace with your Readwise access token

Now, we need to read the highlights from Readwise using the API. Depending on how
many highlights you have this could take a while (about an hour for my library),
so when we're done we save everything into a JSON file so that we don't have to do
it again. If you need to re-read the highlights just delete readwise.json and re-run
this cell.

In [123]:
import requests
from pprint import pprint
from time import sleep
import sys
from tqdm.notebook import tqdm
import json

if os.path.exists('readwise.json'):
    print('readwise.json already exists. Exiting cell.')
    class StopExecution(Exception):
        def _render_traceback_(self):
            pass
    raise StopExecution

def readwise(api_method, **params):
    try:
        url = 'https://readwise.io/api/v2/{}/'.format(api_method)
        response = requests.get(
            url, params={**params, 'page_size': 1000},
            headers={'Authorization': f'Token {READWISE_ACCESS_TOKEN}'})
        if response.status_code == 429:
            throttle_seconds = int(response.headers['Retry-After']) * 1.33333
            sleep(throttle_seconds)
            return readwise(api_method, **params)
        return response.json()
    except:
        print(sys.exc_info())
        return {}

docs = [
     doc for doc
     in readwise('books')['results']
     if doc['category'] in ['books', 'articles', 'tweets']
]

for doc in tqdm(docs, desc='Reading document highlights', leave=True):
    doc['highlights'] = []
    highlights_response = readwise('highlights', book_id=doc['id'])
    for highlight in highlights_response['results']:
        doc['highlights'].append(highlight)

docs = [doc for doc in docs if len(doc['highlights']) > 0]

with open('readwise.json', 'w') as f:
    json.dump(docs, f)

readwise.json already exists. Exiting cell.


In [87]:
with open('readwise.json', 'r') as f:
    docs = json.load(f)

Now that we have all the highlights, we need to index them. The Chroma
vector store uses ChromaDB to create an index of the highlights, each
with its embeddings (created using the Open AI text-ada-002 model)
and metadata.

In [91]:
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.schema import Document

lcdocs = []
for doc in docs:
  metadata = {
      'title': doc['title'] or '',
      'author': doc['author'] or '',
      'category': doc['category'] or '',
      'source_url': doc['source_url'] or '',
      'tags': ' '.join(['#' + tag['name'] for tag in doc['tags']]),
  }
  for highlight in doc['highlights']:
      if not highlight['note'].startswith('.h'):
        lcdocs.append(Document(page_content=highlight['text'], metadata=metadata))

vectordb = Chroma.from_documents(documents=lcdocs, embedding=OpenAIEmbeddings())

Using embedded DuckDB without persistence: data will be transient


Now that we have an index of all our highlights, we can connect it to a chat model
using LangChain. We use the RetrievalQA chain, which is preconfigured to answer
questions by retrieving relevant documents from the index and using their content
to craft a response.

We use the GPT-3.5-turbo model, which is fast and inexpensive. If you'd like to
experiment with a better model, change `model_name` to `'gpt-4'`. It is slower
and more expensive, but can produce superior results when working with complex
texts.

Not that we are returning the retrieved source documents together with the chat
model's response. With that, we can display which documents the highlights are
coming from for reference.

In [111]:
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA

qa_chain = RetrievalQA.from_chain_type(
  llm=ChatOpenAI(model_name='gpt-3.5-turbo', temperature=0.3, max_tokens=1000),
  chain_type="stuff",
  retriever=vectordb.as_retriever(), 
  return_source_documents=True,
)

With the chain ready, asking a question is easy. We can `qa_chain`, which triggers
the complete chain with our questions, figures out what query to run against the
document index, inserts the retrieved documents into the chat model's context and
then relies on the model to produce an answer.

With the list of retrieved highlights, we can collect their metadata, and display
the title, author, and tags.

In [124]:
from textwrap import wrap

def ask(question):
    response = qa_chain(question)
    print('\n'.join(wrap(response['result'])))
    print('\n-----')
    src_docs = {}
    for source in response['source_documents']:
        src_docs[source.metadata['source_url']] = source
    for src_doc in src_docs.values():
        print(
            ' * ',
            src_doc.metadata['title'], ' - ',
            src_doc.metadata['author'], ' ',
            src_doc.metadata['tags'],
        )

Let's try a few questions ...

In [113]:
ask("ELIF what is constructor theory and cite a few examples")

Constructor theory is a new approach to fundamental physics that aims
to explain not just what happens in the world, but what can and cannot
happen. It focuses on the physical processes that can create and
transform physical objects, rather than just describing their
behavior. In other words, it seeks to understand the laws of physics
in terms of what can and cannot be constructed, rather than what can
and cannot happen.  One example of constructor theory is the idea of a
universal constructor, which is a machine that can build any physical
object that can be built from a given set of building blocks. Another
example is the concept of a constructor game, which is a game that
involves building physical objects according to certain rules and
constraints. These games can be used to explore the limits of what can
and cannot be constructed in the physical world.  Constructor theory
also has implications for other fields, such as biology and computer
science. For example, it can help us unde

In [114]:
ask("What are some of the approaches one can take to avoid the detrimental effects of human noise in the judicial system?")

One approach to reducing noise in the judicial system is to implement
rules and guidelines, as advocated by Frankel and implemented by the
US Sentencing Commission. Other approaches may be better suited to
different types of judgments. Some strategies designed to reduce noise
may also reduce bias. However, it is important to note that some level
of noise may be necessary to prevent wrongdoing and deter individuals
from engaging in illegal activities. Additionally, some people argue
that a noisy system can allow for the accommodation of new and
emerging values.

-----
 *  Noise: A Flaw in Human Judgement  -  Daniel Kahneman, Olivier Sibony, Cass R. Sunstein   #thinking


In [121]:
ask("What are some of the considerations when trying to increase the price of a product you are selling?")

When trying to increase the price of a product you are selling, you
should consider how much value customers attach to your products and
services. You should also recognize that your customers perceive the
value of your products and services in different ways depending on
their specific requirements. If you build these variations into your
pricing structure, you can expect to receive higher profits than you
would with a single pricing policy. Additionally, you should consider
the four ways to increase your business's revenue, which are
increasing the number of customers you serve, increasing the average
size of each transaction by selling more, increasing the frequency of
transactions per customer, and raising your prices. Finally, value
comparison is usually the optimal way to price your offer, since the
value of an offer to a specific group can be quite high, resulting in
a much better price.

-----
 *  The Personal MBA  -  Josh Kaufman   #business


In [122]:
ask("Where did the god YHWH come from?")

Various biblical texts suggest that YHWH came from the south; he comes
“from Seir,” “from Edom,” or “from Mount Paran.” The idea that the god
YHWH has a non-Israelite origin has become the established consensus
in scholarly circles, and archaeological discoveries in the Levant and
Mesopotamia in the nineteenth century and especially in the twentieth
century have suggested a variety of hypotheses about this origin.

-----
 *  The Invention of God  -  Thomas Römer   #bible #history
