# Web QA

Steps:
 * Crawl the web and clean the text
 * Split the text into Crunks of max number of tokens
 * Embed all chunks
 * QA based on text
 * Test Case: an essay from FED website

📌 More：https://github.com/openai/openai-cookbook/blob/main/apps/web-crawl-q-and-a/web-qa.ipynb

# 1. Crawl the web and clean the text

In [33]:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse

In [34]:
# crawl the web

def crawl(url):

    # Get the text from the URL using BeautifulSoup
    soup = BeautifulSoup(requests.get(url).text, "html.parser")

    # Get the text but remove the tags
    text = soup.get_text()

    # If the crawler gets to a page that requires JavaScript, it will stop the crawl
    if ("You need to enable JavaScript to run this app." in text):
        print("Unable to parse page " + url + " due to JavaScript being required")
                # Otherwise, write the text to the file in the text directory
    
    return text

# clean the text
def remove_newlines(text):
    text = str(text)
    text = text.replace('\n', ' ')
    text = text.replace('\\n', ' ')
    text = text.replace('  ', ' ')
    text = text.replace('  ', ' ')
    return text

In [35]:
text_raw = crawl("https://www.federalreservehistory.org/essays/recession-of-1981-82")
text_clean = remove_newlines(text_raw)

In [13]:
text_clean

"     Recession of 1981–82 | Federal Reserve History        ×  Close  Skip top navigation                   Federal Reserve History        Overview Great Recession and After (2007–) Great Moderation (1982–2007) Great Inflation (1965–1982) After the Accord (1951–1965) WWII and After (1941–1951) Great Depression (1929–1941) Fed’s Formative Years (1913–1929) Before the Fed (1791–1913) List all essays          Federal Reserve People        Current Fed leaders People by time period People by affiliation List all people          About the Fed        Introduction Structure of the Fed Purposes and functions Current Fed leaders Other Federal Reserve sites          Learning Fed History        Classroom resources About this site Our authors Related resources   Home >          Federal Reserve History >          Time Period: The Great Inflation >         Recession of 1981–82   Recession of 1981–82 July 1981–November 1982 Lasting from July 1981 to November 1982, this economic downturn was triggered 

# 2. Split the text into Crunks of max number of tokens

In [17]:
import tiktoken

# Load the cl100k_base tokenizer which is designed to work with the ada-002 model
tokenizer = tiktoken.get_encoding("cl100k_base")

max_tokens = 500
# Function to split the text into chunks of a maximum number of tokens
def split_into_many(text, max_tokens = max_tokens):

    # Split the text into sentences
    sentences = text.split('. ')

    # Get the number of tokens for each sentence
    n_tokens = [len(tokenizer.encode(" " + sentence)) for sentence in sentences]
    
    chunks = []
    tokens_so_far = 0
    chunk = []

    # Loop through the sentences and tokens joined together in a tuple
    for sentence, token in zip(sentences, n_tokens):

        # If the number of tokens so far plus the number of tokens in the current sentence is greater 
        # than the max number of tokens, then add the chunk to the list of chunks and reset
        # the chunk and tokens so far
        if tokens_so_far + token > max_tokens:
            chunks.append(". ".join(chunk) + ".")
            chunk = []
            tokens_so_far = 0

        # If the number of tokens in the current sentence is greater than the max number of 
        # tokens, go to the next sentence
        if token > max_tokens:
            continue

        # Otherwise, add the sentence to the chunk and add the number of tokens to the total
        chunk.append(sentence)
        tokens_so_far += token + 1

    return chunks

In [20]:
len(tokenizer.encode(text_clean))

1948

In [21]:
chunks = split_into_many(text_clean)

# 3. Embed all chunks

In [23]:
import pandas as pd
import numpy as np

import openai
from openai.embeddings_utils import distances_from_embeddings, cosine_similarity

df = pd.DataFrame(chunks, columns=['text'])
df['n_tokens'] = df.text.apply(lambda x: len(tokenizer.encode(x)))
df['embeddings'] = df.text.apply(lambda x: openai.Embedding.create(input=x, engine='text-embedding-ada-002')['data'][0]['embedding'])

# save for reuse
#df.to_csv('embeddings.csv')
#df=pd.read_csv('embeddings.csv', index_col=0)
#df['embeddings'] = df['embeddings'].apply(eval).apply(np.array)

In [30]:
df

Unnamed: 0,text,n_tokens,embeddings
0,Recession of 1981–82 | Federal Reserve Hi...,497,"[-0.034047387540340424, -0.017126478254795074,..."
1,Both the 1980 and 1981-82 recessions were trig...,479,"[-0.04432230070233345, -0.04187297821044922, 0..."
2,While the nominal rates the Fed targeted could...,496,"[-0.05253702029585838, -0.03755616769194603, 0..."


# 4. QA

In [25]:
def create_context(
    question, df, max_len=1800, size="ada"
):
    """
    Create a context for a question by finding the most similar context from the dataframe
    """

    # Get the embeddings for the question
    q_embeddings = openai.Embedding.create(input=question, engine='text-embedding-ada-002')['data'][0]['embedding']

    # Get the distances from the embeddings
    df['distances'] = distances_from_embeddings(q_embeddings, df['embeddings'].values, distance_metric='cosine')


    returns = []
    cur_len = 0

    # Sort by distance and add the text to the context until the context is too long
    for i, row in df.sort_values('distances', ascending=True).iterrows():
        
        # Add the length of the text to the current length
        cur_len += row['n_tokens'] + 4
        
        # If the context is too long, break
        if cur_len > max_len:
            break
        
        # Else add it to the text that is being returned
        returns.append(row["text"])

    # Return the context
    return "\n\n###\n\n".join(returns)

In [26]:
def answer_question(
    df,
    model="text-davinci-003",
    question="Am I allowed to publish model outputs to Twitter, without a human review?",
    max_len=1800,
    size="ada",
    debug=False,
    max_tokens=150,
    stop_sequence=None
):
    """
    Answer a question based on the most similar context from the dataframe texts
    """
    context = create_context(
        question,
        df,
        max_len=max_len,
        size=size,
    )
    # If debug, print the raw model response
    if debug:
        print("Context:\n" + context)
        print("\n\n")

    try:
        # Create a completions using the question and context
        response = openai.Completion.create(
            prompt=f"Answer the question based on the context below, and if the question can't be answered based on the context, say \"I don't know\"\n\nContext: {context}\n\n---\n\nQuestion: {question}\nAnswer:",
            temperature=0,
            max_tokens=max_tokens,
            top_p=1,
            frequency_penalty=0,
            presence_penalty=0,
            stop=stop_sequence,
            model=model,
        )
        return response["choices"][0]["text"].strip()
    except Exception as e:
        print(e)
        return ""

In [32]:
answer_question(df, question="What are the key takeways in the text? Summarize as points", debug=False)

'1. The 1980 and 1981-82 recessions were triggered by tight monetary policy in an effort to fight mounting inflation. \n2. Paul Volcker was appointed chairman of the Fed in 1979 and shifted Fed policy to aggressively target the money supply rather than interest rates. \n3. The credit-control program initiated in March 1980 by the Carter administration precipitated a sharp recession. \n4. The Fed allowed the federal funds rate to approach 20 percent in late 1980 and early 1981. \n5. Despite this, long-run interest rates continued to rise. \n6. Volcker was adamant that the Fed not back down from its tight policy when unemployment rose. \n7. By October 1982, inflation had fallen'

# Test with FED website

In [38]:
def web_QA(url, question, debug=False):
    
    text_raw = crawl(url)
    text_clean = remove_newlines(text_raw)
    chunks = split_into_many(text_clean)
    df = pd.DataFrame(chunks, columns=['text'])
    df['n_tokens'] = df.text.apply(lambda x: len(tokenizer.encode(x)))
    df['embeddings'] = df.text.apply(lambda x: openai.Embedding.create(input=x, engine='text-embedding-ada-002')['data'][0]['embedding'])
    
    return answer_question(df, question=question, debug=debug)

In [39]:
FED_url = "https://www.federalreservehistory.org/essays/recession-of-1981-82"
my_question  = "What are the key takeways in the text? Summarize as points"
web_QA(FED_url, my_question, debug=False)

'1. The 1980 and 1981-82 recessions were triggered by tight monetary policy in an effort to fight mounting inflation. \n2. Paul Volcker was appointed chairman of the Fed in 1979 and shifted Fed policy to aggressively target the money supply rather than interest rates. \n3. The credit-control program initiated in March 1980 by the Carter administration precipitated a sharp recession. \n4. The Fed allowed the federal funds rate to approach 20 percent in late 1980 and early 1981. \n5. Despite this, long-run interest rates continued to rise. \n6. Volcker was adamant that the Fed not back down from its tight policy when unemployment rose. \n7. By October 1982, inflation had fallen'