# Reference
* OpenAI - https://openai.com/
* Pinecone - Build Knowledge Database for Query: https://www.pinecone.io/
    * Each registered account could build a Pinecone Index for free
    * Current policy: the established free Pinecone Index will be deleted from the server if there are no acitivities on it
* Reference for building NLP vectorDB - Pinecone Index for chatbot
    * https://github.com/pinecone-io/examples/blob/master/generation/chatgpt/plugins/langchain-docs-plugin.ipynb

## Step 1 Download diabetes text content from CDC website as the query knowledge base for chatbot

In [2]:
# command used to download text content from CDC website
#!wget -r -A.html -P rtdocs https://www.cdc.gov/diabetes/index.html
# download to folder - rtdocs

## Step 2 Set up key variables 
* OpenAI API Key
* Pinecone Index Key
* Add key values in file .env
    * OPENAI_API_KEY=sk******
    * PINECONE_KEY=512*****
    * PINECONE_ENVIRON=nor***
    * PINECONE_KEY_cdc=62***
    * PINECONE_ENVIRON_cdc=asi***

In [164]:
import pinecone
import os
import openai
from dotenv import dotenv_values
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Pinecone
from langchain.document_loaders import TextLoader

env_vars = dotenv_values('.env')
OPENAI_API_KEY = env_vars["OPENAI_API_KEY"]
openai.api_key = OPENAI_API_KEY
PINECONE_API_KEY = env_vars['PINECONE_KEY']
PINECONE_ENV = env_vars['PINECONE_ENVIRON']
# initialize pinecone
pinecone.init(
    api_key=PINECONE_API_KEY,  # find at app.pinecone.io
    environment=PINECONE_ENV,  # next to api key in console
)
#pinecone.delete_index('complication')
#print(pinecone.list_indexes())
#pinecone.create_index("complication",dimension=1536,metric="euclidean")
#print(pinecone.list_indexes())

## Delete/Create Pinecone Index

In [189]:
index_name = "diabetes"
#pinecone.delete_index(index_name)
#pinecone.create_index(index_name,dimension=1536)

## Load html files from download folder

In [7]:
from langchain.document_loaders import ReadTheDocsLoader

loader = ReadTheDocsLoader('rtdocs')
docs = loader.load()
len(docs)



  _ = BeautifulSoup(


  soup = BeautifulSoup(data, **self.bs_kwargs)


1

# parse content of CDC diabetes

In [77]:
import glob
from pathlib import Path
from bs4 import BeautifulSoup
import re

class Document:
    def __init__(self, page_content, metadata):
        self.page_content = page_content
        self.metadata = metadata

class ReadTheDocsLoader:
    def __init__(self, path, encoding='utf-8', errors='ignore', custom_html_tag=None, **kwargs):
        self.file_path = Path(path)
        self.encoding = encoding
        self.errors = errors
        self.custom_html_tag = custom_html_tag
        self.bs_kwargs = kwargs

    def _clean_data(self, data):
        soup = BeautifulSoup(data, **self.bs_kwargs)
        # Apply custom HTML tag if specified
        if self.custom_html_tag:
            soup = soup.find(self.custom_html_tag[0], self.custom_html_tag[1])
        # Find the desired content in the HTML structure (modify as needed)
        content_div = soup.find('div', class_='content')
        content = content_div.get_text() if content_div else ''
        # Remove duplicated newlines
        content = re.sub(r'\n+', '\n', content)
        return content.strip()
    
    def _clean_data_diabetes_org(self, data):
        soup = BeautifulSoup(data, 'html.parser')
        # Extract paragraphs with complete sentences
        paragraphs = soup.find_all('p')
        complete_paragraphs = []
        term_above = None

        for paragraph in paragraphs:
            paragraph_text = paragraph.get_text().strip()
            if paragraph_text and len(paragraph_text) >= 15:  # Check if the paragraph has a long sentence
                term_above = paragraph_text
                continue
            if term_above and paragraph_text:  # Check if we have a term above and a non-empty paragraph
                complete_paragraphs.append(term_above)
                complete_paragraphs.append(paragraph_text)
                term_above = None

        return '\n'.join(complete_paragraphs)

    def load(self):
        docs = []
        file_paths = glob.glob(str(self.file_path / '**/*.html'), recursive=True)
        for path in file_paths:
            if not Path(path).is_file():
                continue
            with open(path, encoding=self.encoding, errors=self.errors) as file:
                text = self._clean_data_diabetes_org(file)
            metadata = {"source": path}
            docs.append(Document(page_content=text, metadata=metadata))
        return docs

folder_path = './rtdocs'  # Replace with the correct folder path containing HTML files
loader = ReadTheDocsLoader(folder_path)
docs = loader.load()
print(len(docs))

# Accessing the parsed documents
for doc in docs:
    print("Metadata:", doc.metadata)
    print("Page Content:", doc.page_content)
    print("------")


1
Metadata: {'source': 'rtdocs/diabetes.org/index.html'}
Page Content: Sometimes insurance companies can force you to switch medications. This can be stressful, but there are steps you can take with your health care team.
Tell Me More
------


In [10]:
#https://github.com/pinecone-io/examples/blob/master/generation/chatgpt/plugins/langchain-docs-plugin.ipynb
import requests
from bs4 import BeautifulSoup 
from langchain.document_loaders import ReadTheDocsLoader

#loader = ReadTheDocsLoader('rtdocs')
#docs = loader.load()
#len(docs)

In [144]:
import tiktoken

tokenizer = tiktoken.get_encoding('cl100k_base')

# create the length function
def tiktoken_len(text):
    tokens = tokenizer.encode(
        text,
        disallowed_special=()
    )
    return len(tokens)


In [12]:
tiktoken.encoding_for_model('gpt-3.5-turbo')

<Encoding 'cl100k_base'>

In [145]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=400,
    chunk_overlap=20,  # number of tokens overlap between chunks
    length_function=tiktoken_len,
    separators=['\n\n', '\n', ' ', '']
)

In [146]:
chunks = text_splitter.split_text(docs[5].page_content)
len(chunks)

2

In [17]:
import hashlib
m = hashlib.md5()  # this will convert URL into unique ID

url = docs[5].metadata['source'].replace('rtdocs/', 'https://')
print(url)

# convert URL to unique ID
m.update(url.encode('utf-8'))
uid = m.hexdigest()[:12]
print(uid)

https://www.cdc.gov/diabetestv/your-health-with-joan.html
c45df8eff36c


In [149]:
import os

BEARER_TOKEN = os.environ.get("BEARER_TOKEN") or "BEARER_TOKEN_HERE"

In [150]:
headers = {
    "Authorization": f"Bearer {BEARER_TOKEN}"
}

In [152]:
import requests
from requests.adapters import HTTPAdapter, Retry
from tqdm.auto import tqdm
headers = {
    'Authorization': 'Bearer 64b0183a0e20777e4600050a',
    'Content-Type': 'application/json'
}

batch_size = 100
endpoint_url = "http://localhost:8000"
#endpoint_url = "https://sea-turtle-app-5wgxc.ondigitalocean.app/"
s = requests.Session()

# we setup a retry strategy to retry on 5xx errors
retries = Retry(
    total=5,  # number of retries before raising error
    backoff_factor=0.1,
    status_forcelist=[500, 502, 503, 504]
)
s.mount('http://', HTTPAdapter(max_retries=retries))

for i in tqdm(range(0, len(documents), batch_size)):
    i_end = min(len(documents), i+batch_size)
    # make post request that allows up to 5 retries
    res = s.post(
        f"{endpoint_url}/upsert",
        headers=headers,
        json={
            "documents": documents[i:i_end]
        }
    )

  0%|          | 0/54 [00:00<?, ?it/s]

In [153]:
endpoint_url

'http://localhost:8000'

In [177]:
documents[0:2]

[{'id': '9f305d81dad8-0',
  'text': 'Yes. People with diabetes of all types are protected under the Americans with Disabilities Act as people with disabilities. This includes access to school, public places, the workplace and some benefits such as Social Security and disability insurance.\nThe below information is intended for attorneys and legal professionals, and provides detailed legal information on diabetes discrimination in the employment context.\nThe American Diabetes Association has presented two free webinars on the changes made by the Americans with Disabilities Act Amendments Act of 2008. The first webinar, in February 2009, focused on the statute itself, while the second, in April 2011, focused on the new regulations implementing the statutory changes. Both webinars are available for viewing at the links below.\nDemonstrating Coverage under the ADA Amendments Act of 2008 for People with Diabetes (PDF) (updated January 2014)\nThis article explains how to prove that a person

In [178]:
doc1 = documents[0]
doc1
for doc1 in documents[0:2]:
    print(doc1)

{'id': '9f305d81dad8-0', 'text': 'Yes. People with diabetes of all types are protected under the Americans with Disabilities Act as people with disabilities. This includes access to school, public places, the workplace and some benefits such as Social Security and disability insurance.\nThe below information is intended for attorneys and legal professionals, and provides detailed legal information on diabetes discrimination in the employment context.\nThe American Diabetes Association has presented two free webinars on the changes made by the Americans with Disabilities Act Amendments Act of 2008. The first webinar, in February 2009, focused on the statute itself, while the second, in April 2011, focused on the new regulations implementing the statutory changes. Both webinars are available for viewing at the links below.\nDemonstrating Coverage under the ADA Amendments Act of 2008 for People with Diabetes (PDF) (updated January 2014)\nThis article explains how to prove that a person wi

In [179]:
for doc1 in documents[0:2]:
    #article = doc1['text']
    print(doc1)
    article = doc1['text']
    url1 = doc1['metadata']['url']

{'id': '9f305d81dad8-0', 'text': 'Yes. People with diabetes of all types are protected under the Americans with Disabilities Act as people with disabilities. This includes access to school, public places, the workplace and some benefits such as Social Security and disability insurance.\nThe below information is intended for attorneys and legal professionals, and provides detailed legal information on diabetes discrimination in the employment context.\nThe American Diabetes Association has presented two free webinars on the changes made by the Americans with Disabilities Act Amendments Act of 2008. The first webinar, in February 2009, focused on the statute itself, while the second, in April 2011, focused on the new regulations implementing the statutory changes. Both webinars are available for viewing at the links below.\nDemonstrating Coverage under the ADA Amendments Act of 2008 for People with Diabetes (PDF) (updated January 2014)\nThis article explains how to prove that a person wi

In [None]:
for doc1 in documents:
    #article = doc1['text']
    article = doc1['text']
    url1 = doc1['metadata']['url']
    id = doc1['id']


 
    # vectorize with OpenAI text-emebdding-ada-002
    embedding = openai.Embedding.create(
        input=article,
        model="text-embedding-ada-002"
    )
 
    # print the embedding (length = 1536)
    vector = embedding["data"][0]["embedding"]
    pinecone_vectors = []
    pinecone_vectors.append((str(i), vector, {"url": url1}))
    index.upsert(vectors=pinecone_vectors)

In [190]:
import feedparser
import os
import pinecone
import numpy as np
import openai
import requests
from bs4 import BeautifulSoup

env_vars = dotenv_values('.env')
OPENAI_API_KEY = env_vars["OPENAI_API_KEY"]
openai.api_key = OPENAI_API_KEY
PINECONE_API_KEY = env_vars['PINECONE_KEY']
PINECONE_ENV = env_vars['PINECONE_ENVIRON']
 
# OpenAI API key
openai.api_key = OPENAI_API_KEY
 
# get the Pinecone API key and environment
pinecone_api = PINECONE_API_KEY
pinecone_env = PINECONE_ENV
 
pinecone.init(api_key=pinecone_api, environment=pinecone_env)
 
# set index; must exist
index = pinecone.Index('complication')
pinecone_vectors = []
i = 0
# Loop through each document in the 'documents' list
for i, doc1 in enumerate(documents):
    try:
        article = doc1['text']
        url1 = doc1['metadata']['url']
        doc_id = doc1['metadata']['document_id']
        id = doc1['id']
        
        # vectorize with OpenAI text-embedding-ada-002
        embedding = openai.Embedding.create(
            input=article,
            model="text-embedding-ada-002"
        )

        # print the embedding (length = 1536)
        vector = embedding["data"][0]["embedding"]
        
        pinecone_vectors = []
        pinecone_vectors.append((str(id), vector, {'metadata': {"url": url1, 'document_id': str(doc_id)}}))
        
        index.upsert(vectors=pinecone_vectors)
    except Exception as e:
        # Handle the specific exception you want to catch
        print(f"Error processing document {i}: {e}")
        continue


 
    # append tuple to pinecone_vectors list
    #pinecone_vectors.append((str(i), vector, {"url": url1}))
    i += 1
 
# all vectors can be upserted to pinecode in one go
#upsert_response = index.upsert(vectors=pinecone_vectors)
 
print("Vector upload complete.")

Error processing document 3413: Error communicating with OpenAI: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
Error processing document 3414: Failed to connect; did you specify the correct index name?
Vector upload complete.


In [192]:
for i, doc1 in enumerate(documents[0:5]):
    try:
        article = doc1['text']
        url1 = doc1['metadata']['url']
        id = doc1['id']
        
        # vectorize with OpenAI text-embedding-ada-002
        embedding = openai.Embedding.create(
            input=article,
            model="text-embedding-ada-002"
        )

        # print the embedding (length = 1536)
        vector = embedding["data"][0]["embedding"]
        
        pinecone_vectors = []
        pinecone_vectors.append((str(id), vector, {"url": url1}))
        print(pinecone_vectors)
        #index.upsert(vectors=pinecone_vectors)
    except Exception as e:
        # Handle the specific exception you want to catch
        print(f"Error processing document {i}: {e}")
        continue

[('9f305d81dad8-0', [0.001058428781107068, -0.028458891436457634, 0.0326991081237793, -0.05192316696047783, -0.004236966837197542, 0.015139921568334103, -0.022228630259633064, -0.00701066805049777, -0.04068528860807419, -0.00704968860372901, -0.004649932961910963, 0.01688283309340477, 0.002924905391409993, 0.0054921237751841545, -0.025987597182393074, -0.0007739049615338445, 0.03816196694970131, -0.0006292042089626193, 0.008675538934767246, -0.021656330674886703, -0.028979163616895676, 0.0305399801582098, -0.003999592736363411, 0.021565284579992294, -0.03121633268892765, 0.007726042531430721, 0.017168983817100525, -0.03111227974295616, 0.0012405241141095757, -0.032152824103832245, 0.018872875720262527, 0.007387865800410509, -0.036497097462415695, -0.013618125580251217, 0.0015201703645288944, 0.006308300886303186, -0.004175184760242701, 0.017949391156435013, 0.033323436975479126, -0.03517040237784386, 0.021500250324606895, -0.0003924397169612348, -0.0031411435920745134, -0.0175071600824

In [188]:
text = doc1['text']
article

'$   [H8  J1 @   0BA \t YTi R 0 \t           D$  `               h@ $@D$@ %\nPL\tlF$iL%  \t             @! ZB@0H\nc3Y     `        GQl    ,F P$&!0\n@%\t(  \tBH@ 0  АP@          H  T B@     ā  b8 !qb&   %\n"PT  \tJ"AA   YTi @D    kD#   bT 4ġ  B (L$   !$bT!` \t!  \t    ( H \'MDH4@) B \tBA           ! H                  H\t     \t @\t             H      PA !% E%(\' $   D @        HB`pEG+tj%            Bq7Dـ @@ j *      DH4S   @\n%\t      @Q(0$$  "  8H   #(.\'L1D  $ 1(HP@pHPDD   XDHPaH ġ ŕL!\t  4H0"ABHJ i $1 5  J  5U b iBJA  F    @\t & B@\tR&                                        $ J   0                     \t(               L@H"q2x8gnDʀК                A@xRIY@     \t   @bUL%   BQHF   \t      `8@       b  @ #H    GQ b BD \t@@ Д \tJJ  \tU``  4 Hk L@@V $4ĪT!1 \t bQ  1($ iYEh8H  04\t \t  `H  HH\n!                                           B@                       J""$ ҄'

In [154]:
queries = [
    {'query': "What is diabetets?"}
]

res = requests.post(
    f"{endpoint_url}/query",
    headers=headers,
    json={
        'queries': queries
    }
)
res

<Response [500]>

In [156]:
from langchain.vectorstores import Pinecone

In [47]:
pinecone.describe_index('diabetes')

IndexDescription(name='diabetes', metric='cosine', replicas=1, dimension=1536.0, shards=1, pods=1, pod_type='p1', status={'ready': True, 'state': 'Ready'}, metadata_config=None, source_collection='')

In [45]:
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.5,
 'namespaces': {'': {'vector_count': 51816}},
 'total_vector_count': 51816}

In [35]:
import textwrap

for query_result in res.json()['results']:
    query = query_result['query']
    answers = []
    scores = []
    for result in query_result['results']:
        answers.append(result['text']+' url: '+result['metadata']['url'])
        scores.append(round(result['score'], 2))

    print("-" * 70)
    print(query)
    print()

    for answer, score in zip(answers, scores):
        wrapped_lines = textwrap.wrap(answer, width=80)  # Adjust the width as needed
        for line in wrapped_lines:
            print(f"{score}: {line}")

    print("-" * 70)
    print()


----------------------------------------------------------------------
What is diabetets?

0.89: Diabetes Overview url:
0.89: https://www.cdc.gov/diabetes/library/reports/reportcard.html
0.89: What is Diabetes?  Español (Spanish) Print Minus Related Pages With diabetes,
0.89: your body either doesn't make enough insulin or can't use it as well as it
0.89: should. Diabetes is a chronic (long-lasting) health condition that affects how
0.89: your body turns food into energy. Your body breaks down most of the food you eat
0.89: into sugar (glucose) and releases it into your bloodstream. When your blood
0.89: sugar goes up, it signals your pancreas to release insulin. Insulin acts like a
0.89: key to let the blood sugar into your body’s cells for use as energy. With
0.89: diabetes, your body doesn’t make enough insulin or can’t use it as well as it
0.89: should. When there isn’t enough insulin or cells stop responding to insulin, too
0.89: much blood sugar stays in your bloodstream. Over ti

In [30]:
for query_result in res.json()['results']:
    query = query_result['query']
    answers = []
    scores = []
    for result in query_result['results']:
        answers.append(result['text'])
        scores.append(round(result['score'], 2))
    print("-"*70+"\n"+query+"\n\n"+"\n".join([f"{s}: {a}" for a, s in zip(answers, scores)])+"\n"+"-"*70+"\n\n")

----------------------------------------------------------------------
What is diabetets?

0.89: Diabetes Overview
0.89: What is Diabetes?  Español (Spanish) Print Minus Related Pages With diabetes, your body either doesn't make enough insulin or can't use it as well as it should. Diabetes is a chronic (long-lasting) health condition that affects how your body turns food into energy. Your body breaks down most of the food you eat into sugar (glucose) and releases it into your bloodstream. When your blood sugar goes up, it signals your pancreas to release insulin. Insulin acts like a key to let the blood sugar into your body’s cells for use as energy. With diabetes, your body doesn’t make enough insulin or can’t use it as well as it should. When there isn’t enough insulin or cells stop responding to insulin, too much blood sugar stays in your bloodstream. Over time, that can cause serious health problems, such as heart disease, vision loss, and kidney disease.
0.89: Type 2 Diabetes
----

In [61]:
import textwrap
line = ""
for query_result in res.json()['results']:
    query = query_result['query']
    answers = []
    scores = []
    for result in query_result['results']:
        answers.append(result['text']+' url: '+result['metadata']['url'])
        scores.append(round(result['score'], 2))

    print("-" * 70)
    print(query)
    print()


----------------------------------------------------------------------
What is diabetets?



In [62]:
line = ' '.join(answers)
line

"Diabetes Overview url: https://www.cdc.gov/diabetes/library/reports/reportcard.html What is Diabetes?  Español (Spanish) Print Minus Related Pages With diabetes, your body either doesn't make enough insulin or can't use it as well as it should. Diabetes is a chronic (long-lasting)\xa0health condition\xa0that affects how your body turns food into energy. Your body breaks down most of the food you eat into sugar (glucose) and releases it into your bloodstream. When your blood sugar goes up, it signals your pancreas to release insulin. Insulin acts like a key to let the blood sugar into your body’s cells for use as energy. With diabetes, your body doesn’t make enough insulin or can’t use it as well as it should. When there isn’t enough insulin or cells stop responding to insulin, too much blood sugar stays in your bloodstream. Over time, that can cause serious health problems, such as heart disease, vision loss, and kidney disease. url: https://www.cdc.gov/diabetes/basics/diabetes.html T

In [88]:
import re

context = "Diabetes Overview url: https://www.cdc.gov/diabetes/library/reports/reportcard.html What is Diabetes? Español (Spanish) Print Minus Related Pages With diabetes, your body either doesn't make enough insulin or can't use it as well as it should. Diabetes is a chronic (long-lasting) health condition that affects how your body turns food into energy. Your body breaks down most of the food you eat into sugar (glucose) and releases it into your bloodstream. When your blood sugar goes up, it signals your pancreas to release insulin. Insulin acts like a key to let the blood sugar into your body’s cells for use as energy. With diabetes, your body doesn’t make enough insulin or can’t use it as well as it should. When there isn’t enough insulin or cells stop responding to insulin, too much blood sugar stays in your bloodstream. Over time, that can cause serious health problems, such as heart disease, vision loss, and kidney disease. url: https://www.cdc.gov/diabetes/basics/diabetes.html Type 2 Diabetes url: https://www.cdc.gov/diabetes/health-equity/diabetes-by-the-numbers.html"

question = "What is the diabetes?"

# Extract all URLs from the context
urls = re.findall(r"https?://\S+", context)

response_a = openai.Completion.create(
    prompt=f"Answer the question based on the context below, append the URLs where the answer was found \
        and if the question can't be answered based on the context, \
        say \"I don't know\"\n\nContext: {context}\n\n---\n\nQuestion: {question}\nAnswer:",
    temperature=0,
    max_tokens=150,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0,
    stop=None,
    model="text-davinci-003",
)

answer = response_a["choices"][0]["text"]

if urls:
    urls_text = " ".join(urls)
    answer_with_urls = f"{answer}\n\nURLs: {urls_text}"
else:
    answer_with_urls = answer


In [89]:
answer_with_urls


' Diabetes is a chronic (long-lasting) health condition that affects how your body turns food into energy. Your body breaks down most of the food you eat into sugar (glucose) and releases it into your bloodstream. When your blood sugar goes up, it signals your pancreas to release insulin. Insulin acts like a key to let the blood sugar into your body’s cells for use as energy. With diabetes, your body doesn’t make enough insulin or can’t use it as well as it should. When there isn’t enough insulin or cells stop responding to insulin, too much blood sugar stays in your bloodstream. Over time, that can cause serious health problems, such as heart disease, vision\n\nURLs: https://www.cdc.gov/diabetes/library/reports/reportcard.html https://www.cdc.gov/diabetes/basics/diabetes.html https://www.cdc.gov/diabetes/health-equity/diabetes-by-the-numbers.html'

In [79]:
context = "Diabetes Overview url: https://www.cdc.gov/diabetes/library/reports/reportcard.html What is Diabetes? Español (Spanish) Print Minus Related Pages With diabetes, your body either doesn't make enough insulin or can't use it as well as it should. Diabetes is a chronic (long-lasting) health condition that affects how your body turns food into energy. Your body breaks down most of the food you eat into sugar (glucose) and releases it into your bloodstream. When your blood sugar goes up, it signals your pancreas to release insulin. Insulin acts like a key to let the blood sugar into your body’s cells for use as energy. With diabetes, your body doesn’t make enough insulin or can’t use it as well as it should. When there isn’t enough insulin or cells stop responding to insulin, too much blood sugar stays in your bloodstream. Over time, that can cause serious health problems, such as heart disease, vision loss, and kidney disease. url: https://www.cdc.gov/diabetes/basics/diabetes.html Type 2 Diabetes url: https://www.cdc.gov/diabetes/health-equity/diabetes-by-the-numbers.html"

question = "What is the diabetes?"

# Extract all URLs from the context
urls = re.findall(r"url:\s(.*?)\s", context)

response_a = openai.Completion.create(
    prompt=f"Answer the question based on the context below, append the URL address where the answer was found \
        and if the question can't be answered based on the context, \
        say \"I don't know\"\n\nContext: {context}\n\n---\n\nQuestion: {question}\nAnswer:",
    temperature=0,
    max_tokens=150,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0,
    stop=None,
    model="text-davinci-003",
)

answer = response_a["choices"][0]["text"]

# Find the URL associated with the answer
url = None
for u in urls:
    if u in answer:
        url = u
        break

if url is not None:
    answer_with_url = f"{answer} (Source: {url})"
else:
    answer_with_url = answer


In [81]:
import re

context = "Diabetes Overview url: https://www.cdc.gov/diabetes/library/reports/reportcard.html What is Diabetes? Español (Spanish) Print Minus Related Pages With diabetes, your body either doesn't make enough insulin or can't use it as well as it should. Diabetes is a chronic (long-lasting) health condition that affects how your body turns food into energy. Your body breaks down most of the food you eat into sugar (glucose) and releases it into your bloodstream. When your blood sugar goes up, it signals your pancreas to release insulin. Insulin acts like a key to let the blood sugar into your body’s cells for use as energy. With diabetes, your body doesn’t make enough insulin or can’t use it as well as it should. When there isn’t enough insulin or cells stop responding to insulin, too much blood sugar stays in your bloodstream. Over time, that can cause serious health problems, such as heart disease, vision loss, and kidney disease. url: https://www.cdc.gov/diabetes/basics/diabetes.html Type 2 Diabetes url: https://www.cdc.gov/diabetes/health-equity/diabetes-by-the-numbers.html"

question = "What is the diabetes?"

# Find the best part of the context before the URL
match = re.search(r"(?<=\s)[^url:]+(?=\surl:)", context)
source_text = match.group(0).strip() if match else None

response_a = openai.Completion.create(
    prompt=f"Answer the question based on the context below, append the source text where the answer was found \
        and if the question can't be answered based on the context, \
        say \"I don't know\"\n\nContext: {context}\n\n---\n\nQuestion: {question}\nAnswer:",
    temperature=0,
    max_tokens=150,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0,
    stop=None,
    model="text-davinci-003",
)

answer = response_a["choices"][0]["text"]

if source_text is not None:
    answer_with_source = f"{answer} (Source: {source_text})"
else:
    answer_with_source = answer


In [72]:
print(f"Answer the question based on the context below, append the url address where the answer was found \
                  and if the question can't be answered based on the context, \
                      say \"I don't know\"\n\nContext: \
                          {line}\n\n---\n\nQuestion: {question}\nAnswer:")

Answer the question based on the context below, append the url address where the answer was found                   and if the question can't be answered based on the context,                       say "I don't know"

Context:                           Diabetes Overview url: https://www.cdc.gov/diabetes/library/reports/reportcard.html What is Diabetes?  Español (Spanish) Print Minus Related Pages With diabetes, your body either doesn't make enough insulin or can't use it as well as it should. Diabetes is a chronic (long-lasting) health condition that affects how your body turns food into energy. Your body breaks down most of the food you eat into sugar (glucose) and releases it into your bloodstream. When your blood sugar goes up, it signals your pancreas to release insulin. Insulin acts like a key to let the blood sugar into your body’s cells for use as energy. With diabetes, your body doesn’t make enough insulin or can’t use it as well as it should. When there isn’t enough insulin or

In [67]:
answer = response_a["choices"][0]["text"].strip()
wrapped_lines = textwrap.wrap(answer, width=80)  # Adjust the width as needed
for line1 in wrapped_lines:
    print(line1)

Diabetes is a chronic (long-lasting) health condition that affects how your body
turns food into energy. Your body breaks down most of the food you eat into
sugar (glucose) and releases it into your bloodstream. When your blood sugar
goes up, it signals your pancreas to release insulin. Insulin acts like a key to
let the blood sugar into your body’s cells for use as energy. With diabetes,
your body doesn’t make enough insulin or can’t use it as well as it should. When
there isn’t enough insulin or cells stop responding to insulin, too much blood
sugar stays in your bloodstream. Over time, that can cause serious health
problems, such as heart disease, vision
