# Using ChatGPT with New Data Part 1

## About

This is part 1 in utilizing ChatGPT APIs and Embeddings APIs to determine answers with ChatGPT on data not available to its model as of yet. Instead of fine tuning the model we use the embeddings for a different approach. 

1. Generate embeddings on owned data with OpenAI embeddings API
2. Store embedding vectors for data somewhere, either in a csv or a vector database, this example uses a CSV to store the embeddings.
3. Given a question we wish to ask ChatGPT, generate an embedding for the given question
4. Find similarity of question vector to set of vectors from data to embedding vectors, this will tell us what info to provide to ChatGPT to figure out an answer to provide back
5. Get the set of vectors from the stored embeddings to the amount of allowed tokens permissable as an api call to ChatGPT. Here we seek to give ChatGPT APIs the most data it can use to give a valid answer. We take the top N values such that when tokenized, it falls within the range allowed by ChatGPT
6. Given the data and question, provide it to ChatGPT and get back an answer.

This jupyter notebook here only focuses on creating the embeddings for the data we wish to use to answer questions. The example uses info scraped from https://www.healthforcalifornia.com/ to provide a way to find answers to questions about health insurance in California.

In [18]:
from openai import OpenAI # for calling the OpenAI API
import pandas as pd  # for storing text and embeddings data
import tiktoken  # for counting tokens
import os # for getting API token from env variable OPENAI_API_KEY
from scipy import spatial  # for calculating vector similarities for search

# models
EMBEDDING_MODEL = "text-embedding-ada-002"
GPT_MODEL = "gpt-3.5-turbo"
OPEN_AI_KEY = os.environ.get("OPENAI_API_KEY")
client = OpenAI(api_key=OPEN_AI_KEY)

In [19]:
def get_query_embeddings(query: str) -> list[float]:
    """
    Function to call OpenAI embeddings API and return back a vector representing the embedding.
    
    :param query: The text we want to derive an embedding from
    :return: list[float] representing the embedding
    """
    query_embedding_response = client.embeddings.create(
        model=EMBEDDING_MODEL,
        input=query,
    )
    query_embedding = query_embedding_response.data[0].embedding
    return query_embedding

def num_tokens(text: str, model: str = GPT_MODEL) -> int:
    """
        Return the number of tokens in a string.
        This helps when knowing ho wmuch data we can pass to ChatGPT APIs since there are
        limits on the maount of tokens that can be provided to ChatGPT.
        
        :param text: The text we want to calculate number of tokens for
        :param model: Model we will cal from which we want to calculate tokens for
        :return: int Number representing amount of tokens given text and model
    """
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))

In [20]:
files = ['aetna.txt',
         'anthem_blue_cross.txt',
         'blue_shield.txt',
         'bronze_option.txt',
         'cchp.txt',
         'deadlines.txt',
         'family_options.txt',
         'gold_option.txt',
         'government_discounts.txt',
         'health_net.txt',
         'hmo_vs_ppo.txt',
         'iehp.txt',
         'income_limits.txt',
         'individual_options.txt',
         'irs_1095_a_form.txt',
         'kaiser.txt',
         'la_care_health_plan.txt',
         'medi_cal_options.txt',
         'minimum_option.txt',
         'molina_health',
         'newborn_options.txt',
         'open_enrollment.txt',
         'platinum_option.txt',
         'preventative_care.txt',
         'qualifying_life_events.txt',
         'reporting_changes.txt',
         'self_employed_options.txt',
         'senior_options.txt',
         'sharp.txt',
         'should_switch_to_hmo.txt',
         'silver_70_option.txt',
         'silver_73_option.txt',
         'silver_87_option.txt',
         'silver_94_option.txt',
         'silver_option.txt',
         'small_business_options.txt',
         'special_enrollment.txt',
         'supplemental_options.txt',
         'travel_options.txt',
         'valley_health_plan.txt',
         'western_health_plan.txt'
         ]

In [21]:
file_contents = []
for file in files:
    with open(f"../../datasets/covered_california_2024/{file}", 'r') as file:
        content = file.read()
        file_contents.append(content)

In [22]:
tokens = 0
for content in file_contents:
    tokens += num_tokens(content, EMBEDDING_MODEL)
tokens

37348

In [23]:
embeddings = []
for content in file_contents:
    # embeddings.append(get_query_embeddings(content))
    continue

In [24]:
data = {
    'text': file_contents,
    'embeddings': embeddings
}
# df = pd.DataFrame(data)
# df.to_csv('data.csv', index=False)

In [25]:
len(embeddings)

0