# CodeLive OpenAI

In [23]:
import openai
import os
import json
import time
import numpy as np

openai_key = json.load(open("openai.json"))["key"]
openai.api_key = openai_key

def cosine_similarity(a, b):
    return np.dot(a, b)/(np.linalg.norm(a) * np.linalg.norm(b))

# Open our Files

We need to do a few things before we do anything fun with OpenAI.

1. Verify that all of our files are text (in our case they already are so we'll skip this)

2. Chunk our files into usable pieces
    - OpenAI has a limitation on embedding input size
    - We have a limit on context size in a prompt (we'll discuss what this means later)
    - We want to make sure no piece of content falls "on the line" of a chunk break

In [33]:
files = os.listdir("./docs")
files

['intro-programming-model.adoc',
 'about-mule-message.adoc',
 'cryptography-xml.adoc',
 'tuning-test-validations.adoc',
 'configuring-properties.adoc',
 'cryptography-jce.adoc',
 'about-flows.adoc',
 'migration-core-poll.adoc',
 'cryptography-pgp.adoc',
 'package-a-mule-application.adoc',
 'mule-server-notifications.adoc',
 'for-each-scope-concept.adoc',
 'mule-runtime-updates.adoc',
 'batch-filters-and-batch-aggregator.adoc',
 'using-maven-with-mule.adoc',
 'common-dev-strategies.adoc',
 'streaming-about.adoc',
 'hardware-and-software-requirements.adoc',
 'dynamic-evaluate-component-reference.adoc',
 'test-mule-applications.adoc',
 'about-classloading-isolation.adoc',
 'reconnection-strategy-about.adoc',
 'logging-in-mule.adoc',
 'mule-upgrade-tool.adoc',
 'transform-preview-transformation-output-design-center-task.adoc',
 'migration-patterns-watermark.adoc',
 'parse-template-reference.adoc',
 'migration-core.adoc',
 'logger-component-reference.adoc',
 'business-events-in-components.a

In [32]:
chunks = []

CHUNK_SIZE = 2000
OVERLAP = 250

for file_to_read in files:
    f = open(f"./docs/{file_to_read}", "r")
    contents = f.read()
    chunks += [contents[i:i+CHUNK_SIZE] for i in range(0, len(contents), CHUNK_SIZE - OVERLAP)]

# Embed our Chunked Files

We need to create embeddings for each of our files so that we can lookup based on similarity in our chatbot. This requires the use of OpenAI's Embeddings and then storage of that data.

1. Embed each chunk using text-embedding-ada-002
2. Store each chunk alongside it's embedding so we can use them both later on

### Embeddings 101

1. Embeddings are like fingerprints for chunks of text
2. The "distance" between two "fingerprints" tells us how similar they are

In [36]:
chunk_1 = "Hello, my name is Joel!"
chunk_2 = "Hello, my name is Joe!"
chunk_3 = "This is a very different string from the other two."

embedding_1 = np.array(openai.Embedding.create(
    input=chunk_1,
    model="text-embedding-ada-002"
)['data'][0]['embedding'])

embedding_2 = np.array(openai.Embedding.create(
    input=chunk_2,
    model="text-embedding-ada-002"
)['data'][0]['embedding'])

embedding_3 = np.array(openai.Embedding.create(
    input=chunk_3,
    model="text-embedding-ada-002"
)['data'][0]['embedding'])

np.dot(embedding_1, embedding_2.T), np.dot(embedding_1, embedding_3.T)

(0.9153652814142736, 0.7330231509785115)

In [None]:
embeddings_dict = {}
rate_limit = 60
i = 1

for chunk in chunks:
    embedding = openai.Embedding.create(
        input=chunk, 
        model="text-embedding-ada-002"
    )['data'][0]['embedding']

    embeddings_dict[i] = {
        "vector": embedding, 
        "text": chunk
    }

    i += 1

    if i % rate_limit == 0:
        time.sleep(rate_limit)

# Save our Embeddings

We need to save our embeddings to a file so we can load them in our API. We'll just save them as raw JSON here, but you could use a pickle file or some other compressed format if you would like. YOu'll just have to remember to load them up from that format in your API as we build that out.

In [None]:
with open("embeddings-new.json", "w") as outfile:
    json.dump(embeddings_dict, outfile)