### Internal language model POCs and OpenAI API demo

Prepared for Ventera brownbag

In [117]:
import numpy as np
import openai
import pandas as pd
import pickle
import tiktoken

COMPLETIONS_MODEL = "text-davinci-003"
EMBEDDING_MODEL = "text-embedding-ada-002"

openai.api_key = "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" # TODO: Convert to txtField prompt before distributing


In [118]:
prompt = "What's the main idea behind MVP?"

openai.Completion.create(
    prompt=prompt,
    temperature=0,
    max_tokens=300,
    model=COMPLETIONS_MODEL
)["choices"][0]["text"].strip(" \n")

"The main idea behind MVP (Minimum Viable Product) is to create a product with the minimum amount of features necessary to test the product's viability in the market. This allows companies to quickly and cost-effectively test the market and gain feedback from customers before investing more resources into the product."

In [119]:
prompt = """Answer the question as truthfully as possible, and if you're unsure of the answer, say "Sorry, I don't know".

Q: How many sprints in a PI?
A:"""

openai.Completion.create(
    prompt=prompt,
    temperature=0,
    max_tokens=300,
    model=COMPLETIONS_MODEL
)["choices"][0]["text"].strip(" \n")

"Sorry, I don't know."

#### Prompt Engineering to increase domain 'truthiness'

In [120]:
prompt = """Answer the question as truthfully as possible using the provided text, and if the answer is not contained within the text below, say "I don't know"

Context:
A Minimum Viable Product (MVP) is a version of a working product that allows the team to learn from and 
interact with their customer with the least amount of effort. MVP attends to the core customer needs 
first and as soon as possible. It helps to validate needs, reduce risk, and help the programs course correct 
quickly, as needed. Rooted in concepts that emerged from the book “The Lean Startup” by Eric Ries, the core 
idea is to facilitate a better understanding of the customers needs and interests without committing or 
using a large number of resources or fully developing a product.

Q: What's the main idea behind MVP?
A:"""

openai.Completion.create(
    prompt=prompt,
    temperature=0,
    max_tokens=300,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0,
    model=COMPLETIONS_MODEL
)["choices"][0]["text"].strip(" \n")

'The main idea behind MVP is to facilitate a better understanding of the customers needs and interests without committing or using a large number of resources or fully developing a product.'

#### Preprocess data from vstart.dev

In [122]:
df = pd.read_csv('vstart.csv')
df = df.set_index(["vscontent"])
print(f"{len(df)} rows in the data.")
df.sample(5)

45 rows in the data.


Testing how easy a design is to use with a group of representative users. There are 2 parties involved in a usability test: the moderator or person facilitating the session and the participant or the person attempting to complete the tasks.
"One of the challenges of beginning a new program or project is determining the most beneficial Agile Framework to manage and deliver your products. Programs that arbitrarily select a framework or are directed into a specific framework often encounter challenges managing their products and delivery. When selecting an Agile framework, several considerations are necessary, such as knowing your client, understanding their pre-existing delivery and customer landscape, and identifying their long-term delivery vision and Objectives and Key Results (OKRs). Additionally, a plan to educate your Program and teams to ensure they operate within Agile-based concepts is essential."
Working Agreements (WA) are a set of rules established by the team to make themselves more efficient and self-managing. These ground rules define how a team will work and interact with each other in order to achieve team goals and be most productive. WA’s are also referred to as Team Norms or Social Contract. These rules are revisited often and updated by the team as team dynamics and environment evolve.
"Team onboarding is the process of introducing a new team member (whether recently hired or moving from another project) to the culture of your team. It is a time to set expectations on roles, responsibilities, and rules of engagement. In addition, it may involve walking through documentation and basic information about your project, demonstrating existing applications or work in progress, explaining the tools and artifacts that the team uses, and other activities that you deem necessary and relevant for the person onboarding to become a productive member of the team."
"In a professional context, a meeting is an assembly of two or more people by arrangement for discussion. Although many people dread having meetings, they are the cogs that make our organization work. The misconception is that meetings are a waste of time, despite the multiple benefits of having effective, well-planned meetings. The truth is, we’ve all experienced meetings that are disorganized, wasteful, and simply unproductive. We cannot avoid having meetings, but we can make them as efficient as possible to make them an effective communication and collaboration tool."


In [123]:
def get_embedding(text: str, model: str = EMBEDDING_MODEL) -> list[float]:
    """
    Use the OpenAI Embeddings API to create an embedding for the given text.
    """
    result = openai.Embedding.create(model=model, input=text)
    return result["data"][0]["embedding"]

def compute_doc_embeddings(df: pd.DataFrame) -> dict[str, list[float]]:
    """
    Create an embedding for each row in the dataframe using the OpenAI Embeddings API.

    Return a dictionary that maps between each embedding vector and the vscontent value of the row it corresponds to.
    """
    return {r['vscontent']: get_embedding(r['vscontent']) for idx, r in df.iterrows()}

def load_embeddings(fname: str) -> dict[str, list[float]]:
    """
    Read the document embeddings from a CSV.

    fname is the path to a CSV with exactly these named columns: 
        "vscontent", "0", "1", ... up to the length of the embedding vectors.
    """
    df = pd.read_csv(fname, header=0)
    max_dim = max([int(c) for c in df.columns if c != "vscontent"])
    return {r[0]: [r[i] for i in range(1, max_dim + 1)] for r in df.itertuples(index=False)}

df = pd.read_csv("vstart.csv")




In [124]:
document_embeddings = compute_doc_embeddings(df)

##### Lets save our embeddings in a file

In [125]:
def save_embeddings(embeddings: dict[str, list[float]], fname: str):
    df = pd.DataFrame.from_dict(embeddings, orient='index')
    df.reset_index(inplace=True)
    df.rename(columns={'index': 'vscontent'}, inplace=True)
    df.to_csv(fname, index=False)


output_folder = "output"
output_fname = f"{output_folder}/embeddings.csv"
save_embeddings(document_embeddings, output_fname)

##### Lets view what an embedding looks like

In [126]:
first_text = list(document_embeddings.keys())[0]
first_embedding = list(document_embeddings.values())[0]
truncated_text = first_text[:50] + "..."
print("Example:", truncated_text, first_embedding[0])


Example: ADDIE Training Program. User training is an import... -0.012318002991378307


#####  We can use this for search results as well 

...by finding the most similar document embeddings to the question embedding. We're storing this locally but for a larger dataset we would consider a vector DB to index our results.

In [None]:
def vector_similarity(x: list[float], y: list[float]) -> float:
    """
    Returns the similarity between two vectors.
    
    Because OpenAI Embeddings are normalized to length 1, the cosine similarity is the same as the dot product.
    """
    return np.dot(np.array(x), np.array(y))

def order_document_sections_by_query_similarity(query: str, contexts: dict[(str, str), np.array]) -> list[(float, (str, str))]:
    """
    Find the query embedding for the supplied query, and compare it against all of the pre-calculated document embeddings
    to find the most relevant sections. 
    
    Return the list of document sections, sorted by relevance in descending order.
    """
    query_embedding = get_embedding(query)
    
    document_similarities = sorted([
        (vector_similarity(query_embedding, doc_embedding), doc_index) for doc_index, doc_embedding in contexts.items()
    ], reverse=True)
    
    return document_similarities

##### Lets try a semantic search query

In [128]:
order_document_sections_by_query_similarity("tell me about roadmaps?", document_embeddings)[:5]

[(0.8140396875290642,
  'A customer journey map (also called a user journey map) is a visual story of your customer’s journey through every touchpoint of your brand, product, or service. It highlights the customer’s activities, goals, and experiences in relation to the touchpoints, including what they are thinking and feeling through each stage of their journey, as well as what the application, service, or company is doing in response.'),
 (0.8060616723231722,
  'User story mapping (USM) is a collaborative exercise that helps agile development teams define work by visually mapping the user stories using the user’s journey as the guide. It is a user-centric approach focused on actual user needs. The mapping process creates a workflow that replicates the real user journey and helps identify gaps, prioritize actions, and brings clarity to the team and stakeholders. Visually viewing the stages and features helps create a clear picture of what "must" be delivered each iteration, from the MV