Text Embedding is the process that can convert text to some end dimensional vector.

In [2]:
import openai
import os

In [4]:
# get string of the key
openai.api_key = os.getenv("OPENAI_API_KEY")

Text Embedding Use cases:
- Search
    - Where results are ranked by relevance to a query string
- Clustering
    - Where text strings are grouped by similarity
- Recommendations
    - Where items with related text strings are recommended
- Anomaly detection
    - Where outliers with little relatedness are identified
- Diversity measurement
    - Where similarity distributions are analyzed
- Classification
    - Where text strings are classified by their most similar label


**Model hallicination**
A common issue with large language models is their ability to sound confident in a response or completion and yet be totally factually incorrect.

This issue is called a **hallucination**, is that it can be difficult to actually detect these issues since we usually don't know the correct answer for a query ahead of time.

In [5]:
prompt = "What does the start-up company Pentera do and who invented it?"

In [6]:
response = openai.Completion.create(
    model = 'text-davinci-003',
    prompt = prompt,
    temperature = 0,
    max_tokens = 512
)

In [7]:
print(response['choices'][0]['text'])



Pentera is a start-up company that provides software solutions for the financial services industry. It was founded by two entrepreneurs, David K. Williams and David J. Williams, in 2017. The company's software solutions are designed to help financial advisors and institutions better manage their clients' portfolios, automate processes, and improve compliance. Pentera's software also provides insights into the financial markets and helps advisors make more informed decisions.


The above response is model hallucination... lets try to adjust things using prompt engineering

In [8]:
prompt = """ Only answer the question below if you have 100% certainity of the facts.
Q: What does the start-up company Pentera do and who invented it?
A: """

In [9]:
response = openai.Completion.create(
    model = 'text-davinci-003',
    prompt = prompt,
    temperature = 0,
    max_tokens = 512
)
print(response['choices'][0]['text'])

 I do not have 100% certainty of the facts, so I cannot answer this question.


Giving context to models and requesting summary 

In [10]:
prompt = """ Only answer the question below if you have 100% certainity of the facts.
Context: {context}
Q: What does the start-up company Pentera do and who invented it?
A: """

In [11]:
response = openai.Completion.create(
    model = 'text-davinci-003',
    prompt = prompt,
    temperature = 0,
    max_tokens = 512
)
print(response['choices'][0]['text'])

 Pentera is a start-up company that provides software solutions for the financial services industry. It was founded by CEO and co-founder, David K. Williams.


In [12]:
import pandas as pd

In [13]:
df = pd.read_csv("unicorns.csv")
df.head()

Unnamed: 0,Updated at,Company,Crunchbase Url,Last Valuation (Billion $),Date Joined,Year Joined,City,Country,Industry,Investors,Company Website
0,"10/31/2022, 2:37:05 AM",Esusu,https://www.cbinsights.com/company/esusu,1.0,1/27/2022,2022,New York,United States,Fintech,"[""Next Play Ventures"",""Zeal Capital Partners"",...",
1,"10/31/2022, 2:37:05 AM",Fever Labs,https://www.cbinsights.com/company/fever-labs,1.0,1/26/2022,2022,New York,United States,Internet software & services,"[""Accel"",""14W"",""GS Growth""]",
2,"10/31/2022, 2:37:04 AM",Minio,https://www.cbinsights.com/company/minio,1.0,1/26/2022,2022,Palo Alto,United States,Data management & analytics,"[""General Catalyst"",""Nexus Venture Partners"",""...",
3,"10/31/2022, 2:37:04 AM",Darwinbox,https://www.cbinsights.com/company/darwinbox,1.0,1/25/2022,2022,Hyderabad,India,Internet software & services,"[""Lightspeed India Partners"",""Sequoia Capital ...",
4,"10/31/2022, 2:37:04 AM",Pentera,https://www.cbinsights.com/company/pcysys,1.0,1/11/2022,2022,Petah Tikva,Israel,Cybersecurity,"[""AWZ Ventures"",""Blackstone"",""Insight Partners""]",


In [14]:
df['Investors'][0]

'["Next Play Ventures","Zeal Capital Partners","SoftBank Group"]'

In [15]:
import ast

In [16]:
for i in ast.literal_eval(df['Investors'][0]):
    print(i)

Next Play Ventures
Zeal Capital Partners
SoftBank Group


In [17]:
def summary(company, crunchbase_url, city,
            country, industry, investor_list):
    investors = "The investors in the company are "

    for investor in ast.literal_eval(investor_list):
        investors += f"{investor},"

    text = f"{company} has headquarters in {city} in {country} and is in the field of {industry}. {investors}. More info at {crunchbase_url}"

    return text

In [18]:
df['summary'] = df.apply(lambda df: summary(df['Company'],
                                            df['Crunchbase Url'],
                                            df['City'],
                                            df['Country'],
                                            df['Industry'],
                                            df['Investors']), axis=1)

In [19]:
df['summary'][0]

'Esusu has headquarters in New York in United States and is in the field of Fintech. The investors in the company are Next Play Ventures,Zeal Capital Partners,SoftBank Group,. More info at https://www.cbinsights.com/company/esusu'

Token count and pricing

In [20]:
import tiktoken

In [21]:
def num_of_tokens_from_string(string, encoding_name):
    encoding = tiktoken.get_encoding(encoding_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens

In [22]:
num_of_tokens_from_string(df['summary'][0], encoding_name='cl100k_base')

52

In [23]:
df['token_count'] = df['summary'].apply(lambda text: num_of_tokens_from_string(text,"cl100k_base"))

In [24]:
df.head()

Unnamed: 0,Updated at,Company,Crunchbase Url,Last Valuation (Billion $),Date Joined,Year Joined,City,Country,Industry,Investors,Company Website,summary,token_count
0,"10/31/2022, 2:37:05 AM",Esusu,https://www.cbinsights.com/company/esusu,1.0,1/27/2022,2022,New York,United States,Fintech,"[""Next Play Ventures"",""Zeal Capital Partners"",...",,Esusu has headquarters in New York in United S...,52
1,"10/31/2022, 2:37:05 AM",Fever Labs,https://www.cbinsights.com/company/fever-labs,1.0,1/26/2022,2022,New York,United States,Internet software & services,"[""Accel"",""14W"",""GS Growth""]",,Fever Labs has headquarters in New York in Uni...,53
2,"10/31/2022, 2:37:04 AM",Minio,https://www.cbinsights.com/company/minio,1.0,1/26/2022,2022,Palo Alto,United States,Data management & analytics,"[""General Catalyst"",""Nexus Venture Partners"",""...",,Minio has headquarters in Palo Alto in United ...,51
3,"10/31/2022, 2:37:04 AM",Darwinbox,https://www.cbinsights.com/company/darwinbox,1.0,1/25/2022,2022,Hyderabad,India,Internet software & services,"[""Lightspeed India Partners"",""Sequoia Capital ...",,Darwinbox has headquarters in Hyderabad in Ind...,56
4,"10/31/2022, 2:37:04 AM",Pentera,https://www.cbinsights.com/company/pcysys,1.0,1/11/2022,2022,Petah Tikva,Israel,Cybersecurity,"[""AWZ Ventures"",""Blackstone"",""Insight Partners""]",,Pentera has headquarters in Petah Tikva in Isr...,53


In [25]:
df['token_count'].sum()

63363

In [26]:
# estimating costs
# $0.0004 per 1000 tokens USD
df['token_count'].sum() * 0.0004 / 1000

0.025345200000000002

In [27]:
def get_embedding(text):
    result = openai.Embedding.create(
        model = 'text-embedding-ada-002',
        input = text
    )
    return result['data'][0]['embedding']

In [None]:
def get_embedding(text):
  # Note how this function assumes you already set your Open AI key!
    result = openai.Embedding.create(
      model='text-embedding-ada-002',
      input=text
    )
    return result["data"][0]["embedding"]

In [28]:
df['summary'][0]

'Esusu has headquarters in New York in United States and is in the field of Fintech. The investors in the company are Next Play Ventures,Zeal Capital Partners,SoftBank Group,. More info at https://www.cbinsights.com/company/esusu'

In [29]:
# get_embedding('Esusu has headquarters in New York in United States and is in the field of Fintech. The investors in the company are Next Play Ventures,Zeal Capital Partners,SoftBank Group,. More info at https://www.cbinsights.com/company/esusu')
# OR
vector = get_embedding(df['summary'][0])

In [30]:
len(vector)

1536

In [None]:
df['embedding'] = df['summary'].apply(get_embedding)

In [None]:
df.to_csv('unicorns_with_embeddings.csv',index=False)

In [48]:
df = pd.read_csv("unicorns_with_embeddings.csv")

In [33]:
df.head()

Unnamed: 0,Updated at,Company,Crunchbase Url,Last Valuation (Billion $),Date Joined,Year Joined,City,Country,Industry,Investors,Company Website,summary,token_count,embedding
0,"10/31/2022, 2:37:05 AM",Esusu,https://www.cbinsights.com/company/esusu,1.0,1/27/2022,2022,New York,United States,Fintech,"[""Next Play Ventures"",""Zeal Capital Partners"",...",,Esusu has headquarters in New York in United S...,58,"[0.01195491198450327, -0.017717931419610977, -..."
1,"10/31/2022, 2:37:05 AM",Fever Labs,https://www.cbinsights.com/company/fever-labs,1.0,1/26/2022,2022,New York,United States,Internet software & services,"[""Accel"",""14W"",""GS Growth""]",,Fever Labs has headquarters in New York in Uni...,60,"[0.009171437472105026, 0.01314949057996273, -0..."
2,"10/31/2022, 2:37:04 AM",Minio,https://www.cbinsights.com/company/minio,1.0,1/26/2022,2022,Palo Alto,United States,Data management & analytics,"[""General Catalyst"",""Nexus Venture Partners"",""...",,Minio has headquarters in Palo Alto in United ...,57,"[0.002730059437453747, -0.03737899661064148, 0..."
3,"10/31/2022, 2:37:04 AM",Darwinbox,https://www.cbinsights.com/company/darwinbox,1.0,1/25/2022,2022,Hyderabad,India,Internet software & services,"[""Lightspeed India Partners"",""Sequoia Capital ...",,Darwinbox has headquarters in Hyderabad in Ind...,62,"[-0.0024771858006715775, -0.024587858468294144..."
4,"10/31/2022, 2:37:04 AM",Pentera,https://www.cbinsights.com/company/pcysys,1.0,1/11/2022,2022,Petah Tikva,Israel,Cybersecurity,"[""AWZ Ventures"",""Blackstone"",""Insight Partners""]",,Pentera has headquarters in Petah Tikva in Isr...,58,"[0.011331121437251568, -0.011193273589015007, ..."


Document Similarity and Context Injection

We'll do the following:
- Embed a query string to vector
- Perform a **cosine** similarity between query vector and *all* our document vectors
- Choose most similar and inject context accordingly.

In [34]:
prompt = "What does the company Pentera do and who are the investors?"

In [35]:
prompt_embedding = get_embedding(prompt)

In [41]:
# prompt_embedding

In [37]:
import numpy as np

In [45]:
def vector_similarity(vec1,vec2):
    """
    Returns the similarity between two vectors.
    
    Because OpenAI Embeddings are normalized to length 1, 
    the cosine similarity is the same as the dot product.
    """
    return np.dot(np.array(vec1), np.array(vec2))

In [None]:
df["prompt_similarity"] = df['embedding'].apply(
                            lambda vector: vector_similarity(vector,
                                                             prompt_embedding))

In [None]:
df['prompt_similarity']

In [None]:
df.nlargest(1,'prompt_similarity')

In [None]:
df.nlargest(1,'prompt_similarity').iloc[0]['summary']

## Question Answering with Embeddings

In [None]:
context = df.nlargest(1,'prompt_similarity').iloc[0]['summary']

prompt = f""" Only answer the question below if you have 100% certainity of the facts.
Context: {context}
Q: What does the start-up company Pentera do and who invented it?
A: """


In [None]:
# much stronger prompt and did not require fine tuning thus saving on costs... 
# i.e by using text embeddings 
print(prompt)

In [None]:

response = openai.Completion.create(
    model = 'text-davinci-003',
    prompt = prompt,
    temperature = 0,
    max_tokens = 512
)
print(response['choices'][0]['text'])

In [None]:
def embed_prompt_lookup():
    question = input("What question do you have about a Unicorn start-up? ")

    prompt_embedding = get_embedding(question)

    df['prompt_similarity'] = df['embedding'].apply(lambda vector:vector_similarity(vector,
                                                                                prompt_embedding))
    context = df.nlargest(1,'prompt_similarity').iloc[0]['summary']

    prompt = f""" Only answer the question below if you have 100% certainity of the facts.
            Context: {context}
            Q: {question}?
            A: """
    
    response = openai.Completion.create(
        model = 'text-davinci-003',
        prompt = prompt,
        temperature = 0,
        max_tokens = 512
    )
    print(response['choices'][0]['text'].strip(" \n"))



In [None]:
embed_prompt_lookup()

Momenta is a company in the field of Artificial Intelligence with headquarters in Beijing, China.
