<center><a href="https://www.pieriantraining.com/" ><img src="../PTCenteredPurple.png" alt="Pierian Training Logo" /></a></center>


# Text Embedding - Semantic Search and Context Injection

In [3]:
import pandas as pd

In [4]:
df = pd.read_csv("companies.csv")
df.head()

In [8]:
from vertexai.language_models import TextGenerationModel
llm = TextGenerationModel.from_pretrained('text-bison')
# Hallucinates Some Information
llm.predict(prompt="Tell me about the Minio company and where its located",
           temperature=0,max_output_tokens=2048)

In [50]:
# Hallucinates Some Information
llm.predict(prompt="Tell me about the Minio company and where its located. Only answer if you are 100% certain of the facts.",
           temperature=0,max_output_tokens=2048)

df.iloc[2]
df.iloc[2]['City']
eval(df.iloc[2]['Investors'])

 MinIO, Inc. is a privately held, venture-backed software company developing an open source object storage platform. Founded in 2014, the company is headquartered in San Francisco with offices in Seattle and Bangalore. 

MinIO's flagship product, MinIO, is an open source, high-performance object storage platform that is designed to be compatible with Amazon S3. MinIO is used by over 50,000 organizations worldwide, including Fortune 500 companies and government agencies.

In 2020, MinIO raised $20 million in Series A funding led by General Catalyst. The company plans to use the funding to expand its sales and marketing efforts, as well as to continue developing its product.

In [34]:
def create_context(row):
    
    company_name = row['Company']
    city = row['City']
    country = row['Country']
    investors = row['Investors']
    industry = row['Industry']
    
    context = f"""The company {company_name} was located in {city},{country} in 2022. It works in the {industry} industry and has these investors: {investors}"""
    
    return context

In [37]:
context = create_context(df.iloc[2])
context

In [52]:
llm.predict(prompt=f"Tell me about the Minio company, here is some extra info:\n{context}",
           temperature=0,max_output_tokens=2048)

 Minio is a Palo Alto-based software company that provides data management and analytics solutions. It was founded in 2014 by Anand Babu Periasamy, Gokulnath Dharmaraj, and Jiten Vaidya. The company's flagship product is Minio, an open source object storage platform. Minio also offers a number of other products and services, including Minio Cloud, a cloud-based object storage service, and Minio Edge, an on-premises object storage appliance.

Minio is backed by a number of investors, including General Catalyst, Nexus Venture Partners, and Dell Technologies Capital. The company has raised a total of $43 million in funding.

In 2022, Minio was named a Gartner Cool Vendor in Cloud Storage. The company was also recognized as a Red Herring Top 100 Global Winner.

Minio is a rapidly growing company with a strong team of experienced professionals. The company is well-positioned to continue to grow and succeed in the data management and analytics market.

----

## Use Text Embedding for Semantic Similarity Search and Automatic Context Injection

In [43]:
from vertexai.language_models import TextEmbeddingModel

In [44]:
# Model Info Here: https://cloud.google.com/vertex-ai/docs/generative-ai/embeddings/get-text-embeddings
embedder = TextEmbeddingModel.from_pretrained('textembedding-gecko@001')
embeddings = embedder.get_embeddings(['Is there life on Mars?'])
len(embeddings[0].values)

### General Text Column Creation.

In [53]:
import ast 
def summary(company,crunchbase_url,city,country,industry,investor_list):
    investors = 'The investors in the company are'
     
    for investor in ast.literal_eval(investor_list):
        investors += f" {investor}, "

    text = f"{company} has headquarters in {city} in {country} and is in the field of {industry}. {investors}. You can find more information at {crunchbase_url}"

    return text 

In [54]:
df['summary'] = df.apply(lambda df: summary(df['Company'],df['Crunchbase Url'],df['City'],df['Country'],df['Industry'],df['Investors']),axis=1)
df['summary'][0]

In [56]:
df.head()

Unnamed: 0,Updated at,Company,Crunchbase Url,Last Valuation (Billion $),Date Joined,Year Joined,City,Country,Industry,Investors,Company Website,summary
0,"10/31/2022, 2:37:05 AM",Esusu,https://www.cbinsights.com/company/esusu,1.0,1/27/2022,2022,New York,United States,Fintech,"[""Next Play Ventures"",""Zeal Capital Partners"",...",,Esusu has headquarters in New York in United S...
1,"10/31/2022, 2:37:05 AM",Fever Labs,https://www.cbinsights.com/company/fever-labs,1.0,1/26/2022,2022,New York,United States,Internet software & services,"[""Accel"",""14W"",""GS Growth""]",,Fever Labs has headquarters in New York in Uni...
2,"10/31/2022, 2:37:04 AM",Minio,https://www.cbinsights.com/company/minio,1.0,1/26/2022,2022,Palo Alto,United States,Data management & analytics,"[""General Catalyst"",""Nexus Venture Partners"",""...",,Minio has headquarters in Palo Alto in United ...
3,"10/31/2022, 2:37:04 AM",Darwinbox,https://www.cbinsights.com/company/darwinbox,1.0,1/25/2022,2022,Hyderabad,India,Internet software & services,"[""Lightspeed India Partners"",""Sequoia Capital ...",,Darwinbox has headquarters in Hyderabad in Ind...
4,"10/31/2022, 2:37:04 AM",Pentera,https://www.cbinsights.com/company/pcysys,1.0,1/11/2022,2022,Petah Tikva,Israel,Cybersecurity,"[""AWZ Ventures"",""Blackstone"",""Insight Partners""]",,Pentera has headquarters in Petah Tikva in Isr...


In [57]:
df['summary']

0       Esusu has headquarters in New York in United S...
1       Fever Labs has headquarters in New York in Uni...
2       Minio has headquarters in Palo Alto in United ...
3       Darwinbox has headquarters in Hyderabad in Ind...
4       Pentera has headquarters in Petah Tikva in Isr...
                              ...                        
1194    Fanatics has headquarters in Jacksonville in U...
1195    SpaceX has headquarters in Hawthorne in United...
1196    Vice Media has headquarters in Brooklyn in Uni...
1197    Klarna has headquarters in Stockholm in Sweden...
1198    Veepee has headquarters in La Plaine Saint-Den...
Name: summary, Length: 1199, dtype: object

In [63]:
def get_summary_embedding(summary):
    return embedder.get_embeddings([summary])[0].values

In [68]:
get_summary_embedding(df['summary'][0])
# Alternatively, pass in everything as a list using: df['summary'].tolist()

In [70]:
# Might take a few minutes depending on the number of rows you have!
# Let's limit our dataframe to just 10 rows to see if everything works

In [71]:
new_df = df.head(10)

In [73]:
# new_df

In [74]:
new_df['embeddings'] = new_df['summary'].apply(get_summary_embedding)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df['embeddings'] = new_df['summary'].apply(get_summary_embedding)


In [75]:
new_df

Unnamed: 0,Updated at,Company,Crunchbase Url,Last Valuation (Billion $),Date Joined,Year Joined,City,Country,Industry,Investors,Company Website,summary,embeddings
0,"10/31/2022, 2:37:05 AM",Esusu,https://www.cbinsights.com/company/esusu,1.0,1/27/2022,2022,New York,United States,Fintech,"[""Next Play Ventures"",""Zeal Capital Partners"",...",,Esusu has headquarters in New York in United S...,"[-0.030785107985138893, 0.00467604398727417, -..."
1,"10/31/2022, 2:37:05 AM",Fever Labs,https://www.cbinsights.com/company/fever-labs,1.0,1/26/2022,2022,New York,United States,Internet software & services,"[""Accel"",""14W"",""GS Growth""]",,Fever Labs has headquarters in New York in Uni...,"[0.03269707411527634, -0.01659333147108555, -0..."
2,"10/31/2022, 2:37:04 AM",Minio,https://www.cbinsights.com/company/minio,1.0,1/26/2022,2022,Palo Alto,United States,Data management & analytics,"[""General Catalyst"",""Nexus Venture Partners"",""...",,Minio has headquarters in Palo Alto in United ...,"[0.014637074433267117, -0.036619797348976135, ..."
3,"10/31/2022, 2:37:04 AM",Darwinbox,https://www.cbinsights.com/company/darwinbox,1.0,1/25/2022,2022,Hyderabad,India,Internet software & services,"[""Lightspeed India Partners"",""Sequoia Capital ...",,Darwinbox has headquarters in Hyderabad in Ind...,"[0.016650568693876266, -0.02540149912238121, -..."
4,"10/31/2022, 2:37:04 AM",Pentera,https://www.cbinsights.com/company/pcysys,1.0,1/11/2022,2022,Petah Tikva,Israel,Cybersecurity,"[""AWZ Ventures"",""Blackstone"",""Insight Partners""]",,Pentera has headquarters in Petah Tikva in Isr...,"[-0.02143833413720131, -0.019953027367591858, ..."
5,"10/31/2022, 2:37:04 AM",Placer.ai,https://www.cbinsights.com/company/placerai,1.0,1/12/2022,2022,Los Altos,United States,Artificial intelligence,"[""Fifth Wall Ventures"",""JBV Capital"",""Array Ve...",,Placer.ai has headquarters in Los Altos in Uni...,"[-0.006580262910574675, -0.012497365474700928,..."
6,"10/31/2022, 2:36:06 AM",CAIS,https://www.cbinsights.com/company/cais,1.1,1/11/2022,2022,New York,United States,Fintech,"[""Franklin Templeton"",""Motive Partners. Apollo...",,CAIS has headquarters in New York in United St...,"[-0.004145875573158264, -0.01726820319890976, ..."
7,"10/31/2022, 2:36:04 AM",Chief,https://www.cbinsights.com/company/chief-1,1.1,3/31/2022,2022,New York,United States,Other,"[""General Catalyst"",""Inspired Capital"",""Flybri...",,Chief has headquarters in New York in United S...,"[-0.006407602224498987, -0.024333525449037552,..."
8,"10/31/2022, 2:34:41 AM",Ankorstore,https://www.cbinsights.com/company/ankorstore,1.98,1/10/2022,2022,Paris,France,E-commerce & direct-to-consumer,"[""Global Founders Capital"",""Aglae Ventures"",""A...",,Ankorstore has headquarters in Paris in France...,"[-0.024317169561982155, -0.007583362981677055,..."
9,"10/31/2022, 2:37:16 AM",Unstoppable Domains,https://www.cbinsights.com/company/unstoppable...,1.0,7/27/2022,2022,Las Vegas,United States,Internet software & services,"[""Boost VC"",""Draper Associates"",""Gaingels""]",,Unstoppable Domains has headquarters in Las Ve...,"[-0.01826212927699089, -0.04303402453660965, -..."


-----
### Similarity Search with Cosine Similarity

In [99]:
import numpy as np
# There are other services/programs for larger amount of vectors
# Take a look at Google Cloud's own Vector Search/Store Options
def vector_similarity(A, B):
    vec1 = np.array(A)
    vec2 = np.array(B)
    # Embeddings are normalized to length 1, the cosine similarity is the same as the dot product.
    return np.dot(np.array(vec1), np.array(vec2))

In [100]:
prompt = "Tell me about the Minio company"

In [101]:
prompt_embedding = get_summary_embedding(prompt)

In [102]:
# prompt_embedding

In [103]:
new_df['prompt_similarity'] = new_df['embeddings'].apply(lambda vector: vector_similarity(vector,prompt_embedding))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df['prompt_similarity'] = new_df['embeddings'].apply(lambda vector: vector_similarity(vector,prompt_embedding))


In [104]:
new_df['prompt_similarity']

0    0.575207
1    0.577183
2    0.762513
3    0.615192
4    0.584385
5    0.601514
6    0.638872
7    0.602261
8    0.581121
9    0.623930
Name: prompt_similarity, dtype: float64

In [105]:
new_df.sort_values('prompt_similarity',ascending=False)

Unnamed: 0,Updated at,Company,Crunchbase Url,Last Valuation (Billion $),Date Joined,Year Joined,City,Country,Industry,Investors,Company Website,summary,embeddings,prompt_similarity
2,"10/31/2022, 2:37:04 AM",Minio,https://www.cbinsights.com/company/minio,1.0,1/26/2022,2022,Palo Alto,United States,Data management & analytics,"[""General Catalyst"",""Nexus Venture Partners"",""...",,Minio has headquarters in Palo Alto in United ...,"[0.014637074433267117, -0.036619797348976135, ...",0.762513
6,"10/31/2022, 2:36:06 AM",CAIS,https://www.cbinsights.com/company/cais,1.1,1/11/2022,2022,New York,United States,Fintech,"[""Franklin Templeton"",""Motive Partners. Apollo...",,CAIS has headquarters in New York in United St...,"[-0.004145875573158264, -0.01726820319890976, ...",0.638872
9,"10/31/2022, 2:37:16 AM",Unstoppable Domains,https://www.cbinsights.com/company/unstoppable...,1.0,7/27/2022,2022,Las Vegas,United States,Internet software & services,"[""Boost VC"",""Draper Associates"",""Gaingels""]",,Unstoppable Domains has headquarters in Las Ve...,"[-0.01826212927699089, -0.04303402453660965, -...",0.62393
3,"10/31/2022, 2:37:04 AM",Darwinbox,https://www.cbinsights.com/company/darwinbox,1.0,1/25/2022,2022,Hyderabad,India,Internet software & services,"[""Lightspeed India Partners"",""Sequoia Capital ...",,Darwinbox has headquarters in Hyderabad in Ind...,"[0.016650568693876266, -0.02540149912238121, -...",0.615192
7,"10/31/2022, 2:36:04 AM",Chief,https://www.cbinsights.com/company/chief-1,1.1,3/31/2022,2022,New York,United States,Other,"[""General Catalyst"",""Inspired Capital"",""Flybri...",,Chief has headquarters in New York in United S...,"[-0.006407602224498987, -0.024333525449037552,...",0.602261
5,"10/31/2022, 2:37:04 AM",Placer.ai,https://www.cbinsights.com/company/placerai,1.0,1/12/2022,2022,Los Altos,United States,Artificial intelligence,"[""Fifth Wall Ventures"",""JBV Capital"",""Array Ve...",,Placer.ai has headquarters in Los Altos in Uni...,"[-0.006580262910574675, -0.012497365474700928,...",0.601514
4,"10/31/2022, 2:37:04 AM",Pentera,https://www.cbinsights.com/company/pcysys,1.0,1/11/2022,2022,Petah Tikva,Israel,Cybersecurity,"[""AWZ Ventures"",""Blackstone"",""Insight Partners""]",,Pentera has headquarters in Petah Tikva in Isr...,"[-0.02143833413720131, -0.019953027367591858, ...",0.584385
8,"10/31/2022, 2:34:41 AM",Ankorstore,https://www.cbinsights.com/company/ankorstore,1.98,1/10/2022,2022,Paris,France,E-commerce & direct-to-consumer,"[""Global Founders Capital"",""Aglae Ventures"",""A...",,Ankorstore has headquarters in Paris in France...,"[-0.024317169561982155, -0.007583362981677055,...",0.581121
1,"10/31/2022, 2:37:05 AM",Fever Labs,https://www.cbinsights.com/company/fever-labs,1.0,1/26/2022,2022,New York,United States,Internet software & services,"[""Accel"",""14W"",""GS Growth""]",,Fever Labs has headquarters in New York in Uni...,"[0.03269707411527634, -0.01659333147108555, -0...",0.577183
0,"10/31/2022, 2:37:05 AM",Esusu,https://www.cbinsights.com/company/esusu,1.0,1/27/2022,2022,New York,United States,Fintech,"[""Next Play Ventures"",""Zeal Capital Partners"",...",,Esusu has headquarters in New York in United S...,"[-0.030785107985138893, 0.00467604398727417, -...",0.575207


## Get Summary with Most Similarity

In [106]:
new_df.nlargest(1,'prompt_similarity').iloc[0]['summary'] 

'Minio has headquarters in Palo Alto in United States and is in the field of Data management & analytics. The investors in the company are General Catalyst,  Nexus Venture Partners,  Dell Technologies Capital, . You can find more information at https://www.cbinsights.com/company/minio'

### Put it All Together

In [111]:
def embed_prompt_lookup():
    # Initial question
    question = input("What question do you have about a Unicorn company? ")
    
    # Get embedding
    prompt_embedding = get_summary_embedding(question)
    
    # Get prompt similarity with embeddings
    # Note how this will overwrite the prompt similarity column each time!
    new_df["prompt_similarity"] = new_df['embeddings'].apply(lambda vector: vector_similarity(vector, prompt_embedding))

    # get most similar summary
    context = new_df.nlargest(1,'prompt_similarity').iloc[0]['summary'] 

    prompt = f"""Only answer the question below if you have 100% certainty of the facts, use the context below to answer.
            Here is some context:
            {context}
            Q: {question}
            A:"""


    # Hallucinates Some Information
    results = llm.predict(prompt=prompt,
               temperature=0,max_output_tokens=2048)
    
    print(results.text)

In [112]:
embed_prompt_lookup()

What question do you have about a Unicorn company?  Tell me about Minio


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df["prompt_similarity"] = new_df['embeddings'].apply(lambda vector: vector_similarity(vector, prompt_embedding))


 Minio is a software company founded in 2014 and headquartered in Palo Alto, California. 

The company develops and sells object storage software that is used to store and manage large amounts of data. 

Minio's software is used by businesses of all sizes, including some of the world's largest companies. 

The company has raised over $100 million in funding from investors such as General Catalyst, Nexus Venture Partners, and Dell Technologies Capital.
