I'm currently using this notebook to transform the CSV data (which comes from a Google Sheet) into: (1) A new CSV with an embedding vector added (based on the 'description' column); (2) a JSON blob to be used elsewhere. I just hit "run all cells" to do this. Eventually I'll just put the key steps in a Python script, but right now I'm finding it useful to be able to inspect the outputs because it's changing so much.

In [25]:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

In [26]:
import pandas as pd

# Replace 'your_file_path' with the actual path to your CSV file
df = pd.read_csv('ev-winners.csv')

# This will display the first few rows of the DataFrame
print(df.head())


   id                                               name batch date_announced  \
0   1                                  Tymofiy Mylovanov     1     2018-11-07   
1   2                                        Topos House     1     2018-11-07   
2   3                      18-year-old economics prodigy     1     2018-11-07   
3   4  Mark Lutter and his Center for Innovative Gove...     1     2018-11-07   
4   5                                     Harshita Arora     1     2018-11-07   

                                                link  \
0  marginalrevolution.com/marginalrevolution/2018...   
1  marginalrevolution.com/marginalrevolution/2018...   
2  marginalrevolution.com/marginalrevolution/2018...   
3  marginalrevolution.com/marginalrevolution/2018...   
4  marginalrevolution.com/marginalrevolution/2018...   

                                         description              type  \
0  Anonymous grant for writing in Eastern Europe....  Writing (Online)   
1  Pledged grant to San Fran

In [27]:
names = df['name'].to_numpy()
descriptions = df['description'].to_numpy()
# generate embeddings of the descriptions
embeddings = model.encode(descriptions)
print(embeddings)

[[-8.9191841e-03  8.1923462e-02  8.2692392e-03 ... -8.1706204e-04
  -5.8404645e-03 -2.4727663e-02]
 [ 3.4100588e-02  1.2999950e-01 -1.6341692e-02 ...  2.1348594e-02
   4.1852508e-02  3.3572470e-03]
 [-3.3777144e-02  1.5727529e-01 -2.0336600e-02 ...  1.1649110e-02
   2.6833998e-02 -9.8458026e-03]
 ...
 [ 6.4019270e-02 -1.0361400e-02 -8.6139077e-03 ... -1.8746321e-04
  -2.7924933e-02  7.0708070e-04]
 [ 3.4342300e-02 -4.6487272e-02  3.1901367e-05 ... -2.8234329e-02
   2.5776003e-02 -3.5039790e-02]
 [ 4.9569044e-02  2.0383047e-02  3.1445692e-03 ... -2.9717339e-02
   2.2790560e-02 -2.9443244e-02]]


In [28]:
# Write new dataset

df['embedding_description'] = embeddings.tolist()
output_csv_path = 'ev-winners-with-embeddings.csv'
df.to_csv(output_csv_path, index=False)
print(f"DataFrame with embeddings has been saved to '{output_csv_path}'")

DataFrame with embeddings has been saved to 'ev-winners-with-embeddings.csv'


In [29]:
import numpy as np

def get_ev_winners(query):

    #embeddings of the query, same dimensions as the description embeddings
    query_embed = np.tile(model.encode(query), (len(embeddings), 1))
    
    cos_sim = util.cos_sim(embeddings, query_embed)
    cos_sim
    
    # Add all pairs to a list with their cosine similarity score
    all_sentence_combinations = []
    for i in range(len(cos_sim)-1):
        all_sentence_combinations.append([cos_sim[i], descriptions[i], names[i]])
    
    # Sort list by the highest cosine similarity score
    all_sentence_combinations = sorted(all_sentence_combinations, key=lambda x: x[0][0], reverse=True)
    
    return all_sentence_combinations

In [30]:
def search_ev_winners(query, number):
    print(f"Top {number} matches for query: {query}\n")
    for index, item in enumerate(get_ev_winners(query)[:number], start=1):
        print(f"{index}. {item[2]}: {item[1]}")

search_ev_winners("book", 10)

Top 10 matches for query: book

1. Andy Matuschak: San Francisco, to support his project to reexamine and fundamentally improve the book as a method for learning and absorbing ideas, Twitter here. Here is his essay on why books do not work.
2. Kathleen Harward: Kathleen Harward, to write and market a series of children’s books based on classical liberal values.
3. Jeremy Stern: Glendale, CA, Tablet magazine. To write a book.
4. Sonja Trauss: Sonja Trauss of YIMBY, assistance to publish Nicholas Barbon, A Defence of the Builder.
5. Marc Sidwell: Marc Sidwell of the United Kingdom, to write a book on common sense.
6. Jasmine Wang and team (Jasmine is a repeat winner): Trellis, AI and the book.
7. Jeffrey C. Huber: Jeffrey C. Huber, to write a book on tech and economic progress from a Christian point of view.
8. Joe Francis: Joe Francis, a farmer in Wales, to write a book on the economic and historical import of slavery in the American republic.
9. Cynthia Haven: Cynthia Haven, Stanford U

In [31]:
# JSON blob
df2 = pd.read_csv('ev-winners-with-embeddings.csv')
json_blob = df2.to_json(orient='records', lines=True)
output_json_path = 'ev-winners-with-embeddings.json'
with open(output_json_path, 'w') as json_file:
    json_file.write(json_blob)
print(f"JSON blob has been saved to '{output_json_path}'")

JSON blob has been saved to 'ev-winners-with-embeddings.json'


In [32]:
first_5_records = json_blob.split('\n')[:5]

import json

# Your JSON data as a string
json_string = first_5_records[0]

# Parse the JSON string into a JSON object
json_data = json.loads(json_string)

# Pretty print the JSON object
pretty_json = json.dumps(json_data, indent=4)

# Print the pretty printed JSON
print(pretty_json)

{
    "id": 1,
    "name": "Tymofiy Mylovanov",
    "batch": "1",
    "date_announced": "2018-11-07",
    "link": "marginalrevolution.com/marginalrevolution/2018/11/emergent-ventures-grant-recipients.html",
    "description": "Anonymous grant for writing in Eastern Europe. \nI am pleased to be able to announce that the very first Emergent Ventures winner, several years ago, was Tymofiy Mylovanov, an Ukrainian economist affiliated with the University of Pittsburgh.  Bloomberg covered Tymofiy\u2019s all-important logistics activities in Ukraine here as explained by MR.  And here is a good Pittnews profile.\n\nTymofiy has been so impactful he gets a cohort of his own, in addition to being the very first winner.  The early EV grant to Tymofiy was to encourage him to write on the Ukraine economy, in Ukrainian, and this led in turn to his being appointed the Economy Minister in the Zelensky cabinet, which later morphed into his current set of responsibilities in Ukraine.  (Not all EV grants 