I'm currently using this notebook to transform the CSV data (which comes from a Google Sheet) into: (1) A new CSV with an embedding vector added (based on the 'description' column); (2) a JSON blob to be used elsewhere. I just hit "run all cells" to do this. Eventually I'll just put the key steps in a Python script, but right now I'm finding it useful to be able to inspect the outputs because it's changing so much.

In [2]:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

In [3]:
import pandas as pd

# Replace 'your_file_path' with the actual path to your CSV file
df = pd.read_csv('../data/ev-winners.csv')

# This will display the first few rows of the DataFrame
print(df.head())


    id                                               name batch  \
0  1.0                                          Anonymous     1   
1  2.0                                        Topos House     1   
2  3.0                      18-year-old economics prodigy     1   
3  4.0  Mark Lutter and his Center for Innovative Gove...     1   
4  5.0                                     Harshita Arora     1   

  date_announced                                               link  \
0     2018-11-07  marginalrevolution.com/marginalrevolution/2018...   
1     2018-11-07  marginalrevolution.com/marginalrevolution/2018...   
2     2018-11-07  marginalrevolution.com/marginalrevolution/2018...   
3     2018-11-07  marginalrevolution.com/marginalrevolution/2018...   
4     2018-11-07  marginalrevolution.com/marginalrevolution/2018...   

                                         description              type  \
0    Anonymous grant for writing in Eastern Europe.   Writing (Online)   
1  Pledged grant to Sa

In [4]:
names = df['name'].to_numpy()
descriptions = df['description'].to_numpy()
# generate embeddings of the descriptions
embeddings = model.encode(descriptions)
print(embeddings)

[[-0.05316222  0.08313879 -0.00018972 ... -0.02911166 -0.05182787
  -0.04559411]
 [ 0.00656217  0.01479886 -0.01363882 ... -0.01856913 -0.10103376
   0.02289294]
 [ 0.08496379  0.05712386 -0.01860134 ...  0.05476078 -0.00237706
  -0.04405489]
 ...
 [-0.02717686  0.03089065  0.02796888 ... -0.02674641 -0.01135708
   0.06989009]
 [-0.08260766  0.05114881  0.00184061 ...  0.00910514 -0.02419201
   0.00591901]
 [-0.01491736  0.01335617 -0.06053676 ... -0.11592133  0.08571236
  -0.04570728]]


In [5]:
# Write new dataset

df['embedding_description'] = embeddings.tolist()
output_csv_path = '../data/ev-winners-with-embeddings.csv'
df.to_csv(output_csv_path, index=False)
print(f"DataFrame with embeddings has been saved to '{output_csv_path}'")

DataFrame with embeddings has been saved to '../data/ev-winners-with-embeddings.csv'


In [6]:
import numpy as np

def get_ev_winners(query):

    #embeddings of the query, same dimensions as the description embeddings
    query_embed = np.tile(model.encode(query), (len(embeddings), 1))
    
    cos_sim = util.cos_sim(embeddings, query_embed)
    cos_sim
    
    # Add all pairs to a list with their cosine similarity score
    all_sentence_combinations = []
    for i in range(len(cos_sim)-1):
        all_sentence_combinations.append([cos_sim[i], descriptions[i], names[i]])
    
    # Sort list by the highest cosine similarity score
    all_sentence_combinations = sorted(all_sentence_combinations, key=lambda x: x[0][0], reverse=True)
    
    return all_sentence_combinations

In [7]:
def search_ev_winners(query, number):
    print(f"Top {number} matches for query: {query}\n")
    for index, item in enumerate(get_ev_winners(query)[:number], start=1):
        print(f"{index}. {item[2]}: {item[1]}")

search_ev_winners("book", 10)

Top 10 matches for query: book

1. Andy Matuschak: San Francisco, to support his project to reexamine and fundamentally improve the book as a method for learning and absorbing ideas, Twitter here. Here is his essay on why books do not work.
2. Jeremy Stern: Glendale, CA, Tablet magazine. To write a book.
3. Jasmine Wang and team (Jasmine is a repeat winner): Trellis, AI and the book.
4. Marc Sidwell: Marc Sidwell of the United Kingdom, to write a book on common sense.
5. Henry Oliver: Henry Oliver, London, to write a book on talent and late bloomers. Substack here.
6. Byrne Hobart: Byrne Hobart, to write a book on technological progress with Tobias Huber.
7. Jeffrey C. Huber: Jeffrey C. Huber, to write a book on tech and economic progress from a Christian point of view.
8. Matt Faherty: Matt Faherty, to study and write about the NIH.
9. Yuen Yuen Ang: Yuen Yuen Ang, political scientist at the University of Michigan, from Singapore, to write a new book on disruption.
10. Kathleen Harwar

In [8]:
import json

# JSON blob
df2 = pd.read_csv('../data/ev-winners-with-embeddings.csv')
df2['embedding_description'] = df2['embedding_description'].apply(json.loads)
json_blob = df2.to_json(orient='records', lines=False)
output_json_path = '../data/ev-winners-with-embeddings.json'
with open(output_json_path, 'w') as json_file:
    json_file.write(json_blob)
print(f"JSON blob has been saved to '{output_json_path}'")

JSON blob has been saved to '../data/ev-winners-with-embeddings.json'


In [9]:
first_5_records = json_blob.split('\n')[:5]

import json

# Your JSON data as a string
json_string = first_5_records[0]

# Parse the JSON string into a JSON object
json_data = json.loads(json_string)

# Pretty print the JSON object
pretty_json = json.dumps(json_data, indent=4)

# Print the pretty printed JSON
print(pretty_json)

IOPub data rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_data_rate_limit`.

Current values:
ServerApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
ServerApp.rate_limit_window=3.0 (secs)

