### Semantic Search Demonstration
The following code demonstrates semantic search by using embeddings and cosign similarity. 

The following code reads a csv that contains ChatGPT generated (fake) student memoirs. Each row represents a memoir.

For every memoir we generate an embedding vector which is then used later to compare similarity towards a search term.

In [41]:
import openai
import pandas as pd
import os

from openai.embeddings_utils import get_embedding

from dotenv import load_dotenv
load_dotenv()

openai.api_key = os.environ.get("OPENAI_API_KEY")

df = pd.read_csv('memoirs.csv')
df['embedding'] = df['text'].apply(lambda x: get_embedding(x, engine="text-embedding-ada-002"))
df.to_csv('word_embeddings.csv')

df

Unnamed: 0,text,embedding
0,"I wake up early, feeling exhausted from stayin...","[-0.0035102476831525564, -0.003641922259703278..."
1,I wake up feeling anxious about a test I have ...,"[-0.019288772717118263, 0.014115586876869202, ..."
2,"I wake up early, excited to go to school. In c...","[0.011736449785530567, 0.009439075365662575, 0..."
3,I wake up and feel a sense of dread about goin...,"[0.005431780591607094, -0.005482162348926067, ..."
4,I wake up early and have a healthy breakfast b...,"[0.0019850502721965313, 0.009323081001639366, ..."
5,I wake up late and barely make it to school on...,"[-0.0013890566769987345, 0.013013100251555443,..."
6,I wake up feeling grateful for another day. At...,"[0.0038523832336068153, -0.002531841630116105,..."
7,I wake up early and work out before heading to...,"[-0.008601831272244453, 0.007503862958401442, ..."


With embeddings created for each memoir, we can now compare each for similarity against a search embedding.

In [42]:
import pandas as pd
from openai.embeddings_utils import get_embedding, cosine_similarity

search_term_embedding = get_embedding('Which students are anxious and worried?', engine="text-embedding-ada-002")

df['similar'] = df.embedding.apply(lambda x: cosine_similarity(x, search_term_embedding))
sorted = df.sort_values('similar', ascending=False)
text = df.iloc[1]['text']

print("Most similar text: ", text)
sorted

Most similar text:  I wake up feeling anxious about a test I have later in the day. During classes, I can't focus on anything else, and I feel like I'm going to fail. After school, I go home and study for hours, even though I know it won't make a difference. In the evening, I cry myself to sleep, worried about my grades.


Unnamed: 0,text,embedding,similar
1,I wake up feeling anxious about a test I have ...,"[-0.019288772717118263, 0.014115586876869202, ...",0.851096
3,I wake up and feel a sense of dread about goin...,"[0.005431780591607094, -0.005482162348926067, ...",0.829775
2,"I wake up early, excited to go to school. In c...","[0.011736449785530567, 0.009439075365662575, 0...",0.785964
0,"I wake up early, feeling exhausted from stayin...","[-0.0035102476831525564, -0.003641922259703278...",0.782036
5,I wake up late and barely make it to school on...,"[-0.0013890566769987345, 0.013013100251555443,...",0.775874
6,I wake up feeling grateful for another day. At...,"[0.0038523832336068153, -0.002531841630116105,...",0.774149
4,I wake up early and have a healthy breakfast b...,"[0.0019850502721965313, 0.009323081001639366, ...",0.768053
7,I wake up early and work out before heading to...,"[-0.008601831272244453, 0.007503862958401442, ...",0.766041


Using the search prompt "Which students are anxious and worried?", we've identified a memoir that contains similarities — 0.851096.

With the text being:

I wake up feeling anxious about a test I have later in the day. During classes, I can't focus on anything else, and I feel like I'm going to fail. After school, I go home and study for hours, even though I know it won't make a difference. In the evening, I cry myself to sleep, worried about my grades.