# Codealong Notebook

Use this notebook as your "scratch pad" as you go through the course contents. Feel free to copy any example code and tweak it to get a better understanding of how it works!

Use the **+** button or `Insert` menu to add additional code cells as needed.

In [None]:
#Uncomment this to use OpenAI directly
#import openai
#openai.api_key = "Your OpenAI API key"

# I am using vocareum to access OpenAI, please comment this out if you are using OpenAI directly
import getpass
import os

os.environ["OPENAI_API_BASE"] = "https://openai.vocareum.com/v1"

if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API key: ")

### Install PyPDF2 to extract text from PDFs

In [None]:
!pip install PyPDF2

Collecting PyPDF2
  Using cached pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Using cached pypdf2-3.0.1-py3-none-any.whl (232 kB)
Installing collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1


## Helper function to extract text from PDFs

In [16]:
import PyPDF2
import pandas as pd
import re

def extract_paragraphs_from_pdf(pdf_path):
    # Initialize list to store paragraphs
    paragraphs = []
    
    try:
        # Open the PDF file
        with open(pdf_path, 'rb') as file:
            # Create PDF reader object
            pdf_reader = PyPDF2.PdfReader(file)
            
            # Extract text from each page
            for page in pdf_reader.pages:
                text = page.extract_text()
                
                # Split text into paragraphs
                # This splits on double line breaks and removes empty strings
                page_paragraphs = [p.strip() for p in re.split(r'\n\s*\n', text) if p.strip()]
                paragraphs.extend(page_paragraphs)
        
        # Create DataFrame
        df = pd.DataFrame({
            'paragraph_id': range(len(paragraphs)),
            'text': paragraphs,
            'length': [len(p) for p in paragraphs]
        })
        
        return df
    
    except FileNotFoundError:
        print("Error: PDF file not found")
        return None
    except Exception as e:
        print(f"Error: {str(e)}")
        return None

### Extract text from a PDF file

In [None]:
# Use the function
pdf_path = "US20180282715A1.pdf"
df = extract_paragraphs_from_pdf(pdf_path)

Unnamed: 0,paragraph_id,text,length
0,0,THE MAIN TEA ETA AUTOMAT U TEMAMA ANATUAN US 2...,1413
1,1,"Patent Application Publication Oct . 4 , 2018 ...",93
2,2,"Patent Application Publication Oct . 4 , 2018 ...",168
3,3,"Patent Application Publication Oct . 4 , 2018 ...",142
4,4,"Patent Application Publication Oct . 4 , 2018 ...",256


In [25]:

df.head(10)

Unnamed: 0,paragraph_id,text,length
0,0,THE MAIN TEA ETA AUTOMAT U TEMAMA ANATUAN US 2...,1413
1,1,"Patent Application Publication Oct . 4 , 2018 ...",93
2,2,"Patent Application Publication Oct . 4 , 2018 ...",168
3,3,"Patent Application Publication Oct . 4 , 2018 ...",142
4,4,"Patent Application Publication Oct . 4 , 2018 ...",256
5,5,"Patent Application Publication Oct . 4 , 2018 ...",223
6,6,"US 2018 / 0282715 A1 Oct . 4 , 2018 \n NOVEL C...",6734
7,7,"US 2018 / 0282715 A1 Oct . 4 , 2018 \n cell , ...",7142
8,8,"US 2018 / 0282715 A1 Oct . 4 , 2018 \n referen...",7082
9,9,"US 2018 / 0282715 A1 Oct . 4 , 2018 \n limited...",7664


### Fix indexes

In [32]:
print(df['text'][6])

US 2018 / 0282715 A1 Oct . 4 , 2018 
 NOVEL CRISPR - ASSOCIATED ( CAS ) PROTEIN 
 CROSS - REFERENCE TO RELATED 
 APPLICATIONS 
 [ 0001 ] This application claims the benefit under 35 U . S . C . $ 119 ( e ) ( 1 ) of U . S . Provisional Application Nos . 62 / 477 , 494 , filed 28 Mar . 2017 , and 62 / 629 , 641 , filed 12 Feb . 2018 , which applications are incorporated herein by reference in their entireties . 
 TECHNICAL FIELD 
 [ 0002 ] The present invention relates to Clustered Regu larly Interspaced Short Palindromic Repeats ( CRISPR ) sys 
 tems . In particular , the invention relates to a new CRISPR associated ( Cas ) protein , termed “ CasM , ” and the uses of CasM for site - specific nucleic acid engineering . 
 BACKGROUND OF THE INVENTION 
 [ 0003 ] Clustered Regularly Interspaced Short Palindromic 
 Repeats ( CRISPR ) and CRISPR - associated ( Cas ) proteins 
 are found in prokaryotic immune systems . These systems 
 provide resistance against exogenous genetic elements , such

### save dataset locally

In [27]:
df.to_csv("crispr.csv")

## Creating an Embeddings Index

In [35]:
import pandas as pd
df = pd.read_csv("crispr.csv", index_col=0)
df.head(10)

Unnamed: 0,paragraph_id,text,length
0,0,THE MAIN TEA ETA AUTOMAT U TEMAMA ANATUAN US 2...,1413
1,1,"Patent Application Publication Oct . 4 , 2018 ...",93
2,2,"Patent Application Publication Oct . 4 , 2018 ...",168
3,3,"Patent Application Publication Oct . 4 , 2018 ...",142
4,4,"Patent Application Publication Oct . 4 , 2018 ...",256
5,5,"Patent Application Publication Oct . 4 , 2018 ...",223
6,6,"US 2018 / 0282715 A1 Oct . 4 , 2018 \n NOVEL C...",6734
7,7,"US 2018 / 0282715 A1 Oct . 4 , 2018 \n cell , ...",7142
8,8,"US 2018 / 0282715 A1 Oct . 4 , 2018 \n referen...",7082
9,9,"US 2018 / 0282715 A1 Oct . 4 , 2018 \n limited...",7664


### Creating an Embeddings Index with `openai.Embedding`

### Dataframe with embeddings

In [48]:
from openai import OpenAI

# Get API key if not set
if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API key: ")

# Initialize OpenAI client with optional base URL
client_kwargs = {"api_key": os.environ["OPENAI_API_KEY"]}
if "OPENAI_API_BASE" in os.environ:
    client_kwargs["base_url"] = os.environ["OPENAI_API_BASE"]

client = OpenAI(**client_kwargs)

EMBEDDING_MODEL_NAME = "text-embedding-ada-002"

def get_embeddings(texts):
    try:
        response = client.embeddings.create(
            input=texts,
            model=EMBEDDING_MODEL_NAME
        )
        return [embedding.embedding for embedding in response.data]
    except Exception as e:
        print(f"Error creating embeddings: {e}")
        return None

# Use with your DataFrame
embeddings = get_embeddings(df["text"].tolist())
if embeddings:
    df["embeddings"] = embeddings

### Check the embeddings

In [49]:
df.head(10)

Unnamed: 0,paragraph_id,text,length,embeddings
0,0,THE MAIN TEA ETA AUTOMAT U TEMAMA ANATUAN US 2...,1413,"[-0.044504206627607346, 0.0066853598691523075,..."
1,1,"Patent Application Publication Oct . 4 , 2018 ...",93,"[-0.01686546765267849, 0.00898822396993637, 0...."
2,2,"Patent Application Publication Oct . 4 , 2018 ...",168,"[-0.037664152681827545, 0.007959115318953991, ..."
3,3,"Patent Application Publication Oct . 4 , 2018 ...",142,"[-0.028591148555278778, 0.003124697832390666, ..."
4,4,"Patent Application Publication Oct . 4 , 2018 ...",256,"[-0.031801845878362656, -0.003988437354564667,..."
5,5,"Patent Application Publication Oct . 4 , 2018 ...",223,"[-0.031520090997219086, -0.0033901638817042112..."
6,6,"US 2018 / 0282715 A1 Oct . 4 , 2018 \n NOVEL C...",6734,"[-0.03968781232833862, 0.006241672672331333, -..."
7,7,"US 2018 / 0282715 A1 Oct . 4 , 2018 \n cell , ...",7142,"[-0.04525245353579521, -0.008158625103533268, ..."
8,8,"US 2018 / 0282715 A1 Oct . 4 , 2018 \n referen...",7082,"[-0.03745700418949127, -0.0039087384939193726,..."
9,9,"US 2018 / 0282715 A1 Oct . 4 , 2018 \n limited...",7664,"[-0.04072467237710953, 0.004864649381488562, -..."


### Saving new dataframe with the embeddings

In [50]:
df.to_csv("embeddings.csv")

## Step 2

### Finding Relevant Data with Cosine Similarity

In [51]:
import numpy as np

df = pd.read_csv("embeddings.csv", index_col=0)
df["embeddings"] = df["embeddings"].apply(eval).apply(np.array)
df

Unnamed: 0,paragraph_id,text,length,embeddings
0,0,THE MAIN TEA ETA AUTOMAT U TEMAMA ANATUAN US 2...,1413,"[-0.044504206627607346, 0.0066853598691523075,..."
1,1,"Patent Application Publication Oct . 4 , 2018 ...",93,"[-0.01686546765267849, 0.00898822396993637, 0...."
2,2,"Patent Application Publication Oct . 4 , 2018 ...",168,"[-0.037664152681827545, 0.007959115318953991, ..."
3,3,"Patent Application Publication Oct . 4 , 2018 ...",142,"[-0.028591148555278778, 0.003124697832390666, ..."
4,4,"Patent Application Publication Oct . 4 , 2018 ...",256,"[-0.031801845878362656, -0.003988437354564667,..."
...,...,...,...,...
95,95,"US 2018 / 0282715 Al Oct . 4 , 2018 90 \n - co...",1945,"[-0.020597120746970177, -0.011165442876517773,..."
96,96,"US 2018 / 0282715 Al Oct . 4 , 2018 \n - conti...",2072,"[-0.026444602757692337, -0.01601344905793667, ..."
97,97,"US 2018 / 0282715 Al Oct . 4 , 2018 92 . . \n ...",1982,"[-0.025891972705721855, -0.012279598042368889,..."
98,98,"US 2018 / 0282715 Al Oct . 4 , 2018 \n - conti...",2086,"[-0.028477206826210022, -0.03346458077430725, ..."


In [54]:
question = "What is CRISPR?"
question_embedding = get_embeddings([question])[0]

### Asses similarity

In [56]:
import numpy as np

def distances_from_embeddings(query_embedding, embeddings, distance_metric="cosine"):
    """Return the distances between a query embedding and a list of embeddings."""
    if distance_metric == "cosine":
        # Normalize the embeddings
        query_embedding = query_embedding / np.linalg.norm(query_embedding)
        embeddings = [e / np.linalg.norm(e) for e in embeddings]
        
        # Calculate cosine similarity
        similarities = [np.dot(query_embedding, embedding) for embedding in embeddings]
        
        # Convert to distances (1 - similarity)
        return [1 - s for s in similarities]
    else:
        raise ValueError(f"Unsupported distance metric: {distance_metric}")

In [59]:
distances = distances_from_embeddings(question_embedding, df["embeddings"].tolist())
distances

[0.17378298831315253,
 0.2756749960358714,
 0.29949330899129634,
 0.2802443348338546,
 0.2860481106739243,
 0.2907269660759454,
 0.16586018915831846,
 0.2132141295451997,
 0.19827465526139043,
 0.20786868111961032,
 0.23243462353429512,
 0.2556178604781738,
 0.21869135323193734,
 0.2337948058831204,
 0.2376913264030145,
 0.19462048196206183,
 0.16606427926304823,
 0.17618794844386998,
 0.1905401816890251,
 0.23424776627530441,
 0.23567340454178076,
 0.20113750121921192,
 0.22338363048601872,
 0.21021963059193094,
 0.22302772806427107,
 0.1820966416400509,
 0.1777380861589155,
 0.19976166374577398,
 0.21295422199339287,
 0.22174568652006488,
 0.24629971628271374,
 0.25162697357354435,
 0.28335888332532,
 0.2897478061293399,
 0.2956850662295447,
 0.27997960946033185,
 0.2815042028138569,
 0.273354976958492,
 0.2860451423952237,
 0.28108405021543104,
 0.2739315399290936,
 0.28898658938533406,
 0.28039716621691024,
 0.2711576450363099,
 0.2825557138924599,
 0.2913701452015016,
 0.280877828

In [60]:
df["distances"] = distances
df.head(10)

Unnamed: 0,paragraph_id,text,length,embeddings,distances
0,0,THE MAIN TEA ETA AUTOMAT U TEMAMA ANATUAN US 2...,1413,"[-0.044504206627607346, 0.0066853598691523075,...",0.173783
1,1,"Patent Application Publication Oct . 4 , 2018 ...",93,"[-0.01686546765267849, 0.00898822396993637, 0....",0.275675
2,2,"Patent Application Publication Oct . 4 , 2018 ...",168,"[-0.037664152681827545, 0.007959115318953991, ...",0.299493
3,3,"Patent Application Publication Oct . 4 , 2018 ...",142,"[-0.028591148555278778, 0.003124697832390666, ...",0.280244
4,4,"Patent Application Publication Oct . 4 , 2018 ...",256,"[-0.031801845878362656, -0.003988437354564667,...",0.286048
5,5,"Patent Application Publication Oct . 4 , 2018 ...",223,"[-0.031520090997219086, -0.0033901638817042112...",0.290727
6,6,"US 2018 / 0282715 A1 Oct . 4 , 2018 \n NOVEL C...",6734,"[-0.03968781232833862, 0.006241672672331333, -...",0.16586
7,7,"US 2018 / 0282715 A1 Oct . 4 , 2018 \n cell , ...",7142,"[-0.04525245353579521, -0.008158625103533268, ...",0.213214
8,8,"US 2018 / 0282715 A1 Oct . 4 , 2018 \n referen...",7082,"[-0.03745700418949127, -0.0039087384939193726,...",0.198275
9,9,"US 2018 / 0282715 A1 Oct . 4 , 2018 \n limited...",7664,"[-0.04072467237710953, 0.004864649381488562, -...",0.207869


In [61]:
df.to_csv("distances.csv")

### Wrapped in a function

In [66]:
def get_rows_sorted_by_relevance(question, df):
    """
    Function that takes in a question string and a dataframe containing
    rows of text and associated embeddings, and returns that dataframe
    sorted from least to most relevant for that question
    """

    # Get embeddings for the question text
    question_embeddings = get_embeddings(question)[0]

    # Make a copy of the dataframe and add a "distances" column containing
    # the cosine distances between each row's embeddings and the
    # embeddings of the question
    df_copy = df.copy()
    df_copy["distances"] = distances_from_embeddings(
        question_embeddings,
        df_copy["embeddings"].values
    )

    # Sort the copied dataframe by the distances and return it
    # (shorter distance = more relevant so we sort in ascending order)
    df_copy.sort_values("distances", ascending=True, inplace=True)
    return df_copy

### Testing the function to get embeddings

In [68]:
df = get_rows_sorted_by_relevance(question, df)
df.head(10)

Unnamed: 0,paragraph_id,text,length,embeddings,distances
6,6,"US 2018 / 0282715 A1 Oct . 4 , 2018 \n NOVEL C...",6734,"[-0.03968781232833862, 0.006241672672331333, -...",0.16586
16,16,"US 2018 / 0282715 A1 Oct . 4 , 2018 \n se case...",7453,"[-0.029295915737748146, 0.0044411360286176205,...",0.166064
0,0,THE MAIN TEA ETA AUTOMAT U TEMAMA ANATUAN US 2...,1413,"[-0.044504206627607346, 0.0066853598691523075,...",0.173783
17,17,"US 2018 / 0282715 A1 Oct . 4 , 2018 \n possess...",7678,"[-0.04011465236544609, -0.006061012391000986, ...",0.176188
26,26,"US 2018 / 0282715 A1 Oct . 4 , 2018 \n 21 \n T...",5037,"[-0.018203414976596832, 0.01511384453624487, -...",0.177738
25,25,"US 2018 / 0282715 A1 Oct . 4 , 2018 \n 20 \n E...",5554,"[-0.03702671825885773, 0.01751861162483692, -0...",0.182097
18,18,"US 2018 / 0282715 A1 Oct . 4 , 2018 \n known C...",6824,"[-0.01835833676159382, -0.007091736886650324, ...",0.19054
15,15,"US 2018 / 0282715 A1 Oct . 4 , 2018 \n geese ;...",7742,"[-0.03625903278589249, -0.022521305829286575, ...",0.19462
8,8,"US 2018 / 0282715 A1 Oct . 4 , 2018 \n referen...",7082,"[-0.03745700418949127, -0.0039087384939193726,...",0.198275
27,27,"US 2018 / 0282715 A1 Oct . 4 , 2018 \n [ 0208 ...",4980,"[-0.025646647438406944, 0.01512566115707159, -...",0.199762


## Step 3

### Tokenizing with `tiktoken`

Tiktoken is OpenAI's tokenizer library that helps count and split text into tokens the way OpenAI's models do. It's especially useful for:

Counting tokens before making API calls
Staying within model context limits
Understanding how text will be processed by models like GPT-3.5 and GPT-4

In [70]:
!pip install tiktoken



In [71]:
import tiktoken

In [72]:
tokenizer = tiktoken.get_encoding("cl100k_base")

In [76]:
question = "What is CRISPR?"

In [79]:
tokens = tokenizer.encode(question)

In [80]:
len(tokens)

6

### Composing a Custom Text Prompt

In [83]:
prompt_template = """
Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context:
{}

---
Question: {}
Answer: """

In [84]:
question = "What is CRISPR?"

In [86]:
print(prompt_template.format("here we have to add our new context", question))


Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context:
here we have to add our new context

---
Question: What is CRISPR?
Answer: 


In [87]:
max_token_count = 1000 #depends on the model actually

### How do we check whether our prompt is underneath that cap?

1. First, we need to count the number of tokens in the template and the question itself.

In [88]:
tokenizer = tiktoken.get_encoding("cl100k_base")

In [89]:
question_tokens = len(tokenizer.encode(question))
template_tokens = len(tokenizer.encode(prompt_template))
current_token_count = question_tokens + template_tokens
current_token_count

47

**We have 47 tokens, we can have up to 1000 - 47 = 953 tokens for the context**

In [90]:
context = []

In [91]:
import pandas as pd
df = pd.read_csv("distances.csv", index_col=0)
df

Unnamed: 0,paragraph_id,text,length,embeddings,distances
0,0,THE MAIN TEA ETA AUTOMAT U TEMAMA ANATUAN US 2...,1413,[-0.04450421 0.00668536 -0.00587228 ... 0.00...,0.173783
1,1,"Patent Application Publication Oct . 4 , 2018 ...",93,[-0.01686547 0.00898822 0.01265579 ... -0.00...,0.275675
2,2,"Patent Application Publication Oct . 4 , 2018 ...",168,[-0.03766415 0.00795912 0.00553091 ... 0.00...,0.299493
3,3,"Patent Application Publication Oct . 4 , 2018 ...",142,[-0.02859115 0.0031247 0.01342997 ... 0.00...,0.280244
4,4,"Patent Application Publication Oct . 4 , 2018 ...",256,[-0.03180185 -0.00398844 0.00010493 ... 0.00...,0.286048
...,...,...,...,...,...
95,95,"US 2018 / 0282715 Al Oct . 4 , 2018 90 \n - co...",1945,[-0.02059712 -0.01116544 -0.02790667 ... -0.03...,0.281965
96,96,"US 2018 / 0282715 Al Oct . 4 , 2018 \n - conti...",2072,[-0.0264446 -0.01601345 -0.03496066 ... -0.03...,0.299942
97,97,"US 2018 / 0282715 Al Oct . 4 , 2018 92 . . \n ...",1982,[-0.02589197 -0.0122796 -0.03976119 ... -0.03...,0.289461
98,98,"US 2018 / 0282715 Al Oct . 4 , 2018 \n - conti...",2086,[-0.02847721 -0.03346458 -0.02478234 ... -0.02...,0.281179


In [92]:
for text in df["text"].values:
    text_token_count = len(tokenizer.encode(text))
    current_token_count += text_token_count
    
    if current_token_count <= max_token_count:
        context.append(text)
    else:
        break # we do not want to keep adding text if we are over the limit

In [93]:
context

['THE MAIN TEA ETA AUTOMAT U TEMAMA ANATUAN US 20180282715A1 ( 19 ) United States ( 12 ) Patent Application Publication ( 10 ) Pub . No . : US 2018 / 0282715 A1 Carter et al . ( 43 ) Pub . Date : Oct . 4 , 2018 \n ( 54 ) NOVEL CRISPR - ASSOCIATED ( CAS ) PROTEIN \n ( 71 ) Applicant : Caribou Biosciences , Inc . , Berkeley , \n CA ( US ) \n ( 72 ) Inventors : Matthew Merrill Carter , Berkeley , CA ( US ) ; Paul Daniel Donohoue , Berkeley . \n CA ( US ) Publication Classification \n ( 51 ) Int . Cl . \n C12N 9 / 22 ( 2006 . 01 ) \n C12N 15 / 11 ( 2006 . 01 ) C12N 15 / 85 ( 2006 . 01 ) \n C12N 15 / 113 ( 2006 . 01 ) \n ( 52 ) U . S . Cl . ??? . . . . C12N 9 / 22 ( 2013 . 01 ) ; C12N 15 / 11 \n ( 2013 . 01 ) ; C12N 2800 / 22 ( 2013 . 01 ) ; C12N 15 / 1136 ( 2013 . 01 ) ; C12N 2310 / 20 ( 2017 . 05 ) ; C12N 15 / 85 ( 2013 . 01 ) \n ABSTRACT \n A new CRISPR - associated ( Cas ) protein , termed “ CasM , ” is described , as well as polynucleotides encoding the same and methods of using CasM f

In [94]:
print (prompt_template.format("\n\n###\n\n".join(context), question) )


Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context:
THE MAIN TEA ETA AUTOMAT U TEMAMA ANATUAN US 20180282715A1 ( 19 ) United States ( 12 ) Patent Application Publication ( 10 ) Pub . No . : US 2018 / 0282715 A1 Carter et al . ( 43 ) Pub . Date : Oct . 4 , 2018 
 ( 54 ) NOVEL CRISPR - ASSOCIATED ( CAS ) PROTEIN 
 ( 71 ) Applicant : Caribou Biosciences , Inc . , Berkeley , 
 CA ( US ) 
 ( 72 ) Inventors : Matthew Merrill Carter , Berkeley , CA ( US ) ; Paul Daniel Donohoue , Berkeley . 
 CA ( US ) Publication Classification 
 ( 51 ) Int . Cl . 
 C12N 9 / 22 ( 2006 . 01 ) 
 C12N 15 / 11 ( 2006 . 01 ) C12N 15 / 85 ( 2006 . 01 ) 
 C12N 15 / 113 ( 2006 . 01 ) 
 ( 52 ) U . S . Cl . ??? . . . . C12N 9 / 22 ( 2013 . 01 ) ; C12N 15 / 11 
 ( 2013 . 01 ) ; C12N 2800 / 22 ( 2013 . 01 ) ; C12N 15 / 1136 ( 2013 . 01 ) ; C12N 2310 / 20 ( 2017 . 05 ) ; C12N 15 / 85 ( 2013 . 01 ) 
 ABSTRACT 
 A new CRISPR - associated 

## Step 4

### Getting a Custom Q&A Response with `openai.Completion`

In [96]:
response = client.completions.create(
    model="gpt-3.5-turbo-instruct",
    prompt=prompt_template.format("\n\n###\n\n".join(context), question),
    max_tokens=150
)

print(response.choices[0].text.strip())

CRISPR stands for Clustered Regularly Interspaced Short Palindromic Repeats. It refers to a type of DNA sequence found in bacteria that can be used for gene editing and other genetic research purposes.


# TL;DR - Automatic analasysis - RAG!

In [144]:
import tiktoken
from openai import OpenAI

EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"

# Get API key if not set
if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API key: ")

# Initialize OpenAI client with optional base URL
client_kwargs = {"api_key": os.environ["OPENAI_API_KEY"]}
if "OPENAI_API_BASE" in os.environ:
    client_kwargs["base_url"] = os.environ["OPENAI_API_BASE"]

client = OpenAI(**client_kwargs)

def get_embeddings(texts):
    try:
        response = client.embeddings.create(
            input=texts,
            model=EMBEDDING_MODEL_NAME
        )
        return [embedding.embedding for embedding in response.data]
    except Exception as e:
        print(f"Error creating embeddings: {e}")
        return None

def get_rows_sorted_by_relevance(question, df):
    """
    Function that takes in a question string and a dataframe containing
    rows of text and associated embeddings, and returns that dataframe
    sorted from least to most relevant for that question
    """

    # Get embeddings for the question text
    question_embeddings = get_embeddings(question)

    # Make a copy of the dataframe and add a "distances" column containing
    # the cosine distances between each row's embeddings and the
    # embeddings of the question
    df_copy = df.copy()
    df_copy["distances"] = distances_from_embeddings(
        question_embeddings,
        df_copy["embeddings"].values
    )

    # Sort the copied dataframe by the distances and return it
    # (shorter distance = more relevant so we sort in ascending order)
    df_copy.sort_values("distances", ascending=True, inplace=True)
    return df_copy

def create_prompt(question, df, max_token_count):
    """
    Given a question and a dataframe containing rows of text and their
    embeddings, return a text prompt to send to a Completion model
    """
    # Create a tokenizer that is designed to align with our embeddings
    tokenizer = tiktoken.get_encoding("cl100k_base")

    # Count the number of tokens in the prompt template and question
    prompt_template = """
Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

{}

---

Question: {}
Answer:"""

    current_token_count = len(tokenizer.encode(prompt_template)) + \
                            len(tokenizer.encode(question))

    context = []

    df_distances = get_rows_sorted_by_relevance(question, df)["text"]
    
    for text in df_distances.values:

        # Increase the counter based on the number of tokens in this row
        text_token_count = len(tokenizer.encode(text))
        current_token_count += text_token_count

        # Add the row of text to the list if we haven't exceeded the max
        if current_token_count <= max_token_count:
            context.append(text)
        else:
            break

    # print("######### Custom Context #########")
    # print(context)
    # print("######### End Custom Context #########\n")

    return prompt_template.format("\n\n###\n\n".join(context), question)

def answer_question(
    question, 
    df, 
    max_prompt_tokens=2048, 
    max_answer_tokens=1024
):
    """
    Given a question, a dataframe containing rows of text, and a maximum
    number of desired tokens in the prompt and response, return the
    answer to the question according to an OpenAI Completion model

    If the model produces an error, return an empty string
    """

    prompt = create_prompt(question, df, max_prompt_tokens)
    
    try:
        response = client.completions.create(
            model=COMPLETION_MODEL_NAME,
            prompt=prompt,
            max_tokens=max_answer_tokens
        )
        
        return response.choices[0].text.strip()
    except Exception as e:
        print(e)
        return ""

In [135]:
import numpy as np

df = pd.read_csv("embeddings.csv", index_col=0)
df["embeddings"] = df["embeddings"].apply(eval).apply(np.array)

In [None]:
custom_answer = answer_question("What is CRISPR?", df)
print(custom_answer)

Clustered Regularly Interspaced Short Palindromic Repeats


In [149]:
custom_answer = answer_question("Explain what is CRISPR to a 10 years old kid", df)
print(custom_answer)

CRISPR is a tool that helps scientists edit genes, kind of like using scissors to change a recipe. It can help fix mistakes or make new changes in cells, which can lead to new and exciting discoveries in science and medicine!


In [116]:
custom_twitter_answer = answer_question("Who owns Twitter?", df)
print(custom_twitter_answer)

I don't know


In [127]:
custom_question = answer_question("create bullet points for the main events of this text", df)
print(custom_question)

- Patent Application Publication
- Description of a new CRISPR-associated (Cas) protein called "CasM"
- Methods of using CasM for site-specific genome engineering
- Polynucleotides encoding CasM
- CasM's ability to target and cleave single-stranded RNA
- Use of CasM in creating a Sequence Listing
- Filing of provisional applications in related U.S. Application Data
- Patent Application Publication Sheet 1 of 5
- Patent Application Publication Sheet 2 of 5


In [128]:
custom_question = answer_question("who filled this patent?", df)
print(custom_question)

The application for this patent was filled by Caribou Biosciences, Inc. and the inventors listed are Matthew Merrill Carter and Paul Daniel Donohoue.


In [139]:
custom_question = answer_question("when was the patent published?", df)
print(f"Answer:\n{custom_question}")

######### Custom Context #########
['Patent Application Publication Oct . 4 , 2018 Sheet 3 of 5 US 2018 / 0282715 A1 \n| 1 | 2 | 3 | 4 5 6 7 1 2 . 3 4 5 6 \n + 480 \n - 481 \n FIG . 3', "Patent Application Publication Oct . 4 , 2018 Sheet 4 of 5 US 2018 / 0282715 A1 \n 472 471 \n 490 TIID 491 \n 473 473 470 \n FIG . 4 \n 492 493 494 \n TIID - \n am 31 5 AL O ETSERIER \n RET \n 495 495 495 \n 496 - \n AP \n 543 ' 498 496 53 ' 497 53 3 497 499 \n FIG . 5", 'Patent Application Publication Oct . 4 , 2018 Sheet 2 of 5 US 2018 / 0282715 A1 \n 472 \n 471 20 \n . LIILOTTI 475 104 \nLoe 470 - - - 473 \n @ @ @ \n O \n 4 - \n 474 \n FIG . 2', 'Patent Application Publication Oct . 4 , 2018 Sheet 5 of 5 US 2018 / 0282715 A1 \n . . . \n . . . . . . . . . . . . . \n . . . . 1 . . 1 . iii \n st | 03 | | 41 | 4 | 01 | * * | 4 | 99 | 0 | 71 711 FIG . 6 \n 501 - 500 - 502 503', 'Patent Application Publication Oct . 4 , 2018 Sheet 1 of 5 US 2018 / 0282715 A1 \nRKHK FIG . 1', 'THE MAIN TEA ETA AUTOMAT U TE

In [148]:
custom_question = answer_question("what is Genomic DNA?", df)
print(f"Answer:\n{custom_question}")

Answer:
Genomic DNA is a segment of a chromosome in the genome of a host cell, which may include portions of a nucleic acid target sequence site.


In [147]:
custom_question = answer_question("explain what is Genomic DNA to a 10 years old kid?", df)
print(custom_question)

Genomic DNA is like a blueprint for our body. It contains all the information that makes us who we are, like our eye color, height, and even our personalities! Scientists use it to learn more about how our bodies work and how we can stay healthy.


In [150]:
custom_answer = answer_question("For how long has been CRISPR in research and development?", df)
print(custom_answer)

I don't know


In [156]:
custom_answer = answer_question("What can you tell me about the BACKGROUND OF THE INVENTION?", df)
text = custom_answer.split(".")

for t in text:
    print(t)

The background of the invention includes a new CRISPR-associated (Cas) protein called "CasM," polynucleotides encoding this protein, and methods of using it for site-specific genome engineering
 It also mentions a Sequence Listing and related US application data, including provisional applications filed in 2017 and 2018



In [154]:
custom_answer = answer_question("Explain the BACKGROUND OF THE INVENTION to a 10 year old kid", df)
text = custom_answer.split(".")

for t in text:
    print(t)

The invention is about a new kind of protein called CasM that can help make specific changes to genetic material in living things
 It can target and cut single strands of RNA, which is like the instructions that cells follow to do their jobs
 This invention can help scientists study and change how living things grow and develop
 It's like a special tool that can make tiny changes to things inside our bodies



In [157]:
custom_answer = answer_question("Sum up the SUMMARY of the invention", df)
text = custom_answer.split(".")

for t in text:
    print(t)

The invention is a new CRISPR-associated protein called CasM that can target and cleave single-stranded RNA
 The protein has been found to be useful for site-specific genome engineering
 The application also includes polynucleotides encoding CasM and describes methods for using the protein



In [158]:
custom_answer = answer_question("does the invention mentions a method of screening and killing cells?", df)
text = custom_answer.split(".")

for t in text:
    print(t)

Yes


In [159]:
custom_answer = answer_question("explain the method of screening and killing cells", df)
text = custom_answer.split(".")

for t in text:
    print(t)

The method involves contacting a NATNA / Cas9 complex to a locus of interest in a population of cells, resulting in DNA cleavage and subsequent repair of the break by the endogenous cellular repair machine
 This introduces indels at the break site
 The targeting of the NATNA / Cas9 complex to a targeted locus that encodes an RNA transcript results in indels in an RNA transcript sequence
 This modified RNA transcript sequence is then targeted by a crRNA / CasM complex, resulting in activation of the CasM protein and subsequent cell death
 This method can be adapted to screen for the incorporation of a donor-poly nucleotide into the NATNA / Cas9 break site



In [160]:
custom_answer = answer_question("explain the method of screening and killing cells to a 10 years old kid", df)
text = custom_answer.split(".")

for t in text:
    print(t)

Scientists used special molecules called crRNAs and a protein called CasM to help find and destroy bad cells in the lab
 They do this by putting the bad cells in a nutritious soup called lysogeny broth, adding a special sauce called carbenicillin to it, and then making a jello-like substance on top
 They also put a tiny amount of something called MS2 phage on top
 After waiting overnight, they look to see if any holes have formed in the jello, which would mean the bad cells have been destroyed
 They also put the same soup on different plates with different ingredients to make sure the crRNAs and CasM only kill the bad cells and not the good ones
 Instead of using a microscope, they use a machine called a flow cytometer to look at the cells and see if the bad ones are gone
 This helps them understand how the crRNAs and CasM work and how to make them better at fighting diseases

