# Custom Chatbot Project

Survivor is an American reality series running since May of 2000 with two seasons every year. Its quiet popular, seeing its a continous airing for over 24 years.

pi.ai and chatgpt.open.ai only has the data up until 2021.

## Table of contents<br>

1. [Data Wrangling](#data-wrangling)

2. [Import the support Libraries](#import)

3. [getHTML](#getHTML)

4. [scrape data from the wikipedia](#wikipedia)

5. [scrape data from fandom](#fandom)

6. [merge](#merge)

6. [createTExt](#createText)


<a id="data-wrangling"></a>



## Data Wrangling 

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

<div id="import" name="import">
## Import the support libraries

</div>




 requests : to load the webpage <br>
 pandas : to capture the data in a tabular format<br>
 re : regular expression to cleanup and extract the required data<br>
 BeautifulSoup : to wrangle the HTML data

In [2]:
import requests
import pandas as pd
import re
from bs4 import BeautifulSoup

Two different websites are used to capture the required data <br>
Wikipedia provides the contestants details that include season, age, home town and their rank in the season<br>
fandom to capture the air dates of the episodes.<br><br>

combining the two will provide us the information on the start, end of the season, contestant and their ranks

In [183]:
contestants_url = "https://en.wikipedia.org/wiki/List_of_Survivor_(American_TV_series)_contestants"
seasons_url="https://gameshows.fandom.com/wiki/Survivor/Airdates"

COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"
SEASON_41 = 41  
MAX_TOKENS = 150
# EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
EMBEDDING_MODEL_NAME = "text-embedding-3-large"
DISTANCE_METRIC = "cosine"
ENCODING_MODEL = "cl100k_base"
TEMPERATURE = 0.2

<a id="getHTML" ></a>


#### function getHTML





using requests to get the HTML

Using beautifulSoup to parse the response into an soup object, which we later use to traverse the document


In [4]:
def getHTML(url):
    resp = requests.get(url)
    
    soup = BeautifulSoup(resp.text,'html.parser')

    return soup
    


<div id="wikipedia" name="wikipedia" >


### scrape data from the wikipedia


</div>




the 46 seasons (as of this project) are captured under 10 separate tables (TABLE tag with TH,TR, TD) </br>
Each **table**, has **tbody**, which in turn contains the **tr** table-row <br>
each table row has columns as **th** and **td** <br>
we iterate through table, rows and combine the columns within a row to form a list, which is appended to a dataframe

In [5]:
def getContestantData(soup):

    tables=soup.find_all('table', {'class' : 'wikitable'})

    # create an empty dataframe to hold the data
    dfContestents = pd.DataFrame()
    # pat = "[^\d]*"

    # cleanup the season number, which has spaces and an exclamation point. 
    # we use this pattern from extract only the digits, as many as they are
    pat = "[\d]+"

    # enumerate, and iterate through the tables list
    for idx_table, table in enumerate(tables):

    #     if idx_table==9:    # this is to elimnate the season 46, to avoid spoiler alert, I'm catching up the episides :)
    #         continue

        
        # capture the tbody with the tables, one by one, we don't need all of them, just the immediate next one
        # this is where beautifulsoap makes processing HTML simpler
        tbody = table.find('tbody') 
        

        # find all rows under the tbody
        rows = tbody.find_all('tr')
        
        # initialize the variables
        season_contestants = []
        col_headings=[]
        season = 0

        
        # Iterate the rows
        for idx_row, row in enumerate(rows):
        
            # Select TH and TD
            cols = row.find_all(['th','td'])
            

            
            # this is specific to how the page is structured
            # for some places in contestant name is an hyperlink and therefore make use of TH along
            # with span and anchor 
            # we need to make the following specific to the page being scapred

            lst = []
            
            for idx_col, col in enumerate(cols):
                span   = col.find("span")
                anchor = col.find("a")
                    
    #             print('col = ', col.text.strip())
            
                # Capture the season number ana name, which only appears once against all the contestants
                if  (span!=None 
                    and anchor!=None 
                    and "Survivor" in anchor['title']
                    and "Survivor contestant" not in anchor['title']
                    ):
                
    #                 season = re.sub(pat,"", span['data-sort-value']).strip()
                    season = re.findall(pat, span['data-sort-value']) 
                    if len(season) > 0:
                        season = int(season[0])
                        season_name = col.text.strip()

                    continue
                    
                # value of the columnar table in HTML
                else: 
                    val = col.text.strip()
            

                # make a list of all the HTML column within a single row
                lst.append(val)
            
            # special handling, the first row in each table has headings
            # so we are capturing it by referring to the first row in the list
            # since, season number and name, only appear once for a season,
            # these are added against each row prior to reading the next row
            
            if (idx_row == 0 ):
                col_headings = lst
            else:
                lst.insert(0, season)
                lst.insert(1, season_name)
                season_contestants.append(lst)

    #     print("column headings = ", col_headings)
    #     print ("season plus list ", season_contestants)
        
        # update the heading to include season name
        col_headings.insert(1, "Season Name")

        # finally, we add all the rows within a table into pandas dataframe before processing the next table
        # remember there are multiple tables with a page, breaking the seasons into 5 seasons per table
        dfContestents = pd.concat([dfContestents,
                                pd.DataFrame(season_contestants, columns=col_headings)
                                ])
        
        
    dfContestents['Season'] = dfContestents['Season'].astype(str).astype(int)
        
    return dfContestents



In [6]:
soup = getHTML(contestants_url)
dfContestents = getContestantData(soup)


In [7]:
dfContestents.head()

Unnamed: 0,Season,Season Name,Name,Age,Hometown,Profession,Finish
0,1,Survivor: Borneo,Sonja Christopher,63,"Walnut Creek, CA",Gym Teacher / Retired,16th
1,1,Survivor: Borneo,"Bill ""B.B."" Andersen",64,"Mission Hills, KS",Real Estate Developer,15th
2,1,Survivor: Borneo,Stacey Stillman,27,"San Francisco, CA",Attorney,14th
3,1,Survivor: Borneo,Ramona Gray,29,"Edison, NJ",Biochemist,13th
4,1,Survivor: Borneo,Dirk Been,23,"Spring Green, WI",Dairy Farmer,12th


<div id="fandom" name="fandom"></div>



## scrape data from fandom
grab the season airdates from fandom and follow same steps that were applied for wiki (contestants) pages

<div id="fandom" name="fandom"></div>

In [8]:
def getSeasonsData(soup):
    table = soup.find('table', {'class' : 'wikitable'})
    
    rows = table.find_all('tr')

    seasons_list = []
    for idx_row, row in enumerate(rows):
        
        cols = row.find_all(['th','td'])
        lst = []
        for col in cols:
            lst.append(col.text.strip())
            
        seasons_list.append(lst)

    # print(seasons_list)
    # create a dataframe from the season list, the first row contains the heading and rest i s the data
    dfSeasons = pd.DataFrame(seasons_list[1:], columns=seasons_list[0])

    # before merging the two dataframes, and for simplicity make the name of the column used for merging the same
    # Change the datatype of the season to integer

    dfSeasons.rename(columns={'Season #': 'Season'}, inplace=True)
    dfSeasons['Season']     = dfSeasons['Season'].astype(str).astype(int)
 
    return dfSeasons

#### same steps as used for scraping the wikipedia, only simpler
here the Table has rows (TR), which comprises of TH and TD

In [9]:
soup = getHTML(seasons_url)
dfSeasons = getSeasonsData(soup)


### Tried to use selenium, but its not available in udacity workspace

In [10]:
# !pip install selenium --upgrade

In [11]:
# dfContestents['Season'].astype(str).astype(int)

In [12]:
# from selenium import webdriver
# from selenium.webdriver.chrome.options import Options

# options = Options()
# options.headless = True
# options.add_argument("--window-size=1920,1080")

# driver = webdriver.Chrome(options=options, executable_path='/path/to/chromedriver')
# driver.get(seasons)
# page_source = driver.page_source
# # with open("udacity_home.html", "w") as f:
# #     f.write(page_source)
# driver.quit()

In [13]:

dfSeasons.head(5)

Unnamed: 0,Season,Premiering Date,Finale Date
0,1,"May 31, 2000","August 23, 2000"
1,2,"January 28, 2001","May 3, 2001"
2,3,"October 11, 2001","January 10, 2002"
3,4,"February 28, 2002","May 19, 2002"
4,5,"September 19, 2002","December 19, 2002"


<div id="merge" name="merge"></div>


## Merge the two dataframes to form a single dataframe

In [14]:
# merge the two dataframes to create the single dataframe, which now has contestant as well as season start/end date

dfAll=dfContestents.merge(dfSeasons,
                    on  ='Season',
                    how = 'left',
                   )

In [15]:
dfAll.head()

Unnamed: 0,Season,Season Name,Name,Age,Hometown,Profession,Finish,Premiering Date,Finale Date
0,1,Survivor: Borneo,Sonja Christopher,63,"Walnut Creek, CA",Gym Teacher / Retired,16th,"May 31, 2000","August 23, 2000"
1,1,Survivor: Borneo,"Bill ""B.B."" Andersen",64,"Mission Hills, KS",Real Estate Developer,15th,"May 31, 2000","August 23, 2000"
2,1,Survivor: Borneo,Stacey Stillman,27,"San Francisco, CA",Attorney,14th,"May 31, 2000","August 23, 2000"
3,1,Survivor: Borneo,Ramona Gray,29,"Edison, NJ",Biochemist,13th,"May 31, 2000","August 23, 2000"
4,1,Survivor: Borneo,Dirk Been,23,"Spring Green, WI",Dairy Farmer,12th,"May 31, 2000","August 23, 2000"


In [16]:
# dfContestents.to_csv('temp')

In [17]:
# # df['text'] = 
# "Season {} titled {} of Survivor, contestent {} of Age {} from hometown {} having profession {} finished {}".format(
#         dfSurvivor['Season'],
#         dfSurvivor['Season Name'],
#         dfSurvivor['Name'],
#         dfSurvivor['Age'],
#         dfSurvivor['Hometown'],
#         dfSurvivor['Profession'],
#         dfSurvivor['Finish'],
#         dfSurvivor['Premiering Date'],
#         dfSurvivor['Finale Date'],
# )
            

<div id="createText" name="createText"></div>


## createText

In [144]:
def createText(x):
    # JANE DOE is WINNER of USA survivor season 99 titled SURVIVOR-99,
    # JANE DOE is 99 years old from MARS and works as ASTRONAUT.
    # SEASON-99 titled SURVIVOR-99 was aired between JAN-99-9999 AND DEC-99-9999


#     The winner of Survivor: Edge of Extinction (season 38) was Chris Underwood. The winner of Survivor: Island of the Idols (season 39) was Tommy Sheehan. 
#     The winner of Survivor: Winners at War (season 40) was Tony Vlachos. 
#     There is no season 44 of Survivor listed as of May 2020.

# Heidi Lagares-Greenblatt  finishes Runner-Up in Season 44 of Survivor USA titled 44, 
#     Heidi Lagares-Greenblatt is 43 years old from Pittsburgh, PA and works as Engineering Manager. 
#     Season 44 titled 44 aired during March 1, 2023 and May 24, 2023
    

# ###

# Dee Valladares is a Winner in Season 45 of Survivor USA titled 45, 
#     Dee Valladares is 26 years old from Miami, FL and works as Entrepreneur. 
#     Season 45 titled 45 aired during September 27, 2023 and December 20, 2023




    text= """{} {} USA Survivor Season {} titled {}, 
    {} is {} years old from {} and works as {}. 
    Season {} titled {} aired during {} and {}
    """

    text=text.format(x['Name'],
                    ('is a ' + str(x['Finish']) + ' of ' 
                        if x['Finish']=='Winner' 
                      else 'finishes ' + str(x['Finish']) + ' in '),
                    x['Season'],
                    x['Season Name'],
                    x['Name'],
                    x['Age'],
                    x['Hometown'],
                    x['Profession'],
                    x['Season'],
                    x['Season Name'],
                    x['Premiering Date'],
                    x['Finale Date'],
                   )
    
        
    return text

In [145]:
# since the RAG requires a single column as TEXT, so we concatenate the columns for form a single descriptive description
# which will then be processed further with the openai


dfAll['text'] = dfAll.apply(createText, axis=1)

            

### Survivor 40 is the last season available in chatGPT, so we take the data from Season 41 onward scraped above

In [146]:
# filter only season 41 and onward for augmentation (RAG), as rest of the data is available in the model

cond=dfAll['Season'] >= SEASON_41
df = dfAll[cond].copy()

In [147]:
df[df['Season'].eq(SEASON_41)].head() #.loc[17]['text']

Unnamed: 0,Season,Season Name,Name,Age,Hometown,Profession,Finish,Premiering Date,Finale Date,text
731,41,41,Eric Abraham,51,"San Antonio, TX",Cyber Security Analyst,18th,"September 22, 2021","December 15, 2021",Eric Abraham finishes 18th in USA Survivor Se...
732,41,41,Sara Wilson,24,"Boston, MA",Healthcare Consultant,17th,"September 22, 2021","December 15, 2021",Sara Wilson finishes 17th in USA Survivor Sea...
733,41,41,David Voce,35,"Chicago, IL",Neurosurgeon,16th,"September 22, 2021","December 15, 2021",David Voce finishes 16th in USA Survivor Seas...
734,41,41,Brad Reese,50,"Shawnee, WY",Rancher,15th,"September 22, 2021","December 15, 2021",Brad Reese finishes 15th in USA Survivor Seas...
735,41,41,"Jairus ""JD"" Robinson",20,"Oklahoma City, OK",College Student,14th,"September 22, 2021","December 15, 2021","Jairus ""JD"" Robinson finishes 14th in USA Sur..."


In [148]:
# # since the RAG requires a single column as TEXT, so we concatenate the columns for form a single descriptive description
# # which will then be processed further with the openai


# text = "Season {} titled {} of Survivor, aired between {} and {}, contestent {} Aged {} from hometown {}, works as a {} and a  {} {}"
# df['text'] = df.apply(lambda x: text.format(
#                                                         x['Season'],
#                                                         x['Season Name'],
#                                                         x['Premiering Date'],
#                                                         x['Finale Date'],
#                                                         x['Name'],
#                                                         x['Age'],
#                                                         x['Hometown'],
#                                                         x['Profession'],
#                                                         ('Winner of Survivor '+str(x['Season']) + ' ' + str(x['Season Name']) 
#                                                                   if x['Finish']=='Winner' 
#                                                                 else 'Ranked'),
#                                                         x['Finish']
#                                                         ),
#                                                     axis=1
#                                                                 )

            

In [149]:
df[df['Season'] == SEASON_41].head() #.loc[1]['text']

Unnamed: 0,Season,Season Name,Name,Age,Hometown,Profession,Finish,Premiering Date,Finale Date,text
731,41,41,Eric Abraham,51,"San Antonio, TX",Cyber Security Analyst,18th,"September 22, 2021","December 15, 2021",Eric Abraham finishes 18th in USA Survivor Se...
732,41,41,Sara Wilson,24,"Boston, MA",Healthcare Consultant,17th,"September 22, 2021","December 15, 2021",Sara Wilson finishes 17th in USA Survivor Sea...
733,41,41,David Voce,35,"Chicago, IL",Neurosurgeon,16th,"September 22, 2021","December 15, 2021",David Voce finishes 16th in USA Survivor Seas...
734,41,41,Brad Reese,50,"Shawnee, WY",Rancher,15th,"September 22, 2021","December 15, 2021",Brad Reese finishes 15th in USA Survivor Seas...
735,41,41,"Jairus ""JD"" Robinson",20,"Oklahoma City, OK",College Student,14th,"September 22, 2021","December 15, 2021","Jairus ""JD"" Robinson finishes 14th in USA Sur..."


In [150]:
df.columns

Index(['Season', 'Season Name', 'Name', 'Age', 'Hometown', 'Profession',
       'Finish', 'Premiering Date', 'Finale Date', 'text'],
      dtype='object')

#### Saving the data for later use and so as not to scrape it everytime

In [151]:
dfAll.to_csv('Survivor_data.csv')
df.to_csv('Survivor_data_41_onward.csv')

In [152]:
dfAll.head(5)

Unnamed: 0,Season,Season Name,Name,Age,Hometown,Profession,Finish,Premiering Date,Finale Date,text
0,1,Survivor: Borneo,Sonja Christopher,63,"Walnut Creek, CA",Gym Teacher / Retired,16th,"May 31, 2000","August 23, 2000",Sonja Christopher finishes 16th in USA Surviv...
1,1,Survivor: Borneo,"Bill ""B.B."" Andersen",64,"Mission Hills, KS",Real Estate Developer,15th,"May 31, 2000","August 23, 2000","Bill ""B.B."" Andersen finishes 15th in USA Sur..."
2,1,Survivor: Borneo,Stacey Stillman,27,"San Francisco, CA",Attorney,14th,"May 31, 2000","August 23, 2000",Stacey Stillman finishes 14th in USA Survivor...
3,1,Survivor: Borneo,Ramona Gray,29,"Edison, NJ",Biochemist,13th,"May 31, 2000","August 23, 2000",Ramona Gray finishes 13th in USA Survivor Sea...
4,1,Survivor: Borneo,Dirk Been,23,"Spring Green, WI",Dairy Farmer,12th,"May 31, 2000","August 23, 2000",Dirk Been finishes 12th in USA Survivor Seaso...


In [153]:
df = pd.read_csv('Survivor_data_41_onward.csv',index_col=None)
df.head()

Unnamed: 0.1,Unnamed: 0,Season,Season Name,Name,Age,Hometown,Profession,Finish,Premiering Date,Finale Date,text
0,731,41,41,Eric Abraham,51,"San Antonio, TX",Cyber Security Analyst,18th,"September 22, 2021","December 15, 2021",Eric Abraham finishes 18th in USA Survivor Se...
1,732,41,41,Sara Wilson,24,"Boston, MA",Healthcare Consultant,17th,"September 22, 2021","December 15, 2021",Sara Wilson finishes 17th in USA Survivor Sea...
2,733,41,41,David Voce,35,"Chicago, IL",Neurosurgeon,16th,"September 22, 2021","December 15, 2021",David Voce finishes 16th in USA Survivor Seas...
3,734,41,41,Brad Reese,50,"Shawnee, WY",Rancher,15th,"September 22, 2021","December 15, 2021",Brad Reese finishes 15th in USA Survivor Seas...
4,735,41,41,"Jairus ""JD"" Robinson",20,"Oklahoma City, OK",College Student,14th,"September 22, 2021","December 15, 2021","Jairus ""JD"" Robinson finishes 14th in USA Sur..."


## Inspect the non-customized results

In [154]:
from openai import OpenAI 
import getpass # using getpass to keep the key secret



In [161]:
apiKey = getpass.getpass("Enter openAPI key")
# openai.api_key = api_key # "YOUR API KEY"
client = OpenAI(
    api_key=apiKey
)

In [162]:
model = COMPLETION_MODEL_NAME
prompt = ""
max_tokens = MAX_TOKENS

In [163]:


def get_AI_response(model, prompt, max_tokens=MAX_TOKENS):
    
    """
    Returns the openai response.
    this also enables to use a different provider, if one so chooses
    
    ARGUMENTS:
    model      : LLM model, must be among the ones supported by the service used
    prompt     : prompt (question) for the model to respond
    max_tokens : Maximum number of tokens to return
    """
    
    # client = OpenAI()
    openai_response  = client.completions.create(
        model     = model,
        prompt    = prompt,
        max_tokens= max_tokens
    )
    
    # response = openai_response.choices[0]
    # ["choices"][0]["text"].strip()
    
    return openai_response.choices[0].text.strip()

https://community.openai.com/t/getting-error-with-completion-object-in-openai-1-0-0/618638/2
<br>
https://github.com/openai/openai-python/discussions/742

In [164]:
# # client.completions.create(  model=model, 
# #                           prompt=prompt_latest_season,
# #                           max_tokens=150)

# Completion(id='cmpl-9R7WSIW8CXRLXKUq9UUNrQai0mOIr', 
#            choices=[CompletionChoice(finish_reason='stop', 
#                                      index=0, 
#                                      logprobs=None, text
#                                      ='As of 2021, the latest season of Survivor USA is Season 41, which is set to premiere on September 22, 2021.')], 
#                                      created=1716250800, 
#                                      model='gpt-3.5-turbo-instruct', 
#                                      object='text_completion', 
#                                      system_fingerprint=None, 
#                                      usage=CompletionUsage(completion_tokens=31, prompt_tokens=15, total_tokens=46)
#                                      )

### reviewing the model response prior to adding newer data to the prompt as context

In [165]:
promptsList = []
promptsList.append("""
Question: "What is the latest season of survivor USA?"
Answer:
""")
promptsList.append("""
Question: "Who won the season 40 of survivor USA?"
Answer:
""")
promptsList.append("""
Question: "Who won the season 42 of survivor USA?"
Answer:
""")
promptsList.append("""
Question: "Who was the winner of the season 32 of survivor USA?"
Answer:
""")
promptsList.append("""
Question: "Who won first person eliminated in season 31 of survivor USA?"
Answer:
""")

promptsList

['\nQuestion: "What is the latest season of survivor USA?"\nAnswer:\n',
 '\nQuestion: "Who won the season 40 of survivor USA?"\nAnswer:\n',
 '\nQuestion: "Who won the season 42 of survivor USA?"\nAnswer:\n',
 '\nQuestion: "Who was the winner of the season 32 of survivor USA?"\nAnswer:\n',
 '\nQuestion: "Who won first person eliminated in season 31 of survivor USA?"\nAnswer:\n']

In [166]:
promptResponses = []
for p in promptsList:
    response = get_AI_response( model=COMPLETION_MODEL_NAME, 
                                prompt=p, 
                                max_tokens=MAX_TOKENS)
    promptResponses.append(response)
    print(p)
    print(response)



Question: "What is the latest season of survivor USA?"
Answer:

As of December 2020, the latest season of Survivor USA is Season 40, also known as Survivor: Winners at War.

Question: "Who won the season 40 of survivor USA?"
Answer:

The winner of Survivor USA Season 40: Winners at War was Tony Vlachos.

Question: "Who won the season 42 of survivor USA?"
Answer:

The winner of season 42 of survivor USA has not yet been announced, as it is a hypothetical season that has not aired yet. Each season of Survivor features a different group of contestants and is given a different title, so season 42 has not yet been confirmed.

Question: "Who was the winner of the season 32 of survivor USA?"
Answer:

The winner of Survivor: Kaôh Rōng (season 32) was Michele Fitzgerald.

Question: "Who won first person eliminated in season 31 of survivor USA?"
Answer:

Kaitlyn Anderson


In [167]:
promptsList

['\nQuestion: "What is the latest season of survivor USA?"\nAnswer:\n',
 '\nQuestion: "Who won the season 40 of survivor USA?"\nAnswer:\n',
 '\nQuestion: "Who won the season 42 of survivor USA?"\nAnswer:\n',
 '\nQuestion: "Who was the winner of the season 32 of survivor USA?"\nAnswer:\n',
 '\nQuestion: "Who won first person eliminated in season 31 of survivor USA?"\nAnswer:\n']

## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

### Generating embeddings

We'll use the Embedding tooling from OpenAI documentation here to create vectors representing each row of our custom dataset.

In order to avoid a RateLimitError we'll send our data in batches to the Embedding.create function.


In [168]:
batch_size = 100
embeddings = []
for i in range(0, len(df), batch_size):
    # Send text data to OpenAI model to get embeddings
    # response = openai.Embedding.create(
    #     input=df.iloc[i:i+batch_size]["text"].tolist(),
    #     engine=EMBEDDING_MODEL_NAME
    # )
    response = client.embeddings.create(
        input=df.iloc[i:i+batch_size]["text"].tolist(),
        model=EMBEDDING_MODEL_NAME
    )
    # break
    # Add embeddings to list
    embeddings.extend([data.embedding for data in response.data])

df["embeddings"] = embeddings

In [169]:

# Add embeddings list to dataframe

df.head(5)

Unnamed: 0.1,Unnamed: 0,Season,Season Name,Name,Age,Hometown,Profession,Finish,Premiering Date,Finale Date,text,embeddings
0,731,41,41,Eric Abraham,51,"San Antonio, TX",Cyber Security Analyst,18th,"September 22, 2021","December 15, 2021",Eric Abraham finishes 18th in USA Survivor Se...,"[0.01449219137430191, -0.006638483610004187, -..."
1,732,41,41,Sara Wilson,24,"Boston, MA",Healthcare Consultant,17th,"September 22, 2021","December 15, 2021",Sara Wilson finishes 17th in USA Survivor Sea...,"[0.0195144172757864, -0.01527693122625351, -0...."
2,733,41,41,David Voce,35,"Chicago, IL",Neurosurgeon,16th,"September 22, 2021","December 15, 2021",David Voce finishes 16th in USA Survivor Seas...,"[0.030623018741607666, -0.0020657856948673725,..."
3,734,41,41,Brad Reese,50,"Shawnee, WY",Rancher,15th,"September 22, 2021","December 15, 2021",Brad Reese finishes 15th in USA Survivor Seas...,"[0.009501384571194649, -0.021842408925294876, ..."
4,735,41,41,"Jairus ""JD"" Robinson",20,"Oklahoma City, OK",College Student,14th,"September 22, 2021","December 15, 2021","Jairus ""JD"" Robinson finishes 14th in USA Sur...","[0.035467758774757385, -0.03187376260757446, -..."


In [170]:
# for data in response.data:
#     print (data.embedding) #0].embedding)

In [171]:
df.columns

Index(['Unnamed: 0', 'Season', 'Season Name', 'Name', 'Age', 'Hometown',
       'Profession', 'Finish', 'Premiering Date', 'Finale Date', 'text',
       'embeddings'],
      dtype='object')

In [172]:
df.to_csv("embeddings.csv")

df = df[['text', 'embeddings']].copy() 

In [173]:
# df = pd.read_csv("embeddings.csv")
# # df.head()

In [174]:
# import numpy as np

# df = df[['text', 'embeddings']].copy()
# df["embeddings"] = df["embeddings"].apply(eval).apply(np.array)

In [175]:
len(df)

108


### Step 2: Create a Function that Finds Related Pieces of Text for a Given Question

What we are implementing here is similar to a search engine or recommendation algorithm. We want to sort all of the rows of our dataset from least relevant to most relevant.

This will use the embeddings that we generated previously in order to compare the vectorized version of our question to the vectorized versions of the rows of the dataset.


In [176]:
def get_embedding(text, model=EMBEDDING_MODEL_NAME):
   text = text.replace("\n", " ")
   return client.embeddings.create(input = [text], 
                                   model=model).data[0].embedding


# https://community.openai.com/t/embeddings-utils-distance-formulas-where-did-it-move/479868/8
from typing import List#, Optional
from scipy import spatial

def distances_from_embeddings(
    query_embedding: List[float],
    embeddings: List[List[float]],
    distance_metric="cosine",
) -> List[List]:
    distance_metrics = {
        "cosine": spatial.distance.cosine,
        "L1": spatial.distance.cityblock,
        "L2": spatial.distance.euclidean,
        "Linf": spatial.distance.chebyshev,
    }
    distances = [
        distance_metrics[distance_metric](query_embedding, embedding)
        for embedding in embeddings
    ]
    return distances

In [177]:
# openai.embeddings_utils no longer available with the openai version later than 0.28 
# from openai.embeddings_utils import get_embedding, distances_from_embeddings

def get_rows_sorted_by_relevance(question, param_df=df):
    """
    Function that takes in a question string and a dataframe containing
    rows of text and associated embeddings, and returns that dataframe
    sorted from least to most relevant for that question
    """
    
    # Get embeddings for the question text
    question_embeddings = get_embedding(question, model=EMBEDDING_MODEL_NAME)
    
    # Make a copy of the dataframe and add a "distances" column containing
    # the cosine distances between each row's embeddings and the
    # embeddings of the question
    df_copy = param_df.copy()
    df_copy["distances"] = distances_from_embeddings(
        question_embeddings,
        df_copy["embeddings"].values,
        distance_metric=DISTANCE_METRIC
    )
    
    # Sort the copied dataframe by the distances and return it
    # (shorter distance = more relevant so we sort in ascending order)
    df_copy.sort_values("distances", ascending=True, inplace=True)
    return df_copy


### Step 3: Create a Function that Composes a Text Prompt

Building on that sorted list of rows, we're going to select the create a text prompt that provides context to a Completion model in order to help it answer a question. The outline of the prompt looks like this:

Answer the question based on the context below, and if the
question can't be answered based on the context, say "I don't
know"

Context:

{context}

---

Question: {question}
Answer:

We want to fit as much of our dataset as possible into the "context" part of the prompt without exceeding the number of tokens allowed by the Completion model, which is currently 4,000. So we'll loop over the dataset, counting the tokens as we go, and stop when we hit the limit. Then we'll join that list of text data into a single string and add it to the prompt.


In [178]:
# EMBEDDING_MODEL_NAME = "text-embedding-ada-002"

# def get_embedding(text, model=EMBEDDING_MODEL_NAME):
#     text = text.replace("\n", " ")
#     return openai.Embedding.create(input = [text], model=model).data[0].embedding


# # df['ada_embedding'] = df.combined.apply(lambda x: get_embedding(x, model='text-embedding-3-small'))
# # df.to_csv('output/embedded_1k_reviews.csv', index=False)

###### https://cookbook.openai.com/examples/how_to_count_tokens_with_tiktoken 
<br>
Encoding name   OpenAI models
<BR>
cl100k_base &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;gpt-4, gpt-3.5-turbo, text-embedding-ada-002, text-embedding-3-small, text-embedding-3-large

In [179]:
import tiktoken

def create_prompt(question, param_df=df, max_token_count=MAX_TOKENS):
    """
    Given a question and a dataframe containing rows of text and their
    embeddings, return a text prompt to send to a Completion model
    """
    # Create a tokenizer that is designed to align with our embeddings
    tokenizer = tiktoken.get_encoding(ENCODING_MODEL)
    
    # Count the number of tokens in the prompt template and question
    prompt_template = """
Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know""

Context: 

{}

---

Question: {}
Answer:"""

    
    prompt_current_model = """
Question: {}
Answer:
"""
    df_copy = param_df.copy()
    
    prompt_current_model=prompt_current_model.format(question)

    current_model_answer = get_AI_response(COMPLETION_MODEL_NAME,
                                   prompt_current_model,
                                   max_tokens=MAX_TOKENS)

    
    embedding= client.embeddings.create(input = [current_model_answer],
                                        model = EMBEDDING_MODEL_NAME).data[0].embedding
    
    
    # append the result from model in the DF, this allows to use the current answer along with the new context
    # to respond with the best answer
    
    df_copy.loc[len(df_copy), ['text','embeddings']] = [current_model_answer, embedding]

    
    context = []  
    current_token_count = len(tokenizer.encode(prompt_template)) + \
                            len(tokenizer.encode(question)) 
    
    for text in get_rows_sorted_by_relevance(question, df_copy)["text"].values:
        
        # Increase the counter based on the number of tokens in this row
        text_token_count = len(tokenizer.encode(text))
        current_token_count += text_token_count
        
        # Add the row of text to the list if we haven't exceeded the max
        if current_token_count <= max_token_count:
            context.append(text)
        else:
            break

    return prompt_template.format("\n\n###\n\n".join(context), question)

In [184]:


def answer_question(
    question, param_df=df, max_prompt_tokens=1800, max_answer_tokens=MAX_TOKENS
):
    """
    Given a question, a dataframe containing rows of text, and a maximum
    number of desired tokens in the prompt and response, return the
    answer to the question according to an OpenAI Completion model
    
    If the model produces an error, return an empty string
    """
    
    prompt = create_prompt(question, param_df, max_prompt_tokens)
    
    try:
        response = client.completions.create(
            model=COMPLETION_MODEL_NAME,
            prompt=prompt,
            max_tokens=max_answer_tokens,
            temperature=TEMPERATURE
        )
        return response.choices[0].text.strip() #["choices"][0]["text"].strip()
    except Exception as e:
        print(e)
        return ""
    

## Custom Performance Demonstration


In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.
<br>

Five questions and their responses from current model and custom RAG backed are captured below via a for loop.<BR>
the loop is executed three times, and for the third time, the temperature was set to 0.2, which provided better responses then the prior two attempts

In [181]:
customResponses = []
for idx, p in enumerate(promptsList):
    response = answer_question(p, df)
    customResponses.append(response)
    print(p)
    print("          Current Model Response : {}".format(promptResponses[idx]))
    print("\n          Custom RAG Response : {}".format(response))
    print("\n----------------------------------------------")



Question: "What is the latest season of survivor USA?"
Answer:

          Current Model Response : As of December 2020, the latest season of Survivor USA is Season 40, also known as Survivor: Winners at War.

          Custom RAG Response : The latest season of survivor USA is Survivor: Winners at War, which aired in February-May 2020.

----------------------------------------------

Question: "Who won the season 40 of survivor USA?"
Answer:

          Current Model Response : The winner of Survivor USA Season 40: Winners at War was Tony Vlachos.

          Custom RAG Response : Tony Vlachos

----------------------------------------------

Question: "Who won the season 42 of survivor USA?"
Answer:

          Current Model Response : The winner of season 42 of survivor USA has not yet been announced, as it is a hypothetical season that has not aired yet. Each season of Survivor features a different group of contestants and is given a different title, so season 42 has not yet been confi

In [182]:
## Second Run

customResponses = []
for idx, p in enumerate(promptsList):
    response = answer_question(p, df)
    customResponses.append(response)
    print(p)
    print("          Current Model Response : {}".format(promptResponses[idx]))
    print("\n          Custom RAG Response : {}".format(response))
    print("\n----------------------------------------------")



Question: "What is the latest season of survivor USA?"
Answer:

          Current Model Response : As of December 2020, the latest season of Survivor USA is Season 40, also known as Survivor: Winners at War.

          Custom RAG Response : The latest season of Survivor USA is Season 46, titled "46".

----------------------------------------------

Question: "Who won the season 40 of survivor USA?"
Answer:

          Current Model Response : The winner of Survivor USA Season 40: Winners at War was Tony Vlachos.

          Custom RAG Response : Tony Vlachos

----------------------------------------------

Question: "Who won the season 42 of survivor USA?"
Answer:

          Current Model Response : The winner of season 42 of survivor USA has not yet been announced, as it is a hypothetical season that has not aired yet. Each season of Survivor features a different group of contestants and is given a different title, so season 42 has not yet been confirmed.

          Custom RAG Response

In [186]:
# third attempt

customResponses = []
for idx, p in enumerate(promptsList):
    response = answer_question(p, df)
    customResponses.append(response)
    print(p)
    print("          Current Model Response : {}".format(promptResponses[idx]))
    print("\n          Custom RAG Response : {}".format(response))
    print("\n----------------------------------------------")



Question: "What is the latest season of survivor USA?"
Answer:

          Current Model Response : As of December 2020, the latest season of Survivor USA is Season 40, also known as Survivor: Winners at War.

          Custom RAG Response : The latest season of Survivor USA is season 46, titled "Survivor: Winners at War."

----------------------------------------------

Question: "Who won the season 40 of survivor USA?"
Answer:

          Current Model Response : The winner of Survivor USA Season 40: Winners at War was Tony Vlachos.

          Custom RAG Response : Tony Vlachos

----------------------------------------------

Question: "Who won the season 42 of survivor USA?"
Answer:

          Current Model Response : The winner of season 42 of survivor USA has not yet been announced, as it is a hypothetical season that has not aired yet. Each season of Survivor features a different group of contestants and is given a different title, so season 42 has not yet been confirmed.

       

## THE END

In [142]:
q= "What is the latest season of survivor USA?"
p = create_prompt(q, df, 500)

In [143]:
print(p)


Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know""

Context: 

As of September 2021, the latest season of Survivor USA is season 41, which is set to premiere on September 22, 2021.

###

Swati Goel finishes 14th in  Survivor Season 42 of Survivor USA titled 42, 
    Swati Goel is 19 years old from Palo Alto, CA and works as Ivy League Student. 
    Season 42 titled 42 aired during March 9, 2022 and May 25, 2022
    

###

Sydney Segal finishes 12th in  Survivor Season 41 of Survivor USA titled 41, 
    Sydney Segal is 26 years old from Brooklyn, NY and works as Law Student. 
    Season 41 titled 41 aired during September 22, 2021 and December 15, 2021
    

###

Sarah Wade finishes 14th in  Survivor Season 44 of Survivor USA titled 44, 
    Sarah Wade is 27 years old from Rochester, MN and works as Management Consultant. 
    Season 44 titled 44 aired during March 1, 2023 and May 24, 2023
    

###

Cassidy C

In [137]:
get_rows_sorted_by_relevance(q, df)


Unnamed: 0,text,embeddings,distances
22,Swati Goel finishes 14th in Survivor Season 4...,"[-0.025238176807761192, -0.02735145390033722, ...",0.450127
6,Sydney Segal finishes 12th in Survivor Season...,"[0.013477068394422531, -0.025735730305314064, ...",0.459233
58,Sarah Wade finishes 14th in Survivor Season 4...,"[-0.02776567079126835, -0.030176833271980286, ...",0.464100
52,Cassidy Clark finishes Runner-Up in Survivor ...,"[0.03023993782699108, -0.013639522716403008, -...",0.468801
19,"Zachary ""Zach"" Wurtenberger finishes 17th in ...","[0.037157200276851654, -0.02367389388382435, -...",0.469998
...,...,...,...
64,Frannie Marin finishes 8th in Survivor Season...,"[-0.02440713904798031, -0.019201941788196564, ...",0.545530
101,Venus Vafa finishes 7th in Survivor Season 46...,"[-0.0469750240445137, -0.01781335100531578, -0...",0.549708
71,"Yamil ""Yam Yam"" Arocho is a Winner of Survivo...","[-0.01828192174434662, 0.010416443459689617, -...",0.551980
33,Romeo Escobar finishes 2nd Runner-up in Survi...,"[0.0016588345170021057, -0.011710562743246555,...",0.556590
