# Custom Chatbot Project

Survivor is an American reality series running since May of 2000 with two seasons every year. Its quiet popular, seeing its a continous airing for over 24 years.

pi.ai and chatgpt.open.ai only has the data up until 2021.

## Data Wrangling

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

## Import the support libraries
 requests : to load the webpage <br>
 pandas : to capture the data in a tabular format<br>
 re : regular expression to cleanup and extract the required data<br>
 BeautifulSoup : to wrangle the HTML data

In [1]:
import requests
import pandas as pd
import re
from bs4 import BeautifulSoup

Two different websites are used to capture the required data <br>
Wikipedia provides the contestants details that include season, age, home town and their rank in the season<br>
fandom to capture the air dates of the episodes.<br><br>

combining the two will provide us the information on the start, end of the season, contestant and their ranks

In [2]:
contestants_url = "https://en.wikipedia.org/wiki/List_of_Survivor_(American_TV_series)_contestants"
seasons_url="https://gameshows.fandom.com/wiki/Survivor/Airdates"

### scrape data from the wikipedia

using requests to get the HTML

In [3]:
resp = requests.get(contestants_url)
resp

<Response [200]>

Using beautifulSoup to parse the response into an soup object, which we later use to traverse the document

In [4]:
soup = BeautifulSoup(resp.text,'html.parser')

the 46 seasons (as of this project) are captured under 10 separate tables (TABLE tag with TH,TR, TD) </br>
Each **table**, has **tbody**, which in turn contains the **tr** table-row <br>
each table row has columns as **th** and **td** <br>
we iterate through table, rows and combine the columns within a row to form a list, which is appended to a dataframe

In [5]:
tables=soup.find_all('table', {'class' : 'wikitable'})

In [6]:
# create an empty dataframe to hold the data
dfContestents = pd.DataFrame()
# pat = "[^\d]*"

# cleanup the season number, which has spaces and an exclamation point. 
# we use this pattern from extract only the digits, as many as they are
pat = "[\d]+"

# enumerate, and iterate through the tables list
for idx_table, table in enumerate(tables):

#     if idx_table==9:    # this is to elimnate the season 46, to avoid spoiler alert, I'm catching up the episides :)
#         continue

    
    # capture the tbody with the tables, one by one, we don't need all of them, just the immediate next one
    # this is where beautifulsoap makes processing HTML simpler
    tbody = table.find('tbody') 
    

    # find all rows under the tbody
    rows = tbody.find_all('tr')
    
    # initialize the variables
    season_contestants = []
    col_headings=[]
    season = 0

    
    # Iterate the rows
    for idx_row, row in enumerate(rows):
    
        # Select TH and TD
        cols = row.find_all(['th','td'])
        

        
        # this is specific to how the page is structured
        # for some places in contestant name is an hyperlink and therefore make use of TH along
        # with span and anchor 
        # we need to make the following specific to the page being scapred

        lst = []
        
        for idx_col, col in enumerate(cols):
            span   = col.find("span")
            anchor = col.find("a")
                
#             print('col = ', col.text.strip())
        
            # Capture the season number ana name, which only appears once against all the contestants
            if  (span!=None 
                 and anchor!=None 
                 and "Survivor" in anchor['title']
                 and "Survivor contestant" not in anchor['title']
                ):
            
#                 season = re.sub(pat,"", span['data-sort-value']).strip()
                season = re.findall(pat, span['data-sort-value']) 
                if len(season) > 0:
                    season = int(season[0])
                    season_name = col.text.strip()

                continue
                
            # value of the columnar table in HTML
            else: 
                val = col.text.strip()
          

            # make a list of all the HTML column within a single row
            lst.append(val)
        
        # special handling, the first row in each table has headings
        # so we are capturing it by referring to the first row in the list
        # since, season number and name, only appear once for a season,
        # these are added against each row prior to reading the next row
        
        if (idx_row == 0 ):
            col_headings = lst
        else:
            lst.insert(0, season)
            lst.insert(1, season_name)
            season_contestants.append(lst)

#     print("column headings = ", col_headings)
#     print ("season plus list ", season_contestants)
    
    # update the heading to include season name
    col_headings.insert(1, "Season Name")

    # finally, we add all the rows within a table into pandas dataframe before processing the next table
    # remember there are multiple tables with a page, breaking the seasons into 5 seasons per table
    dfContestents = pd.concat([dfContestents,
                               pd.DataFrame(season_contestants, columns=col_headings)
                              ])



In [7]:
col_headings

['Season', 'Season Name', 'Name', 'Age', 'Hometown', 'Profession', 'Finish']

In [8]:
dfContestents.head()

Unnamed: 0,Season,Season Name,Name,Age,Hometown,Profession,Finish
0,1,Survivor: Borneo,Sonja Christopher,63,"Walnut Creek, CA",Gym Teacher / Retired,16th
1,1,Survivor: Borneo,"Bill ""B.B."" Andersen",64,"Mission Hills, KS",Real Estate Developer,15th
2,1,Survivor: Borneo,Stacey Stillman,27,"San Francisco, CA",Attorney,14th
3,1,Survivor: Borneo,Ramona Gray,29,"Edison, NJ",Biochemist,13th
4,1,Survivor: Borneo,Dirk Been,23,"Spring Green, WI",Dairy Farmer,12th


### Tried to use selenium, but its not available in udacity workspace

In [9]:
# !pip install selenium --upgrade

In [10]:
# dfContestents['Season'].astype(str).astype(int)

In [11]:
# from selenium import webdriver
# from selenium.webdriver.chrome.options import Options

# options = Options()
# options.headless = True
# options.add_argument("--window-size=1920,1080")

# driver = webdriver.Chrome(options=options, executable_path='/path/to/chromedriver')
# driver.get(seasons)
# page_source = driver.page_source
# # with open("udacity_home.html", "w") as f:
# #     f.write(page_source)
# driver.quit()

### scrape data from fandom
grab the season airdates from fandom and follow same steps that were applied for wiki (contestants) pages

In [12]:
seasons_url="https://gameshows.fandom.com/wiki/Survivor/Airdates"

In [13]:
resp = requests.get(seasons_url)
resp

<Response [200]>

In [14]:
soup = BeautifulSoup(resp.text,'html.parser')
# print(soup.prettify())

#### same steps as used for scraping the wikipedia, only simpler
here the Table has rows (TR), which comprises of TH and TD

In [15]:
table = soup.find('table', {'class' : 'wikitable'})

In [16]:
rows = table.find_all('tr')

seasons_list = []
for idx_row, row in enumerate(rows):
    
    cols = row.find_all(['th','td'])
    lst = []
    for col in cols:
        lst.append(col.text.strip())
        
    seasons_list.append(lst)

# print(seasons_list)
 

In [17]:
seasons_list[0]

['Season #', 'Premiering Date', 'Finale Date']

In [18]:
# create a dataframe from the season list, the first row contains the heading and rest i s the data
dfSeasons = pd.DataFrame(seasons_list[1:], columns=seasons_list[0])
dfSeasons.head(5)

Unnamed: 0,Season #,Premiering Date,Finale Date
0,1,"May 31, 2000","August 23, 2000"
1,2,"January 28, 2001","May 3, 2001"
2,3,"October 11, 2001","January 10, 2002"
3,4,"February 28, 2002","May 19, 2002"
4,5,"September 19, 2002","December 19, 2002"


In [19]:
# # change the data format to yyyy-mm-dd for both the premiering and the finale dates
# dfSeasons['Premiering Date']=pd.to_datetime(dfSeasons['Premiering Date'], format='%B %d, %Y')
# dfSeasons['Finale Date']    =pd.to_datetime(dfSeasons['Finale Date']    , format='%B %d, %Y')
# dfSeasons.head(5)

In [20]:
# before merging the two dataframes, and for simplicity make the name of the column used for merging the same
# Change the datatype of the season to integer

dfSeasons.rename(columns={'Season #': 'Season'}, inplace=True)

dfSeasons['Season']     = dfSeasons['Season'].astype(str).astype(int)
dfContestents['Season'] = dfContestents['Season'].astype(str).astype(int)

In [21]:
# merge the two dataframes to create the single dataframe, which now has contestant as well as season start/end date

dfSurvivor=dfContestents.merge(dfSeasons,
                    on  ='Season',
                    how = 'left',
                   )

In [22]:
dfSurvivor.head()

Unnamed: 0,Season,Season Name,Name,Age,Hometown,Profession,Finish,Premiering Date,Finale Date
0,1,Survivor: Borneo,Sonja Christopher,63,"Walnut Creek, CA",Gym Teacher / Retired,16th,"May 31, 2000","August 23, 2000"
1,1,Survivor: Borneo,"Bill ""B.B."" Andersen",64,"Mission Hills, KS",Real Estate Developer,15th,"May 31, 2000","August 23, 2000"
2,1,Survivor: Borneo,Stacey Stillman,27,"San Francisco, CA",Attorney,14th,"May 31, 2000","August 23, 2000"
3,1,Survivor: Borneo,Ramona Gray,29,"Edison, NJ",Biochemist,13th,"May 31, 2000","August 23, 2000"
4,1,Survivor: Borneo,Dirk Been,23,"Spring Green, WI",Dairy Farmer,12th,"May 31, 2000","August 23, 2000"


### Survivor 40 is the last season available in chatGPT, so we take the data from Season 41 onward scraped above

In [23]:
cond=dfSurvivor['Season'] >= 41
dfSurvivor_copy = dfSurvivor[cond].copy()

In [24]:
dfSurvivor_copy[dfSurvivor_copy['Season'].eq(41)]#.loc[17]['text']

Unnamed: 0,Season,Season Name,Name,Age,Hometown,Profession,Finish,Premiering Date,Finale Date
731,41,41,Eric Abraham,51,"San Antonio, TX",Cyber Security Analyst,18th,"September 22, 2021","December 15, 2021"
732,41,41,Sara Wilson,24,"Boston, MA",Healthcare Consultant,17th,"September 22, 2021","December 15, 2021"
733,41,41,David Voce,35,"Chicago, IL",Neurosurgeon,16th,"September 22, 2021","December 15, 2021"
734,41,41,Brad Reese,50,"Shawnee, WY",Rancher,15th,"September 22, 2021","December 15, 2021"
735,41,41,"Jairus ""JD"" Robinson",20,"Oklahoma City, OK",College Student,14th,"September 22, 2021","December 15, 2021"
736,41,41,Genie Chen,46,"Portland, OR",Grocery Clerk,13th,"September 22, 2021","December 15, 2021"
737,41,41,Sydney Segal,26,"Brooklyn, NY",Law Student,12th,"September 22, 2021","December 15, 2021"
738,41,41,Tiffany Seely,47,"Plainview, NY",Teacher,11th,"September 22, 2021","December 15, 2021"
739,41,41,Naseer Muttalif,37,"Morgan Hill, CA",Sales Manager,10th,"September 22, 2021","December 15, 2021"
740,41,41,"Evelyn ""Evvie"" Jagoda",28,"Arlington, MA",PhD Student,9th,"September 22, 2021","December 15, 2021"


In [25]:
def createText(x):
    # JANE DOE is WINNER of survivor season 99 titled SURVIVOR-99,
    # JANE DOE is 99 years old from MARS and works as ASTRONAUT.
    # SEASON-99 titled SURVIVOR-99 was aired between JAN-99-9999 AND DEC-99-9999
    text= """{} {} of Survivor Season {} titled {}, 
    {} is {} years old from {} and works as {}. 
    Season {} titled {} aired during {} and {}
    """

    text=text.format(x['Name'],
                    str(x['Finish'])+ ' is ' if x['Finish']=='Winner' else ' finishes ' + str(x['Finish']),
                    x['Season'],
                    x['Season Name'],
                    x['Name'],
                    x['Age'],
                    x['Hometown'],
                    x['Profession'],
                    x['Season'],
                    x['Season Name'],
                    x['Premiering Date'],
                    x['Finale Date'],
                   )
    
        
    return text

In [26]:
print(dfSurvivor_copy.apply(createText,axis=1) )

731    Eric Abraham  finishes 18th of Survivor Season...
732    Sara Wilson  finishes 17th of Survivor Season ...
733    David Voce  finishes 16th of Survivor Season 4...
734    Brad Reese  finishes 15th of Survivor Season 4...
735    Jairus "JD" Robinson  finishes 14th of Survivo...
                             ...                        
834    Ben Katzman  finishes TBA of Survivor Season 4...
835    Charlie Davis  finishes None of Survivor Seaso...
836    Kenzie Petty  finishes None of Survivor Season...
837    Liz Wilcox  finishes None of Survivor Season 4...
838    Maria Shrime Gonzalez  finishes None of Surviv...
Length: 108, dtype: object


In [27]:
# # since the RAG requires a single column as TEXT, so we concatenate the columns for form a single descriptive description
# # which will then be processed further with the openai


# text = "Season {} titled {} of Survivor, aired between {} and {}, contestent {} Aged {} from hometown {}, works as a {} and a  {} {}"
# dfSurvivor_copy['text'] = dfSurvivor_copy.apply(lambda x: text.format(
#                                                         x['Season'],
#                                                         x['Season Name'],
#                                                         x['Premiering Date'],
#                                                         x['Finale Date'],
#                                                         x['Name'],
#                                                         x['Age'],
#                                                         x['Hometown'],
#                                                         x['Profession'],
#                                                         ('Winner of Survivor '+str(x['Season']) + ' ' + str(x['Season Name']) 
#                                                                   if x['Finish']=='Winner' 
#                                                                 else 'Ranked'),
#                                                         x['Finish']
#                                                         ),
#                                                     axis=1
#                                                                 )

            

In [28]:
# since the RAG requires a single column as TEXT, so we concatenate the columns for form a single descriptive description
# which will then be processed further with the openai


text = "Season {} titled {} of Survivor, aired between {} and {}, contestent {} Aged {} from hometown {}, works as a {} and a  {} {}"
dfSurvivor_copy['text'] = dfSurvivor_copy.apply(createText, axis=1)

            

In [29]:
dfSurvivor_copy[dfSurvivor_copy['Season'] == 41].loc[732]['text']

'Sara Wilson  finishes 17th of Survivor Season 41 titled 41, \n    Sara Wilson is 24 years old from Boston, MA and works as Healthcare Consultant. \n    Season 41 titled 41 aired during September 22, 2021 and December 15, 2021\n    '

In [30]:
dfSurvivor_copy.columns

Index(['Season', 'Season Name', 'Name', 'Age', 'Hometown', 'Profession',
       'Finish', 'Premiering Date', 'Finale Date', 'text'],
      dtype='object')

#### Saving the data for later use and so as not to scrape it everytime

In [31]:
dfSurvivor.to_csv('Survivor_data.csv')
dfSurvivor_copy.to_csv('Survivor_data_41_onward.csv')

In [90]:
dfSurvivor_copy = pd.read_csv('Survivor_data_41_onward.csv')
dfSurvivor_copy.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,Season,Season Name,Name,Age,Hometown,Profession,Finish,Premiering Date,Finale Date,text
0,0,731,41,41,Eric Abraham,51,"San Antonio, TX",Cyber Security Analyst,18th,2021-09-22,2021-12-15,"Season 41 titled 41 of Survivor, aired between..."
1,1,732,41,41,Sara Wilson,24,"Boston, MA",Healthcare Consultant,17th,2021-09-22,2021-12-15,"Season 41 titled 41 of Survivor, aired between..."
2,2,733,41,41,David Voce,35,"Chicago, IL",Neurosurgeon,16th,2021-09-22,2021-12-15,"Season 41 titled 41 of Survivor, aired between..."
3,3,734,41,41,Brad Reese,50,"Shawnee, WY",Rancher,15th,2021-09-22,2021-12-15,"Season 41 titled 41 of Survivor, aired between..."
4,4,735,41,41,"Jairus ""JD"" Robinson",20,"Oklahoma City, OK",College Student,14th,2021-09-22,2021-12-15,"Season 41 titled 41 of Survivor, aired between..."


## Inspect the non-customized results

In [91]:
import openai
import getpass # using getpass to keep the key secret

COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"

In [92]:
api_key = getpass.getpass("Enter openAPI key")
openai.api_key = api_key # "YOUR API KEY"

Enter openAPI key········


In [93]:
model = "gpt-3.5-turbo-instruct"
prompt = ""
max_tokens = 150

In [94]:


def get_AI_response(model, prompt, max_tokens):
    
    """
    Returns the openai response.
    this also enables to use a different provider, if one so chooses
    
    ARGUMENTS:
    model      : LLM model, must be among the ones supported by the service used
    prompt     : prompt (question) for the model to respond
    max_tokens : Maximum number of tokens to return
    """
    
    openai_response  = openai.Completion.create(
        model     = model,
        prompt    = prompt,
        max_tokens= max_tokens
    )["choices"][0]["text"].strip()
    
    return openai_response

### reviewing the model response prior to adding newer data to the prompt as context


In [139]:
question_latest_season    = "What is the latest season of survivor USA?"
question_season_42_winner = "who won the Season 42 of the Survivor USA?"
question_season_44_runnerup = "who was the runner-up of season 44 of Survivor USA?"

In [128]:
prompt_latest_season = """
Question: {}
Answer:
""".format(question_latest_season)

prompt_latest_season_answer = get_AI_response( model, 
                                                    prompt_latest_season, 
                                                    max_tokens)
print(prompt_latest_season_answer)

The latest season of Survivor USA is season 40, titled "Winners at War." It aired in early 2020.


In [129]:
prompt_season_42_winner = """
Question: {}
Answer:
""".format(question_season_42_winner)

prompt_season_42_winner_answer = get_AI_response( model, 
                                                    prompt_latest_season, 
                                                    max_tokens)
print(prompt_season_42_winner_answer)

As of October 2021, the latest season of Survivor USA is Survivor: Island of the Idols, which aired in Fall 2019. The 41st season, Survivor: Winners at War, aired in Spring 2020.


In [130]:
prompt_season_44_runnerup = """
Question: {}
Answer:
""".format(question_season_44_runnerup)

prompt_season_44_runnerup_answer = get_AI_response( model, 
                                                    prompt_latest_season, 
                                                    max_tokens)
print(prompt_season_44_runnerup_answer)

As of September 2021, the latest season of Survivor USA is Season 41, which is set to premiere on September 22nd, 2021 on CBS.


## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

### Generating embeddings

We'll use the Embedding tooling from OpenAI documentation here to create vectors representing each row of our custom dataset.

In order to avoid a RateLimitError we'll send our data in batches to the Embedding.create function.


In [98]:
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"

In [99]:
batch_size = 100
embeddings = []
for i in range(0, len(dfSurvivor_copy), batch_size):
    # Send text data to OpenAI model to get embeddings
    response = openai.Embedding.create(
        input=dfSurvivor_copy.iloc[i:i+batch_size]["text"].tolist(),
        engine=EMBEDDING_MODEL_NAME
    )
    
    # Add embeddings to list
    embeddings.extend([data["embedding"] for data in response["data"]])

# Add embeddings list to dataframe
dfSurvivor_copy["embeddings"] = embeddings
dfSurvivor_copy.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,Season,Season Name,Name,Age,Hometown,Profession,Finish,Premiering Date,Finale Date,text,embeddings
0,0,731,41,41,Eric Abraham,51,"San Antonio, TX",Cyber Security Analyst,18th,2021-09-22,2021-12-15,"Season 41 titled 41 of Survivor, aired between...","[0.0009507008944638073, -0.030369749292731285,..."
1,1,732,41,41,Sara Wilson,24,"Boston, MA",Healthcare Consultant,17th,2021-09-22,2021-12-15,"Season 41 titled 41 of Survivor, aired between...","[0.012535907328128815, -0.008541460148990154, ..."
2,2,733,41,41,David Voce,35,"Chicago, IL",Neurosurgeon,16th,2021-09-22,2021-12-15,"Season 41 titled 41 of Survivor, aired between...","[-0.01699759252369404, -0.01152073871344328, 0..."
3,3,734,41,41,Brad Reese,50,"Shawnee, WY",Rancher,15th,2021-09-22,2021-12-15,"Season 41 titled 41 of Survivor, aired between...","[-0.011191071011126041, -0.019860779866576195,..."
4,4,735,41,41,"Jairus ""JD"" Robinson",20,"Oklahoma City, OK",College Student,14th,2021-09-22,2021-12-15,"Season 41 titled 41 of Survivor, aired between...","[-0.012426728382706642, -0.02102375216782093, ..."


In [100]:
dfSurvivor_copy.to_csv("embeddings.csv")

df = dfSurvivor_copy[['text', 'embeddings']].copy()

In [101]:
# dfSurvivor_copy = pd.read_csv("embeddings.csv")
# # dfSurvivor_copy.head()

In [102]:
# import numpy as np

# df = dfSurvivor_copy[['text', 'embeddings']].copy()
# df["embeddings"] = df["embeddings"].apply(eval).apply(np.array)

In [103]:
len(dfSurvivor_copy)

108


### Step 2: Create a Function that Finds Related Pieces of Text for a Given Question

What we are implementing here is similar to a search engine or recommendation algorithm. We want to sort all of the rows of our dataset from least relevant to most relevant.

This will use the embeddings that we generated previously in order to compare the vectorized version of our question to the vectorized versions of the rows of the dataset.


In [104]:
from openai.embeddings_utils import get_embedding, distances_from_embeddings

def get_rows_sorted_by_relevance(question, df):
    """
    Function that takes in a question string and a dataframe containing
    rows of text and associated embeddings, and returns that dataframe
    sorted from least to most relevant for that question
    """
    
    # Get embeddings for the question text
    question_embeddings = get_embedding(question, engine=EMBEDDING_MODEL_NAME)
    
    # Make a copy of the dataframe and add a "distances" column containing
    # the cosine distances between each row's embeddings and the
    # embeddings of the question
    df_copy = df.copy()
    df_copy["distances"] = distances_from_embeddings(
        question_embeddings,
        df_copy["embeddings"].values,
        distance_metric="cosine"
    )
    
    # Sort the copied dataframe by the distances and return it
    # (shorter distance = more relevant so we sort in ascending order)
    df_copy.sort_values("distances", ascending=True, inplace=True)
    return df_copy

In [141]:
question_season_42_winner

'who won the Season 42 of the Survivor USA?'

In [140]:
get_rows_sorted_by_relevance(question_season_42_winner, df)

Unnamed: 0,text,embeddings,distances
69,"Season 44 titled 44 of Survivor, aired between...","[0.0011320209596306086, -0.018356913700699806,...",0.142343
70,"Season 44 titled 44 of Survivor, aired between...","[-0.0032948569860309362, -0.019257549196481705...",0.143621
52,"Season 43 titled 43 of Survivor, aired between...","[-0.007554784417152405, -0.010118015110492706,...",0.144845
87,"Season 45 titled 45 of Survivor, aired between...","[-0.006924946792423725, -0.022197775542736053,...",0.145605
27,"Season 42 titled 42 of Survivor, aired between...","[-0.02053735964000225, -0.03299356997013092, -...",0.146603
...,...,...,...
90,"Season 46 titled 46 of Survivor, aired between...","[-0.026984089985489845, -0.030671605840325356,...",0.168511
97,"Season 46 titled 46 of Survivor, aired between...","[0.0030906544998288155, -0.016236059367656708,...",0.172134
103,"Season 46 titled 46 of Survivor, aired between...","[-0.007374738343060017, -0.014268365688621998,...",0.173293
105,"Season 46 titled 46 of Survivor, aired between...","[-0.013762668706476688, -0.02342640981078148, ...",0.173542


In [106]:
get_rows_sorted_by_relevance("Who won the season 42 of survivor USA?", df)

Unnamed: 0,text,embeddings,distances
69,"Season 44 titled 44 of Survivor, aired between...","[0.0011320209596306086, -0.018356913700699806,...",0.138757
70,"Season 44 titled 44 of Survivor, aired between...","[-0.0032948569860309362, -0.019257549196481705...",0.140265
52,"Season 43 titled 43 of Survivor, aired between...","[-0.007554784417152405, -0.010118015110492706,...",0.142574
87,"Season 45 titled 45 of Survivor, aired between...","[-0.006924946792423725, -0.022197775542736053,...",0.144139
27,"Season 42 titled 42 of Survivor, aired between...","[-0.02053735964000225, -0.03299356997013092, -...",0.145049
...,...,...,...
90,"Season 46 titled 46 of Survivor, aired between...","[-0.026984089985489845, -0.030671605840325356,...",0.168694
103,"Season 46 titled 46 of Survivor, aired between...","[-0.007374738343060017, -0.014268365688621998,...",0.172188
97,"Season 46 titled 46 of Survivor, aired between...","[0.0030906544998288155, -0.016236059367656708,...",0.173472
106,"Season 46 titled 46 of Survivor, aired between...","[-0.03036060929298401, -0.020625118166208267, ...",0.173643



### Step 3: Create a Function that Composes a Text Prompt

Building on that sorted list of rows, we're going to select the create a text prompt that provides context to a Completion model in order to help it answer a question. The outline of the prompt looks like this:

Answer the question based on the context below, and if the
question can't be answered based on the context, say "I don't
know"

Context:

{context}

---

Question: {question}
Answer:

We want to fit as much of our dataset as possible into the "context" part of the prompt without exceeding the number of tokens allowed by the Completion model, which is currently 4,000. So we'll loop over the dataset, counting the tokens as we go, and stop when we hit the limit. Then we'll join that list of text data into a single string and add it to the prompt.


In [116]:
import tiktoken

def create_prompt(question, df, max_token_count):
    """
    Given a question and a dataframe containing rows of text and their
    embeddings, return a text prompt to send to a Completion model
    """
    # Create a tokenizer that is designed to align with our embeddings
    tokenizer = tiktoken.get_encoding("cl100k_base")
    
    # Count the number of tokens in the prompt template and question
    prompt_template = """
Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know""

Context: 

{}

---

Question: {}
Answer:"""

    df_copy = df.copy()
    
    
    context = []  
    current_token_count = len(tokenizer.encode(prompt_template)) + \
                            len(tokenizer.encode(question)) 
    
    for text in get_rows_sorted_by_relevance(question, df_copy)["text"].values:
        
        # Increase the counter based on the number of tokens in this row
        text_token_count = len(tokenizer.encode(text))
        current_token_count += text_token_count
        
        # Add the row of text to the list if we haven't exceeded the max
        if current_token_count <= max_token_count:
            context.append(text)
        else:
            break

    return prompt_template.format("\n\n###\n\n".join(context), question)

In [131]:
print(create_prompt(question_season_42_winner, df , 150))


Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know""

Context: 

Season 44 titled 44 of Survivor, aired between 2023-03-01 and 2023-05-24, contestent Carolyn Wiger Aged 35 from hometown North St. Paul, MN, works as a Drug Counselor and a  Ranked 2nd Runner-Up

---

Question: what won the Season 42 of the Survivor USA?
Answer:


In [111]:
COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"

def answer_question(
    question, df, max_prompt_tokens=1800, max_answer_tokens=150
):
    """
    Given a question, a dataframe containing rows of text, and a maximum
    number of desired tokens in the prompt and response, return the
    answer to the question according to an OpenAI Completion model
    
    If the model produces an error, return an empty string
    """
    
    prompt = create_prompt(question, df, max_prompt_tokens)
    
    try:
        response = openai.Completion.create(
            model=COMPLETION_MODEL_NAME,
            prompt=prompt,
            max_tokens=max_answer_tokens
        )
        return response["choices"][0]["text"].strip()
    except Exception as e:
        print(e)
        return ""
    

In [112]:
print(create_prompt("Who had won the season 45 of survivor USA?", df, 200))


Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know""

Context: 

Season 45 titled 45 of Survivor, aired between 2023-09-27 and 2023-12-20, contestent Jake O'Kane Aged 26 from hometown Boston, MA, works as a Attorney and a  Ranked 2nd Runner-Up

###

Season 45 titled 45 of Survivor, aired between 2023-09-27 and 2023-12-20, contestent Austin Li Coon Aged 26 from hometown Chicago, IL, works as a Grad Student and a  Ranked Runner-Up

---

Question: Who had won the season 45 of survivor USA?
Answer:



### Step 4: Create a Function that Answers a Question

Our final step is to send that text prompt to a Completion model and parse the model output!


## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Check prompts

In [132]:
print(create_prompt(question_season_42_winner, df, 200))


Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know""

Context: 

Season 44 titled 44 of Survivor, aired between 2023-03-01 and 2023-05-24, contestent Carolyn Wiger Aged 35 from hometown North St. Paul, MN, works as a Drug Counselor and a  Ranked 2nd Runner-Up

###

Season 41 titled 41 of Survivor, aired between 2021-09-22 and 2021-12-15, contestent Sara Wilson Aged 24 from hometown Boston, MA, works as a Healthcare Consultant and a  Ranked 17th

---

Question: what won the Season 42 of the Survivor USA?
Answer:


In [134]:
print(create_prompt(question_season_44_runnerup, df, 150))


Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know""

Context: 

Season 44 titled 44 of Survivor, aired between 2023-03-01 and 2023-05-24, contestent Carolyn Wiger Aged 35 from hometown North St. Paul, MN, works as a Drug Counselor and a  Ranked 2nd Runner-Up

---

Question: what was the runner-up of season 44 of Survivor USA?
Answer:


In [121]:
print(prompt_44_runnerup)


Question: "Who was the runner-up in the season 44 of survivor USA?"
Answer:



In [122]:
def get_compare_AI_response(prompt, 
                            initial_response,
                            func=answer_question, 
                            df=df):
    """
    Function to get the response with the added context (new data)
    and compare with the initial response from teh model
    
    ARGUMENTS:
    func              : function to return the reponse from AI
    prompt            : prompt or question
    initial_reponse   : response from the initial model, without newly added context/data
    df                : dataframe with the embeddings to use as context
    """
    
    answer = func(prompt_latest_season, df)
 
    response = f"""
{prompt}

Original Answer: {initial_response}
Custom Answer:   {answer}

""".format(prompt, initial_response, answer)
    
    return response

## Question 1

In [135]:
print(get_compare_AI_response( question_latest_season, prompt_latest_season_answer))


What is the latest season of survivor USA?

Original Answer: The latest season of Survivor USA is season 40, titled "Winners at War." It aired in early 2020.
Custom Answer:   The latest season of Survivor USA is season 46, which aired between 2024-02-28 and 2024-05-22.




## Question 2

In [136]:
print(get_compare_AI_response( question_season_42_winner, prompt_42_winner_answer))


what won the Season 42 of the Survivor USA?

Original Answer: The winner of Survivor USA season 42 has not been determined yet as season 42 has not yet aired. The most recent winner of Survivor USA (season 41) was Erika Casupanan.
Custom Answer:   Season 46 titled 46 of Survivor, aired between 2024-02-28 and 2024-05-22.




## Question 3

In [137]:
print(get_compare_AI_response( question_season_44_runnerup, prompt_44_runnerup_answer))


what was the runner-up of season 44 of Survivor USA?

Original Answer: The runner-up in Survivor season 44 cannot be determined as the show has not completed that many seasons yet. As of 2021, the 40th season of Survivor has aired.
Custom Answer:   Season 46


