# Build a Custom OpenAI Chatbot with ML-Driven Prompt Engineering

# Crie um Chatbot OpenAI Personalizado com Engenharia de Prompt Baseada em ML

The objective of this project is to develop a personalized OpenAI chatbot, using a technique called augmented retrieval generation, which basically consists of providing updated information so that it is able to answer specific questions about a certain subject for which it has not been previously trained.

O objetivo deste projeto é desenvolver um chatbot OpenAI personalizado, utilizando uma técnica chamada de geração aumentada de recuperação, que consiste basicamente em fornecer informações atualizadas para que ele seja capaz de responder questões  espcíficas sobre um determinado assunto para o qual ele não foi treinado previamente.

## Install Dependencies

## Instala Dependências

In [1]:
!pip install -r requirements.txt

Collecting aiohttp==3.8.3 (from -r requirements.txt (line 1))
  Downloading aiohttp-3.8.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
Collecting appnope==0.1.3 (from -r requirements.txt (line 3))
  Downloading appnope-0.1.3-py2.py3-none-any.whl (4.4 kB)
Collecting asttokens==2.2.1 (from -r requirements.txt (line 4))
  Downloading asttokens-2.2.1-py2.py3-none-any.whl (26 kB)
Collecting async-timeout==4.0.2 (from -r requirements.txt (line 5))
  Downloading async_timeout-4.0.2-py3-none-any.whl (5.8 kB)
Collecting attrs==22.2.0 (from -r requirements.txt (line 6))
  Downloading attrs-22.2.0-py3-none-any.whl (60 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.0/60.0 kB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
Collecting beautifulsoup4==4.11.1 (from -r requirements.txt (line 8))
  Downloading beautifulsoup4-4.11.1-py3-none-any.w

## Explain which dataset you chose and why it is appropriate for this task:

The Wikipedia API will be used to obtain the data for the task at hand, the dataset refers to the main events of 2024, it is suitable for the task as our base model was only trained until 2021, therefore it would be unable to respond Current Issues Without the Information We Will Provide With Our Custom Dataset.

## Explique qual conjunto de dados você escolheu e por que ele é apropriado para esta tarefa:

Será Utilizada a API da Wikipedia Para Obter os Dados Para a Tarefa em Questão, O Conjunto de Dados se refere aos principais acontecimentos de 2024, Ele é Adequado Para a Tarefa Pois o Nosso Modelo Base Foi Treinado Somente Até 2021, Portanto Seria Incapaz de Responder Questões Atuais Sem as Informações Que Iremos Forenecer Com o Nosso Conjunto de Dados Personalizado.

## Data preparation

In the cells below, we will load the previously chosen dataset into a `pandas` dataframe with a column called `"text"`. This column should contain all of your text data, separated into at least 20 lines.

## Preparação dos dados

Nas células abaixo, carregaremos o conjunto de dados escolhido anteriormente em um dataframe `pandas` com uma coluna chamada `"text"`. Esta coluna deve conter todos os seus dados de texto, separados em pelo menos 20 linhas.

In [118]:
import openai
from openai.embeddings_utils import get_embedding, distances_from_embeddings
from dateutil.parser import parse
import pandas as pd
import requests
import numpy as np
import tiktoken

In [119]:
openai.api_key = "Put Your Key Here"
COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"

#Prepare dataset

**The data will be loaded into a pandas `DataFrame` called `df` where each row represents a sample of text, and there is only one column, `"text"`, which contains the raw text data.**

In this specific case, we are collecting data from [year 2024](https://en.wikipedia.org/wiki/2024) and doing some data discussions to get it into the appropriate format.

# Prepare o conjunto de dados

**Os dados serão carregados em um `DataFrame` do pandas chamado `df` onde cada linha representa uma amostra de texto, e há apenas uma coluna, `"text"`, que contém os dados de texto brutos.**

Neste caso específico, estamos coletando dados do [ano de 2024](https://en.wikipedia.org/wiki/2024) e realizando algumas discussões de dados para colocá-los no formato apropriado.

In [240]:
url="https://en.wikipedia.org/w/api.php?action=query&prop=extracts&exlimit=1&titles=2024&explaintext=1&formatversion=2&format=json"

In [242]:
# Get the Wikipedia page for "2024" since OpenAI's models stop in 2021
resp = requests.get(url)

# Load page text into a dataframe
df = pd.DataFrame()
df["text"] = resp.json()["query"]["pages"][0]["extract"].split("\n")

# Clean up text to remove empty lines and headings
df = df[(df["text"].str.len() > 0) & (~df["text"].str.startswith("=="))]

# In some cases dates are used as headings instead of being part of the
# text sample; adjust so dated text samples start with dates
prefix = ""
for (i, row) in df.iterrows():
    # If the row already has " - ", it already has the needed date prefix
    if " – " not in row["text"]:
        try:
            # If the row's text is a date, set it as the new prefix
            parse(row["text"])
            prefix = row["text"]
        except:
            # If the row's text isn't a date, add the prefix
            row["text"] = prefix + " – " + row["text"]
df = df[df["text"].str.contains(" – ")]

In [243]:
df.head(30)

Unnamed: 0,text
0,"– 2024 (MMXXIV) is the current year, and is a..."
1,"– So far, this year has witnessed the continu..."
2,"– Approximately 79 countries, representing ar..."
10,"January 1 – Egypt, Ethiopia, Iran and the Unit..."
11,January 1 – The Republic of Artsakh is formall...
12,January 1 – A 7.5 Mww earthquake strikes the w...
13,January 1 – Ethiopia announces an agreement wi...
14,January 2 – 2023 Marshallese general election:...
15,January 3 – 2024 Kerman bombings: An Islamic S...
16,January 7 – 2024 Bangladeshi general election:...


In [244]:
df.tail(30)

Unnamed: 0,text
102,June 9 – 2024 Belgian general election.
103,June 28 – 2024 Mongolian parliamentary election.
104,June 29 – 2024 Mauritanian presidential election.
105,July 15–16 – 2024 Rwandan general election.
106,"July 26 – 2024 Summer Olympics in Paris, France."
107,July 28 – 2024 Venezuelan presidential election.
108,August 17 – Nusantara will become the new capi...
109,September 7 – 2024 Algerian presidential elect...
110,September 15 – 2024 Romanian presidential elec...
111,October 6 – 2024 Brazilian municipal elections...


## Generating Embeddings

We'll use the `Embedding` tooling from OpenAI [documentation here](https://platform.openai.com/docs/guides/embeddings/embeddings) to create vectors representing each row of our custom dataset.

In order to avoid a `RateLimitError` we'll send our data in batches to the `Embedding.create` function.

## Gerando incorporações

Usaremos as ferramentas `Embedding` da OpenAI [documentação aqui](https://platform.openai.com/docs/guides/embeddings/embeddings) para criar vetores que representam cada linha do nosso conjunto de dados personalizado.

Para evitar um `RateLimitError` enviaremos nossos dados em lotes para a função `Embedding.create`.

In [246]:
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
batch_size = 100
embeddings = []

for i in range(0, len(df), batch_size):
    # Send text data to OpenAI model to get embeddings
    response = openai.Embedding.create(
        input=df.iloc[i:i+batch_size]["text"].tolist(),
        engine=EMBEDDING_MODEL_NAME
    )

    # Add embeddings to list
    embeddings.extend([data["embedding"] for data in response["data"]])

# Add embeddings list to dataframe
df["embeddings"] = embeddings

In order to avoid having to run that code again in the future, we'll save the generated embeddings as a CSV file.

Para evitar ter que executar esse código novamente no futuro, salvaremos os embeddings gerados como um arquivo CSV.

In [247]:
df.to_csv("embeddings.csv")

## Checkpoint

If you want to stop the tutorial here and come back, you can reload `df` using this code (again adding your API key) rather than generating the embeddings again:

Se você quiser parar o tutorial aqui e voltar, você pode recarregar `df` usando este código (novamente adicionando sua chave de API) em vez de gerar os embeddings novamente:

In [179]:
df = pd.read_csv("embeddings.csv", index_col=0)
df["embeddings"] = df["embeddings"].apply(eval).apply(np.array)

In [248]:
df

Unnamed: 0,text,embeddings
0,"– 2024 (MMXXIV) is the current year, and is a...","[0.001296337111853063, -0.017870482057332993, ..."
1,"– So far, this year has witnessed the continu...","[-0.021781660616397858, -0.02024887688457966, ..."
2,"– Approximately 79 countries, representing ar...","[0.0007858542376197875, -0.021182088181376457,..."
10,"January 1 – Egypt, Ethiopia, Iran and the Unit...","[-0.00607002479955554, -0.0236552432179451, -0..."
11,January 1 – The Republic of Artsakh is formall...,"[0.00801378209143877, 0.008033107966184616, -0..."
...,...,...
130,June 1 – 2024 Austrian legislative election.,"[-0.0058554308488965034, -0.008935715071856976..."
131,June 1 – 2024 Sri Lankan presidential election.,"[0.0016323139425367117, -0.004772505257278681,..."
133,October – 2024 Botswana general election.,"[-0.020383397117257118, -0.023523297160863876,..."
134,October – 2024 Georgian presidential election.,"[-0.006440339144319296, -0.010024076327681541,..."


## Inspecting Non-Customized Results

Before we perform any prompt engineering, **let's ask the OpenAI model some questions and see how it answers**.

(If you encounter an `AuthenticationError` when running this code, make sure that you have added a valid API key to the cell above and executed it.)

## Inspecionando resultados não personalizados

Antes de realizarmos qualquer engenharia imediata, **vamos fazer algumas perguntas ao modelo OpenAI e ver como ele responde**.

(Se você encontrar um `AuthenticationError` ao executar este código, certifique-se de ter adicionado uma chave de API válida à célula acima e executado.)

In [284]:
prompt1 = """
Question: "What year is it?"
"""

prompt2 = """
Question: "Which countries have recently become members of BRICS?"
"""

prompt3 = """
Question: "What are the recent advances in artificial intelligence?"
"""

prompt4 = """
Question: "What are the biggest conflicts in the world today?"
"""

In [285]:
len_tokens=200

In [286]:
initial_answer_1 = openai.Completion.create(
    model=COMPLETION_MODEL_NAME,
    prompt=prompt1,
    max_tokens=len_tokens)["choices"][0]["text"].strip()

initial_answer_2 = openai.Completion.create(
    model=COMPLETION_MODEL_NAME,
    prompt=prompt2,
    max_tokens=len_tokens)["choices"][0]["text"].strip()

initial_answer_3 = openai.Completion.create(
    model=COMPLETION_MODEL_NAME,
    prompt=prompt3,
    max_tokens=len_tokens)["choices"][0]["text"].strip()

initial_answer_4 = openai.Completion.create(
    model=COMPLETION_MODEL_NAME,
    prompt=prompt4,
    max_tokens=len_tokens)["choices"][0]["text"].strip()

In [289]:
print(initial_answer_1+"\n")

As an AI, I do not have a physical body or keep track of time in the same way that humans do. However, the current year in the Gregorian calendar is 2021.



In [290]:
print(initial_answer_2+"\n")

Answer: South Africa, which joined in 2011, is the most recent member of BRICS.



In [291]:
print(initial_answer_3+"\n")

Answer: There have been many recent advances in the field of artificial intelligence, including:

1. Deep learning: This is a subset of machine learning that uses neural networks to learn from large amounts of data and make predictions. Deep learning has been instrumental in many AI applications, such as image recognition, speech recognition, and natural language processing.

2. Robotics: AI-powered robots have become more advanced and capable, with the ability to perform complex tasks and adapt to changing environments. This has led to increased use of robots in industries such as manufacturing, healthcare, and logistics.

3. Natural language processing (NLP): NLP is the ability of computers to understand and process human language. Recent advancements in this field have enabled machines to accurately translate between languages, generate human-like text, and have conversations with humans.

4. Computer vision: Computer vision involves teaching computers to understand and interpret vi

In [292]:
print(initial_answer_4+"\n")

Answer:

1. War in Syria: The ongoing civil war in Syria, which has been ongoing since 2011, has resulted in hundreds of thousands of deaths and mass displacement of people.

2. Terrorism: The rise of extremist groups such as ISIS, Al-Qaeda, and Boko Haram has led to violent attacks and conflicts in various countries around the world.

3. Israeli-Palestinian Conflict: The ongoing conflict between Israel and Palestine over land and territories has been a source of tension and violence in the Middle East for decades.

4. North Korea's Nuclear Program: The development and testing of nuclear weapons by North Korea has led to international tensions and conflicts, particularly with the United States.

5. Trade Wars: The trade disputes between major economies such as the United States, China, and the European Union have led to conflicts and tariffs that have a global impact.

6. Religious and Ethnic Conflicts: Violent conflicts based on religious and ethnic differences are ongoing in various 

# Creating a function that finds related snippets of text for a given question

What we are implementing here is similar to a search engine or recommendation algorithm. We want to sort all the rows in our dataset from least relevant to most relevant.

This will use the embeddings we generated earlier to compare the vectorized version of our question to the vectorized versions of the rows in the dataset.

The metric we are using to rank the results is cosine similarity

# Criando uma função que encontre trechos de texto relacionados para uma determinada pergunta

O que estamos implementando aqui é semelhante a um mecanismo de busca ou algoritmo de recomendação. Queremos classificar todas as linhas do nosso conjunto de dados, da menos relevante para a mais relevante.

Isso usará os embeddings que geramos anteriormente para comparar a versão vetorizada de nossa pergunta com as versões vetorizadas das linhas do conjunto de dados.

A métrica que estamos usando para classificar os resultados é a similaridade do cosseno

In [293]:
def get_rows_sorted_by_relevance(question, df):
    """
    Function that takes in a question string and a dataframe containing
    rows of text and associated embeddings, and returns that dataframe
    sorted from least to most relevant for that question
    """

    # Get embeddings for the question text
    question_embeddings = get_embedding(question, engine=EMBEDDING_MODEL_NAME)

    # Make a copy of the dataframe and add a "distances" column containing
    # the cosine distances between each row's embeddings and the
    # embeddings of the question
    df_copy = df.copy()
    df_copy["distances"] = distances_from_embeddings(
        question_embeddings,
        df_copy["embeddings"].values,
        distance_metric="cosine"
    )

    # Sort the copied dataframe by the distances and return it
    # (shorter distance = more relevant so we sort in ascending order)
    df_copy.sort_values("distances", ascending=True, inplace=True)
    return df_copy


Let's test that out for a couple different questions:

Vamos testar isso para algumas perguntas diferentes:

In [294]:
get_rows_sorted_by_relevance(prompt1, df)

Unnamed: 0,text,embeddings,distances
0,"– 2024 (MMXXIV) is the current year, and is a...","[0.001296337111853063, -0.017870482057332993, ...",0.199006
124,December 24 – The 2025 Jubilee will take place...,"[-0.015491542406380177, -0.021557891741394997,...",0.231859
129,June 1 – September or October,"[0.005865198094397783, -0.011174241080880165, ...",0.232077
112,October 9 – 2024 Mozambican general election.,"[-0.020175598561763763, -0.02026589959859848, ...",0.235496
121,December 7 – 2024 Ghanaian general election.,"[-0.010682174935936928, -0.01834850385785103, ...",0.235920
...,...,...,...
10,"January 1 – Egypt, Ethiopia, Iran and the Unit...","[-0.00607002479955554, -0.0236552432179451, -0...",0.275125
25,January 16 – Iran carries out a series of miss...,"[-0.03714895620942116, -0.015521847642958164, ...",0.277382
69,April 5 – Ecuadorian police raid the Mexican e...,"[-0.02853575348854065, -0.003814217634499073, ...",0.278981
28,January 24 – 2024 Korochansky Ilyushin Il-76 c...,"[-0.012084310874342918, -0.01990433596074581, ...",0.283235


In [295]:
get_rows_sorted_by_relevance(prompt2, df)

Unnamed: 0,text,embeddings,distances
10,"January 1 – Egypt, Ethiopia, Iran and the Unit...","[-0.00607002479955554, -0.0236552432179451, -0...",0.157725
62,March 31 – Bulgaria and Romania become members...,"[0.012220675125718117, -0.011527257971465588, ...",0.213607
1,"– So far, this year has witnessed the continu...","[-0.021781660616397858, -0.02024887688457966, ...",0.215256
2,"– Approximately 79 countries, representing ar...","[0.0007858542376197875, -0.021182088181376457,...",0.221019
52,March 7 – As the final Nordic country to join ...,"[-0.0019085346721112728, -0.037938594818115234...",0.221837
...,...,...,...
118,November 12 – 2024 Palauan general election.,"[-0.022184129804372787, -0.010053795762360096,...",0.292568
22,January 14 – Margrethe II formally abdicates a...,"[-0.0056414855644106865, -0.02698841318488121,...",0.293563
124,December 24 – The 2025 Jubilee will take place...,"[-0.015491542406380177, -0.021557891741394997,...",0.298074
14,January 2 – 2023 Marshallese general election:...,"[-0.032947033643722534, -0.014056704938411713,...",0.301301


In [296]:
get_rows_sorted_by_relevance(prompt3, df)

Unnamed: 0,text,embeddings,distances
56,"March 13 – The Artificial Intelligence Act, th...","[-0.0028268315363675356, -0.02203412726521492,...",0.182789
1,"– So far, this year has witnessed the continu...","[-0.021781660616397858, -0.02024887688457966, ...",0.245909
45,February 22 – American company Intuitive Machi...,"[0.004767944570630789, -0.013238654471933842, ...",0.252384
57,March 15–17 – 2024 Russian presidential electi...,"[-0.010838480666279793, -0.01607760600745678, ...",0.261778
16,January 7 – 2024 Bangladeshi general election:...,"[-0.02312278375029564, -0.022216515615582466, ...",0.266135
...,...,...,...
22,January 14 – Margrethe II formally abdicates a...,"[-0.0056414855644106865, -0.02698841318488121,...",0.301564
69,April 5 – Ecuadorian police raid the Mexican e...,"[-0.02853575348854065, -0.003814217634499073, ...",0.302123
40,February 6 – Former President of Chile Sebasti...,"[-0.00816845428198576, -0.0024274776224046946,...",0.303147
66,April 1 – Israel attacks the Iranian embassy i...,"[-0.025679413229227066, 0.002990813460201025, ...",0.303968


In [297]:
get_rows_sorted_by_relevance(prompt4, df)

Unnamed: 0,text,embeddings,distances
1,"– So far, this year has witnessed the continu...","[-0.021781660616397858, -0.02024887688457966, ...",0.174346
17,January 8 – 2024 conflict in Ecuador: Ecuadori...,"[-0.01292432751506567, -0.010080292820930481, ...",0.232163
47,February 29 – Israel–Hamas war: Soldiers of th...,"[-0.025512726977467537, 0.0004744035250041634,...",0.234949
75,April 16 – 2024 Persian Gulf floods: At least ...,"[-0.013219978660345078, -0.014456474222242832,...",0.235604
30,January 26 – Israel–Hamas war: The UN's Intern...,"[-0.038517553359270096, -0.006535387597978115,...",0.244307
...,...,...,...
32,January 31 – Sultan of Johor Ibrahim Iskandar ...,"[0.0030541284941136837, -0.01499270647764206, ...",0.288241
123,December 17 – Assuming the next United Kingdom...,"[-0.025605235248804092, -0.017141951248049736,...",0.289664
14,January 2 – 2023 Marshallese general election:...,"[-0.032947033643722534, -0.014056704938411713,...",0.292212
72,April 9 – After Leo Varadkar handed in his res...,"[-0.009332936257123947, -0.0027552535757422447...",0.293547


# Create a Function that Composes a Text Prompt

Building on that sorted list of rows, we're going to select the create a text prompt that provides context to a `Completion` model in order to help it answer a question. The outline of the prompt looks like this:

```
Answer the question based on the context below, and if the
question can't be answered based on the context, say "I don't
know"

Context:

{context}

---

Question: {question}
Answer:
```

We want to fit as much of our dataset as possible into the "context" part of the prompt without exceeding the number of tokens allowed by the `Completion` model, which is currently 4,000. So we'll loop over the dataset, counting the tokens as we go, and stop when we hit the limit. Then we'll join that list of text data into a single string and add it to the prompt.

# Criar uma função que compõe um prompt de texto

Com base nessa lista ordenada de linhas, selecionaremos criar um prompt de texto que forneça contexto para um modelo `Completion` para ajudá-lo a responder a uma pergunta. O esboço do prompt é assim:

```
Answer the question based on the context below, and if the
question can't be answered based on the context, say "I don't
know"

Context:

{context}

---

Question: {question}
Answer:
```

Queremos ajustar o máximo possível do nosso conjunto de dados na parte "contexto" do prompt sem exceder o número de tokens permitidos pelo modelo `Completion`, que atualmente é 4.000. Portanto, percorreremos o conjunto de dados, contando os tokens à medida que avançamos, e pararemos quando atingirmos o limite. Em seguida, juntaremos essa lista de dados de texto em uma única string e a adicionaremos ao prompt.

In [298]:
def create_prompt(question, df, max_token_count):
    """
    Given a question and a dataframe containing rows of text and their
    embeddings, return a text prompt to send to a Completion model
    """
    # Create a tokenizer that is designed to align with our embeddings
    tokenizer = tiktoken.get_encoding("cl100k_base")

    # Count the number of tokens in the prompt template and question
    prompt_template = """
    Answer the question based on the context below, and if the question
    can't be answered based on the context, say "I don't know"

    Context:
    {}
    ---

    {}
    """

    current_token_count = len(tokenizer.encode(prompt_template)) + \
                            len(tokenizer.encode(question))

    context = []
    for text in get_rows_sorted_by_relevance(question, df)["text"].values:

        # Increase the counter based on the number of tokens in this row
        text_token_count = len(tokenizer.encode(text))
        current_token_count += text_token_count

        # Add the row of text to the list if we haven't exceeded the max
        if current_token_count <= max_token_count:
            context.append(text)
        else:
            break

    return prompt_template.format("\n\n###\n\n".join(context), question)

Now let's test that out! We'll use a `max_token_count` below the actual limit just to keep the output shorter and more readable.

Agora vamos testar isso! Usaremos um `max_token_count` abaixo do limite real apenas para manter a saída mais curta e mais legível.

In [299]:
max_token_count=500

In [300]:
print(create_prompt(prompt1, df, max_token_count))


    Answer the question based on the context below, and if the question
    can't be answered based on the context, say "I don't know"

    Context:
     – 2024 (MMXXIV) is the current year, and is a leap year starting on Monday of the Gregorian calendar, the 2024th year of the Common Era (CE) and Anno Domini (AD) designations, the 24th  year of the 3rd millennium and the 21st century, and the  5th   year of the 2020s decade.  

###

December 24 – The 2025 Jubilee will take place on this date.

###

June 1 – September or October

###

October 9 – 2024 Mozambican general election.

###

December 7 – 2024 Ghanaian general election.

###

June 1 – 2024 Icelandic presidential election.

###

June 1 – 2024 Sri Lankan presidential election.

###

July 15–16 – 2024 Rwandan general election.

###

October 27 – 2024 Uruguayan general election.

###

May 6 –  2024 Chadian presidential election.

###

June 2 – 2024 Mexican general election.

###

October – 2024 Georgian presidential election.

#

In [301]:
print(create_prompt(prompt2, df, max_token_count))


    Answer the question based on the context below, and if the question
    can't be answered based on the context, say "I don't know"

    Context:
    January 1 – Egypt, Ethiopia, Iran and the United Arab Emirates become BRICS members.

###

March 31 – Bulgaria and Romania become members of the Schengen Area through sea and air routes.

###

 – So far, this year has witnessed the continuation of major armed conflicts, including the Russian invasion of Ukraine, the Myanmar civil war, the war in Sudan, and the Islamist insurgency in the Sahel. The continuation of the Israel–Hamas war has further caused spillover into many countries, including a crisis in the Red Sea impacting global shipping. 

###

 – Approximately 79 countries, representing around four billion people, are expected to conduct national elections throughout the course of the year, including eight out of ten of the world's most populous countries (Bangladesh, Brazil, Pakistan, Russia, India, Mexico, Indonesia, and the U

In [302]:
print(create_prompt(prompt3, df, max_token_count))


    Answer the question based on the context below, and if the question
    can't be answered based on the context, say "I don't know"

    Context:
    March 13 – The Artificial Intelligence Act, the world's first comprehensive legal and regulatory framework for artificial intelligence, is passed by the European Union.

###

 – So far, this year has witnessed the continuation of major armed conflicts, including the Russian invasion of Ukraine, the Myanmar civil war, the war in Sudan, and the Islamist insurgency in the Sahel. The continuation of the Israel–Hamas war has further caused spillover into many countries, including a crisis in the Red Sea impacting global shipping. 

###

February 22 – American company Intuitive Machines' Nova-C lander becomes the first commercial vehicle to land on the Moon.

###

March 15–17 – 2024 Russian presidential election: Incumbent Vladimir Putin is re-elected for a fifth term.

###

January 7 – 2024 Bangladeshi general election: The Awami League, l

In [303]:
print(create_prompt(prompt4, df, max_token_count))


    Answer the question based on the context below, and if the question
    can't be answered based on the context, say "I don't know"

    Context:
     – So far, this year has witnessed the continuation of major armed conflicts, including the Russian invasion of Ukraine, the Myanmar civil war, the war in Sudan, and the Islamist insurgency in the Sahel. The continuation of the Israel–Hamas war has further caused spillover into many countries, including a crisis in the Red Sea impacting global shipping. 

###

January 8 – 2024 conflict in Ecuador: Ecuadorian President Daniel Noboa declares a state of emergency following the escape of Los Choneros drug cartel leader José Adolfo Macías Villamar from prison. The military was deployed onto the streets and into prisons, while setting a national nighttime curfew.

###

February 29 – Israel–Hamas war: Soldiers of the Israel Defense Forces open fire on a crowd of civilians in Gaza City, killing more than a hundred people, as the Palestinian c

## Custom Query Completion

In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model.

## Conclusão de consulta personalizada

Nas células abaixo, componha uma consulta personalizada usando o conjunto de dados escolhido e recupere os resultados de um modelo OpenAI `Completion`.

# Create a Function that Answers a Question

Our final step is to send that text prompt to a `Completion` model and parse the model output!

# Criar uma função que responda a uma pergunta

Nossa etapa final é enviar esse prompt de texto para um modelo `Completion` e analisar a saída do modelo!

In [304]:
def answer_question(
    question, df, max_prompt_tokens=1800, max_answer_tokens=150
):
    """
    Given a question, a dataframe containing rows of text, and a maximum
    number of desired tokens in the prompt and response, return the
    answer to the question according to an OpenAI Completion model

    If the model produces an error, return an empty string
    """

    prompt = create_prompt(question, df, max_prompt_tokens)

    try:
        response = openai.Completion.create(
            model=COMPLETION_MODEL_NAME,
            prompt=prompt,
            max_tokens=max_answer_tokens
        )
        return response["choices"][0]["text"].strip()
    except Exception as e:
        print(e)
        return ""


In [305]:
custom_answer_1 = answer_question(prompt1, df)
print(custom_answer_1)

Answer: 2024


In [306]:
custom_answer_2 = answer_question(prompt2, df)
print(custom_answer_2)

Egypt, Ethiopia, Iran, and the United Arab Emirates.


In [307]:
custom_answer_3 = answer_question(prompt3, df)
print(custom_answer_3)

Answer: On March 13, the European Union passed the Artificial Intelligence Act, the first comprehensive legal and regulatory framework for artificial intelligence. This is the most recent advancement in the field of AI.


In [310]:
custom_answer_4 = answer_question(prompt4, df)
print(custom_answer_4)

Answer: From the context provided, some of the biggest conflicts in the world today include the Russian invasion of Ukraine, the Myanmar civil war, the conflict in Sudan, the Islamist insurgency in the Sahel, the war between Israel and Hamas, the crisis in the Red Sea, and the ongoing conflicts in Yemen and Syria.


## Custom Query Performance Demo

We will demonstrate the performance of the custom query, comparing for each question, the answer to a basic query from the `Completion` model, as well as the answer to the custom query including the information that we will provide to the model.

## Demonstração de Desempenho da Consulta Personalizada

Demonstraremos o desempenho da consulta personalizada, comparando para cada pergunta, a resposta de uma consulta básica do modelo `Completion`, bem como a resposta da consulta personalizada incluindo as informações que iremos forncecer ao modelo.

### Question 1

### Questão 1

In [348]:
prompt_template="""
{}
Original Answer: {}
-------------------------------------------------------------------------------------------------------------------------
Custom Answer: {}
"""

In [349]:
print(prompt_template.format(prompt1,initial_answer_1,custom_answer_1))



Question: "What year is it?"

Original Answer: As an AI, I do not have a physical body or keep track of time in the same way that humans do. However, the current year in the Gregorian calendar is 2021.
-------------------------------------------------------------------------------------------------------------------------
Custom Answer: Answer: 2024



### Question 2

### Questão 2

In [350]:
print(prompt_template.format(prompt2,initial_answer_2,custom_answer_2))



Question: "Which countries have recently become members of BRICS?"

Original Answer: Answer: South Africa, which joined in 2011, is the most recent member of BRICS.
-------------------------------------------------------------------------------------------------------------------------
Custom Answer: Egypt, Ethiopia, Iran, and the United Arab Emirates.



### Question 3

### Questão 3

In [351]:
print(prompt_template.format(prompt3,initial_answer_3,custom_answer_3))



Question: "What are the recent advances in artificial intelligence?"

Original Answer: Answer: There have been many recent advances in the field of artificial intelligence, including:

1. Deep learning: This is a subset of machine learning that uses neural networks to learn from large amounts of data and make predictions. Deep learning has been instrumental in many AI applications, such as image recognition, speech recognition, and natural language processing.

2. Robotics: AI-powered robots have become more advanced and capable, with the ability to perform complex tasks and adapt to changing environments. This has led to increased use of robots in industries such as manufacturing, healthcare, and logistics.

3. Natural language processing (NLP): NLP is the ability of computers to understand and process human language. Recent advancements in this field have enabled machines to accurately translate between languages, generate human-like text, and have conversations with humans.

4. Co

### Question 4

### Questão 4

In [352]:
print(prompt_template.format(prompt4,initial_answer_4,custom_answer_4))



Question: "What are the biggest conflicts in the world today?"

Original Answer: Answer:

1. War in Syria: The ongoing civil war in Syria, which has been ongoing since 2011, has resulted in hundreds of thousands of deaths and mass displacement of people.

2. Terrorism: The rise of extremist groups such as ISIS, Al-Qaeda, and Boko Haram has led to violent attacks and conflicts in various countries around the world.

3. Israeli-Palestinian Conflict: The ongoing conflict between Israel and Palestine over land and territories has been a source of tension and violence in the Middle East for decades.

4. North Korea's Nuclear Program: The development and testing of nuclear weapons by North Korea has led to international tensions and conflicts, particularly with the United States.

5. Trade Wars: The trade disputes between major economies such as the United States, China, and the European Union have led to conflicts and tariffs that have a global impact.

6. Religious and Ethnic Conflicts: 

## Conclusion

In this project we use unsupervised machine learning to perform immediate engineering for personalized responses from an OpenAI Chatbot! By providing updated context from 2024 headlines to answer questions about current events, the result was that the chat was able to answer questions in a way that was more consistent with the current context.

## Conclusão

Neste projeto utilizamos aprendizado de máquina não supervisionado para realizar engenharia imediata para respostas personalizadas de um Chatbot OpenAI! Fornecendo o contexto atualizado das manchetes de 2024 para responder a perguntas sobre eventos atuais, o resultado foi que o chat conseguiu responder as perguntas de forma mais coerente com o contexto atual.