# 2. 합성 Q&A 데이터세트 만들기
다음 지침에 특화된 모델인 [`davinci-instruct-beta-v2`](https://beta.openai.com/docs/engines/instruct-series-beta)를 사용하여 주어진 컨텍스트를 기반으로 질문을 생성합니다. . 그런 다음 동일한 맥락에서 [`davinci-instruct-beta-v2`](https://beta.openai.com/docs/engines/instruct-series-beta)를 사용하여 이러한 질문에 답합니다.

이것은 비용이 많이 들고, 또한 각 섹션에 대해 davinci 엔진을 호출하기 때문에 시간이 오래 걸립니다. 대신 최종 데이터 세트를 다운로드하기만 하면 됩니다.

[이전 노트북](olympics-1-collect-data.ipynb)으로 생성한 데이터셋을 사용하고 있습니다.

## 2.1 데이터를 읽고 컨텍스트 생성
해당 섹션의 제목, 제목 및 내용을 연결하여 컨텍스트를 만듭니다.

In [1]:
import pandas as pd
df = pd.read_csv('olympics-data/olympics_sections.csv')
df['context'] = df.title + "\n" + df.heading + "\n\n" + df.content
df.head()

Unnamed: 0,title,heading,content,tokens,context
0,2020 Summer Olympics,Summary,The 2020 Summer Olympics (Japanese: 2020年夏季オリン...,726,2020 Summer Olympics\nSummary\n\nThe 2020 Summ...
1,2020 Summer Olympics,Host city selection,The International Olympic Committee (IOC) vote...,126,2020 Summer Olympics\nHost city selection\n\nT...
2,2020 Summer Olympics,Impact of the COVID-19 pandemic,"In January 2020, concerns were raised about th...",374,2020 Summer Olympics\nImpact of the COVID-19 p...
3,2020 Summer Olympics,Qualifying event cancellation and postponement,Concerns about the pandemic began to affect qu...,298,2020 Summer Olympics\nQualifying event cancell...
4,2020 Summer Olympics,Effect on doping tests,Mandatory doping tests were being severely res...,163,2020 Summer Olympics\nEffect on doping tests\n...


## 2.2 상황에 따라 질문 만들기
davinci-instruct를 사용하여 Wikipedia 섹션 내용과 관련된 여러 그럴듯한 질문을 생성합니다.

참고: 우리는 온도=0을 사용했지만 더 높은 온도로 실험하여 더 다양한 질문을 얻는 것이 도움이 될 수 있습니다.

<span style="color:orange">**경고: 이 단계는 시간이 오래 걸리고 많은 토큰을 소모합니다. 모든 섹션에 대해 davinci-instruct를 호출하여 여러 질문을 생성하기 때문입니다.**</span >

In [2]:
import openai

def get_questions(context):
    try:
        response = openai.Completion.create(
            engine="davinci-instruct-beta-v2",
            prompt=f"Write questions based on the text below\n\nText: {context}\n\nQuestions:\n1.",
            temperature=0,
            max_tokens=257,
            top_p=1,
            frequency_penalty=0,
            presence_penalty=0,
            stop=["\n\n"]
        )
        return response['choices'][0]['text']
    except:
        return ""

# context = df[0:1]['context']
# print(context)
# response = get_questions(context)
# print(response)

df['questions']= df.context.apply(get_questions)
df['questions'] = "1." + df.questions
print(df[['questions']].values[0][0])

1.


프롬프트는 여러 질문을 생성하도록 설계되었습니다. 위의 예시 질문은 2020년 하계 올림픽 페이지의 요약 섹션을 기반으로 생성되었습니다.

위의 3번과 5번 질문이 반복되는 것을 볼 수 있습니다. 때때로 생성된 질문은 컨텍스트 없이 모호할 수 있습니다. 이러한 한계에도 불구하고 성공적인 모델을 만들 수 있음을 보여드리겠습니다.

In [3]:
print(df.content.values[0])

The 2020 Summer Olympics (Japanese: 2020年夏季オリンピック, Hepburn: Nisen Nijū-nen Kaki Orinpikku), officially the Games of the XXXII Olympiad (第三十二回オリンピック競技大会, Dai Sanjūni-kai Orinpikku Kyōgi Taikai) and also known as Tokyo 2020 (東京2020, Tōkyō Nii Zero Nii Zero), was an international multi-sport event held from 23 July to 8 August 2021 in Tokyo, Japan, with some preliminary events that began on 21 July.
Tokyo was selected as the host city during the 125th IOC Session in Buenos Aires, Argentina, on 7 September 2013. The Games were originally scheduled to take place from 24 July to 9 August 2020, but due to the global COVID-19 pandemic, on 24 March 2020, the event was postponed to 2021, the first such instance in the history of the Olympic Games (previous games had been cancelled but not rescheduled). However, the event retained the Tokyo 2020 branding for marketing purpose. It was largely held behind closed doors with no public spectators permitted due to the declaration of a state of emergenc

## 2.3 컨텍스트를 기반으로 답변 만들기
davinci-instruct를 사용하여 관련 Wikipedia 섹션 내용이 주어진 질문에 답하십시오.

참고: 우리는 온도=0을 사용했지만 더 높은 온도로 실험하여 더 다양한 질문을 얻는 것이 도움이 될 수 있습니다.

<span style="color:orange">**경고: 이 단계는 모든 질문에 답하기 위해 모든 섹션에 대해 davinci-instruct를 호출하므로 시간이 오래 걸리고 많은 토큰을 소모합니다.**</span>

In [None]:
def get_answers(row):
    try:
        response = openai.Completion.create(
            engine="davinci-instruct-beta-v2",
            prompt=f"Write questions based on the text below\n\nText: {row.context}\n\nQuestions:\n{row.questions}\n\nAnswers:\n1.",
            temperature=0,
            max_tokens=257,
            top_p=1,
            frequency_penalty=0,
            presence_penalty=0
        )
        return response['choices'][0]['text']
    except Exception as e:
        print (e)
        return ""


df['answers']= df.apply(get_answers, axis=1)
df['answers'] = "1." + df.answers
df = df.dropna().reset_index().drop('index',axis=1)
print(df[['answers']].values[0][0])

That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model d

That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model d

That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model does not exist
That model d

다음은 개최 도시 선택과 관련된 컨텍스트를 기반으로 위의 질문에 대한 답변입니다.

답변 3-5에 정답이 포함되어 있음을 알 수 있지만 질문에 직접 답변하는 대신 답변을 그대로 추출한 것입니다. 이러한 낮은 품질의 답변에도 불구하고 많은 수의 예제가 주어지면 모델이 작업을 합리적으로 잘 학습할 수 있음을 보여줍니다.

## 2.4 Wikipedia 섹션을 기반으로 올림픽 Q&A 데이터 세트 저장
[다음 노트북](olympics-3-train-qa.ipynb)에서 사용할 파일을 저장합니다.

In [None]:
df.to_csv('olympics-data/olympics_qa.csv', index=False)

## 2.5 파일 검색
질문이 있을 때 관련 컨텍스트를 검색하는 데 사용할 수 있는 검색 파일([API 참조](https://beta.openai.com/docs/api-reference/files/list))을 만듭니다.


In [None]:
df = df[df.tokens<2000]
df[['context', 'tokens']].rename(columns={'context':'text','tokens':'metadata'}).to_json('olympics-data/olympics_search.jsonl', orient='records', lines=True)

search_file = openai.File.create(
  file=open("olympics-data/olympics_search.jsonl"),
  purpose='search'
)
olympics_search_fileid = search_file['id']

## 2.6 제공된 컨텍스트를 기반으로 질문에 답변

우리는 답변 엔드포인트의 간단한 구현을 사용할 것입니다. 이는 컨텍스트에 포함될 수 있는 관련 섹션을 얻기 위해 색인된 파일을 검색하는 [/search endpoint](https://beta.openai.com/docs/api-reference/searches)를 사용하여 작동합니다. 지정된 모델에 따라 질문 및 응답 프롬프트가 표시됩니다.

In [None]:
from answers_with_ft import create_context, answer_question
print(create_context("Where did women's 4 x 100 metres relay event take place during the 2020 Summer Olympics?", olympics_search_fileid, max_len=400))

In [None]:
answer_question(olympics_search_fileid, "davinci-instruct-beta-v2", 
            "Where did women's 4 x 100 metres relay event take place during the 2020 Summer Olympics?")

Q&A용 모델을 미세 조정한 후에는 [`davinci-instruct-beta-v2`](https://beta.openai.com/docs/engines/instruct-series-beta ), 문맥에 따라 질문에 답할 수 없을 때 더 나은 답변을 얻으려면. 우리는 [`davinci-instruct-beta-v2`](https://beta.openai.com/docs/engines/instruct-series-beta)의 단점을 봅니다. 컨텍스트가 존재하는지 여부. (두 번째 질문은 2024년을 배경으로 하는 미래의 사건에 대한 질문입니다.)

In [None]:
answer_question(olympics_search_fileid, "davinci-instruct-beta-v2", 
            "Where did women's 4 x 100 metres relay event take place during the 2048 Summer Olympics?", max_len=1000)

주어진 맥락에서 질문에 답할 수 없는 경우에도 davinci는 질문에 답하는 경향이 있음을 알 수 있습니다. 아직 발생하지 않은 2048년 하계 올림픽에 관한 질문에 유의하십시오. 검색된 콘텐츠는 2020년에 대한 결과만 반환했습니다.

## 2.7 (선택 사항) 검색 엔드포인트가 관련 컨텍스트를 반환할 가능성에 대한 조사

In [None]:
def check_context(title, heading, question, max_len=1800, search_model='ada', max_rerank=10):
    """
    Evaluate the performance of the search model in retrieving the correct context

    Parameters
    ----------
    title: str
        The title of the Wikipedia page
    heading: str
        The heading of the Wikipedia section
    qusetion: str
        The question
    max_len: int
        The maximum length of the context
    search_model: str
        The search model to use - `ada` is most cost effective
    max_rerank: int
        The maximum number of reranking documents to use the search model on

    Returns
    -------
    rank: int
        The rank of the correct context
    token_length: int
        The number of tokens needed to obtain the correct context
    """
    
    try:
        results = openai.Engine(search_model).search(
            search_model=search_model, 
            query=question, 
            max_rerank=max_rerank,
            file=olympics_search_fileid,
            return_metadata=True
        )
        index=-1
        returns = []
        cur_len = 0
        for result in results['data']:
            cur_len += int(result['metadata']) + 4 # we add 4 tokens for the separator `\n\n###\n\n`
            if cur_len > max_len:
                break
            returns.append(result['text'])
            res = result['text'].split('\n')
            if res[0] == title and res[1] == heading:
                index = len(returns) - 1
                break
        return index, cur_len
    except Exception as e:
        #print (e)
        return []
print(check_context("Athletics at the 2020 Summer Olympics – Women's 4 × 100 metres relay", "Summary", "Where did women's 4 x 100 metres relay event take place during the 2020 Summer Olympics?", max_len=10000))

컨텍스트를 기반으로 생성된 질문을 활용하여 원래 컨텍스트를 검색할 수 있는 빈도를 추정합니다. 이러한 질문은 시끄럽기 때문에 완벽한 추정치는 아닙니다.

질문과 답변에는 번호가 붙은 글머리 기호가 붙지만 생성 방식으로 인해 첫 번째 숫자가 누락되어 "1"을 추가합니다. 질문(및 답변) 목록으로 이동합니다.

우리는 ada 검색을 사용하여 검색된 섹션의 순위와 전체 관련 섹션을 검색하는 데 필요한 컨텍스트의 토큰 수를 계산합니다.

In [None]:
ada_results = df.apply(lambda x: [
                    check_context( x.title, 
                                   x.heading, 
                                   q[3:],     # remove the number prefix
                                   max_len=1000000, # set a large number to get the full context 
                                   search_model='ada', 
                                   max_rerank=200,
                                 ) 
                    for q in (x.questions).split('\n') # split the questions
                    if len(q) >10 # remove the empty questions
                ], axis=1)
ada_results.head()

In [None]:
out = pd.concat([ada_results], axis=1)
out.columns = ['ada']
out.to_csv('olympics-data/search_engine_results.csv')

In [None]:
def expand_lists(out):
    """
    Expand a pandas series containing lists into a series, where each list element becomes a value on its own

    Input is a row per paragraph, which has multiple questions
    Output is a row per question
    """
    cols = [pd.DataFrame(out[name].tolist()).stack().reset_index(level=1, drop=True).rename(name) for name in out.columns] 
    return pd.concat(cols, axis=1)

out_expanded = expand_lists(out)
out_expanded['rank'] = out_expanded.ada.apply(lambda x: x[0] if x != [] else -2)
out_expanded['tokens'] = out_expanded.ada.apply(lambda x: x[1] if x != [] else -2)


In [None]:
within_2k = (out_expanded.tokens < 2000).mean()
print(f"{within_2k*100:.1f}% of relevant paragraphs are retrieved within the first 2k tokens")

관련 컨텍스트는 이 데이터 세트에서 시간의 74%를 얻을 수 있습니다.

In [None]:
outside_200 = (out_expanded['rank'] == -1).mean()
print(f"{outside_200*100:.1f}% of relevant paragraphs are not retrieved within the first 200 results")

시간의 7.4%는 검색 알고리즘의 키워드 검색 부분이 처음 200개 결과 내에서 관련 컨텍스트를 검색하지 않기 때문입니다.
시간의 18.3%는 시맨틱 검색이 처음 2000개 토큰 내에 관련 컨텍스트를 배치하지 않기 때문입니다.

In [None]:
import matplotlib.pyplot as plt

# plot a histogram, and add axis descriptions and title
out_expanded[(out_expanded['rank'] >=0)&(out_expanded['rank'] <30)]['rank'].hist(bins=29)
plt.xlabel('rank')
plt.ylabel('count')
plt.title('Histogram of ranks of retrieved paragraphs')
plt.show()

In [None]:
out_expanded[(out_expanded.tokens>=0)&(out_expanded.tokens < 2000)]['tokens'].hist(bins=29)
plt.xlabel('tokens')
plt.ylabel('count')
plt.title('Histogram of the number of minimum tokens needed')
plt.show()

컨텍스트가 첫 번째 결과 중 하나로 반환될 가능성이 가장 높고 처음 200-500개 토큰 내에서 반환될 가능성이 가장 높다는 것을 관찰할 수 있습니다.

In [None]:
# normalized value_counts
out_expanded['rank'].value_counts(normalize=True).sort_index()[:13]

각 순위에서 관련 컨텍스트가 반환될 확률. (-2는 처리 오류를 의미하고 -1은 순위가 >200임을 의미)