# Prompt engineering and ChatGPT tutorial
- プロンプトエンジニアリングについての概観を把握する
- 紹介されているtipsを実際に試してみて、自分が作ろうとしているアプリにはどのように使用できそうか、イメージを膨らませる


## Prompt engineeringについて
- [OpenAI Cookbook](https://github.com/openai/openai-cookbook)
- [OpenAI Prompt engineering](https://help.openai.com/en/collections/3675942-prompt-engineering)
    - https://help.openai.com/en/articles/6654000-best-practices-for-prompt-engineering-with-openai-api
- [Lil' Log, Prompt Engineering(様々なまとめブログを書いている人のプロンプトエンジニアリングについての記事)](https://lilianweng.github.io/posts/2023-03-15-prompt-engineering/)
- [Prompt Engineering Guide](https://github.com/dair-ai/Prompt-Engineering-Guide)
    - https://github.com/dair-ai/Prompt-Engineering-Guide/blob/main/lecture/Prompt-Engineering-Lecture-Elvis.pdf
    - https://www.promptingguide.ai/notebooks
- [Awesome ChatGPT Prompts](https://github.com/f/awesome-chatgpt-prompts)

## ChatGPTに論文の要点を箇条書きにしてもらおう

In [1]:
import os
import requests
from pathlib import Path
from urllib.parse import urlparse

pdf_url = 'https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf'
r = requests.get(pdf_url)
if r.status_code != 200:
    raise ValueError(
        "Check the url of your file; returned status code %s"
        % r.status_code
    )

# ファイル名を取得します
parsed_url = urlparse(pdf_url)
file_name = os.path.basename(parsed_url.path)

# pdf_filesディレクトリを作成し、ファイルを保存します
pdf_files_directory = Path("pdf_files")
pdf_files_directory.mkdir(exist_ok=True)
file_path = pdf_files_directory / file_name

with open(file_path, "wb") as f:
    f.write(r.content)

print(f"File saved at: {file_path}")
from pdfminer.high_level import extract_text
text = extract_text(file_path)

File saved at: pdf_files/language_understanding_paper.pdf


In [2]:
print(text[:300])

Improving Language Understanding
by Generative Pre-Training

Alec Radford
OpenAI
alec@openai.com

Karthik Narasimhan
OpenAI
karthikn@openai.com

Tim Salimans
OpenAI
tim@openai.com

Ilya Sutskever
OpenAI
ilyasu@openai.com

Abstract

Natural language understanding comprises a wide range of diverse tas


In [3]:
def is_digit_or_uppercase_word(element):
    if element.isdigit() or element[0].isupper():
        return True
    return False
chunk_idx = []
chunk_key_names = []
for idx, chunk in enumerate(text.split('\n\n')):
    chunk_elements = chunk.split()
    chunk_judgement = [is_digit_or_uppercase_word(element) for element in chunk_elements]
    if all(chunk_judgement):
        if len(chunk) > 3:
            print(chunk)
            print('chunk_idx:', idx)
            chunk_idx.append(idx)
            chunk_key_names.append(chunk)

Abstract
chunk_idx: 5
Introduction
chunk_idx: 8
2 Related Work
chunk_idx: 16
3 Framework
chunk_idx: 24
4 Experiments
chunk_idx: 57
Task
chunk_idx: 62
Datasets
chunk_idx: 63
Method
chunk_idx: 76
MNLI-m MNLI-mm SNLI
chunk_idx: 77
SciTail QNLI RTE
chunk_idx: 78
Method
chunk_idx: 106
Story Cloze RACE-m RACE-h RACE
chunk_idx: 107
Method
chunk_idx: 128
Classiﬁcation
chunk_idx: 129
Semantic Similarity
chunk_idx: 130
GLUE
chunk_idx: 131
5 Analysis
chunk_idx: 171
Method
chunk_idx: 177
Avg. Score
chunk_idx: 178
6 Conclusion
chunk_idx: 209
References
chunk_idx: 211


In [4]:
chunk_dict = dict()

for idx, chunk_key_name in enumerate(chunk_key_names):
    if idx == len(chunk_key_names) - 1:
        chunk_dict[chunk_key_name] = (chunk_idx[idx], len(text.split('\n\n')))
    else:
        chunk_dict[chunk_key_name] = (chunk_idx[idx], chunk_idx[idx+1])

In [5]:
chunk_dict

{'Abstract': (5, 8),
 'Introduction': (8, 16),
 '2 Related Work': (16, 24),
 '3 Framework': (24, 57),
 '4 Experiments': (57, 62),
 'Task': (62, 63),
 'Datasets': (63, 76),
 'Method': (177, 178),
 'MNLI-m MNLI-mm SNLI': (77, 78),
 'SciTail QNLI RTE': (78, 106),
 'Story Cloze RACE-m RACE-h RACE': (107, 128),
 'Classiﬁcation': (129, 130),
 'Semantic Similarity': (130, 131),
 'GLUE': (131, 171),
 '5 Analysis': (171, 177),
 'Avg. Score': (178, 209),
 '6 Conclusion': (209, 211),
 'References': (211, 332)}

In [23]:
sections = dict()
for chunk_key_name in chunk_dict.keys():
    # print(chunk_key_name)
    sentences = text.split('\n\n')[chunk_dict[chunk_key_name][0]+1:chunk_dict[chunk_key_name][1]]
    sentences = ' '.join(sentences)
    sentences = sentences.replace('\n', ' ')
    # print(sentences)
    sections[chunk_key_name] = sentences

In [7]:
# !pip install --upgrade openai
# https://platform.openai.com/account/api-keys

# https://github.com/openai/openai-cookbook/blob/main/examples/How_to_format_inputs_to_ChatGPT_models.ipynb
import openai
import os
openai.api_key = os.getenv("OPENAI_API_KEY")

MODEL = "gpt-3.5-turbo"
response = openai.ChatCompletion.create(
    model=MODEL,
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Knock knock."},
        {"role": "assistant", "content": "Who's there?"},
        {"role": "user", "content": "Orange."},
    ],
    temperature=0,
)

response

<OpenAIObject chat.completion id=chatcmpl-71B86N0lXY2EBRonp9l56cZTCzP3L at 0x10c505540> JSON: {
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "Orange who?",
        "role": "assistant"
      }
    }
  ],
  "created": 1680515466,
  "id": "chatcmpl-71B86N0lXY2EBRonp9l56cZTCzP3L",
  "model": "gpt-3.5-turbo-0301",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 3,
    "prompt_tokens": 39,
    "total_tokens": 42
  }
}

In [24]:
# sections['Abstract']

In [9]:
gpt_prompt = f"""
Summarize the text below as a bullet point list of the most important points.

Text: ###
{sections['Introduction']}
###
"""

response = openai.ChatCompletion.create(
    model=MODEL,
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": gpt_prompt},
    ],
    temperature=0,
)

response

<OpenAIObject chat.completion id=chatcmpl-71B87kIMQJVFDQ7CMaKR72HwTpZKu at 0x107e964a0> JSON: {
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "- Learning effectively from raw text is important in natural language processing (NLP)\n- Deep learning methods require substantial amounts of manually labeled data, which limits their applicability in many domains\n- Models that can leverage linguistic information from unlabeled data provide a valuable alternative to gathering more annotation\n- Learning good representations in an unsupervised fashion can provide a significant performance boost\n- Pre-trained word embeddings have been extensively used to improve performance on a range of NLP tasks\n- Leveraging more than word-level information from unlabeled text is challenging due to uncertainties in optimization objectives and transfer techniques\n- The paper proposes a semi-supervised approach for language understanding tasks usin

In [10]:
print(response['choices'][0]['message']['content'])

- Learning effectively from raw text is important in natural language processing (NLP)
- Deep learning methods require substantial amounts of manually labeled data, which limits their applicability in many domains
- Models that can leverage linguistic information from unlabeled data provide a valuable alternative to gathering more annotation
- Learning good representations in an unsupervised fashion can provide a significant performance boost
- Pre-trained word embeddings have been extensively used to improve performance on a range of NLP tasks
- Leveraging more than word-level information from unlabeled text is challenging due to uncertainties in optimization objectives and transfer techniques
- The paper proposes a semi-supervised approach for language understanding tasks using a combination of unsupervised pre-training and supervised fine-tuning
- The approach uses a two-stage training procedure and the Transformer model architecture
- The approach outperforms discriminatively train

In [11]:
summarys_en = dict()
summarys_jp = dict()
for section_name in sections.keys():
    gpt_prompt = f'''
    Summarize the text below as a bullet point list of the most important points.

    Text: """
    {sections[section_name]}
    """
    '''

    response = openai.ChatCompletion.create(
        model=MODEL,
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": gpt_prompt},
        ],
        temperature=0,
    )

    summarys_en[section_name] = response['choices'][0]['message']['content']

    response = openai.ChatCompletion.create(
        model=MODEL,
        messages=[
            {"role": "system", "content": "You are a helpful assistant. Please translate texts into Japanese. 英語を日本語に翻訳して下さい"},
            {"role": "user", "content": summarys_en[section_name]},
        ],
        temperature=0,
    )
    summarys_jp[section_name] = response['choices'][0]['message']['content']


In [25]:
# for section_name in summarys_en.keys():
#     print(section_name)
#     print(summarys_en[section_name])
#     print(summarys_jp[section_name])

## ChatPDFのような仕組みを作ろう
- [ChatPDF](https://www.chatpdf.com/)
    - https://pbs.twimg.com/media/FsPqJy6aUAINwq0?format=jpg&name=large

In [13]:
import os
import requests
from pathlib import Path
from urllib.parse import urlparse

pdf_url = 'https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf'
r = requests.get(pdf_url)
if r.status_code != 200:
    raise ValueError(
        "Check the url of your file; returned status code %s"
        % r.status_code
    )

# ファイル名を取得します
parsed_url = urlparse(pdf_url)
file_name = os.path.basename(parsed_url.path)

# pdf_filesディレクトリを作成し、ファイルを保存します
pdf_files_directory = Path("pdf_files")
pdf_files_directory.mkdir(exist_ok=True)
file_path = pdf_files_directory / file_name

with open(file_path, "wb") as f:
    f.write(r.content)

print(f"File saved at: {file_path}")
from pdfminer.high_level import extract_text
text = extract_text(file_path)

File saved at: pdf_files/language_understanding_paper.pdf


In [26]:
text = text.replace('\n', ' ')
# text

In [15]:
# !pip install -U tiktoken
import tiktoken
encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")
num_tokens = len(encoding.encode(text))
# https://platform.openai.com/docs/models/gpt-3-5
# モデルに入れられるtoken数は4096
# GPT-4は32k, 64kまで扱える
print(num_tokens)

10631


In [16]:
# !pip install 'openai[embeddings]'

In [17]:
from dataclasses import dataclass
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer
from typing import List

# 仮定したモジュールを使用しています。実際には適切なモジュールをインポートして使用してください。
from openai.embeddings_utils import get_embedding, cosine_similarity

@dataclass
class Page:
    page_number: int
    text: str
    embedding: List[float]

def extract_page_text(pdf_file):
    pages_text = []
    for page_layout in extract_pages(pdf_file):
        single_page_text = ''
        for element in page_layout:
            if isinstance(element, LTTextContainer):
                single_page_text += element.get_text().replace('\n', ' ')
        pages_text.append(single_page_text)
    return pages_text
file_path = 'pdf_files/language_understanding_paper.pdf'
pages_text = extract_page_text(file_path)
# for page_text in pages_text:
#     print(len(encoding.encode(page_text)))

pages = []
for idx, page_text in enumerate(pages_text):
    # https://github.com/openai/openai-cookbook/blob/main/examples/Semantic_text_search_using_embeddings.ipynb
    embedding = get_embedding(
        page_text,
        engine="text-embedding-ada-002"
    )
    page = Page(page_number=idx + 1, text=page_text, embedding=embedding)
    pages.append(page)

In [18]:
import pandas as pd

# pagesリストをDataFrameに変換
pages_df = pd.DataFrame(pages)
pages_df

Unnamed: 0,page_number,text,embedding
0,1,Improving Language Understanding by Generative...,"[-0.010624888353049755, -0.0002075700904242694..."
1,2,"In this paper, we explore a semi-supervised ap...","[-0.001758279511705041, -0.002217849250882864,..."
2,3,pre-trained language or machine translation mo...,"[-0.005748812574893236, 0.0015207912074401975,..."
3,4,Figure 1: (left) Transformer architecture and ...,"[-0.0037100473418831825, -0.011732332408428192..."
4,5,Table 1: A list of the different tasks and dat...,"[-0.00776058342307806, 0.006178564857691526, 0..."
5,6,Table 2: Experimental results on natural langu...,"[-0.01649334840476513, 0.01438925787806511, 0...."
6,7,Table 4: Semantic similarity and classiﬁcation...,"[-0.005145553965121508, -0.007219440769404173,..."
7,8,Table 5: Analysis of various model ablations o...,"[-0.018282189965248108, 0.0034548065159469843,..."
8,9,"[2] J. L. Ba, J. R. Kiros, and G. E. Hinton. L...","[-0.02173006534576416, 0.007508637383580208, 0..."
9,10,"[24] F. Jiao, S. Wang, C.-H. Lee, R. Greiner, ...","[-0.004755450412631035, -0.0022865389473736286..."


In [27]:
# query文字列とそのembedding
query = "この論文の手法の重要な点は何ですか？"
query_embedding = get_embedding(
        query,
        engine="text-embedding-ada-002"
    )

# コサイン類似度を計算
pages_df['similarity'] = pages_df['embedding'].apply(lambda x: cosine_similarity(x, query_embedding))

# 類似度の降順に並び替え
sorted_pages_df = pages_df.sort_values(by='similarity', ascending=False)

# 上位n件のテキストを取得
n = 3
top_n_pages = sorted_pages_df.head(n)

# 結果を表示
print(f"Top {n} pages similar to the query '{query}':")
context_texts = []
for index, row in top_n_pages.iterrows():
    print(f"\nPage {row.page_number} (similarity: {row.similarity}):")
    # print(row.text)
    context_texts.append(row.text)

Top 3 pages similar to the query 'この論文の手法の重要な点は何ですか？':

Page 7 (similarity: 0.7406577225533086):

Page 1 (similarity: 0.7346770669052388):

Page 6 (similarity: 0.7345843612395349):


In [20]:
# https://github.com/openai/openai-cookbook/blob/main/examples/Question_answering_using_embeddings.ipynb
SEPARATOR = "\n* "
header = """Answer the question as truthfully as possible using the provided context, and if the answer is not contained within the text below, say "I don't know."\n\nContext:\n"""
gpt_prompt = ""
gpt_prompt += header
for context_text in context_texts:
    gpt_prompt += SEPARATOR
    gpt_prompt += context_text
gpt_prompt += "\n\n Q: " + query + "\n A:"
print(gpt_prompt)

Answer the question as truthfully as possible using the provided context, and if the answer is not contained within the text below, say "I don't know."

Context:

* Table 4: Semantic similarity and classiﬁcation results, comparing our model with current state-of-the- art methods. All task evaluations in this table were done using the GLUE benchmark. (mc= Mathews correlation, acc=Accuracy, pc=Pearson correlation) Method Classiﬁcation Semantic Similarity GLUE CoLA SST2 MRPC STSB QQP (F1) (F1) (mc) (acc) (pc) Sparse byte mLSTM [16] TF-KLD [23] ECNU (mixed ensemble) [60] Single-task BiLSTM + ELMo + Attn [64] Multi-task BiLSTM + ELMo + Attn [64] Finetuned Transformer LM (ours) - - - 35.0 18.9 45.4 93.2 - - 90.2 91.6 91.3 - 86.0 - 80.2 83.5 82.3 - - 81.0 55.5 72.8 82.0 - - - - - - 66.1 63.3 70.3 64.8 68.9 72.8 Overall, our approach achieves new state-of-the-art results in 9 out of the 12 datasets we evaluate on, outperforming ensembles in many cases. Our results also indicate that our approa

In [21]:
MODEL = "gpt-3.5-turbo"
response = openai.ChatCompletion.create(
    model=MODEL,
    messages=[
        {"role": "system", "content": "You are a helpful assistant. Please answer in Japanese. 回答は日本語でお願いします。"},
        {"role": "user", "content": gpt_prompt},
    ],
    temperature=0,
)

In [22]:
print(response['choices'][0]['message']['content'])

この論文の手法の重要な点は、大規模な未ラベルのテキストコーパスを用いた言語モデルの生成的事前学習によって、多様な自然言語理解タスクにおいて高い性能を発揮することができることです。また、タスクに応じた入力変換を行うことで、モデルアーキテクチャを最小限変更することで効果的な転移学習を実現しています。
