## 概要

- `gemini`を使用して、論文のアブストラクトの評価を行うスクリプト
- `gemini`の出力は JSON 形式に固定
- `rules`で評価指標を指定
- ``で用語等の定義を指定


In [249]:
# 初期設定
!pip install pandas
import pathlib
import textwrap
import google.generativeai as genai
import os
from dotenv import load_dotenv
from IPython.display import display
from IPython.display import Markdown
import pandas as pd
import time
import json



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [250]:
# モデルのインスタンスを作成
model = genai.GenerativeModel(
    "gemini-1.5-flash",
    generation_config={"response_mime_type": "application/json"}
)

In [251]:
# geminiを使った回答の生成
def generate_response(model, prompt):
    response = model.generate_content(prompt)
    return response

def process_abstracts(model, abstracts, definition, instruction, rules, interval=5):
    results = []
    for abstract in abstracts:
        abstract_id = abstract["abstract_id"]
        content = abstract["content"]
        
        # プロンプトの作成
        prompt = create_prompt(content, definition, instruction, rules)
        
        # レスポンス生成
        response = generate_response(model, prompt)
        
        # 結果の保存
        results.append({
            "abstract_id": abstract_id,
            "response": response.text
        })
        
        # インターバルを挿入
        time.sleep(interval)
    
    return results

# responseカラムを解析してrule1, rule2, rule3に分解
def parse_response(response):
    if pd.isna(response):
        return {}
    try:
        # JSON形式をデコード
        parsed = json.loads(response)
        if isinstance(parsed, list):  # リストの場合
            if isinstance(parsed[0], dict) and "rules" in parsed[0]:
                rules = parsed[0]["rules"]
            else:
                rules = []
        elif isinstance(parsed, dict):  # 辞書の場合
            if "results" in parsed:  # "results"キーがある場合
                rules = parsed["results"][0].get("rules", [])
            else:  # "rules"キーが直接ある場合
                rules = parsed.get("rules", [])
        else:
            rules = []
        # ルールを辞書形式で返す
        return {f"rule{i+1}": rule for i, rule in enumerate(rules)}
    except json.JSONDecodeError:
        return {}

In [252]:
file_name = "Biochemistry_Molecular_Biology_low1000"
input_file = f"../data/csv/{file_name}.csv"
output_file = f"../data/result/{file_name}.csv"

# 読み込むデータの行数を指定
num_rows_to_read = 5

try:
    df = pd.read_csv(input_file, encoding="utf-8", nrows=num_rows_to_read)
    print("データの読み込みに成功しました。")
except Exception as e:
    print(f"エラーが発生しました: {e}")

データの読み込みに成功しました。


In [253]:
df["ID"] = df.index
cols = ["ID"] + [col for col in df.columns if col != "ID"]
df = df[cols]
df.head()

Unnamed: 0,ID,Publication Type,Authors,Title,Abstract,DOI
0,0,J,"Athare, SV; Gejji, SP",Regioselectivity in nonsymmetric methyl pentyl...,The present work illustrates regioselective bi...,10.1016/j.jmgm.2019.107960
1,1,J,"Brunetti, M; Mortola, JP",Hypoxic hypometabolism in chicken embryos: con...,"Postnatally, during hypoxia the decrease in ox...",10.1016/j.cbpa.2019.110578
2,2,J,"Du, ZF; Qu, Y; Farrell, NP",Intramolecular platinum migration on a peptide...,We report the migration of platinum ligand uni...,10.1016/j.jinorgbio.2019.110858
3,3,B,"Jahn, D; Geier, A",Transcriptional control of cells by vitamin D ...,,10.1016/B978-0-12-811907-5.00030-0
4,4,J,"Pakravan, M; Shamsollahi, MB","Spatial and temporal joint, partially-joint an...",absectionBackground Three types of sources can...,10.1016/j.jneumeth.2019.108453


In [254]:
# 用語等の定義
definition = """

"""

# 指示部分
instruction = """
Please answer yes or no if your abstract follows each of the rules in JSON format.

Use this JSON schema:

**JSON schema**:
{
    "results": [
        {
            "rules": ["yes" or "no", ...]
        }
    ]
}

**Example Response**:
{
    "results": [
        {
            "rules": ["yes", "no", "yes"]
        }
    ]
}
"""

# 評価指標
rules = """
Rules:
1. The objectives, methods, results, and conclusions are clearly stated.
2. It avoids the passive voice and uses active expressions.
3. Separating facts from claims.
4. Research background is written
"""

# プロンプトの作成関数
def create_prompt(abstract, definition, instruction, rules):
    return f"""
    {instruction}
    Abstract: {abstract}
    {definition}
    {rules}
    """
    


In [255]:
abstracts = [
    {"abstract_id": row["ID"], "content": row["Abstract"]}
    for _, row in df.dropna(subset=["Abstract"]).iterrows()
]

In [256]:
abstracts

[{'abstract_id': 0,
  'content': 'The present work illustrates regioselective binding of nonsymmetric axle BuX (X = F, Cl, Br, CN) and 5-bromovaleronitrile (BVN) to the non-symmetric methyl pentyl pillar[5]arene (MPP5). Theoretical calculations reveal that the guest encapsulation within MPP5 is spontaneous and the conformer showing X weakly bound to pentyl rim of MPP5 is favoured over its other conformer wherein it interacts with methyl rim of the host. The noncovalent interactions namely C-H-pi, C-H-X and H\\-H prevail over C- H center dot center dot center dot O hydrogen bonding in the complexes of MPP5. The manifestations of these to vibrational spectra obtained from the present theory are discussed. The strength of host-guest binding further is shown to correlate well with weakening of the C-X bond through natural bond orbital analyses. (C) 2019 Elsevier Inc. All rights reserved.'},
 {'abstract_id': 1,
  'content': 'Postnatally, during hypoxia the decrease in oxygen consumption ((V

In [257]:
%%time
raw_responses = process_abstracts(model, abstracts, definition, instruction, rules)

CPU times: user 30.9 ms, sys: 28.1 ms, total: 59.1 ms
Wall time: 23.2 s


In [258]:
raw_responses

[{'abstract_id': 0,
  'response': '{"results": [{"rules": ["yes", "yes", "yes", "no"]}]}'},
 {'abstract_id': 1,
  'response': '{"results": [{"rules": ["yes", "yes", "yes", "no"]}]}'},
 {'abstract_id': 2,
  'response': '{"results": [{"rules": ["yes", "yes", "yes", "no"]}]}'},
 {'abstract_id': 4,
  'response': '{"results": [{"rules": ["yes", "yes", "yes", "yes"]}]}'}]

In [259]:
# DataFrameに変換
results_df = pd.DataFrame(raw_responses)

# `response`をパースして新しいカラムを作成
rules_df = results_df["response"].apply(parse_response).apply(pd.Series)

# `abstract_id`にルールを結合
results_df = pd.concat([results_df, rules_df], axis=1).drop(columns=["response"])

# 元のDataFrameと評価結果を結合
merged_df = df.merge(results_df, left_on="ID", right_on="abstract_id", how="left").drop(columns=["abstract_id"])


In [260]:
merged_df.head()

Unnamed: 0,ID,Publication Type,Authors,Title,Abstract,DOI,rule1,rule2,rule3,rule4
0,0,J,"Athare, SV; Gejji, SP",Regioselectivity in nonsymmetric methyl pentyl...,The present work illustrates regioselective bi...,10.1016/j.jmgm.2019.107960,yes,yes,yes,no
1,1,J,"Brunetti, M; Mortola, JP",Hypoxic hypometabolism in chicken embryos: con...,"Postnatally, during hypoxia the decrease in ox...",10.1016/j.cbpa.2019.110578,yes,yes,yes,no
2,2,J,"Du, ZF; Qu, Y; Farrell, NP",Intramolecular platinum migration on a peptide...,We report the migration of platinum ligand uni...,10.1016/j.jinorgbio.2019.110858,yes,yes,yes,no
3,3,B,"Jahn, D; Geier, A",Transcriptional control of cells by vitamin D ...,,10.1016/B978-0-12-811907-5.00030-0,,,,
4,4,J,"Pakravan, M; Shamsollahi, MB","Spatial and temporal joint, partially-joint an...",absectionBackground Three types of sources can...,10.1016/j.jneumeth.2019.108453,yes,yes,yes,yes
