# 研究課題の主要データをDBに保存するプログラム

## プログラムの概要

- 前提：研究課題のXMLファイルが./xmlフォルダに保存されていること
- 内定時点のデータを保存する
- 原則としてデータはsummary要素から取得する。研究機関データのみgrantlist要素から取得する。
研究者情報は、grantAward/summary/memberとgrantAward/memberList/memberの2箇所にある。
前者は同じ人は複数出てこなくてまとまっているが、所属機関等のコードがない。
後者は所属機関コードがあるが、毎年度の実績報告書があるので同じ人が複数回出てくる。
差し当たって前者からデータを取得することにする。そのうち余裕が出たら、後者のデータと突合したい。

### 流れ

1. grantaward : 研究課題メインになる部分。課題番号、研究種目、開始年度、終了年度、直接経費総額など。


- 部品1：研究課題データのうち、課題番号や研究種目など、変更にならない項目で、課題番号に対して一対一になる項目
- 部品2：採択年度の研究機関
- 部品3：採択年度の研究代表者

上記の3つの部品を課題番号をキーにして、結合して一つのテーブルを作り、DBに書き込む

以下のテーブルは、grantawardに対して、基本的に一対多のリレーションになっている。DBに書き込む。

2. grantaward_member : 研究代表者、研究分担者など
3. grantaward_field : 研究分野。系分野分科細目表に基づくもの。2017年度まで。
4. grantaward_review_section : 審査区分。審査区分表に基づくもの。2018年度以降。
5. grantaward_annual : 年度ごとの直接経費金額
6. grantaward_keyword : 研究課題のキーワード
7. grantaward_paragraph : 研究概要等のテキストデータ
8. grantaward_product: 研究成果物

## 事前準備

In [1]:
import configparser
import os
import pickle
import re
import shutil
from glob import glob

import numpy as np
import pandas as pd
from joblib import Parallel, delayed
from lxml import etree
from sqlalchemy import create_engine
from sqlalchemy.types import Date, Integer, String, BigInteger
from tqdm import tqdm_notebook as tqdm

In [2]:
# DB設定
config = configparser.ConfigParser()
config.read("../kaken_parse_grants_masterxml/config.ini")
username = config["mariadb"]["username"]
password = config["mariadb"]["password"]
url = (
    "mysql+pymysql://"
    + username
    + ":"
    + password
    + "@localhost:3306/"
    + "kaken"
    + "?charset=UTF8MB4"
)
engine = create_engine(url, echo=True)

データセットを作成する年度を指定する

In [3]:
startyear = 2018
endyear = 2020

## XMLファイルからデータ抽出

関数を定義する

In [4]:
# 研究課題の主な項目
def kadai(xmlfile):
    tree = etree.parse(xmlfile)
    nsmap = {"xml": "http://www.w3.org/XML/1998/namespace"}
    kadailist = []
    for grantAward in tree.iterfind("grantAward"):
        projecttype = grantAward.get("projectType")
        awardnumber = grantAward.get("awardNumber")
        summary = grantAward.find("summary[@xml:lang='ja']", nsmap)
        projectstatus = summary.find("projectStatus")
        try:
            projectstatus_fiscalyear = projectstatus.get("fiscalYear")
        except AttributeError:
            projectstatus_fiscalyear = None
        try:
            projectstatus_statuscode = projectstatus.get("statusCode")
        except AttributeError:
            projectstatus_statuscode = None
        startfiscalyear = summary.find("periodOfAward").get("searchStartFiscalYear")
        endfiscalyear = summary.find("periodOfAward").get("searchEndFiscalYear")
        try:
            category_niicode = summary.find("category").get("niiCode")
        except AttributeError:
            category_niicode = None
        try:
            category = summary.find("category").text
        except AttributeError:
            category = None
        try:
            section_niicode = summary.find("section").get("niiCode")
        except AttributeError:
            section_niicode = None
        try:
            section = summary.find("section").text
        except AttributeError:
            section = None
        try:
            title_ja = summary.find("title").text
        except AttributeError:
            title_ja = None
        try:
            title_en = summary.find("title").text
        except AttributeError:
            title_en = None
        try:
            directcost = summary.find("overallAwardAmount/directCost").text
        except AttributeError:
            directcost = None
        row = [
            awardnumber,
            projecttype,
            projectstatus_fiscalyear,
            projectstatus_statuscode,
            startfiscalyear,
            endfiscalyear,
            category_niicode,
            category,
            section_niicode,
            section,
            title_ja,
            title_en,
            directcost,
        ]
        kadailist.append(row)
    dumpfilename = (
        "dump_kadai/main/main_"
        + re.search("[0-9]{4}_[0-9]+-[0-9]+.xml", xmlfile).group()
        + ".dump"
    )
    with open(dumpfilename, "wb") as f:
        pickle.dump(kadailist, f)

In [5]:
# 研究代表者が所属する研究機関
def institution(xmlfile):
    tree = etree.parse(xmlfile)
    nsmap = {"xml": "http://www.w3.org/XML/1998/namespace"}
    institutionlist = []
    for grantAward in tree.iterfind("grantAward"):
        awardnumber = grantAward.get("awardNumber")
        grantlist = grantAward.find("grantList")
        try:
            for grant in grantlist.iterfind("grant[@xml:lang='ja']", nsmap):
                fiscalyear = grant.get("fiscalYear")
                grant_sequence = grant.get("sequence")
                for institution in grant.iterfind("institution"):
                    institution_sequence = institution.get("sequence")
                    institution_niicode = institution.get("niiCode")
                    institution_mextcode = institution.get("mextCode")
                    institution_jspscode = institution.get("jspsCode")
                    institution_name = institution.text
                    row = [
                        awardnumber,
                        fiscalyear,
                        grant_sequence,
                        institution_sequence,
                        institution_niicode,
                        institution_mextcode,
                        institution_jspscode,
                        institution_name,
                    ]
                    institutionlist.append(row)
        except AttributeError:
            row = [awardnumber] + [None] * 7
    dumpfilename = (
        "dump_kadai/institution/institution_"
        + re.search("[0-9]{4}_[0-9]+-[0-9]+.xml", xmlfile).group()
        + ".dump"
    )
    with open(dumpfilename, "wb") as f:
        pickle.dump(institutionlist, f)

In [6]:
# 研究代表者等の研究者番号等
def member(xmlfile):
    tree = etree.parse(xmlfile)
    nsmap = {"xml": "http://www.w3.org/XML/1998/namespace"}
    memberlist = []
    for grantAward in tree.iterfind("grantAward"):
        awardnumber = grantAward.get("awardNumber")
        summary = grantAward.find("summary[@xml:lang='ja']", nsmap)
        for member in summary.iterfind("member", nsmap):
            sequence = member.get("sequence")
            try:
                participate = member.get("participate")
            except AttributeError:
                participate = None
            eradcode = member.get("eradCode")
            role = member.get("role")
            try:
                fullname = member.find("personalName/fullName").text
            except AttributeError:
                fullname = None
            try:
                familyname = member.find("personalName/familyName").text
            except AttributeError:
                familyname = None
            try:
                givenname = member.find("personalName/givenName").text
            except AttributeError:
                givenname = None
            try:
                familyname_yomi = member.find("personalName/familyName").get("yomi")
            except AttributeError:
                familyname_yomi = None
            try:
                givenname_yomi = member.find("personalName/givenName").get("yomi")
            except AttributeError:
                givenname_yomi = None
            row = [
                awardnumber,
                sequence,
                participate,
                eradcode,
                role,
                fullname,
                familyname,
                givenname,
                familyname_yomi,
                givenname_yomi,
            ]
            memberlist.append(row)
    dumpfilename = (
        "dump_kadai/member/member_"
        + re.search("[0-9]{4}_[0-9]+-[0-9]+.xml", xmlfile).group()
        + ".dump"
    )
    with open(dumpfilename, "wb") as f:
        pickle.dump(memberlist, f)

In [7]:
# 系分野分科細目表に基づく研究分野データ
def field(xmlfile):
    tree = etree.parse(xmlfile)
    nsmap = {"xml": "http://www.w3.org/XML/1998/namespace"}
    fieldlist = []
    for grantAward in tree.iterfind("grantAward"):
        awardnumber = grantAward.get("awardNumber")
        summary = grantAward.find("summary[@xml:lang='ja']", nsmap)
        for field in summary.iterfind("field"):
            field_sequence = field.get("sequence")
            field_path = field.get("path")
            field_niicode = field.get("niiCode")
            field_table = field.get("fieldTable")
            field_name = field.text
            row = [
                awardnumber,
                field_sequence,
                field_path,
                field_niicode,
                field_table,
                field_name,
            ]
            fieldlist.append(row)
    dumpfilename = (
        "dump_kadai/field/field_"
        + re.search("[0-9]{4}_[0-9]+-[0-9]+.xml", xmlfile).group()
        + ".dump"
    )
    with open(dumpfilename, "wb") as f:
        pickle.dump(fieldlist, f)

In [8]:
# 審査区分表に基づく研究分野
def review_section(xmlfile):
    tree = etree.parse(xmlfile)
    nsmap = {"xml": "http://www.w3.org/XML/1998/namespace"}
    review_secitonlist = []
    for grantAward in tree.iterfind("grantAward"):
        awardnumber = grantAward.get("awardNumber")
        summary = grantAward.find("summary[@xml:lang='ja']", nsmap)
        for review_section in summary.iterfind("review_section", nsmap):
            review_section_sequence = review_section.get("sequence")
            review_section_niicode = review_section.get("niiCode")
            review_section_table_type = review_section.get("tableType")
            review_section_name = review_section.text
            row = [
                awardnumber,
                review_section_sequence,
                review_section_niicode,
                review_section_table_type,
                review_section_name,
            ]
            review_secitonlist.append(row)
        dumpfilename = (
            "dump_kadai/review_section/review_section_"
            + re.search("[0-9]{4}_[0-9]+-[0-9]+.xml", xmlfile).group()
            + ".dump"
        )
    with open(dumpfilename, "wb") as f:
        pickle.dump(review_secitonlist, f)

In [9]:
# 年度ごとの直接経費金額
def annual(xmlfile):
    tree = etree.parse(xmlfile)
    nsmap = {"xml": "http://www.w3.org/XML/1998/namespace"}
    directcostlist = []
    for grantAward in tree.iterfind("grantAward"):
        awardnumber = grantAward.get("awardNumber")
        for awardamountlist in grantAward.iterfind("awardAmountList"):
            sequence = awardamountlist.get("sequence")
            for awardamount in awardamountlist.iterfind("awardAmount"):
                try:
                    fiscalyear = awardamount.get("fiscalYear")
                except AttributeError:
                    fiscalyear = None
                try:
                    directcost = awardamount.find("directCost").text
                except AttributeError:
                    directcost = None
                row = [awardnumber, sequence, fiscalyear, directcost]
                directcostlist.append(row)
    dumpfilename = (
        "dump_kadai/annual/annual_"
        + re.search("[0-9]{4}_[0-9]+-[0-9]+.xml", xmlfile).group()
        + ".dump"
    )
    with open(dumpfilename, "wb") as f:
        pickle.dump(directcostlist, f)

In [10]:
# 研究課題のキーワード
def keyword(xmlfile):
    tree = etree.parse(xmlfile)
    nsmap = {"xml": "http://www.w3.org/XML/1998/namespace"}
    keywordlist = []
    for grantAward in tree.iterfind("grantAward"):
        awardnumber = grantAward.get("awardNumber")
        try:
            keywordList = grantAward.find("summary[@xml:lang='ja']/keywordList", nsmap)
            for keyword in keywordList.iterfind("keyword"):
                keyword_sequence = keyword.get("sequence")
                keyword_text = keyword.text
                row = [awardnumber, keyword_sequence, keyword_text]
                keywordlist.append(row)
        except AttributeError:
            row = [awardnumber] + [None] * 2
    dumpfilename = (
        "dump_kadai/keyword/keyword_"
        + re.search("[0-9]{4}_[0-9]+-[0-9]+.xml", xmlfile).group()
        + ".dump"
    )
    with open(dumpfilename, "wb") as f:
        pickle.dump(keywordlist, f)

In [11]:
# 研究課題のテキストデータ
def paragraph(xmlfile):
    tree = etree.parse(xmlfile)
    nsmap = {"xml": "http://www.w3.org/XML/1998/namespace"}
    textlist = []
    for grantAward in tree.iterfind("grantAward"):
        awardnumber = grantAward.get("awardNumber")
        summary = grantAward.find("summary[@xml:lang='ja']", nsmap)
        try:
            for paragraphlist in summary.iterfind("paragraphList"):
                paragraphlist_sequence = paragraphlist.get("sequence")
                paragraphlist_parentid = paragraphlist.get("parentId")
                paragraphlist_type = paragraphlist.get("type")
                for paragraph in paragraphlist.iterfind("paragraph"):
                    paragraph_sequence = paragraph.get("sequence")
                    paragraph_text = paragraph.text
                    row = [
                        awardnumber,
                        paragraphlist_sequence,
                        paragraphlist_parentid,
                        paragraphlist_type,
                        paragraph_sequence,
                        paragraph_text,
                    ]
                    textlist.append(row)
        except AttributeError:
            row = [awardnumber] + [None] * 5
            textlist.append(row)
    dumpfilename = (
        "dump_kadai/paragraph/paragraph_"
        + re.search("[0-9]{4}_[0-9]+-[0-9]+.xml", xmlfile).group()
        + ".dump"
    )
    with open(dumpfilename, "wb") as f:
        pickle.dump(textlist, f)

In [12]:
# 研究成果物
def product(xmlfile):
    tree = etree.parse(xmlfile)
    nsmap = {"xml": "http://www.w3.org/XML/1998/namespace"}
    productlist = []
    for grantAward in tree.iterfind("grantAward"):
        awardnumber = grantAward.get("awardNumber")
        try:
            productlistenriched = grantAward.find("productListEnriched")
            for product in productlistenriched.iterfind("product"):
                product_type = product.get("type")
                sequence = product.get("sequence")
                try:
                    reviewed = product.get("reviewed")
                except AttributeError:
                    reviewed = None
                try:
                    doi = product.find("doi").text
                except AttributeError:
                    doi = None
                try:
                    author_ja = product.find("author[@xml:lang='ja']", nsmap).text
                except AttributeError:
                    author_ja = None
                try:
                    author_en = product.find("author[@xml:lang='en']", nsmap).text
                except AttributeError:
                    author_en = None
                try:
                    title_ja = product.find("title[@xml:lang='ja']", nsmap).text
                except AttributeError:
                    title_ja = None
                try:
                    title_en = product.find("title[@xml:lang='en']", nsmap).text
                except AttributeError:
                    title_en = None
                try:
                    journaltitle_ja = product.find(
                        "journalTitle[@xml:lang='ja']", nsmap
                    ).text
                except AttributeError:
                    journaltitle_ja = None
                try:
                    journaltitle_en = product.find(
                        "journalTitle[@xml:lang='en']", nsmap
                    ).text
                except AttributeError:
                    journaltitle_en = None
                try:
                    year = product.find("year").text
                except AttributeError:
                    year = None
                row = [
                    awardnumber,
                    product_type,
                    sequence,
                    reviewed,
                    doi,
                    author_ja,
                    author_en,
                    title_ja,
                    title_en,
                    journaltitle_ja,
                    journaltitle_en,
                    year,
                ]
                productlist.append(row)
        except:
            row = [awardnumber] + [None] * 11
            productlist.append(row)

    dumpfilename = (
        "dump_kadai/product/product_"
        + re.search("[0-9]{4}_[0-9]+-[0-9]+.xml", xmlfile).group()
        + ".dump"
    )
    with open(dumpfilename, "wb") as f:
        pickle.dump(productlist, f)

XMLファイルから研究課題に関するデータを抽出して保存する

In [13]:
# dump_kadaiフォルダを空にしておく
target_dir = "dump_kadai"
if os.path.isdir(target_dir):
    shutil.rmtree(target_dir)
parts = [
    "main",
    "institution",
    "member",
    "field",
    "review_section",
    "annual",
    "keyword",
    "paragraph",
    "product",
]
dirlist = [target_dir + "/" + p for p in parts]
for d in dirlist:
    os.makedirs(d)

In [15]:
# XMLファイルのリストを作成する
filenames = []
for i in range(startyear, endyear + 1):
    globdir = "../kaken_parse_grants_masterxml/xml/" + str(i) + "*.xml"
    filenames.extend(glob(globdir))

# XMLファイルをパースする関数を束ねる
def parse(xmlfile):
    kadai(xmlfile)
    institution(xmlfile)
    member(xmlfile)
    field(xmlfile)
    review_section(xmlfile)
    annual(xmlfile)
    keyword(xmlfile)
    paragraph(xmlfile)
    product(xmlfile)


# Joblibで並列処理する
Parallel(n_jobs=-1, verbose=1)([delayed(parse)(i) for i in filenames])

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   14.9s
[Parallel(n_jobs=-1)]: Done 176 out of 176 | elapsed:   46.6s finished


[None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,

## データ処理用の関数の準備

In [16]:
def merge_list(parts):
    lists = []
    for dump in tqdm(glob("dump_kadai/" + parts + "/" + parts + "*.dump")):
        with open(dump, mode="rb") as f:
            l = pickle.load(f)
            lists += l
    return lists

## 研究課題基礎テーブル

### 基礎テーブル部品1. 研究課題メインデータ

In [17]:
# リストを結合する
lists = merge_list("main")
# リストをデータフレームに変換する
columns = [
    "awardnumber",
    "projecttype",
    "projectstatus_fiscalyear",
    "projectstatus_statuscode",
    "startfiscalyear",
    "endfiscalyear",
    "category_niicode",
    "category",
    "section_niicode",
    "section",
    "title_ja",
    "title_en",
    "directcost",
]
base_main = pd.DataFrame(lists, columns=columns)
# 課題番号に重複がないことを確認して、インデックスに設定する
assert not base_main["awardnumber"].duplicated().any(), "awardnumber is duplicated."
base_main = base_main.set_index("awardnumber")
# データを見ると、研究種目名と区分名は表記ゆれなどがあって使いにくいので、それぞれのniicodeだけ残して、削除しておく。
base_main = base_main.drop(columns=["category", "section"])
base_main

HBox(children=(IntProgress(value=0, max=176), HTML(value='')))




Unnamed: 0_level_0,projecttype,projectstatus_fiscalyear,projectstatus_statuscode,startfiscalyear,endfiscalyear,category_niicode,section_niicode,title_ja,title_en,directcost
awardnumber,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
20K16271,project,2020,adopted,2020,2021,252,,Understanding and restoring host-microbe inter...,Understanding and restoring host-microbe inter...,3200000
20K16270,project,2020,adopted,2020,2021,252,,ヒトメタニューモウイルスによる宿主自然免疫ハイジャック機構の解明,ヒトメタニューモウイルスによる宿主自然免疫ハイジャック機構の解明,3200000
20K16269,project,2020,adopted,2020,2022,252,,経鼻ワクチンの効率的なIgA産生を誘導する新規樹状細胞サブセットの同定と機能解明,経鼻ワクチンの効率的なIgA産生を誘導する新規樹状細胞サブセットの同定と機能解明,3300000
20K16268,project,2020,adopted,2020,2021,252,,ノロウイルスの感染・防御のための構造基盤の研究,ノロウイルスの感染・防御のための構造基盤の研究,3200000
20K16267,project,2020,adopted,2020,2021,252,,C型肝炎ウイルス伝播経路の選択バランス制御機構の解析,C型肝炎ウイルス伝播経路の選択バランス制御機構の解析,3200000
20K16266,project,2020,adopted,2020,2021,252,,パラミクソウイルスゲノムの塩基数はなぜ6の倍数でなければならないのか,パラミクソウイルスゲノムの塩基数はなぜ6の倍数でなければならないのか,3200000
20K16265,project,2020,adopted,2020,2021,252,,B型肝炎ウイルスの持続感染の成立および維持における制御性T細胞の寄与の解明,B型肝炎ウイルスの持続感染の成立および維持における制御性T細胞の寄与の解明,3200000
20K16264,project,2020,adopted,2020,2021,252,,インフルエンザRNAポリメラーゼとウイルス増殖阻害抗体のクライオ電子顕微鏡解析,インフルエンザRNAポリメラーゼとウイルス増殖阻害抗体のクライオ電子顕微鏡解析,3200000
20K16263,project,2020,adopted,2020,2022,252,,ジカウイルス感染が卵巣に及ぼす病原性の究明,ジカウイルス感染が卵巣に及ぼす病原性の究明,2900000
20K16262,project,2020,adopted,2020,2023,252,,Real-time observation of native conformations ...,Real-time observation of native conformations ...,3200000


### 基礎テーブル部品2. 採択時の代表研究機関

In [18]:
# リストを結合する
lists = merge_list("institution")
# リストをデータフレームに変換する
columns = [
    "awardnumber",
    "fiscalyear",
    "grant_sequence",
    "institution_sequence",
    "institution_niicode",
    "institution_mextcode",
    "institution_jspscode",
    "institution_name",
]
base_institution = pd.DataFrame(lists, columns=columns)
# awardnumberごとにfiscalyearが最小の行（＝採択時点の研究機関の行）を取得する
oldest = base_institution.groupby("awardnumber")["fiscalyear"].min().reset_index()
# dfのうち、oldestと一致する行のみ残す
base_institution = pd.merge(oldest, base_institution, on=["awardnumber", "fiscalyear"])
# 課題番号に重複がないことを確認して、インデックスに設定する
assert not base_institution["awardnumber"].duplicated().any(), "awardnumber is duplicated."
base_institution = base_institution.set_index("awardnumber")
# 使用しない列を削除する
base_institution = base_institution.drop(columns=["fiscalyear", "grant_sequence", "institution_sequence"])
base_institution

HBox(children=(IntProgress(value=0, max=176), HTML(value='')))




Unnamed: 0_level_0,institution_niicode,institution_mextcode,institution_jspscode,institution_name
awardnumber,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
15KK0114,0013301,13301,13301,金沢大学
16H00409,0013801,13801,13801,静岡大学
16K21741,0012613,12613,12613,一橋大学
16K21746,382803,82606,,国立研究開発法人国立がん研究センター
16K21747,0014401,14401,14401,大阪大学
16K21749,382814,92101,92101,株式会社IDファーマ
16K21751,0017102,17102,17102,九州大学
17H00239,0051303,51303,51303,仙台高等専門学校
17H00305,0017201,17201,17201,佐賀大学
17H04732,0011301,11301,11301,東北大学


### 基礎テーブル部品3. 採択時の研究代表者

In [19]:
# リストを結合する
lists = merge_list("member")
# リストをデータフレームに変換する
columns = [
    "awardnumber",
    "sequence",
    "participate",
    "eradcode",
    "role",
    "fullname",
    "familyname",
    "givenname",
    "familyname_yomi",
    "givenname_yomi",
]
base_member = pd.DataFrame(lists, columns=columns)
# 代表者のみ抽出
daihyou = [
    "principal_investigator",
    "area_organizer",
    "principal_investigator_support",
    "research_fellow",
    "foreign_research_fellow",
]
base_member = base_member[base_member["role"].isin(daihyou)]
# データ型を指定する
base_member = base_member.astype({"sequence": int})
# awardnumberごとにsequenceが最大のレコードのみ抽出する。生のXMLを眺めてみると、sequenceが大きいほど古い年度のデータなので。
seqmax = base_member.groupby('awardnumber')['sequence'].max().reset_index()
base_member = pd.merge(seqmax, base_member, on=['awardnumber', 'sequence'])
# 課題番号に重複がないことを確認して、インデックスに設定する
assert not base_member["awardnumber"].duplicated().any(), "awardnumber is duplicated."
base_member = base_member.set_index("awardnumber")
base_member

HBox(children=(IntProgress(value=0, max=176), HTML(value='')))




Unnamed: 0_level_0,sequence,participate,eradcode,role,fullname,familyname,givenname,familyname_yomi,givenname_yomi
awardnumber,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
15KK0114,1,,60377009,principal_investigator,杉橋 やよい,杉橋,やよい,スギハシ,ヤヨイ
16H00409,1,,,principal_investigator,上田 瑞恵,上田,瑞恵,ウエダ,ミズエ
16K21741,1,,,principal_investigator,森田 穂高,森田,穂高,,
16K21746,1,,,principal_investigator,小林 進,小林,進,,
16K21747,1,,,principal_investigator,北川 克己,北川,克己,,
16K21749,1,,,principal_investigator,横山 英樹,横山,英樹,,
16K21751,1,,,principal_investigator,知念 孝敏,知念,孝敏,,
17H00239,1,,,principal_investigator,田中 ゆみ,田中,ゆみ,タナカ,ユミ
17H00305,1,,,principal_investigator,新地 姉理華,新地,姉理華,シンチ,エリカ
17H04732,1,,90610626,principal_investigator,王 欣,王,欣,オウ,キン


### 基礎テーブルの3つの部品を結合してDBに書き込む

In [20]:
# 3つのデータフレームを結合する
base = base_main.join(base_institution)
base = base.join(base_member)
base

Unnamed: 0_level_0,projecttype,projectstatus_fiscalyear,projectstatus_statuscode,startfiscalyear,endfiscalyear,category_niicode,section_niicode,title_ja,title_en,directcost,...,institution_name,sequence,participate,eradcode,role,fullname,familyname,givenname,familyname_yomi,givenname_yomi
awardnumber,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
20K16271,project,2020,adopted,2020,2021,252,,Understanding and restoring host-microbe inter...,Understanding and restoring host-microbe inter...,3200000,...,北海道大学,1,,10854804,principal_investigator,Cho Steven・Shian・Chin,Cho,Steven・Shian・Chin,チヨウ,ステイーブン・シアン・チン
20K16270,project,2020,adopted,2020,2021,252,,ヒトメタニューモウイルスによる宿主自然免疫ハイジャック機構の解明,ヒトメタニューモウイルスによる宿主自然免疫ハイジャック機構の解明,3200000,...,国立感染症研究所,1,,00781741,principal_investigator,直 亨則,直,亨則,ナオ,ナガノリ
20K16269,project,2020,adopted,2020,2022,252,,経鼻ワクチンの効率的なIgA産生を誘導する新規樹状細胞サブセットの同定と機能解明,経鼻ワクチンの効率的なIgA産生を誘導する新規樹状細胞サブセットの同定と機能解明,3300000,...,国立感染症研究所,1,,40762216,principal_investigator,佐々木 永太,佐々木,永太,ササキ,エイタ
20K16268,project,2020,adopted,2020,2021,252,,ノロウイルスの感染・防御のための構造基盤の研究,ノロウイルスの感染・防御のための構造基盤の研究,3200000,...,生理学研究所,1,,20755516,principal_investigator,ソン チホン,ソン,チホン,ソン,チホン
20K16267,project,2020,adopted,2020,2021,252,,C型肝炎ウイルス伝播経路の選択バランス制御機構の解析,C型肝炎ウイルス伝播経路の選択バランス制御機構の解析,3200000,...,東京理科大学,1,,40866761,principal_investigator,大橋 啓史,大橋,啓史,オオハシ,ヒロフミ
20K16266,project,2020,adopted,2020,2021,252,,パラミクソウイルスゲノムの塩基数はなぜ6の倍数でなければならないのか,パラミクソウイルスゲノムの塩基数はなぜ6の倍数でなければならないのか,3200000,...,和歌山県立医科大学,1,,00735912,principal_investigator,松本 祐介,松本,祐介,マツモト,ユウスケ
20K16265,project,2020,adopted,2020,2021,252,,B型肝炎ウイルスの持続感染の成立および維持における制御性T細胞の寄与の解明,B型肝炎ウイルスの持続感染の成立および維持における制御性T細胞の寄与の解明,3200000,...,名古屋市立大学,1,,70843027,principal_investigator,浦木 隆太,浦木,隆太,ウラキ,リユウタ
20K16264,project,2020,adopted,2020,2021,252,,インフルエンザRNAポリメラーゼとウイルス増殖阻害抗体のクライオ電子顕微鏡解析,インフルエンザRNAポリメラーゼとウイルス増殖阻害抗体のクライオ電子顕微鏡解析,3200000,...,横浜市立大学,1,,90724774,principal_investigator,吉田 尚史,吉田,尚史,ヨシダ,ヒサシ
20K16263,project,2020,adopted,2020,2022,252,,ジカウイルス感染が卵巣に及ぼす病原性の究明,ジカウイルス感染が卵巣に及ぼす病原性の究明,2900000,...,浜松医科大学,1,,10766059,principal_investigator,今川 稔文,今川,稔文,イマガワ,トシフミ
20K16262,project,2020,adopted,2020,2023,252,,Real-time observation of native conformations ...,Real-time observation of native conformations ...,3200000,...,金沢大学,1,,60842987,principal_investigator,LIM KEE・SIANG,LIM,KEE・SIANG,リン,キイ・シヤン


In [21]:
# 外部キー制約を外す
try:
    with engine.connect() as con:
        con.execute("ALTER TABLE grantaward_review_section DROP FOREIGN KEY fk_grantaward_review_section_grantaward;")
        con.execute("ALTER TABLE grantaward_field DROP FOREIGN KEY fk_grantaward_field_grantaward;")
        con.execute("ALTER TABLE grantaward_annual DROP FOREIGN KEY fk_grantaward_annual_grantaward;")
        con.execute("ALTER TABLE grantaward_member DROP FOREIGN KEY fk_grantaward_member_grantaward;")
        con.execute("ALTER TABLE grantaward_paragraph DROP FOREIGN KEY fk_grantaward_paragraph_grantaward;")
        con.execute("ALTER TABLE grantaward_keyword DROP FOREIGN KEY fk_grantaward_keyword_grantaward;")
        con.execute("ALTER TABLE grantaward_product DROP FOREIGN KEY fk_grantaward_product_grantaward;")
        con.execute("ALTER TABLE grantaward DROP PRIMARY KEY;")
        con.execute("ALTER TABLE grantaward_member DROP PRIMARY KEY;")
        con.execute("ALTER TABLE grantaward_field DROP PRIMARY KEY;")
        con.execute("ALTER TABLE grantaward_review_section DROP PRIMARY KEY;")
        con.execute("ALTER TABLE grantaward_annual DROP PRIMARY KEY;")
        con.execute("ALTER TABLE grantaward_keyword DROP PRIMARY KEY;")
        con.execute("ALTER TABLE grantaward_paragraph DROP PRIMARY KEY;")
        con.execute("ALTER TABLE grantaward_product DROP PRIMARY KEY;")
except:
    pass

# データベースに書き込む
base.to_sql(
    "grantaward",
    engine,
    if_exists="replace",
    dtype={
        "awardnumber": String(255),
        "startfiscalyear": Integer,
        "endfiscalyear": Integer,
        "projectstatus_fiscalyear": Integer,
        "category_niicode": Integer,
        "section_niicode": Integer,
        "institution_niicode": Integer,
        "directcost": BigInteger,
        "sequence": Integer,
        "eradcode": String(8),
    },
)

2020-05-23 09:35:15,638 INFO sqlalchemy.engine.base.Engine SHOW VARIABLES LIKE 'sql_mode'
2020-05-23 09:35:15,639 INFO sqlalchemy.engine.base.Engine {}
2020-05-23 09:35:15,642 INFO sqlalchemy.engine.base.Engine SELECT DATABASE()
2020-05-23 09:35:15,643 INFO sqlalchemy.engine.base.Engine {}
2020-05-23 09:35:15,644 INFO sqlalchemy.engine.base.Engine show collation where `Charset` = 'utf8' and `Collation` = 'utf8_bin'
2020-05-23 09:35:15,645 INFO sqlalchemy.engine.base.Engine {}
2020-05-23 09:35:15,648 INFO sqlalchemy.engine.base.Engine SELECT CAST('test plain returns' AS CHAR(60)) AS anon_1
2020-05-23 09:35:15,649 INFO sqlalchemy.engine.base.Engine {}
2020-05-23 09:35:15,650 INFO sqlalchemy.engine.base.Engine SELECT CAST('test unicode returns' AS CHAR(60)) AS anon_1
2020-05-23 09:35:15,651 INFO sqlalchemy.engine.base.Engine {}
2020-05-23 09:35:15,653 INFO sqlalchemy.engine.base.Engine SELECT CAST('test collated returns' AS CHAR CHARACTER SET utf8) COLLATE utf8_bin AS anon_1
2020-05-23 09

2020-05-23 09:35:27,670 INFO sqlalchemy.engine.base.Engine COMMIT


In [22]:
# 主キーと外部キー制約を設定する
with engine.connect() as con:
    con.execute("ALTER TABLE grantaward ADD PRIMARY KEY(awardnumber)")
    con.execute(
        "ALTER TABLE grantaward ADD CONSTRAINT category_niicode_1 FOREIGN KEY (category_niicode) REFERENCES categories(category_niicode);"
    )
    con.execute(
        "ALTER TABLE grantaward ADD CONSTRAINT section_niicode_1 FOREIGN KEY (section_niicode) REFERENCES sections(section_niicode);"
    )
    con.execute(
        "ALTER TABLE grantaward ADD CONSTRAINT institution_niicode_1 FOREIGN KEY (institution_niicode) REFERENCES institutions(institution_niicode);"
    )

2020-05-23 09:35:34,108 INFO sqlalchemy.engine.base.Engine ALTER TABLE grantaward ADD PRIMARY KEY(awardnumber)
2020-05-23 09:35:34,109 INFO sqlalchemy.engine.base.Engine {}
2020-05-23 09:35:34,936 INFO sqlalchemy.engine.base.Engine COMMIT
2020-05-23 09:35:34,937 INFO sqlalchemy.engine.base.Engine ALTER TABLE grantaward ADD CONSTRAINT category_niicode_1 FOREIGN KEY (category_niicode) REFERENCES categories(category_niicode);
2020-05-23 09:35:34,938 INFO sqlalchemy.engine.base.Engine {}
2020-05-23 09:35:35,913 INFO sqlalchemy.engine.base.Engine COMMIT
2020-05-23 09:35:35,914 INFO sqlalchemy.engine.base.Engine ALTER TABLE grantaward ADD CONSTRAINT section_niicode_1 FOREIGN KEY (section_niicode) REFERENCES sections(section_niicode);
2020-05-23 09:35:35,915 INFO sqlalchemy.engine.base.Engine {}
2020-05-23 09:35:36,994 INFO sqlalchemy.engine.base.Engine COMMIT
2020-05-23 09:35:36,995 INFO sqlalchemy.engine.base.Engine ALTER TABLE grantaward ADD CONSTRAINT institution_niicode_1 FOREIGN KEY (in

---

## 研究者テーブルを作る

In [23]:
# リストを結合する
lists = merge_list("member")
# リストをデータフレームに変換する
columns = [
    "awardnumber",
    "sequence",
    "participate",
    "eradcode",
    "role",
    "fullname",
    "familyname",
    "givenname",
    "familyname_yomi",
    "givenname_yomi",
]
member = pd.DataFrame(lists, columns=columns)
# 研究者番号が数字のみで構成されていることを確認する
assert member["eradcode"].str.match('^[0-9]*$').all(), "eradcode contains non-integer letter."
member

HBox(children=(IntProgress(value=0, max=176), HTML(value='')))




Unnamed: 0,awardnumber,sequence,participate,eradcode,role,fullname,familyname,givenname,familyname_yomi,givenname_yomi
0,19H02984,1,,60302379,principal_investigator,横張 真,横張,真,ヨコハリ,マコト
1,19H02984,2,,60396760,co_investigator_buntan,村山 顕人,村山,顕人,ムラヤマ,アキト
2,19H02984,3,,00619934,co_investigator_buntan,寺田 徹,寺田,徹,テラダ,トオル
3,19H02984,4,,90700930,co_investigator_buntan,飯田 晶子,飯田,晶子,イイダ,アキコ
4,19H02984,5,,20447345,co_investigator_buntan,秋田 典子,秋田,典子,アキタ,ノリコ
5,19H02983,1,,10313016,principal_investigator,村上 暁信,村上,暁信,ムラカミ,アキノブ
6,19H02983,2,,40396411,co_investigator_buntan,原科 幸爾,原科,幸爾,ハラシナ,コウジ
7,19H02983,3,,70509207,co_investigator_buntan,福永 真弓,福永,真弓,フクナガ,マユミ
8,19H02983,4,,90716135,co_investigator_buntan,熊倉 永子,熊倉,永子,クマクラ,エイコ
9,19H02982,1,,40359485,principal_investigator,松島 肇,松島,肇,マツシマ,ハジメ


In [24]:
# データベースに書き込む
member.to_sql(
    "grantaward_member",
    engine,
    if_exists="replace",
    dtype={"awardnumber": String(255), "sequence": Integer, "eradcode": String(8)},
)

2020-05-23 09:35:52,052 INFO sqlalchemy.engine.base.Engine DESCRIBE `grantaward_member`
2020-05-23 09:35:52,053 INFO sqlalchemy.engine.base.Engine {}
2020-05-23 09:35:52,063 INFO sqlalchemy.engine.base.Engine DESCRIBE `grantaward_member`
2020-05-23 09:35:52,064 INFO sqlalchemy.engine.base.Engine {}
2020-05-23 09:35:52,068 INFO sqlalchemy.engine.base.Engine SHOW FULL TABLES FROM `kaken`
2020-05-23 09:35:52,069 INFO sqlalchemy.engine.base.Engine {}
2020-05-23 09:35:52,073 INFO sqlalchemy.engine.base.Engine SHOW CREATE TABLE `grantaward_member`
2020-05-23 09:35:52,073 INFO sqlalchemy.engine.base.Engine {}
2020-05-23 09:35:52,077 INFO sqlalchemy.engine.base.Engine 
DROP TABLE grantaward_member
2020-05-23 09:35:52,078 INFO sqlalchemy.engine.base.Engine {}
2020-05-23 09:35:52,081 INFO sqlalchemy.engine.base.Engine COMMIT
2020-05-23 09:35:52,086 INFO sqlalchemy.engine.base.Engine 
CREATE TABLE grantaward_member (
	`index` BIGINT, 
	awardnumber VARCHAR(255), 
	sequence INTEGER, 
	participate T

In [25]:
# 主キーと外部キー制約を設定する
with engine.connect() as con:
    con.execute("ALTER TABLE `grantaward_member` ADD PRIMARY KEY(`index`);")
    con.execute(
        "ALTER TABLE `grantaward_member` ADD CONSTRAINT fk_grantaward_member_grantaward FOREIGN KEY (`awardnumber`) REFERENCES `grantaward`(`awardnumber`);"
    )

2020-05-23 09:36:06,782 INFO sqlalchemy.engine.base.Engine ALTER TABLE `grantaward_member` ADD PRIMARY KEY(`index`);
2020-05-23 09:36:06,783 INFO sqlalchemy.engine.base.Engine {}
2020-05-23 09:36:07,350 INFO sqlalchemy.engine.base.Engine COMMIT
2020-05-23 09:36:07,350 INFO sqlalchemy.engine.base.Engine ALTER TABLE `grantaward_member` ADD CONSTRAINT fk_grantaward_member_grantaward FOREIGN KEY (`awardnumber`) REFERENCES `grantaward`(`awardnumber`);
2020-05-23 09:36:07,352 INFO sqlalchemy.engine.base.Engine {}
2020-05-23 09:36:08,574 INFO sqlalchemy.engine.base.Engine COMMIT


---

## 研究分野テーブルを作る

In [26]:
# リストを結合する
lists = merge_list("field")
# リストをデータフレームに変換する
columns = [
    "awardnumber",
    "field_sequence",
    "field_path",
    "field_niicode",
    "field_table",
    "field_name",
]
field = pd.DataFrame(lists,columns=columns)
field

HBox(children=(IntProgress(value=0, max=176), HTML(value='')))




Unnamed: 0,awardnumber,field_sequence,field_path,field_niicode,field_table,field_name
0,18J15176,1,001982,1982,tokubetsu_kenkyu,土木工学
1,18J15176,1,001982001985,1985,tokubetsu_kenkyu,地盤工学
2,18J15171,1,001994,1994,tokubetsu_kenkyu,材料工学
3,18J15171,1,001994001997,1997,tokubetsu_kenkyu,複合材料・表界面工学
4,18J15167,1,001954,1954,tokubetsu_kenkyu,複合化学
5,18J15167,1,001954001956,1956,tokubetsu_kenkyu,合成化学
6,18J15154,1,002036,2036,tokubetsu_kenkyu,基礎生物学
7,18J15154,1,002036002041,2041,tokubetsu_kenkyu,進化生物学
8,18J15153,1,001950,1950,tokubetsu_kenkyu,基礎化学
9,18J15153,1,001950001952,1952,tokubetsu_kenkyu,有機化学


In [27]:
# データベースに書き込む
field.to_sql(
    "grantaward_field",
    engine,
    if_exists="replace",
    dtype={
        "awardnumber": String(255),
        "field_niicode": Integer,
        "field_path": String(255),
    },
)

2020-05-23 09:36:21,133 INFO sqlalchemy.engine.base.Engine DESCRIBE `grantaward_field`
2020-05-23 09:36:21,135 INFO sqlalchemy.engine.base.Engine {}
2020-05-23 09:36:21,141 INFO sqlalchemy.engine.base.Engine DESCRIBE `grantaward_field`
2020-05-23 09:36:21,142 INFO sqlalchemy.engine.base.Engine {}
2020-05-23 09:36:21,145 INFO sqlalchemy.engine.base.Engine SHOW FULL TABLES FROM `kaken`
2020-05-23 09:36:21,146 INFO sqlalchemy.engine.base.Engine {}
2020-05-23 09:36:21,148 INFO sqlalchemy.engine.base.Engine SHOW CREATE TABLE `grantaward_field`
2020-05-23 09:36:21,149 INFO sqlalchemy.engine.base.Engine {}
2020-05-23 09:36:21,152 INFO sqlalchemy.engine.base.Engine 
DROP TABLE grantaward_field
2020-05-23 09:36:21,152 INFO sqlalchemy.engine.base.Engine {}
2020-05-23 09:36:21,155 INFO sqlalchemy.engine.base.Engine COMMIT
2020-05-23 09:36:21,158 INFO sqlalchemy.engine.base.Engine 
CREATE TABLE grantaward_field (
	`index` BIGINT, 
	awardnumber VARCHAR(255), 
	field_sequence TEXT, 
	field_path VARC

In [28]:
# 主キーと外部キー制約を設定する
with engine.connect() as con:
    con.execute("ALTER TABLE `grantaward_field` ADD PRIMARY KEY(`index`);")
    con.execute(
        "ALTER TABLE `grantaward_field` ADD CONSTRAINT fk_grantaward_field_grantaward FOREIGN KEY (`awardnumber`) REFERENCES `grantaward`(`awardnumber`);"
    )
    con.execute(
        "ALTER TABLE `grantaward_field` ADD CONSTRAINT fk_grantaward_field_field_niicode FOREIGN KEY (`field_niicode`) REFERENCES `fields`(`field_niicode`);"
    )
    con.execute(
        "ALTER TABLE grantaward_field ADD CONSTRAINT fk_grantaward_field_field_path FOREIGN KEY (field_path) REFERENCES fields (field_path);"
    )

2020-05-23 09:36:28,892 INFO sqlalchemy.engine.base.Engine ALTER TABLE `grantaward_field` ADD PRIMARY KEY(`index`);
2020-05-23 09:36:28,893 INFO sqlalchemy.engine.base.Engine {}
2020-05-23 09:36:28,954 INFO sqlalchemy.engine.base.Engine COMMIT
2020-05-23 09:36:28,956 INFO sqlalchemy.engine.base.Engine ALTER TABLE `grantaward_field` ADD CONSTRAINT fk_grantaward_field_grantaward FOREIGN KEY (`awardnumber`) REFERENCES `grantaward`(`awardnumber`);
2020-05-23 09:36:28,957 INFO sqlalchemy.engine.base.Engine {}
2020-05-23 09:36:29,056 INFO sqlalchemy.engine.base.Engine COMMIT
2020-05-23 09:36:29,057 INFO sqlalchemy.engine.base.Engine ALTER TABLE `grantaward_field` ADD CONSTRAINT fk_grantaward_field_field_niicode FOREIGN KEY (`field_niicode`) REFERENCES `fields`(`field_niicode`);
2020-05-23 09:36:29,058 INFO sqlalchemy.engine.base.Engine {}
2020-05-23 09:36:29,172 INFO sqlalchemy.engine.base.Engine COMMIT
2020-05-23 09:36:29,173 INFO sqlalchemy.engine.base.Engine ALTER TABLE grantaward_field A

---

## 審査区分テーブルを作る 

In [29]:
# リストを結合する
lists = merge_list("review_section")
# リストをデータフレームに変換する
columns = [
    "awardnumber",
    "review_section_sequence",
    "review_section_niicode",
    "review_section_table_type",
    "review_section_name",
]
review_section = pd.DataFrame(lists, columns=columns)
review_section

HBox(children=(IntProgress(value=0, max=176), HTML(value='')))




Unnamed: 0,awardnumber,review_section_sequence,review_section_niicode,review_section_table_type,review_section_name
0,19J12105,1,511,review_section_tokken,小区分04030:文化人類学および民俗学関連
1,19J12097,1,521,review_section_tokken,小区分07010:理論経済学関連
2,19J12079,1,680,review_section_tokken,小区分47020:薬系分析および物理化学関連
3,19J12072,1,708,review_section_tokken,小区分53030:呼吸器内科学関連
4,19J12059,1,648,review_section_tokken,小区分40030:水圏生産科学関連
5,19J12055,1,504,review_section_tokken,小区分03030:アジア史およびアフリカ史関連
6,19J12050,1,554,review_section_tokken,小区分13030:磁性、超伝導および強相関系関連
7,19J12043,1,626,review_section_tokken,小区分35020:高分子材料関連
8,19J12037,1,546,review_section_tokken,小区分11010:代数学関連
9,19J12030,1,643,review_section_tokken,小区分39050:昆虫科学関連


In [30]:
# データベースに書き込む
review_section.to_sql(
    "grantaward_review_section",
    engine,
    if_exists="replace",
    dtype={"awardnumber": String(255), "review_section_niicode": Integer},
)

2020-05-23 09:36:47,663 INFO sqlalchemy.engine.base.Engine DESCRIBE `grantaward_review_section`
2020-05-23 09:36:47,664 INFO sqlalchemy.engine.base.Engine {}
2020-05-23 09:36:47,669 INFO sqlalchemy.engine.base.Engine DESCRIBE `grantaward_review_section`
2020-05-23 09:36:47,670 INFO sqlalchemy.engine.base.Engine {}
2020-05-23 09:36:47,674 INFO sqlalchemy.engine.base.Engine SHOW FULL TABLES FROM `kaken`
2020-05-23 09:36:47,675 INFO sqlalchemy.engine.base.Engine {}
2020-05-23 09:36:47,677 INFO sqlalchemy.engine.base.Engine SHOW CREATE TABLE `grantaward_review_section`
2020-05-23 09:36:47,679 INFO sqlalchemy.engine.base.Engine {}
2020-05-23 09:36:47,685 INFO sqlalchemy.engine.base.Engine 
DROP TABLE grantaward_review_section
2020-05-23 09:36:47,686 INFO sqlalchemy.engine.base.Engine {}
2020-05-23 09:36:47,692 INFO sqlalchemy.engine.base.Engine COMMIT
2020-05-23 09:36:47,695 INFO sqlalchemy.engine.base.Engine 
CREATE TABLE grantaward_review_section (
	`index` BIGINT, 
	awardnumber VARCHAR(2

In [31]:
# 主キーと外部キー制約を設定する
with engine.connect() as con:
    con.execute("ALTER TABLE `grantaward_review_section` ADD PRIMARY KEY(`index`);")
    con.execute(
        "ALTER TABLE `grantaward_review_section` ADD CONSTRAINT fk_grantaward_review_section_grantaward FOREIGN KEY (`awardnumber`) REFERENCES `grantaward`(`awardnumber`);"
    )
    con.execute(
        "ALTER TABLE `grantaward_review_section` ADD CONSTRAINT fk_grantaward_review_section_review_section_niicode FOREIGN KEY (`review_section_niicode`) REFERENCES `review_sections`(`review_section_niicode`);"
    )

2020-05-23 09:36:55,363 INFO sqlalchemy.engine.base.Engine ALTER TABLE `grantaward_review_section` ADD PRIMARY KEY(`index`);
2020-05-23 09:36:55,364 INFO sqlalchemy.engine.base.Engine {}
2020-05-23 09:36:55,658 INFO sqlalchemy.engine.base.Engine COMMIT
2020-05-23 09:36:55,659 INFO sqlalchemy.engine.base.Engine ALTER TABLE `grantaward_review_section` ADD CONSTRAINT fk_grantaward_review_section_grantaward FOREIGN KEY (`awardnumber`) REFERENCES `grantaward`(`awardnumber`);
2020-05-23 09:36:55,660 INFO sqlalchemy.engine.base.Engine {}
2020-05-23 09:36:56,364 INFO sqlalchemy.engine.base.Engine COMMIT
2020-05-23 09:36:56,365 INFO sqlalchemy.engine.base.Engine ALTER TABLE `grantaward_review_section` ADD CONSTRAINT fk_grantaward_review_section_review_section_niicode FOREIGN KEY (`review_section_niicode`) REFERENCES `review_sections`(`review_section_niicode`);
2020-05-23 09:36:56,366 INFO sqlalchemy.engine.base.Engine {}
2020-05-23 09:36:57,169 INFO sqlalchemy.engine.base.Engine COMMIT


---

## 年度ごとの直接経費金額テーブルを作る

In [32]:
# リストを結合する
lists = merge_list("annual")
# リストをデータフレームに変換する
columns = ["awardnumber", "sequence", "fiscalyear", "directcost"]
annual = pd.DataFrame(lists, columns=columns)
annual

HBox(children=(IntProgress(value=0, max=176), HTML(value='')))




Unnamed: 0,awardnumber,sequence,fiscalyear,directcost
0,18J22004,1,2018,800000
1,18J22004,1,2019,700000
2,18J21997,1,2018,1000000
3,18J21997,1,2019,900000
4,18J21985,1,2018,1000000
5,18J21985,1,2019,900000
6,18J21963,1,2018,1000000
7,18J21963,1,2019,900000
8,18J21961,1,2018,800000
9,18J21961,1,2019,700000


In [33]:
# データベースに書き込む
annual.to_sql(
    "grantaward_annual",
    engine,
    if_exists="replace",
    dtype={
        "awardnumber": String(255),
        "sequence": Integer,
        "fiscalyaer": Integer,
        "directcost": BigInteger,
    },
)

2020-05-23 09:37:08,966 INFO sqlalchemy.engine.base.Engine DESCRIBE `grantaward_annual`
2020-05-23 09:37:08,967 INFO sqlalchemy.engine.base.Engine {}
2020-05-23 09:37:08,974 INFO sqlalchemy.engine.base.Engine DESCRIBE `grantaward_annual`
2020-05-23 09:37:08,975 INFO sqlalchemy.engine.base.Engine {}
2020-05-23 09:37:08,980 INFO sqlalchemy.engine.base.Engine SHOW FULL TABLES FROM `kaken`
2020-05-23 09:37:08,984 INFO sqlalchemy.engine.base.Engine {}
2020-05-23 09:37:08,986 INFO sqlalchemy.engine.base.Engine SHOW CREATE TABLE `grantaward_annual`
2020-05-23 09:37:08,987 INFO sqlalchemy.engine.base.Engine {}
2020-05-23 09:37:08,992 INFO sqlalchemy.engine.base.Engine 
DROP TABLE grantaward_annual
2020-05-23 09:37:08,993 INFO sqlalchemy.engine.base.Engine {}
2020-05-23 09:37:08,997 INFO sqlalchemy.engine.base.Engine COMMIT
2020-05-23 09:37:09,000 INFO sqlalchemy.engine.base.Engine 
CREATE TABLE grantaward_annual (
	`index` BIGINT, 
	awardnumber VARCHAR(255), 
	sequence INTEGER, 
	fiscalyear TE

In [34]:
# 主キーと外部キー制約を設定する
with engine.connect() as con:
    con.execute("ALTER TABLE `grantaward_annual` ADD PRIMARY KEY(`index`);")
    con.execute(
        "ALTER TABLE `grantaward_annual` ADD CONSTRAINT fk_grantaward_annual_grantaward FOREIGN KEY (`awardnumber`) REFERENCES `grantaward`(`awardnumber`);"
    )

2020-05-23 09:37:19,795 INFO sqlalchemy.engine.base.Engine ALTER TABLE `grantaward_annual` ADD PRIMARY KEY(`index`);
2020-05-23 09:37:19,797 INFO sqlalchemy.engine.base.Engine {}
2020-05-23 09:37:20,364 INFO sqlalchemy.engine.base.Engine COMMIT
2020-05-23 09:37:20,365 INFO sqlalchemy.engine.base.Engine ALTER TABLE `grantaward_annual` ADD CONSTRAINT fk_grantaward_annual_grantaward FOREIGN KEY (`awardnumber`) REFERENCES `grantaward`(`awardnumber`);
2020-05-23 09:37:20,365 INFO sqlalchemy.engine.base.Engine {}
2020-05-23 09:37:22,196 INFO sqlalchemy.engine.base.Engine COMMIT


---

## キーワードテーブルを作る

In [35]:
# リストを結合する
lists = merge_list("keyword")
# リストをデータフレームに変換する
columns = ["awardnumber", "keyword_sequence", "keyword_text"]
keyword = pd.DataFrame(lists, columns=columns)
keyword

HBox(children=(IntProgress(value=0, max=176), HTML(value='')))




Unnamed: 0,awardnumber,keyword_sequence,keyword_text
0,18H02445,1,始原生殖細胞
1,18H02445,2,細胞移動
2,18H02445,3,生殖細胞性
3,18H02445,4,鳥類
4,18H02445,5,生殖腺
5,18H02445,6,鳥類胚
6,18H02444,1,ERK
7,18H02444,2,G1-S期チェックポイント
8,18H02444,3,イメージング
9,18H02444,4,G1/S


In [36]:
# データベースに書き込む
keyword.to_sql(
    "grantaward_keyword",
    engine,
    if_exists="replace",
    dtype={"awardnumber": String(255)},
)

2020-05-23 09:37:39,888 INFO sqlalchemy.engine.base.Engine DESCRIBE `grantaward_keyword`
2020-05-23 09:37:39,889 INFO sqlalchemy.engine.base.Engine {}
2020-05-23 09:37:39,894 INFO sqlalchemy.engine.base.Engine DESCRIBE `grantaward_keyword`
2020-05-23 09:37:39,895 INFO sqlalchemy.engine.base.Engine {}
2020-05-23 09:37:39,898 INFO sqlalchemy.engine.base.Engine SHOW FULL TABLES FROM `kaken`
2020-05-23 09:37:39,899 INFO sqlalchemy.engine.base.Engine {}
2020-05-23 09:37:39,902 INFO sqlalchemy.engine.base.Engine SHOW CREATE TABLE `grantaward_keyword`
2020-05-23 09:37:39,903 INFO sqlalchemy.engine.base.Engine {}
2020-05-23 09:37:39,907 INFO sqlalchemy.engine.base.Engine 
DROP TABLE grantaward_keyword
2020-05-23 09:37:39,908 INFO sqlalchemy.engine.base.Engine {}
2020-05-23 09:37:39,911 INFO sqlalchemy.engine.base.Engine COMMIT
2020-05-23 09:37:39,914 INFO sqlalchemy.engine.base.Engine 
CREATE TABLE grantaward_keyword (
	`index` BIGINT, 
	awardnumber VARCHAR(255), 
	keyword_sequence TEXT, 
	key

In [37]:
# 主キーと外部キー制約を設定する
with engine.connect() as con:
    con.execute("ALTER TABLE `grantaward_keyword` ADD PRIMARY KEY(`index`);")
    con.execute(
        "ALTER TABLE `grantaward_keyword` ADD CONSTRAINT fk_grantaward_keyword_grantaward FOREIGN KEY (`awardnumber`) REFERENCES `grantaward`(`awardnumber`);"
    )

2020-05-23 09:37:51,395 INFO sqlalchemy.engine.base.Engine ALTER TABLE `grantaward_keyword` ADD PRIMARY KEY(`index`);
2020-05-23 09:37:51,396 INFO sqlalchemy.engine.base.Engine {}
2020-05-23 09:37:52,102 INFO sqlalchemy.engine.base.Engine COMMIT
2020-05-23 09:37:52,103 INFO sqlalchemy.engine.base.Engine ALTER TABLE `grantaward_keyword` ADD CONSTRAINT fk_grantaward_keyword_grantaward FOREIGN KEY (`awardnumber`) REFERENCES `grantaward`(`awardnumber`);
2020-05-23 09:37:52,103 INFO sqlalchemy.engine.base.Engine {}
2020-05-23 09:37:54,262 INFO sqlalchemy.engine.base.Engine COMMIT


---

## 研究概要等のテキストのテーブルを作る

In [38]:
# リストを結合する
lists = merge_list("paragraph")
# リストをデータフレームに変換する
columns = [
    "awardnumber",
    "paragraphlist_sequence",
    "paragraphlist_parentid",
    "paragraphlist_type",
    "paragraph_sequence",
    "paragraph_text",
]
paragraph = pd.DataFrame(lists, columns=columns)
paragraph

HBox(children=(IntProgress(value=0, max=176), HTML(value='')))




Unnamed: 0,awardnumber,paragraphlist_sequence,paragraphlist_parentid,paragraphlist_type,paragraph_sequence,paragraph_text
0,18K14670,1,18K146702018hokoku,outline_of_research_performance,1,本研究では哺乳類概日時計の動作原理として考えられる自律性をもつミニマルなリン酸化振動子の試験...
1,18K14670,1,18K146702018hokoku,outline_of_research_performance,2,2.時計タンパク質由来のペプチドライブラリを用いたスクリーニングよりCKIのリン酸化活性を向...
2,18K14670,2,classification18K14670progress2018,progress,1,"時計タンパク質(PER1/2,CRY1/2,BMAL1,CLOCK)由来のペプチドを化学合成..."
3,18K14670,3,18K146702018hokoku,planning_scheme,1,リン酸化/脱リン酸化と相互作用させるペプチドを用いて自律的なリン酸化振動子が構成しうるかを検...
4,18K14669,1,18K146692018hokoku,outline_of_research_performance,1,培養細胞に外的刺激を与えた際に生じる細胞内水の変化を定量的に解析することを目指し，本年度は細...
5,18K14669,1,18K146692018hokoku,outline_of_research_performance,2,（１）THz TD-ATR測定系の構築に関しては，フェムト秒レーザー励起をもとにダイポール型...
6,18K14669,1,18K146692018hokoku,outline_of_research_performance,3,（２）誘電率フィッティング解析法の最適化に関する論文はPhysical Chemistry ...
7,18K14669,2,classification18K14669progress2018,progress,1,計画時点では，本年度中に細胞内水和割合3 %の変化に相当する振幅変動±1.5 %以下かつ位相...
8,18K14669,3,18K146692018hokoku,planning_scheme,1,まずは目的とする高精度・高安定なTHz TD-ATR測定系を完成させることを目指す．そのため...
9,18K14668,1,18K146682018hokoku,outline_of_research_performance,1,平成30年度はターゲットとする酵素タンパク質の熱安定化変異体の作製及び熱安定性解析と計算機を...


In [39]:
# データベースに書き込む
paragraph.to_sql(
    "grantaward_paragraph",
    engine,
    if_exists="replace",
    dtype={"awardnumber": String(255)},
)

2020-05-23 09:38:07,474 INFO sqlalchemy.engine.base.Engine DESCRIBE `grantaward_paragraph`
2020-05-23 09:38:07,475 INFO sqlalchemy.engine.base.Engine {}
2020-05-23 09:38:07,485 INFO sqlalchemy.engine.base.Engine DESCRIBE `grantaward_paragraph`
2020-05-23 09:38:07,486 INFO sqlalchemy.engine.base.Engine {}
2020-05-23 09:38:07,490 INFO sqlalchemy.engine.base.Engine SHOW FULL TABLES FROM `kaken`
2020-05-23 09:38:07,491 INFO sqlalchemy.engine.base.Engine {}
2020-05-23 09:38:07,494 INFO sqlalchemy.engine.base.Engine SHOW CREATE TABLE `grantaward_paragraph`
2020-05-23 09:38:07,496 INFO sqlalchemy.engine.base.Engine {}
2020-05-23 09:38:07,501 INFO sqlalchemy.engine.base.Engine 
DROP TABLE grantaward_paragraph
2020-05-23 09:38:07,502 INFO sqlalchemy.engine.base.Engine {}
2020-05-23 09:38:07,506 INFO sqlalchemy.engine.base.Engine COMMIT
2020-05-23 09:38:07,510 INFO sqlalchemy.engine.base.Engine 
CREATE TABLE grantaward_paragraph (
	`index` BIGINT, 
	awardnumber VARCHAR(255), 
	paragraphlist_sequ

In [40]:
# 主キーと外部キー制約を設定する
with engine.connect() as con:
    con.execute("ALTER TABLE `grantaward_paragraph` ADD PRIMARY KEY(`index`);")
    con.execute(
        "ALTER TABLE `grantaward_paragraph` ADD CONSTRAINT fk_grantaward_paragraph_grantaward FOREIGN KEY (`awardnumber`) REFERENCES `grantaward`(`awardnumber`);"
    )

2020-05-23 09:38:25,754 INFO sqlalchemy.engine.base.Engine ALTER TABLE `grantaward_paragraph` ADD PRIMARY KEY(`index`);
2020-05-23 09:38:25,755 INFO sqlalchemy.engine.base.Engine {}
2020-05-23 09:38:27,782 INFO sqlalchemy.engine.base.Engine COMMIT
2020-05-23 09:38:27,784 INFO sqlalchemy.engine.base.Engine ALTER TABLE `grantaward_paragraph` ADD CONSTRAINT fk_grantaward_paragraph_grantaward FOREIGN KEY (`awardnumber`) REFERENCES `grantaward`(`awardnumber`);
2020-05-23 09:38:27,784 INFO sqlalchemy.engine.base.Engine {}
2020-05-23 09:38:30,341 INFO sqlalchemy.engine.base.Engine COMMIT


## 成果物テーブルを作る

In [41]:
# リストを結合する
lists = merge_list("product")
# リストをデータフレームに変換する
columns = [
    "awardnumber",
    "product_type",
    "sequence",
    "reviewed",
    "doi",
    "author_ja",
    "author_en",
    "title_ja",
    "title_en",
    "journaltitle_ja",
    "journaltitle_en",
    "year",
]
product = pd.DataFrame(lists, columns=columns)
product

HBox(children=(IntProgress(value=0, max=176), HTML(value='')))




Unnamed: 0,awardnumber,product_type,sequence,reviewed,doi,author_ja,author_en,title_ja,title_en,journaltitle_ja,journaltitle_en,year
0,20H01698,,,,,,,,,,,
1,20H01697,,,,,,,,,,,
2,20H01696,,,,,,,,,,,
3,20H01695,,,,,,,,,,,
4,20H01694,,,,,,,,,,,
5,20H01693,,,,,,,,,,,
6,20H01692,,,,,,,,,,,
7,20H01691,,,,,,,,,,,
8,20H01690,,,,,,,,,,,
9,20H01689,,,,,,,,,,,


In [42]:
# データベースに書き込む
product.to_sql(
    "grantaward_product",
    engine,
    if_exists="replace",
    dtype={"awardnumber": String(255), "year": Integer},
)

2020-05-23 09:38:46,139 INFO sqlalchemy.engine.base.Engine DESCRIBE `grantaward_product`
2020-05-23 09:38:46,140 INFO sqlalchemy.engine.base.Engine {}
2020-05-23 09:38:46,145 INFO sqlalchemy.engine.base.Engine DESCRIBE `grantaward_product`
2020-05-23 09:38:46,146 INFO sqlalchemy.engine.base.Engine {}
2020-05-23 09:38:46,149 INFO sqlalchemy.engine.base.Engine SHOW FULL TABLES FROM `kaken`
2020-05-23 09:38:46,150 INFO sqlalchemy.engine.base.Engine {}
2020-05-23 09:38:46,153 INFO sqlalchemy.engine.base.Engine SHOW CREATE TABLE `grantaward_product`
2020-05-23 09:38:46,154 INFO sqlalchemy.engine.base.Engine {}
2020-05-23 09:38:46,159 INFO sqlalchemy.engine.base.Engine 
DROP TABLE grantaward_product
2020-05-23 09:38:46,160 INFO sqlalchemy.engine.base.Engine {}
2020-05-23 09:38:46,165 INFO sqlalchemy.engine.base.Engine COMMIT
2020-05-23 09:38:46,169 INFO sqlalchemy.engine.base.Engine 
CREATE TABLE grantaward_product (
	`index` BIGINT, 
	awardnumber VARCHAR(255), 
	product_type TEXT, 
	sequenc

In [43]:
# 主キーと外部キー制約を設定する
with engine.connect() as con:
    con.execute("ALTER TABLE `grantaward_product` ADD PRIMARY KEY(`index`);")
    con.execute(
        "ALTER TABLE `grantaward_product` ADD CONSTRAINT fk_grantaward_product_grantaward FOREIGN KEY (`awardnumber`) REFERENCES `grantaward`(`awardnumber`);"
    )

2020-05-23 09:38:55,204 INFO sqlalchemy.engine.base.Engine ALTER TABLE `grantaward_product` ADD PRIMARY KEY(`index`);
2020-05-23 09:38:55,206 INFO sqlalchemy.engine.base.Engine {}
2020-05-23 09:38:55,475 INFO sqlalchemy.engine.base.Engine COMMIT
2020-05-23 09:38:55,476 INFO sqlalchemy.engine.base.Engine ALTER TABLE `grantaward_product` ADD CONSTRAINT fk_grantaward_product_grantaward FOREIGN KEY (`awardnumber`) REFERENCES `grantaward`(`awardnumber`);
2020-05-23 09:38:55,477 INFO sqlalchemy.engine.base.Engine {}
2020-05-23 09:38:56,289 INFO sqlalchemy.engine.base.Engine COMMIT


おしまい