# 気象データべース地上観測の一つの観測点における日別または時別の複数年連続ファイルを年別のファイルに分割
**Author: Jun Sasaki, Coded on December 14, 2017, Revised on December 17, 2023**<br>
GWOのDVDでは1測点について，連続した年号を一括で出力可能．このように1ファイルとして出力したものを年毎のファイルに分割するのが目的

- 気象データベース地上観測（GWO）DVD時別値は，1961年から1990年までは3時間間隔データとなっている．GWOのDVDから1地点1ファイルとして複数年一括出力したものを年毎に分ける．1991年以降は1時間間隔となっている．本コードはどちらにも，あるいは混在していても適用可能である．
- このほかに日別値にも同様に対応した．
- 入力CSVファイルのエンコードはSHIFT-JIS，CRLF，分割後の年別ファイルのエンコードはUTF-8，LFである．
- nkfを用い，linefeedをLinux形式としている.
- データベースはSQLViewer7.exeを立ち上げ，「有効データ」および「閾値ソート」のチェックを外し，「全データベース項目」にチェックを入れ，観測所を一つ選び，全期間を指定して実行する．
- CSVファイルに保存する際は，抽出DB画面の「編集」の「列ラベル出力」にチェックを入れてから，保存する．
- Windows版では出力が自動的にShift-JISになったため，pandas.to_csvでエンコードをUTF-8に指定する必要がある．
- 時別値は1日の最後のデータ時刻が24時のため，年末24時のデータは翌年年初0時のデータを意味する．

# Splitting a GWO Houly or Daily data file containing multiple-year data at a station into a set of each-year files
The GWO DVD can output a batch of consecutive years for one measurement point. The purpose of this is to split the output as one file into yearly files.

- The GWO DVD hourly data from 1961 to 1990 are 3-hourly data, and the data are divided into yearly data from GWO DVD files. This code can be applied to both of them or both of them.
- This code can be applied to both, or to a mixture of both. Daily values are also supported.
- The encoding of the input csv file is SHIFT-JIS and CRLF, and the encoding of the split year file is UTF-8 and LF.
- The encoding of the input csv file is SHIFTJIS and CRLF, and the encoding of the split yearly file is UTF-8 and LF. nkf is used and linefeed is in Linux format.
- For the database, start SQLViewer7.exe, uncheck "Valid Data" and "Threshold Sort", check "All Database Items", select one observatory, and specify all periods.
- To save the data in a csv file, check the "Output Column Labels" checkbox in the "Edit" section of the Extraction DB screen, and then save the file.
- In the Windows version, the output is automatically set to Shift-JIS, so it is necessary to specify UTF-8 encoding in pandas.to_csv.
- The last data time of the day is 24:00, so the data at 24:00 at the end of the year means the data at 0:00 at the beginning of the next year.

In [None]:
import pandas as pd
import subprocess
import os
import sys
import glob

In [None]:
stn_dict={"稚内":"Wakkanai", "北見枝幸":"Kitamiesashi", "羽幌":"Haboro", "雄武":"Oumu", "留萌":"Rumoi", "旭川":"Asahikawa", \
          "網走":"Abashiri", "小樽":"Otaru", "札幌":"Sapporo", "岩見沢":"Iwamizawa", "帯広":"Obihiro", "釧路":"Kushiro", \
          "根室":"Nemuro", "寿都":"Suttu", "室蘭":"Muroran", "苫小牧":"Tomakomai", \
          "浦河":"Urakawa", "江差":"Esashi", "函館":"Hakodate", "倶知安":"Kutchan", "紋別":"Monbetsu", "広尾":"Hiroo", \
          "大船渡":"Ofunato", "新庄":"Shinjo", "若松":"Wakamatsu", "深浦":"Fukaura", \
          "青森":"Aomori", "むつ":"Mutsu", "八戸":"Hachinohe", "秋田":"Akita", "盛岡":"Morioka", "宮古":"Miyako", \
          "酒田":"Sakata", "山形":"Yamagata", "仙台":"Sendai", "石巻":"Ishinomaki", \
          "福島":"Fukushima", "白河":"Shirakawa", "小名浜":"Onahama", "輪島":"Wajima", "相川":"Aikawa", "新潟":"Niigata", \
          "金沢":"Kanazawa", "伏木":"Fushiki", "富山":"Toyama", "長野":"Nagano", \
          "高田":"Takada", "宇都宮":"Utsunomiya", "福井":"Fukui", "高山":"Takayama", "松本":"Matsumoto", \
          "諏訪":"Suwa", "軽井沢":"Karuizawa", "前橋":"Maebashi", "熊谷":"Kumagaya", "水戸":"Mito", \
          "敦賀":"Tsuruga", "岐阜":"Gifu", "名古屋":"Nagoya", "飯田":"Iida", "甲府":"Kofu", \
          "河口湖":"Kawaguchiko", "秩父":"Chichibu", "館野":"Tateno", "銚子":"Choshi", "上野":"Ueno", \
          "津":"Tsu", "伊良湖":"Irago", "浜松":"Hamamatsu", "御前崎":"Omaezaki", "静岡":"Shizuoka", \
          "三島":"Mishima", "東京":"Tokyo", "尾鷲":"Owase", "石廊崎":"Irozaki", "網代":"Ajiro", \
          "横浜":"Yokohama", "館山":"Tateyama", "勝浦":"Katsuura", "大島":"Oshima", "三宅島":"Miyakejima", \
          "八丈島":"Hachijojima", "千葉":"Chiba", "四日市":"Yokkaichi", "日光":"Nikko", "西郷":"Saigo", \
          "松江":"Matsue", "境":"Sakai", "米子":"Yonago", "鳥取":"Tottori", "豊岡":"Toyooka", "舞鶴":"Maiduru", \
          "伊吹山":"Ibukiyama", "萩":"Hagi", "浜田":"Hamada", "津山":"Tsuyama", \
          "京都":"Kyoto", "彦根":"Hikone", "下関":"Shimonoseki", "広島":"Hiroshima", "呉":"Kure", \
          "福山":"Fukuyama", "岡山":"Okayama", "姫路":"Himeji", "神戸":"Kobe", "大阪":"Osaka", \
          "洲本":"Sumoto", "和歌山":"Wakayama", "潮岬":"Shionomisaki", "奈良":"Nara", "山口":"Yamaguchi", \
          "厳原":"Izuhara", "平戸":"Hirado", "福岡":"Fukuoka", "飯塚":"Iiduka", "佐世保":"Sasebo", \
          "佐賀":"Saga", "日田":"Hita", "大分":"Oita", "長崎":"Nagasaki", "熊本":"Kumamoto", \
          "阿蘇山":"Asosan", "延岡":"Nobeoka", "阿久根":"Akune", "人吉":"Hitoyoshi", "鹿児島":"Kagoshima", \
          "都城":"Miyakonojo", "宮崎":"Miyazaki", "枕崎":"Makurazaki", "油津":"Aburatsu", "屋久島":"Yakushima", \
          "種子島":"Tanegashima", "牛深":"Ushibuka", "福江":"Fukue", "松山":"Matsuyama", "多度津":"Tadotsu", \
          "高松":"Takamatsu", "宇和島":"Uwajima", "高知":"Kochi", "剣山":"Tsurugisan", "徳島":"Tokushima", \
          "宿毛":"Sukumo", "清水":"Shimizu", "室戸岬":"Murotomisaki", "名瀬":"Nase", "与那国島":"Yonakunijima", \
          "石垣島":"Ishigakijima", "宮古島":"Miyakojima", "久米島":"Kumejima", "那覇":"Naha", "名護":"Nago", \
          "沖永良部":"Okinoerabu", "南大東島":"Minamidaitojima", "父島":"Chichijima", "南鳥島":"Minamitorishima"}

In [None]:
def GWO_div_year(stn="Tokyo", year_ini=None, year_end=None, db_path="/mnt/d/dat/met/JMA_DataBase/GWO/Hourly/"):
    '''Divide a file containing continuous years data into each year file'''
    ### File path for reading file, e.g., Tokyo1961-1990.csv (UTF-8)
    dirpath=db_path + stn + "/"
    if not os.path.isdir(dirpath):
        print('Error: No such a directory')
        sys.exit()
    fpath = glob.glob(dirpath + '*-*.csv')
    if not year_ini == None and not year_end == None:  ### year_iniとyear_endが与えられている場合
        fpath=db_path + stn +"/" + stn + str(year_ini) + "-" + str(year_end) + ".csv"
        print(fpath)
    else: ### year_iniとyear_endの少なくとも一方がNoneの場合
        ### 時別値は2つのファイルが存在
        if len(fpath) == 1: ### 日別値の場合
            fpath=fpath[0]
        elif len(fpath) > 1:
            if year_end == 1990:
                fpath = glob.glob(dirpath + '*-1990*.csv')[0]
            elif year_ini == 1991:
                fpath = glob.glob(dirpath + '*1991-*.csv')[0]
            else:
                print("Error: Number of all-year csv files cannot be selected.")
                print(fpath)
                sys.exit()
        else:
            print("Error: Number of all-year csv files cannot be selected.")
            print(fpath)
            sys.exit()
    print("Reading ", fpath)
    df=pd.read_csv(fpath, header=None, dtype="str", encoding="SHIFT-JIS")
    if year_ini == None:
        year_ini = int(df.iloc[0,3])
    if year_end == None:
        year_end = int(df.iloc[-1,3])
    for year in range(year_ini, year_end + 1):
        df_year = df[df.iloc[:,3] == str(year)]  ### Column 3 of df is year.
        fpath_year = db_path + stn + "/" + stn + str(year) + ".csv"  ### output CSV file path for each year
        print("Creating", fpath_year)
        df_year.to_csv(fpath_year, header=None, index=False, encoding="utf-8")  ### エンコードをutf-8とする
        ### Converting to linefeed of LF  ### Linux互換とするため，改行コードをLFにする
        cmd = ['nkf', '-w', '-Lu', '--overwrite', fpath_year]
        print("Converting to LF", cmd)
        print(f"Converting to LF: {' '.join(cmd)}")
        subprocess.call(cmd)

## 時別値ファイルを分割する
**db_path**で指定したディレクトリ，例えば，`/mnt/d/dat/met/GWO/Hourly/` の地点名のディレクトリに，**Tokyo1961-1990.csv** といったCSVファイルが要存在

## Split an hourly multi-year data file into each year data files
Under the directory of **db_path**, such as `/mnt/d/dat/met/GWO/Hourly/`, each station directly should be exist. The corresponding multi-year data CSV file should exist in each station directory.

### Simple test case at one station

In [None]:
'''
stn="Tokyo"  ### "Tokyo" or "Yokohama" or "Chiba" or "Tateyama"
year_ini=1961  ### データの存在する最初の年は地点により異なるので，要設定
GWO_div_year(stn=stn, year_ini=None, year_end=1990)
GWO_div_year(stn=stn, year_ini=1991, year_end=2021)
'''

### すべての観測点におけるすべての観測年の時別値ファイルに一気に分割する
- 1990年以前と1991年以降ではデータの時間間隔が異なるため，それぞれのファイルを処理する必要がある．
- 1990年以前のファイルの処理においては，`year_ini=None` とする．これは観測所によってデータ取得開始年が異なるためである．
- 1991年以降のファイルの処理においては，`year_ini=1991`，`year_end=2021`とする．`year_end`は気象データベースで用意した最後の年を与える（現在のところ2021）．

### Split the hourly multi-year file into each year files at every station and year.
- Since the time interval of the data is different between before 1990 and after 1991, it is necessary to process each file.
- In the processing of the files before 1990, `year_ini=None` should be applied. This is because the start year of data acquisition differs from station to station.
- When processing files after 1991, `year_ini=1991` and `year_end=2021` should be used. The `year_end` should be given as the last year prepared in the meteorological database (currently 2021).

In [None]:
## Specify the directory db_path containing station directories with a multiple-year CSV file.
db_path = "/mnt/d/dat/met/JMA_DataBase/GWO/Hourly2/"
stns = list(stn_dict.values())

for stn in stns:
    ## None-1990
    # GWO_div_year(stn=stn, year_ini=None, year_end=1990, db_path=db_path)
    ## 1991-2021
    ## The years must be consistent with the years in the CSV file name.
    GWO_div_year(stn=stn, year_ini=1991, year_end=2021, db_path=db_path)

In [None]:
%tb

# 以降は要検証（Not tested yet）
## 日別値ファイルを分割する
時別値と同様に複数年のcsvファイルをデフォルトのSHIFT-JIS，CRLFで準備する．例：Tokyo1961-2017.csv

In [None]:
'''
stn="Matsue"
year_ini=None
year_end=None
db_path = "/mnt/d/dat/met/GWO/Daily/"
GWO_div_year(stn=stn, year_ini=year_ini, year_end=year_end, db_path=db_path)
'''

### すべての観測点におけるすべての観測年の日別値ファイルに一気に分割する

In [None]:
'''
stns = list(stn_dict.values())
db_path = "/mnt/d/dat/met/GWO/Daily/"
for stn in stns:
    GWO_div_year(stn=stn, year_ini=None, year_end=None, db_path=db_path)
'''