## 国土数値情報 観光資源データ
2023/04/25  
データを取得し、分析しやすい形式に変換する  
[国土交通省　観光資源データ - 国土数値情報](https://nlftp.mlit.go.jp/ksj/gml/datalist/KsjTmplt-P12-v2_2.html)

### データクレンジング
P12-14_22.xml：静岡県のデータ

In [1]:
import pandas as pd
import xmljson
from lxml import etree
import os
import csv
import collections

In [98]:
!pip install xmljson

Collecting xmljson
  Downloading xmljson-0.2.1-py2.py3-none-any.whl (10 kB)
Installing collected packages: xmljson
Successfully installed xmljson-0.2.1


In [2]:
def getCurve(name, xmldata):
    #★観光資源（線）-面にあるIDがデータになっている　緯度経度
    if isinstance(xmldata, collections.OrderedDict):
        xml_id = xmldata['{http://www.opengis.net/gml/3.2}id']
        if 'pc_' in xml_id:
            xml_id = 'pa'+xml_id[2:]
        savelatlon(name, xml_id, xmldata['{http://www.opengis.net/gml/3.2}segments']['{http://www.opengis.net/gml/3.2}LineStringSegment'])
    else:
        for i in range(len(xmldata)):
            xml_id = xmldata[i]['{http://www.opengis.net/gml/3.2}id']
            #print(xml_id)
            if 'pc_' in xml_id:
                xml_id = 'pa'+xml_id[2:]
            savelatlon(name, xml_id, xmldata[i]['{http://www.opengis.net/gml/3.2}segments']['{http://www.opengis.net/gml/3.2}LineStringSegment'])

In [3]:
def savelatlon(name, xml_id, xmldata):
    if isinstance(xmldata, collections.OrderedDict):
        data = xmldata['{http://www.opengis.net/gml/3.2}posList']
        data = data.split('\n')[1:-1]
        data = [[data[i].split(' ')[0], data[i].split(' ')[1]] for i in range(len(data))]
    else:
        data = []
        for i in range(len(xmldata)):
            xdata = xmldata[i]['{http://www.opengis.net/gml/3.2}posList']
            xdata = xdata.split('\n')[1:-1]
            data = data + [[xdata[i].split(' ')[0], xdata[i].split(' ')[1]] for i in range(len(xdata))]
    f = open('01_output/{}/{}.csv'.format(name, xml_id), 'w')
    writer = csv.writer(f)
    writer.writerows(data)
    f.close()

In [4]:
def multicode(xmldata):
    text = ''
    if isinstance(xmldata, collections.OrderedDict):#len(xmldata) == 1
        return xmldata['content']
    else:
        for i in range(len(xmldata)):
            if i > 0:
                text = text + ','
            text = text + xmldata[i]['content']
        return text

In [5]:
def getResource(resources, name, xmldata):
    if isinstance(xmldata, collections.OrderedDict):
        xml_id = xmldata['{http://nlftp.mlit.go.jp/ksj/schemas/ksj-app}'+'{}'.format(getflag(xmldata['{http://www.opengis.net/gml/3.2}id']))]['{http://www.w3.org/1999/xlink}href'][1:]
        latlogs = pd.read_csv('01_output/{}/{}.csv'.format(name, xml_id),header=None)
        resources.append([xml_id,xmldata['{http://nlftp.mlit.go.jp/ksj/schemas/ksj-app}turismResorceName'],xmldata['{http://nlftp.mlit.go.jp/ksj/schemas/ksj-app}address'], latlogs.mean()[0], latlogs.mean()[1],multicode(xmldata['{http://nlftp.mlit.go.jp/ksj/schemas/ksj-app}prefectureCode']),multicode(xmldata['{http://nlftp.mlit.go.jp/ksj/schemas/ksj-app}administartiveAreaCode']),xmldata['{http://nlftp.mlit.go.jp/ksj/schemas/ksj-app}turismResorceKindName'],xmldata['{http://nlftp.mlit.go.jp/ksj/schemas/ksj-app}tourismResourceCategoryCode']])
    else:
        flag = getflag(xmldata[0]['{http://www.opengis.net/gml/3.2}id'])
        for i in range(len(xmldata)):
            xml_id = xmldata[i]['{http://nlftp.mlit.go.jp/ksj/schemas/ksj-app}'+'{}'.format(flag)]['{http://www.w3.org/1999/xlink}href'][1:]
            latlogs = pd.read_csv('01_output/{}/{}.csv'.format(name, xml_id),header=None)
            resources.append([xml_id,xmldata[i]['{http://nlftp.mlit.go.jp/ksj/schemas/ksj-app}turismResorceName'],xmldata[i]['{http://nlftp.mlit.go.jp/ksj/schemas/ksj-app}address'], latlogs.mean()[0], latlogs.mean()[1],multicode(xmldata[i]['{http://nlftp.mlit.go.jp/ksj/schemas/ksj-app}prefectureCode']),multicode(xmldata[i]['{http://nlftp.mlit.go.jp/ksj/schemas/ksj-app}administartiveAreaCode']),xmldata[i]['{http://nlftp.mlit.go.jp/ksj/schemas/ksj-app}turismResorceKindName'],xmldata[i]['{http://nlftp.mlit.go.jp/ksj/schemas/ksj-app}tourismResourceCategoryCode']])
            #print([xml_id,xmldata[i]['{http://nlftp.mlit.go.jp/ksj/schemas/ksj-app}turismResorceName'],xmldata[i]['{http://nlftp.mlit.go.jp/ksj/schemas/ksj-app}address'], latlogs.mean()[0], latlogs.mean()[1],xmldata[i]['{http://nlftp.mlit.go.jp/ksj/schemas/ksj-app}prefectureCode']['content'],xmldata[i]['{http://nlftp.mlit.go.jp/ksj/schemas/ksj-app}administartiveAreaCode']['content'],xmldata[i]['{http://nlftp.mlit.go.jp/ksj/schemas/ksj-app}turismResorceKindName'],xmldata[i]['{http://nlftp.mlit.go.jp/ksj/schemas/ksj-app}tourismResourceCategoryCode']])
    return resources

In [6]:
def getflag(gmlid):
    if 'FL03_' in gmlid:
        flag = 'location'
    elif 'FL02_' in gmlid:
        flag = 'bounds'
    #elif 'FL04_' in gmlid:
    #    flag = 'position'
    return flag

In [7]:
def getPointLatlon(pos):
    if isinstance(pos, list):
        plist = [[],[]]
        for i in range(len(pos)):
            plist[0].append(pos[i].split(' ')[0])
            plist[1].append(pos[i].split(' ')[1])
        return plist
    else:
        return [float(s) for s in pos.split(' ')]

In [8]:
def getResourcePoint(resources, xmldata1, xmldata2):
    dict_points = {}
    for i in range(len(xmldata1)):
        dict_points[xmldata1[i]['{http://www.opengis.net/gml/3.2}id']] = getPointLatlon(xmldata1[i]['{http://www.opengis.net/gml/3.2}pos'])
    #print(dict_points)
    for i in range(len(xmldata2)):
        xml_id = xmldata2[i]['{http://nlftp.mlit.go.jp/ksj/schemas/ksj-app}position']['{http://www.w3.org/1999/xlink}href'][1:]
        if isinstance(dict_points[xml_id][0], list):
            for j in range(len(dict_points[xml_id][0])):
                resources.append([xml_id, xmldata2[i]['{http://nlftp.mlit.go.jp/ksj/schemas/ksj-app}turismResorceName'], xmldata2[i]['{http://nlftp.mlit.go.jp/ksj/schemas/ksj-app}address'], dict_points[xml_id][0][j], dict_points[xml_id][1][j],multicode(xmldata2[i]['{http://nlftp.mlit.go.jp/ksj/schemas/ksj-app}prefectureCode']),multicode(xmldata2[i]['{http://nlftp.mlit.go.jp/ksj/schemas/ksj-app}administartiveAreaCode']),xmldata2[i]['{http://nlftp.mlit.go.jp/ksj/schemas/ksj-app}turismResorceKindName'],xmldata2[i]['{http://nlftp.mlit.go.jp/ksj/schemas/ksj-app}tourismResourceCategoryCode']])
        else:
            resources.append([xml_id, xmldata2[i]['{http://nlftp.mlit.go.jp/ksj/schemas/ksj-app}turismResorceName'], xmldata2[i]['{http://nlftp.mlit.go.jp/ksj/schemas/ksj-app}address'], dict_points[xml_id][0], dict_points[xml_id][1],multicode(xmldata2[i]['{http://nlftp.mlit.go.jp/ksj/schemas/ksj-app}prefectureCode']),multicode(xmldata2[i]['{http://nlftp.mlit.go.jp/ksj/schemas/ksj-app}administartiveAreaCode']),xmldata2[i]['{http://nlftp.mlit.go.jp/ksj/schemas/ksj-app}turismResorceKindName'],xmldata2[i]['{http://nlftp.mlit.go.jp/ksj/schemas/ksj-app}tourismResourceCategoryCode']])
    return resources

In [9]:
#xmlデータを読み込みます
allresources = []
for prenum in range(1,48):
    resources = []
    name = 'P12-14_{:02}'.format(prenum)
    if prenum in [10,12,17,21,22,28]:
        xml_tree = etree.parse('01_input/{}_replace.xml'.format(name))
    else:
        xml_tree = etree.parse('01_input/{}.xml'.format(name))
    # すべてのタグの取得
    xml_root = xml_tree.getroot()
    # xmlデータをdict型に変換
    xml_dict = xmljson.yahoo.data(xml_root)
    
    if not os.path.exists('01_output/{}'.format(name)):
        # ディレクトリが存在しない場合、ディレクトリを作成する
        os.makedirs('01_output/{}'.format(name))
        
    xml_keys = xml_dict['{http://nlftp.mlit.go.jp/ksj/schemas/ksj-app}Dataset'].keys()
    print('name', name)
    #print('xml_keys', xml_keys)
    
    #データセットid
    #xml_dict['{http://nlftp.mlit.go.jp/ksj/schemas/ksj-app}Dataset']['{http://www.opengis.net/gml/3.2}id']
    
    #データセット名
    #xml_dict['{http://nlftp.mlit.go.jp/ksj/schemas/ksj-app}Dataset']['{http://www.opengis.net/gml/3.2}description']
    
    #データセット基本情報
    #xml_dict['{http://nlftp.mlit.go.jp/ksj/schemas/ksj-app}Dataset']['{http://www.opengis.net/gml/3.2}boundedBy']
    
    #面と線の緯度経度情報の取得
    getCurve(name, xml_dict['{http://nlftp.mlit.go.jp/ksj/schemas/ksj-app}Dataset']['{http://www.opengis.net/gml/3.2}Curve'])
    
    #観光資源（線）
    if '{http://nlftp.mlit.go.jp/ksj/schemas/ksj-app}TourismResource_Line' in xml_keys:
        resources = getResource(resources, name, xml_dict['{http://nlftp.mlit.go.jp/ksj/schemas/ksj-app}Dataset']['{http://nlftp.mlit.go.jp/ksj/schemas/ksj-app}TourismResource_Line'])
        resources
    #観光資源（面）に関する情報 名前、住所
    if '{http://nlftp.mlit.go.jp/ksj/schemas/ksj-app}TourismResource_Surface' in xml_keys:
        resources = getResource(resources, name, xml_dict['{http://nlftp.mlit.go.jp/ksj/schemas/ksj-app}Dataset']['{http://nlftp.mlit.go.jp/ksj/schemas/ksj-app}TourismResource_Surface'])
        resources
    
    if '{http://nlftp.mlit.go.jp/ksj/schemas/ksj-app}TourismResource_Point' in xml_keys:
        resources = getResourcePoint(resources, xml_dict['{http://nlftp.mlit.go.jp/ksj/schemas/ksj-app}Dataset']['{http://www.opengis.net/gml/3.2}Point'], xml_dict['{http://nlftp.mlit.go.jp/ksj/schemas/ksj-app}Dataset']['{http://nlftp.mlit.go.jp/ksj/schemas/ksj-app}TourismResource_Point'])
    
    pd_resources = pd.DataFrame(resources, columns=['id', 'name', 'adress', 'lat', 'lon', 'precode', 'areacode', 'category', 'category_flag'])
    pd_resources.to_csv('01_output/{}/resources.csv'.format(name))
    pd_resources.to_excel('01_output/{}/resources.xlsx'.format(name))
    pd_resources
    
    allresources = allresources + resources
pd_resources = pd.DataFrame(allresources, columns=['id', 'name', 'adress', 'lat', 'lon', 'precode', 'areacode', 'category', 'category_flag'])
pd_resources.to_csv('01_output/resources.csv'.format(name))
pd_resources.to_excel('01_output/resources.xlsx'.format(name))
pd_resources

name P12-14_01


  pd_resources.to_excel('01_output/{}/resources.xlsx'.format(name))


name P12-14_02


  pd_resources.to_excel('01_output/{}/resources.xlsx'.format(name))


name P12-14_03


  pd_resources.to_excel('01_output/{}/resources.xlsx'.format(name))
  pd_resources.to_excel('01_output/{}/resources.xlsx'.format(name))


name P12-14_04
name P12-14_05


  pd_resources.to_excel('01_output/{}/resources.xlsx'.format(name))
  pd_resources.to_excel('01_output/{}/resources.xlsx'.format(name))


name P12-14_06
name P12-14_07
name P12-14_08


  pd_resources.to_excel('01_output/{}/resources.xlsx'.format(name))
  pd_resources.to_excel('01_output/{}/resources.xlsx'.format(name))


name P12-14_09
name P12-14_10


  pd_resources.to_excel('01_output/{}/resources.xlsx'.format(name))
  pd_resources.to_excel('01_output/{}/resources.xlsx'.format(name))


name P12-14_11


  pd_resources.to_excel('01_output/{}/resources.xlsx'.format(name))


name P12-14_12
name P12-14_13


  pd_resources.to_excel('01_output/{}/resources.xlsx'.format(name))
  pd_resources.to_excel('01_output/{}/resources.xlsx'.format(name))
  pd_resources.to_excel('01_output/{}/resources.xlsx'.format(name))


name P12-14_14
name P12-14_15


  pd_resources.to_excel('01_output/{}/resources.xlsx'.format(name))


name P12-14_16
name P12-14_17


  pd_resources.to_excel('01_output/{}/resources.xlsx'.format(name))
  pd_resources.to_excel('01_output/{}/resources.xlsx'.format(name))
  pd_resources.to_excel('01_output/{}/resources.xlsx'.format(name))


name P12-14_18
name P12-14_19
name P12-14_20


  pd_resources.to_excel('01_output/{}/resources.xlsx'.format(name))
  pd_resources.to_excel('01_output/{}/resources.xlsx'.format(name))


name P12-14_21


  pd_resources.to_excel('01_output/{}/resources.xlsx'.format(name))


name P12-14_22


  pd_resources.to_excel('01_output/{}/resources.xlsx'.format(name))


name P12-14_23
name P12-14_24


  pd_resources.to_excel('01_output/{}/resources.xlsx'.format(name))
  pd_resources.to_excel('01_output/{}/resources.xlsx'.format(name))


name P12-14_25


  pd_resources.to_excel('01_output/{}/resources.xlsx'.format(name))


name P12-14_26


  pd_resources.to_excel('01_output/{}/resources.xlsx'.format(name))


name P12-14_27
name P12-14_28


  pd_resources.to_excel('01_output/{}/resources.xlsx'.format(name))
  pd_resources.to_excel('01_output/{}/resources.xlsx'.format(name))


name P12-14_29


  pd_resources.to_excel('01_output/{}/resources.xlsx'.format(name))


name P12-14_30
name P12-14_31


  pd_resources.to_excel('01_output/{}/resources.xlsx'.format(name))
  pd_resources.to_excel('01_output/{}/resources.xlsx'.format(name))


name P12-14_32


  pd_resources.to_excel('01_output/{}/resources.xlsx'.format(name))


name P12-14_33


  pd_resources.to_excel('01_output/{}/resources.xlsx'.format(name))


name P12-14_34
name P12-14_35


  pd_resources.to_excel('01_output/{}/resources.xlsx'.format(name))
  pd_resources.to_excel('01_output/{}/resources.xlsx'.format(name))
  pd_resources.to_excel('01_output/{}/resources.xlsx'.format(name))


name P12-14_36
name P12-14_37
name P12-14_38


  pd_resources.to_excel('01_output/{}/resources.xlsx'.format(name))
  pd_resources.to_excel('01_output/{}/resources.xlsx'.format(name))


name P12-14_39
name P12-14_40


  pd_resources.to_excel('01_output/{}/resources.xlsx'.format(name))
  pd_resources.to_excel('01_output/{}/resources.xlsx'.format(name))


name P12-14_41
name P12-14_42


  pd_resources.to_excel('01_output/{}/resources.xlsx'.format(name))
  pd_resources.to_excel('01_output/{}/resources.xlsx'.format(name))


name P12-14_43


  pd_resources.to_excel('01_output/{}/resources.xlsx'.format(name))


name P12-14_44
name P12-14_45


  pd_resources.to_excel('01_output/{}/resources.xlsx'.format(name))
  pd_resources.to_excel('01_output/{}/resources.xlsx'.format(name))
  pd_resources.to_excel('01_output/{}/resources.xlsx'.format(name))


name P12-14_46
name P12-14_47


  pd_resources.to_excel('01_output/{}/resources.xlsx'.format(name))
  pd_resources.to_excel('01_output/resources.xlsx'.format(name))


Unnamed: 0,id,name,adress,lat,lon,precode,areacode,category,category_flag
0,lc_00000_0,層雲峡,上川町,43.729559,142.962962,01,01457,河川・峡谷,-1
1,lc_00001_0,天塩川,音威子府村　および　中川町　および　天塩町　および　幌延町,44.861605,141.982589,01,01470014710148701520,河川・峡谷,-1
2,lc_00002_0,石狩川,札幌市北区　および　札幌市東区　および　石狩市　および　当別町,43.216942,141.382275,01,01102011030123501303,河川・峡谷,-1
3,pa_00000_0,サロベツ原野,豊富町　および　幌延町,45.133595,141.673084,01,0151601520,高原・湿原・原野,-1
4,pa_00001_0,霧多布湿原,浜中町,43.093102,145.064205,01,01663,高原・湿原・原野,-1
...,...,...,...,...,...,...,...,...,...
18157,n00156,西表島,竹富町,24.332775,123.815205,47,47381,‐,1
18158,n00157,小浜島,竹富町,24.341647,123.980528,47,47381,‐,1
18159,n00158,黒島,竹富町,24.236759,124.011932,47,47381,‐,1
18160,n00159,波照間島,竹富町,24.05975,123.783813,47,47381,‐,1


## メモ：XMLファイル修正点
国土交通省のデータセットで一部エラーが出るファイルがあったため、下記の通り修正した。  
■P12-14_12.xml  
ファイルの最後に下記を追加  
``` 
</ksj:TourismResource_Surface>
</ksj:Dataset>
```
■P12-14_17.xml P12-14_21.xml P12-14_22.xml P12-14_28.xml    
半角&を```&amp;```に変更  
``` python
file = open('01_input/P12-14_XX.xml')#XXは変えてください
xml_text = file.read()
file.close()
xml_text = xml_text.replace('&', '&amp;')
f = open('01_input/P12-14_XX_replace.xml', 'w')
f.write(xml_text)
f.close()
``` 