# Fetch Population Data Page

The perception of the **Page** is the entry of the LEFT panel of the website of

```text
http://www.stats.gov.cn/tjsj/pcsj/rkpc/6rp/indexch.htm
```

The **example page** is selected as

```text
http://www.stats.gov.cn/tjsj/pcsj/rkpc/6rp/html/A0101a.htm
```

In [1]:
import os
import json 
import pandas as pd 
import plotly.express as px

## Fetch Example Data Page

In [2]:
# Setup Example URL
url = 'http://www.stats.gov.cn/tjsj/pcsj/rkpc/6rp/html/A0101a.htm'

In [3]:
fname = 'fetched.json'
if os.path.isfile(fname):
    fetched = pd.read_json(fname)
else:
    fetched = pd.read_html(url)[0]
    fetched.to_json(fname)

In [4]:
def merge_objs(lst):
    '''Merge objs from the [lst]'''
    lst = list(lst)
    res = [''.join(e.split()) for n, e in enumerate(lst) if e not in lst[:n]]
    return '-'.join(res)
    
def parse_dataFrame(df):
    title = df[0][0]
    
    _df = df[df[0] == '地 区']
    header = _df.apply(merge_objs)
    
    _df = df.iloc[1:]
    body = _df[_df[0] != '地 区'].copy()
    
    body.columns = header.to_list()
    body['Location'] = body['地区'].map(lambda e: ''.join(e.split()))
    body = body[body['Location'] != '全国']
    body.index = range(len(body))
    
    return title, header, body

In [5]:
raw = fetched.dropna()
title, header, df = parse_dataFrame(raw)
print(f'Title is "{title}"')
# print(f'Header is "{header}"')
df.iloc[:10]

Title is "1-1 各地区户数、人口数和性别比"


Unnamed: 0,地区,户数-合计,户数-家庭户,户数-集体户,人口数-合计-合计,人口数-合计-男,人口数-合计-女,人口数-合计-性别比-(女=100),人口数-家庭户-小计,人口数-家庭户-男,人口数-家庭户-女,人口数-家庭户-性别比-(女=100),人口数-集体户-小计,人口数-集体户-男,人口数-集体户-女,人口数-集体户-性别比-(女=100),平均家庭-户规模-（人/户）,Location
0,北 京,7355291,6680552,674739,19612368,10126430,9485938,106.75,16389723,8173161,8216562,99.47,3222645,1953269,1269376,153.88,2.45,北京
1,天 津,3963604,3661992,301612,12938693,6907091,6031602,114.52,10262186,5129604,5132582,99.94,2676507,1777487,899020,197.71,2.8,天津
2,河 北,20813492,20395116,418376,71854210,36430286,35423924,102.84,68538709,34552649,33986060,101.67,3315501,1877637,1437864,130.59,3.36,河北
3,山 西,10654162,10330207,323955,35712101,18338760,17373341,105.56,33484131,16988087,16496044,102.98,2227970,1350673,877297,153.96,3.24,山西
4,内 蒙 古,8470472,8205498,264974,24706291,12838243,11868048,108.17,23071690,11725291,11346399,103.34,1634601,1112952,521649,213.35,2.81,内蒙古
5,辽 宁,15334912,14994046,340866,43746323,22147745,21598578,102.54,41755874,20956756,20799118,100.76,1990449,1190989,799460,148.97,2.78,辽宁
6,吉 林,9162183,8998492,163691,27452815,13907218,13545597,102.67,26457769,13358390,13099379,101.98,995046,548828,446218,123.0,2.94,吉林
7,黑 龙 江,13192935,13000088,192847,38313991,19426106,18887885,102.85,36884039,18603181,18280858,101.76,1429952,822925,607027,135.57,2.84,黑龙江
8,上 海,8893483,8253257,640226,23019196,11854916,11164280,106.19,20593430,10318168,10275262,100.42,2425766,1536748,889018,172.86,2.5,上海
9,江 苏,25635291,24381782,1253509,78660941,39626707,39034234,101.52,71685839,35542124,36143715,98.34,6975102,4084583,2890519,141.31,2.94,江苏


## Load Province Map 

In [6]:
alias_province_name = dict()
alias_province_name['广西壮族自治区'] = '广西'
alias_province_name['内蒙古自治区'] = '内蒙古'
alias_province_name['宁夏回族自治区'] = '宁夏'
alias_province_name['新疆维吾尔自治区'] = '新疆'
alias_province_name['西藏自治区'] = '西藏'

# with open(os.path.join(os.environ['SYNC'], 'GeoData', 'json-files', '100000_full.json')) as f:
with open('china_province.geojson', encoding='utf-8') as f:
    province_map = json.load(f)

for feature in province_map['features']:
    name = feature['properties']['NL_NAME_1']
    print(name)
    if name in alias_province_name:
        feature['properties']['NL_NAME_1'] = alias_province_name[name]
        print(name, '-->', alias_province_name[name])

安徽
北京
重庆
福建
甘肃
广东
广西壮族自治区
广西壮族自治区 --> 广西
贵州
海南
河北
黑龙江
河南
湖北
湖南
江苏
江西
吉林
辽宁
内蒙古自治区
内蒙古自治区 --> 内蒙古
宁夏回族自治区
宁夏回族自治区 --> 宁夏
青海
陕西
山东
上海
山西
四川
天津
新疆维吾尔自治区
新疆维吾尔自治区 --> 新疆
西藏自治区
西藏自治区 --> 西藏
云南
浙江
台湾
香港


In [7]:
df = df.astype({'户数-合计': 'float'})
df.dtypes

地区                      object
户数-合计                  float64
户数-家庭户                  object
户数-集体户                  object
人口数-合计-合计               object
人口数-合计-男                object
人口数-合计-女                object
人口数-合计-性别比-(女=100)      object
人口数-家庭户-小计              object
人口数-家庭户-男               object
人口数-家庭户-女               object
人口数-家庭户-性别比-(女=100)     object
人口数-集体户-小计              object
人口数-集体户-男               object
人口数-集体户-女               object
人口数-集体户-性别比-(女=100)     object
平均家庭-户规模-（人/户）          object
Location                object
dtype: object

In [8]:
mapbox_token = open(os.path.join(os.environ['Onedrive'], 'SafeBox', '.mapbox_token')).read()
fig = px.choropleth_mapbox(
    data_frame=df,
    geojson=province_map,
    color='户数-合计',
    locations="Location",
    featureidkey="properties.NL_NAME_1",
#     mapbox_style="Dark",
#     mapbox_accesstoken=mapbox_token,
    color_continuous_scale='viridis'
)
fig.update_layout(mapbox_style="light",
                  mapbox_accesstoken=mapbox_token,
                  mapbox_zoom=3,
                  mapbox_center = {"lat": 37.110573, "lon": 106.493924})
fig.write_html('example.html')