<a href="https://colab.research.google.com/github/hannari-python/tutorial/blob/master/family_budget/kakei_chosa_prepro_to_dash.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 家計調査データの分析

## データの紹介

- [独立行政法人 統計センター 一般用ミクロデータ 平成21年全国消費実態調査 （十大費目）](https://github.com/hannari-python/tutorial/raw/master/data/ippan-microdata/ippan_2009zensho.zip) https://www.nstac.go.jp/services/ippan-microdata.html
- [独立行政法人 統計センター 一般用ミクロデータ 平成21年全国消費実態調査 （詳細品目）](https://github.com/hannari-python/tutorial/raw/master/data/ippan-microdata/ippan_2009zensho_s.zip) https://www.nstac.go.jp/services/ippan-microdata.html
- [独立行政法人 統計センター 一般用ミクロデータ 就業構造基本調査 （平成４年～24年）](https://github.com/hannari-python/tutorial/raw/master/data/ippan-microdata/ippan_shugyou.zip) https://www.nstac.go.jp/services/ippan-microdata.html

## 課題1

https://www.nstac.go.jp/services/ippan-microdata.html

をクリックし、ippan_2009zensho.zip と ippan_2009zensho_s.zip がどのように異なるか見てみましょう。

## データの取得と解凍

!の後にLinuxコマンドを書くとそのセルの命令はPythonプログラムではなくLinuxのコマンドとして実行されます。

wget はデータをダウンロードするのに使うことができるLinuxのコマンドです。

In [1]:
!wget https://github.com/hannari-python/tutorial/raw/master/data/ippan-microdata/ippan_2009zensho_s.zip

--2020-08-19 05:26:36--  https://github.com/hannari-python/tutorial/raw/master/data/ippan-microdata/ippan_2009zensho_s.zip
Resolving github.com (github.com)... 192.30.255.112
Connecting to github.com (github.com)|192.30.255.112|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/hannari-python/tutorial/master/data/ippan-microdata/ippan_2009zensho_s.zip [following]
--2020-08-19 05:26:36--  https://raw.githubusercontent.com/hannari-python/tutorial/master/data/ippan-microdata/ippan_2009zensho_s.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 33090889 (32M) [application/zip]
Saving to: ‘ippan_2009zensho_s.zip’


2020-08-19 05:26:37 (57.4 MB/s) - ‘ippan_2009zensho_s.zip’ saved [33090889/33090889]



unzip を zip ファイルを解凍する Linux コマンドです。

In [2]:
!unzip /content/ippan_2009zensho_s.zip

Archive:  /content/ippan_2009zensho_s.zip
  inflating: ippan_2009zensho_s/ippan_2009zensho_s.xls  
  inflating: ippan_2009zensho_s/ippan_2009zensho_s_dataset.csv  


## 課題2

- `ippan_2009zensho_s_dataset.csv` の何行目からが表データのスタートか見てみましょう


## 分析に必要なPythonパッケージのインストール

pip はPythonのパッケージをインストールするためのコマンドです。
空白を間にはさみ複数のパッケージを連続して書くとそれらがすべてインストールされます。

dashはダッシュボードのWebアプリケーションを構築するためのフレームワークタイプのPythonパッケージです。

jupyter_dashはJupyter Notebookでdashを使うためのPythonパッケージです。

plotlyはオープンソースのインタラクティブなグラフライブラリのPythonパッケージです。

colabには最初からplotlyがインストールされていますが最新版を使うために下記では`--upgrade`オプションをつけインストールし直しています。

In [3]:
!pip install dash jupyter_dash 
!pip install --upgrade plotly

Collecting dash
[?25l  Downloading https://files.pythonhosted.org/packages/1d/d1/191ad32bd9e6d10b2fc0f5d31e9e6a85fdb2642088658f75817d67bdeaea/dash-1.14.0.tar.gz (70kB)
[K     |████████████████████████████████| 71kB 3.3MB/s 
[?25hCollecting jupyter_dash
[?25l  Downloading https://files.pythonhosted.org/packages/b9/b9/5f9499a0154124a262c85e3a99033b9b3a20dc3d2707b587f52b32b60d76/jupyter_dash-0.3.1-py3-none-any.whl (49kB)
[K     |████████████████████████████████| 51kB 6.6MB/s 
Collecting flask-compress
  Downloading https://files.pythonhosted.org/packages/a0/96/cd684c1ffe97b513303b5bfd4bbfb4114c5f4a5ea8a737af6fd813273df8/Flask-Compress-1.5.0.tar.gz
Collecting dash_renderer==1.6.0
[?25l  Downloading https://files.pythonhosted.org/packages/da/a6/ddbcd01c638a2c235bfe13fd75155b344c7b7ab1c6466fe6d46b159897ad/dash_renderer-1.6.0.tar.gz (1.2MB)
[K     |████████████████████████████████| 1.2MB 37.5MB/s 
[?25hCollecting dash-core-components==1.10.2
[?25l  Downloading https://files.pythonhos

In [4]:
import pandas as pd

In [5]:
s = pd.read_csv('/content/ippan_2009zensho_s/ippan_2009zensho_s_dataset.csv', encoding='shift_jis', header=8)

## 課題3
- `encoding` オプションを付けないと`s`がどうなるか確認しましょう
- `header` オプションを付けないと`s`がどうなるか確認しましょう
- `header` オプションの数値を変えると`s`がどうなるか確認しましょう

## データフレームとは

ここで `s` のデータ構造はデータフレームと呼ばれます。
データフレームとはとりあえずテーブルデータを保持するためのデータ構造と考えていただけばOKです。
`s`とだけ書いたセルをそのまま実行するとデータの内容を見ることができます。

## 課題4

- `ippan_2009zensho_s.xls` を Google Sheets で開いて、データ行列の定義を確認しましょう。

In [7]:
s

Unnamed: 0,3City,T_SeJinin,T_SyuJinin,T_JuSyoyu,T_Syuhi,T_Age_5s,T_Age_65,Weight,Y_Income,L_Expenditure,Food,Housing,LFW,Furniture,Clothes,Health,Transport,Education,Recreation,OL_Expenditure,E001,E002,E003,E004,E005,E006,E007,E008,E009,E010,E011,E012,E013,E014,E015,E016,E017,E018,E019,E020,...,E371,E372,E373,E374,E375,E376,E377,E378,E379,E380,E381,E382,E383,E384,E385,E386,E387,E388,E389,E390,E391,E392,E393,E394,E395,E396,E397,E398,E399,E400,E401,E402,E403,E404,E405,E406,E407,E408,E409,E410
0,1,2,1,1,1,1,1,895.266667,3917,201649,47756,16028,9652,6702,8088,726,21546,0,14433,76719,47756,4574,1728,1557,1027,262,3450,2070,555,393,432,3722,2882,785,1286,621,64,126,840,636,...,1320,2186,6448,138,512,535,582,277,4404,2885,65,939,1089,224,538,31,2240,18347,1185,57,5425,244,284,9490,231,95,182,1154,17820,15248,2572,24149,19000,5149,417,3591,1141,576,186,390
1,1,2,1,1,1,1,1,895.266667,6675,166381,34054,7416,26313,17062,6989,7637,20773,0,19048,27089,34054,2813,1019,1050,489,254,1863,905,564,132,261,2432,2112,632,621,697,23,139,320,203,...,231,668,1677,21,258,100,78,38,1182,1581,26,899,189,161,291,14,656,5794,229,10,3980,49,96,804,96,32,95,404,7335,6811,523,8710,7687,1023,239,437,348,282,94,188
2,1,2,1,1,1,1,1,895.266667,6706,259736,84501,1927,10082,6741,5090,11015,53372,0,17289,69719,84501,5004,657,1122,3067,158,7010,4216,718,836,1239,7742,5284,3382,1136,628,43,95,2458,2248,...,4942,3637,8037,190,1210,1606,423,557,4050,8032,124,3204,3350,316,1007,30,1096,20994,1447,40,4398,497,258,9418,295,75,121,4445,9115,8095,1021,11933,7405,4529,1166,2705,658,429,183,246
3,1,2,1,1,1,1,1,895.266667,2790,114511,41664,730,22358,5413,1205,5049,17411,0,8605,12077,41664,4372,1785,1474,910,202,3680,2713,363,255,349,3093,2512,925,828,565,45,149,581,383,...,306,113,2073,36,116,202,109,56,1554,487,18,116,147,45,146,14,182,1661,45,3,157,349,63,667,34,6,24,314,2692,2291,401,4136,2644,1492,69,977,445,205,55,151
4,1,2,1,1,1,1,1,895.266667,2577,193505,56981,3779,28747,4812,4243,751,16435,0,38231,39527,56981,4957,1463,1463,1763,268,4712,3181,558,413,560,4644,3726,1310,1682,518,87,129,919,637,...,842,699,7654,46,160,131,612,93,6613,1780,7,643,870,170,82,8,1116,4768,280,22,2000,164,50,1347,17,96,133,660,5131,3859,1272,16907,4840,12068,699,10940,428,445,138,307
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45806,0,3,2,2,2,0,2,552.266667,4398,216774,46169,39809,32508,2896,4203,18778,21739,30,4315,46326,46169,5207,2668,1391,913,235,4929,3086,810,515,518,3851,3055,1001,1263,607,64,119,797,604,...,1022,1163,3508,82,243,280,362,165,2376,1196,40,576,276,67,193,45,1122,12331,1560,291,728,1149,292,6481,243,48,904,634,10563,7368,3194,12858,10883,1975,175,1076,724,1817,957,860
45807,0,3,2,2,2,0,2,552.266667,4844,165978,52670,37839,25403,3532,2293,7016,5022,8,12656,19540,52670,5932,3008,1602,1051,271,5653,3556,927,578,593,4410,3527,1164,1477,680,71,134,883,669,...,408,335,1003,43,200,128,150,76,406,527,19,231,172,20,68,16,516,3927,727,82,125,888,81,1493,115,18,191,208,6634,5082,1552,5104,4120,984,89,523,372,769,363,406
45808,0,3,2,2,2,0,2,552.266667,4630,244064,53784,76871,18196,17747,3308,13499,27705,25,4309,28619,53784,6033,3086,1614,1052,280,5761,3633,932,596,599,4503,3581,1184,1485,700,75,138,922,707,...,626,757,2547,41,121,227,423,104,1630,936,26,282,295,38,266,29,402,6524,847,117,300,812,186,3022,187,38,608,408,6198,3433,2765,8804,7553,1251,190,699,362,1431,973,457
45809,0,3,2,2,2,0,2,552.266667,6738,399003,123080,38810,19711,2816,5666,21831,23914,28,58844,104302,123080,13423,6772,3770,2247,634,13315,8362,2196,1353,1405,10379,8271,2713,3424,1644,174,316,2108,1603,...,2179,2480,7502,198,515,597,772,353,5066,2641,85,1228,606,119,508,95,2025,26762,3328,817,1364,3160,624,13367,517,102,1928,1554,27466,16321,11145,27540,23209,4331,373,2395,1562,4129,2041,2088


## データフレームの列の選択

`s` のうち `OL_Expenditure`までの列を選択し、細かい分類の支出情報の列を取り除いてみましょう。
そのためにはデータフレームの `loc` メソッドを使います。

In [8]:
data = s.loc[:, :'OL_Expenditure']
data

Unnamed: 0,3City,T_SeJinin,T_SyuJinin,T_JuSyoyu,T_Syuhi,T_Age_5s,T_Age_65,Weight,Y_Income,L_Expenditure,Food,Housing,LFW,Furniture,Clothes,Health,Transport,Education,Recreation,OL_Expenditure
0,1,2,1,1,1,1,1,895.266667,3917,201649,47756,16028,9652,6702,8088,726,21546,0,14433,76719
1,1,2,1,1,1,1,1,895.266667,6675,166381,34054,7416,26313,17062,6989,7637,20773,0,19048,27089
2,1,2,1,1,1,1,1,895.266667,6706,259736,84501,1927,10082,6741,5090,11015,53372,0,17289,69719
3,1,2,1,1,1,1,1,895.266667,2790,114511,41664,730,22358,5413,1205,5049,17411,0,8605,12077
4,1,2,1,1,1,1,1,895.266667,2577,193505,56981,3779,28747,4812,4243,751,16435,0,38231,39527
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45806,0,3,2,2,2,0,2,552.266667,4398,216774,46169,39809,32508,2896,4203,18778,21739,30,4315,46326
45807,0,3,2,2,2,0,2,552.266667,4844,165978,52670,37839,25403,3532,2293,7016,5022,8,12656,19540
45808,0,3,2,2,2,0,2,552.266667,4630,244064,53784,76871,18196,17747,3308,13499,27705,25,4309,28619
45809,0,3,2,2,2,0,2,552.266667,6738,399003,123080,38810,19711,2816,5666,21831,23914,28,58844,104302


## [記述統計量(英: descriptive statistics value)](https://ja.wikipedia.org/wiki/%E8%A6%81%E7%B4%84%E7%B5%B1%E8%A8%88%E9%87%8F) の確認

記述統計量を確認しデータの範囲や代表的な値がどの程度か確認しましょう。
そのためにはデータフレームの `describe` メソッドを使います。


In [9]:
data.describe()

Unnamed: 0,3City,T_SeJinin,T_SyuJinin,T_JuSyoyu,T_Syuhi,T_Age_5s,T_Age_65,Weight,Y_Income,L_Expenditure,Food,Housing,LFW,Furniture,Clothes,Health,Transport,Education,Recreation,OL_Expenditure
count,45811.0,45811.0,45811.0,45811.0,45811.0,45811.0,45811.0,45811.0,45811.0,45811.0,45811.0,45811.0,45811.0,45811.0,45811.0,45811.0,45811.0,45811.0,45811.0,45811.0
mean,0.400559,2.61291,1.489904,1.176966,1.250289,4.200214,1.29148,693.326887,6401.920652,298373.7,68740.210364,16127.91,19420.995569,9373.956342,12054.818668,13280.96885,44692.37,15014.89,31099.29,68568.28
std,0.490017,0.48709,0.499904,0.381644,0.433184,3.137751,0.454449,161.527843,3810.787269,164932.8,29209.813101,38787.98,8246.369457,12941.144209,15069.659817,18928.182013,77388.39,44170.47,32895.03,84566.48
min,0.0,2.0,1.0,1.0,1.0,0.0,1.0,510.050847,366.0,42084.0,9737.0,0.0,1632.0,27.0,64.0,45.0,78.0,0.0,372.0,345.0
25%,0.0,2.0,1.0,1.0,1.0,0.0,1.0,566.503597,3805.0,194216.5,48104.5,659.0,13589.5,2877.5,3893.5,3849.0,10372.5,0.0,12182.0,24793.0
50%,0.0,3.0,1.0,1.0,1.0,4.0,1.0,594.246377,5504.0,260256.0,63498.0,2877.0,17905.0,5614.0,7686.0,7687.0,23465.0,1761.0,21708.0,44634.0
75%,1.0,3.0,2.0,1.0,2.0,7.0,2.0,811.616592,7983.5,354991.0,83502.0,17462.0,23544.5,11058.0,14515.5,15434.5,48760.5,13535.0,38311.0,80909.5
max,1.0,3.0,2.0,2.0,2.0,9.0,2.0,1365.894737,67394.0,3161852.0,394347.0,1702352.0,128725.0,584980.0,569141.0,715039.0,2306300.0,2711758.0,1019545.0,2133987.0


## plotly (express) を用いた可視化

ここでは ... を可視化してみましょう。

In [None]:
import plotly.express as px

In [None]:
px.violin(data, y=data.columns)

In [None]:
px.histogram(data, x='Y_Income', histnorm='percent', nbins=100)

In [None]:
data_bins = pd.cut(data['Y_Income'].values, bins=10)

In [None]:
data_bins

# ↑いまいちなビン作り・・・
# ↓自作bin作成

In [None]:
bin_image = [0, 2000, 4000, 6000, 8000, 12000, 20000]

In [None]:
bin_image

In [None]:
import numpy as np

In [None]:
bin_image = [0, 2000, 4000, 6000, 8000, 12000, 20000]
bin_array = np.digitize(data['Y_Income'], bin_image)

In [None]:
cont = pd.read_excel("/content/ippan_2009zensho_s/ippan_2009zensho_s.xls")

In [None]:
cont[:50]

## 各消費分類の消費支出に対する割合を作成する

In [None]:
data_test = data.loc[: ,'L_Expenditure':'OL_Expenditure']

In [None]:
data_test

In [None]:
data_test.apply(lambda x: x/ data_test['L_Expenditure'])

In [None]:
data_expenditure_ratio = data_test.apply(lambda x: x / data_test['L_Expenditure'])
data_front = data.loc[:, :'Y_Income']

In [None]:
px.violin(data_expenditure_ratio, x=data_expenditure_ratio.columns[1:])

In [None]:
data_front = data_front.drop('Weight',axis=1)

In [None]:
data_preped = pd.concat([data_front, data_expenditure_ratio], axis=1)

In [None]:
data_preped

In [None]:
bin_image = [0, 2000, 4000, 6000, 8000, 12000, 20000]
bin_array = np.digitize(data['Y_Income'], bin_image)
data_preped['bins'] = bin_array

In [None]:
data_preped

In [None]:
import plotly.graph_objects as go

In [None]:
fig = go.Figure()

for num in data_preped.bins.unique():
  data_preped_num = data_preped[data_preped['bins'] == num]
  fig.add_trace(go.Histogram(x=data_preped_num['Food'], name=f'{num}', histnorm='probability', nbinsx=20))

fig.show()

## どんな可視化をするか？

- ビンの個数を動かせる
- ビンの数値を変更できる
- それをヒストグラムで描画できる

In [None]:
data_for_dash = data_preped.drop('bins', axis=1)

In [None]:
from jupyter_dash import JupyterDash 
import dash_core_components as dcc 
import dash_html_components as html 

from dash.dependencies import Input, Output, State, ALL

In [None]:
external_stylesheets = ['https://codepen.io/chriddyp/pen/bWLwgP.css']

app = JupyterDash(__name__, external_stylesheets=external_stylesheets)

app.layout = html.Div([
                       

                       html.Button(id='my_button', children='Add Input'),
                      
                      html.Div([
                       html.Div([
                       html.Div(id='input_zone', children=[]),
                      ], style={'width': '25%', 'display': 'inline-block', 'verticalAlign': 'top'}),
                       


                       html.Div([
                                 dcc.Dropdown(id='my_dropdown',
                                              options=[{'label': col, 'value': col} for col in data_for_dash.columns],
                                              value='Food'
                                              ),
                                 dcc.Graph(id='my_graph'),
                                
                                html.Div([
                                 html.H3('Histogram Bin Num: '),
                                 dcc.Input(id='bin_num', value=10, type='number'),
                                ]),
                       ], style={'width': '70%', 'display': 'inline-block'}),
                      ]),
                       
                       html.Div([
                       dcc.RangeSlider(id='my_range_slider',
                                       min=0,
                                       max=data_for_dash['Y_Income'].max()
                                       
                                       ),
                        html.Button(id='slider_button', children='graph update'),
                       ], style={'width': '90%', 'height': 100, 'margin': 'auto'}),

])

@app.callback(Output('input_zone', 'children'), [Input('my_button', 'n_clicks')], [State('input_zone', 'children')], prevent_initial_call=True)
def update_input_zone(n_clicks, existing_children):
  my_inputs = html.Div([
                        dcc.Input(id={'type': 'my_inputs', 'index': n_clicks}, value=0)
  ])
  existing_children.append(my_inputs)
  return existing_children

@app.callback(Output('my_range_slider', 'value'), [Input({'type': 'my_inputs', 'index': ALL}, 'value')])
def update_slider(enter_values):
  if len(enter_values) > 1:
    enter_values = [int(i) for i in enter_values]
  return enter_values

@app.callback(Output('my_graph', 'figure'), [Input('slider_button', 'n_clicks'), Input('my_dropdown', 'value'), Input('bin_num', 'value')], [State('my_range_slider','value')], prevent_initial_call=True)
def update_graph(n_clicks, slider_values, bin_num, selected_values):

  if len(selected_values) > 1 and sum(selected_values) > 1:
    bin_array = np.digitize(data_for_dash['Y_Income'], selected_values)
    data_for_dash['bins'] = bin_array  
    fig = go.Figure()
    for num in data_for_dash['bins'].unique():
      update_df = data_for_dash[data_for_dash['bins'] == num]
      fig.add_trace(go.Histogram(x=update_df[slider_values], histnorm='probability', nbinsx=bin_num))
  
    return fig
  return dash.no_update


app.run_server(mode='inline')

In [None]:
data_for_dash

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
!mkdir '/content/drive/My Drive/pycon-tutorial'

In [None]:
data_for_dash.to_csv('/content/drive/My Drive/pycon-tutorial/data_for_dash.csv')

In [None]:
test = dcc.RangeSlider(min=0, max=100, value=[0,100,200,300,400])

In [None]:
test.value

In [None]:
sum(test.value)

In [None]:
len(test.value)

In [None]:
if (len(test.value) > 4) & (sum(test.value) > 1):
  print("yeah")
else:
  print('nooo')