<table width=100%; style="background-color:#caf0fa";>
    <tr style="background-color:#caf0fa">
        <td>
            <h1 style="text-align:right">
                Python for Data Science Training - Week 4
            </h1>
        </td>
        <td>
            <img src="../img/jica-logo.png" alt = "JICA Training" style = "width: 100px;"/>
        </td>
    </tr>
</table>

# Today's Contents
1. Matplotlib
2. Seaborn

---

# 1. Matplotlib
Matplotlib is the most famous and popular visualization library in Python.

### Creating a Japanese language environment for matplotlib
Matplotlibはデフォルトでは日本語を読み込む際に文字化けします。以下のように、Matplotlibの文字設定を変更しておく必要があります。  
参考1：[Qiita matplotlibで日本語](https://qiita.com/yniji/items/3fac25c2ffa316990d0c)  
参考2：[ESRI Japan matplotlib での日本語利用について](https://esrijapan.github.io/arcgis-dev-resources/tips/python/matplotlib-japanese/)

In [None]:
import matplotlib.font_manager as fm
fm_list = fm.findSystemFonts()
fm_list

In [None]:
import matplotlib.pyplot as plt
from matplotlib import rcParams
rcParams["font.family"] = "sans-serif"
rcParams['font.sans-serif'] = ['Hiragino Maru Gothic Pro', 'Yu Gothic', 'Meirio', 'Takao', 'IPAexGothic', 'IPAPGothic', 'VL PGothic', 'Noto Sans CJK JP']

## Preprocessing

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
# percent is called magic command to allow specific behaviour in the notebook.

In [None]:
# 国勢調査　男女別人口－全国，都道府県（大正９年～平成27年）
url = 'https://www.e-stat.go.jp/stat-search/file-download?statInfId=000031524010&fileKind=1'
df = pd.read_csv(url, encoding="shift-jis")

In [None]:
df.head()

In [None]:
df.isnull().sum()

In [None]:
df.tail()

In [None]:
df = df.iloc[:-2, :].reset_index(drop = True)

In [None]:
df.tail()

In [None]:
# drop columns
df = df.drop(columns = ['元号', '和暦（年）', '注'])

In [None]:
# Change column names
df.columns = ['code', 'pref', 'year', 'pop_total', 'pop_men', 'pop_women']

In [None]:
df.isnull().sum()

In [None]:
df.dtypes

In [None]:
# Change data type of year, pop_total, pop_men, pop_women
df['year']  = df['year'].astype(int)

# columns of pop have string values, which raise error if using .astype. Instead, use pd.to_numeric with errors options.
# This returns non-integer as null value.
df['pop_total'] = pd.to_numeric(df['pop_total'], errors = 'coerce')
df['pop_men'] = pd.to_numeric(df['pop_men'], errors = 'coerce')
df['pop_women'] = pd.to_numeric(df['pop_women'], errors = 'coerce')

# drop null values
df = df.dropna().reset_index(drop = True)

In [None]:
for col in ['pref', 'year']:
    print('\n' + col + ': ' , df[col].unique())

In [None]:
# Select national stats
df_japan = df[df['pref'] == '全国'].reset_index(drop = True)

# Select only prefecture
cond1 = df['pref'] != '全国'
cond2 = df['pref'] != '人口集中地区'
cond3 = df['pref'] != '人口集中地区以外の地区'
conds = cond1 & cond2 & cond3
df_pref = df[conds].reset_index(drop = True)

In [None]:
df_japan.head()

In [None]:
df_pref.head()

In [None]:
print(df_japan['pref'].unique())
print(df_pref['pref'].unique())

In [None]:
df_japan.dtypes

In [None]:
# Convert float to integer on pop columns
for col in ['pop_total', 'pop_men', 'pop_women']:
    df_japan[col] = df_japan[col].astype(int)
    df_pref[col] = df_pref[col].astype(int)

In [None]:
df_japan.dtypes

In [None]:
df_pref.dtypes

Finished preprocessing!!

## Quick overview of the data

In [None]:
df_japan.describe()

In [None]:
# Plot the graph
df_japan.plot(x = 'year')

In [None]:
# Add some labels
df_japan.plot(x = 'year')
plt.title('Japan\'s population')
plt.xlabel('Survey year')
plt.ylabel('Population')
plt.legend()

In [None]:
# Create new columns with million population
df_japan['pop_total_m'] = df_japan['pop_total']/1000000
df_japan['pop_men_m'] = df_japan['pop_men']/1000000
df_japan['pop_women_m'] = df_japan['pop_women']/1000000

In [None]:
# Change pop variables
df_japan.plot(x = 'year', y = ['pop_total_m', 'pop_men_m', 'pop_women_m'])
plt.title('Japan\'s population')
plt.xlabel('Survey year')
plt.ylabel('Population (million)')
plt.legend()

In [None]:
# Check minimum and maximum years
print('year min {}, max {}'.format(df_japan['year'].min(), df_japan['year'].max()))
print('population min {}, max {}'.format(df_japan['pop_total_m'].min(), df_japan['pop_total_m'].max()))

In [None]:
# Add some labels
df_japan.plot(x = 'year', y = ['pop_total_m', 'pop_men_m', 'pop_women_m'],
              color = ['black', 'blue', 'red'], figsize = (12, 6))
plt.title('Japan\'s population')
plt.xlabel('Survey year')
plt.ylabel('Population (million)')
plt.xticks(range(1920, 2015, 5))
plt.yticks(range(0, 150, 20))
plt.legend(labels = ['Total Population', 'Population (Men)', 'Population (Women)'])

In [None]:
# bar plot
df_japan.plot.bar(x = 'year', y = 'pop_total_m')
plt.title('Japan\'s Total Population')
plt.xlabel('Survey Year')
plt.ylabel('Population (Million)')

In [None]:
# bar plot
df_japan.plot.bar(x = 'year', y = ['pop_men_m', 'pop_women_m'], figsize = (10, 5), color = ['blue', 'red'])
plt.title('Japan\'s population by gender')
plt.xlabel('Survey Year')
plt.ylabel('Population (Million)')
plt.legend(labels = ['Population (Men)', 'Population (Women)'])

### Now let's look at prefecture

In [None]:
df_pref.columns

In [None]:
df_pref[['pop_total', 'pop_men', 'pop_women']].describe().T[['min', 'max']]

In [None]:
for col in ['pop_total', 'pop_men', 'pop_women']:
    df_pref[col + '_t'] = df_pref[col] // 1000

In [None]:
df_pref.head()

In [None]:
# Show histogram
df_pref.hist(column = ['pop_total_t'], bins  = 10)

In [None]:
# Show boxplot
df_pref.boxplot(column = ['pop_total_t'], by = 'pref', figsize = (15, 7), )
plt.title('県別人口ボックスプロット')
plt.xticks(rotation = 90);

Looking into Kanto region.

In [None]:
df_pref['pref'].unique()

In [None]:
# Select prefectures in Kanto region
kanto = ['茨城県', '栃木県', '群馬県', '埼玉県', '千葉県', '東京都', '神奈川県']
df_kanto = df_pref.query("pref == @kanto").reset_index(drop = True)

In [None]:
df_kanto.head()

In [None]:
import seaborn as sns

In [None]:
plt.figure(figsize = (15, 5))
sns.violinplot(x = 'pref', y = 'pop_total_t', data = df_kanto, palette = 'tab20c')

It would be more interesting if we could know gender distribution by prefecture. To do this, we need to tweak dataframe to allow Seaborn to visualize.

In [None]:
# "Melting" data to long form.
df_kanto_long = df_kanto.melt(id_vars = 'pref', value_vars = ['pop_men_t', 'pop_women_t'],
                              var_name = ['gender'], value_name = 'population')
df_kanto_long = df_kanto_long.replace({'pop_men_t':'男性', 'pop_women_t':'女性'})
df_kanto_long.head()

In [None]:
# Finaly shows violin plot by gender
plt.figure(figsize = (15, 5))
sns.violinplot(x = 'pref', y = 'population', hue = 'gender', data = df_kanto_long, palette = ['#E8F1F2','#B3EFB2'])
plt.xlabel('関東')
plt.ylabel('人口（千人）');

That's it!お疲れ様でした。MatplotlibやSeabornでは様々なことができるので、是非それぞれのギャラリーを確認ください。私が以前作ったグラフで、例えばムンバイで空間クラスタリングがどこでどのように発生しているかを以下の図のように作成しました。これも、Matplotlibで作っています。（図の見方は、中心点がプロジェクトの介入場所（道路整備）、外周が方角、補助線が道路からの距離、赤いドットが空間クラスタリングの発生場所です。）

<img src="img/Mumbai_horizontal_clustering.png" alt = "Mumbai Horizontal Clustering" style = "width: 300px;"/>

Visualizationで最も手っ取り早いアプローチはサンプルやexampleを真似て、段々とオリジナリティを出していくことです。<br>
Matplotlibのサンプルについては、[こちら](https://matplotlib.org/stable/gallery/index.html)を参照。<br>
Seabornのサンプルについては、[こちら](https://seaborn.pydata.org/examples/index.html)を参照。<br>

図のカラーリングは[Coolors](https://coolors.co/)がおすすめ。Coolorsでカラーパターンを作って、そこからHEXコードをコピーするのが最も手っ取り早いと思います。