# SQL試行用サンドボックス

## はじめに
- データベースはPostgreSQL13です
- 初めに以下のセルを実行してください
- セルに %%sql と記載することでSQLを発行することができます
- jupyterからはdescribeコマンドによるテーブル構造の確認ができないため、テーブル構造を確認する場合はlimitを指定したSELECTなどで代用してください
- 使い慣れたSQLクライアントを使っても問題ありません（接続情報は以下の通り）
  - IPアドレス：Docker Desktopの場合はlocalhost、Docker toolboxの場合は192.168.99.100
  - Port:5432
  - database名: dsdojo_db
  - ユーザ名：padawan
  - パスワード:padawan12345
- 大量出力を行うとJupyterが固まることがあるため、出力件数は制限することを推奨します（設問にも出力件数を記載）
    - 結果確認のために表示させる量を適切にコントロールし、作業を軽快にすすめる技術もデータ加工には求められます
- 大量結果が出力された場合は、ファイルが重くなり以降開けなくなることもあります
    - その場合、作業結果は消えますがファイルをGitHubから取り直してください
    - vimエディタなどで大量出力範囲を削除することもできます
- 名前、住所等はダミーデータであり、実在するものではありません

In [1]:
%load_ext sql
import os

pgconfig = {
    'host': 'db',
    'port': os.environ['PG_PORT'],
    'database': os.environ['PG_DATABASE'],
    'user': os.environ['PG_USER'],
    'password': os.environ['PG_PASSWORD'],
}
dsl = 'postgresql://{user}:{password}@{host}:{port}/{database}'.format(**pgconfig)

# MagicコマンドでSQLを書くための設定
%sql $dsl

'Connected: padawan@dsdojo_db'

# 演習問題

---
> サンプル

In [4]:
%%sql
SELECT
    *
FROM reserve_tb
LIMIT 10

 * postgresql://padawan:***@db:5432/dsdojo_db
10 rows affected.


reserve_id,hotel_id,customer_id,reserve_datetime,checkin_date,checkin_time,checkout_date,people_num,total_price
r1,h_75,c_1,2016-03-06 13:09:42,2016-03-26,10:00:00,2016-03-29,4,97200
r2,h_219,c_1,2016-07-16 23:39:55,2016-07-20,11:30:00,2016-07-21,2,20600
r3,h_179,c_1,2016-09-24 10:03:17,2016-10-19,09:00:00,2016-10-22,2,33600
r4,h_214,c_1,2017-03-08 03:20:10,2017-03-29,11:00:00,2017-03-30,4,194400
r5,h_16,c_1,2017-09-05 19:50:37,2017-09-22,10:30:00,2017-09-23,3,68100
r6,h_241,c_1,2017-11-27 18:47:05,2017-12-04,12:00:00,2017-12-06,3,36000
r7,h_256,c_1,2017-12-29 10:38:36,2018-01-25,10:30:00,2018-01-28,1,103500
r8,h_241,c_1,2018-05-26 08:42:51,2018-06-08,10:00:00,2018-06-09,1,6000
r9,h_217,c_2,2016-03-05 13:31:06,2016-03-25,09:30:00,2016-03-27,3,68400
r10,h_240,c_2,2016-06-25 09:12:22,2016-07-14,11:00:00,2016-07-17,4,320400


In [6]:
%%sql
SELECT
    *
FROM hotel_tb
LIMIT 10

 * postgresql://padawan:***@db:5432/dsdojo_db
10 rows affected.


hotel_id,base_price,big_area_name,small_area_name,hotel_latitude,hotel_longitude,is_business
h_1,26100,D,D-2,43.06456855822263,141.51139666434983,True
h_2,26400,A,A-1,35.71531970202519,139.93944644588987,True
h_3,41300,E,E-4,35.28157168065808,136.98856536061078,False
h_4,5200,C,C-3,38.43129309709185,140.7956148443442,False
h_5,13500,G,G-3,33.597291462214194,130.5338715811148,True
h_6,49500,A,A-3,35.912763699782325,139.73128106597213,True
h_7,18900,C,C-2,38.32870158964997,140.89496934194355,False
h_8,12400,B,B-2,35.543318290612,139.7987370494236,False
h_9,31400,C,C-1,38.23267359673196,140.79569303443034,False
h_10,5600,A,A-3,35.91387423929255,139.93100303561386,False


### JOIN

- 単純なJOIN(==INNER JOIN)
- INNER JOINはお互いマッチするやつしか残さない。

In [30]:
%%sql
SELECT *
FROM reserve_tb
JOIN hotel_tb ON reserve_tb.hotel_id = hotel_tb.hotel_id
LIMIT 10

 * postgresql://padawan:***@db:5432/dsdojo_db
10 rows affected.


reserve_id,hotel_id,customer_id,reserve_datetime,checkin_date,checkin_time,checkout_date,people_num,total_price,hotel_id_1,base_price,big_area_name,small_area_name,hotel_latitude,hotel_longitude,is_business
r1,h_75,c_1,2016-03-06 13:09:42,2016-03-26,10:00:00,2016-03-29,4,97200,h_75,8100,B,B-2,35.54586020010928,139.70121711838777,False
r2,h_219,c_1,2016-07-16 23:39:55,2016-07-20,11:30:00,2016-07-21,2,20600,h_219,10300,B,B-3,35.64472936123362,139.6933889258853,True
r3,h_179,c_1,2016-09-24 10:03:17,2016-10-19,09:00:00,2016-10-22,2,33600,h_179,5600,G,G-4,33.599961904114444,130.63201906517207,False
r4,h_214,c_1,2017-03-08 03:20:10,2017-03-29,11:00:00,2017-03-30,4,194400,h_214,48600,C,C-2,38.33399424275985,140.79183603507371,False
r5,h_16,c_1,2017-09-05 19:50:37,2017-09-22,10:30:00,2017-09-23,3,68100,h_16,22700,A,A-3,35.911392792694706,139.93251080375714,False
r6,h_241,c_1,2017-11-27 18:47:05,2017-12-04,12:00:00,2017-12-06,3,36000,h_241,6000,A,A-1,35.81540930881771,139.8390455477878,False
r7,h_256,c_1,2017-12-29 10:38:36,2018-01-25,10:30:00,2018-01-28,1,103500,h_256,34500,C,C-1,38.23729380047828,140.69613145016714,True
r8,h_241,c_1,2018-05-26 08:42:51,2018-06-08,10:00:00,2018-06-09,1,6000,h_241,6000,A,A-1,35.81540930881771,139.8390455477878,False
r9,h_217,c_2,2016-03-05 13:31:06,2016-03-25,09:30:00,2016-03-27,3,68400,h_217,11400,B,B-2,35.54470322228935,139.79443971256103,True
r10,h_240,c_2,2016-06-25 09:12:22,2016-07-14,11:00:00,2016-07-17,4,320400,h_240,26700,C,C-2,38.330799991341095,140.79725835352326,False


### JOINの条件に主テーブルの条件がある場合

- 主テーブル側の条件を付けると、結果から消える。
  - 以下の例では、主テーブルが1月のみに制限。

In [29]:
%%sql
SELECT *
FROM reserve_tb
JOIN hotel_tb ON reserve_tb.hotel_id = hotel_tb.hotel_id
    AND reserve_tb.checkout_date BETWEEN '2016-01-01' AND '2016-01-31'
LIMIT 10

 * postgresql://padawan:***@db:5432/dsdojo_db
10 rows affected.


reserve_id,hotel_id,customer_id,reserve_datetime,checkin_date,checkin_time,checkout_date,people_num,total_price,hotel_id_1,base_price,big_area_name,small_area_name,hotel_latitude,hotel_longitude,is_business
r257,h_20,c_60,2016-01-04 06:14:07,2016-01-15,09:00:00,2016-01-17,4,56800,h_20,7100,C,C-3,38.43569663944218,140.69932577987717,False
r261,h_5,c_62,2016-01-20 06:57:16,2016-01-30,11:30:00,2016-01-31,3,40500,h_5,13500,G,G-3,33.597291462214194,130.5338715811148,True
r311,h_157,c_74,2016-01-19 14:03:43,2016-01-19,11:30:00,2016-01-20,2,69200,h_157,34600,A,A-3,35.91639535520708,139.93984510223416,False
r324,h_173,c_76,2016-01-19 19:14:28,2016-01-23,09:00:00,2016-01-26,3,360000,h_173,40000,G,G-4,33.600076015265074,130.6364420096841,True
r422,h_120,c_98,2016-01-12 17:49:41,2016-01-16,09:30:00,2016-01-19,1,130200,h_120,43400,G,G-3,33.59087482943807,130.53113753827935,True
r458,h_172,c_106,2016-01-20 23:34:39,2016-01-23,12:30:00,2016-01-25,2,64000,h_172,16000,F,F-1,34.532760079912705,132.46614373741406,False
r479,h_179,c_110,2016-01-02 05:24:11,2016-01-21,10:30:00,2016-01-24,3,50400,h_179,5600,G,G-4,33.599961904114444,130.63201906517207,False
r596,h_222,c_145,2016-01-12 19:52:43,2016-01-21,10:30:00,2016-01-23,2,80000,h_222,20000,C,C-1,38.2293701817522,140.89767922197402,True
r616,h_264,c_149,2016-01-11 10:46:10,2016-01-22,10:00:00,2016-01-23,4,52400,h_264,13100,E,E-2,35.18762427290423,136.98049277390308,True
r624,h_243,c_151,2016-01-03 03:51:39,2016-01-21,11:30:00,2016-01-23,3,166200,h_243,27700,A,A-1,35.71235675523694,139.93903193206336,True


### LEFT JOINの条件に主テーブルの条件がある場合

- これをLEFT JOINにすると？
    - LEFT JOINはとにかく主テーブルを元データのレコード数から変えることをしない。
    - つまり、主テーブルの条件に合わない部分は、データが無しとして扱われる。

In [31]:
%%sql
SELECT *
FROM reserve_tb
LEFT JOIN hotel_tb ON reserve_tb.hotel_id = hotel_tb.hotel_id
    AND reserve_tb.checkout_date BETWEEN '2016-01-01' AND '2016-01-31'
LIMIT 10

 * postgresql://padawan:***@db:5432/dsdojo_db
10 rows affected.


reserve_id,hotel_id,customer_id,reserve_datetime,checkin_date,checkin_time,checkout_date,people_num,total_price,hotel_id_1,base_price,big_area_name,small_area_name,hotel_latitude,hotel_longitude,is_business
r1,h_75,c_1,2016-03-06 13:09:42,2016-03-26,10:00:00,2016-03-29,4,97200,,,,,,,
r2,h_219,c_1,2016-07-16 23:39:55,2016-07-20,11:30:00,2016-07-21,2,20600,,,,,,,
r3,h_179,c_1,2016-09-24 10:03:17,2016-10-19,09:00:00,2016-10-22,2,33600,,,,,,,
r4,h_214,c_1,2017-03-08 03:20:10,2017-03-29,11:00:00,2017-03-30,4,194400,,,,,,,
r5,h_16,c_1,2017-09-05 19:50:37,2017-09-22,10:30:00,2017-09-23,3,68100,,,,,,,
r6,h_241,c_1,2017-11-27 18:47:05,2017-12-04,12:00:00,2017-12-06,3,36000,,,,,,,
r7,h_256,c_1,2017-12-29 10:38:36,2018-01-25,10:30:00,2018-01-28,1,103500,,,,,,,
r8,h_241,c_1,2018-05-26 08:42:51,2018-06-08,10:00:00,2018-06-09,1,6000,,,,,,,
r9,h_217,c_2,2016-03-05 13:31:06,2016-03-25,09:30:00,2016-03-27,3,68400,,,,,,,
r10,h_240,c_2,2016-06-25 09:12:22,2016-07-14,11:00:00,2016-07-17,4,320400,,,,,,,


- これをどうにかしたい場合は、WHEREをつければOKだが、本来はLEFT JOINをJOIN(INNER JOIN)に修正した方がよさそう。

In [32]:
%%sql
SELECT *
FROM reserve_tb
LEFT JOIN hotel_tb ON reserve_tb.hotel_id = hotel_tb.hotel_id
    AND reserve_tb.checkout_date BETWEEN '2016-01-01' AND '2016-01-31'
WHERE reserve_tb.checkout_date BETWEEN '2016-01-01' AND '2016-01-31'
LIMIT 10

 * postgresql://padawan:***@db:5432/dsdojo_db
10 rows affected.


reserve_id,hotel_id,customer_id,reserve_datetime,checkin_date,checkin_time,checkout_date,people_num,total_price,hotel_id_1,base_price,big_area_name,small_area_name,hotel_latitude,hotel_longitude,is_business
r257,h_20,c_60,2016-01-04 06:14:07,2016-01-15,09:00:00,2016-01-17,4,56800,h_20,7100,C,C-3,38.43569663944218,140.69932577987717,False
r261,h_5,c_62,2016-01-20 06:57:16,2016-01-30,11:30:00,2016-01-31,3,40500,h_5,13500,G,G-3,33.597291462214194,130.5338715811148,True
r311,h_157,c_74,2016-01-19 14:03:43,2016-01-19,11:30:00,2016-01-20,2,69200,h_157,34600,A,A-3,35.91639535520708,139.93984510223416,False
r324,h_173,c_76,2016-01-19 19:14:28,2016-01-23,09:00:00,2016-01-26,3,360000,h_173,40000,G,G-4,33.600076015265074,130.6364420096841,True
r422,h_120,c_98,2016-01-12 17:49:41,2016-01-16,09:30:00,2016-01-19,1,130200,h_120,43400,G,G-3,33.59087482943807,130.53113753827935,True
r458,h_172,c_106,2016-01-20 23:34:39,2016-01-23,12:30:00,2016-01-25,2,64000,h_172,16000,F,F-1,34.532760079912705,132.46614373741406,False
r479,h_179,c_110,2016-01-02 05:24:11,2016-01-21,10:30:00,2016-01-24,3,50400,h_179,5600,G,G-4,33.599961904114444,130.63201906517207,False
r596,h_222,c_145,2016-01-12 19:52:43,2016-01-21,10:30:00,2016-01-23,2,80000,h_222,20000,C,C-1,38.2293701817522,140.89767922197402,True
r616,h_264,c_149,2016-01-11 10:46:10,2016-01-22,10:00:00,2016-01-23,4,52400,h_264,13100,E,E-2,35.18762427290423,136.98049277390308,True
r624,h_243,c_151,2016-01-03 03:51:39,2016-01-21,11:30:00,2016-01-23,3,166200,h_243,27700,A,A-1,35.71235675523694,139.93903193206336,True


### JOINの条件に従テーブルの条件がある場合

- こちらの方が想像がつきやすいかも。
- 要は従テーブルの条件にマッチしないものは、欠損になって、INNER JOINなので消される。

In [37]:
%%sql
SELECT *
FROM reserve_tb
JOIN hotel_tb ON reserve_tb.hotel_id = hotel_tb.hotel_id
    AND hotel_tb.big_area_name = 'D'
LIMIT 10

 * postgresql://padawan:***@db:5432/dsdojo_db
10 rows affected.


reserve_id,hotel_id,customer_id,reserve_datetime,checkin_date,checkin_time,checkout_date,people_num,total_price,hotel_id_1,base_price,big_area_name,small_area_name,hotel_latitude,hotel_longitude,is_business
r20,h_292,c_3,2017-02-23 07:10:30,2017-03-03,11:00:00,2017-03-04,2,18200,h_292,9100,D,D-1,43.05786893483166,141.40998159306062,True
r28,h_119,c_4,2016-10-07 04:38:54,2016-11-04,10:00:00,2016-11-06,4,52800,h_119,6600,D,D-1,43.06207503404144,141.4120219592387,False
r65,h_221,c_11,2016-01-24 18:58:29,2016-02-19,12:30:00,2016-02-22,2,33000,h_221,5500,D,D-3,43.159853379903545,141.40684391661392,False
r79,h_218,c_13,2017-03-02 05:54:51,2017-03-19,10:30:00,2017-03-22,3,189000,h_218,21000,D,D-4,43.15851330740281,141.5063404776606,True
r96,h_127,c_17,2016-06-08 22:39:12,2016-07-08,10:00:00,2016-07-09,4,168000,h_127,42000,D,D-2,43.056172375637054,141.50815533916463,True
r117,h_221,c_22,2019-02-25 03:10:39,2019-03-13,11:30:00,2019-03-15,1,11000,h_221,5500,D,D-3,43.159853379903545,141.40684391661392,False
r135,h_91,c_27,2016-12-16 10:42:36,2017-01-04,10:30:00,2017-01-06,2,184000,h_91,46000,D,D-3,43.16366397868197,141.4158682781267,False
r140,h_221,c_28,2016-03-13 23:16:53,2016-03-16,09:30:00,2016-03-18,4,44000,h_221,5500,D,D-3,43.159853379903545,141.40684391661392,False
r156,h_40,c_34,2016-06-06 20:04:33,2016-06-30,12:30:00,2016-07-01,3,26700,h_40,8900,D,D-3,43.16539930665659,141.41334946543185,False
r164,h_40,c_36,2016-12-06 05:50:33,2016-12-30,10:30:00,2017-01-01,4,71200,h_40,8900,D,D-3,43.16539930665659,141.41334946543185,False


### LEFT JOINの条件に従テーブルの条件がある場合

- こちらも欠損含めてひょうじされる。

In [41]:
%%sql
SELECT *
FROM reserve_tb
LEFT JOIN hotel_tb ON reserve_tb.hotel_id = hotel_tb.hotel_id
    AND hotel_tb.big_area_name = 'D'
LIMIT 10

 * postgresql://padawan:***@db:5432/dsdojo_db
10 rows affected.


reserve_id,hotel_id,customer_id,reserve_datetime,checkin_date,checkin_time,checkout_date,people_num,total_price,hotel_id_1,base_price,big_area_name,small_area_name,hotel_latitude,hotel_longitude,is_business
r1,h_75,c_1,2016-03-06 13:09:42,2016-03-26,10:00:00,2016-03-29,4,97200,,,,,,,
r2,h_219,c_1,2016-07-16 23:39:55,2016-07-20,11:30:00,2016-07-21,2,20600,,,,,,,
r3,h_179,c_1,2016-09-24 10:03:17,2016-10-19,09:00:00,2016-10-22,2,33600,,,,,,,
r4,h_214,c_1,2017-03-08 03:20:10,2017-03-29,11:00:00,2017-03-30,4,194400,,,,,,,
r5,h_16,c_1,2017-09-05 19:50:37,2017-09-22,10:30:00,2017-09-23,3,68100,,,,,,,
r6,h_241,c_1,2017-11-27 18:47:05,2017-12-04,12:00:00,2017-12-06,3,36000,,,,,,,
r7,h_256,c_1,2017-12-29 10:38:36,2018-01-25,10:30:00,2018-01-28,1,103500,,,,,,,
r8,h_241,c_1,2018-05-26 08:42:51,2018-06-08,10:00:00,2018-06-09,1,6000,,,,,,,
r9,h_217,c_2,2016-03-05 13:31:06,2016-03-25,09:30:00,2016-03-27,3,68400,,,,,,,
r10,h_240,c_2,2016-06-25 09:12:22,2016-07-14,11:00:00,2016-07-17,4,320400,,,,,,,


### WHEREとかORとかINとか
- INの使い方はPythonに似てて問題なさそうやね。

In [47]:
%%sql
SELECT *
FROM reserve_tb
WHERE reserve_tb.people_num = 4 OR ( reserve_tb.customer_id IN ('c_1', 'c_4') )
LIMIT 10

 * postgresql://padawan:***@db:5432/dsdojo_db
10 rows affected.


reserve_id,hotel_id,customer_id,reserve_datetime,checkin_date,checkin_time,checkout_date,people_num,total_price
r1,h_75,c_1,2016-03-06 13:09:42,2016-03-26,10:00:00,2016-03-29,4,97200
r2,h_219,c_1,2016-07-16 23:39:55,2016-07-20,11:30:00,2016-07-21,2,20600
r3,h_179,c_1,2016-09-24 10:03:17,2016-10-19,09:00:00,2016-10-22,2,33600
r4,h_214,c_1,2017-03-08 03:20:10,2017-03-29,11:00:00,2017-03-30,4,194400
r5,h_16,c_1,2017-09-05 19:50:37,2017-09-22,10:30:00,2017-09-23,3,68100
r6,h_241,c_1,2017-11-27 18:47:05,2017-12-04,12:00:00,2017-12-06,3,36000
r7,h_256,c_1,2017-12-29 10:38:36,2018-01-25,10:30:00,2018-01-28,1,103500
r8,h_241,c_1,2018-05-26 08:42:51,2018-06-08,10:00:00,2018-06-09,1,6000
r10,h_240,c_2,2016-06-25 09:12:22,2016-07-14,11:00:00,2016-07-17,4,320400
r12,h_268,c_2,2017-05-24 10:06:21,2017-06-20,09:00:00,2017-06-21,4,81600


### UNION

- UNIONは単純に縦方向にくっつけるだけ。
- なので同じカラム同士である必要がある。
- UNIONとUNION ALLがあり、UNIONは重複するものを削除するが、そのため計算コストが大きい。

In [52]:
%%sql
SELECT *
FROM reserve_tb
WHERE reserve_tb.people_num = 4
UNION ALL
SELECT *
FROM reserve_tb
WHERE reserve_tb.people_num = 3
LIMIT 20

 * postgresql://padawan:***@db:5432/dsdojo_db
20 rows affected.


reserve_id,hotel_id,customer_id,reserve_datetime,checkin_date,checkin_time,checkout_date,people_num,total_price
r1,h_75,c_1,2016-03-06 13:09:42,2016-03-26,10:00:00,2016-03-29,4,97200
r4,h_214,c_1,2017-03-08 03:20:10,2017-03-29,11:00:00,2017-03-30,4,194400
r10,h_240,c_2,2016-06-25 09:12:22,2016-07-14,11:00:00,2016-07-17,4,320400
r12,h_268,c_2,2017-05-24 10:06:21,2017-06-20,09:00:00,2017-06-21,4,81600
r16,h_135,c_2,2018-07-06 04:18:28,2018-07-08,10:00:00,2018-07-09,4,46400
r22,h_12,c_3,2017-07-24 19:15:54,2017-08-08,09:00:00,2017-08-09,4,26800
r24,h_34,c_3,2018-04-27 08:51:07,2018-05-07,09:30:00,2018-05-10,4,102000
r28,h_119,c_4,2016-10-07 04:38:54,2016-11-04,10:00:00,2016-11-06,4,52800
r29,h_222,c_4,2016-11-10 21:59:02,2016-11-13,12:30:00,2016-11-16,4,240000
r31,h_143,c_4,2017-08-17 03:16:51,2017-09-14,10:30:00,2017-09-15,4,79200


- 単純に上記の例では、WHEREに複数条件入れれば済む話ではある。
- なので実際には違うルールでJOINした者同士をくっつけたり、WHEREで済まない場合が使いどころである。

### DISTINCT

- SELECT直後に記述し、結果が重複するレコードを削除する。
- 重複の判断は、SELECTされたカラムのみで行われる。
  - なのでテーブル自体の重複を削除するのではなく、あくまで取得する列の範囲での重複を削除するイメージ。

In [62]:
%%sql
SELECT DISTINCT customer_id, people_num FROM reserve_tb

 * postgresql://padawan:***@db:5432/dsdojo_db
2366 rows affected.


customer_id,people_num
c_568,4
c_796,2
c_332,3
c_440,1
c_848,1
c_378,4
c_83,2
c_108,3
c_105,3
c_726,1
