In [2]:
system"cd ",getenv[`HOME],"/course-introductory-workshop"
.trn.nbdir:system"cd"
\l scripts/loaddata.q

"Initializing variables"
"Loaded Weather CSV"
"Loaded Taxi Trips partitioned DB"
"Defining exercise results"
"Ready"


**Learning objective**
* How to apply a left join
* How to apply an as-of join

# Joins

A join combines data from two tables, or from a table and a dictionary.

Some joins are keyed, in that columns in the first argument are matched with the key columns of the second argument.

Some joins are as-of, where a time column in the first argument specifies corresponding intervals in a time column of the second argument. Such joins are not keyed.

In each case, the result has the merge of columns from both arguments. Where necessary, rows are filled with nulls or zeros. 

Here is a list of some joins possible using kdb+/q:

+ [Left Join](https://code.kx.com/q/ref/lj/)
+ [AJ (As-of) Join](https://code.kx.com/q/ref/aj/)

In order to demonstrate some common join types we will use data from preloaded table, `weather`,  which corresponds to the same time as our taxi data. In this case we will load some weather data and try and get some insight in how this affected taxi journeys around this period.

In [3]:
// Check number of records in weather and the meta
count weather
meta weather

31


c            | t f a
-------------| -----
date         | d    
maxtemp      | f    
mintemp      | f    
avgtemp      | f    
departuretemp| f    
hdd          | f    
cdd          | f    
precip       | f    
newsnow      | f    
snowdepth    | f    


In [4]:
select mindate:min date, maxdate:max date from weather

mindate    maxdate   
---------------------
2009.01.01 2009.01.31


##### Exercise 9
- Display the max and min temperatures for NYC each week through January (For this query a week is just every 7 days)

In [5]:
select max maxtemp, min mintemp by 7 xbar date from weather
//alternative way- select max maxtemp, min mintemp by date.week from weather

date      | maxtemp mintemp
----------| ---------------
2008.12.27| 34      15     
2009.01.03| 43      25     
2009.01.10| 41      9      
2009.01.17| 47      6      
2009.01.24| 46      13     
2009.01.31| 27      20     


 <img src="images/qbies.png" width="50px" align="left"/><p style='color:#273a6e'><i> Note the difference in start date in the 2 solutions. In kdb+/q when using 7 xbar the date jumps from one Saturday to the next Saturday. This is because kdb+ follows the J2000 international standard starting from 2000.01.01 which happens to be a Saturday. When using date.week the date starts from a Monday as this is the first day of a week in kdb+.  </i></p>

In [None]:
//Enter your code here 

In [6]:
ex9[] //check correct output

date      | maxtemp mintemp
----------| ---------------
2008.12.27| 34      15     
2009.01.03| 43      25     
2009.01.10| 41      9      
2009.01.17| 47      6      
2009.01.24| 46      13     
2009.01.31| 27      20     


We now have two tables of related data, taxi trips and weather data, for each day on which a trip occurred.

It would be nice to combine these tables so we could easily ask questions across both data sets.

For example, are average trip durations shorter or longer on days with lots of precipitation?

## Left Join

Like other SQL languages, qSQL supports a number of join operations. Here we will use a left join to create a single table containing both trip and weather data. In kdb+/q the `lj` operator requires one or more common columns between the tables to join across. 

`t1 lj t2` - [left join](https://code.kx.com/q/ref/lj/)

<img src="images/LeftJoin.png" width="400" height="200">

For our purpose we will be looking at the daily weather data so we'll use the `date` column, which is in both tables.

In [12]:
// Find the number of trips per day

jan09:select from trips where date within 2009.01.01 2009.01.31
jan09C:select trips: count i by date from jan09
jan09C

date      | trips 
----------| ------
2009.01.01| 327625
2009.01.02| 376708
2009.01.03| 432710
2009.01.04| 367525
2009.01.05| 370901
2009.01.06| 427394
2009.01.07| 371043
2009.01.08| 477502
2009.01.09| 520846
2009.01.10| 483350
2009.01.11| 405075
2009.01.12| 414642
2009.01.13| 442543
2009.01.14| 489177
2009.01.15| 486450
2009.01.16| 535200
2009.01.17| 511023
2009.01.18| 419962
2009.01.19| 352534
2009.01.20| 433639
..


Looking more closely at `jan09C`, it doesn't look like a normal kdb+ table as there is a vertical line appearing between the columns `date` and `x`. This informs us that this table is actually a **keyed table** which we require to have in most kdb+ joins. But how do we create keyed tables explicitly ? Well, we have two choices:

1. Using the syntax above - we can use a by clause 
2. Using [xkey](https://code.kx.com/q/ref/keys/#xkey) or [!(bang)](https://code.kx.com/q/ref/enkey/)

In [13]:
`date xkey weather //we are keying on date 
1!weather          //we are keying on the first column 
3!weather          //we can key on N number of columns

date      | maxtemp mintemp avgtemp departuretemp hdd cdd precip newsnow snow..
----------| -----------------------------------------------------------------..
2009.01.01| 26      15      20.5    -12.9         44  0   0      0       0   ..
2009.01.02| 34      23      28.5    -4.8          36  0                  0   ..
2009.01.03| 38      29      33.5    0.4           31  0                  0   ..
2009.01.04| 42      25      33.5    0.5           31  0   0      0       0   ..
2009.01.05| 43      38      40.5    7.6           24  0          0       0   ..
2009.01.06| 38      31      34.5    1.7           30  0   0.08           0   ..
2009.01.07| 38      31      34.5    1.8           30  0   1.19   0       0   ..
2009.01.08| 38      29      33.5    0.9           31  0   0      0       0   ..
2009.01.09| 32      26      29      -3.5          36  0   0      0       0   ..
2009.01.10| 30      23      26.5    -5.9          38  0   0.14   1       0   ..
2009.01.11| 31      24      27.5    -4.9

date      | maxtemp mintemp avgtemp departuretemp hdd cdd precip newsnow snow..
----------| -----------------------------------------------------------------..
2009.01.01| 26      15      20.5    -12.9         44  0   0      0       0   ..
2009.01.02| 34      23      28.5    -4.8          36  0                  0   ..
2009.01.03| 38      29      33.5    0.4           31  0                  0   ..
2009.01.04| 42      25      33.5    0.5           31  0   0      0       0   ..
2009.01.05| 43      38      40.5    7.6           24  0          0       0   ..
2009.01.06| 38      31      34.5    1.7           30  0   0.08           0   ..
2009.01.07| 38      31      34.5    1.8           30  0   1.19   0       0   ..
2009.01.08| 38      29      33.5    0.9           31  0   0      0       0   ..
2009.01.09| 32      26      29      -3.5          36  0   0      0       0   ..
2009.01.10| 30      23      26.5    -5.9          38  0   0.14   1       0   ..
2009.01.11| 31      24      27.5    -4.9

date       maxtemp mintemp| avgtemp departuretemp hdd cdd precip newsnow snow..
--------------------------| -------------------------------------------------..
2009.01.01 26      15     | 20.5    -12.9         44  0   0      0       0   ..
2009.01.02 34      23     | 28.5    -4.8          36  0                  0   ..
2009.01.03 38      29     | 33.5    0.4           31  0                  0   ..
2009.01.04 42      25     | 33.5    0.5           31  0   0      0       0   ..
2009.01.05 43      38     | 40.5    7.6           24  0          0       0   ..
2009.01.06 38      31     | 34.5    1.7           30  0   0.08           0   ..
2009.01.07 38      31     | 34.5    1.8           30  0   1.19   0       0   ..
2009.01.08 38      29     | 33.5    0.9           31  0   0      0       0   ..
2009.01.09 32      26     | 29      -3.5          36  0   0      0       0   ..
2009.01.10 30      23     | 26.5    -5.9          38  0   0.14   1       0   ..
2009.01.11 31      24     | 27.5    -4.9

If we want to unkey a keyed table using the `!`: 

In [14]:
0!jan09C     

date       trips 
-----------------
2009.01.01 327625
2009.01.02 376708
2009.01.03 432710
2009.01.04 367525
2009.01.05 370901
2009.01.06 427394
2009.01.07 371043
2009.01.08 477502
2009.01.09 520846
2009.01.10 483350
2009.01.11 405075
2009.01.12 414642
2009.01.13 442543
2009.01.14 489177
2009.01.15 486450
2009.01.16 535200
2009.01.17 511023
2009.01.18 419962
2009.01.19 352534
2009.01.20 433639
..


The `lj` operator requires that at least the right hand table argument be keyed. A table can be keyed in a number of ways, however the easiest is to use the [`xkey`](https://code.kx.com/q/ref/keys/#xkey) function

In [18]:
// select date and precipitation from the weather table
// key the result on date
// join to the unkeyed table jan09C (0! unkeys the table)
jan09W:jan09C lj `date xkey select date, precip from weather 
jan09W

jan09W:jan09C lj select avg precip by date from weather //using the by clause to key

date      | trips  precip
----------| -------------
2009.01.01| 327625 0     
2009.01.02| 376708       
2009.01.03| 432710       
2009.01.04| 367525 0     
2009.01.05| 370901       
2009.01.06| 427394 0.08  
2009.01.07| 371043 1.19  
2009.01.08| 477502 0     
2009.01.09| 520846 0     
2009.01.10| 483350 0.14  
2009.01.11| 405075 0.19  
2009.01.12| 414642 0     
2009.01.13| 442543 0     
2009.01.14| 489177 0     
2009.01.15| 486450 0.05  
2009.01.16| 535200 0     
2009.01.17| 511023 0     
2009.01.18| 419962 0.18  
2009.01.19| 352534 0.18  
2009.01.20| 433639 0     
..


Let's check will we get the same result if the left hand table is unkeyed?

In [19]:
unkeyedJan09C:0!jan09C
unkeyedJan09C lj `date xkey select date, precip from weather

date       trips  precip
------------------------
2009.01.01 327625 0     
2009.01.02 376708       
2009.01.03 432710       
2009.01.04 367525 0     
2009.01.05 370901       
2009.01.06 427394 0.08  
2009.01.07 371043 1.19  
2009.01.08 477502 0     
2009.01.09 520846 0     
2009.01.10 483350 0.14  
2009.01.11 405075 0.19  
2009.01.12 414642 0     
2009.01.13 442543 0     
2009.01.14 489177 0     
2009.01.15 486450 0.05  
2009.01.16 535200 0     
2009.01.17 511023 0     
2009.01.18 419962 0.18  
2009.01.19 352534 0.18  
2009.01.20 433639 0     
..


 <img src="images/qbies.png" width="50px" align="left"/><p style='color:#273a6e'><i> The left-hand table can be keyed or unkeyed. The format of the left-hand table will dictate the format of the result table. </i></p> 

Now we can look at trips vs precipitation

In [20]:
select date,trips,precip from jan09W

date       trips  precip
------------------------
2009.01.01 327625 0     
2009.01.02 376708       
2009.01.03 432710       
2009.01.04 367525 0     
2009.01.05 370901       
2009.01.06 427394 0.08  
2009.01.07 371043 1.19  
2009.01.08 477502 0     
2009.01.09 520846 0     
2009.01.10 483350 0.14  
2009.01.11 405075 0.19  
2009.01.12 414642 0     
2009.01.13 442543 0     
2009.01.14 489177 0     
2009.01.15 486450 0.05  
2009.01.16 535200 0     
2009.01.17 511023 0     
2009.01.18 419962 0.18  
2009.01.19 352534 0.18  
2009.01.20 433639 0     
..


##### Exercise 10
- Create a new join which joins the number of trips with the average temperature from the weather data, per day for the month of January

In [21]:
jan09C lj `date xkey select date, avgtemp from weather 

date      | trips  avgtemp
----------| --------------
2009.01.01| 327625 20.5   
2009.01.02| 376708 28.5   
2009.01.03| 432710 33.5   
2009.01.04| 367525 33.5   
2009.01.05| 370901 40.5   
2009.01.06| 427394 34.5   
2009.01.07| 371043 34.5   
2009.01.08| 477502 33.5   
2009.01.09| 520846 29     
2009.01.10| 483350 26.5   
2009.01.11| 405075 27.5   
2009.01.12| 414642 27     
2009.01.13| 442543 34.5   
2009.01.14| 489177 26     
2009.01.15| 486450 18.5   
2009.01.16| 535200 12.5   
2009.01.17| 511023 14     
2009.01.18| 419962 28.5   
2009.01.19| 352534 29.5   
2009.01.20| 433639 24     
..


In [None]:
//Enter your code here 

In [22]:
ex10[] //check correct output

date      | trips  avgtemp
----------| --------------
2009.01.01| 327625 20.5   
2009.01.02| 376708 28.5   
2009.01.03| 432710 33.5   
2009.01.04| 367525 33.5   
2009.01.05| 370901 40.5   
2009.01.06| 427394 34.5   
2009.01.07| 371043 34.5   
2009.01.08| 477502 33.5   
2009.01.09| 520846 29     
2009.01.10| 483350 26.5   
2009.01.11| 405075 27.5   
2009.01.12| 414642 27     
2009.01.13| 442543 34.5   
2009.01.14| 489177 26     
2009.01.15| 486450 18.5   
2009.01.16| 535200 12.5   
2009.01.17| 511023 14     
2009.01.18| 419962 28.5   
2009.01.19| 352534 29.5   
2009.01.20| 433639 24     
..


## As-of Join

`aj[matching columns;t1;t2]` - [aj join](https://code.kx.com/q/ref/aj/)

qSQL also supports time-series joins, a powerful feature not typically found in other databases and languages.

Given the data we have, we could ask what were the latest pick-ups for each vendor, as of a particular time.

We will create a temporary time table with a minimum date time for each vendor:

Let's say there are three reports of individuals who have lost their phone or wallet who were picked up shortly before the time who said how many passengers were in the taxi. Which vendor were they riding with?

In [23]:
timetab:([] passengers:1 2 3; event_time:2009.01.06D03:30:00+00:30*til 3)
timetab

passengers event_time                   
----------------------------------------
1          2009.01.06D03:30:00.000000000
2          2009.01.06D04:00:00.000000000
3          2009.01.06D04:30:00.000000000


Using `aj`, we can look up the table `jan09` to find out what was the last trip taken at each of the times above with those passengers:

In [24]:
aj[`passengers`event_time;timetab;select passengers, event_time:pickup_time, vendor, pickup_time from jan09]

passengers event_time                    vendor pickup_time                  
-----------------------------------------------------------------------------
1          2009.01.06D03:30:00.000000000 VTS    2009.01.06D03:30:00.000000000
2          2009.01.06D04:00:00.000000000 VTS    2009.01.06D04:00:00.000000000
3          2009.01.06D04:30:00.000000000 CMT    2009.01.06D04:29:22.000000000


The result is the record for each vendor with the event_time ≤ to the time we specified.
- An `aj` join will always select the last record before the specified time.

In [25]:
timetab:([] passengers:1 2 3 4 5 6; event_time:2009.01.06D03:30:00+00:30*til 6)
timetab

passengers event_time                   
----------------------------------------
1          2009.01.06D03:30:00.000000000
2          2009.01.06D04:00:00.000000000
3          2009.01.06D04:30:00.000000000
4          2009.01.06D05:00:00.000000000
5          2009.01.06D05:30:00.000000000
6          2009.01.06D06:00:00.000000000


We have created a new timetab table. What will the output as-of join be now?

In [26]:
aj[`passengers`event_time;timetab;select passengers, event_time:pickup_time, vendor, pickup_time from jan09]

passengers event_time                    vendor pickup_time                  
-----------------------------------------------------------------------------
1          2009.01.06D03:30:00.000000000 VTS    2009.01.06D03:30:00.000000000
2          2009.01.06D04:00:00.000000000 VTS    2009.01.06D04:00:00.000000000
3          2009.01.06D04:30:00.000000000 CMT    2009.01.06D04:29:22.000000000
4          2009.01.06D05:00:00.000000000 CMT    2009.01.06D04:59:54.000000000
5          2009.01.06D05:30:00.000000000 VTS    2009.01.06D05:30:00.000000000
6          2009.01.06D06:00:00.000000000 VTS    2009.01.06D05:59:00.000000000


##### Exercise 11

Find the latest trips as of 09:30 on the 31st of January for each vendor.

In [27]:
timetab:([] vendor: `VTS`DDS`CMT; pickup_time:3#2009.01.31D09:30:00)
aj[`vendor`pickup_time;timetab;jan09]

vendor pickup_time                   date       month   dropoff_time         ..
-----------------------------------------------------------------------------..
VTS    2009.01.31D09:30:00.000000000 2009.01.31 2009.01 2009.01.31D09:41:00.0..
DDS    2009.01.31D09:30:00.000000000 2009.01.31 2009.01 2009.01.31D09:35:17.0..
CMT    2009.01.31D09:30:00.000000000 2009.01.31 2009.01 2009.01.31D09:38:56.0..


In [None]:
//Enter your code here

In [28]:
ex11[] //check correct output

vendor pickup_time                   date       month   dropoff_time         ..
-----------------------------------------------------------------------------..
VTS    2009.01.31D09:30:00.000000000 2009.01.31 2009.01 2009.01.31D09:41:00.0..
DDS    2009.01.31D09:30:00.000000000 2009.01.31 2009.01 2009.01.31D09:35:17.0..
CMT    2009.01.31D09:30:00.000000000 2009.01.31 2009.01 2009.01.31D09:38:56.0..
