Skip to content

[feature request] Complex join conditions #173

@derekmahar

Description

@derekmahar

Thank you for developing such a well designed and highly useful CSV processing tool!

While the Miller join command is already very useful, I think it would be even better if it were to support join conditions more complex than simple equality. I propose an additional join command option --condition or -c, that accepts a logical expression that may refer to an arbitrary number of columns in the left and right tables (CSV input files). Miller would use the result of this logical expression to decide whether two rows in each table match and whether to include the merger of these two rows in the final table result.

The specific use case I have in mind is the association of a sparse history of currency exchange rates to a history of trades of some other asset like stocks in a particular currency where the currency exchange rate times may not cover all of those of the stock trades. For example, consider Bitcoin trades which happen continuously in USD on many exchanges worldwide and the currency exchange of USD to CAD which typically happens only during business hours and skips weekends. In order to assign a USD to CAD exchange rate to each Bitcoin trade, given a Bitcoin trade time, we must select the most recent exchange rate time that equals or precedes the Bitcoin trade time.

Consider the following highly simplified example involving a sequence of eleven imaginary currency exchange rates and ten trades of some imaginary asset where the trade times are continuous, but there exist gaps between some exchange rate times. While I include the value of each exchange rate, I omit all details of each trade except its time. For both the exchange rates and trades, I represent time as a monotonically increasing integer.

# Exchange rates
cat << EOF | mlr --csv --opprint cat > exchange_rates.csv
time,rate
1,1.0
4,1.1
5,1.5
7,2.0
9,1.8
10,1.7
11,1.9
EOF
time rate
1    1.0
4    1.1
5    1.5
7    2.0
9    1.8
10   1.7
11   1.9

# Trade times
cat << EOF | mlr --csv --opprint cat > trades.csv
time
1
2
3
4
5
6
7
8
9
10
EOF

The following table shows the expected result of joining the earlier two tables together:

# Trade times joined with exchange rates
$ cat << EOF | mlr --csv --opprint cat
time,rate
1,1
2,1
3,1
4,1.1
5,1.5
6,1.5
7,2
8,2
9,1.8
10,1.7
EOF
time rate
1    1
2    1
3    1
4    1.1
5    1.5
6    1.5
7    2
8    2
9    1.8
10   1.7

While the Miller join command cannot produce this result, you can calculate this result using an SQL Common Table Expression (CTE) and SQL window functions available in PostgreSQL 9.6:

PostgreSQL 9.6 Schema Setup:

CREATE TABLE exchange_rate(time INT, rate FLOAT);

INSERT INTO exchange_rate VALUES
(1, 1.0)
,(4, 1.1)
,(5, 1.5)
,(7, 2.0)
,(9, 1.8)
,(10, 1.7)
,(11, 1.9)
;

CREATE TABLE trade(time int);

INSERT INTO trade VALUES
(1)
,(2)
,(3)
,(4)
,(5)
,(6)
,(7)
,(8)
,(9)
,(10)
;

Query 1:

WITH exchange_rate AS (
SELECT
  time AS start_time,
  LEAD(time) OVER (ORDER BY time) as end_time,
  rate
FROM exchange_rate)
SELECT
  t.time,
  e.rate
FROM trade t
JOIN exchange_rate e
ON e.start_time <= t.time AND e.end_time > t.time

Results:

time rate
1 1
2 1
3 1
4 1.1
5 1.5
6 1.5
7 2
8 2
9 1.8
10 1.7

Using Miller, the first step towards this result is to use the step and shift command combination to transform the original currency exchange rate table into one where each row contains the starting and ending times within which an exchange rate is "current":

cat << EOF > exchange_rates.csv
time,rate
1,1.0
4,1.1
5,1.5
7,2.0
9,1.8
10,1.7
11,1.9
EOF

{ cat exchange_rates.csv |
  mlr --csv step -a shift -f time,rate \
    then filter -e 'NR != 1' \
    then cut -f time_shift,time,rate_shift \
    then rename time_shift,start_time,time,end_time,rate_shift,rate \
    then reorder -f start_time,end_time,rate
  # Append last row
  cat exchange_rates.csv |
    mlr --csv tail -n 1 |
    mlr --csv put '$start_time = $time; $end_time="";' \
      then reorder -f start_time,end_time,rate \
      then cut -f start_time,end_time,rate |
    mlr --csv --headerless-csv-output cat;
} |
mlr --csv --opprint cat
start_time end_time rate
1          4        1.0
4          5        1.1
5          7        1.5
7          9        2.0
9          10       1.8
10         11       1.7
11         -        1.9

The result is the same as that of the SQL query exchange_rate in the earlier CTE query:

PostgreSQL 9.6 Schema Setup:

CREATE TABLE exchange_rate(time INT, rate FLOAT);

INSERT INTO exchange_rate VALUES
(1, 1.0)
,(4, 1.1)
,(5, 1.5)
,(7, 2.0)
,(9, 1.8)
,(10, 1.7)
,(11, 1.9)
;

CREATE TABLE trade(time int);

INSERT INTO trade VALUES
(1)
,(2)
,(3)
,(4)
,(5)
,(6)
,(7)
,(8)
,(9)
,(10)
;

Query 1:

SELECT
  time as start_time,
  LEAD(time) OVER (ORDER BY time) as end_time,
  rate
FROM exchange_rate

Results:

start_time end_time rate
1 4 1
4 5 1.1
5 7 1.5
7 9 2
9 10 1.8
10 11 1.7
11 (null) 1.9

In order for Miller to produce the final result which joins the trades with the exchange rates, we'd have to apply an additional hypothetical join command -c '$start_time <= $time && $end_time > $time' that resembles the SQL join operation FROM trade t JOIN exchange_rate e ON e.start_time <= t.time AND e.end_time > t.time:

cat << EOF > exchange_rates.csv
time,rate
1,1.0
4,1.1
5,1.5
7,2.0
9,1.8
10,1.7
11,1.9
EOF

{ cat exchange_rates.csv |
  mlr --csv step -a shift -f time,rate \
    then filter -e 'NR != 1' \
    then cut -f time_shift,time,rate_shift \
    then rename time_shift,start_time,time,end_time,rate_shift,rate \
    then reorder -f start_time,end_time,rate
  # Append last row
  cat exchange_rates.csv |
    mlr --csv tail -n 1 |
    mlr --csv put '$start_time = $time; $end_time="";' \
      then reorder -f start_time,end_time,rate \
      then cut -f start_time,end_time,rate |
    mlr --csv --headerless-csv-output cat;
} |
mlr --csv --opprint join -f trades.csv -j time,rate -r start_time,end_time,rate \
  -c '$start_time <= $time && $end_time > $time'
time rate
1    1
2    1
3    1
4    1.1
5    1.5
6    1.5
7    2
8    2
9    1.8
10   1.7

Aside from this use case, another reason for Miller to support complex join conditions would simply be to match the join conditions possible in SQL.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions