-
Notifications
You must be signed in to change notification settings - Fork 543
/
ibis-for-sql-users.qmd
1205 lines (893 loc) · 26.5 KB
/
ibis-for-sql-users.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Tutorial: Ibis for SQL users
## Prerequisites
{{< include ../_tabsets/install_default.qmd >}}
## Overview
Ibis provides a full-featured replacement for SQL
`SELECT` queries, but expressed with Python code that is:
- Type-checked and validated as you go. No more debugging cryptic database
errors; Ibis catches your mistakes right away.
- Easier to write. Pythonic function calls with tab completion in IPython.
- More composable. Break complex queries down into easier-to-digest pieces
- Easier to reuse. Mix and match Ibis snippets to create expressions tailored
for your analysis.
We intend for all `SELECT` queries to be fully portable to Ibis. Coverage of
other DDL statements (e.g. `CREATE TABLE` or `INSERT`) may vary from engine to
engine.
::: {.callout-note}
If you find any SQL idioms or use cases in your work that are not represented
here, please reach out so we can add more to this guide!
:::
## Projections: select/add/remove columns
All tables in Ibis are immutable. To select a subset of a table's columns, or
to add new columns, you must produce a new table by means of a *projection*.
```{python}
import ibis
t = ibis.table(dict(one="string", two="float", three="int32"), name="my_data")
t
```
In SQL, you might write something like:
```sql
SELECT two, one
FROM my_data
```
In Ibis, this is
```{python}
proj = t["two", "one"]
```
or
```{python}
proj = t.select("two", "one")
```
This generates the expected SQL:
```{python}
ibis.to_sql(proj)
```
What about adding new columns?
```sql
SELECT two, one, three * 2 AS new_col
FROM my_data
```
The last expression is written:
```{python}
new_col = (t.three * 2).name("new_col")
```
Now, we have:
```{python}
proj = t["two", "one", new_col]
ibis.to_sql(proj)
```
### `mutate`: Add or modify columns easily
Since adding new columns or modifying existing columns is so common, there is
a convenience method `mutate`:
```{python}
mutated = t.mutate(new_col=t.three * 2)
```
Notice that using the `name` was not necessary here because we're using Python
keyword arguments to provide the name:
```{python}
ibis.to_sql(mutated)
```
If you modify an existing column with `mutate` it will list out all the other
columns:
```{python}
mutated = t.mutate(two=t.two * 2)
ibis.to_sql(mutated)
```
### `SELECT *` equivalent
Especially in combination with relational joins, it's convenient to be able to
select all columns in a table using the `SELECT *` construct. To do this, use
the table expression itself in a projection:
```{python}
proj = t[t]
ibis.to_sql(proj)
```
This is how `mutate` is implemented. The example above
`t.mutate(new_col=t.three * 2)` can be written as a normal projection:
```{python}
proj = t[t, new_col]
ibis.to_sql(proj)
```
Let's consider a table we might wish to join with `t`:
```{python}
t2 = ibis.table(dict(key="string", value="float"), name="dim_table")
```
Now let's take the SQL:
```sql
SELECT t0.*, t0.two - t1.value AS diff
FROM my_data t0
INNER JOIN dim_table t1
ON t0.one = t1.key
```
To write this with Ibis, it is:
```{python}
diff = (t.two - t2.value).name("diff")
joined = t.join(t2, t.one == t2.key)[t, diff]
```
And verify the generated SQL:
```{python}
ibis.to_sql(joined)
```
### Using functions in projections
If you pass a function instead of a string or Ibis expression in any projection
context, it will be invoked with the "parent" table as its argument. This can
help significantly when [composing complex operations. Consider this SQL:
```sql
SELECT one, avg(abs(the_sum)) AS mad
FROM (
SELECT one, three, sum(two) AS the_sum
FROM my_data
GROUP BY 1, 2
) t0
GROUP BY 1
```
This can be written as one chained expression:
```{python}
expr = (
t.group_by(["one", "three"])
.aggregate(the_sum=t.two.sum())
.group_by("one")
.aggregate(mad=lambda x: x.the_sum.abs().mean())
)
```
Here's the SQL:
```{python}
ibis.to_sql(expr)
```
## Filtering / `WHERE`
You can add filter clauses to a table expression either by indexing with
`[]` (similar to pandas) or use the `filter` method:
```{python}
filtered = t[t.two > 0]
ibis.to_sql(filtered)
```
`filter` can take a list of expressions, which must all be satisfied for
a row to appear in the result:
```{python}
filtered = t.filter([t.two > 0, t.one.isin(["A", "B"])])
ibis.to_sql(filtered)
```
To compose boolean expressions with `AND` or `OR`, use the respective
`&` and `|` operators:
```{python}
cond = (t.two < 0) | ((t.two > 0) | t.one.isin(["A", "B"]))
filtered = t[cond]
ibis.to_sql(filtered)
```
Note the parentheses around the second expression. These are necessary due to
operator precedence.
## Aggregation / `GROUP BY`
To aggregate a table, you need:
- Zero or more grouping expressions (these can be column names)
- One or more aggregation expressions
Let's look at the `aggregate` method on tables:
```{python}
agged = t.aggregate(total_two=t.two.sum(), avg_three=t.three.mean())
```
If you don't use any group expressions, the result will have a single
row with your statistics of interest:
```{python}
agged.schema()
```
```{python}
ibis.to_sql(agged)
```
To add groupings, use either the `by` argument of `aggregate` or use the
`group_by` construct:
```{python}
agged2 = t.aggregate(total_two=t.two.sum(), avg_three=t.three.mean())
agged3 = t.group_by("one").aggregate(total_two=t.two.sum(), avg_three=t.three.mean())
ibis.to_sql(agged3)
```
### Non-trivial grouping keys
You can use any expression (or function, like in projections) deriving
from the table you are aggregating. The only constraint is that the
expressions must be named. Let's look at an example:
```{python}
events = ibis.table(
dict(ts="timestamp", event_type="int32", session_id="int64"),
name="web_events",
)
```
Suppose we wanted to total up event types by year and month:
```{python}
stats = (
events.group_by(year=events.ts.year(), month=events.ts.month())
.aggregate(total=events.count(), sessions=events.session_id.nunique())
)
```
Now we have:
```{python}
ibis.to_sql(stats)
```
### Aggregates considering table subsets
In analytics is it common to compare statistics from different subsets
of a table. Let's consider a dataset containing people's name, age,
gender, and nationality:
```{python}
pop = ibis.table(
dict(name="string", country="string", gender="string", age="int16"),
name="population",
)
```
Now, suppose you wanted to know for each country:
- Average overall age
- Average male age
- Average female age
- Total number of persons
In SQL, you may write:
```sql
SELECT
country,
count(*) AS num_persons,
AVG(age) AS avg_age
AVG(
CASE WHEN gender = 'M' THEN age
ELSE NULL
END
) AS avg_male,
AVG(
CASE WHEN gender = 'F' THEN age
ELSE NULL
END
) AS avg_female,
FROM population
GROUP BY 1
```
Ibis makes this much simpler by giving you `where` option in aggregation
functions:
```{python}
expr = pop.group_by("country").aggregate(
num_persons=pop.count(),
avg_age=pop.age.mean(),
avg_male=pop.age.mean(where=pop.gender == "M"),
avg_female=pop.age.mean(where=pop.gender == "F"),
)
```
This indeed generates the correct SQL. Note that SQL engines handle `NULL`
values differently in aggregation functions, but Ibis will write the SQL
expression that is correct for your query engine.
```{python}
ibis.to_sql(expr)
```
### Counting rows with [`Table.count()`](../reference/expression-tables.qmd#ibis.expr.types.relations.Table.count):
Computing group frequencies can be done by calling `count()` after calling
`group_by()`:
```{python}
freqs = events.group_by(year=events.ts.year(), month=events.ts.month()).count()
ibis.to_sql(freqs)
```
### Frequency table convenience: `value_counts`
Consider the SQL idiom:
```sql
SELECT some_column_expression, count(*)
FROM table
GROUP BY 1
```
This is so common that, like pandas, there is a generic column method
`value_counts` which does this:
```{python}
expr = events.ts.year().value_counts()
ibis.to_sql(expr)
```
### `HAVING` clause
The SQL `HAVING` clause enables you to filter the results of an aggregation
based on some group-wise condition holding true. For example, suppose we wanted
to limit our analysis to groups containing at least 1000 observations:
```sql
SELECT one, sum(two) AS total
FROM my_data
GROUP BY 1
HAVING count(*) >= 1000
```
With Ibis, you can do:
```{python}
expr = (
t.group_by("one")
.having(t.count() >= 1000)
.aggregate(total=t.two.sum())
)
ibis.to_sql(expr)
```
## Sorting / `ORDER BY`
To sort a table, use the `order_by` method along with either column names or
expressions that indicate the sorting keys:
```{python}
sorted = events.order_by([events.ts.year(), events.ts.month()])
ibis.to_sql(sorted)
```
The default for sorting is in ascending order. To reverse the sort direction of
any key, wrap it in `ibis.desc`:
```{python}
sorted = (
events.order_by([ibis.desc("event_type"), ibis.desc(events.ts.month())])
.limit(100)
)
ibis.to_sql(sorted)
```
## `LIMIT` and `OFFSET`
The table `limit` function truncates a table to the indicates number of rows.
So if you only want the first 1000 rows (which may not be deterministic
depending on the SQL engine), you can do:
```{python}
limited = t.limit(1000)
ibis.to_sql(limited)
```
The `offset` option in `limit` skips rows. So if you wanted rows 11
through 20, you could do:
```{python}
limited = t.limit(10, offset=10)
ibis.to_sql(limited)
```
## Common column expressions
See the full [API documentation](../reference/expression-generic.qmd) for all
of the available value methods and tools for creating value expressions.
We mention a few common ones here as they relate to common SQL queries.
### Type casts
Ibis's type system is independent of any SQL system. You cast Ibis
expressions from one Ibis type to another. For example:
```{python}
expr = t.mutate(
date=t.one.cast("timestamp"),
four=t.three.cast("float32"),
)
ibis.to_sql(expr)
```
### `CASE` statements
SQL dialects typically support one or more kind of `CASE` statements. The first
is the *simple case* that compares against exact values of an expression.
```sql
CASE expr
WHEN value_1 THEN result_1
WHEN value_2 THEN result_2
ELSE default
END
```
Value expressions in Ibis have a `case` method that allows us to emulate these
semantics:
```{python}
case = (
t.one.cast("timestamp")
.year()
.case()
.when(2015, "This year")
.when(2014, "Last year")
.else_("Earlier")
.end()
)
expr = t.mutate(year_group=case)
ibis.to_sql(expr)
```
The more general case is that of an arbitrary list of boolean expressions and
result values:
```sql
CASE
WHEN boolean_expr1 THEN result_1
WHEN boolean_expr2 THEN result_2
WHEN boolean_expr3 THEN result_3
ELSE default
END
```
To do this, use `ibis.case`:
```{python}
case = (
ibis.case()
.when(t.two < 0, t.three * 2)
.when(t.two > 1, t.three)
.else_(t.two)
.end()
)
expr = t.mutate(cond_value=case)
ibis.to_sql(expr)
```
There are several places where Ibis builds cases for you in a simplified way.
One example is the `ifelse` function:
```{python}
switch = (t.two < 0).ifelse("Negative", "Non-Negative")
expr = t.mutate(group=switch)
ibis.to_sql(expr)
```
### Using `NULL` in expressions
To use `NULL` in an expression, either use the special `ibis.NA` value:
```{python}
pos_two = (t.two > 0).ifelse(t.two, ibis.NA)
expr = t.mutate(two_positive=pos_two)
ibis.to_sql(expr)
```
### Set membership: `IN` / `NOT IN`
Let's look again at the population dataset. Suppose you wanted to
combine the United States and Canada data into a "North America"
category. Here would be some SQL to do it:
```sql
CASE
WHEN UPPER(country) IN ('UNITED STATES', 'CANADA') THEN 'North America'
ELSE country
END AS refined_group
```
The Ibis equivalent of `IN` is the `isin` method. So we can write:
```{python}
refined = (
pop.country.upper()
.isin(["UNITED STATES", "CANADA"])
.ifelse("North America", pop.country)
)
expr = pop.mutate(refined_group=refined)
ibis.to_sql(expr)
```
The opposite of `isin` is `notin`.
### Constant and literal expressions
Consider a SQL expression like:
```sql
'foo' IN (column1, column2)
```
which is equivalent to
```sql
column1 = 'foo' OR column2 = 'foo'
```
To build expressions off constant values, you must first convert the
value (whether a Python string or number) to an Ibis expression using
`ibis.literal`:
```{python}
t3 = ibis.table(
dict(column1="string", column2="string", column3="float"),
name="data",
)
value = ibis.literal("foo")
```
Once you've done this, you can use the literal expression like any
other array or scalar expression:
```{python}
has_foo = value.isin([t3.column1, t3.column2])
expr = t3.mutate(has_foo=has_foo)
ibis.to_sql(expr)
```
In many other situations, you can use constants without having to use
`ibis.literal`. For example, we could add a column containing nothing
but the number 5 like so:
```{python}
expr = t3.mutate(number5=5)
ibis.to_sql(expr)
```
### `IS NULL` and `IS NOT NULL`
These are simple: use the `isnull` and `notnull` functions respectively,
which yield boolean arrays:
```{python}
indic = t.two.isnull().ifelse("valid", "invalid")
expr = t.mutate(is_valid=indic)
ibis.to_sql(expr)
```
```{python}
agged = (
expr[expr.one.notnull()]
.group_by("is_valid")
.aggregate(three_count=lambda t: t.three.notnull().sum())
)
ibis.to_sql(agged)
```
### `BETWEEN`
The `between` method on arrays and scalars compiles to the SQL `BETWEEN`
keyword. The result of `between` is boolean and can be used with any
other boolean expression:
```{python}
expr = t[t.two.between(10, 50) & t.one.notnull()]
ibis.to_sql(expr)
```
## Joins
Ibis supports several kinds of joins between table expressions:
- `inner_join`: maps to SQL `INNER JOIN`
- `cross_join`: a cartesian product join with no keys. Equivalent to
`inner_join` with no join predicates
- `left_join`: maps to SQL `LEFT OUTER JOIN`
- `outer_join`: maps to SQL `FULL OUTER JOIN`
- `semi_join`: maps to SQL `LEFT SEMI JOIN`. May or may not be an
explicit join type in your query engine.
- `anti_join`: maps to SQL `LEFT ANTI JOIN`. May or may not be an
explicit join type in your query engine.
The `join` table method is by default the same as `inner_join`.
Let's look at a couple example tables to see how joins work in Ibis:
```{python}
t1 = ibis.table(dict(value1="float", key1="string", key2="string"), name="table1")
t2 = ibis.table(dict(value2="float", key3="string", key4="string"), name="table2")
```
Let's join on one key:
```{python}
joined = t1.left_join(t2, t1.key1 == t2.key3)
```
The immediate result of a join does not yet have a set schema. That is
determined by the next action that you take. There's several ways forward from
here that address a variety of SQL use cases.
### Join + projection
Consider the SQL:
```sql
SELECT t0.*, t1.value2
FROM table1 t0
LEFT OUTER JOIN table2 t1
ON t0.key1 = t1.key3
```
After one or more joins, you can reference any of the joined tables in
a projection immediately after:
```{python}
expr = joined[t1, t2.value2]
ibis.to_sql(expr)
```
If you need to compute an expression that involves both tables, you can
do that also:
```{python}
expr = joined[t1.key1, (t1.value1 - t2.value2).name("diff")]
ibis.to_sql(expr)
```
### Join + aggregation
You can directly aggregate a join without need for projection, which
also allows you to form statistics that reference any of the joined
tables.
Consider this SQL:
```sql
SELECT t0.key1, AVG(t0.value1 - t1.value2) AS avg_diff
FROM table1 t0
LEFT OUTER JOIN table2 t1
ON t0.key1 = t1.key3
GROUP BY 1
```
As you would hope, the code is as follows:
```{python}
avg_diff = (t1.value1 - t2.value2).mean()
expr = (
t1.left_join(t2, t1.key1 == t2.key3)
.group_by(t1.key1)
.aggregate(avg_diff=avg_diff)
)
ibis.to_sql(expr)
```
### Join with `SELECT *`
If you try to compile or execute a join that has not been projected or
aggregated, it will be *fully materialized*:
```{python}
joined = t1.left_join(t2, t1.key1 == t2.key3)
ibis.to_sql(joined)
```
### Multiple joins
You can join multiple tables together in succession without needing to address
any of the above concerns.
```{python}
t3 = ibis.table(dict(value3="float", key5="string"), name="table3")
total = (t1.value1 + t2.value2 + t3.value3).sum()
expr = (
t1.join(t2, [t1.key1 == t2.key3, t1.key2 == t2.key4])
.join(t3, t1.key1 == t3.key5)
.group_by([t2.key4, t3.key5])
.aggregate(total=total)
)
ibis.to_sql(expr)
```
### Self joins
What about when you need to join a table on itself? For example:
```sql
SELECT t0.one, avg(t0.two - t1.three) AS metric
FROM my_data t0
INNER JOIN my_data t1
ON t0.one = t1.one
GROUP BY 1
```
The table `view` method enables you to form a *self-reference* that is
referentially distinct in expressions. Now you can proceed normally:
```{python}
t_view = t.view()
stat = (t.two - t_view.three).mean()
expr = (
t.join(t_view, t.three.cast("string") == t_view.one)
.group_by(t.one)
.aggregate(metric=stat)
)
ibis.to_sql(expr)
```
### Overlapping join keys
In many cases the columns being joined between two tables or table
expressions have the same name. Consider this example:
```{python}
t4 = ibis.table(
dict(key1="string", key2="string", key3="string", value1="float"),
name="table4",
)
t5 = ibis.table(
dict(key1="string", key2="string", key3="string", value2="float"),
name="table5",
)
```
In these case, we can specify a list of common join keys:
```{python}
joined = t4.join(t5, ["key1", "key2", "key3"])
expr = joined[t4, t5.value2]
ibis.to_sql(expr)
```
You can mix the overlapping key names with other expressions:
```{python}
joined = t4.join(t5, ["key1", "key2", t4.key3.left(4) == t5.key3.left(4)])
expr = joined[t4, t5.value2]
ibis.to_sql(expr)
```
### Non-equality join predicates
You can join tables with boolean clauses that are not equality. Some
query engines support these efficiently, some inefficiently, or some not
at all. In the latter case, these conditions get moved by Ibis into the
`WHERE` part of the `SELECT` query.
```{python}
expr = t1.join(t2, t1.value1 < t2.value2).group_by(t1.key1).size()
ibis.to_sql(expr)
```
### Other ways to specify join keys
You can also pass a list of column names instead of forming boolean
expressions:
```{python}
joined = t1.join(t2, [("key1", "key3"), ("key2", "key4")])
```
## Subqueries
Ibis creates inline views and nested subqueries automatically. This
section concerns more complex subqueries involving foreign references
and other advanced relational algebra.
### Correlated `EXISTS` / `NOT EXISTS` filters
The SQL `EXISTS` and `NOT EXISTS` constructs are typically used for
efficient filtering in large many-to-many relationships.
Let's consider a web dataset involving website session / usage data and
purchases:
```{python}
events = ibis.table(
dict(session_id="int64", user_id="int64", event_type="int32", ts="timestamp"),
name="events",
)
purchases = ibis.table(
dict(item_id="int64", user_id="int64", price="float", ts="timestamp"),
name="purchases",
)
```
Now, the key `user_id` appears with high frequency in both tables. But let's
say you want to limit your analysis of the `events` table to only sessions by
users who have made a purchase.
In SQL, you can do this using the somewhat esoteric `EXISTS` construct:
```sql
SELECT t0.*
FROM events t0
WHERE EXISTS (
SELECT 1
FROM purchases t1
WHERE t0.user_id = t1.user_id
)
```
To describe this operation in Ibis, you compare the `user_id` columns and use
the `any` reduction:
```{python}
cond = (events.user_id == purchases.user_id).any()
```
This can now be used to filter `events`:
```{python}
expr = events[cond]
ibis.to_sql(expr)
```
If you negate the condition, it will instead give you only event data
from user *that have not made a purchase*:
```{python}
expr = events[-cond]
ibis.to_sql(expr)
```
### Subqueries with `IN` / `NOT IN`
Subquery filters with `IN` (and `NOT IN`) are functionally similar to
`EXISTS` subqueries. Let's look at some SQL:
```sql
SELECT *
FROM events
WHERE user_id IN (
SELECT user_id
FROM purchases
)
```
This is almost semantically the same as the `EXISTS` example. Indeed,
you can write with Ibis:
```{python}
cond = events.user_id.isin(purchases.user_id)
expr = events[cond]
ibis.to_sql(expr)
```
Depending on the query engine, the query planner/optimizer will often
rewrite `IN` or `EXISTS` subqueries into the same set of relational
algebra operations.
### Comparison with scalar aggregates
Sometime you want to compare a value with an unconditional aggregate
value from a different table. Take the SQL:
```sql
SELECT *
FROM table1
WHERE value1 > (
SELECT max(value2)
FROM table2
)
```
With Ibis, the code is simpler and more pandas-like:
```{python}
expr = t1[t1.value1 > t2.value2.max()]
ibis.to_sql(expr)
```
### Conditional aggregates
Suppose you want to compare a value with the aggregate value for some
common group values between two tables. Here's some SQL:
```sql
SELECT *
FROM table1 t0
WHERE value1 > (
SELECT avg(value2)
FROM table2 t1
WHERE t0.key1 = t1.key3
)
```
This query computes the average for each distinct value of `key3` and
uses the corresponding average for the comparison, rather than the
whole-table average as above.
With Ibis, the code is similar, but you add the correlated filter to the
average statistic:
```{python}
stat = t2[t1.key1 == t2.key3].value2.mean()
expr = t1[t1.value1 > stat]
ibis.to_sql(expr)
```
## `DISTINCT` expressions
In SQL, the `DISTINCT` keyword is used in a couple of ways:
- Deduplicating identical rows in some `SELECT` statement
- Aggregating on the distinct values of some column expression
Ibis supports both use cases. So let's have a look. The first case is
the simplest: call `distinct` on a table expression. First, here's the
SQL:
```sql
SELECT DISTINCT *
FROM table1
```
And the Ibis Python code:
```{python}
expr = t1.distinct()
ibis.to_sql(expr)
```
For distinct aggregates, the most common case is `COUNT(DISTINCT ...)`,
which computes the number of unique values in an expression. So if