/
impala.qmd
1328 lines (1053 loc) · 36.1 KB
/
impala.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Impala
[https://impala.apache.org](https://impala.apache.org)
![](https://img.shields.io/badge/memtables-native-green?style=flat-square) ![](https://img.shields.io/badge/inputs-Impala tables-blue?style=flat-square) ![](https://img.shields.io/badge/outputs-Impala tables | CSV | pandas | Parquet | PyArrow-orange?style=flat-square)
## Install
Install Ibis and dependencies for the Impala backend:
::: {.panel-tabset}
## `pip`
Install with the `impala` extra:
```{.bash}
pip install 'ibis-framework[impala]'
```
And connect:
```{.python}
import ibis
con = ibis.impala.connect() # <1>
```
1. Adjust connection parameters as needed.
## `conda`
Install for Impala:
```{.bash}
conda install -c conda-forge ibis-impala
```
And connect:
```{.python}
import ibis
con = ibis.impala.connect() # <1>
```
1. Adjust connection parameters as needed.
## `mamba`
Install for Impala:
```{.bash}
mamba install -c conda-forge ibis-impala
```
And connect:
```{.python}
import ibis
con = ibis.impala.connect() # <1>
```
1. Adjust connection parameters as needed.
:::
```{python}
#| echo: false
# setup dynamic quartodoc rendering
from functools import partial
from _utils import get_backend, get_object, render_methods
render_methods = partial(render_methods, level=4)
backend = get_object("ibis.backends.impala", "Backend")
table = get_object("ibis.backends.impala.client", "ImpalaTable")
```
## Database methods
```{python}
#| echo: false
#| output: asis
render_methods(backend, "create_database", "drop_database", "list_databases")
```
## Table methods
The `Backend` object itself has many helper utility methods. You'll
find the most methods on `ImpalaTable`.
```{python}
#| echo: false
#| output: asis
render_methods(
backend,
"table",
"sql",
"raw_sql",
"list_tables",
"drop_table",
"insert",
"truncate_table",
"get_schema",
"cache_table",
)
```
The best way to interact with a single table is through the
`ImpalaTable` object you get back from `Backend.table`.
```{python}
#| echo: false
#| output: asis
render_methods(
table,
"drop",
"drop_partition",
"files",
"insert",
"is_partitioned",
"partition_schema",
"partitions",
"refresh",
"describe_formatted",
)
```
## Creating views
```{python}
#| echo: false
#| output: asis
render_methods(backend, "drop_table_or_view", "create_view")
```
## Accessing data formats in HDFS
```{python}
#| echo: false
#| output: asis
render_methods(backend, "delimited_file", "parquet_file", "avro_file")
```
## HDFS Interaction
Ibis delegates all HDFS interaction to the
[`fsspec`](https://filesystem-spec.readthedocs.io/en/latest/) library.
## The Impala client object
To use Ibis with Impala, you first must connect to a cluster using the
`ibis.impala.connect` function, optionally supplying an HDFS connection:
```python
import ibis
hdfs = ibis.impala.hdfs_connect(host=webhdfs_host, port=webhdfs_port)
client = ibis.impala.connect(host=impala_host, port=impala_port, hdfs_client=hdfs)
```
By default binary transport mode is used, however it is also possible to use HTTP.
Depending on your configuration, additional connection arguments may need to be provided.
For the full list of possible connection arguments please refer to
the [`impyla`](https://github.com/cloudera/impyla) documentation.
```python
import ibis
client = ibis.impala.connect(
host=impala_host,
port=impala_port,
username=username,
password=password,
use_ssl=True,
auth_mechanism='LDAP',
use_http_transport=True,
http_path='cliservice',
)
```
All examples here use the following block of code to connect to impala
using docker:
```python
import ibis
hdfs = ibis.impala.hdfs_connect(host="localhost", port=50070)
client = ibis.impala.connect(host=host, hdfs_client=hdfs)
```
You can accomplish many tasks directly through the client object, but we
additionally provide APIs to streamline tasks involving a single Impala
table or database.
## Table objects
```{python}
#| echo: false
#| output: asis
render_methods(get_object("ibis.backends.base.sql", "BaseSQLBackend"), "table")
```
The client's `table` method allows you to create an Ibis table
expression referencing a physical Impala table:
```python
table = client.table('functional_alltypes', database='ibis_testing')
```
`ImpalaTable` is a Python subclass of the more general Ibis `Table`
that has additional Impala-specific methods. So you can use it
interchangeably with any code expecting a `Table`.
While the client has a `drop_table` method you can use to drop tables,
the table itself has a method `drop` that you can use:
```python
table.drop()
```
## Expression execution
Ibis expressions have execution methods like `to_pandas` that compile and run the
expressions on Impala or whichever backend is being referenced.
For example:
```python
>>> fa = db.functional_alltypes
>>> expr = fa.double_col.sum()
>>> expr.to_pandas()
331785.00000000006
```
For longer-running queries, Ibis will attempt to cancel the query in
progress if an interrupt is received.
## Creating tables
There are several ways to create new Impala tables:
- From an Ibis table expression
- Empty, from a declared schema
- Empty and partitioned
In all cases, you should use the `create_table` method either on the
top-level client connection or a database object.
```{python}
#| echo: false
#| output: asis
render_methods(backend, "create_table")
```
### Creating tables from a table expression
If you pass an Ibis expression to `create_table`, Ibis issues a
`CREATE TABLE ... AS SELECT` (CTAS) statement:
```python
>>> table = db.table('functional_alltypes')
>>> expr = table.group_by('string_col').size()
>>> db.create_table('string_freqs', expr, format='parquet')
>>> freqs = db.table('string_freqs')
>>> freqs.to_pandas()
string_col count
0 9 730
1 3 730
2 6 730
3 4 730
4 1 730
5 8 730
6 2 730
7 7 730
8 5 730
9 0 730
>>> files = freqs.files()
>>> files
Path Size Partition
0 hdfs://impala:8020/user/hive/warehouse/ibis_te... 584B
>>> freqs.drop()
```
You can also choose to create an empty table and use `insert` (see
below).
### Creating an empty table
To create an empty table, you must declare an Ibis schema that will be
translated to the appropriate Impala schema and data types.
As Ibis types are simplified compared with Impala types, this may expand
in the future to include a more fine-grained schema declaration.
You can use the `create_table` method either on a database or client
object.
```python
schema = ibis.schema([('foo', 'string'),
('year', 'int32'),
('month', 'int16')])
name = 'new_table'
db.create_table(name, schema=schema)
```
By default, this stores the data files in the database default location.
You can force a particular path with the `location` option.
```python
from getpass import getuser
schema = ibis.schema([('foo', 'string'),
('year', 'int32'),
('month', 'int16')])
name = 'new_table'
location = '/home/{}/new-table-data'.format(getuser())
db.create_table(name, schema=schema, location=location)
```
If the schema matches a known table schema, you can always use the
`schema` method to get a schema object:
```python
>>> t = db.table('functional_alltypes')
>>> t.schema()
ibis.Schema {
id int32
bool_col boolean
tinyint_col int8
smallint_col int16
int_col int32
bigint_col int64
float_col float32
double_col float64
date_string_col string
string_col string
timestamp_col timestamp
year int32
month int32
}
```
### Creating a partitioned table
To create an empty partitioned table, include a list of columns to be
used as the partition keys.
```python
schema = ibis.schema([('foo', 'string'),
('year', 'int32'),
('month', 'int16')])
name = 'new_table'
db.create_table(name, schema=schema, partition=['year', 'month'])
```
## Partitioned tables
Ibis enables you to manage partitioned tables in various ways. Since
each partition behaves as its own \"subtable\" sharing a common schema,
each partition can have its own file format, directory path,
serialization properties, and so forth.
There are a handful of table methods for adding and removing partitions
and getting information about the partition schema and any existing
partition data:
```{python}
#| echo: false
#| output: asis
render_methods(
table,
"add_partition",
"drop_partition",
"is_partitioned",
"partition_schema",
"partitions",
)
```
To address a specific partition in any method that is partition
specific, you can either use a dict with the partition key names and
values, or pass a list of the partition values:
```python
schema = ibis.schema([('foo', 'string'),
('year', 'int32'),
('month', 'int16')])
name = 'new_table'
db.create_table(name, schema=schema, partition=['year', 'month'])
table = db.table(name)
table.add_partition({'year': 2007, 'month', 4})
table.add_partition([2007, 5])
table.add_partition([2007, 6])
table.drop_partition([2007, 6])
```
We'll cover partition metadata management and data loading below.
## Inserting data into tables
If the schemas are compatible, you can insert into a table directly from
an Ibis table expression:
```python
>>> t = db.functional_alltypes
>>> db.create_table('insert_test', schema=t.schema())
>>> target = db.table('insert_test')
>>> target.insert(t[:3])
>>> target.insert(t[:3])
>>> target.insert(t[:3])
>>> target.to_pandas()
id bool_col tinyint_col ... timestamp_col year month
0 5770 True 0 ... 2010-08-01 00:00:00.000 2010 8
1 5771 False 1 ... 2010-08-01 00:01:00.000 2010 8
2 5772 True 2 ... 2010-08-01 00:02:00.100 2010 8
3 5770 True 0 ... 2010-08-01 00:00:00.000 2010 8
4 5771 False 1 ... 2010-08-01 00:01:00.000 2010 8
5 5772 True 2 ... 2010-08-01 00:02:00.100 2010 8
6 5770 True 0 ... 2010-08-01 00:00:00.000 2010 8
7 5771 False 1 ... 2010-08-01 00:01:00.000 2010 8
8 5772 True 2 ... 2010-08-01 00:02:00.100 2010 8
[9 rows x 13 columns]
>>> target.drop()
```
If the table is partitioned, you must indicate the partition you are
inserting into:
```python
part = {'year': 2007, 'month': 4}
table.insert(expr, partition=part)
```
## Managing table metadata
Ibis has functions that wrap many of the DDL commands for Impala table
metadata.
### Detailed table metadata: `DESCRIBE FORMATTED`
To get a handy wrangled version of `DESCRIBE FORMATTED` use the
`metadata` method.
```{python}
#| echo: false
#| output: asis
render_methods(table, "metadata")
```
```python
>>> t = client.table('ibis_testing.functional_alltypes')
>>> meta = t.metadata()
>>> meta
<class 'ibis.backends.impala.metadata.TableMetadata'>
{'info': {'CreateTime': datetime.datetime(2021, 1, 14, 21, 23, 8),
'Database': 'ibis_testing',
'LastAccessTime': 'UNKNOWN',
'Location': 'hdfs://impala:8020/__ibis/ibis-testing-data/parquet/functional_alltypes',
'Owner': 'root',
'Protect Mode': 'None',
'Retention': 0,
'Table Parameters': {'COLUMN_STATS_ACCURATE': False,
'EXTERNAL': True,
'STATS_GENERATED_VIA_STATS_TASK': True,
'numFiles': 3,
'numRows': 7300,
'rawDataSize': '-1',
'totalSize': 106278,
'transient_lastDdlTime': datetime.datetime(2021, 1, 14, 21, 23, 17)},
'Table Type': 'EXTERNAL_TABLE'},
'schema': [('id', 'int'),
('bool_col', 'boolean'),
('tinyint_col', 'tinyint'),
('smallint_col', 'smallint'),
('int_col', 'int'),
('bigint_col', 'bigint'),
('float_col', 'float'),
('double_col', 'double'),
('date_string_col', 'string'),
('string_col', 'string'),
('timestamp_col', 'timestamp'),
('year', 'int'),
('month', 'int')],
'storage info': {'Bucket Columns': '[]',
'Compressed': False,
'InputFormat': 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat',
'Num Buckets': 0,
'OutputFormat': 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat',
'SerDe Library': 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe',
'Sort Columns': '[]'}}
>>> meta.location
'hdfs://impala:8020/__ibis/ibis-testing-data/parquet/functional_alltypes'
>>> meta.create_time
datetime.datetime(2021, 1, 14, 21, 23, 8)
```
The `files` function is also available to see all of the physical HDFS
data files backing a table:
```{python}
#| echo: false
#| output: asis
render_methods(table, "files")
```
```python
>>> ss = c.table('tpcds_parquet.store_sales')
>>> ss.files()[:5]
path size \
0 hdfs://localhost:20500/test-warehouse/tpcds.st... 160.61KB
1 hdfs://localhost:20500/test-warehouse/tpcds.st... 123.88KB
2 hdfs://localhost:20500/test-warehouse/tpcds.st... 139.28KB
3 hdfs://localhost:20500/test-warehouse/tpcds.st... 139.60KB
4 hdfs://localhost:20500/test-warehouse/tpcds.st... 62.84KB
partition
0 ss_sold_date_sk=2451803
1 ss_sold_date_sk=2451819
2 ss_sold_date_sk=2451772
3 ss_sold_date_sk=2451789
4 ss_sold_date_sk=2451741
```
### Modifying table metadata
For unpartitioned tables, you can use the `alter` method to change its
location, file format, and other properties. For partitioned tables, to
change partition-specific metadata use `alter_partition`.
```{python}
#| echo: false
#| output: asis
render_methods(table, "alter", "alter_partition")
```
For example, if you wanted to \"point\" an existing table at a directory
of CSV files, you could run the following command:
```python
from getpass import getuser
csv_props = {
'serialization.format': ',',
'field.delim': ',',
}
data_dir = '/home/{}/my-csv-files'.format(getuser())
table.alter(location=data_dir, format='text', serde_properties=csv_props)
```
If the table is partitioned, you can modify only the properties of a
particular partition:
```python
table.alter_partition(
{'year': 2007, 'month': 5},
location=data_dir,
format='text',
serde_properties=csv_props
)
```
## Table statistics
### Computing table and partition statistics
```{python}
#| echo: false
#| output: asis
render_methods(table, "compute_stats")
```
Impala-backed physical tables have a method `compute_stats` that
computes table, column, and partition-level statistics to assist with
query planning and optimization. It is standard practice to invoke this
after creating a table or loading new data:
```python
table.compute_stats()
```
If you are using a recent version of Impala, you can also access the
`COMPUTE INCREMENTAL STATS` DDL command:
```python
table.compute_stats(incremental=True)
```
### Seeing table and column statistics
```{python}
#| echo: false
#| output: asis
render_methods(table, "column_stats", "stats")
```
The `compute_stats` and `stats` functions return the results of
`SHOW COLUMN STATS` and `SHOW TABLE STATS`, respectively, and their
output will depend, of course, on the last `COMPUTE STATS` call.
```python
>>> ss = c.table('tpcds_parquet.store_sales')
>>> ss.compute_stats(incremental=True)
>>> stats = ss.stats()
>>> stats[:5]
ss_sold_date_sk #Rows #Files Size Bytes Cached Cache Replication \
0 2450829 1071 1 78.34KB NOT CACHED NOT CACHED
1 2450846 839 1 61.83KB NOT CACHED NOT CACHED
2 2450860 747 1 54.86KB NOT CACHED NOT CACHED
3 2450874 922 1 66.74KB NOT CACHED NOT CACHED
4 2450888 856 1 63.33KB NOT CACHED NOT CACHED
Format Incremental stats \
0 PARQUET true
1 PARQUET true
2 PARQUET true
3 PARQUET true
4 PARQUET true
Location
0 hdfs://localhost:20500/test-warehouse/tpcds.st...
1 hdfs://localhost:20500/test-warehouse/tpcds.st...
2 hdfs://localhost:20500/test-warehouse/tpcds.st...
3 hdfs://localhost:20500/test-warehouse/tpcds.st...
4 hdfs://localhost:20500/test-warehouse/tpcds.st...
>>> cstats = ss.column_stats()
>>> cstats
Column Type #Distinct Values #Nulls Max Size Avg Size
0 ss_sold_time_sk BIGINT 13879 -1 NaN 8
1 ss_item_sk BIGINT 17925 -1 NaN 8
2 ss_customer_sk BIGINT 15207 -1 NaN 8
3 ss_cdemo_sk BIGINT 16968 -1 NaN 8
4 ss_hdemo_sk BIGINT 6220 -1 NaN 8
5 ss_addr_sk BIGINT 14077 -1 NaN 8
6 ss_store_sk BIGINT 6 -1 NaN 8
7 ss_promo_sk BIGINT 298 -1 NaN 8
8 ss_ticket_number INT 15006 -1 NaN 4
9 ss_quantity INT 99 -1 NaN 4
10 ss_wholesale_cost DECIMAL(7,2) 10196 -1 NaN 4
11 ss_list_price DECIMAL(7,2) 19393 -1 NaN 4
12 ss_sales_price DECIMAL(7,2) 15594 -1 NaN 4
13 ss_ext_discount_amt DECIMAL(7,2) 29772 -1 NaN 4
14 ss_ext_sales_price DECIMAL(7,2) 102758 -1 NaN 4
15 ss_ext_wholesale_cost DECIMAL(7,2) 125448 -1 NaN 4
16 ss_ext_list_price DECIMAL(7,2) 141419 -1 NaN 4
17 ss_ext_tax DECIMAL(7,2) 33837 -1 NaN 4
18 ss_coupon_amt DECIMAL(7,2) 29772 -1 NaN 4
19 ss_net_paid DECIMAL(7,2) 109981 -1 NaN 4
20 ss_net_paid_inc_tax DECIMAL(7,2) 132286 -1 NaN 4
21 ss_net_profit DECIMAL(7,2) 122436 -1 NaN 4
22 ss_sold_date_sk BIGINT 120 0 NaN 8
```
### `REFRESH` and `INVALIDATE METADATA`
These DDL commands are available as table-level and client-level
methods:
```{python}
#| echo: false
#| output: asis
render_methods(backend, "invalidate_metadata")
render_methods(table, "invalidate_metadata", "refresh")
```
You can invalidate the cached metadata for a single table or for all
tables using `invalidate_metadata`, and similarly invoke
`REFRESH db_name.table_name` using the `refresh` method.
```python
client.invalidate_metadata()
table = db.table(table_name)
table.invalidate_metadata()
table.refresh()
```
These methods are often used in conjunction with the `LOAD DATA`
commands and `COMPUTE STATS`. See the Impala documentation for full
details.
## Parquet and other session options
Ibis gives you access to Impala session-level variables that affect
query execution:
```{python}
#| echo: false
#| output: asis
render_methods(
backend,
"disable_codegen",
"get_options",
"set_options",
"set_compression_codec",
)
```
For example:
```python
>>> client.get_options()
{'ABORT_ON_ERROR': '0',
'APPX_COUNT_DISTINCT': '0',
'BUFFER_POOL_LIMIT': '',
'COMPRESSION_CODEC': '',
'COMPUTE_STATS_MIN_SAMPLE_SIZE': '1073741824',
'DEFAULT_JOIN_DISTRIBUTION_MODE': '0',
'DEFAULT_SPILLABLE_BUFFER_SIZE': '2097152',
'DISABLE_CODEGEN': '0',
'DISABLE_CODEGEN_ROWS_THRESHOLD': '50000',
'DISABLE_ROW_RUNTIME_FILTERING': '0',
'DISABLE_STREAMING_PREAGGREGATIONS': '0',
'DISABLE_UNSAFE_SPILLS': '0',
'ENABLE_EXPR_REWRITES': '1',
'EXEC_SINGLE_NODE_ROWS_THRESHOLD': '100',
'EXEC_TIME_LIMIT_S': '0',
'EXPLAIN_LEVEL': '1',
'HBASE_CACHE_BLOCKS': '0',
'HBASE_CACHING': '0',
'IDLE_SESSION_TIMEOUT': '0',
'MAX_ERRORS': '100',
'MAX_NUM_RUNTIME_FILTERS': '10',
'MAX_ROW_SIZE': '524288',
'MEM_LIMIT': '0',
'MIN_SPILLABLE_BUFFER_SIZE': '65536',
'MT_DOP': '',
'NUM_SCANNER_THREADS': '0',
'OPTIMIZE_PARTITION_KEY_SCANS': '0',
'PARQUET_ANNOTATE_STRINGS_UTF8': '0',
'PARQUET_ARRAY_RESOLUTION': '2',
'PARQUET_DICTIONARY_FILTERING': '1',
'PARQUET_FALLBACK_SCHEMA_RESOLUTION': '0',
'PARQUET_FILE_SIZE': '0',
'PARQUET_READ_STATISTICS': '1',
'PREFETCH_MODE': '1',
'QUERY_TIMEOUT_S': '0',
'REPLICA_PREFERENCE': '0',
'REQUEST_POOL': '',
'RUNTIME_BLOOM_FILTER_SIZE': '1048576',
'RUNTIME_FILTER_MAX_SIZE': '16777216',
'RUNTIME_FILTER_MIN_SIZE': '1048576',
'RUNTIME_FILTER_MODE': '2',
'RUNTIME_FILTER_WAIT_TIME_MS': '0',
'S3_SKIP_INSERT_STAGING': '1',
'SCHEDULE_RANDOM_REPLICA': '0',
'SCRATCH_LIMIT': '-1',
'SEQ_COMPRESSION_MODE': '',
'SYNC_DDL': '0'}
```
To enable Snappy compression for Parquet files, you could do either of:
```python
>>> client.set_options({'COMPRESSION_CODEC': 'snappy'})
>>> client.set_compression_codec('snappy')
>>> client.get_options()['COMPRESSION_CODEC']
'SNAPPY'
```
## Ingesting data from pandas
Overall interoperability between the Hadoop / Spark ecosystems and
pandas / the PyData stack is poor, but it will improve in time (this is
a major part of the Ibis roadmap).
Ibis's Impala tools currently interoperate with pandas in these ways:
- Ibis expressions return pandas objects (i.e. DataFrame or Series)
for non-scalar expressions when calling their `to_pandas` method
- The `create_table` and `insert` methods can accept pandas objects.
This includes inserting into partitioned tables. It currently uses
CSV as the ingest route.
For example:
```python
>>> import pandas as pd
>>> data = pd.DataFrame({'foo': [1, 2, 3, 4], 'bar': ['a', 'b', 'c', 'd']})
>>> db.create_table('pandas_table', data)
>>> t = db.pandas_table
>>> t.to_pandas()
bar foo
0 a 1
1 b 2
2 c 3
3 d 4
>>> t.drop()
>>> db.create_table('empty_for_insert', schema=t.schema())
>>> to_insert = db.empty_for_insert
>>> to_insert.insert(data)
>>> to_insert.to_pandas()
bar foo
0 a 1
1 b 2
2 c 3
3 d 4
>>> to_insert.drop()
```
```python
>>> import pandas as pd
>>> data = pd.DataFrame({'foo': [1, 2, 3, 4], 'bar': ['a', 'b', 'c', 'd']})
>>> db.create_table('pandas_table', data)
>>> t = db.pandas_table
>>> t.to_pandas()
foo bar
0 1 a
1 2 b
2 3 c
3 4 d
>>> t.drop()
>>> db.create_table('empty_for_insert', schema=t.schema())
>>> to_insert = db.empty_for_insert
>>> to_insert.insert(data)
>>> to_insert.to_pandas()
foo bar
0 1 a
1 2 b
2 3 c
3 4 d
>>> to_insert.drop()
```
## Uploading / downloading data from HDFS
If you've set up an HDFS connection, you can use the Ibis HDFS interface
to look through your data and read and write files to and from HDFS:
```python
>>> hdfs = con.hdfs
>>> hdfs.ls('/__ibis/ibis-testing-data')
['README.md',
'avro',
'awards_players.csv',
'batting.csv',
'csv',
'diamonds.csv',
'functional_alltypes.csv',
'functional_alltypes.parquet',
'geo.csv',
'ibis_testing.db',
'parquet',
'struct_table.avro',
'udf']
```
```python
>>> hdfs.ls('/__ibis/ibis-testing-data/parquet')
['functional_alltypes',
'tpch_customer',
'tpch_lineitem',
'tpch_nation',
'tpch_orders',
'tpch_part',
'tpch_partsupp',
'tpch_region',
'tpch_supplier']
```
Suppose we wanted to download
`/__ibis/ibis-testing-data/parquet/functional_alltypes`, which is a
directory. We need only do:
```bash
$ rm -rf parquet_dir/
```
```python
>>> hdfs.get('/__ibis/ibis-testing-data/parquet/functional_alltypes',
... 'parquet_dir',
... recursive=True)
'/ibis/docs/source/tutorial/parquet_dir'
```
Now we have that directory locally:
```bash
$ ls parquet_dir/
9a41de519352ab07-4e76bc4d9fb5a789_1624886651_data.0.parq
9a41de519352ab07-4e76bc4d9fb5a78a_778826485_data.0.parq
9a41de519352ab07-4e76bc4d9fb5a78b_1277612014_data.0.parq
```
Files and directories can be written to HDFS just as easily using `put`:
```python
>>> path = '/__ibis/dir-write-example'
>>> hdfs.rm(path, recursive=True)
>>> hdfs.put(path, 'parquet_dir', recursive=True)
```
```python
>>> hdfs.ls('/__ibis/dir-write-example')
['9a41de519352ab07-4e76bc4d9fb5a789_1624886651_data.0.parq',
'9a41de519352ab07-4e76bc4d9fb5a78a_778826485_data.0.parq',
'9a41de519352ab07-4e76bc4d9fb5a78b_1277612014_data.0.parq']
```
Delete files and directories with `rm`:
```python
>>> hdfs.rm('/__ibis/dir-write-example', recursive=True)
```
```bash
rm -rf parquet_dir/
```
## Queries on Parquet, Avro, and Delimited files in HDFS
Ibis can easily create temporary or persistent Impala tables that
reference data in the following formats:
- Parquet (`parquet_file`)
- Avro (`avro_file`)
- Delimited text formats (CSV, TSV, etc.) (`delimited_file`)
Parquet is the easiest because the schema can be read from the data
files:
```python
>>> path = '/__ibis/ibis-testing-data/parquet/tpch_lineitem'
>>> lineitem = con.parquet_file(path)
>>> lineitem.limit(2)
l_orderkey l_partkey l_suppkey l_linenumber l_quantity l_extendedprice \
0 1 155190 7706 1 17.00 21168.23
1 1 67310 7311 2 36.00 45983.16
l_discount l_tax l_returnflag l_linestatus l_shipdate l_commitdate \
0 0.04 0.02 N O 1996-03-13 1996-02-12
1 0.09 0.06 N O 1996-04-12 1996-02-28
l_receiptdate l_shipinstruct l_shipmode \
0 1996-03-22 DELIVER IN PERSON TRUCK
1 1996-04-20 TAKE BACK RETURN MAIL
l_comment
0 egular courts above the
1 ly final dependencies: slyly bold
```
```python
>>> lineitem.l_extendedprice.sum()
Decimal('229577310901.20')
```
If you want to query a Parquet file and also create a table in Impala
that remains after your session, you can pass more information to
`parquet_file`:
```python
>>> table = con.parquet_file(path, name='my_parquet_table',
... database='ibis_testing',
... persist=True)
>>> table.l_extendedprice.sum()
Decimal('229577310901.20')
```
```python
>>> con.table('my_parquet_table').l_extendedprice.sum()
Decimal('229577310901.20')
```
```python
>>> con.drop_table('my_parquet_table')
```
To query delimited files, you need to write down an Ibis schema. At some
point we'd like to build some helper tools that will infer the schema