### <div class="alert alert-success" style="background:#2C3E50;color:white">Compression Algorithms</div>

<p style="background:#F1C40F"><b>General Guidelines</b></p>

* You need to balance the processing capacity required to compress and uncompress the data, the disk IO required to read and write the data, and the network bandwidth required to send the data across the network. The correct balance of these factors depends upon the characteristics of your cluster and your data, as well as your usage patterns.
* Compression is not recommended if your data is already compressed (such as images in JPEG format). In fact, the resulting file can actually be larger than the original.
* GZIP compression uses more CPU resources than Snappy or LZO, but provides a higher compression ratio. GZip is often a good choice for cold data, which is accessed infrequently. Snappy or LZO are a better choice for hot data, which is accessed frequently.
* BZip2 can also produce more compression than GZip for some types of files, at the cost of some speed when compressing and decompressing. HBase does not support BZip2 compression.
* Snappy often performs better than LZO. It is worth running tests to see if you detect a significant difference.

* Some of the Standard compression algorithms are -
    - gzip
    - snappy
    - lzo
    - bzip2
* Some algorithms are splittable while others are non-splittable 
    - Splittable generate Part compressed files.
* Most of the algorithms have bot native as well as java implementations (except for bzip2 - which has only java implementation).
* Native implementations are relatively faster than java implementations.
* Not only final output but intermediate data can also be compressed in Spark.

<p style="background:#AED6F1"><b>Compression Formats / Types</p>

<table align='left'><th>Summary of Compression Formats</th>
<tr><td>Compression Format</td><td>Tool</td><td>Algorithm</td><td>Filename Extension</td><td>Splittable?</td></tr>
<tr><td>Deflate[a]</td><td>N/A</td><td>DEFLATE</td><td>.deflate</td><td>No</td></tr>
<tr><td>gzip</td><td>gzip</td><td>DEFLATE</td><td>.gz</td><td>No</td></tr>
<tr><td>bzip2</td><td>bzip2</td><td>bzip2</td><td>.bz2</td><td>Yes</td></tr>
<tr><td>LZO</td><td>lzop</td><td>LZO</td><td>.lzo</td><td>No[b]</td></tr>
<tr><td>LZ4</td><td>N/A</td><td>LZ4</td><td>.lz4</td><td>No</td></tr>    
<tr><td>Snappy</td><td>N/A</td><td>Snappy</td><td>.snappy</td><td>No</td></tr>
<tr><td>[a] DEFLATE is a compression algorithm whose standard implementation is zlib. There is no commonly available command-line tool for producing files in DEFLATE format, as gzip is normally used. (Note - that gzip file format is DEFLATE with extra headers and a footer). The .deflate extension is hadoop convention.</td><td></td><td></td><td></td><td></td></tr>
<tr><td>[b] However, LZO files are splittable, if they have been indexed in a preprocessing step.</td><td></td><td></td><td></td><td></td></tr>
</table>

<p style="background:#AED6F1"><b>Compression Reading & Writing</p>

* Compressing Text / CSV Files -
    - Reading - No special action need to be taken as long as we use supported algorithms.
    - Writing - 
        - Can compress to most of the algorithms (bzip2, deflate, uncompressed, lz4, gzip, snappy, none)
        - Use option on spark.write before csv -<br>
        <code>df.write.option("codec", "gzip").csv("<PATH>")</code>
        - Also, option with compression works fine.

* Compressing Json Files -
    - Reading - No special action need to be taken as long as we use supported algorithms.
    - Writing - 
        - Can compress to most of the algorithms (bzip2, deflate, uncompressed, lz4, gzip, snappy, none)
        - Use option with compression -<br>
        <code>option('compression', 'gzip')</code>

* Compressing orc Files -
    - Reading - No special action need to be taken as long as we use supported algorithms.
    - Writing - 
        - Default - Snappy
        - Supported codecs - none, uncompressed, snappy, zlib, lzo.
    - <code>spark.sql.orc.compression.codec</code> - Sets the compression codec used when writing ORC files. If either compression or orc.compress is specified in the table-specific options/properties, the precedence would be compression, orc.compress, spark.sql.orc.compression.codec.

* Compressing parquet Files -
    - Reading - No special action need to be taken as long as we use supported algorithms.
    - Writing - 
        - Default - Snappy
        - Supported codecs - none, uncompressed, snappy, gzip, lzo, brotli, lz4, zstd.
        - Set <code>spark.sql.parquet.compression.codec</code> to appropriate algorithm.
        - <code>spark.sql.parquet.compression.codec</code> - Sets the compression codec used when writing Parquet files. If either compression or parquet.compression is specified in the table-specific options/properties, the precedence would be compression, parquet.compression, spark.sql.parquet.compression.codec.
        - <b>compression</b> – compression codec to use when saving to file. This can be one of the known case-insensitive shorten names (none, uncompressed, snappy, gzip, lzo, brotli, lz4, and zstd). This will override spark.sql.parquet.compression.codec. If None is set, it uses the value specified in spark.sql.parquet.compression.codec

* Compressing avro Files -
    - Reading - No special action need to be taken as long as we use supported algorithms.
    - Writing - 
        - Default - Snappy
        - Supported codecs: uncompressed, deflate, snappy, bzip2 and xz.
        - Set <code>spark.sql.avro.compression.codec</code> to appropriate algorithm.

<p style="background:#AED6F1"><b>CompressionExamples using different file formats</p>

<p style="background :#d0d5db"><b>Loading pyspark using command below</b> </p>

In [None]:
pyspark --master yarn --conf spark.ui.port=21117 --packages com.databricks:spark-avro_2.11:4.0.0

In [2]:
from pyspark.sql import SparkSession

In [3]:
spark = SparkSession. \
    builder. \
    config('spark.ui.port', '0'). \
    appName('orders_testing'). \
    master('local'). \
    getOrCreate()

In [4]:
>>> ordersCSV = spark.read. \
                    format('csv'). \
                    schema('order_id int, order_date string, order_cust_id int, order_status string'). \
                    load('/Users/monikamendiratta/data/retail_db/orders/part-00000.csv')

In [5]:
>>> ordersCSV.printSchema()

root
 |-- order_id: integer (nullable = true)
 |-- order_date: string (nullable = true)
 |-- order_cust_id: integer (nullable = true)
 |-- order_status: string (nullable = true)



In [6]:
ordersCSV.show()

+--------+--------------------+-------------+---------------+
|order_id|          order_date|order_cust_id|   order_status|
+--------+--------------------+-------------+---------------+
|       1|2013-07-25 00:00:...|        11599|         CLOSED|
|       2|2013-07-25 00:00:...|          256|PENDING_PAYMENT|
|       3|2013-07-25 00:00:...|        12111|       COMPLETE|
|       4|2013-07-25 00:00:...|         8827|         CLOSED|
|       5|2013-07-25 00:00:...|        11318|       COMPLETE|
|       6|2013-07-25 00:00:...|         7130|       COMPLETE|
|       7|2013-07-25 00:00:...|         4530|       COMPLETE|
|       8|2013-07-25 00:00:...|         2911|     PROCESSING|
|       9|2013-07-25 00:00:...|         5657|PENDING_PAYMENT|
|      10|2013-07-25 00:00:...|         5648|PENDING_PAYMENT|
|      11|2013-07-25 00:00:...|          918| PAYMENT_REVIEW|
|      12|2013-07-25 00:00:...|         1837|         CLOSED|
|      13|2013-07-25 00:00:...|         9149|PENDING_PAYMENT|
|      1

In [None]:
>>> orders = ordersCSV

<p style="background :#d0d5db"><b>HDFS commands executed on other terminal tab -</b> Creating Directories for better clarity and visibilty </p>

In [None]:
[monahadoop@gw03 ~]$ hdfs dfs -ls /user/monahadoop

In [None]:
[monahadoop@gw03 ~]$ hdfs dfs -mkdir /user/monahadoop/spark_compressed
[monahadoop@gw03 ~]$ hdfs dfs -mkdir /user/monahadoop/spark_compressed/csv
[monahadoop@gw03 ~]$ hdfs dfs -mkdir /user/monahadoop/spark_compressed/avro
[monahadoop@gw03 ~]$ hdfs dfs -mkdir /user/monahadoop/spark_compressed/json
[monahadoop@gw03 ~]$ hdfs dfs -mkdir /user/monahadoop/spark_compressed/text
[monahadoop@gw03 ~]$ hdfs dfs -mkdir /user/monahadoop/spark_compressed/parquet
[monahadoop@gw03 ~]$ hdfs dfs -mkdir /user/monahadoop/spark_compressed/orc

In [None]:
[monahadoop@gw03 ~]$ hdfs dfs -ls /user/monahadoop/spark_compressed

<p style="background:#F1C40F"><b> CSV</b></p>

<p style="background :#d0d5db"><b>on pyspark terminal tab</b> </p>

<p style="background:#AED6F1"><b> Writing/ Saving to .csv file with gzip compression</b></p>

In [None]:
>>> orders.write.\
...     format('csv'). \
...     option('codec', 'gzip').\
...     save('/user/monahadoop/spark_compressed/csv/orders_csv')

<p style="background :#d0d5db"><b>HDFS commands executed on other terminal tab</b> </p>

<p style="background:#AED6F1"><b> Viewing.csv file with gzip compression</b></p>

In [None]:
[monahadoop@gw03 ~]$ hdfs dfs -ls -t /user/monahadoop/spark_compressed/csv

In [None]:
[monahadoop@gw03 ~]$ hdfs dfs -ls  /user/monahadoop/spark_compressed/csv/orders_csv

In [None]:
[monahadoop@gw03 ~]$ hdfs dfs -ls -h /public/retail_db/orders

In [None]:
[monahadoop@gw03 ~]$ hdfs dfs -ls -h /user/monahadoop/spark_compressed/csv/orders_csv

<p style="background:#F1C40F"><b> JSON</b></p>

<p style="background :#d0d5db"><b>on pyspark terminal tab</b> </p>

<p style="background:#AED6F1"><b> Writing/ Saving to .json file with gzip compression</b></p>

In [None]:
>>> orders.write.\
...     format('json'). \
...     option('codec', 'gzip'). \
...     save('/user/monahadoop/spark_compressed/json/orders_json')

<p style="background :#d0d5db"><b>HDFS commands executed on other terminal tab</b> </p>

<p style="background:#AED6F1"><b> Viewing .json file with gzip compression</b> - didn't get compressed as in above write command, in option 'codec was used instead of compression.' </p>

In [None]:
[monahadoop@gw03 ~]$ hdfs dfs -ls /user/monahadoop/spark_compressed/json

In [None]:
[monahadoop@gw03 ~]$ hdfs dfs -ls /user/monahadoop/spark_compressed/json/orders_json

In [None]:
[monahadoop@gw03 ~]$ hdfs dfs -rm -R /user/monahadoop/spark_compressed/json

In [None]:
[monahadoop@gw03 ~]$ hdfs dfs -mkdir /user/monahadoop/spark_compressed/json

<p style="background :#d0d5db"><b>on pyspark terminal tab</b> </p>

<p style="background:#AED6F1"><b> Writing/ Saving to .json file with gzip compression - </b>Again with correct command containg 'compression' in option instead of 'codec' </p>

In [None]:
>>> orders.write.option('compression', 'gzip').json('/user/monahadoop/spark_compressed/json/orders_json')

<p style="background :#d0d5db"><b>HDFS commands executed on other terminal tab</b> </p>

<p style="background:#AED6F1"><b> Viewing .json file with gzip compression</b> - compressed this time </p>

In [None]:
[monahadoop@gw03 ~]$ hdfs dfs -ls /user/monahadoop/spark_compressed/json

In [None]:
[monahadoop@gw03 ~]$ hdfs dfs -ls /user/monahadoop/spark_compressed/json/orders_json

In [None]:
[monahadoop@gw03 ~]$ hdfs dfs -ls -h /user/monahadoop/spark_compressed/json/orders_json

<p style="background :#d0d5db"><b>on pyspark terminal tab</b> </p>

<p style="background:#AED6F1"><b> Writing/ Saving to .json file with gzip compression - </b>with mode='overwrite' </p>

In [None]:
>>> orders.write.option('compression', 'gzip').json('/user/monahadoop/spark_compressed/json/orders_json', mode='overwrite')

<p style="background:#F1C40F"><b> PARQUET</b></p>

<p style="background :#d0d5db"><b>on pyspark terminal tab</b> </p>

<p style="background:#AED6F1"><b> Writing/ Saving to .parquet file with gzip compression</b></p>

In [None]:
>>> #setting spark property for parquet file compression 

>>> spark.conf.set('spark.sql.parquet.compression.codec', 'gzip')

In [None]:
>>> # writing without compression

>>> orders.write.parquet('/user/monahadoop/spark_compressed/parquet/orders_parquet')

In [None]:
>>> orders.write.\                                                              
...     format('parquet').\
...     option('compression', 'gzip'). \
...     save('/user/monahadoop/spark_compressed/parquet/orders_parquet', mode='overwrite')

In [None]:
>>> orders.write.\
...     format('parquet').\
...     option('compression', 'gzip'). \
...     save('/user/monahadoop/spark_compressed/parquet/orders_parquet', mode='overwrite')

<p style="background :#d0d5db"><b>HDFS commands executed on other terminal tab</b> </p>

<p style="background:#AED6F1"><b> Viewing .parquet file with gzip compression</b></p>

In [None]:
[monahadoop@gw03 ~]$ hdfs dfs -ls /user/monahadoop/spark_compressed/parquet

In [None]:
[monahadoop@gw03 ~]$ hdfs dfs -ls /user/monahadoop/spark_compressed/parquet/orders_parquet

In [None]:
[monahadoop@gw03 ~]$ hdfs dfs -tail /user/monahadoop/spark_compressed/parquet/orders_parquet/part-00000-fa82e1cc-2974-4ad5-a615-5037de05e104-c000.gz.parquet

In [None]:
[monahadoop@gw03 ~]$ hdfs dfs -ls -h /user/monahadoop/spark_compressed/parquet/orders_parquet

<p style="background:#F1C40F"><b> AVRO</b></p>

<p style="background :#d0d5db"><b>on pyspark terminal tab</b> </p>

<p style="background:#AED6F1"><b> Writing/ Saving to .parquet file with snappy compression</b></p>

In [None]:
>>> #setting spark property for avro file compression 

>>> spark.conf.set('spark.sql.avro.compression.codec', 'snappy')

In [None]:
>>> orders.write.\
...     format('com.databricks.spark.avro'). \
...     save('/user/monahadoop/spark_compressed/avro/orders_avro', mode='overwrite')

<p style="background :#d0d5db"><b>HDFS commands executed on other terminal tab</b> </p>

<p style="background:#AED6F1"><b> Viewing .avro file with snappy compression</b></p>

In [None]:
[monahadoop@gw03 ~]$ hdfs dfs -ls -t /user/monahadoop/spark_compressed

In [None]:
[monahadoop@gw03 ~]$ hdfs dfs -ls -t /user/monahadoop/spark_compressed/avro/orders_avro

<p style="background:#F1C40F"><b> Reading AVRO & PARQUET Files</b></p>

<p style="background :#d0d5db"><b>Reading compressed .avro file - </b> To show reading needs no special action.</p>

In [None]:
>>> #reading avro compressed file

>>> spark.read. \
...     format('com.databricks.spark.avro'). \
...     load('/user/monahadoop/spark_compressed/avro/orders_avro').\
...     show()

<p style="background :#d0d5db"><b>Reading uncompressed .avro file</b></p>

<p style="background :#d0d5db"><b>HDFS commands executed on other terminal tab</b> </p>

In [None]:
[monahadoop@gw03 ~]$ hdfs dfs -mkdir /user/monahadoop/spark_uncompressed

In [None]:
[monahadoop@gw03 ~]$ hdfs dfs -mkdir /user/monahadoop/spark_uncompressed/avro

<p style="background :#d0d5db"><b>on pyspark terminal tab</b> </p>

In [None]:
>>> spark.conf.set('spark.sql.avro.compression.codec', 'uncompressed')

In [None]:
>>> #first creating an uncompressed demo .avro file

>>> orders.write. \
...     format('com.databricks.spark.avro').\
...     save('/user/monahadoop/spark_uncompressed/avro/orders_avro')

In [None]:
>>> #reading uncompressed avro file

>>> spark.read.\
...     format('com.databricks.spark.avro').\
...     load('/user/monahadoop/spark_uncompressed/avro/orders_avro').show()

In [None]:
>>> df = spark.read.\
...     format('com.databricks.spark.avro').\
...     load('/user/monahadoop/spark_uncompressed/avro/orders_avro')

In [None]:
>>> df.show()

In [None]:
>>> df.printSchema()

<p style="background :#d0d5db"><b>Reading compressed .parquet file - </b> To show reading needs no special action.</p>

In [None]:
>>> spark.read.\
...     format('parquet'). \
...     load('/user/monahadoop/spark_compressed/parquet/orders_parquet').show()

In [None]:
>>> spark.read.\
...     format('parquet'). \
...     load('/user/monahadoop/spark_compressed/parquet/orders_parquet').printSchema()

<p style="background:#AED6F1"><b>Criteria and Tips</p>

* Choose the ones with native implementation.
* Most of the compression algorithms, which have native implementation are non-splittable - which means irrespective of the file size, it is processed by one task only.
* To work-around this non-splittable limitation of one task per file (while writing), we need to make sure data is saved in multiple files of manageable size.
* Some of the file formats such as parquet, orc etc. are compressed by default. It is better to use default compression (for example parquet is compressed using snappy).

<p style="background:#F1C40F"><b> Example demonstrating Splittable vs Non-Splittable Compression</b></p>

Taking <b>/public/yelp-dataset/yelp_review.csv</b> original uncompressed file for demonstrating.

Steps for each -

* uncompressed file at this HDFS location<br>
    **/public/yelp-dataset/yelp_review.csv**
    - check it's number of blocks and files on HDFS using hdfs fsck command.
    - This should display more than 24 blocks as file size is 3.8 MB (approx) **[29 blocks]**
    - create an RDD for this file and do a count on that RDD, this count runs executors and we can see that it takes more than 24 tasks to complete the count.<br><br>
    
* compress this file at some other HDFS location<br>
    **/user/monahadoop/spark_compressed/csv**
    - check it's number of blocks and files on HDFS using hdfs fsck command.
    - This should display more than 24 blocks, as it will create split .gz files. **[29 blocks]**
    - create an RDD for this file and do a count on that RDD, this count runs executors and we can see that it takes 12 tasks to complete the count.<br><br>
    
* compress deliberately such that only 1 one non splittable compressed file is there at some other HDFS location. (using coalesce)<br>
    **/user/monahadoop/spark_compressed**
    - check it's number of blocks and files on HDFS using hdfs fsck command.
    - This should display 1 block if compressed in non splittable manner.
    - This instead displayed 19 blocks as file size is 1.6 MB (approx) when compressed and this depicts that the compressed file is distributed across 19 blocks with 1 .snappy.parquet file. **[19 blocks]**
    - create an RDD for this file and do a count on that RDD, this count runs executors and we can see that it takes 1 tasks to complete the count since it is not splitted.

<p style="background :#d0d5db">uncompressed file at this HDFS location<br>
    <b>/public/yelp-dataset/yelp_review.csv</p>

In [None]:
[monahadoop@gw03 ~]$ hdfs fsck /public/yelp-dataset/yelp_review.csv -files -blocks

In [None]:
# creating RDD and counting -> this took 29 tasks

In [None]:
>>> yelp_rvw = sc.textFile('/public/yelp-dataset/yelp_review.csv')

In [None]:
>>> yelp_rvw.count()

<p style="background :#d0d5db">compress this file at some other HDFS location<br>
    <b>/user/monahadoop/spark_compressed/csv</p>

In [None]:
>>> yelpdf = spark.read. \
...             format('csv'). \
...             load('/public/yelp-dataset/yelp_review.csv')

In [None]:
# normally_compressed

>>> yelpdf.write.\
...     option('codec', 'gzip'). \
...     format('csv').\
...     save('/user/monahadoop/spark_compressed/csv/yelp_review_csv/normal_compressed')

In [None]:
[monahadoop@gw03 ~]$ hdfs dfs -ls -t /user/monahadoop/spark_compressed/csv/yelp_review_csv/normal_compressed

In [None]:
[monahadoop@gw03 ~]$ hdfs fsck /user/monahadoop/spark_compressed/csv/yelp_review_csv/normal_compressed

In [None]:
# creating RDD and counting -> took 29 tasks

In [None]:
>>> yelp_nc = sc.textFile('/user/monahadoop/spark_compressed/csv/yelp_review_csv/normal_compressed')

In [None]:
>>> yelp_nc.count()

<p style="background :#d0d5db">compress deliberately such that only 1 one non splittable compressed file is there at some other HDFS location. (using coalesce)<br>
    <b>/user/monahadoop/spark_compressed</p>

In [None]:
>>> yelpdf.\
...     coalesce(1). \
...     write. \
...     option('codec', 'gzip'). \
...     save('/user/monahadoop/spark_compressed/csv/yelp_review_csv/coalesce_compressed', mode='overwrite')

In [None]:
# created .snappy.parquet file even though gzip compression was given-> No idea Why?

In [None]:
[monahadoop@gw03 ~]$ hdfs dfs -ls -t /user/monahadoop/spark_compressed/csv/yelp_review_csv/coalesce_compressed

In [None]:
[monahadoop@gw03 ~]$ hdfs fsck /user/monahadoop/spark_compressed/csv/yelp_review_csv/coalesce_compressed/part-00000-d79829db-5ab4-483f-86fa-a5b777973aff-c000.snappy.parquet

In [None]:
#  trying to do gzip compression again

In [None]:
>>> yelpdf.coalesce(1).write.option('codec', 'gzip'). \
...      save('/user/monahadoop/spark_compressed/csv/yelp_review_csv/coalesce_compressed_gz', mode='overwrite')

In [None]:
[monahadoop@gw03 ~]$ hdfs dfs -ls -t /user/monahadoop/spark_compressed/csv/yelp_review_csv/coalesce_compressed_gz

In [None]:
# it still created .snappy.parquet file

In [None]:
# creating RDD and counting -> took 19 tasks

In [None]:
>>> yelp_cc = sc.textFile('/user/monahadoop/spark_compressed/csv/yelp_review_csv/coalesce_compressed/part-00000-d79829db-5ab4-483f-86fa-a5b777973aff-c000.snappy.parquet')

In [None]:
>>> yelp_cc.count()

In [None]:
# This time it created .csv.gz with .format('csv') file

In [None]:
>>> op.coalesce(1).write.option('codec', 'gzip').format('csv').\
...     save('/user/monahadoop/compressed/yelp_review_csv')

In [None]:
[monahadoop@gw03 ~]$ hdfs dfs -ls -t /user/monahadoop/compressed
Found 2 items
drwxr-xr-x   - monahadoop hdfs          0 2020-07-17 02:42 /user/monahadoop/compressed/yelp_review_csv
drwxr-xr-x   - monahadoop hdfs          0 2020-07-17 02:26 /user/monahadoop/compressed/orders

In [None]:
[monahadoop@gw03 ~]$ hdfs dfs -ls -t /user/monahadoop/compressed/yelp_review_csv
Found 2 items
-rw-r--r--   2 monahadoop hdfs          0 2020-07-17 02:42 /user/monahadoop/compressed/yelp_review_csv/_SUCCESS
-rw-r--r--   2 monahadoop hdfs 1564151568 2020-07-17 02:42 /user/monahadoop/compressed/yelp_review_csv/part-00000-97fdbe13-d109-413d-b7db-6aad143e4a62-c000.csv.gz

<p style="background :#d0d5db"><b>END</b> </p>