# Uncompressing files in parallel

In this lab we will see how to take advantadge of the `pipe` method to launch commands in parallel. 

The objective is to uncompress all files in a directory in parallel.

## Files to uncompress

The files we have to uncompress are in the `/opt/cesga/cursos/pyspark_2022/datasets/compressed-files`. 

NOTE: Notice that this directory is in NFS and not in HDFS.

In [1]:
!ls /opt/cesga/cursos/pyspark_2022/datasets/compressed-files

file100.gz  file23.gz  file37.gz  file50.gz  file64.gz	file78.gz  file91.gz
file10.gz   file24.gz  file38.gz  file51.gz  file65.gz	file79.gz  file92.gz
file11.gz   file25.gz  file39.gz  file52.gz  file66.gz	file7.gz   file93.gz
file12.gz   file26.gz  file3.gz   file53.gz  file67.gz	file80.gz  file94.gz
file13.gz   file27.gz  file40.gz  file54.gz  file68.gz	file81.gz  file95.gz
file14.gz   file28.gz  file41.gz  file55.gz  file69.gz	file82.gz  file96.gz
file15.gz   file29.gz  file42.gz  file56.gz  file6.gz	file83.gz  file97.gz
file16.gz   file2.gz   file43.gz  file57.gz  file70.gz	file84.gz  file98.gz
file17.gz   file30.gz  file44.gz  file58.gz  file71.gz	file85.gz  file99.gz
file18.gz   file31.gz  file45.gz  file59.gz  file72.gz	file86.gz  file9.gz
file19.gz   file32.gz  file46.gz  file5.gz   file73.gz	file87.gz
file1.gz    file33.gz  file47.gz  file60.gz  file74.gz	file88.gz
file20.gz   file34.gz  file48.gz  file61.gz  file75.gz	file89.gz
file21.gz   file35.gz  file49.gz  file62.gz  fi

We will create a `tmp/compressed-files-lab` directory in our HOME and we will copy the files there (review the lines below before executing them to be sure of what you are going to do ;-)

In [2]:
!mkdir -p ~/tmp/compressed-files-lab
!cp /opt/cesga/cursos/pyspark_2022/datasets/compressed-files/*.gz ~/tmp/compressed-files-lab
!chmod a+rwx ~/tmp/compressed-files-lab
!chmod a+rx ~

**SECURITY WARNING:** Remember to reset back the permissions once you finish this lab.

## Obtain the name of the files

First we need to get the name of the files from python:

In [3]:
import os

filenames = os.listdir(os.path.expanduser('~/tmp/compressed-files-lab'))

Check that the filenames variable contains the expected results:

In [4]:
filenames

['file100.gz',
 'file10.gz',
 'file11.gz',
 'file12.gz',
 'file13.gz',
 'file14.gz',
 'file15.gz',
 'file16.gz',
 'file17.gz',
 'file18.gz',
 'file19.gz',
 'file1.gz',
 'file20.gz',
 'file21.gz',
 'file22.gz',
 'file23.gz',
 'file24.gz',
 'file25.gz',
 'file26.gz',
 'file27.gz',
 'file28.gz',
 'file29.gz',
 'file2.gz',
 'file30.gz',
 'file31.gz',
 'file32.gz',
 'file33.gz',
 'file34.gz',
 'file35.gz',
 'file36.gz',
 'file37.gz',
 'file38.gz',
 'file39.gz',
 'file3.gz',
 'file40.gz',
 'file41.gz',
 'file42.gz',
 'file43.gz',
 'file44.gz',
 'file45.gz',
 'file46.gz',
 'file47.gz',
 'file48.gz',
 'file49.gz',
 'file4.gz',
 'file50.gz',
 'file51.gz',
 'file52.gz',
 'file53.gz',
 'file54.gz',
 'file55.gz',
 'file56.gz',
 'file57.gz',
 'file58.gz',
 'file59.gz',
 'file5.gz',
 'file60.gz',
 'file61.gz',
 'file62.gz',
 'file63.gz',
 'file64.gz',
 'file65.gz',
 'file66.gz',
 'file67.gz',
 'file68.gz',
 'file69.gz',
 'file6.gz',
 'file70.gz',
 'file71.gz',
 'file72.gz',
 'file73.gz',
 'file74.gz

## Create a RDD

We have now to create and RDD and rembember that we can control the level of parallelism setting the number of partitions:

In [5]:
PARTITIONS = 4

In [6]:
rdd = sc.parallelize(filenames, PARTITIONS)

Let's see how the work will be distributed:

In [7]:
rdd.glom().collect()

[['file100.gz',
  'file10.gz',
  'file11.gz',
  'file12.gz',
  'file13.gz',
  'file14.gz',
  'file15.gz',
  'file16.gz',
  'file17.gz',
  'file18.gz',
  'file19.gz',
  'file1.gz',
  'file20.gz',
  'file21.gz',
  'file22.gz',
  'file23.gz',
  'file24.gz',
  'file25.gz',
  'file26.gz',
  'file27.gz',
  'file28.gz',
  'file29.gz',
  'file2.gz',
  'file30.gz',
  'file31.gz'],
 ['file32.gz',
  'file33.gz',
  'file34.gz',
  'file35.gz',
  'file36.gz',
  'file37.gz',
  'file38.gz',
  'file39.gz',
  'file3.gz',
  'file40.gz',
  'file41.gz',
  'file42.gz',
  'file43.gz',
  'file44.gz',
  'file45.gz',
  'file46.gz',
  'file47.gz',
  'file48.gz',
  'file49.gz',
  'file4.gz',
  'file50.gz',
  'file51.gz',
  'file52.gz',
  'file53.gz',
  'file54.gz'],
 ['file55.gz',
  'file56.gz',
  'file57.gz',
  'file58.gz',
  'file59.gz',
  'file5.gz',
  'file60.gz',
  'file61.gz',
  'file62.gz',
  'file63.gz',
  'file64.gz',
  'file65.gz',
  'file66.gz',
  'file67.gz',
  'file68.gz',
  'file69.gz',
  'file6.gz'

## Create helper script

First we will try with a simple `run.sh` script that echoes the lines it receives as input:

In [8]:
%%writefile ~/tmp/run.sh
#!/bin/bash
while read LINE; do
   echo $LINE
done

Overwriting /home/cesga/jlopez/tmp/run.sh


Give executable permissions to the file so spark can execute it (spark runs with the spark user):

In [9]:
!chmod a+rx ~/tmp/run.sh

Let's store the location of the script:

In [10]:
run = os.path.expanduser('~/tmp/run.sh')

Let's test it:

In [11]:
rdd.pipe(run).collect()

[u'file100.gz',
 u'file10.gz',
 u'file11.gz',
 u'file12.gz',
 u'file13.gz',
 u'file14.gz',
 u'file15.gz',
 u'file16.gz',
 u'file17.gz',
 u'file18.gz',
 u'file19.gz',
 u'file1.gz',
 u'file20.gz',
 u'file21.gz',
 u'file22.gz',
 u'file23.gz',
 u'file24.gz',
 u'file25.gz',
 u'file26.gz',
 u'file27.gz',
 u'file28.gz',
 u'file29.gz',
 u'file2.gz',
 u'file30.gz',
 u'file31.gz',
 u'file32.gz',
 u'file33.gz',
 u'file34.gz',
 u'file35.gz',
 u'file36.gz',
 u'file37.gz',
 u'file38.gz',
 u'file39.gz',
 u'file3.gz',
 u'file40.gz',
 u'file41.gz',
 u'file42.gz',
 u'file43.gz',
 u'file44.gz',
 u'file45.gz',
 u'file46.gz',
 u'file47.gz',
 u'file48.gz',
 u'file49.gz',
 u'file4.gz',
 u'file50.gz',
 u'file51.gz',
 u'file52.gz',
 u'file53.gz',
 u'file54.gz',
 u'file55.gz',
 u'file56.gz',
 u'file57.gz',
 u'file58.gz',
 u'file59.gz',
 u'file5.gz',
 u'file60.gz',
 u'file61.gz',
 u'file62.gz',
 u'file63.gz',
 u'file64.gz',
 u'file65.gz',
 u'file66.gz',
 u'file67.gz',
 u'file68.gz',
 u'file69.gz',
 u'file6.gz',


## Launch gunzip in parallel

We just have to update the script so it executes the `gunzip` command instead of the `echo` command:

IMPORTANT: You have to set the full path with your HOME dir, because the script will be run under the spark user (not your user)

In [12]:
%%writefile ~/tmp/run.sh
#!/bin/bash
while read LINE; do
   gunzip "/home/cesga/jlopez/tmp/compressed-files-lab/$LINE"
done

Overwriting /home/cesga/jlopez/tmp/run.sh


In [13]:
rdd.pipe(run).collect()

[]

Finally we can check that actually the files have been uncompressed:

In [14]:
!ls ~/tmp/compressed-files-lab

file1	 file18  file27  file36  file45  file54  file63  file72  file81  file90
file10	 file19  file28  file37  file46  file55  file64  file73  file82  file91
file100  file2	 file29  file38  file47  file56  file65  file74  file83  file92
file11	 file20  file3	 file39  file48  file57  file66  file75  file84  file93
file12	 file21  file30  file4	 file49  file58  file67  file76  file85  file94
file13	 file22  file31  file40  file5	 file59  file68  file77  file86  file95
file14	 file23  file32  file41  file50  file6	 file69  file78  file87  file96
file15	 file24  file33  file42  file51  file60  file7	 file79  file88  file97
file16	 file25  file34  file43  file52  file61  file70  file8	 file89  file98
file17	 file26  file35  file44  file53  file62  file71  file80  file9	 file99


## Cleaning up and resetting permissions back

Let's clean up now and remove the directory:

In [15]:
!rm -rf ~/tmp/compressed-files-lab

And for your security you should reset back your HOME dir permissions:
```
chmod go-rx ~
```