# Uncompressing files in parallel

In this lab we will see how to take advantadge of the `pipe` method to launch commands in parallel. 

The objective is to uncompress all files in a directory in parallel.

## Files to uncompress

The files we have to uncompress are in the `/opt/cesga/cursos/pyspark_2022/datasets/compressed-files`. 

NOTE: Notice that this directory is in NFS and not in HDFS.

In [None]:
!ls /opt/cesga/cursos/pyspark_2022/datasets/compressed-files

We will create a `tmp/compressed-files-lab` directory in our HOME and we will copy the files there.

In [None]:
!mkdir -p ~/tmp/compressed-files-lab
!cp /opt/cesga/cursos/pyspark_2022/datasets/compressed-files/*.gz ~/tmp/compressed-files-lab
!chmod a+rwx ~/tmp/compressed-files-lab

## Obtain the name of the files

First we need to get the name of the files from python:

In [None]:
import os

filenames = os.listdir(os.path.expanduser('~/tmp/compressed-files-lab'))

Check that the filenames variable contains the expected results:

In [None]:
filenames

## Create a RDD

We have now to create and RDD and rembember that we can control the level of parallelism setting the number of partitions:

In [None]:
PARTITIONS = 4

In [None]:
rdd = ...

Let's see how the work will be distributed:

In [None]:
rdd.glom().collect()

## Create helper script

First we will try with a simple `run.sh` script that echoes the lines it receives as input:

In [None]:
%%writefile ~/tmp/run.sh
#!/bin/bash
while read LINE; do
   echo $LINE
done

Give executable permissions to the file so spark can execute it (spark runs with the spark user):

In [None]:
!chmod a+rx ~/tmp/run.sh

Let's store the location of the script:

In [None]:
run = os.path.expanduser('~/tmp/run.sh')

Let's test it:

In [None]:
rdd.pipe(run).collect()

## Launch gunzip in parallel

We just have to update the script so it executes the `gunzip` command instead of the `echo` command:

IMPORTANT: You have to set the full path with your HOME dir, because the script will be run under the spark user (not your user)

In [None]:
%%writefile ~/tmp/run.sh
#!/bin/bash
while read LINE; do
   gunzip "SET_YOUR_HOME_PATH_HERE/tmp/compressed-files-lab/$LINE"
done

In [None]:
rdd.pipe(...).collect()

Finally we can check that actually the files have been uncompressed:

In [None]:
!ls ~/tmp/compressed-files-lab