# PySpark training for data engineers
## 03. Data Cleaning

### Goal

* Read the RDD created from the XML in the previous notebook and clean up the contents.
* Read the RDD created from the CSV and map the content to useful data.

### Highlights

* `spark.pickleFile()` reads a pickle into an RDD
* `rdd.map()` applies a function to each row of an RDD
* Inside the `map()` a lambda-function can be used or a explicit defined function

### Implementation

In [1]:
from pyspark import SparkConf, SparkContext
config = SparkConf().setMaster('local')
spark = SparkContext.getOrCreate(conf=config)

In [2]:
pwd

'/home/jitsejan/itility/pyspark-101'

Load data from previous step. Note that the RDD is converted to a list object.

In [3]:
xmlrdd = spark.pickleFile('xml-pickle-file/')

In [4]:
xmlrdd.collect()

[('file:/home/jitsejan/itility/pyspark-101/test.xml',
  '<teststructure>\n    <info>testfile</info>\n    <outerlist>\n        <elt>\n            <subelement>One</subelement>\n        </elt>\n        <elt>\n            <subelement>Two</subelement>\n        </elt>\n    </outerlist>\n</teststructure>')]

In [5]:
def _parse(input):
    # This function removes the filename from the RDD
    filename, filecontent = input
    output = filecontent
    return filecontent

In [6]:
xmlrdd.map(_parse).collect()

['<teststructure>\n    <info>testfile</info>\n    <outerlist>\n        <elt>\n            <subelement>One</subelement>\n        </elt>\n        <elt>\n            <subelement>Two</subelement>\n        </elt>\n    </outerlist>\n</teststructure>']

In [7]:
def _parse_remove_newline(input):
    # This function removes the filename from the RDD and the newlines from the content
    filename, filecontent = input
    output = filecontent
    return filecontent.replace('\n', '')

In [8]:
xmlrdd.map(_parse_remove_newline).collect()

['<teststructure>    <info>testfile</info>    <outerlist>        <elt>            <subelement>One</subelement>        </elt>        <elt>            <subelement>Two</subelement>        </elt>    </outerlist></teststructure>']

In [9]:
import xml.etree.ElementTree as ET

In [10]:
root = ET.parse('test.xml')
root

<xml.etree.ElementTree.ElementTree at 0x7f487004cef0>

Get a list of all the sub elements in a Pythonic way:

In [11]:
[elem.text for elem in root.findall('outerlist/elt/subelement')]

['One', 'Two']

In [12]:
def _parse_xml(input):
    # This function retrieves the root from the XML tree in the filecontent
    _, filecontent = input
    # Get the root of the XML document
    root = ET.fromstring(filecontent.replace('\n', ''))
    # Find the information
    info = root.find('info').text
    # Find the subelements and return a list with dictionaries
    return [{'text': subelem.text, 'info': info} for subelem in root.findall('outerlist/elt/subelement')]

In [13]:
xmlrdd = xmlrdd.flatMap(_parse_xml)

In [14]:
xmlrdd.collect()

[{'text': 'One', 'info': 'testfile'}, {'text': 'Two', 'info': 'testfile'}]

In [15]:
xmlrdd.saveAsPickleFile('xml-pickle-03')

### CSV

In [16]:
csvrdd = spark.pickleFile('csv-pickle-file/')

In [17]:
csvrdd.collect()

[('file:/home/jitsejan/itility/pyspark-101/csvfile01.csv',
  'john,doe,male,32\njake,doe,male,16'),
 ('file:/home/jitsejan/itility/pyspark-101/csvfile02.csv',
  'jane,doe,female,31\njanet,doe,female,13')]

We can also directly apply a lambda function in the mapping of the RDD. As we saw earlier, each row of an RDD after reading the files contains the tuple with filename and filecontent. In this case we say the row is `x` which means `x[1]` corresponds to the filecontent.

Splitting by a comma seems to yield the wrong result:

In [18]:
csvrdd.map(lambda x:x[1].split(',')).collect()

[['john', 'doe', 'male', '32\njake', 'doe', 'male', '16'],
 ['jane', 'doe', 'female', '31\njanet', 'doe', 'female', '13']]

We expect there would be four lines of data, not only two. As we can observe the new lines are not picked up and hence the two lines per CSV file are taken together. Let's try splitting by a new line character this time:

In [19]:
csvrdd.map(lambda x:x[1].split('\n')).collect()

[['john,doe,male,32', 'jake,doe,male,16'],
 ['jane,doe,female,31', 'janet,doe,female,13']]

Almost, but we want to split the elements on new rows, not on existing rows, so instead we will use `flatMap`:

In [20]:
csvrdd = csvrdd.flatMap(lambda x:x[1].split('\n'))
csvrdd.collect()

['john,doe,male,32',
 'jake,doe,male,16',
 'jane,doe,female,31',
 'janet,doe,female,13']

In [21]:
csvrdd.saveAsPickleFile('csv-pickle-03')