# PySpark training for data engineers
## 04. Data Filtering

### Goal

Practise with filtering on the two RDDs created in the earlier notebooks. Both RDDs are converted to dataframes and some simple filtering examples are shown.

### Highlights

* `sqlContext.createDataFrame()` creates a Spark dataframe from a RDD.
* If the schema cannot be inferred from the RDD, a schema has to be supplied when converting the RDD to a dataframe.
* Using `dataframe.select()` and `dataframe.where()` data can be selected.

### Implementation

In [1]:
from pyspark import SparkConf, SparkContext
config = SparkConf().setMaster('local')
spark = SparkContext.getOrCreate(conf=config)

Exception: Java gateway process exited before sending the driver its port number

#### XML

In [None]:
xmlrdd = spark.pickleFile('xml-pickle-03/')

In [None]:
xmlrdd.collect()

Let us create a SQLContext to get more flexibility in our Spark environment.

In [None]:
from pyspark.sql import SQLContext
sqlContext = SQLContext(spark)

Now we can convert the RDD to a dataframe which makes it easier to work with the data:

In [None]:
xmldf = sqlContext.createDataFrame(xmlrdd)

In [None]:
xmldf.show()

From the above print we can observe that the columns are correct, both columns (info and text) are inferred from the RDD.

In [None]:
xmldf.where(xmldf['text'] == 'One').show()

In [None]:
xmldf.where(xmldf['text'] == 'Two').show()

#### CSV

In [None]:
csvrdd = spark.pickleFile('csv-pickle-03/')

In [None]:
csvrdd.collect()

First we need to create proper Rows from each CSV line by using a mapping.

In [None]:
from pyspark.sql import Row

def processCSV(row):
    # Split the row into a list
    row = row.split(','))
    # Return the four fields
    return Row(row[0], row[1], row[2], row[3])

csvrdd = csvrdd.map(lambda row: processCSV(row))
csvrdd.collect()

We define the schema of the data so the `createDataFrame` does not have to infer the schema.

In [None]:
from pyspark.sql.types import StructField, StructType, StringType, IntegerType
schema = StructType([
            StructField("first_name", StringType(), True),
            StructField("last_name", StringType(), True),
            StructField("gender", StringType(), True),
            StructField("age", StringType(), True)
        ])

In [None]:
csvdf = sqlContext.createDataFrame(csvrdd, schema=schema)

In [None]:
csvdf.show()

In [None]:
csvdf.select(csvdf.age > 30).collect()

In [None]:
csvdf.select(csvdf.age > 30).show()

In [None]:
csvdf.select('first_name', csvdf.age > 30).collect()

In [None]:
from pyspark.sql import functions as psf
csvdf.select('first_name', psf.when(csvdf.age > 30, 1).otherwise(0)).show()

### Important links
[Cheat Sheet](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/PySpark_SQL_Cheat_Sheet_Python.pdf)