# Pivotal Greenplum-Spark Connector
## PySpark Example

----

Pivotal Greenplum-Spark Connector documentation (notes below extracted from Pivotal documentation):

https://greenplum-spark.docs.pivotal.io/110/index.html

----

Steps to launching Jupyter Notebook with Greenplum-Spark connector available

1. Download greenplum-spark connector from Pivotal network https://network.pivotal.io/products/pivotal-gpdb (version used for this example greenplum-spark_2.11-1.1.0.jar)

2. Set environment variables - pyspark will launch Jupyter Notebook
```bash
# set environment variables
export PYSPARK_DRIVER_PYTHON='ipython'
export PYSPARK_DRIVER_PYTHON_OPTS='notebook --port=8888 --no-browser --ip=0.0.0.0 --notebook_dir=/notebooks'
```
3. Launch Jupyter Notebook
```bash
# Launch notebooks
# Set jar to location of greenplum-spark connector jar
pyspark --master spark://spark:7077 --jars=../spark-jars/greenplum-spark_2.11-1.1.0.jar
```

*Note - Wine data set used in example https://archive.ics.uci.edu/ml/datasets/wine*

----

In [93]:
# dependencies
import pyspark              # http://spark.apache.org/docs/latest/api/python/

Note that the .load() operation does not initiate the movement of data from Greenplum Database to Spark. 
Spark employs lazy evaluation for transformations; it does not compute the results until the application 
performs an action on the DataFrame, such as displaying or filtering the data or counting the number of rows.

https://greenplum-spark.docs.pivotal.io/110/read_from_gpdb.html

Options
* **url** format jdbc:postgresql://[hostname]:[port]/[database]
* **dbtable** table must be in GPDB search_path and have a distribution column (can not be distributed randomly)
* **partitionColumn** must be of type in [bigint, bigserial, integer, serial]

In [84]:
# create pointer to table 'pivotal.testing' in greenplum
gpdf = sqlContext.read.format("io.pivotal.greenplum.spark.GreenplumRelationProvider").options(
    url="jdbc:postgresql://gpdb:5432/gpadmin",
    user="gpadmin",
    password="pivotal",
    dbtable="wine",
    partitionColumn="cultivars").load()


Note: By default, Spark recomputes a transformed DataFrame each time you run an action on it. 
If you have a large data set on which you want to perform multiple transformations, you may choose 
to keep the DataFrame in memory for performance reasons. You can use the DataSet.persist() method 
for this purpose. Keep in mind that there are memory implications to persisting large data sets.

In [85]:
gpdf.persist()

DataFrame[cultivars: int, alcohol: double, malic_acid: double, ash: double, alcalinity_of_ash: double, magnesium: double, total_phenols: double, flavanoids: double, nonflavanoid_phenols: double, proanthocyanins: double, color_intensity: double, hue: double, od280_od315: double, proline: double]

In [86]:
# Check out data types of columns
gpdf.printSchema()

root
 |-- cultivars: integer (nullable = true)
 |-- alcohol: double (nullable = true)
 |-- malic_acid: double (nullable = true)
 |-- ash: double (nullable = true)
 |-- alcalinity_of_ash: double (nullable = true)
 |-- magnesium: double (nullable = true)
 |-- total_phenols: double (nullable = true)
 |-- flavanoids: double (nullable = true)
 |-- nonflavanoid_phenols: double (nullable = true)
 |-- proanthocyanins: double (nullable = true)
 |-- color_intensity: double (nullable = true)
 |-- hue: double (nullable = true)
 |-- od280_od315: double (nullable = true)
 |-- proline: double (nullable = true)



In [87]:
# Column names 
gpdf.columns

['cultivars',
 'alcohol',
 'malic_acid',
 'ash',
 'alcalinity_of_ash',
 'magnesium',
 'total_phenols',
 'flavanoids',
 'nonflavanoid_phenols',
 'proanthocyanins',
 'color_intensity',
 'hue',
 'od280_od315',
 'proline']

In [88]:
# row count
gpdf.count()

178

In [89]:
# show first 5 rows
gpdf.show(5, truncate=True)

+---------+-------+----------+----+-----------------+---------+-------------+----------+--------------------+---------------+---------------+----+-----------+-------+
|cultivars|alcohol|malic_acid| ash|alcalinity_of_ash|magnesium|total_phenols|flavanoids|nonflavanoid_phenols|proanthocyanins|color_intensity| hue|od280_od315|proline|
+---------+-------+----------+----+-----------------+---------+-------------+----------+--------------------+---------------+---------------+----+-----------+-------+
|        1|  13.16|      2.36|2.67|             18.6|    101.0|          2.8|      3.24|                 0.3|           2.81|           5.68|1.03|       3.17| 1185.0|
|        1|  13.24|      2.59|2.87|             21.0|    118.0|          2.8|      2.69|                0.39|           1.82|           4.32|1.04|       2.93|  735.0|
|        1|  13.75|      1.73|2.41|             16.0|     89.0|          2.6|      2.76|                0.29|           1.81|            5.6|1.15|        2.9| 1320.0

In [90]:
# summary stats
# toPandas(): pySpark dataframe -> pandas dataframe
gpdf.describe().toPandas()

Unnamed: 0,summary,cultivars,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280_od315,proline
0,count,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0
1,mean,1.9382022471910112,13.000617977528092,2.3363483146067416,2.366516853932583,19.49494382022472,99.74157303370788,2.295112359550562,2.029269662921348,0.3618539325842696,1.590898876404494,5.058089882022472,0.9574494382022468,2.611685393258428,746.8932584269663
2,stddev,0.7750349899850566,0.8118265380058567,1.1171460976144625,0.2743440090608148,3.339563767173505,14.282483515295668,0.6258510488339892,0.9988586850169464,0.1244533402966793,0.5723588626747613,2.318285871822413,0.2285715658298234,0.7099904287650505,314.90747427684926
3,min,1.0,11.03,0.74,1.36,10.6,70.0,0.98,0.34,0.13,0.41,1.28,0.48,1.27,278.0
4,max,3.0,14.83,5.8,3.23,30.0,162.0,3.88,5.08,0.66,3.58,13.0,1.71,4.0,1680.0


In [91]:
# select a subset of columns
gpdf.select(gpdf.columns[0:2]).show(5)

+---------+-------+
|cultivars|alcohol|
+---------+-------+
|        1|  13.16|
|        1|  13.24|
|        1|  13.75|
|        1|  14.75|
|        1|  14.38|
+---------+-------+
only showing top 5 rows



In [92]:
# Select first 5 columns, filter results to where cultivars = 1 and show top 5 when ranked by alcohol

# select columns -> filter rows -> order results by
gpdf.select(gpdf.columns[0:5]).filter("cultivars = 1").orderBy("alcohol").limit(5).toPandas()

Unnamed: 0,cultivars,alcohol,malic_acid,ash,alcalinity_of_ash
0,1,12.85,1.6,2.52,17.8
1,1,12.93,3.8,2.65,18.6
2,1,13.05,2.05,3.22,25.0
3,1,13.05,1.65,2.55,18.0
4,1,13.05,1.77,2.1,17.0


**Running Spark SQL query against DataFrame**

In [105]:
# Prepare temp table view for running SQL queries
gpdf.createGlobalTempView("wine")


In [106]:
# Select first 5 columns, filter results to where cultivars = 1 and show top 5 when ranked by alcohol

# prepare query
query = """
    SELECT {} 
    FROM global_temp.wine 
    WHERE cultivars = 1
    ORDER BY alcohol
""".format(','.join(gpdf.columns[0:5]))

# run query
spark.sql(query).limit(5).toPandas()

Unnamed: 0,cultivars,alcohol,malic_acid,ash,alcalinity_of_ash
0,1,12.85,1.6,2.52,17.8
1,1,12.93,3.8,2.65,18.6
2,1,13.05,2.05,3.22,25.0
3,1,13.05,1.65,2.55,18.0
4,1,13.05,1.77,2.1,17.0
