# More ID tricks

- Once you define a Spark process, you'll likely want to use it many times. Depending on your needs, you may want to start your IDs at a certain value so there isn't overlap with previous runs of the Spark task. This behavior is similar to how IDs would behave in a relational database. You have been given the task to make sure that the IDs output from a monthly Spark task start at the highest value from the previous month.

- The spark session and two DataFrames, `voter_df_march` and `voter_df_april`, are available in your workspace. The `pyspark.sql.functions` library is available under the alias `F`.

## Instructions

- Determine the highest ROW_ID in `voter_df_march` and save it in the variable `previous_max_ID`. The statement `.rdd.max()[0]` will get the maximum ID.
- Add a `ROW_ID` column to `voter_df_april` starting at the value of `previous_max_ID`.
- Show the ROW_ID's from both Data Frames and compare.

In [4]:
# Intialization
import os
import sys

os.environ["SPARK_HOME"] = "/home/talentum/spark"
os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"
# In below two lines, use /usr/bin/python2.7 if you want to use Python 2
os.environ["PYSPARK_PYTHON"] = "/usr/bin/python3.6" 
os.environ["PYSPARK_DRIVER_PYTHON"] = "/usr/bin/python3"
sys.path.insert(0, os.environ["PYLIB"] +"/py4j-0.10.7-src.zip")
sys.path.insert(0, os.environ["PYLIB"] +"/pyspark.zip")

# NOTE: Whichever package you want mention here.
# os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.11:0.6.0 pyspark-shell' 
# os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-avro_2.11:2.4.0 pyspark-shell'
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.11:0.6.0,org.apache.spark:spark-avro_2.11:2.4.3 pyspark-shell'
# os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.11:0.6.0,org.apache.spark:spark-avro_2.11:2.4.0 pyspark-shell'

In [None]:
#Entrypoint 2.x
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Spark SQL basic example").enableHiveSupport().getOrCreate()

# On yarn:
# spark = SparkSession.builder.appName("Spark SQL basic example").enableHiveSupport().master("yarn").getOrCreate()
# specify .master("yarn")

sc = spark.sparkContext

In [None]:
import pyspark.sql.functions as F
from pyspark.sql.types import *

# Load the CSV file
voter_df_march = spark.read.format('csv').options(Header=True).load('file://<pwd>/Dataset/voter_data1.csv')
voter_df_april = spark.read.format('csv').options(Header=True).load('file://<pwd>/Dataset/voter_data2.csv')

# Determine the highest ROW_ID and save it in previous_max_ID
____ = voter_df_march.select('ROW_ID').rdd.max()[0]

# Add a ROW_ID column to voter_df_april starting at the desired value
voter_df_april = voter_df_april.withColumn('ROW_ID', ____ + ____)

# Show the ROW_ID from both DataFrames and compare
____.select('ROW_ID').show()
____

