# **PySpark Core Program Techniques**

In [17]:
#Imports
from pyspark.sql.session import SparkSession


## **1. How to create SparkSession object creation & sparkContext reference?**

In [23]:
'''
SparkSession is the entry point for accessing the mypyspark cluster operation.
SparkSession object will instantiate SparkContex, SqlContext, HiveContext Objects.
|__SparkSession is a Class
      |__builder is a Class variable/object/attribute to call the SparkSession class methods like master(), appName(), enableHiveSupport(), getOrCreate()
           |__master('yarn')           => method to help us submit the mypyspark application to the respective cluster manager
           |__appName('program-name')  => mypyspark program name that help us identify the jobs runs in a cluster   
           |__enableHiveSupport()      => HiveQueryLanguage(HQL) method help us to create Catalog, UDFs, etc.,
           |__getOrCreate()            => method to create a new SparkSession object or referring to existing SparkSession
'''

In [22]:
#SparkSession Object Creation
spark=SparkSession.builder.getOrCreate()
print(spark)

#Accessing the underlying SparkContext with the Spark Session
sc=spark.sparkContext
print(f"SparkContext:",sc)

#Accessing the underlying SQLContext with the Spark Session
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
print(f"SparkContext:",sqlContext)

#Accessing the underlying HiveContext with the Spark Session
from pyspark.sql import HiveContext
hiveContext = HiveContext(sc)
print(f"HiveContext:",hiveContext)


<pyspark.sql.session.SparkSession object at 0xffff5008f5e0>
SparkContext: <SparkContext master=local[*] appName=pyspark-shell>
SparkContext: <pyspark.sql.context.SQLContext object at 0xffff44a72280>
HiveContext: <pyspark.sql.context.HiveContext object at 0xffff44bdff70>


In [None]:
'''
Q1.  Can we have more than SparkContext in a same application? 
        No, only one can be created. Also it is not allowd and not a good practice to load same program in different memory containers again and again.
        Other wise we will get the error: sc=mypyspark.sparkContext() TypeError: 'SparkContext' object is not callable
'''

In [31]:
try:
    sc1=spark.sparkContext()
    print(f"SparkContext:",sc1)
except Exception as e:
    print(f"Exception Occured: {e}")

Exception Occured: 'SparkContext' object is not callable


## **2. RDD (Resilient Distributed Dataset)**

In [None]:
'''
Q1. Spark Terminologies?
    1. RDD            :Resilient(can be rebuild) Distribuited(across multiple nodes memory) Dataset (can come from anywhere)
    2. DAG            :(Direct Acyclic Graph)
    3. Transformation :
    4. Action         :
    5. Lineage        :(Direct relation between transformation and action)
    
Q2. What is RDD?
        Resilient Distributed Dataset, Lazily evaluated and executed, Immutable, Core Spark Abstraction, Fundamental unit of data, Lineage to rebuild.
    
Q3. What are ways to create RDDs?
        1. RDD from any sources(different filesystems)
        2. RDDs/DFs can be created programatically
        3. RDDs/DFs from another RDD/DF
        4. RDD/DF from memory 
'''

In [None]:
'''
download the dataset "custinfo.csv" to linux and hadoop home folder:

ls /home/hduser/custinfo.csv

hadoop fs -put /home/hduser/custinfo.csv /user/hduser/
hadoop fs -ls /user/hduser/custinfor.csv
'''

In [46]:
#1. RDD creation from any sources (different filesystems)
file_rdd1 = sc.textFile("/home/hduser/custinfo.csv") #linux file system
hdfs_rdd1 = sc.textFile("/user/hduser/custinfo.csv") #hdfs file system


In [40]:
#2. RDD creation programatically
program_rdd1 =  sc.parallelize(range(1,1000))

salary_list = [20000,30000,15000,40000,50000]
salary_list_rdd1 = sc.parallelize(salary_list,2) #Creating distributed RDD referencing 2 memory location (partitions)
print(f"salary_list_rdd1.collect()=>{salary_list_rdd1.collect()}") #Collect Action: Consolidate all the partitions and produce one result
print(f"salary_list_rdd1.glom().collect()=>{salary_list_rdd1.glom().collect()}") # Collect Action: Partition wise collect output


salary_list_rdd1.collect()=>[20000, 30000, 15000, 40000, 50000]
salary_list_rdd1.glom().collect()=>[[20000, 30000], [15000, 40000, 50000]]


In [47]:
#Typical Spark Core Application like below
from pyspark.sql.session import SparkSession
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext
rdd1 = sc.parallelize([20000,30000,15000,40000,50000],2)
rdd2 = rdd1.map(lambda sal:sal+1000) # Transformation (Map - no element count is not changed)
print(f"rdd2.collect()=>{rdd2.collect()}")
print(f"rdd2.glom().collect()=>{rdd2.glom().collect()}")


rdd2.collect()=>[21000, 31000, 16000, 41000, 51000]
rdd2.glom().collect()=>[[21000, 31000], [16000, 41000, 51000]]


In [45]:
#3. RDDs/DFs from another RDD/DF
rdd2 = rdd1.map(lambda sal:sal+1000) #rdd2 created from rdd1


In [49]:
#4. RDD/DF from memory
rdd1.cache()                         #rdd1 value will be persist in the memory till the program complete exit
rdd2 = rdd1.map(lambda sal:sal+1000) #rdd2 is created from memory RDD(rdd1)


## **3. What is Transformation & Action in RDD?**

In [None]:
'''
Q1. What is Transformation & Action in RDD?
        Transformation:  If a function/method returns another RDD.   Operations => map(), flatMap(), filter(), distinct(), union()
        Action        :  If a function/method returns RESULT(VALUE). Operations => collect(), count(), take(3), reduce(), saveAsTextFile()
'''

In [51]:
rdd1 = sc.parallelize([20000,30000,15000,40000,50000],2)
rdd2 = rdd1.map(lambda sal:sal+1000) #Transformation (MAP returns another RDD)
print(rdd2.count())                  #Action (COUNT trigers computation and returns RESULT)


5


In [None]:
'''
Q2. What are types of Transformation?
        Active:  If the output number of elements of a given RDD is different from the input number of element of an RDD.
        Passive: If the output number of elements of a given RDD is same      from the input number of element of an RDD.

#/home/hduser/custinfo.csv
4000001,Kristina,Chung,55,Pilot
4000002,Paige,Chen,77,Teacher
4000003,Sherri,Melton,34,Firefighter
4000004,Gretchen,Hill,66,Computer hardware engineer
'''
