In [24]:
sc

In [27]:
spark

In [4]:
hr_employee = spark.read.csv("file:///home/hadoop/Downloads/HR_Employee.csv", inferSchema=True, header=True)

In [5]:
hr_employee.printSchema()

root
 |-- EmployeeID: integer (nullable = true)
 |-- Department: string (nullable = true)
 |-- JobRole: string (nullable = true)
 |-- Attrition: string (nullable = true)
 |-- Gender: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- MaritalStatus: string (nullable = true)
 |-- Education: string (nullable = true)
 |-- EducationField: string (nullable = true)
 |-- BusinessTravel: string (nullable = true)
 |-- JobInvolvement: string (nullable = true)
 |-- JobLevel: integer (nullable = true)
 |-- JobSatisfaction: string (nullable = true)
 |-- Hourlyrate: integer (nullable = true)
 |-- Income: integer (nullable = true)
 |-- Salaryhike: integer (nullable = true)
 |-- OverTime: string (nullable = true)
 |-- Workex: integer (nullable = true)
 |-- YearsSinceLastPromotion: integer (nullable = true)
 |-- EmpSatisfaction: string (nullable = true)
 |-- TrainingTimesLastYear: integer (nullable = true)
 |-- WorkLifeBalance: string (nullable = true)
 |-- Performance_Rating: string (nul

#### Big Data file Types
* Parquet FileFormat - records are stored columnwise, this file format compreses dataset of 
  .csv or structured format into parquet format. (Parquet  format will compress file format 
  from original to compressed by reducing filesize)
* similar to this there are other file formats - orc , avro (stored records are keys and values format ,
  schema format id stored as JSON format)

In [10]:
# write csv file to local directory as parquet file format
hr_employee.write.parquet('file:///home/hadoop/Downloads/HR')

In [14]:
hr_employee.write.parquet('/home/')

In [15]:
!hdfs dfs -rm -r /home

Deleted /home


In [8]:
spark.read.parquet('file:///home/hadoop/Downloads/HR/').show()

+----------+--------------------+--------------------+---------+------+---+-------------+-------------+--------------+-----------------+--------------+--------+---------------+----------+------+----------+--------+------+-----------------------+---------------+---------------------+---------------+------------------+
|EmployeeID|          Department|             JobRole|Attrition|Gender|Age|MaritalStatus|    Education|EducationField|   BusinessTravel|JobInvolvement|JobLevel|JobSatisfaction|Hourlyrate|Income|Salaryhike|OverTime|Workex|YearsSinceLastPromotion|EmpSatisfaction|TrainingTimesLastYear|WorkLifeBalance|Performance_Rating|
+----------+--------------------+--------------------+---------+------+---+-------------+-------------+--------------+-----------------+--------------+--------+---------------+----------+------+----------+--------+------+-----------------------+---------------+---------------------+---------------+------------------+
|         1|               Sales|     Sales

In [12]:
hr_employee.write.orc('/HR_Data')

In [13]:
spark.read.orc("/HR_Data").show()

+----------+--------------------+--------------------+---------+------+---+-------------+-------------+--------------+-----------------+--------------+--------+---------------+----------+------+----------+--------+------+-----------------------+---------------+---------------------+---------------+------------------+
|EmployeeID|          Department|             JobRole|Attrition|Gender|Age|MaritalStatus|    Education|EducationField|   BusinessTravel|JobInvolvement|JobLevel|JobSatisfaction|Hourlyrate|Income|Salaryhike|OverTime|Workex|YearsSinceLastPromotion|EmpSatisfaction|TrainingTimesLastYear|WorkLifeBalance|Performance_Rating|
+----------+--------------------+--------------------+---------+------+---+-------------+-------------+--------------+-----------------+--------------+--------+---------------+----------+------+----------+--------+------+-----------------------+---------------+---------------------+---------------+------------------+
|         1|               Sales|     Sales

#### Optimization techniques
* Optimizing spark jobs can significantly improve performance of spark running queries, spark jobs

1. Partitioning
   * Partitioning devides data into smaller chunks, which can be processed in parallel

In [16]:
partitioned_df = hr_employee.repartition(3)

In [17]:
partitioned_df.write.parquet("/HRPartition")

#### 3. Caching and Persistance
* Managing different levels of storage.

In [18]:
hr_employee.cache()

DataFrame[EmployeeID: int, Department: string, JobRole: string, Attrition: string, Gender: string, Age: int, MaritalStatus: string, Education: string, EducationField: string, BusinessTravel: string, JobInvolvement: string, JobLevel: int, JobSatisfaction: string, Hourlyrate: int, Income: int, Salaryhike: int, OverTime: string, Workex: int, YearsSinceLastPromotion: int, EmpSatisfaction: string, TrainingTimesLastYear: int, WorkLifeBalance: string, Performance_Rating: string]

In [19]:
# Persistance of dataframe with a specific storage level
from pyspark import StorageLevel
hr_employee.persist(StorageLevel.MEMORY_AND_DISK)

DataFrame[EmployeeID: int, Department: string, JobRole: string, Attrition: string, Gender: string, Age: int, MaritalStatus: string, Education: string, EducationField: string, BusinessTravel: string, JobInvolvement: string, JobLevel: int, JobSatisfaction: string, Hourlyrate: int, Income: int, Salaryhike: int, OverTime: string, Workex: int, YearsSinceLastPromotion: int, EmpSatisfaction: string, TrainingTimesLastYear: int, WorkLifeBalance: string, Performance_Rating: string]

#### 4. Serialization
* Efficient serialization reduces time to read/write data and transfer it over the network. Kyro serialization is popular serialization method for better performance over default Java serialization.

a) Java serialization : It is default serialization method , Its easy to use but drawback is it will slow down the read write process. It can produce large seralized sizes

In [26]:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Java Serialization Example").getOrCreate()

b) Kyro serialization: Faster , more compact than JavaSerialization

In [21]:
"""
spark = SparkSession.builder.config("spark.serializer","ord.apache.spark.serializer.KyroSerializer")\
.config("spark.kyro.registrationRequired","true")\
.config("spark.kyro.classesToRegister","org.apache.spark.example.Person")\
.appName("Hava Serialization Example").getOrCreate
"""

In [None]:
# spark.stop()

#### 5.Broadcasting variables
* Broatcasting small datasets improves join performance

In [31]:
small_df = spark.read.csv("file:///home/hadoop/Downloads/airports.csv", inferSchema=True, header= True)
df = spark.read.csv("file:///home/hadoop/Downloads/raw_flight_data.csv", inferSchema=True, header= True)

In [32]:
df.printSchema()

root
 |-- DayofMonth: integer (nullable = true)
 |-- DayOfWeek: integer (nullable = true)
 |-- Carrier: string (nullable = true)
 |-- OriginAirportID: integer (nullable = true)
 |-- DestAirportID: integer (nullable = true)
 |-- DepDelay: integer (nullable = true)
 |-- ArrDelay: integer (nullable = true)



In [34]:
from pyspark.sql.functions import broadcast
broadcast_df = broadcast(small_df)

In [35]:
broadcast_df.cache()
df.cache()

DataFrame[DayofMonth: int, DayOfWeek: int, Carrier: string, OriginAirportID: int, DestAirportID: int, DepDelay: int, ArrDelay: int]

In [36]:
df.join(broadcast_df, df.OriginAirportID == broadcast_df.airport_id)

DataFrame[DayofMonth: int, DayOfWeek: int, Carrier: string, OriginAirportID: int, DestAirportID: int, DepDelay: int, ArrDelay: int, airport_id: int, city: string, state: string, name: string]

#### 7. level of parallelism

In [37]:
#Adjust level of parallelism based on your cluster size
spark.conf.set("spark.default.parallelism", 100)

#### 8.Avoid GroupByKey
* Use ReduceByKey or aggregateByKey() instead of groupByKey() to reduce shuffling.

In [38]:
rdd = spark.sparkContext.parallelize([('dosa',2),('salad',3),('idli',1),('dosa',3),('chocolates',4),('idli',2)])
rdd.reduceByKey(lambda x,y:x+y).collect()


[('salad', 3), ('chocolates', 4), ('dosa', 5), ('idli', 3)]

#### 8. Reduce shuffle
* reduce the number of shuffle by optimizing transformation .
* use reduceByKey() over groupByKey()
* use map() and reduce() over groupBy()

#### 9. repartition and coalesce 

#### 10. Accumulators
* Use accumulator for aggregate information like count(), sum(), max(), std(), corr(), etc.. across executors

In [40]:
acc = spark.sparkContext.accumulator(0)

In [41]:
type(acc)

pyspark.accumulators.Accumulator

In [66]:
rdd = spark.sparkContext.parallelize([1,2,3,4,5,6,7,8])

In [67]:
#python udf
def sum(x):
    acc.add(x)

In [68]:
rdd.foreach(sum)

In [69]:
print(acc.value)

216


#### 11. Bucketing
* Use to create Buckets of large datasets for efficient query and joins

In [72]:
df.write.bucketBy(7, 'DayOfWeek').saveAsTable("fbucket_table")

In [71]:
!hdfs dfs -

root
 |-- DayofMonth: integer (nullable = true)
 |-- DayOfWeek: integer (nullable = true)
 |-- Carrier: string (nullable = true)
 |-- OriginAirportID: integer (nullable = true)
 |-- DestAirportID: integer (nullable = true)
 |-- DepDelay: integer (nullable = true)
 |-- ArrDelay: integer (nullable = true)

