Hadoop V/S Spark

| Parameter       | Hadoop                                                                                        | Spark                                                                                                         |
|-----------------|-----------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------|
| Performance     | Hadoop is slower then spark. it writes data back to the disk and read again from to in-memory | Spark is faster then hadoop because spark do all the the computation in memory.                               |
| Batch/Streaming | Build for batch data processing.                                                              | Build for batch as well as streaming data processing.                                                         |
| Ease Of Use     | Difficult to write code in hadoop. Hive was built to make it easier                           | Easy to write and debug code. Interactive shell to develop and test. Spark provides high and low level API's. |
| Security        | Use kerberos Authentication and ACL autirization. (YARN)                                      | Don't have solid security. (HDFS->ACL)(YARN->kerberos).                                                       |
| Fault Talerance | It has block of data (128 MB) and replication factor to handle the failure.                   | Use DAG to provide fault talerance (DAG).                                                                           |

Read Data In Spark
1. Format (Optional) -> CSV, JSON, JDBC/ODBC, Table, parquate. If you do not define any reading method, then by default it takes parquate as read method.
2. Option (Optional) -> InferSchema, Mode, header.
3. Schema (Optional) -> Custom schema can be used.
4. Load -> File path.

Read Mode:
1. Failfest: Fail execution if malformed record in dataset.
2. Dropmalformed: Drop the corrupted record.
3. Permissive: Default mode. Set null values to all the corrupted fields.

In [21]:

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType,StructField, StringType, IntegerType, FloatType
from pyspark.sql import functions as Fun


In [22]:
spark = SparkSession.builder.master("local[*]").appName("SparkSumbittest").getOrCreate()

In [23]:
df = spark.read.format("csv")\
                .option("header", "true")\
                .option("inferschema", "true")\
                .option("mode", "PERMISSIVE")\
                .load("/home/manish/Documents/VSCodeProjects/SparkTutorial/flight_data.csv")

In [24]:
df.show(5)

+--------+-------+-----------+--------------+-----+-------------+----------------+-------+--------+---------+-----+
| airline| flight|source_city|departure_time|stops| arrival_time|destination_city|  class|duration|days_left|price|
+--------+-------+-----------+--------------+-----+-------------+----------------+-------+--------+---------+-----+
|SpiceJet|SG-8709|      Delhi|       Evening| zero|        Night|          Mumbai|Economy|    2.17|        1| 5953|
|SpiceJet|SG-8157|      Delhi| Early_Morning| zero|      Morning|          Mumbai|Economy|    2.33|        1| 5953|
| AirAsia| I5-764|      Delhi| Early_Morning| zero|Early_Morning|          Mumbai|Economy|    2.17|        1| 5956|
| Vistara| UK-995|      Delhi|       Morning| zero|    Afternoon|          Mumbai|Economy|    2.25|        1| 5955|
| Vistara| UK-963|      Delhi|       Morning| zero|      Morning|          Mumbai|Economy|    2.33|        1| 5955|
+--------+-------+-----------+--------------+-----+-------------+-------

Spark Job Submit (Assumptions DriverMemory-20GB, TotalExecutor-5, ExeutorCores-4, ExecutorMemory-25GB )
1. Master node first create Driver in any worker node.
2. Driver which is also known as Application Driver. Spark is writen in scala, and scala is a JVM process. Inside the Driver container it will create 2 main methods, one is for pyspark and another is for JVM.  Spark Core -> Java Wrapper -> Python Wrapper. JVM is called Application driver and pyspark is called pyspark driver.
3. Then driver check the executor details and then it send the request to the resource manager. 
4. The resource manager sent request to node manager (worker), then it creates 5 executors in the ideal workers.
5. Application driver send data and other details to the executors for the processing.
6. All the excutors send computated result tot the driver.
7. In the end all the container driver & exector will be delete. 

Note: Avoid writing/using UDF funtion in the pysaprk, it will require python worker in the executor container so it will impact on the performace. Always use buit-in function.
  


Saprk Schema: There are 2 types of schrma
1. Using StructType & StructField
    i. StructType: Defines structure of dataframe. List of StructField
    ii. StructField: Define the column data type.
    Example: StructType([StructField("id", IntegerType(), True), StructField("name", StringType(), True), StructField("age", IntegerType(), True)])
2. Using DDL: In quotes comma seperated columns with data type in space.
    Example: "id integer, name string, age integer"

In [39]:
flight_schema = StructType([StructField("airline", StringType(), True), StructField("flight", StringType(), True), \
                            StructField("source_city", StringType(), True), StructField("departure_time", StringType(), True), \
                            StructField("stops", StringType(), True), StructField("arrival_time", StringType(), True), \
                            StructField("destination_city", StringType(), True), StructField("class", StringType(), True), \
                            StructField("duration", IntegerType(), True), StructField("days_left", IntegerType(), True), StructField("price", IntegerType(), True)])

In [45]:
df = spark.read.format("csv")\
                .option("header", "true")\
                .option("inferschema", "false")\
                .schema(flight_schema)\
                .option("mode", "PERMISSIVE")\
                .load("/home/manish/Documents/VSCodeProjects/SparkTutorial/flight_data.csv")
df.printSchema()
df.count()

root
 |-- airline: string (nullable = true)
 |-- flight: string (nullable = true)
 |-- source_city: string (nullable = true)
 |-- departure_time: string (nullable = true)
 |-- stops: string (nullable = true)
 |-- arrival_time: string (nullable = true)
 |-- destination_city: string (nullable = true)
 |-- class: string (nullable = true)
 |-- duration: integer (nullable = true)
 |-- days_left: integer (nullable = true)
 |-- price: integer (nullable = true)



300153

In [46]:
df1 = df.filter(df["departure_time"]=="Evening")    # Trasormation
df1.count()     # Action

65102

When action is called then only trasformation is triggered.

Types Of Trasformations
1. Narrow -> Trasformation that doesn't require data movement within partition. Eg. Filter, Select, Union, Map
2. Wide -> Trasformation that require datat movement within partition. Eg Join, GroupBy, Distinct. In wide trasfomation need to do data suffling between the partitions.