###  Hadoop V/S Spark

| Parameter       | Hadoop                                                                                        | Spark                                                                                                         |
|-----------------|-----------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------|
| Performance     | Hadoop is slower then spark. it writes data back to the disk and read again from to in-memory | Spark is faster then hadoop because spark do all the the computation in memory.                               |
| Batch/Streaming | Build for batch data processing.                                                              | Build for batch as well as streaming data processing.                                                         |
| Ease Of Use     | Difficult to write code in hadoop. Hive was built to make it easier                           | Easy to write and debug code. Interactive shell to develop and test. Spark provides high and low level API's. |
| Security        | Use kerberos Authentication and ACL autirization. (YARN)                                      | Don't have solid security. (HDFS->ACL)(YARN->kerberos).                                                       |
| Fault Talerance | It has block of data (128 MB) and replication factor to handle the failure.                   | Use DAG to provide fault talerance (DAG).                                                                           |

### Read CSV Data In Spark
1. Format (Optional) -> CSV, JSON, JDBC/ODBC, Table, parquate. By default it takes parquate as read method.
2. Option (Optional) -> InferSchema, Mode, header.
3. Schema (Optional) -> Custom schema can be used.
4. Load -> File path.

Read Mode:
1. Failfest: Fail execution if malformed record in dataset.
2. Dropmalformed: Drop the corrupted record.
3. Permissive: Default mode. Set null values to all the corrupted fields.

In [10]:

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType,StructField, StringType, IntegerType, FloatType
from pyspark.sql import functions as Fun


In [11]:
spark = SparkSession.builder.master("local[*]").appName("SparkSumbittest").getOrCreate()

In [12]:
df = spark.read.format("csv")\
                .option("header", "true")\
                .option("inferschema", "true")\
                .option("mode", "PERMISSIVE")\
                .load("/home/manish/Documents/VSCodeProjects/SparkTutorial/flight_data.csv")
df.show(5)

+--------+-------+-----------+--------------+-----+-------------+----------------+-------+--------+---------+-----+
| airline| flight|source_city|departure_time|stops| arrival_time|destination_city|  class|duration|days_left|price|
+--------+-------+-----------+--------------+-----+-------------+----------------+-------+--------+---------+-----+
|SpiceJet|SG-8709|      Delhi|       Evening| zero|        Night|          Mumbai|Economy|    2.17|        1| 5953|
|SpiceJet|SG-8157|      Delhi| Early_Morning| zero|      Morning|          Mumbai|Economy|    2.33|        1| 5953|
| AirAsia| I5-764|      Delhi| Early_Morning| zero|Early_Morning|          Mumbai|Economy|    2.17|        1| 5956|
| Vistara| UK-995|      Delhi|       Morning| zero|    Afternoon|          Mumbai|Economy|    2.25|        1| 5955|
| Vistara| UK-963|      Delhi|       Morning| zero|      Morning|          Mumbai|Economy|    2.33|        1| 5955|
+--------+-------+-----------+--------------+-----+-------------+-------

### Spark Job Submit 

(Assumptions DriverMemory-20GB, TotalExecutor-5, ExeutorCores-4, ExecutorMemory-25GB )

1. Master node first create Driver in any worker node.
2. Driver which is also known as Application Driver. Spark is writen in scala, and scala is a JVM process. Inside the Driver container it will create 2 main methods, one is for pyspark and another is for JVM.  Spark Core -> Java Wrapper -> Python Wrapper. JVM is called Application driver and pyspark is called pyspark driver.
3. Then driver check the executor details and then it send the request to the resource manager. 
4. The resource manager sent request to node manager (worker), then it creates 5 executors in the ideal workers.
5. Application driver send data and other details to the executors for the processing.
6. All the excutors send computated result tot the driver.
7. In the end all the container driver & exector will be delete. 

Note: Avoid writing/using UDF funtion in the pysaprk, it will require python worker in the executor container so it will impact on the performace. Always use buit-in function.
  


### Spark Schema

There are 2 types of schema
1. Using StructType & StructField
    i. StructType: Defines structure of dataframe. List of StructField
    ii. StructField: Define the column data type.
    Example: StructType([StructField("id", IntegerType(), True), StructField("name", StringType(), True), StructField("age", IntegerType(), True)])
2. Using DDL: In quotes comma seperated columns with data type in space.
    Example: "id integer, name string, age integer"

In [13]:
flight_schema = StructType([StructField("airline", StringType(), True), StructField("flight", StringType(), True), \
                            StructField("source_city", StringType(), True), StructField("departure_time", StringType(), True), \
                            StructField("stops", StringType(), True), StructField("arrival_time", StringType(), True), \
                            StructField("destination_city", StringType(), True), StructField("class", StringType(), True), \
                            StructField("duration", IntegerType(), True), StructField("days_left", IntegerType(), True), StructField("price", IntegerType(), True)])

In [14]:
df = spark.read.format("csv")\
                .option("header", "true")\
                .option("inferschema", "false")\
                .schema(flight_schema)\
                .option("mode", "PERMISSIVE")\
                .load("/home/manish/Documents/VSCodeProjects/SparkTutorial/flight_data.csv")
df.printSchema()
df.count()

root
 |-- airline: string (nullable = true)
 |-- flight: string (nullable = true)
 |-- source_city: string (nullable = true)
 |-- departure_time: string (nullable = true)
 |-- stops: string (nullable = true)
 |-- arrival_time: string (nullable = true)
 |-- destination_city: string (nullable = true)
 |-- class: string (nullable = true)
 |-- duration: integer (nullable = true)
 |-- days_left: integer (nullable = true)
 |-- price: integer (nullable = true)



300153

### Trasformation & Action

When action is called then only transformation is triggered.

Transformation: Filter, Select, Union, Join, GroupBy, Distinct.

Action: count, collect, show, read.

Types Of Transformations
1. Narrow -> Trasformation that doesn't require data movement within partition. Eg. Filter, Select, Union, Map
2. Wide -> Trasformation that require datat movement within partition. Eg Join, GroupBy, Distinct. In wide transformation need to do data suffling between the partitions.

Data Shuffling: Data is transferred through network among different partitions. Which is a very expensive trasformation.

Note: When action is executed, the output data is collected by Driver. Driver memory should be always greater then the action output data. Otherwise there will be a memory error. 

In [15]:
df1 = df.filter(df["departure_time"]=="Evening")    # Trasormation
df1.count()     # Action

65102

### DAG & Lazy Evaluation

DAG -> Directed Acyclic Graph. It will never run loop, and execution happens in tree structure.

On every action, a job is created and each job has it's own DAG.

df = spark.read.format("csv")\          # Action
                .option("header", "true")\
                .option("inferschema", "false")\        # Action
                .schema(flight_schema)\
                .option("mode", "PERMISSIVE")\
                .load("/home/manish/Documents/VSCodeProjects/SparkTutorial/flight_data.csv")       
data_repartition = df.repartition(3)        # Wide Trasformation
df1 = df.filter(df["departure_time"]=="Evening")        # Narrow Trasformation
df = df.filter((Fun.col("destination_city")=="Mumbai") | (Fun.col("destination_city")=="Delhi"))        # Narrow Trasformation
df = df.groupby("stops").sum("price")           # Wide Trasformation
df.show()       # Action

In the above code, after the execution it will create 4 jobs. 3 for actions and 1 for _______. DAG can be view from spark UI.
1. Read -> Reading and generating java byte code. 
2. Inferschema -> Mapping
3. Show

Wide & Narrow trasformation will trigger, when action is defined. This is called Lazy Evaluation. Through Lazy Evaluation it also optimize the code, example in the above code it will merge the departure_time & destination_city filter to a single query.

In [16]:
df = spark.read.format("csv")\
                .option("header", "true")\
                .option("inferschema", "false")\
                .schema(flight_schema)\
                .option("mode", "PERMISSIVE")\
                .load("/home/manish/Documents/VSCodeProjects/SparkTutorial/flight_data.csv")

data_repartition = df.repartition(3)

df1 = df.filter(df["departure_time"]=="Evening") 

df = df.filter((Fun.col("destination_city")=="Mumbai") | (Fun.col("destination_city")=="Delhi"))

df = df.groupby("stops").sum("price")

df.show()

+-----------+----------+
|      stops|sum(price)|
+-----------+----------+
|two_or_more|  26808235|
|        one|2098748431|
|       zero| 195028713|
+-----------+----------+



### Spark SQL Engine

Spark SQL Engine / Catayst Optimizer: 

SQL / Dataframe / Dataset -> Spark SQL Engine / Catayst Optimizer -> RDD Java Byte Code 

4 phases of Spark SQl Engine:
1. Analysis : Linked with catalog. It checks whether table, columns or path is present or not. If not, then it will throw the "Analysis Exception". 
2. Optimized Logical Planning : In the Lazy evalution it performs the code optimization automatically. Example Merge multiple filters to single filter or During the Computation only 2 columns are required, so it will only pull the 2 column data automation for the optimation.  
3. Physical Planning : Spark created multiple plans and among the best plan, it automatically choose the best one for the cost optimization. Example : Suppose there is 2 table one is big and other one is small. So it will broadcast the small table to avoid the shuffling. 
4. Code Generation

<u>Code</u> --------> <u>Unresolved Logical Planning</u> ----Analysis----> <u>Resolved Logical Planning</u> ----Logical-Optimization----> <u>Optimized Logical Planning</u> --------> <u>Physical Planning</u> ----Cost-Model----> <u>Best Physical Plan</u> --------> <u>Final Code</u> 
                                                                        
Catalog -> it is just a metadata of the data.                       
Analysis Exception Error:



In [17]:
df = spark.read.format("csv")\
                .option("header", "true")\
                .option("inferschema", "false")\
                .schema(flight_schema)\
                .option("mode", "PERMISSIVE")\
                .load("/home/manish/Documents/VSCodeProjects/SparkTutorial/flight_data.csv")
df.select("name1").count()      # Column does not exsist so it will show the Analysis Error.

AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `name1` cannot be resolved. Did you mean one of the following? [`class`, `price`, `stops`, `flight`, `airline`].;
'Project ['name1]
+- Relation [airline#435,flight#436,source_city#437,departure_time#438,stops#439,arrival_time#440,destination_city#441,class#442,duration#443,days_left#444,price#445] csv


### RDD (Resilient Distributed Dataset)

When "Full Control On Data" is required then RDD is used. RDD is a data structure in spark. Example: RDD distribute list in different node for the processing.

Resilient -> In case of failure recovery. <BR>
Destributed -> Data is over the cluster. <BR>
Dataset -> Actual data over different partitions. <BR>

RDD is immutable. From RDD1, RDD2 is creted. It means in the filteration it is creating new RDD i.e. RDD2 but it is not loosing unfiltered RDD1 data. <br>
Example: <u>RDD1 (100 Rows Data)</u> --Filter-On-Age-Column--> <u>RDD2 (70 Rows Data)</u> --Filter-On-City-Column--> <u>RDD3 (30 Rows Data)</u> <br>
Suppose RDD3 fails, then through DAG it knows how to create RDD3 from RDD2, so it will automatically create the new RDD3. This feature is also called as Fault Tolerance.  <br>

Advantage:
1. Best for unstructured data.
2. It is tyoe safe. It will throw column error during the compile time but dataframe will through error during the run time (Run time - After 2 hrs of code execution it will through error in dataframe case).
3. Flexibility & Control.


Disadvantage: 
1. No optimization is done by spark. Developer need to write the optimization methods.
2. In RDD it is "How To", and in Dataframe it is "What To".
3. Very difficult to write code in RDD.

| Method          | Code                                                                                          | 
|-----------------|-----------------------------------------------------------------------------------------------|
| Dataframe       | data.groupBy("dept").avg("age")                                                               |
| SQL             | SELECT dept, avg(age) from data group by dept                                                 |
| RDD             | data.map {case dept, age} => dept -> (age, 1) <br /> .reduceByKey   { case ((a1,c1), (a2,c2)) => (a1+a2,c1+c2) } <br /> .map { case (dept, (age, c)) => dept -> age/c}                                          |


### Parquet

| Title | Data | Chart|
|-------|------|------|
| 1     | 2    | 3    |
| 4     | 5    | 6    |
| 7     | 8    | 9    |

Types in which parquet is saved on disk:
1. Columnar based file format. | 1 | 4 | 7 | 2 | 5 | 8 | 3 | 6 | 9 | <br>
    Example: OLTP (Online Transactional Processing) Required when need to analysis on few columns like groupby, join 
2. Row based file format. | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | => If we read specific index like 0 and 2, so everything time it need to jumps/skip the index 1. Due to this process will be slow. <br>
    Example: OLAP (Online Analytical Processing) Required when need to update, insert, deletion is done on the date.

Big Data -. Write Once & Read Many

It is binary file format (Can be read with naked eyes). 3 main Advantages:
1. Cost Reduce
2. Time Reduce
3. Performance Increase


Data Encoding
![Data Compression](/home/manish/Documents/VSCodeProjects/SparkTutorial/CompressionInParquet.png)

Parquet Format:
1. GZIP
2. LZO
3. Snappy

Sample query executed, time taken by :-
1. CSV is 2892 sec.
2. LZO is 50 sec.
3. GZIP is 40 sec.
4. Snappy is 28 sec.

### Write in Spark

df.write.format("csv")<br>
        .option("header", "true")<br>
        .option("mode", "overwrite")<br>
        .option("path", "file_path")<br>
        .save() # Path can also be provided here also

Type of wite modes:
1. Append
2. Overwrite
3. errorIfExsists
4. ignore

### Partitioning & Bucketing

Both the methods are for witing the data. Spark code performace increases when suck data queried.

Partitioning: Create number of directors based on column categorical value. Like City, Gender

Bucketing: Create number of directors is provided  by user, when column is continous value type. Like Age, ID 

Suppose in a backend 200 task are running, and in the end you are creating a bucket 5. Then it will create 200*5=1000 bucket. So we need to define repartition 5 in the code. like df.repartition(5)

Bucket Suffling Eliminated: If need to join 2 tables which are saved in bucket. Then they should have:
1. Same number of bucket count.
2. Bucketing column name should be same.

Bucket Pruning: It makes searching and joining faster.
Example: 1234 5678 9102 > Aadhar card number and if you devide 123456789102/10000 then output will be 9102. it neams I need to search 9102 bucket number to get complete details of this aadhar card. This bucket could 2%-20% of the total data.

In [27]:
df.write.format("csv")\
        .option("header", "true")\
        .option("mode", "overwrite")\
        .option("path", "/home/manish/Documents/VSCodeProjects/SparkTutorial/partition_destination_city")\
        .partitionBy("destination_city")\
        .save() 

df.write.format("csv")\
        .option("header", "true")\
        .option("mode", "overwrite")\
        .option("path", "/home/manish/Documents/VSCodeProjects/SparkTutorial/bucket_duration")\
        .bucketBy(3, "duration")\
        .saveAsTable("bucket_duration_flight") 

23/08/04 01:41:55 WARN ParseMode: overwrite is not a valid parse mode. Using PERMISSIVE.


### Application Jobs, Stages & Taks

Application: Code which executed through spark-submit. Single application is submitted at a time, mutiple application deployment is also feasible. <br>
Job: In a application, total number of jobs is always equal to total number actions in your code. <br>
Stages: Job is divided into stages, and stages is nothing but number of transformation in your code. <br> 
Tasks: Execute the code on the actual data. Triggered on executor.

When job is created then minimum one stage is created in job and again minimum one task is created in stage. 