## ** Distribution of Executors, Cores and Memory for a Spark Application running in Yarn: 

https://spoddutur.github.io/spark-notes/distribution_of_executors_cores_and_memory_for_spark_application.html

## **Understanding Spark UI

https://sparkbyexamples.com/spark/spark-web-ui-understanding/

### When to use Avro and when Parquet and Why ?

<p><b>AVRO</b> is a row-based storage format whereas PARQUET is a columnar based storage format.</p>
<p><b>PARQUET</b> is a columnar based storage format.</p>
<p><b>PARQUET</b> is much better for analytical querying i.e. reads and querying are much more efficient than writing.<br> Write operations in <b>AVRO</b> are better than in <b>PARQUET</b>.</p>

### Do you have any idea Hadoop Erasure in coding?
<p>Erasure coding, a new feature in HDFS, <b>can reduce storage overhead by approximately 50% compared to replication</b> while maintaining the same durability guarantees. This post explains how it works.</p>

link : https://blog.cloudera.com/introduction-to-hdfs-erasure-coding-in-apache-hadoop/

In Hadoop3 we can enable Erasure coding policy to any folder in HDFS. By default erasure coding is not enabled in Hadoop3

### How to Configure it ?
link https://stackoverflow.com/questions/51475712/hadoop-3-how-to-configure-enable-erasure-coding

### What is Spark driver?
<p>The spark driver is that the program that defines the <b>transformations and actions on RDDs</b> of knowledge and submits request to the master. Spark driver is a program that runs on the master node of the machine which declares transformations and actions on knowledge RDDs.</p>


### What is Spark session ?
<p>Spark session is a unified entry point of a spark application from Spark 2.0. <br> It provides a way to interact with various spark's functionality with a lesser number of constructs. Instead of having a spark context, hive context, SQL context, now all of it is encapsulated in a Spark session.</p>

### who will take care spark cluster resources?
<p>Cluster Manager (Mesos, YARN or Kubernetes)</p>
<p>Driver talks to Cluster Manager & negotiates for resources. CM launches executors on worker nodes on behalf of the driver</p>
<p>The Spark Master is the process that requests resources in the cluster and makes them available to the Spark Driver. In all deployment modes, the Master negotiates resources or containers with Worker nodes or slave nodes and tracks their status and monitors their progress.</p>

### Deployment modes: Cluster and Client?

### How the client reads a file from HDFS ?

<ul>
    <li>HDFS Client: On user behalf, HDFS client interacts with NameNode and Datanode to fulfill user requests.</li>
    <li>NameNode: NameNode is the master node that stores metadata about block locations, blocks of a file, etc. This metadata is used for file read and write operation.</li>
    <li>DataNode: DataNodes are the slave nodes in HDFS. They store actual data (data blocks).</li>
</ul>
 <p>HDFS client wants to read a file “File.txt”</p>
 <p>The client will reach out to NameNode asking locations of DataNodes containing data blocks. <br>
    The NameNode first checks for required privileges, and if the client has sufficient privileges, the NameNode sends the locations of DataNodes containing blocks</p>
 <p> NameNode also gives a security token to the client, which they need to show to the DataNodes for authentication.</p>
 <a href="https://data-flair.training/blogs/hdfs-data-read-operation/">How the client reads a file from HDFS Link</a>

### What types of Hadoop nodes are used by HBase?
Masters -- HDFS NameNode, YARN ResourceManager, and HBase Master. Slaves -- HDFS DataNodes, YARN NodeManagers, <br>
and HBase RegionServers. The DataNodes, NodeManagers, and HBase RegionServers are co-located or co-deployed <br>
for optimal data locality.

### When using the default storage level for persist() what happens when an RDD does not fit in the cluster's memory?
<p>If the RDD does not fit in memory, store the partitions on disk that don't fit in memory, and read them from there when they're needed.</p>

### Spark OOM (Out of Memory ) Error — Closeup
https://medium.com/swlh/spark-oom-error-closeup-462c7a01709d#:~:text=To%20fix%20this%20error%20we,size%20with%20below%20configuration%20setting.&text=GC%20Overhead%20limit%20exceeded.,if%20the%20value%20for%20spark.

### Python Spark Cumulative Sum by Group Using DataFrame

https://stackoverflow.com/questions/45946349/python-spark-cumulative-sum-by-group-using-dataframe

### What is the Typical Process that you follow to optimize spark Job ?

Melwin dont add everything what is there mention only those which you know well <br>
https://www.xenonstack.com/blog/apache-spark-optimisation

## MCQ
### Which property must a Spark structured streaming sink processess to ensure end-to-end exactly-once sematics ?
<ul>
    <li>Horizontally scalable</li>
    <li>Predicatable</li>
    <li>Cacheable</li>
    <li>Idempotent</li>
</ul>

### Which of the following will ensure better performance of quries when using the Spark as a query engin ?
<ul>
    <li>Querying a large dataset</li>
    <li>Including a surrogate primary key in the where clause</li>
    <li>Increasing the amount of data skew on a join key</li>
    <li>Including a filter on partition column in the where clause for predicate pushdown</li>
</ul>

### Which file format enables the use of predicate pushdown filtering as well as column pruning at the storage layer ?
<ul>
    <li>Parquet</li>
    <li>Avro</li>
    <li>CSV</li>
    <li>JSON</li>
</ul>

## Example of Spark Submit cmd

<pre>
spark-submit --deploy-mode cluster --master yarn --driver-memory 10g \
--executor-memory 35g \
--executor-cores 5 \
--num-executors 12 \ 
--jars s3://< s3-path > \
python_file.py 20190401 20190430
</pre>

## Example of create spark session in Prod

<pre>
spark = SparkSession \
        .builder \
        .config("spark.hadoop.fs.s3a.aws.credentials.provider", "com.amazonaws.auth.InstanceProfileCredentialsProvider") \
        .getOrCreate()
        
spark.conf.set("credentials", base64_string)  # this is to connect GCP account
spark.conf.set("viewsEnabled", "true")
spark.conf.set("materializationDataset", "34532456245")

</pre>

<pre>
df = spark.read \
     .format("bigquery") \
     .option("parentProject", "lively-encoder-854") \
     .load(query)

output_s3_path = "< s3 path >"

df.coalesce(50).write.mode('append').parquet(output_s3_path)
</pre>