## What is PySpark?

PySpark is the Python API for Apache Spark – an open-source big data processing framework. It allows users to write Spark applications using Python – providing a simple way to perform big data processing, machine learning, and graph processing.


## What are the key features of PySpark?

PySpark offers several key features, 
1. including in-memory computation, 
2. fault tolerance, 
3. scalability ( its ability to handle increasingly large volumes of data and computational workloads by seamlessly adding more nodes to a cluster), 
4. and support for a wide range of data formats. It also provides APIs for working with structured and unstructured data, making it suitable for various big data processing tasks.

The secret to Spark's awesome performance is parallelism

The reason for putting the data on more than one computer is intuitive:

Either the data is too large to fit on one machine or it would simply take too long to perform that computation on one machine.

## What is PySpark SQL?

PySpark SQL is a module in PySpark that allows users to run SQL queries on DataFrames and RDDs. It supports querying structured data in a distributed environment, making it easy to perform complex data analysis using SQL syntax.

# How do you handle large-scale data processing in PySpark?
This is one of the most common PySpark interview questions for 3 years experienced candidates. 

“When handling large-scale data in PySpark, I use its in-memory processing capabilities to speed up computations. I use DataFrames for efficient data manipulation and ensure proper data partitioning to optimize processing. I also apply transformations and actions carefully to minimize shuffling and reduce overall processing time.

# How do you manage and monitor resource allocation in a PySpark application?

Managing and monitoring resource allocation involves setting appropriate configurations for memory and CPU usage based on the workload. 

Using Spark’s built-in tools, such as the Spark UI, helps track resource usage and identify bottlenecks. 

Adjusting configurations like executor memory, core count, and parallelism settings based on the monitoring data ensures efficient resource utilization.

.

# How do you optimize PySpark jobs for better performance?
This is one of the most important PySpark interview questions for 5 years experienced candidates....

1) To optimize PySpark jobs, I cache DataFrames to avoid recomputation and use built-in functions instead of UDFs for efficiency.

2. I adjust the number of partitions to ensure even data distribution and manage resources by tuning Spark configurations. 

3) Additionally, I monitor job performance through the Spark UI to identify and address bottlenecks.

## What is a JVM?

The JVM manages system memory and provides a portable execution environment for Java-based applications

Technical definition: The JVM is the specification for a software program that executes code and provides the runtime environment for that code.

Everyday definition: The JVM is how we run our Java programs. We configure the JVM's settings and then rely on it to manage program resources during execution.

## Driver 

The driver orchestrates and monitors execution of a Spark application. There’s always one driver per Spark application. You can think of the driver as a wrapper around the application.

The driver process runs our main() function, sits on a node in the cluster, and is responsible for:

1. Maintaining information about the Spark Application
2. Responding to a user’s program or input
3. Requesting memory and CPU resources from cluster managers
4. Breaking application logic into stages and tasks
5. Sending tasks to executors
6. Collecting the results from the executors

## Executors
The executors are responsible for actually executing the work that the driver assigns them. This means, each executor is responsible for only two things:

1) Executing code assigned to it by the driver
2) Reporting the state of the computation, on that executor, back to the driver node

## Cluster Manager

Spark employs a Cluster Manager that is responsible for provisioning nodes in our cluster

#   --------  DAG -------------

DAGScheduler is the scheduling layer of Apache Spark that implements stage-oriented scheduling. It transforms a logical execution plan to a physical execution plan

After an action (see below) has been called, SparkContext hands over a logical plan to DAGScheduler that it in turn translates to a set of stages that are submitted as a set of tasks for execution.

The fundamental concepts of DAGScheduler are jobs and stages that it tracks through internal registries and counters.

Lineage and Transformations

Spark maintains a directed acyclic graph (DAG) of transformations applied to RDDs. If a node fails, Spark can recompute only the lost partitions by tracing back the lineage graph. This minimizes the amount of data that needs to be recalculated.


# --------- JOB ------------

A Job is a sequence of stages, triggered by an action such as count(), collect(), read() or write().


Each parallelized action is referred to as a Job.
The results of each Job (parallelized/distributed action) is returned to the Driver from the Executor.
Depending on the work required, multiple Jobs will be required.

# ----------------- STAGE ------------------------------

Each job that gets divided into smaller sets of tasks is a stage.

A Stage is a sequence of Tasks that can all be run together - i.e. in parallel - without a shuffle. For example: using .read to read a file from disk, then runnning .filter can be done without a shuffle, so it can fit in a single stage. The number of Tasks in a Stage also depends upon the number of Partitions your datasets have.

## Explain what PySpark DataFrame is.
A PySpark DataFrame is a distributed collection of data organized into named columns, similar to a table in a relational database. DataFrames provide a higher-level abstraction than RDDs (Basic abstraction in Spark is RDD) and support various operations like filtering, grouping, and aggregations.

A DataFrame is the most common Structured API and simply represents a table of data with rows and columns. The list of columns and the types in those columns is called the schema.

#  ~~~~~~~~~~~~ RDD (Resilient distributed dataset)  ~~~~~~~~~~~~~~~~

RDDs are the fundamental data structures in Spark. They are immutable and partitioned across the cluster. When a node fails, Spark can reconstruct lost RDD partitions using lineage information, which represents the sequence of transformations applied to create the RDD. This lineage allows Spark to recompute the lost data without needing to store the entire dataset.

### RDDs offer two types of operations:
1. Transformations take an RDD as an input and produce one or multiple RDDs as output.
2. Actions take an RDD as an input and produce a performed operation as an output.

# ----------------------------  Data  Partition  -------------------------------------

.

In order to allow every executor to perform work in parallel, Spark breaks up the data into chunks, called partitions.

A partition is a collection of rows that sit on one physical machine in our cluster. A DataFrame’s partitions represent how the data is physically distributed across your cluster of machines during execution:

An important thing to note is that with DataFrames, we do not (for the most part) manipulate partitions manually (on an individual basis). We simply specify high level transformations of data in the physical partitions and Spark determines how this work will actually execute on the cluster.

## Explain the importance of partitioning in PySpark and how it affects performance.

Partitioning in PySpark is critical for distributing data across the cluster and ensuring parallel processing. Proper partitioning helps in reducing data shuffling and improves job performance. It’s important to partition data based on key columns to ensure even distribution and minimize bottlenecks during data processing.

## How do you optimize PySpark jobs with data?

“To optimize PySpark jobs with skewed data, I use techniques such as salting, where 
1) I add a random prefix to keys to distribute data more evenly. 
2) I also repartition the data to ensure balanced partitions 
3) and use broadcast joins to handle skewed joins efficiently.”

## 8.	How can you improve the performance of PySpark jobs?
https://medium.com/@sounder.rahul/pyspark-optimization-techniques-for-data-engineers-df5033778709

    
•	Partitioning

Proper partitioning can significantly improve the speed and efficiency of code. However, improper partitioning can lead to poor performance and inefficient use of resources.

•	Caching 

Caching can improve performance by reducing the time required to process data frames. You can use functions such as cache and persist to cache data frames in memory. However, if used at the wrong locations in a query, it might eat up all memory and can even slow down queries substantially.

•	Avoid UDFs

Avoid UDFs (User-Defined Functions) that perform more than one thing. Splitting UDFs allows you to use built-in functions for one part of the resulting code and greatly simplifies testing.

•	Parallel computing

PySpark can perform parallel processing across a cluster of machines by splitting data into smaller partitions and performing parallel processing on them. This makes PySpark faster and more efficient than Pandas for large-scale data processing. 


## How can you handle skewed data in PySpark?

Spark has data loaded into memory in the form of partitions. Ideally, the data in the partitions should be uniformly distributed. Data skew is when one or some partitions have significantly more data compared to other partitions. Data-skew is usually the result of operations that require re-partitioning the data, mostly join and grouping (GroupBy) operations.
Handling Data Skewness in Apache Spark

1.  Custom Partitioning: Instead of relying on Spark’s default partitioning strategy, implementing a custom partitioning strategy can help distribute data more evenly across partitions. For example, range partitioning can be more effective when dealing with numeric keys. 
2.  Salting: Salting is a technique where a random value (salt) is appended to the key, which helps distribute the data more evenly across partitions. This can be particularly useful when dealing with hot keys.
3.  Dynamic Partition Pruning: Dynamic partition pruning is a technique used in Spark to optimize join operations by skipping the scanning of irrelevant partitions in both datasets. This can help improve performance in the case of data skewness caused by join operations.
4.  Splitting Skewed Data: Another strategy is to split the skeowed data across multiple partitions. This involves identifying the skewed keys and redistributing the data associated with these keys.
5.  Avoid GroupBy for Large Datasets: When possible, avoid using GroupBy operations on large datasets with non-unique keys. Alternatives such as reduceByKey, which performs a combine operation locally on each partition before performing the grouping operation, can be more efficient.


## How data skewness is handled
-	Data is not evenly distributed across partitions

-	Broadcast join

-	Repartition & coalesce (coalesce to minimize shuffle operation) (Increase or decrease partitions)

-	Dynamic allocation (if you don’t know workload/amount of data)

-	Hash partitions (uniformly distributed key) 


#                         ~~~~~~~~~~ Miscellanies  ~~~~~~~~~~~~

## What are some best practices for managing dependencies in a PySpark environment?

Managing dependencies in PySpark involves using a virtual environment or Conda to isolate dependencies, guaranteeing that all nodes in the cluster have consistent library versions. It’s also important to package dependencies with your PySpark job using tools like –py-files or managing them through the cluster’s resource manager (like YARN).

## You are processing a large dataset in PySpark, but you notice that certain stages are taking significantly longer. How would you troubleshoot and resolve this issue?

“If I am processing a large dataset in PySpark and notice that certain stages are taking significantly longer, I would start by checking the Spark UI for details on the slow stages. I would look for signs of excessive shuffling, data skew, or improper partitioning. 

To address these issues, I would adjust the partition size, optimize data distribution, and use operations that minimize shuffling. Additionally, I might cache intermediate results to help speed up the process.”

## How do you handle iterative algorithms in PySpark, such as those used in machine learning?

“Iterative algorithms in PySpark, like those in machine learning, can be optimized by using the DataFrame API and caching intermediate results. I also use MLlib’s built-in algorithms, which are optimized for distributed computing, and monitor performance to fine-tune iterations.”

## What are the key considerations when using PySpark for data engineering tasks?
You may also come across PySpark interview questions for data engineers like this one. 

Key considerations include managing data ingestion and storage efficiently, optimizing data transformations and aggregations, and ensuring data quality. It’s important to use partitioning and caching to improve performance and use Spark’s built-in functions for efficient data processing.

## How do you implement fault tolerance in PySpark applications?
Fault tolerance in PySpark applications is implemented through checkpointing, which saves the state of the RDDs or DataFrames to reliable storage. This allows recovery from failures. Additionally, using lineage information to recompute lost data and configuring retries for failed tasks helps ensure the application can handle and recover from errors effectively.

## You need to process a massive amount of data using PySpark on AWS, but you’re facing high costs and slow performance. What steps would you take to optimize the job?
“To optimize the job, I’d review the cluster setup and choose appropriate instance types to balance cost and performance. I’d use Spot Instances to lower costs and optimize job performance by adjusting partitions and resource settings. I’d also use Amazon S3 for storage and Amazon EMR to manage the Spark cluster efficiently.”

## A PySpark job is failing intermittently due to resource constraints. What would you do to diagnose and address the issue?
“To address the issue, I’d start by checking the Spark UI for resource bottlenecks. I’d then increase executor memory and cores as needed. Implementing data caching and adjusting partition sizes can help. If necessary, I’d enable dynamic resource allocation to match resource use with job needs.”


## You need to join two large datasets in PySpark, but the join operation is taking too long. How would you improve the performance of this joint operation?
“To improve join performance, I’d use a broadcast join if one dataset is small enough to fit in memory, which reduces shuffling. I’d also repartition the data based on the join keys to balance the load and optimize data partitioning to speed up the join process.”

## what is the difference between parquet and csv

https://medium.com/@dinesh1.chopra/unveiling-the-battle-apache-parquet-vs-csv-exploring-the-pros-and-cons-of-data-formats-b6bfd8e43107  

1. storage efficiency: columnar based offers compression tech end encoding schemas this reduce storage sparce. 
2. Performance: Parquet require specific columns, parquet can skip reading irrelevant data, resulting in faster query execution time. csv files need to read entire rows. 
3. Data Types and schema evolution: Parquet supports complex data types and nested structures, making it suitable for handling structured and semi-structured data. It also provides support for schema evolution, allowing new columns to be added to existing Parquet files without requiring rewriting the entire dataset. CSV, on the other hand, represents data in a flat, tabular format and does not provide built-in support for complex data types or schema evolution.
4. Ease of Use and Interoperability: CSV files are widely supported and can be easily opened, viewed, and edited using standard text editors or spreadsheet software. They have a simple, human-readable format and are commonly used for data exchange between different systems. Parquet files, although not directly readable by humans, can be processed by various data processing frameworks and tools that support the Parquet format, such as Apache Spark, Apache Hive, and Apache Arrow.


## What is Lazy Evaluation?

Lazy evaluation is a key concept in Apache Spark, where the transformations on data are not immediately executed, but rather their execution is delayed until an action is triggered.

•	Transformations in Spark are not executed immediately, but are stored as a directed acyclic graph (DAG) of operations.

•	Actions trigger the execution of the DAG, allowing for optimizations like pipelining and avoiding unnecessary computations. 

•	Lazy evaluation helps in optimizing the execution plan and improving performance by delaying the actual computation until necessary.

-     It’s a spark strategy, transformation will be added into DAG, after request from driver it will be executed through action. 

## repartition () versus coalesce ()

Partitions of an existing RDD can be changed using repartition() or coalesce(). These operations can redistribute the RDD based on the number of partitions provided. The repartition() can be used to increase or decrease the number of partitions, but it involves heavy data shuffling across the cluster. On the other hand, coalesce() can be used only to decrease the number of partitions. In most of the cases, coalesce() does not trigger a shuffle. The coalesce() can be used soon after heavy filtering to optimize the execution time. It is important to notice that coalesce() does not always avoid shuffling. If the number of partitions provided is much smaller than the number of available nodes in the cluster then ...


## Difference between cache and persist?

The persist() method is used to persist (or cache) the RDD, DataFrame, or Dataset in memory or disk. This means that the data is stored for future use, which can be beneficial if you need to use the same data multiple times in your Spark application.

With cache(), you use only the default storage level :

•	MEMORY_ONLY for RDD

•	MEMORY_AND_DISK for Dataset

But Persist() We can save the intermediate results in 5 storage levels:

•	MEMORY_ONLY

•	MEMORY_AND_DISK

•	MEMORY_ONLY_SER

•	MEMORY_AND_DISK_SER

•	DISK_ONLY


## Spark submit example

./bin/spark2-submit \
   --verbose
   --master yarn \
   --deploy-mode cluster \
   --driver-memory 8g \
   --executor-memory 16g \
   --executor-cores 2  \
   --class org.apache.spark.examples.SparkPi \
   /spark-home/examples/jars/spark-examples_versionxx.jar 80

## Catalyst optimizer 

-    It’s a core component of spark
-   Analysis part, logical plan, physical plan, code generation part. 
-   It’s an internal process to optimize performance  

Spark Catalyst Optimizer is a powerful tool that can greatly enhance the performance of your Spark applications.
•	Logical optimization: Spark Catalyst Optimizer applies rule-based and cost-based optimization techniques to enhance logical query plans.
•	Physical planning: It then transforms the logically optimized query plan into an efficient physical execution plan.
•	Query execution: This optimized physical plan is utilized for executing the query, improving performance and resource utilization.
