### **Importing necessary modules**

In [None]:
# Enable displaying output of all expressions in a single Jupyter cell (not just the last one)
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

# Suppress warnings to keep the output clean
import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')

# Required to manually set environment variables for Spark, Java, and Hadoop in a local Windows setup
import os
# Set the Java installation path
os.environ["JAVA_HOME"] = "D:/Programs/Java"  # Change this path based on your Java installation
# Set the Hadoop home directory (required even if not using full Hadoop setup, for Windows compatibility)
os.environ["HADOOP_HOME"] = "D:/Programs/hadoop"  # Change if you installed Hadoop elsewhere
# Set the Spark installation directory
os.environ["SPARK_HOME"] = "D:/Programs/spark/spark-3.5.6-bin-hadoop3"  # Match your Spark installation path

# Use findspark to make PySpark (Spark Python API) importable
import findspark
findspark.init("D:/Programs/spark/spark-3.5.6-bin-hadoop3")  # Initialize Spark environment using the specified path

# Import specific PySpark modules for data processing and transformations
from pyspark.sql import SparkSession              # Main entry point for DataFrame and SQL functionality in Spark
import pyspark.sql.functions as F                 # Import Spark SQL functions with alias 'F'
from pyspark.sql.types import IntegerType         # For defining schema types
from pyspark.sql.types import DoubleType          # For schema definition using double precision (float)

### **Setting Up Multiple Executors Using Spark's Standalone Cluster**

Follow the steps below to manually start a Spark master and multiple workers on your local machine using the Spark Standalone Cluster mode.

---

##### **Step 1**: Open Spark `bin` Directory in Command Prompt

```bash
cd D:\Programs\spark\spark-3.5.6-bin-hadoop3\bin
```

> Make sure to run all the following commands from within this directory.

---

##### **Step 2**: Start the Spark Master Node

```bash
spark-class2.cmd org.apache.spark.deploy.master.Master
```

* After running the above command, visit the Spark Master Web UI at:
  [http://localhost:8080/](http://localhost:8080/)

* Copy the master's URL from the UI — it will look something like:

```text
spark://192.168.171.138:7077
```

This URL is needed to connect worker nodes to the master.

---

##### **Step 3**: Start Worker Nodes (Executors)

Each worker must be launched from a **separate command prompt window**. For every executor you want, repeat the following step with different resource configurations:

```bash
spark-class2.cmd org.apache.spark.deploy.worker.Worker -c <num_cores> -m <memory> spark://192.168.171.138:7077
```

Example commands:

```bash
spark-class2.cmd org.apache.spark.deploy.worker.Worker -c 1 -m 1G spark://192.168.171.138:7077
spark-class2.cmd org.apache.spark.deploy.worker.Worker -c 2 -m 2G spark://192.168.171.138:7077
spark-class2.cmd org.apache.spark.deploy.worker.Worker -c 1 -m 2G spark://192.168.171.138:7077
spark-class2.cmd org.apache.spark.deploy.worker.Worker -c 2 -m 1G spark://192.168.171.138:7077
```

* `-c` specifies the number of CPU cores assigned to the executor.
* `-m` specifies the amount of memory assigned to the executor.

> The workers will register with the master, and you can monitor them at [http://localhost:8080/](http://localhost:8080/)

---

##### Spark Standalone Cluster is Now Ready

---


### **Setting up spark session - Dynamic Allocation** 

In [None]:
# in .master add IP of master
spark = (
    SparkSession
    .builder
    .appName("Dynamic Allocation")
    .master("spark://192.168.171.138:7077")
    .config("spark.executor.cores", 1)
    .config("spark.executor.memory", "512M")
    .config("spark.dynamicAllocation.enabled", True)
    .config("spark.dynamicAllocation.minExecutors", 0)
    .config("spark.dynamicAllocation.maxExecutors", 3)
    .config("spark.dynamicAllocation.initialExecutors", 1)
    .config("spark.dynamicAllocation.shuffleTracking.enabled", True)
    .getOrCreate()
)

spark

```python
    .appName("Dynamic Allocation")  # Sets the name of your Spark application
```
* `SparkSession.builder`: Starts the configuration of your Spark session.
* `.appName("Dynamic Allocation")`: Names your application. Useful for identifying it in Spark UI (`http://localhost:4040` or the master web UI).


```python
    .master("spark://192.168.171.138:7077")  # Connect to the master node of your standalone Spark cluster
```
* The master URL should be the same one shown on your Spark master UI (`spark://<master-ip>:7077`).
* It tells the Spark driver where the cluster manager (master) is running.


```python
    .config("spark.executor.cores", 1)  # Each executor will use 1 CPU core
    .config("spark.executor.memory", "512M")  # Each executor will be allocated 512MB memory
```
* These are the base resources given to each executor.
* Keep these values small if you're running locally or testing on a low-resource machine.


```python
    .config("spark.dynamicAllocation.enabled", True)  # Enables dynamic allocation of executors
```
* **Dynamic Allocation** lets Spark scale the number of executors up or down based on workload.


```python
    .config("spark.dynamicAllocation.minExecutors", 0)  # Minimum number of executors
    .config("spark.dynamicAllocation.maxExecutors", 3)  # Maximum number of executors
    .config("spark.dynamicAllocation.initialExecutors", 1)  # Number of executors to start with
```
* Spark will begin with 1 executor and scale up to a maximum of 3 depending on the workload.
* If there are idle resources or fewer tasks, Spark may scale down to 0.


```python
    .config("spark.dynamicAllocation.shuffleTracking.enabled", True)  # Enables tracking of shuffle dependencies
```
* **Shuffle Tracking** helps Spark track shuffles and decide when it’s safe to remove executors.
* This makes dynamic allocation more effective **even without external shuffle service**, which is not commonly set up in standalone/local mode.


```python
spark  # Shows basic info about the created SparkSession (useful in Jupyter or REPL)
```

---

##### **Summary**:

* You're connecting to a standalone Spark cluster running at `spark://192.168.171.138:7077`.
* You're enabling **dynamic allocation** to allow Spark to automatically scale executors based on the workload.
* Executors will be created with 1 core and 512MB memory, and their number will range from 0 to 3 based on the tasks.

---



In [None]:
sc = spark.sparkContext
sc.setLogLevel("ERROR")

### **Execute the Job**

In [None]:
# change the transactions_file path wherever your data resides
transactions_file = "D:/Internship/spark-project-main/pyspark_test_project/data/transactions.parquet"
df_transactions = spark.read.parquet(transactions_file)
df_transformed = (
    df_transactions
    .withColumn("amt", F.col("amt").cast(DoubleType()))
    .filter(F.col("amt") > 10)
    .groupBy("city")
    .agg(F.avg("amt").alias("avg_amt"))
)
# change the output directory whereever you have to store the results
df_transformed.write.mode("overwrite").csv(f"output/repartition_4")