# PySpark Tutorial — Beginner to Advanced
Comprehensive notes, theory (Hadoop/HDFS/RDD/MapReduce), practical PySpark examples, advanced topics, diagrams, and sample datasets.
Each code cell includes comments explaining what it does, internal working notes, and keywords.

## Setup: install & environment
**What this cell does:** shows how to install PySpark and required libs. **Note:** Running PySpark requires Java and Spark binaries when creating a real SparkSession; code here is for the notebook — uncomment install commands if needed.

**Keywords:** `pyspark`, `SparkSession`, `spark-submit`, `PYSPARK_PYTHON`

In [None]:
# Install PySpark (uncomment to run)
# !pip install pyspark
# !pip install findspark  # optional helper

# Typical SparkSession creation (do not run unless Java/Spark present)
# from pyspark.sql import SparkSession
# spark = SparkSession.builder \
#     .appName('pyspark-notebook') \
#     .config('spark.some.config.option', 'some-value') \
#     .getOrCreate()

# print(spark)

## Big Data Theory: Hadoop, HDFS, MapReduce, Distributed Systems (Short)
- **Hadoop**: ecosystem for distributed storage and processing.
- **HDFS**: distributed file system with NameNode (metadata) and DataNodes (blocks). Stores files split into blocks (default 128MB) replicated across DataNodes.
- **MapReduce**: programming model (Map + Shuffle + Reduce). Batch processing; writes intermediate results to disk.
- **Distributed Systems basics**: partitioning, replication, consensus, fault tolerance, data locality (run close to data), network and disk IO bottlenecks.

**Keywords:** blocks, replication, NameNode, DataNode, mapper, reducer, shuffle, data locality, eventual consistency.

## RDDs (Resilient Distributed Datasets)
**What:** low-level immutable distributed collection of objects. Operations: transformations (map, filter) and actions (collect, count).
**Internals:** lineage graph for fault tolerance; recompute partitions on failure; operations are lazy; partitions are units of parallelism.

**Keywords:** partition, lineage, transformation, action, narrow/wide dependencies, shuffle, checkpointing.

In [None]:
# Example RDD code (conceptual, inside SparkSession)
# rdd = spark.sparkContext.parallelize([1,2,3,4], numSlices=2)
# rdd2 = rdd.map(lambda x: x*2)
# print(rdd2.collect())

# Notes: avoid collect() on large datasets; use actions like count(), take(n) for sampling.

## Spark Core: Transformations, Actions, DAG, Stages & Tasks
- **Transformations**: lazy operations returning new RDD/DataFrame (map, filter, select, join).
- **Actions**: trigger execution (count, collect, save).
- **DAG**: optimizer builds Directed Acyclic Graph of stages.
- **Stages**: split by shuffle boundaries; each stage contains tasks executed in parallel per partition.

**Keywords:** lineage, shuffle, stage, task, narrow dependency, wide dependency.

## PySpark DataFrame API — Basics
**What this cell does:** introduces creation and basic operations on DataFrames.
**Internals:** PySpark DataFrame is a distributed collection of rows with schema; uses Catalyst optimizer and Tungsten execution backend.
**Keywords:** SparkSession, DataFrame, Column, select, filter, withColumn, groupBy, agg, join

In [None]:
# Example PySpark DataFrame usage (conceptual code to be run in a Spark environment)
# from pyspark.sql import SparkSession
# spark = SparkSession.builder.appName('demo').getOrCreate()
# df = spark.read.csv('Pyspark_tutorial/pyspark_samples/sales.csv', header=True, inferSchema=True)
# df.printSchema()
# df.show(5)
# df.select('customer', 'amount').filter(df.amount > 100).show()

# Note: the above commands work in an environment with Spark installed and Java available.

## Reading & Writing Data
**What this cell does:** demonstrates formats and options.
**Internals:** Spark supports many data sources (CSV, Parquet, ORC, JSON, JDBC). Parquet + columnar formats are preferred for performance.
**Keywords:** parquet, parquet predicate pushdown, partitionBy, mode, header, inferSchema

## Partitioning, Shuffle, & Broadcast Joins
**What this cell does:** explains how data movement happens and best practices.
**Internals:** Shuffle writes intermediate data across network and disk; expensive. Broadcast join sends small table to all executors to avoid shuffle.
**Keywords:** repartition, coalesce, broadcast, shuffle, shuffle partitions (spark.sql.shuffle.partitions)

In [None]:
# Broadcast join example (conceptual)
# from pyspark.sql.functions import broadcast
# small = spark.read.csv('/small.csv', header=True, inferSchema=True)
# large = spark.read.csv('/large.csv', header=True, inferSchema=True)
# joined = large.join(broadcast(small), on='id')
# joined.show()

# Repartition vs coalesce notes:
# - repartition(n): full shuffle to create n partitions
# - coalesce(n): avoid full shuffle when decreasing partitions

## Checkpointing & Fault Tolerance
**What this cell does:** describes checkpointing and lineage management.
**Internals:** RDD/DataFrame lineage records transformations; excessive lineage increases job planning overhead; checkpointing writes to reliable storage to truncate lineage.
**Keywords:** checkpoint, setCheckpointDir, lineage, fault recovery

In [None]:
# Checkpoint example (conceptual)
# spark.sparkContext.setCheckpointDir('/tmp/checkpoints')
# rdd = spark.sparkContext.parallelize(range(1000)).map(lambda x: x+1)
# rdd_checkpointed = rdd.checkpoint()
# rdd_checkpointed.count()

## Structured Streaming (Intro)
**What this cell does:** introduces streaming model and concepts.
**Internals:** Structured Streaming treats streams as unbounded tables; uses micro-batch or continuous processing; supports event-time aggregations and watermarks.
**Keywords:** readStream, writeStream, trigger, watermark, window, outputMode

In [None]:
# Structured Streaming example (conceptual)
# df_stream = spark.readStream.option('sep',',').csv('/stream/path')
# query = (df_stream
#          .withColumn('event_time', to_timestamp(df_stream.ts))
#          .groupBy(window(df_stream.event_time, '1 minute'), df_stream.level)
#          .count()
#          .writeStream
#          .outputMode('update')
#          .format('console')
#          .start())
# query.awaitTermination()

## Advanced Internals: Catalyst & Tungsten
- **Catalyst**: Spark SQL optimizer that builds logical plan, applies rule-based and cost-based optimizations, and generates physical plan.
- **Tungsten**: execution engine focusing on memory and CPU efficiency (off-heap, whole-stage code generation).

**Keywords:** logical plan, physical plan, whole-stage codegen, vectorized execution, predicate pushdown.

## Performance Tuning & Best Practices
- Use columnar formats (Parquet/ORC)
- Avoid wide transformations when possible
- Tune `spark.sql.shuffle.partitions`
- Use broadcast joins for small lookup tables
- Cache intermediate results only when reused
- Use vectorized UDFs (pandas_udf) instead of row UDFs
- Prefer built-in functions over Python UDFs

**Keywords:** serialization (Kryo), memory fraction, executor cores, executor memory, GC tuning

In [None]:
# Example: tuning shuffle partitions (conceptual)
# spark.conf.set('spark.sql.shuffle.partitions', '200')
# spark.conf.get('spark.sql.shuffle.partitions')

# Example: enabling Kryo serialization
# spark = SparkSession.builder.config('spark.serializer', 'org.apache.spark.serializer.KryoSerializer').getOrCreate()

## MLlib (Machine Learning)
**What this cell does:** introduces Spark MLlib API (DataFrame-based) and pipelines.
**Internals:** algorithms adapted to distributed data; some are approximations (e.g., ALS for collaborative filtering).
**Keywords:** Pipeline, Transformer, Estimator, fit, transform, MLlib, ALS, logistic regression

In [None]:
# Example MLlib pipeline (conceptual)
# from pyspark.ml.feature import VectorAssembler
# from pyspark.ml.classification import LogisticRegression
# assembler = VectorAssembler(inputCols=['feature1','feature2'], outputCol='features')
# lr = LogisticRegression(featuresCol='features', labelCol='label')
# pipeline = Pipeline(stages=[assembler, lr])
# model = pipeline.fit(train_df)
# preds = model.transform(test_df)

## Debugging, Logging & UI
- Use Spark Web UI (Driver URL) to inspect stages, tasks, executors, and storage.
- Check driver/executor logs for stack traces.
- Enable event logs for history server.

**Keywords:** Spark UI, executor logs, eventLog.enabled, history server.

In [None]:
# Example: enabling event logs (conceptual)
# spark = SparkSession.builder.config('spark.eventLog.enabled', 'true').config('spark.eventLog.dir','/tmp/spark-events').getOrCreate()

## Local Pandas Examples for Quick Testing (useful when Spark not available locally)
**What this cell does:** provides pandas equivalents for quick experimentation and mirrors PySpark examples so you can prototype logic locally before running on Spark cluster.
**Internal notes:** Use pandas for small datasets; scale to Spark for big data.
**Keywords:** pandas, prototype, sample data

In [None]:
import pandas as pd
sales = pd.read_csv('Pyspark_tutorial/pyspark_samples/sales.csv')
users = pd.read_csv('Pyspark_tutorial/pyspark_samples/users.csv')
logs = pd.read_csv('Pyspark_tutorial/pyspark_samples/logs.csv')

# Quick analysis: total by customer
print(sales.groupby('customer')['amount'].sum())
# Join example
print(sales.merge(users, left_on='order_id', right_on='user_id', how='left').head())
# Time series: resample sales weekly sums
sales['date'] = pd.to_datetime(sales['date'])
print(sales.set_index('date').resample('7D')['amount'].sum())

## Visual Diagrams
Included diagrams: Big Data stack and Spark DAG/execution flow.

In [None]:
from IPython.display import Image, display
display(Image(filename='Pyspark_tutorial/pyspark_samples/bigdata_stack.png'))
display(Image(filename='Pyspark_tutorial/pyspark_samples/spark_dag.png'))