In [None]:
1. Basic Databricks Questions


What is Databricks?
Databricks is a unified data analytics platform built on Apache Spark that supports data engineering, 
data science, machine learning, and BI tasks via a Lakehouse architecture.


What is a cluster in Databricks? What are the different cluster types?
A cluster is a set of compute (VMs) that run your Spark workloads. 
Types include all-purpose (interactive) and job clusters (ephemeral for jobs).


Difference between a job cluster and an all-purpose cluster?

All-purpose: multi-user, shared, interactive development

Job cluster: spun up per job, more transient and optimized for scheduled workloads




What is the Databricks Workspace?
The workspace is the UI environment where users create and manage notebooks, jobs, folders, and repos.


What are notebooks in Databricks and which languages are supported?
Notebooks are interactive documents for code, visualizations, and markdown. 
Supported languages include Python, SQL, Scala, R, and also SQL + Python mixed via %sql, %python etc.


What is DBFS (Databricks File System)?
DBFS is a distributed file system mounted into Databricks clusters, abstracting storage (like S3, ADLS) into a file-like API.


How do you mount Azure Data Lake or AWS S3 in Databricks?
Use DBFS mounting commands (e.g., dbutils.fs.mount) or 
configure access via IAM roles / service principals and 
use direct access via path URIs.


What is Auto Loader?
Auto Loader is a Databricks feature that incrementally ingests new data files from cloud storage 
(e.g., S3 or ADLS) using a schema-aware file notification mechanism.


What are widgets in Databricks notebooks?
Widgets allow parameterization of notebooks (dropdowns, text boxes) so that jobs can be run with 
different input parameters.


How do you schedule a Databricks notebook as a job?
Use the Databricks Jobs UI or API, define a job with notebook, set cluster spec, schedule (cron), 
and optionally dependent tasks.

   Method 1: Using the Jobs UI (Recommended)
In the left sidebar, click Jobs.
Click Create job.
In the Tasks tab, click Add task.
Give the task a name and select Notebook from the Type drop-down menu.
Under Source, select Workspace and browse to find your notebook.
Configure a job cluster or select an existing cluster for the task.
Optionally, set up a schedule, alerts, or parameters and click Create. 

    Method 2: Scheduling from the notebook
Open the notebook you want to run as a job.
Click the Schedule button in the top-right corner.
If jobs already exist for the notebook, click Add a schedule.
In the Schedule dialog, enter a name, choose a schedule type (Simple or Advanced), and select the compute resource.
You can also add parameters or email alerts to be sent on job events.
Click Create (or Submit) to save the scheduled job. 


2. Intermediate Databricks / Delta Questions


Explain the advantages of Delta Lake.
Delta Lake brings ACID transactions, time-travel, scalable metadata handling, and schema enforcement 
to data lakes.


What problem does Delta Lake solve?
It solves data reliability, consistency, and performance issues in data lakes (e.g., handling concurrent writes, data corruption, small files).


Delta Lake features: ACID, time travel, schema enforcement, schema evolution.


ACID: ensures transactional consistency


Time travel: query data as of previous versions


Schema enforcement: reject writes not matching schema


Schema evolution: allows schema to change over time in a controlled way




How does schema enforcement work in Delta Lake?
When writing to a Delta table, Delta validates incoming data against the defined schema and 
rejects the write if incompatible.


Difference between MERGE and UPDATE in Delta?


UPDATE: modifies existing rows based on a condition


MERGE: upsert pattern — can insert, update, or delete based on matching conditions




What is OPTIMIZE and ZORDER in Databricks?


OPTIMIZE: compacts small files into larger ones for performance


ZORDER: reorders data within files by specified columns, improving query performance (especially for range queries)




What is the small files problem in Delta tables?
When many tiny files are created (e.g., from micro-batches), query performance suffers due to overhead. 
It can be mitigated by compaction (OPTIMIZE).


What is the purpose of VACUUM in Delta?
VACUUM removes old, unneeded files from Delta’s storage, freeing up space but you must be careful because it deletes historical data 
beyond retention.


Explain COPY INTO. When do you use it?
COPY INTO is used to bulk load data into a Delta table from cloud storage, especially when there are many files or partitioned data.


What is Databricks Runtime (DBR)?
DBR is the managed Spark runtime provided by Databricks, optimized for performance, with pre-installed libraries, ML support, GPU support, and more.



3. Spark / Databricks SQL Questions


Difference between Spark DataFrame and Pandas DataFrame.


Spark DataFrame: distributed, parallel, fault-tolerant


Pandas DataFrame: in-memory, single-node, good for small-to-medium data




What is the Catalyst Optimizer and Tungsten?


Catalyst: Spark’s query optimizer, does logical and physical planning


Tungsten: Spark’s execution engine optimizations (memory, code gen) for speed and efficiency




Explain narrow vs wide transformations in Spark.


Narrow: partitions don’t need data from other partitions (e.g., map, filter)


Wide: require shuffles / data exchange across partitions (e.g., groupBy, join)




How do you handle skew in Spark?


1. Broadcast small tables for joins

2. Salting keys to distribute skewed keys

3. Using adaptive query execution (AQE)

4. Repartition data manually




What causes shuffle in Spark?
Shuffle happens when data needs to be redistributed across partitions — e.g., 
groupBy, join, distinct, repartition.


How do you reduce shuffle in Spark applications?


1. Use broadcast joins.
2. Use map-side aggregations.
3. Use partitioning.
4. Use coalesce properly.
5. Use filter, select early to reduce data.




What is broadcast join? When should you use it?
A broadcast join sends a small dataset to all executors so a large dataset can join locally without 
shuffling. Use when one side of the join is small.


What is AQE (Adaptive Query Execution)?
Adaptive Query Execution dynamically adjusts the query plan at runtime, e.g., 
changing join strategy or number of shuffle partitions based on statistics.


What are Spark UDF, Pandas UDF, and SQL functions?

Spark UDF: user-defined function using JVM (Scala/Java/Python), slower
Pandas UDF (vectorized): uses Apache Arrow, faster, good for batch vectorized operations
SQL functions: built-in Spark SQL functions, most efficient




What is checkpointing vs. caching in Spark?
Caching: storing data in memory/disk to reuse it across operations
Checkpointing: writing to stable storage (HDFS/DBFS) to truncate lineage and provide fault tolerance





4. Lakehouse & Architecture Questions

What is the Databricks Lakehouse Platform?
A unified architecture combining data lake flexibility and data warehouse reliability with Delta Lake 
as the storage layer.

How does the Lakehouse differ from a traditional data warehouse?
Lakehouse uses cheaper, scalable object storage
Supports batch & streaming
Provides ACID via Delta
More flexible schema handling




What is Unity Catalog?
A centralized data governance service in Databricks that handles access control, 
metadata management, and lineage across your lakehouse.


How do you secure data using Unity Catalog?
By defining catalogs, schemas, and tables, and granting fine-grained permissions 
(SELECT, INSERT, MODIFY) on them; 
also using data lineage and audit logs.


Difference between a catalog, schema, and table in Unity Catalog.
Catalog: top-level namespace (e.g., “finance_catalog”)
Schema: a namespace inside a catalog (like a database)
Table: the actual data structure inside a schema




What is the medallion architecture (Bronze, Silver, Gold)?

Bronze: raw, ingested data
Silver: cleaned, deduplicated, enriched data
Gold: business-level aggregates / curated tables / serving layer




How do you design a Delta Live Tables (DLT) pipeline?


0. Define source tables
1. Apply transformations (cleaning, business logic)
2. Use expectations (data quality)
3. Define target tables for Bronze, Silver, Gold
4. Set pipeline scheduling and monitoring




What is serverless SQL warehouse?
A fully managed SQL compute resource in Databricks for running BI/analytics queries that scales 
automatically and you pay per usage.


What are DBU charges?
DBU = Databricks Unit, a usage-based cost metric representing compute resource consumption in Databricks.


Explain cluster autoscaling.
Autoscaling automatically adds or removes worker nodes in a cluster based on workload demand to 
optimize cost and performance.



5. Real-Time & Streaming Questions


What is Structured Streaming?
Spark’s high-level API for stream processing, treating streaming data as incremental micro-batches 
(or continuous, depending on mode) of DataFrames.


What is Auto Loader and how is it different from readStream?
Auto Loader is optimized for file-based incremental ingestion; 
readStream is Spark’s API for reading streaming data regardless of source, but requires more manual 
handling.


What are checkpoints and triggers in streaming?

Checkpointing: storing progress and offsets to recover from failures
Triggers: define when micro-batches should be processed (e.g., processingTime, once, continuous)




How do you perform stream-to-stream and stream-to-batch join?
Stream-to-stream: both sides are streaming DataFrames, careful with watermarking
Stream-to-batch: join streaming DataFrame with a static (batch) DataFrame, simpler and common for lookups




How do you guarantee exactly-once processing in Databricks?
Use checkpointing, idempotent writes (e.g., to Delta with merge), and transactional writes supported by Delta Lake.



6. Delta Lake Deep-Dive Questions


What is a Delta transaction log? Explain _delta_log.
_delta_log is the directory containing JSON / Parquet log files that record every transaction (add, remove) for a Delta table
to support atomicity and versioning.


What is an OPTIMIZE with ZORDER?
OPTIMIZE compacts files; ZORDER reorders data within those files on chosen columns, improving query performance for those columns.


What is change data feed (CDF) in Delta Lake?
CDF allows you to read row-level changes (inserts, updates, deletes) between table versions, which is useful for CDC 
(change data capture) patterns.


How does Delta Lake handle late-arriving data?
Use merge logic (e.g., MERGE), watermarking + streaming, or separate staging + deduplication to incorporate late data without corruption.


What happens internally when you run a MERGE INTO?
Spark reads both source and target, identifies matching keys, generates add and remove files via the Delta transaction log, and writes a new version in an atomic transaction.



7. Security & Governance


How do you use cluster policies?
Define policy rules (e.g., allowed node types, instance pools, autoscaling limits) and enforce them so that users can only spin up clusters that meet compliance/cost guidelines.


How does Unity Catalog enforce permission inheritance?
Permissions granted at the catalog or schema level propagate down to child objects (tables, views) unless explicitly overridden.


How do you audit data access in Databricks?
Use Unity Catalog’s lineage and audit logging, or DB’s audit logs (workspace, clusters, jobs) to monitor who accessed what, when, and how.


What is Credential Passthrough / IAM roles?


Credential Passthrough: forwards user credentials (e.g., via IAM) to the storage layer so that access enforcement happens at the storage level


IAM roles: assign roles (e.g., in AWS IAM / Azure RBAC) so that clusters access storage securely using those roles




How do you secure PII data in Databricks?
Use Unity Catalog for object-level permissions, implement column-level masking or row-level filtering, encrypt sensitive data, and enforce data governance policies.



8. Performance & Optimization


How do you handle slow Spark jobs in Databricks?


Profile jobs using Spark UI


Identify skew, shuffle, or memory spills


Use partitioning, caching, or broadcast joins


Tune shuffle partitions, executors, or cluster size




How do you identify shuffle spill or skew in the Spark UI?


Look at Stage Summary → "Shuffle Read / Write" metrics


Monitor “Task Deserialization Time” or “Shuffle Spill” in Task View


Use the DAG and SQL physical plan to spot data skew




Best practices to optimize Delta tables?


Periodically OPTIMIZE


Use ZORDER on frequently filtered columns


Partition appropriately


Clean up with VACUUM


Avoid too many small files




Describe file compaction strategy in Silver layer.
In Silver (cleaned) layer, compact small files periodically (or via automated jobs) so downstream analytics / BI queries run efficiently; often OPTIMIZE + ZORDER.


How do you split large Bronze → Silver → Gold loads?


Bronze: raw ingestion (Auto Loader or batch)


Silver: cleaning, deduplication, data quality checks


Gold: aggregations, business logic, serving layer
Use DLT or scheduled jobs to orchestrate these layers.





9. ML & GenAI (Databricks)


What is MLflow?
MLflow is an open-source platform for managing the ML lifecycle: tracking experiments, packaging code, deploying models, and registering them.


What is Model Serving in Databricks?
You can deploy MLflow models as real-time REST API endpoints (or batch), enabling integration with production systems.


What is Vector Search in Databricks?
Vector Search allows you to store and query embeddings (vectors) efficiently – useful for semantic search, recommendation, and GenAI.


What is DBRX?
(If by DBRX you mean Databricks’ vector or retrieval-augmented functionality) — DBRX is Databricks’ framework for building applications using retrieval + LLMs, often used for production GenAI.


Explain Feature Store in Databricks.
A managed store for ML features: centralizes feature definitions, ensures consistency between training and inference, supports batch and streaming feature pipelines, and versioning.



10. Scenario-Based Questions


Your Delta table has lots of small files. How do you fix it?
Use OPTIMIZE to compact the files. Optionally ZORDER by commonly filtered columns. Also adjust upstream batching strategy (larger micro-batches or coalescing).


A MERGE operation is running slow. What steps do you take?


Check if source or target is skewed


Ensure predicates are selective


Use broadcast join if appropriate


Partitioning strategy


Possibly reorganize or optimize the target table




Your Silver table is showing corrupted records. How do you handle them?


Use Delta “expectations” in DLT or validate schema


Define quality checks (null checks, regex, ranges)


Route bad data to a separate “quarantine” table for later investigation




Your stream stopped due to schema mismatch. How do you fix it?


Use schema evolution (Delta)


Apply schema enforcement


Use Auto Loader with cloudFiles.schemaLocation to track schema changes


Validate incoming schema before ingest




How do you enforce data quality rules in Databricks?


Use Delta Live Tables (DLT) expectations


Implement custom validation in notebooks or jobs


Use monitoring, alerting on failed records


Create data contracts with producers





11. Behavioral / HR Questions


How did you use Databricks in your last project?
Sample answer: “I built a Bronze-Silver-Gold pipeline using Auto Loader, Delta Live Tables, and MLflow. We ingested raw data from S3, cleaned and transformed in Silver, then exposed Gold tables to BI and ML teams.”


What part of Databricks do you use the most?
“I primarily use Delta Lake and Databricks SQL for building reliable ETL pipelines and serving tables. I also use notebooks for ad hoc data exploration and job orchestration.”


What’s the biggest optimization you did in Databricks?
“I optimized a job by compacting small files using OPTIMIZE + ZORDER, which reduced query latency by 60%. I also tuned the cluster size and shuffle partitions, which cut costs by 30%.”


Explain a time you debugged a slow Spark job.
“We had a Spark job with a long shuffle stage. I used the Spark UI to analyze task durations, found data skew on a join key, applied salting to that key, and repartitioned data — performance improved dramatically.”


Describe a Databricks architecture you designed.
“I designed a lakehouse architecture: raw data from S3 landed in a Bronze Delta table via Auto Loader; then Delta Live Tables cleaned and transformed data into Silver; Gold tables served BI dashboards in Databricks SQL. Used Unity Catalog to manage permissions, and scheduled jobs using the Jobs UI. For ML, I used MLflow to track experiments and deploy models.”



If you like, I can turn this into a mock “question + ideal answer” flashcard deck (for interview prep). Do you want me to do that?I prefer this response

In [None]:
what is the difference between autoloader and spark.readStream()

General Structured Streaming: spark.readStream is the fundamental API for Apache Spark Structured Streaming. It allows you to read data from various streaming sources like Kafka, Kinesis, and filesystems (including cloud storage).
Flexibility: It offers broad flexibility in configuring how you read and process streaming data, including defining schema, handling watermarks, and specifying output modes.
File Source Limitations: When used directly with file sources, spark.readStream can be less efficient for discovering new files, especially in very large and dynamic directories. It typically relies on directory listing, which can be resource-intensive.
Auto Loader:
Optimized for Cloud Object Storage: Auto Loader is a Databricks-specific feature built on top of spark.readStream that is specifically optimized for incrementally and efficiently ingesting files from cloud object storage (e.g., S3, ADLS, GCS).
Scalable File Discovery: It offers highly scalable file discovery mechanisms, including file notification mode (using cloud services like SQS or Event Grid) and optimized directory listing, making it performant even with billions of files.
Schema Inference and Evolution: Auto Loader automatically infers schema and provides robust mechanisms for handling schema evolution, including detecting changes and rescuing data that might otherwise be lost due to schema mismatches.
Cost Efficiency: By leveraging native cloud APIs and file notification services, Auto Loader can significantly reduce the cost of file discovery compared to traditional directory listing.
Exactly-Once Guarantees: It ensures exactly-once processing guarantees for data ingested from cloud storage.
Key Differences Summarized:
Purpose: spark.readStream is a general streaming API; Auto Loader is specifically designed for efficient, scalable, and cost-effective ingestion of files from cloud object storage.
File Discovery: Auto Loader offers superior file discovery mechanisms (file notification, optimized directory listing) compared to spark.readStream's standard directory listing for file sources.
Schema Handling: Auto Loader provides built-in schema inference and evolution capabilities, which are not directly part of the base spark.readStream API.
Cost: Auto Loader can be more cost-efficient for cloud storage ingestion due to its optimized file discovery methods.
In essence, Auto Loader can be considered a specialized and enhanced version of spark.readStream for handling file-based streaming ingestion from cloud storage, providing significant advantages in terms of scalability, performance, cost, and schema management.

In [None]:
what is the difference between auto-loader and copy into ? 

In [None]:
**Scale:** Auto Loader excels at handling large volumes (millions+) of continuously arriving files, 
while COPY INTO is better for smaller, batch-oriented loads (thousands of files).
**Ingestion Type:** Auto Loader is for streaming/incremental ingestion, COPY INTO is for batch/ad-hoc ingestion.
**Schema Evolution:** Auto Loader automatically manages schema changes, whereas COPY INTO requires manual handling.
**Complexity:** Auto Loader, being built on Structured Streaming, 
can be more complex to configure for advanced scenarios, while COPY INTO offers simpler syntax for basic loads.
Cost Efficiency: Auto Loader, especially with file notification mode, can be more cost-effective for large-scale, 
continuous ingestion by optimizing file discovery.

1. What is the difference between Spark DataFrame and Pandas DataFrame?
|Feature|Spark DataFrame|Pandas DataFrame|
|-------|---------------|----------------|
|Engine|Distributed (cluster)|Single machine (local memory)|
|Memory Limit|Scales to TBs|	Limited by RAM|
|Speed|	Fast for large data; parallel|	Fast for small data|
|Lazy Evaluation|	Yes|	No|
|Fault Tolerant|	Yes (RDD lineage)|	No|
|Joins/Operations|	Distributed|	In-memory|
|API Style|	SQL-like|	Python-native|

**Summary:**  
Pandas is ideal for small datasets; Spark DataFrames are built for big data, distributed processing, and fault tolerance.

2. What is Catalyst Optimizer and Tungsten?
Catalyst Optimizer

**Spark SQL’s query optimizer.**

It performs:

* Logical plan analysis
* Predicate pushdown
* Constant folding
* Column pruning
* Join reordering
* Physical plan selection

**Catalyst ensures optimized SQL/DataFrame execution.**

**Tungsten**

Spark’s execution engine for efficient memory and CPU usage.

Includes:

* Off-heap memory management
* Cache-friendly binary format
* Whole-stage code generation (Java bytecode)
* Better CPU utilization

Catalyst = Optimizer, Tungsten = Execution engine.

3. Explain narrow vs wide transformations in Spark.
Narrow Transformations

Each partition depends on a single parent partition.

Examples:

* map
* filter
* flatMap
* mapPartitions

Characteristics:
* No shuffle
* Fast
* Pipelineable

**Wide Transformations**

Data from multiple partitions is required.

Examples:

* groupByKey
* reduceByKey
* join
* distinct
* repartition

**Characteristics:**

* Causes shuffle
* Expensive

**4. How do you handle skew in Spark?**

Data skew happens when one key has many more records than others.

**Ways to handle:**

* Salting keys

df.withColumn("key_salted", concat(col("key"), lit("_"), rand(0)))


* Broadcast smaller table
* Use AQE (Adaptive Query Execution) — auto skew join optimization
* Use reduceByKey instead of groupByKey
* Increase shuffle partitions
* Split heavy keys manually

Use skew hints

SELECT /*+ SKEW('key') */ * FROM table;

5. What causes shuffle in Spark?

* Shuffle happens when Spark redistributes data across nodes due to operations like:
* groupBy
* join
* distinct
* orderBy
* repartition
* reduceByKey
* coalesce (sometimes)

aggregateByKey

Reason: data needs to be grouped by key or sorted across partitions.

6. How do you reduce shuffle in Spark applications?

* Use map-side operations
* Use reduceByKey instead of groupByKey
* Broadcast small tables
* Use partition pruning
* Use bucketing or ZORDER
* Filter early (predicate pushdown)
* Cache reused datasets
* Use AQE to optimize shuffle partitions automatically

Avoid unnecessary repartition()

7. What is broadcast join? When should you use it?

* Broadcast join sends small table to all executors to avoid shuffle.
* Example:
* broadcast(df_small)

* When to use:
* Smaller table < 10–20 MB
* Star-schema joins
* Fact table joining with small dimension table
* Repeated joins in ETL

**Benefit:**

* No shuffle
* Extremely fast

8. What is AQE (Adaptive Query Execution)?

* AQE adjusts the query plan at runtime based on real statistics.

**Features:**

* Dynamic shuffle partition coalescing
* Automatic skew join handling
* Automatic broadcast join selection

9. What are Spark UDF, Pandas UDF, and SQL functions?
**Spark UDF (Standard UDF)**

* Executes row-by-row in Python/Java
* Slow (serialization overhead)
* Should be avoided if possible
    
**Pandas UDF (Vectorized UDF)**
* Uses Apache Arrow
* Processes data in batches
* Much faster than normal UDF
* Suitable for ML, custom transformations

**SQL functions (built-in)**

* Fastest (highly optimized)
* Runs inside Catalyst/Tungsten
* Prefer built-in functions over UDFs

10. What is checkpointing vs. caching in Spark?
Caching

Purpose: Improve performance

* Stores data in memory/disk
* Used for iterative algorithms
* Loses data after failure
* Not suitable for streaming fault tolerance
* df.cache()

**Checkpointing**

**Purpose: Fault tolerance**

* Breaks RDD lineage
* Saves data to reliable storage (HDFS/DBFS)
* Used in streaming and cyclic graphs (e.g., machine learning)
* Overhead is higher