Great! Below are **10 structured PySpark practice questions** with:

* âœ… Question
* ðŸ“¥ Input Data (Schema & Sample Data)
* ðŸ“¤ Expected Output (Result explanation)

We'll continue with more once you complete these.

---

### **1. Create a DataFrame from a List of Tuples**

**âœ… Question:** Create a DataFrame with schema `id`, `name`, `age` from a list.

**ðŸ“¥ Input:**

```python
data = [(1, "Navin", 28), (2, "Priya", 26), (3, "Amit", 17)]
schema = ["id", "name", "age"]
```

**ðŸ“¤ Output:**

```
+---+-----+---+
|id |name |age|
+---+-----+---+
|1  |Navin|28 |
|2  |Priya|26 |
|3  |Amit |17 |
+---+-----+---+
```

---

### **2. Filter Data Where Age > 25**

**âœ… Question:** Filter records with `age > 25`.

**ðŸ“¥ Input:** Same as above
**ðŸ“¤ Output:**

```
+---+-----+---+
|id |name |age|
+---+-----+---+
|1  |Navin|28 |
|2  |Priya|26 |
+---+-----+---+
```

---

### **3. Add Column `is_adult` (True if age >= 18)**

**âœ… Question:** Add a column `is_adult`.

**ðŸ“¤ Output:**

```
+---+-----+---+--------+
|id |name |age|is_adult|
+---+-----+---+--------+
|1  |Navin|28 |true    |
|2  |Priya|26 |true    |
|3  |Amit |17 |false   |
+---+-----+---+--------+
```

---

### **4. Group by Department and Calculate Avg Salary**

**âœ… Question:** Group employees by `department` and get average salary.

**ðŸ“¥ Input:**

```python
data = [
  (1, "Navin", "IT", 60000),
  (2, "Priya", "HR", 50000),
  (3, "Amit", "IT", 80000),
  (4, "Sara", "HR", 55000)
]
schema = ["id", "name", "department", "salary"]
```

**ðŸ“¤ Output:**

```
+----------+-------------+
|department|avg(salary)  |
+----------+-------------+
|IT        |70000.0      |
|HR        |52500.0      |
+----------+-------------+
```

---

### **5. Rename Column `name` to `employee_name`**

**âœ… Question:** Rename column `name`.

**ðŸ“¤ Output:**

```
+---+--------------+----------+------+
|id |employee_name |department|salary|
+---+--------------+----------+------+
```

---

### **6. Sort by Salary Descending**

**âœ… Question:** Sort employees by salary in descending order.

**ðŸ“¤ Output:**

```
+---+-----+----------+------+
|id |name |department|salary|
+---+-----+----------+------+
|3  |Amit |IT        |80000 |
|1  |Navin|IT        |60000 |
|4  |Sara |HR        |55000 |
|2  |Priya|HR        |50000 |
+---+-----+----------+------+
```

---

### **7. Add Row Number by Department using Window**

**âœ… Question:** Use Window function to add row number by `department` sorted by salary descending.

**ðŸ“¤ Output:**

```
+---+-----+----------+------+----------+
|id |name |department|salary|row_number|
+---+-----+----------+------+----------+
|3  |Amit |IT        |80000 |1         |
|1  |Navin|IT        |60000 |2         |
|4  |Sara |HR        |55000 |1         |
|2  |Priya|HR        |50000 |2         |
+---+-----+----------+------+----------+
```

---

### **8. Join Employees with Department Table**

**âœ… Question:** Join employees with `departments` DataFrame on `dept_id`.

**ðŸ“¥ Input Employees:**

```python
[(1, "Navin", 101), (2, "Priya", 102)]
schema = ["id", "name", "dept_id"]
```

**ðŸ“¥ Input Departments:**

```python
[(101, "IT"), (102, "HR")]
schema = ["dept_id", "dept_name"]
```

**ðŸ“¤ Output:**

```
+---+-----+--------+---------+
|id |name |dept_id |dept_name|
+---+-----+--------+---------+
|1  |Navin|101     |IT       |
|2  |Priya|102     |HR       |
+---+-----+--------+---------+
```

---

### **9. Replace Null Bonus with 0**

**âœ… Question:** Replace null values in `bonus` column with 0.

**ðŸ“¥ Input:**

```python
[(1, "Navin", 1000), (2, "Priya", None), (3, "Amit", 500)]
```

**ðŸ“¤ Output:**

```
+---+-----+-----+
|id |name |bonus|
+---+-----+-----+
|1  |Navin|1000 |
|2  |Priya|0    |
|3  |Amit |500  |
+---+-----+-----+
```

---

### **10. Write DataFrame to Parquet**

**âœ… Question:** Save a DataFrame as Parquet in `/tmp/output/`.

**ðŸ“¤ Output:** A directory `/tmp/output/` with partitioned Parquet files.

---

Would you like me to continue with more **Level 3+ questions** (UDFs, JSON, Explode, Aggregations)? I can also give you a ready-to-run [notebook version](f) or [code examples](f) for these.


In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

spark=SparkSession.builder.appName("pyspark-practice-01").getOrCreate()


In [2]:
spark

In [4]:
# Q1 craete dataframe from the list of tuple

data = [(1, "Navin", 28), (2, "Priya", 26), (3, "Amit", 17)]
schema = ["id", "name", "age"]
spark.createDataFrame(data, schema).show()

+---+-----+---+
| id| name|age|
+---+-----+---+
|  1|Navin| 28|
|  2|Priya| 26|
|  3| Amit| 17|
+---+-----+---+



In [None]:
#Q2 filter data where age >25

df=spark.createDataFrame(data, schema)
df.filter("age > 25").show()

+---+-----+---+
| id| name|age|
+---+-----+---+
|  1|Navin| 28|
|  2|Priya| 26|
+---+-----+---+



In [6]:
#Q3 add column `is_adult` if age >= 18

df.withColumn("is_aduit", when( col("age")>18, True).otherwise(False)).show()

+---+-----+---+--------+
| id| name|age|is_aduit|
+---+-----+---+--------+
|  1|Navin| 28|    true|
|  2|Priya| 26|    true|
|  3| Amit| 17|   false|
+---+-----+---+--------+



In [9]:
#Q4 Group by Department and Calculate Avg Salary

data1 = [
  (1, "Navin", "IT", 60000),
  (2, "Priya", "HR", 50000),
  (3, "Amit", "IT", 80000),
  (4, "Sara", "HR", 55000)
]
schema1 = ["id", "name", "department", "salary"]

df1=spark.createDataFrame(data1, schema1)
df2=df1.groupBy("department").avg().alias("avg_sal")
df2.show()

+----------+-------+-----------+
|department|avg(id)|avg(salary)|
+----------+-------+-----------+
|        IT|    2.0|    70000.0|
|        HR|    3.0|    52500.0|
+----------+-------+-----------+

