**Core DataFrame & Column Skills (Basic)**

Q1: Create a DataFrame with schema enforcement

Problem: Create a DataFrame with user info ensuring proper types.

Constraints: Enforce schema, all types must match.

In [0]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

spark = SparkSession.builder.getOrCreate()

data = [
    (1, "Alice", 30),
    (2, "Bob", 25),
    (3, "Charlie", 35)
]

schema = StructType([
    StructField("id", IntegerType(), True),
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True)
])

df = spark.createDataFrame(data, schema)
df.show()

In [0]:

+---+-------+---+
| id|   name|age|
+---+-------+---+
|  1|  Alice| 30|
|  2|    Bob| 25|
|  3|Charlie| 35|
+---+-------+---+


Q2: Add a literal column
Problem: Add a column country with value "USA" for all rows.

Expected Output:

In [0]:
+---+-------+---+-------+
| id|   name|age|country|
+---+-------+---+-------+
|  1|  Alice| 30|    USA|
|  2|    Bob| 25|    USA|
|  3|Charlie| 35|    USA|
+---+-------+---+-------+

Q3: Rename a column
Problem: Rename column name to full_name.

Expected Output:

In [0]:
+---+---------+---+
| id|full_name|age|
+---+---------+---+
|  1|    Alice| 30|
|  2|      Bob| 25|
|  3|  Charlie| 35|
+---+---------+---+


Q4: Use col and arithmetic operations
Problem: Create a column age_in_5_years = age + 5.

Expected Output:

In [0]:
+---+-------+---+------------+
| id|   name|age|age_in_5yrs|
+---+-------+---+------------+
|  1|  Alice| 30|          35|
|  2|    Bob| 25|          30|
|  3|Charlie| 35|          40|
+---+-------+---+------------+


Q5: Handle null values with fillna
Problem: Replace null values in age with 0.

In [0]:
data_with_null = [
    (1, "Alice", None),
    (2, "Bob", 25)
]

df_null = spark.createDataFrame(data_with_null, ["id", "name", "age"])
df_null.fillna({"age": 0}).show()


In [0]:
+---+-----+---+
| id| name|age|
+---+-----+---+
|  1|Alice|  0|
|  2|  Bob| 25|
+---+-----+---+


Q6: Conditional column with when / otherwise
Problem: Add age_group column: "Adult" if age >= 30 else "Young".

Expected Output:

In [0]:
+---+-------+---+---------+
| id|   name|age|age_group|
+---+-------+---+---------+
|  1|  Alice| 30|    Adult|
|  2|    Bob| 25|     Young|
|  3|Charlie| 35|    Adult|
+---+-------+---+---------+


Q7: Use expr for string concatenation
Problem: Create user_label = "User_" + name.

Expected Output:

In [0]:
+---+-------+---+----------+
| id|   name|age|user_label|
+---+-------+---+----------+
|  1|  Alice| 30|   User_Alice|
|  2|    Bob| 25|   User_Bob|
|  3|Charlie| 35|   User_Charlie|
+---+-------+---+----------+


Q8: Use coalesce to handle multiple null columns
Problem: coalesce(col1, col2, lit("Unknown")).

In [0]:
data_multi_null = [
    (1, None, None),
    (2, None, "ValueB"),
    (3, "ValueA", None)
]
df_multi_null = spark.createDataFrame(data_multi_null, ["id", "col1", "col2"])


In [0]:
+---+------+
| id|value |
+---+------+
|  1|Unknown|
|  2|ValueB |
|  3|ValueA |
+---+------+


Q9: Drop rows with null values
Problem: Remove all rows with any null values in age.

Expected Output:

In [0]:
+---+-----+---+
| id| name|age|
+---+-----+---+
|  2|  Bob| 25|
+---+-----+---+


Q10: Select and alias multiple columns
Problem: Select id and name as username.

Expected Output:

In [0]:
+---+--------+
| id|username|
+---+--------+
|  1|   Alice|
|  2|     Bob|
|  3| Charlie|
+---+--------+


Q11: Chain multiple column operations
Problem: Add age_in_10 = age + 10, is_senior = age >= 60, greeting = "Hello " + name.

Expected Output:

In [0]:
+---+-------+---+----------+---------+---------+
| id|   name|age|age_in_10 |is_senior|greeting |
+---+-------+---+----------+---------+---------+
|  1|  Alice| 30|        40|    False|Hello Alice|
|  2|    Bob| 25|        35|    False|Hello Bob  |
|  3|Charlie| 35|        45|    False|Hello Charlie|
+---+-------+---+----------+---------+---------+


Q12: Use expr for conditional math
Problem: bonus = 1000 if age > 30 else 500.

Expected Output:

In [0]:
+---+-------+---+-----+
| id|   name|age|bonus|
+---+-------+---+-----+
|  1|  Alice| 30|  500|
|  2|    Bob| 25|  500|
|  3|Charlie| 35| 1000|
+---+-------+---+-----+


Q13: Handle nested nulls with coalesce and when
Problem: Create final_value = col1 if not null else col2 if not null else "N/A".

Expected Output:

In [0]:
+---+------+------+
| id|col1  |col2  |final_value|
+---+------+------+
|  1| null | null | N/A       |
|  2| null | ValB | ValB      |
|  3| ValA | null | ValA      |
+---+------+------+


Q14: Add multiple literal columns with different types
Problem: Add country="USA", score=100, is_active=True.

Expected Output:

In [0]:
+---+-----+---+-------+-----+---------+
| id| name|age|country|score|is_active|
+---+-----+---+-------+-----+---------+
|  1|Alice| 30|    USA|  100|     True|
|  2|  Bob| 25|    USA|  100|     True|
+---+-----+---+-------+-----+---------+


Q15: Complex column expression with multiple operations
Problem: Create status = "Senior" if age>30 and score>90 else "Junior"

Expected Output:

In [0]:
+---+-----+---+-----+------+
| id| name|age|score|status|
+---+-----+---+-----+------+
|  1|Alice| 30| 100 |Junior|
|  2|  Bob| 25| 100 |Junior|
|  3|Charlie|35| 95  |Senior|
+---+-----+---+-----+------+
