Basic To advanced PySpark query

# DATA READING  

In [0]:
# Read Data from CSV:
df_csv = spark.read.csv('/FileStore/tables/employees.csv',inferSchema=True, header=True).display()
display(df_csv.limit(5))        # Alter method
df_csv.display()                # Alter method
df_csv.show(5)  # Returns a list of Row objects



# inferSchema means auto identify the data type in CSV / Json file.

# same as json 
df_json = spark.read.json('/FileStore/tables/employees.json',inferSchema=True, header=True)

## Data Reading Utils

In [0]:
# Csv/Json uploaded locatations in Data bricks
dbutils.fs.ls('dbfs:/FileStore/tables/')

Out[6]: [FileInfo(path='dbfs:/FileStore/tables/departments.csv', name='departments.csv', size=687, modificationTime=1738674464000),
 FileInfo(path='dbfs:/FileStore/tables/emp.json', name='emp.json', size=752, modificationTime=1739564468000),
 FileInfo(path='dbfs:/FileStore/tables/employees.csv', name='employees.csv', size=7920, modificationTime=1738674451000)]

**Delete a File in locatation**

In [0]:
#Delete a Single File
dbutils.fs.rm("dbfs:/FileStore/tables/employees-1.csv")


#Delete All Files in the Folder
#dbutils.fs.rm("dbfs:/FileStore/tables/", recurse=True)   #The recurse=True flag ensures all files in the folder are deleted.


Out[5]: True

**Schema** **Definition**

In [0]:

# print the description of table / Scheme
df_csv.printSchema()

root
 |-- employee_id: integer (nullable = true)
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- email: string (nullable = true)
 |-- phone_number: string (nullable = true)
 |-- hire_date: date (nullable = true)
 |-- job_id: string (nullable = true)
 |-- salary: integer (nullable = true)
 |-- commission_pct: double (nullable = true)
 |-- manager_id: integer (nullable = true)
 |-- department_id: integer (nullable = true)



# Select query 

In [0]:
df_csv.display()   # Shows all 

df_csv.show(5)      # Alter method

df_csv.select('first_name','salary','department_id').show(5)    # Selecting few columns we can't do column name change in this method

df_csv.select(col('first_name').alias('Name'),col('salary'),col('department_id')).show(5)    # Selecting few columns with alias

## Select with filter condition
The filter transformation in PySpark allows you to select rows from a DataFrame based on a condition. 

### 🚀 Summary of `filter()` Use Cases  

| Use Case               | Example  |
|------------------------|----------|
| **Logical Operators:** | `& (AND): (condition1) & (condition2) ` |
| **Equality:**   | `col("column_name") == value or col("column_name") === value (The === is preferred for clarity)` |
| **Inequality:** | `col("column_name") != value or col("column_name") <> value` |
| **Greater than:** | `col("column_name") > value` |
| **Less than:** | `col("column_name") < value` |
| **Greater than or equal to:** | `col("column_name") >= value` |
| **Less than or equal to:** | `col("column_name") <= value` |
| **Handle NULL values** | `col("column_name").isNull()` |
| **Extract date parts** | `col("column_name").isNotNull()` |
| **Conditional logic** | `withColumn("salary_category", when(col("salary") > 60000, "High").otherwise("Low"))` |
| **Mathematical operations** | `withColumn("yearly_salary", col("salary") * 12)` |


In [0]:
df_csv.filter(col('department_id')==60).show(10)

df_csv.filter(col('department_id').isin(60,80)).show(5)

df_csv.filter(~col('department_id').isin(60,80)).show()    # selecting Notin 

df_csv.filter((col('department_id').isNotNull()) & (col('commission_pct').isNull())).show(5)



df_csv.filter("department_id IS NOT NULL AND commission_pct IS NULL").show(5)  # Alter SQL Method

# withColumn
The withColumn() function in PySpark allows you to add, update, or modify columns in a DataFrame.

### 🚀 Summary of `withColumn()` Use Cases  

| Use Case               | Example  |
|------------------------|----------|
| **Add a new column**   | `withColumn("bonus", lit(500))` |
| **Modify existing column** | `withColumn("salary", col("salary") * 1.1)` |
| **Rename column** | `withColumn("emp_name", col("first_name")).drop("first_name")` |
| **Change data type** | `withColumn("department_id", col("department_id").cast("string"))` |
| **String transformations** | `withColumn("first_name_upper", upper(col("first_name")))` |
| **Replace values** | `withColumn("job_id", regexp_replace(col("job_id"), "FI_ACCOUNT", "ACNT"))` |
| **Handle NULL values** | `withColumn("commission_pct", coalesce(col("commission_pct"), lit(0)))` |
| **Extract date parts** | `withColumn("hire_year", year(col("hire_date")))` |
| **Conditional logic** | `withColumn("salary_category", when(col("salary") > 60000, "High").otherwise("Low"))` |
| **Mathematical operations** | `withColumn("yearly_salary", col("salary") * 12)` |


In [0]:
from pyspark.sql.functions import col, lit, coalesce, round


df_csv.withColumn("bonus", lit(500)).show(5)  # Adds a new column with a fixed value

df_csv.withColumn(
    "Full_sal", round((col("salary")) * (1 + coalesce(col("commission_pct"), lit(0))), 2) # Round to 2 decimal places
).display()


df_new = df_csv.withColumn("department_id", col("department_id").cast("string"))
df_new.printSchema()          # 



## StructType() Schema

**importing** **libraries**

In [0]:
from pyspark.sql.types import * 
from pyspark.sql.functions import *

# Sorting ascending / descending

In [0]:
from pyspark.sql.types import * 
from pyspark.sql.functions import *

#   Method 1
df_csv.sort(col('department_id').desc()).show(5)

df_csv.orderBy(col("salary").desc()).show(5)  # Alternative

df_csv.sort(col('department_id').asc()).show(5) # Ascending order


#   Method 2 with 2 or more column 
df_csv.sort(['department_id','salary'],ascending = [1,0]).show(10)     # passing 0 is false 1 is True for assending order

df_csv.sort(['first_name'],ascending = [1]).show(5)

In [0]:
n.printSchema()

root
 |-- employee_id: string (nullable = true)
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- email: string (nullable = true)
 |-- phone_number: string (nullable = true)
 |-- hire_date: date (nullable = true)
 |-- job_id: string (nullable = true)
 |-- salary: integer (nullable = true)
 |-- commission_pct: double (nullable = true)
 |-- manager_id: integer (nullable = true)
 |-- department_id: integer (nullable = true)



In [0]:
from pyspark.sql.types import *
from pyspark.sql.functions import *