# 01 - DataFrame Basics

This notebook provides a hands-on introduction to PySpark DataFrames. DataFrames are a distributed collection of data organized into named columns. They are conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood.

In [None]:
# Install required packages if needed (uncomment to run)
# !pip install pyspark
# !pip install findspark
# !pip install pandas

In [None]:
import findspark
findspark.init()

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lit, when, avg

# Initialize SparkSession
spark = SparkSession.builder.appName("DataFrameBasics").getOrCreate()

## 2. Creating DataFrames

There are several ways to create DataFrames in PySpark:

- From a list of rows
- From an RDD
- From an external data source (e.g., CSV, JSON, Parquet)

### 2.1 Creating a DataFrame from a List

In [None]:
# Sample data using Python lists
data = [
    ("James", "", "Smith", "1991-04-01", "M", 3000),
    ("Michael", "Rose", "", "2000-05-19", "M", 4000),
    ("Robert", "", "Williams", "1978-09-05", "M", 4000),
    ("Maria", "Anne", "Jones", "1967-12-01", "F", 4000),
    ("Jen", "Mary", "Brown", "1980-02-17", "F", -1),
]

# Define column names
columns = ["firstname", "middlename", "lastname", "dob", "gender", "salary"]

# Create DataFrame
df = spark.createDataFrame(data=data, schema=columns)

### 2.2 Creating a DataFrame from an RDD

In [None]:
# Create an RDD from the data list
rdd = spark.sparkContext.parallelize(data)
df_from_rdd = spark.createDataFrame(rdd, schema=columns)

### 2.3 Creating a DataFrame from a CSV file (External Data Source)

Download the California Housing Prices dataset from Kaggle:
https://www.kaggle.com/datasets/camnugent/california-housing-prices

In [None]:
# Update this path to your file location
file_path = "housing.csv"

# Create a DataFrame from the CSV file
# df_housing = spark.read.csv(file_path, header=True, inferSchema=True)

## 3. Basic DataFrame Operations

### 3.1 `printSchema()`

Displays the schema of the DataFrame (column names and data types).

In [None]:
df.printSchema()

### 3.2 `show()`

Displays the first 20 rows of the DataFrame by default.

In [None]:
df.show()

### 3.3 `select()`

Selects specific columns from the DataFrame.

In [None]:
df.select("firstname", "lastname").show()

### 3.4 `filter()`

Filters rows based on a condition.

In [None]:
df.filter(df.salary >= 4000).show()

### 3.5 `withColumn()`

Adds a new column or replaces an existing one.

In [None]:
df = df.withColumn("salary_increased", col("salary") * lit(1.1))  # Increase salary by 10%
df.show()

### 3.6 `withColumnRenamed()`

Renames an existing column.

In [None]:
df = df.withColumnRenamed("salary_increased", "new_salary")
df.show()

## 4. Handling Null Values

### 4.1 Identifying Null Values

In [None]:
df.filter(col("middlename").isNull()).show()

### 4.2 Filling Null Values

In [None]:
df_filled = df.na.fill("Unknown", subset=["middlename"])
df_filled.show()

### 4.3 Dropping Rows with Null Values

In [None]:
df_dropped = df.na.drop(subset=["middlename"])
df_dropped.show()

## 5. Conditional Logic with `when()` and `otherwise()`

Create new columns based on conditions.

In [None]:
df = df.withColumn(
    "salary_grade",
    when(col("new_salary") >= 4000, "High")
    .when(col("new_salary") < 4000, "Medium")
    .otherwise("Low"),
)
df.show()

## 6. Conclusion

This notebook covered the basics of creating and manipulating PySpark DataFrames, including:

- Creating DataFrames from various sources
- Performing basic operations like `printSchema()`, `show()`, `select()`, `filter()`, `withColumn()`, and `withColumnRenamed()`
- Handling null values
- Using conditional logic with `when()` and `otherwise()`

In [None]:
# Stop Spark Session
spark.stop()