# 02: Basic Transformations in PySpark

This notebook covers the following:
|#|Basics|
|--|---|
|1|Filtering Columns|
|2|Showing Columns|
|3|Showing head of dataframe|
|4|Filtering Columns|
|5|Dropping Columns|


### Imports

In [2]:
# Imports
import pyspark
import numpy as np
import pandas as pd
import os
from pyspark.sql import SparkSession

# Creating the spark session
spark = SparkSession.builder.appName("Practice").getOrCreate()


### Loading Dataframe and Creating Session

In [3]:
# This needs a new import. The datatypes you'll use are places after 'import'.
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType

# Example schema definition (you need to adjust this to your actual CSV columns)
schema = StructType([
    StructField("index", IntegerType(), True),
    StructField("airline", StringType(), True),
    StructField("flight", StringType(), True),
    StructField("source_city", StringType(), True),
    StructField("departure_time", StringType(), True),
    StructField("stops", StringType(), True),
    StructField("arrival_time", StringType(), True),
    StructField("destination_city", StringType(), True),
    StructField("class", StringType(), True),
    StructField("duration", DoubleType(), True),
    StructField("days_left", IntegerType(), True),  
    StructField("price", IntegerType(), True),   
])

# Load with predefined schema
df = spark.read.option("header", "true").schema(schema).csv("./datasets/airlines_flights_data.csv")

# Printing Schema
df.printSchema()

# If any start showing as null, it's probably because you skipped a column.

root
 |-- index: integer (nullable = true)
 |-- airline: string (nullable = true)
 |-- flight: string (nullable = true)
 |-- source_city: string (nullable = true)
 |-- departure_time: string (nullable = true)
 |-- stops: string (nullable = true)
 |-- arrival_time: string (nullable = true)
 |-- destination_city: string (nullable = true)
 |-- class: string (nullable = true)
 |-- duration: double (nullable = true)
 |-- days_left: integer (nullable = true)
 |-- price: integer (nullable = true)



---
## Handling Missing Data

#### Dropping all missing rows

In [8]:
# Dropping missing rows
df.na.drop().show(2)

+-----+--------+-------+-----------+--------------+-----+------------+----------------+-------+--------+---------+-----+
|index| airline| flight|source_city|departure_time|stops|arrival_time|destination_city|  class|duration|days_left|price|
+-----+--------+-------+-----------+--------------+-----+------------+----------------+-------+--------+---------+-----+
|    0|SpiceJet|SG-8709|      Delhi|       Evening| zero|       Night|          Mumbai|Economy|    2.17|        1| 5953|
|    1|SpiceJet|SG-8157|      Delhi| Early_Morning| zero|     Morning|          Mumbai|Economy|    2.33|        1| 5953|
+-----+--------+-------+-----------+--------------+-----+------------+----------------+-------+--------+---------+-----+
only showing top 2 rows


In [11]:
# Dropping Nulls --  drop a row if it contains any nulls.
df.na.drop(how='any').show(2)

# Dropping Nulls -- drop a row if it contains a given amount of nulls.
df.na.drop(how='any',thresh=2).show()

+-----+--------+-------+-----------+--------------+-----+------------+----------------+-------+--------+---------+-----+
|index| airline| flight|source_city|departure_time|stops|arrival_time|destination_city|  class|duration|days_left|price|
+-----+--------+-------+-----------+--------------+-----+------------+----------------+-------+--------+---------+-----+
|    0|SpiceJet|SG-8709|      Delhi|       Evening| zero|       Night|          Mumbai|Economy|    2.17|        1| 5953|
|    1|SpiceJet|SG-8157|      Delhi| Early_Morning| zero|     Morning|          Mumbai|Economy|    2.33|        1| 5953|
+-----+--------+-------+-----------+--------------+-----+------------+----------------+-------+--------+---------+-----+
only showing top 2 rows
+-----+---------+-------+-----------+--------------+-----+-------------+----------------+-------+--------+---------+-----+
|index|  airline| flight|source_city|departure_time|stops| arrival_time|destination_city|  class|duration|days_left|price|
+---

In [10]:
# Dropping Nulls -- drop a row if all values are nulls.
df.na.drop(how='all').show(2)


+-----+--------+-------+-----------+--------------+-----+------------+----------------+-------+--------+---------+-----+
|index| airline| flight|source_city|departure_time|stops|arrival_time|destination_city|  class|duration|days_left|price|
+-----+--------+-------+-----------+--------------+-----+------------+----------------+-------+--------+---------+-----+
|    0|SpiceJet|SG-8709|      Delhi|       Evening| zero|       Night|          Mumbai|Economy|    2.17|        1| 5953|
|    1|SpiceJet|SG-8157|      Delhi| Early_Morning| zero|     Morning|          Mumbai|Economy|    2.33|        1| 5953|
+-----+--------+-------+-----------+--------------+-----+------------+----------------+-------+--------+---------+-----+
only showing top 2 rows
