# PySpark DataFrames - Part 1

In this notebook, we will learn how to work with PySpark DataFrames. We will see how to create DataFrames, perform operations on them, and save them.

**A DataFrame is a distributed collection of data organized into named columns, similar to a table in a relational database**. It is part of the PySpark library, and can be created using various functions in SparkSession.

DataFrames in PySpark are conceptually equivalent to pandas DataFrames, but they are distributed across a cluster and designed to handle large-scale data processing.

To understand PySpark DataFrames, it is important to first understand RDDs, the building blocks of PySpark that we explored in the previous module. **PySpark DataFrames are built on top of RDD**s, but they provide a more user-friendly API for manipulating data. Here are three key functionalities of PySpark DataFrames that distinguish them from RDDs:

- **Schema information**: Unlike RDDs, DataFrames have schema information, meaning they have named columns and defined data types.
- **Ease of use**: DataFrames provide high-level methods and operations similar to those found in SQL and pandas, making them more user-friendly.
- **Flexibility**: DataFrames are more flexible and expressive than RDDs. They allow you to perform complex operations with less code.

For a more detailed comparison between RDDs and DataFrames, please go back to the notebook `3-PySpark-RDDs` of the previous module.

Once created, DataFrames can be manipulated using the various methods that we'll explore in this notebook.

***Note:*** As we've seen in the previous notebooks, databricks cluster initialization automatically creates a SparkSession object called `spark` which we can use to create DataFrames.

Let's start by exploring the different data sources that can be used to create a DataFrame.


## Data Sources

### RDDs

You can create a DataFrame from an existing RDD. Here's an example of how to create a DataFrame from an RDD of tuples:

In [None]:
# Create RDD using spark context
rdd = sc.parallelize([(1, "John"), (2, "Isabel"), (3, "Bart")])

# Create DataFrame from RDD
df = spark.createDataFrame(rdd, ["ID", "Name"])

DataFrames also have the lazy evaluation property as they are built on top of RDDs. This means that transformations on DataFrames are not computed until an action is executed.

In the previous cell we have created a DataFrame, but no action was executed. On the next cell we'll call an action to visualize it.

In [None]:
df.display()

### List of Tuples

You can create a DataFrame from a list of tuples, where each tuple represents a row. Then, you can specify the column names separately when creating the DataFrame.

Here's an example:

In [None]:
data = [(1, "John"), (2, "Isabel"), (3, "Bart")]

df = spark.createDataFrame(data, ["ID", "Name"])

df.display()

### List of Dictionaries

Similarly to Pandas, you can create a DataFrame from a list of dictionaries. Each dictionary represents a row that maps the columns names to the values.

Here's an example:

In [None]:
data = [{"ID": 1, "Name": "John"}, {"ID": 2, "Name": "Isabel"}, {"ID": 3, "Name": "Bart"}]

df = spark.createDataFrame(data)

df.display()

### Pandas DataFrame

You can also create a PySpark DataFrame directly from a Pandas DataFrame.

In [None]:
import pandas as pd

pandas_df = pd.DataFrame({"ID": [1, 2, 3], "Name": ["John", "Isabel", "Bart"]})

spark_df = spark.createDataFrame(pandas_df)

spark_df.display()

Similarly, you can convert a PySpark DataFrame to a Pandas DataFrame using the `toPandas()` method.

In [None]:
df = spark_df.toPandas()

type(df)

### CSV

You can create a DataFrame from a CSV file.

Let's download a CSV file and create a DataFrame from it.

In [None]:
%sh

wget https://raw.githubusercontent.com/inesmcm26/lp-big-data/main/data/people-100.csv -O people.csv

To avoid saving this data to the DBFS, let's read it directly from the driver's local file system. Remember you need to specify the full path to the file, beginning with `file:`

In [None]:
# You can do it like this
df = spark.read.csv("file:/databricks/driver/people.csv", header=True, sep=',', inferSchema=True)

df.display()

In [None]:
# Or like this to make it more readable
df = (
    spark.read.format("csv")
    .option("header", "true")
    .option("inferSchema", "true")
    .option("sep", ",")
    .load("file:/databricks/driver/people.csv")
)

df.display()

You can also specify the schema of the data you're reading.

The schema of a DataFrame defines the column names, each column data type and whether `null` values are accepted in that column.

In the last cell, we used the option `.option("inferSchema", "true")` when reading the data. By doing this, we are asking spark to find out the structure of the data before creating our DataFrame.

However, instead of relying on Spark to infer the schema, you can explicitly define the structure of the table.

This is extremely useful when you want to improve the robustness of your data pipeline; **it is better to know you are missing a few columns at ingestion time than to get an error later in the program**. Additionally, defining the schema in advance can improve performance because inferring the schema requires a pre-read of the data.

In PySpark, we have a specific data type to represent schemas: the `StructType`.

You can think of a Struct as a dictionary, and its pairs of key and values are `StructFields`, in which you specify the column name, the data type and if the column can have null values.

All the data types available for each column are described in the [pyspark.sql.types](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/data_types.html) module.

In [None]:
from pyspark.sql.types import IntegerType, StringType, DateType, StructType, StructField

people_df_schema = StructType([
    StructField('Index', IntegerType(), False),
    StructField('User Id', StringType(), True),
    StructField('First Name', StringType(), True),
    StructField('Last Name', StringType(), True),
    StructField('Gender', StringType(), True),
    StructField('Email', StringType(), True),
    StructField('Phone', StringType(), True),
    StructField('Date of birth', DateType(), True),
    StructField('Job Title', StringType(), True),
])

df = (
    spark.read.format('csv')
    .schema(people_df_schema)
    .option('header', 'true')
    .option('sep', ',')
    .load('file:/databricks/driver/people.csv')
)

df.display()

### External Databases

You can also create a DataFrame from an external database like MySQL, PostgreSQL, etc. You need to specify the JDBC URL, the table name, and the connection properties.

The code below is just an example. It won't run because we don't have an external database running.

In [None]:
df = spark.read.format("jdbc") \
    .option("url", "jdbc:mysql://hostname:port/database_name") \
    .option("dbtable", "table_name") \
    .option("user", "username") \
    .option("password", "password") \
    .load()

### Tables

You can also create/save a DataFrame from/to a table. 

Let's first save a simple DataFrame to a table.

In [None]:
pandas_df = pd.DataFrame({"ID": ['1', '2', '3'], "Name": ["John", "Isabel", "Bart"]})

df = spark.createDataFrame(pandas_df)

df.write.saveAsTable('users')

In Databricks, when you use the `saveAsTable('name')` method on a PySpark DataFrame, **the data is stored in a managed table within the Databricks environment**. Databricks manages the underlying storage for you, so you don't need to worry about the specifics of the distributed file system.

**The data is typically stored in a distributed storage layer managed by Databricks, which abstracts away the details of the storage infrastructure.** Databricks manages this storage layer transparently, allowing you to focus on your data analysis tasks without needing to worry about infrastructure management.

**To access the data stored in the table, you can use SQL in Databricks or the PySpark DataFrame API**. Databricks provides a unified interface for accessing and querying data, regardless of its underlying storage. So, you can interact with the table just like any other table in your Databricks environment.

You can check that the table was created by running a SQL query directly on the Databricks environment. As we've seen in the previous module, you can do this by using the `%sql` magic command

When running this command, **you can access any data that is referenced in the metastore of the running cluster**. Since we've saved the DataFrame as a table, a reference to the table is stored in the metastore and can be accessed using the `%sql` magic command.

In [None]:
%sql

SHOW TABLES;

Now that the data is stored in a table, you can read it into a spark DataFrame using the `spark.read.table()` method.

In [None]:
spark.read.table('users').display()

## PySpark DataFrame Operations - Part 1

### PySpark DataFrame Operations VS SQL Queries

PySpark DataFrame methods provide similar functionality as SQL queries when it comes to data manipulation and transformation tasks. 

In fact, you can basically **run all DataFrame methods as SQL queries by using the `sql` method of the SparkSession, or the `%sql` magic on a Databricks cell**. **Both methods are optimized to run efficiently on Spark clusters**.

However, **DataFrame methods are more flexible** and can be used in more complex scenarios than SQL queries. They are also more readable and **easier to maintain**. Here are some key advantages of using DataFrame methods over SQL queries:

- **Programmatic control:** You can manipulate DataFrames using Python code, which provides flexibility in handling complex logic, conditional operations, and iterative processes that may not be easily expressed in SQL.
- **Easier to maintain:** DataFrames are easier to maintain than SQL queries because they are written in Python code, which is more readable and less error-prone than SQL.
- **Debugging:** Python code provides type safety and easier debugging capabilities compared to SQL queries. Errors in Python code are typically caught at runtime, whereas errors in SQL may sometimes only be caught during query execution, leading to potentially faster iteration and debugging cycles.
- **Custom functions:** You can define custom Python functions (UDFs - User Defined Functions) within PySpark DataFrames, enabling you to encapsulate business logic and reuse it across different parts of your data pipeline.


Nevertheless, we will see how some spark DataFrame methods can be translated into SQL queries, just to show how similar they are.

### Downloading the Data

Now we're going to go through some [Pyspark DataFrame methods](https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.DataFrame.html) and [PySpark SQL functions](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/functions.html).

For that, we'll be using a dataset with five years of customer orders information, including thousands of products sold. The records comprise requests from 2017 to 2021. The data was adapted from the JMP Case Study Library and was obtained from [Kaggle](https://www.kaggle.com/datasets/gabrielsantello/wholesale-and-retail-orders-dataset).

Our goal is to perform some data analysis using this dataset. For that, we'll first go through these steps:
1. Data Cleaning
2. Feature Engineering
3. Data Analytics

In this notebook we'll cover the first two steps. The third step will be covered in the next two notebooks.

Let's start by downloading the data from a downloadable link using the `wget` command we explored earlier. Since the file we're downloading is a zip file, it is also necessary to unzip it.

![Retail Orders](https://www.smesouthafrica.co.za/wp-content/uploads/2020/10/DPO-SA--1024x640.jpg)

In [None]:
%sh

wget https://github.com/inesmcm26/lp-big-data/raw/main/data/orders-data.zip

unzip orders-data.zip

There are two files inside the zip:
- `orders.csv`: A file containing orders information, including purchase date, product ID, price etc.
- `product-supplier.csv`: A file containing information about purchased products as well as their supplier.

To avoid downloading the files everytime we want to run this notebook, let's save them to the DBFS:

In [None]:
%fs
cp -r file:/databricks/driver/orders-data/ /FileStore/lp-big-data/orders-data/

Starting with the orders data, load it to a spark DataFrame using one of the methods we saw earlier.

In [None]:
df_orders = (
    spark.read.format("csv")
    .option("header", "true")
    .option("inferSchema", "true")
    .option("sep", ",")
    .load("/FileStore/lp-big-data/orders-data/orders.csv")
)

df_orders.display()

### DataFrame properties

Check some information about the columns.

For that, we can use the `printSchema()` method. This function prints the schema of the DataFrame, including data types and nullable properties of each column.

In [None]:
df_orders.printSchema()

We can also see how many rows the dataset contains by using the `count()` method.

In [None]:
df_orders.count()

For a general overview, let's check some statistic about the data. For that we have two methods: `describe()` and `summary()`.

Run both and check the differences.

***Note:*** Remember spark's lazy evaluation. To visualize the results you need to trigger an action, for example, call the `display()` method. 

In [None]:
df_orders.describe().display()

In [None]:
df_orders.summary().display()

The `summary()` method provides a more in-depth statistical analysis, like the quartiles.

### Data Cleaning

Now that we've got to explore a bit our dataset, let's clean it.

In [None]:
# Import the pyspark SQL functions module
import pyspark.sql.functions as f

#### Drop Duplicates

We don't want duplicate orders records. Drop the duplicate entries in case they exist.

In [None]:
df_orders = df_orders.dropDuplicates()

#### Rename columns

The original column names are too long and contain spaces. This makes it harder to use them, so let's standardize them and make them smaller.

There are several ways of renaming a DataFrame's columns.

1. Using the `select()` with `alias()`.

    You can rename the column names by selecting them and giving them a new alias.

**Selecting columns**

First let's see the different ways of selecting columns and then we can put everything together.

To select columns you have the following options:

In [None]:
# 1
df_orders_1=(
    df_orders.select(
        'Customer ID',
        'Customer Status',
        'Date Order was placed',
        'Delivery Date',
        'Order ID',
        'Product ID',
        'Quantity Ordered',
        'Total Retail Price for This Order',
        'Cost Price Per Unit')
)

# 2
df_orders_2=(
    df_orders.select(
        df_orders['Customer ID'],
        df_orders['Customer Status'],
        df_orders['Date Order was placed'],
        df_orders['Delivery Date'],
        df_orders['Order ID'],
        df_orders['Product ID'],
        df_orders['Quantity Ordered'],
        df_orders['Total Retail Price for This Order'],
        df_orders['Cost Price Per Unit'],
    )
)

# 3
df_orders_3=(
    df_orders.select(
        f.col('Customer ID'),
        f.col('Customer Status'),
        f.col('Date Order was placed'),
        f.col('Delivery Date'),
        f.col('Order ID'),
        f.col('Product ID'),
        f.col('Quantity Ordered'),
        f.col('Total Retail Price for This Order'),
        f.col('Cost Price Per Unit'),
    )
)

I recommend using the 3th. It allows you to be explicit about selecting a column and makes it easier to apply `Column` methods like `alias()`.

Let's use it to rename the columns.

In [None]:
df_orders_renamed=(
    df_orders.select(
        f.col('Customer ID').alias('customer_id'),
        f.col('Customer Status').alias('customer_status'),
        f.col('Date Order was placed').alias('placing_date'),
        f.col('Delivery Date').alias('delivery_date'),
        f.col('Order ID').alias('order_id'),
        f.col('Product ID').alias('product_id'),
        f.col('Quantity Ordered').alias('amount'),
        f.col('Total Retail Price for This Order').alias('revenue'),
        f.col('Cost Price Per Unit').alias('cost_per_unit'),
    )
)

df_orders_renamed.display()

***Note:*** We can use a SQL query to select and rename columns as well. Let's see how it would be done.

First, we need to create a temporary view of the DataFrame. Only then we can run a SQL query on it using the SparkSession `sql()` method.

In [None]:
df_orders.createOrReplaceTempView('df_orders_view')

Now let's run the query to rename the columns.

In [None]:
df_orders_sql = spark.sql(
    '''
        SELECT
            `Customer ID` AS customer_id,
            `Customer Status` AS customer_status,
            `Date Order was placed` AS placing_date,
            `Delivery Date` AS delivery_date,
            `Order ID` AS order_id,
            `Product ID` AS product_id,
            `Quantity Ordered` AS amount,
            `Total Retail Price for This Order` AS revenue,
            `Cost Price Per Unit` AS cost_per_unit
        FROM
            df_orders_view
    '''
)

df_orders_sql.display()

In terms of performance, both methods are similar. However, using DataFrame methods is more flexible and easier to maintain.

If we wanted to query the resulting DataFrame using SQL again, we would have to create a temporary view of the DataFrame first, as we did before.

Instead, we can simply assign the resulting DataFrame to a new variable and continue to use DataFrame methods on that auxiliary variable. This is easier to maintain because it avoids creating temporary views and makes it easier to debug.

This becomes more complex as the number of operations increases. Therefore, it is recommended to use DataFrame methods whenever possible.

2. Using the `withColumnRenamed` DataFrame method.

    This method receives two arguments:
    - The old column name
    - The new column name

In [None]:
df_orders_renamed_2=(
    df_orders
    .withColumnRenamed('Customer ID', 'customer_id')
    .withColumnRenamed('Customer Status', 'customer_status')
    .withColumnRenamed('Date Order was placed', 'placing_date')
    .withColumnRenamed('Delivery Date', 'delivery_date')
    .withColumnRenamed('Order ID', 'order_id')
    .withColumnRenamed('Product ID', 'product_id')
    .withColumnRenamed('Quantity Ordered', 'amount')
    .withColumnRenamed('Total Retail Price for This Order', 'revenue')
    .withColumnRenamed('Cost Price Per Unit', 'cost_per_unit')
)

df_orders_renamed_2.display()

3. Using the `toDF()` method. This is the easiest way of doing it when you want to rename all the columns in your DataFrame.

In [None]:
df_orders_renamed_3 = (
    df_orders.toDF(
        'customer_id',
        'customer_status',
        'placing_date',
        'delivery_date',
        'order_id',
        'product_id',
        'amount',
        'revenue',
        'cost_per_unit'
    )
)

df_orders_renamed_3.display()

#### Create and drop columns

Have you noticed that the `customer_status` has some values in lower case and others in upper case? We should deal with this issue and make them all lower case.

Let's check how many times that happens. We can use the `groupBy` together with the `count` aggregation method to see how many times each unique value appears int he `customer_status` column.

We'll see how the `groupBy` works in more detail later. For now let's just use it to mimic the pandas dataframe `value_counts` method.

In [None]:
df_orders_renamed.groupBy('customer_status').count().display()

We can see that we should only have 3 unique values: Platinum, Silver and Gold, but insetad we also have records with those values in upper case.

To standardize them and eliminate these redundancies we can create a new column in the DataFrame using the `withColumn` method. This new column will be obtained by applying a function to the old column. In this case, we want to apply the `lower` function from PySpark SQL functions.

Once the new column is created we need to drop the old one. For that we can use the `drop` method.

Finally, we can rename the new column to match the old column's name.

In [None]:
df_orders_standardized = (
    df_orders_renamed
    .withColumn('customer_status_new', f.lower(df_orders_renamed['customer_status']))
    .drop('customer_status')
    .withColumnRenamed('customer_status_new', 'customer_status')
)

df_orders_standardized.display()

Let's check again the new unique values on the `customer_status` column

In [None]:
df_orders_standardized.groupBy('customer_status').count().display()

#### Null values

Have you noticed that we have some columns with null values? Let's check the `describe` method again.

In [None]:
df_orders_standardized.describe().display()

All columns except for `placing_date` and `amount` have 185013 values.

We need to deal with these missing values before we proceed with our analysis.

Let's make some assumptions. If the `amount` of product ordered is missing, let's assume only one unit was ordered. Therefore, we'll impute the missing values with value `1`.

In the cases with missing `placing_date`, it is really hard to infer the missing values, so let's simply drop these observations from the analysis.


In [None]:
df_orders_cleaned = (
    df_orders_standardized
    # fillna receives the value to fill the missing entries in the subset of columns
    .fillna(1, subset=(['amount']))
    # dropna drops all the rows with missing value int the specified subset of columns
    .dropna(subset=['placing_date'])
)

df_orders_cleaned.describe().display()

Perfect! We managed to clean our dataset.

### Feature Engineering

Now we can perform some feature engineering to add value to our dataset. This will enable us to make a better analysis in the end.

#### Date objects

There are some new columns we can create based on the `placing_date` and `delivery_date`.

As we've just seen before, we can create these new columns by applying functions to the old ones. In this case we want to apply the `to_date` function from PySpark SQL functions.

This function receives a column and a date format and converts the string to a `DateType` data type based on the specified format.

But as we've seen in the initial schema, these columns are saved as strings, although they represent dates. So first we need to convert them to dates.

Also, there's one more problem we need to solve first. The dates are not only saved as strings, but the months are saved as abbreviations instead of numbers. Therefore, we need an additional extra step to convert these abbreviations to numbers, so that we can apply the `to_date` method.

There are several ways of doing this, but the most straightforward one is to use the `regexp_replace` function in the PySpark SQL functions library.

In [None]:
# For simplicity we can create a mapping between strings and numbers
months_mapping = {
    'Jan': '01',
    'Feb': '02',
    'Mar': '03',
    'Apr': '04',
    'May': '05',
    'Jun': '06',
    'Jul': '07',
    'Aug': '08',
    'Sep': '09',
    'Oct': '10',
    'Nov': '11',
    'Dec': '12'
}

# Create a copy of the original dataframe
df_orders_date = df_orders_cleaned.alias('df_orders_date')

# Loop through the months and replace the abbreviations with the corresponding values
for abbr, nr in months_mapping.items():
    df_orders_date = (
        df_orders_date
        .withColumn('placing_date', f.regexp_replace('placing_date', abbr, nr))
        .withColumn('delivery_date', f.regexp_replace('delivery_date', abbr, nr))
    )

In [None]:
df_orders_date.display()

We are ready to convert the strings to dates.

In [None]:
df_orders_date_format = (
    df_orders_date
    .withColumn('placing_date', f.to_date(f.col('placing_date'), 'dd-MM-yy'))
    .withColumn('delivery_date', f.to_date(f.col('delivery_date'), 'dd-MM-yy'))
)

df_orders_date_format.display()

Now that we have our date columns in the correct format, we can think of new columns we may engineer.

I have prepared some suggestions:

- Difference between order and delivery dates: `datediff(end, start)`
- Extract the month: `month()`
- Extract the year: `year()`
- Extract the day of week: `dayofweek()`

In [None]:
df_orders_date_engineered = (
    df_orders_date_format
    .withColumn('days_to_delivery', f.datediff(f.col('delivery_date'), f.col('placing_date')))
    .withColumn('order_month', f.month(f.col('placing_date')))
    .withColumn('order_year', f.year(f.col('placing_date')))
    .withColumn('order_day_of_week', f.dayofweek(f.col('placing_date')))
)

df_orders_date_engineered.display()

#### Additional features

There are some other features we can add to our dataset, besides the ones related to the dates.

Here are some suggestions:

- **Order Profit**: Subtract the total cost (cost_per_unit * amount) from the revenue to calculate the profit made from each order. This can help in analyzing profitability.

- **Delivery Speed**: We can define that an order was delivered 'Fast' if the delivery happened in 1 day of less, 'Medium' if it was between 1-3 days and 'Slow' if higher than that. For that we can use the `when(condition, value)` function.

Let's create them!

In [None]:
df_orders_engineered = (
    df_orders_date_engineered
    .withColumn('profit', f.col('revenue') - (f.col('cost_per_unit') * f.col('amount')))
    .withColumn('delivery_speed',
                f.when(f.col('days_to_delivery') <= 1, 'Fast')
                .when(f.col('days_to_delivery') <= 3, 'Medium')
                .otherwise('Slow'))
    
)

df_orders_engineered.display()

Great! We managed to create some new features that will be valuable for later analysis.

Let's save the preprocessed orders dataset to the DBFS to avoid running the preprocessing steps all over again.

Save it as a csv file to `FileStore/lp-big-data/preprocessed-data/orders-data/` with the name `orders-preprocessed.csv`

In [None]:
df_orders_engineered.write.csv('/FileStore/lp-big-data/orders-data/orders-preprocessed.csv', header=True)

---

It's time to apply what we've learned so far. Go to the `exercises-part1` notebook and try to solve the exercises.