## The DataFrame

The PySpark data-structure (a data organisation and storage format) that we will use the most is the DataFrame. In modern PySpark, the standard data-structure is the DataFrame. A DataFrame organises data into a 2-dimensional table of rows and columns, much like a spreadsheet, but with named columns and some other abilities. 

The PySpark DataFrame is built upon Spark's older data-structure, the Resilient Distributed Dataset (RDD). The idea behind Spark is that modern datasets are too large for a single computer. These datasets need to be distributed over several computers, perhaps over several locations (the ubiquitous cloud), hence distributed dataset. Of course, such distribution must be done in a fault-tolerant manner so that the dataset can be restored in case of some disturbance, ergo resilient.

The DataFrame is built upon the RDD and is the data-structure to use when working with Python. The Pandas library also uses the DataFrame as its core data-structure. In fact, you can feed a Pandas DataFrame to a Spark DataFrame. Before you can create or read a DataFrame you will need to open a Spark session. 

In [None]:
# cell for imports
import os
from doctest import testmod
from typing import NamedTuple

import numpy as np
import pyspark.sql.functions as F
from pyspark.sql import SparkSession
from pyspark.sql.dataframe import DataFrame
from pyspark.sql.session import SparkSession

The Spark session is your entry point to use the functionality that Spark offers. Accordingly, you will always see a Spark script using the following line of code:

In [None]:
spark: SparkSession = SparkSession.builder.getOrCreate()

Spark is great at giving you feedback.
Perhaps a little too good. The line below sets the log level to log only errors, which is probably what you want when writing code. When running the code in production, you will probably want a different log level. The default log level is `Warn`.

In [None]:
spark.sparkContext.setLogLevel("Error")

#### Introduction into PySpark and Python

Actually, you cannot read these notebooks without understanding Python, so this is not a true Python introduction. Python is a great programming language; it is easy to learn, easy to read, easy to write, very flexible in use, and has a great ecosystem of libraries you can use, such as PySpark. However, Python is also quite prone to mistakes. Mistakes that can come as a surprise, even to the most experienced programmer. The people behind Python are well aware of this and continue to improve the toolkit you have to write better code.

For the kind of scripting that is done by your average data scientist or engineer, the basics of writing clean, safer, and understandable code is not that complicated. If you start writing larger object-oriented applications, you will need more steps, but for these type of notebooks, the following four steps will result in better code:

1. Typing
2. Naming
3. Documenting
4. Testing

In these notebooks, we will follow these steps.

## Typing
Typing is a more complicated issue in Python than it is in a statically typed language like Java or Scala. The latter even has adopted Haskell's type inference, so you don't have to specify type. Instead of being a strict typer from the start, you probably should grow into it. For Python, these types are no more than hints; they are not checked at compilation time. Python will never be a static typed language; instead, it relies on external type checkers, such as MyPy. [MyPy](https://mypy.readthedocs.io/en/stable/) has a great intro to typing in Python, which is a good place to start. You could also read [PEP 484](https://peps.python.org/pep-0484/), where the case for typing was made by Guido van Rossum himself. As Python treats types as hints I suggest you use them as such; as a form of documentation.

First we need to create a DataFrame to do manipulation on. You can create a DataFrame in several ways, the easiest probably being to add a list of rows. e.g., `groceries = [["courgette", 1, 0.75], ["lentils", 1, 1.45], ["toetje", 2, 3.75]]`

This example comes from the book (it is a good book. Just the coding practice is at times a bit iffy!). Code like this we should avoid; at least we should try to type this; however, we cannot type this code in such a manner that it would pass a type checker. A list can only have one type, which is a bit odd; we will come back to this in a later notebook. More important is that this code does not capture the essence of what we are trying to do. The inner elements are obviously different from the outer list, which is just a container. The inner elements are more complex; they have attributes and values. It is better to treat them as objects in themselves instead of a list.

In [None]:
class Item(NamedTuple):
    name: str
    quantity: tuple[int, str]
    price: float


groceries: list[Item] = [
    Item(name="courgette", quantity=(1, "piece"), price=0.75),
    Item(name="lentiles", quantity=(150, "grams"), price=1.45),
    Item(name="desert", quantity=(2, "piece"), price=4.0),
]

In the code above, I created the same grocery list, but now I added an `Item` to the list. The `Item`
looks like a class, but underneath the exterior is just a tuple. `NamedTuple`s have fields accessible by attribute lookup as well as being indexable, iterable, and have a nice `repr` method defined. All of this allows us to not only inspect the items on the grocery list thoroughly but also to direct the input on the list. An item must have a name, which is a string, a quantity, which is a tuple, and a price, which is a float. Writing code like this enables you to write much cleaner code and will prevent faults from using the wrong type; more on this shortly. 

In [None]:
repr(groceries[0])

This type of code is straight forward to work with. If we want to know the total we can write
very legible code.

In [None]:
total: float = sum([item.price for item in groceries])
total

Using code from the book I can get a total too, but far less clear. I will have to index
into the inner list. The meaning of `item[2]` is not immidiatly clear, unlike `item.price`.

In [None]:
groceries_shoddy: list[tuple[str, int, float]] = [
    ("courgette", 1, 0.75),
    ("lentils", 1, 1.45),
    ("desert", 2, 4.0),
]
total: float = sum([item[2] for item in groceries_shoddy])
total

A final example of the superiority of using a named tuple is that I can easily add to the complexity of our item, without creating difficult to read code. What a quantity is can very depending on the item, for instance courgette you will likely buy per piece and lentils you will likely by per weight. Let's code it up using the shoddy coding practice. 

In [None]:
groceries_shoddy: list[tuple[str, tuple[int, str], float]] = [
    ("courgette", (1, "piece"), 0.75),
    ("lentils", (150, "gram"), 1.45),
    ("desert", (2, "piece"), 4.0),
]
type(groceries_shoddy[1][1])

In [None]:
for t in groceries:
    print(type(t.quantity))

Let's make a DataFrame

In [None]:
df_groceries: DataFrame = spark.createDataFrame(
    groceries, ["item", "quantity", "price"]
)
df_groceries

#### Warning: PySpark does not coerce types!
You cannot carelessly mix certain types in PySpark like you could in Python. Spark will throw an error if you assign `desert.price=4`. As background information, Spark does not automatically coerce an `int` into a `float`. Spark, and thus PySpark, is less flexible than Python, which will coerce these values. This is quite important to remember when you move from Python to PySpark. There will be more examples where Python and PySpark are not one-on-one. Using a `NamedTuple` has the benefit of typed attributes, directing the input. If a programmer follows your design, this fault should not occur, even if she does not know PySpark does not coerce types. 

In [None]:
df_groceries.printSchema()

## Spark SQL
Spark SQL will be the API you will mostly use. The Spark SQL API introduced the [DataFrame](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) as a data abstraction. The DataFrame allows you to work with structured (e.g., tabular) data but also with semi-structured data, such as full-text documents or JSON. A DataFrame is column-oriented;  when we start manipulating our data, we do this per column and not per row. 

In a normal environment, you will not have to design data as we did with `Item`; instead, you will read data from a source 

In [None]:
path: str = (
    "./ProgrammingProjects/SparkTest/DataAnalysisWithPythonAndPySpark-Data-trunk/gutenberg_books/"
)

pride_and_pred: DataFrame = spark.read.text(path + "1342-0.txt")

## Quick inspection tools
There are a few easy and useful inspection tools for any DataFrame:
1. `printSchema` which prints the schema of our DataFrame
2. `show` which shows us the first 20 rows of our DataFrame
3. `count` which counts the rows.
4. `rdd.countApprox` which approximates the number of rows in the DataFrame, useful when working with very large DFs.
5. `columns` is an attribute of the DataFrame that gives you a list of columns.
6. `dtypes` is another attribute that gives you the column name and type being used for that column in the DF.

In [None]:
pride_and_pred.printSchema()

In [None]:
pride_and_pred.show(truncate=False)

In [None]:
f"the row count of pride_and pred is {pride_and_pred.count()}"

In [None]:
pride_and_pred.rdd.countApprox(timeout=5)

In [None]:
pride_and_pred.columns

In [None]:
pride_and_pred.dtypes

## PySpark.sql API
We can manipulate our dataframe with all kinds of functions and methods from the [PySpark SQL API](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/index.html#) 

Below, we will use three functions:
1. The select function: `DataFrame.select(*cols: ColumnOrName) → DataFrame`
2. The split function: `Pyspark.sql.functions.split(str: ColumnOrName, pattern: str, limit: int = - 1) → pyspark.sql.column.Column`
3. The alias function: `Column.alias(*alias: str, **kwargs: Any) → pyspark.sql.column.Column`

In the PySpark community, it is a convention to `import pyspark.sql.functions as F` and then call the function as `F.split`. It is good programming practice to follow the convention. Studies have shown that code is read ten times more than it is written; adhering to conventions increases the legibility of your code.

In [None]:
lines: DataFrame = pride_and_pred.select(
    F.split(pride_and_pred.value, " ").alias("line")
)

lines.show(n=5, truncate=False)

We can use the alias we have created to further manipulate our data. We want to get rid of these
lists; instead, we want a column that contains all the words in these lists. Enter the explode function:

`pyspark.sql.functions.explode(col: ColumnOrName) → pyspark.sql.column.Column`

Explode returns a new row for each element in the given array or map. In other words, for every string in our list (array) of strings, we get a new row. 

In [None]:
words: DataFrame = lines.select(F.explode(F.col("line")).alias("word"))
words.show()

## selecting columns 
PySpark's select is very much equivalent to SQL's select. The function definition of select is:

`DataFrame.select(*cols: ColumnOrName) → DataFrame` 

As arguments, we give a string, a column, or a list. Columns in PySpark can again be selected in all the ways we would expect in Python, and with the col `pyspark.sql.functions.col(col: str) → pyspark.sql.column.Column` function: 
1. `select(words.word)`
2. `select(words['word'])`
3. `select(F.col('word'))`

You will see all three being used, and you can, but there are slight differences under the hood: 
 - words.word uses the dot notation you know from objects. `words` being the object, `word` being the attribute, using dot notation uses the `__get_attribute__` special method. This is the most inflexible of the three. In Python, attribute names cannot have special characters, start with a number, or have spaces. Using this notation on a column that includes, for instance, a space will result in a `SyntaxError`.
 - `words['word']` uses the `__get_item__` special method. By implementing the `__get_item__` method, the DataFrame implemented the Sequence interface, which allows us to slice. This type of implicit interface implementation is called `DuckTyping`, a term you might come across.
 - `F.col('word')` is the Spark native way of selecting a column and returns an expression. An expression is a construct that can be evaluated to determine its value. This is an important distinction because PySpark returns something that can be evaluated, we can operate on before the DataFrame is assigned. 

You may think we may call a column with get: `words.get('word')` but we cannot. `get` is not defined for `Column`.

In [None]:
words_lower: DataFrame = words.select(F.lower(F.col("word")).alias("word_lower"))
words_lower.show()

In [None]:
words_clean: DataFrame = words_lower.select(
    F.regexp_extract(F.col("word_lower"), "[a-z]+", 0).alias("word")
)
words_clean.show()

In [None]:
words_no_null: DataFrame = words_clean.filter(F.col("word") != "")
words_no_null.show()

In a few easy steps we have changed the text into an analysable format.

Say we want to know the number of words  which have 12 or more letters we can simply write:

In [None]:
big_words = words_no_null.filter(F.length(F.col("word")) > 11)
big_words.count()

**Jane Austin uses a lot of big words!** 

In [None]:
path: str = "./Downloads"

broadcast_logs: DataFrame = spark.read.csv(
    path=os.path.join(path, "BroadcastLogs_2018_Q3_M8.CSV"),
    sep="|",
    header=True,
    inferSchema=True,
    timestampFormat="yyyy-MM-dd",
)

Most likely, you will want to analyse some form of tabular data. For instance, a CSV file. Similar to reading text, Spark comes with a CSV reader.

The CSV file is downloadable via the author's [GitHub page](https://github.com/jonesberg/DataAnalysisWithPythonAndPySpark-Data/tree/trunk/broadcast_logs).  .

From the point of view of reading someone's code, I would suggest you always use keyword arguments to methods and functions; at least you know what the arguments are for, sort of anyway. Using keywords makes your code more readable. There is, however, another benefit: you get to know methods and functions better.

In [None]:
broadcast_logs.select("BroadcastLogID", "LogServiceID", "LogDate").show(
    n=5, truncate=False
)

As I said previously, by using `col` we are returned an expression, an expression we can use straight
away.

In [None]:
broadcast_logs.select(
    F.col("Duration"),
    F.col("Duration").substr(1, 2).cast("int").alias("hours"),
    F.col("Duration").substr(4, 2).cast("int").alias("Minutes"),
    F.col("Duration").substr(7, 2).cast("int").alias("Seconds"),
).distinct().show(5)

or use the column values in a calculation

In [None]:
broadcast_logs.select(
    F.col("Duration"),
    (
        F.col("Duration").substr(1, 2).cast("int") * 3600
        + F.col("Duration").substr(4, 2).cast("int") * 60
        + F.col("Duration").substr(7, 2).cast("int")
    ).alias("DurationInSeconds"),
).distinct().show(5)

In [None]:
broadcast_logs.printSchema()

In [None]:
len(broadcast_logs.columns)

Two quick checks give us quite some insight into the meta data of our DataFrame. 
We have 30 columns, and quite a few are simply IDs; let's drop those. There are two ways to do that: either we use the function `drop` or we can write something in Python, some code that saves us from having to spell out all the names of the columns we want to drop.

In [None]:
broadcast_logs = broadcast_logs.select(
    *[x for x in broadcast_logs.columns if x[-2:] != "ID"]
)
broadcast_logs.printSchema()

We have used an alias above, but what if we want the original DataFrame with an added column.

We use `withColumn` 

In [None]:
extended_logs = broadcast_logs.withColumn(
    "DurationInSeconds",
    (
        F.col("Duration").substr(1, 2).cast("int") * 3600
        + F.col("Duration").substr(4, 2).cast("int") * 60
        + F.col("Duration").substr(7, 2).cast("int")
    ),
)
extended_logs.printSchema()

we can sort these columns

In [None]:
extended_logs.select(sorted(extended_logs.columns)).printSchema()

We can also use standard statistical functions on a DataFrame:
1. `describe`
2. `summary`

In [None]:
extended_logs.select(F.col("DurationInSeconds")).describe().show()

In [None]:
extended_logs.select(F.col("DurationInSeconds")).summary().show()

## Naming functions, classes and variables
We have discussed typing; we should also discuss naming, as this is also an important part of proper coding. I start by stating that Python conventions are set out in [PEP 8](https://peps.python.org/pep-0008/). Of course, who can remember to use all those conventions? I cannot; I use a linter. [Ruff](https://github.com/astral-sh/ruff) to be precise, Ruff is written in Rust and is the fastest linter and code formatter out there for Python. You should use a linter too. 

As for naming methods, variables, etc., there are a few things to remember:

1. Use nouns, verbs, and adjectives. Verbs should indicate an action like `calculate()` or `print()`. Nouns describe a return value of the function or describe the variable meaningfully, e.g., `name()`, `user`. Adjectives add specificity to a name, for instance, `total_price()`. Mixing nouns, verbs, and adjectives together builds descriptive names like `calculate_total_price()`.
2. Avoid ambiguity; do not use generic terms such as `process` or `data`. Names with multiple meanings, e.g., check, file, object. Finally, do not use abbreviations unless they are widely known.
3. If you want the function or attribute to be treated as private; use an underscore `_private`.

Everyone who writes code has used the variable `x`. I am not an exception, but keeping in mind that code is read 10 times more than it is written, I try my best.  