# First Steps
### A Simple Program in PySpark

Data-driven applications, no matter how complex, all boil down to what we can
think of as three meta steps, which are easy to distinguish in a program:
1. We start by loading or reading the data we wish to work with.
2. We transform the data, either via a few simple instructions or a very complex machine learning model.
3. We then export (or sink) the resulting data, either into a file or by summarizing our findings into a visualization.

> **Note**: 
> This is assumes you have Spark and Java installed and setup

To run spark interactively in terminal, run below commands in terminal
- `set PYSPARK_DRIVER_PYTHON=ipython`
- `set PYSPARK_DRIVER_PYTHON_OPTS=`

Using the pyspark REPL

```
>pyspark

Python 3.8.10 (tags/v3.8.10:3d8993a, May  3 2021, 11:48:03) [MSC v.1928 64 bit (AMD64)]        
Type 'copyright', 'credits' or 'license' for more information
IPython 8.7.0 -- An enhanced Interactive Python. Type '?' for help.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).   
23/01/02 22:24:13 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.2.3
      /_/

Using Python version 3.8.10 (tags/v3.8.10:3d8993a, May  3 2021 11:48:03)
Spark context Web UI available at http://ManojZephyrusG15:4040
Spark context available as 'sc' (master = local[*], app id = local-1672694655134).
SparkSession available as 'spark'.

In [1]:
```


### SparkSession entry point



In [2]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName(
    "Analyzing the vocabulary of Pride and Prejudice."
).getOrCreate()

In [3]:
# Reading older PySpark code
sc = spark.sparkContext
sqlContext = spark
sc

A simple problem: “What are the most popular words used in the English language?”

Steps to carry out:
1. [*Read*](#step-1)—Read the input data (we’re assuming a plain text file).
2. [*Token*](#step-2)—Tokenize each word.
3. [*Clean*](#step-3)—Remove any punctuation and/or tokens that aren’t words. Lowercase each word.
4. [*Count*](3_Scaling.ipynb#step-4)—Count the frequency of each word present in the text.
5. [*Answer*](3_Scaling.ipynb#step-5)—Return the top 10 (or 20, 50, 100)

![A simple program](images/first_steps_simple_program.png)


<a id="step-1"></a>
## Step-1-READ

### Ingest and Explore

we need to choose how we are going to store the ingested data. PySpark provides two main structures:
1. RDD
2. Data frame

The **RDD** is like a distributed collection of objects (or rows). You pass orders to the RDD through regular Python functions over the items in it.

The **data frame**  is a stricter version of the RDD. Conceptually, you can think of it like a table, where each cell can contain one value. The data frame makes heavy usage of the concept of columns, where you operate on columns instead of on records, like in the *RDD*

![RDD vs DF](images/first_steps_rdd_df.png)

The data frame is now the dominant data structure, and we will almost exclusively use it here. It inherits RDD with a record-by-record flexibility

In [4]:
dir(spark.read)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_df',
 '_jreader',
 '_set_opts',
 '_spark',
 'csv',
 'format',
 'jdbc',
 'json',
 'load',
 'option',
 'options',
 'orc',
 'parquet',
 'schema',
 'table',
 'text']

In [5]:
book = spark.read.text("data/gutenberg_books/1342-0.txt")
book

DataFrame[value: string]

<img align="right" height="300px" src="images/first_steps_gutengery_text.png">

PySpark doesn’t output any data to the screen. Instead, it prints the schema, which is the name of the columns and their type. In PySpark’s world, each column has a type: it represents how the value is represented by Spark’s engine. By having the type attached to each column, you can instantly know what operations you can do on the data. With this information, you won’t inadvertently try to add an integer to a string: PySpark won’t let you add 1 to “blue.” Here, we have one column, named value, composed of a string. 

A quick graphical representation of our data frame would look like figure on the right: each line of text (separated by a newline character) is a record.

In [6]:
book.printSchema()

root
 |-- value: string (nullable = true)



In [7]:
print(book.dtypes)

[('value', 'string')]


In [8]:
print(spark.__doc__)

The entry point to programming Spark with the Dataset and DataFrame API.

    A SparkSession can be used create :class:`DataFrame`, register :class:`DataFrame` as
    tables, execute SQL over tables, cache tables, and read parquet files.
    To create a :class:`SparkSession`, use the following builder pattern:

    .. autoattribute:: builder
       :annotation:

    Examples
    --------
    >>> spark = SparkSession.builder \
    ...     .master("local") \
    ...     .appName("Word Count") \
    ...     .config("spark.some.config.option", "some-value") \
    ...     .getOrCreate()

    >>> from datetime import datetime
    >>> from pyspark.sql import Row
    >>> spark = SparkSession(sc)
    >>> allTypes = sc.parallelize([Row(i=1, s="string", d=1.0, l=1,
    ...     b=True, list=[1, 2, 3], dict={"s": 0}, row=Row(a=1),
    ...     time=datetime(2014, 8, 1, 14, 1, 5))])
    >>> df = allTypes.toDF()
    >>> df.createOrReplaceTempView("allTypes")
    >>> spark.sql('select i+1, d+1, not b, 

In [9]:
book.show()

+--------------------+
|               value|
+--------------------+
|The Project Guten...|
|                    |
|This eBook is for...|
|almost no restric...|
|re-use it under t...|
|with this eBook o...|
|                    |
|                    |
|Title: Pride and ...|
|                    |
| Author: Jane Austen|
|                    |
|Posting Date: Aug...|
|Release Date: Jun...|
|Last Updated: Mar...|
|                    |
|   Language: English|
|                    |
|Character set enc...|
|                    |
+--------------------+
only showing top 20 rows



In [10]:
book.show(5, truncate=50)

+--------------------------------------------------+
|                                             value|
+--------------------------------------------------+
|The Project Gutenberg EBook of Pride and Prejud...|
|                                                  |
|This eBook is for the use of anyone anywhere at...|
|almost no restrictions whatsoever.  You may cop...|
|re-use it under the terms of the Project Gutenb...|
+--------------------------------------------------+
only showing top 5 rows



<a id="step-2"></a>
## Step-2-Token

Now, we will perfom simple column transformations to tokenize and clean the data.


In [11]:
# splitting our lines of text into arrays or words

from pyspark.sql.functions import split

lines = book.select(split(book.value, " ").alias("line"))
lines.show(5)

+--------------------+
|                line|
+--------------------+
|[The, Project, Gu...|
|                  []|
|[This, eBook, is,...|
|[almost, no, rest...|
|[re-use, it, unde...|
+--------------------+
only showing top 5 rows



Let's go a little deeper to understand the `select()` statement. It is to select
one or more columns from your data frame. In PySpark’s world, a data frame is made out of Column objects, and you perform transformations on them. The most basic transformation is the identity, where you return exactly what was provided to you: similar to SQL's `select` statement.

In [12]:
book.select(book.value).show(
    10
)  # .value is the refering to column name using dot notation

# ---- all the methods below are valid cases ----- #

# from pyspark.sql.functions import col

# book.select(book.value)
# book.select(book["value"])
# book.select(col("value")) <- moving forward we will prefer this method
# book.select("value")

+--------------------+
|               value|
+--------------------+
|The Project Guten...|
|                    |
|This eBook is for...|
|almost no restric...|
|re-use it under t...|
|with this eBook o...|
|                    |
|                    |
|Title: Pride and ...|
|                    |
+--------------------+
only showing top 10 rows



Splitting the string into a list of words. The split() function takes two or three
parameters: 
- A column object containing strings
- A Java regular expression delimiter to split the strings against
- An optional integer about how many times we apply the delimiter (not used here)

In [13]:
from pyspark.sql.functions import col, split

lines = book.select(split(col("value"), " "))
lines

DataFrame[split(value,  , -1): array<string>]

In [14]:
lines.printSchema()

root
 |-- split(value,  , -1): array (nullable = true)
 |    |-- element: string (containsNull = true)



In [15]:
lines.show(5)

+--------------------+
| split(value,  , -1)|
+--------------------+
|[The, Project, Gu...|
|                  []|
|[This, eBook, is,...|
|[almost, no, rest...|
|[re-use, it, unde...|
+--------------------+
only showing top 5 rows



The split functions transformed our string column into an array column, containing one or more string elements. we see that in the schema. With our lines of text now tokenized into words, there is a little annoyance present: Spark gave a very unintuitive name (split(value, , -1)) to our column. Let's see how to rename the columns.

Renaming columns with **alias** and **withColumnRenamed**

In [16]:
book.select(split(col("value"), " ")).printSchema()

root
 |-- split(value,  , -1): array (nullable = true)
 |    |-- element: string (containsNull = true)



In [17]:
book.select(split(col("value"), " ").alias("line")).printSchema()

root
 |-- line: array (nullable = true)
 |    |-- element: string (containsNull = true)



In [18]:
# This is messier, and you have to remember the name PySpark assigns automatically
lines = book.select(split(book.value, " "))
lines = lines.withColumnRenamed("split(value, , -1)", "line")

Exploding a list into rows:

When working with data, a key element in data preparation is making sure that it “fits the mold”; this means making sure that the structure containing the data is logical and appropriate for the work at hand. At the moment, each record of our data frame contains multiple words into an array of strings. It would be better to have one record for each word.

let's look at the `explode()` function.

When applied to a column containing a containerlike data structure (such as an array), it’ll take each element and give it its own row.

![Explode function](images/first_steps_explode.png)

In [19]:
# The code follows the same structure as split()

from pyspark.sql.functions import explode, col

lines = book.select(split(col("value"), " ").alias("line"))
words = lines.select(explode(col("line")).alias("word"))

words.show(15)

+----------+
|      word|
+----------+
|       The|
|   Project|
| Gutenberg|
|     EBook|
|        of|
|     Pride|
|       and|
|Prejudice,|
|        by|
|      Jane|
|    Austen|
|          |
|      This|
|     eBook|
|        is|
+----------+
only showing top 15 rows



Before continuing our data-processing journey, we can take a step back and look at a sample of the data. Just by looking at the 15 rows returned, we can see that *Prejudice,* has a comma and that the cell between *Austen* and *This* contains the empty string. Let's clean the data.

<a id="step-3"></a>
## Step-3-Clean

In this section we will take care of lowering the case using `lower()` function and removing punctuations through the usage of a regular expression.

In [20]:
from pyspark.sql.functions import lower

words_lower = words.select(lower(col("word")).alias("word_lower"))

words_lower.show()

+----------+
|word_lower|
+----------+
|       the|
|   project|
| gutenberg|
|     ebook|
|        of|
|     pride|
|       and|
|prejudice,|
|        by|
|      jane|
|    austen|
|          |
|      this|
|     ebook|
|        is|
|       for|
|       the|
|       use|
|        of|
|    anyone|
+----------+
only showing top 20 rows



Now let's remove/clean the punctuations. In order to keep it simple, we will do a crude clean up. we’ll keep the first contiguous group of letters as the word, and remove the rest. It will effectively remove punctuation, quotation marks, and other symbols, at the expense of being less robust with more exotic word
construction.

In [21]:
from pyspark.sql.functions import regexp_extract

words_clean = words_lower.select(
    regexp_extract(col("word_lower"), "[a-z]+", 0).alias("word")
)

words_clean.show()

+---------+
|     word|
+---------+
|      the|
|  project|
|gutenberg|
|    ebook|
|       of|
|    pride|
|      and|
|prejudice|
|       by|
|     jane|
|   austen|
|         |
|     this|
|    ebook|
|       is|
|      for|
|      the|
|      use|
|       of|
|   anyone|
+---------+
only showing top 20 rows



Now let's see how to filter records from a data frame. we should be able to provide a test to perform on each record. If it returns true, we keep the record. False? You’re out! PySpark provides not one, but two identical methods to perform this task. You can use either `.filter()` or its alias `.where()`. 



In [22]:
words_nonull = words_clean.filter(col("word") != "")
words_nonull.show()

+---------+
|     word|
+---------+
|      the|
|  project|
|gutenberg|
|    ebook|
|       of|
|    pride|
|      and|
|prejudice|
|       by|
|     jane|
|   austen|
|     this|
|    ebook|
|       is|
|      for|
|      the|
|      use|
|       of|
|   anyone|
| anywhere|
+---------+
only showing top 20 rows



***
<p style="text-align:left;">
    <a href="./1_Pyspark_Intro.ipynb">Previous Chapter</a>
    <span style="float:right;">
        <a href="./3_Scaling.ipynb">Next Chapter</a>
    </span>
</p>
