# Submitting and scaling your first PySpark program

This chapter covers

- Summarizing data using `groupby` and a simple aggregate function
- Ordering results for display
- Writing data from a data frame
- Using `spark-submit` to launch your program in batch mode
- Simplifying PySpark writing using method chaining
- Scaling your program to multiple files at once


Chapter 2 dealt with all the data preparation work for our word frequency program. We read the input data, tokenized each word, and cleaned our records to only keep lowercase words. If we bring out our outline, we only have steps 4 and 5 to complete:

- [DONE]Read: Read the input data (we’re assuming a plain text file).
- [DONE]Token: Tokenize each word.
- [DONE]Clean: Remove any punctuation and/or tokens that aren’t words. Lowercase each word.
- Count: Count the frequency of each word present in the text.
- Answer: Return the top 10 (or 20, 50, 100).


## Grouping records: Counting word frequencies

If you take our data frame in the same shape as it was at the end of chapter 2 (you can find the code in a single file in the book’s code repository at code/Ch02/end_of_ chapter.py), there is just a little more work to be done. With a data frame containing a single word per record, we just have to count the word occurrences and take the top contenders. This section shows you how to count records using the `GroupedData` object and perform an aggregation function—here, counting the items—on each group.

Intuitively, we count the number of each word by creating groups: one for each word. Once those groups are formed, we can perform an **aggregation function** on each one of them. In this specific case, we count the number of records for each group, which will give us the number of occurrences for each word in the data frame. Under the hood, PySpark represents a grouped data frame in a `GroupedData` object; think of it as a transitional object that awaits an aggregation function to become a transformed data frame.

![](https://drek4537l1klr.cloudfront.net/rioux/Figures/03-01.png)

In [7]:
# end-of-chapter.py############################################################
#
# Use this to get a free pass from Chapter 2 to Chapter 3.
#
# Remember, with great power comes great responsibility. Make sure you
# understand the code before running it! If necessary, refer to the text in
# Chapter 2.
#
###############################################################################

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, split, explode, lower, regexp_extract

spark = SparkSession.builder.getOrCreate()

book = spark.read.text("../../data/gutenberg_books/1342-0.txt")

lines = book.select(split(book.value, " ").alias("line"))

words = lines.select(explode(col("line")).alias("word"))

words_lower = words.select(lower(col("word")).alias("word_lower"))

words_clean = words_lower.select(
    regexp_extract(col("word_lower"), "[a-z]*", 0).alias("word")
)

words_nonull = words_clean.filter(col("word") != "")


The easiest way to count record occurrence is to use the `groupby()` method, passing the columns we wish to group as a parameter. The `groupby()`  method in listing 3.1 returns a `GroupedData` and awaits further instructions. Once we apply the `count()` method, we get back a data frame containing the grouping column word, as well as the count column containing the number of occurrences for each word.

In [2]:
groups = words_nonull.groupby(col("word"))

groups

<pyspark.sql.group.GroupedData at 0x222545b9e50>

In [2]:
results = words_nonull.groupby(col("word")).count()
 
results

DataFrame[word: string, count: bigint]

In [4]:
results.show()

+-------------+-----+
|         word|count|
+-------------+-----+
|       online|    4|
|         some|  203|
|        still|   72|
|          few|   72|
|         hope|  122|
|        those|   60|
|     cautious|    4|
|    imitation|    1|
|          art|    3|
|      solaced|    1|
|       poetry|    2|
|    arguments|    5|
| premeditated|    1|
|      elevate|    1|
|       doubts|    2|
|    destitute|    1|
|    solemnity|    5|
|   lieutenant|    1|
|gratification|    1|
|    connected|   14|
+-------------+-----+
only showing top 20 rows



Peeking at the results data frame in listing 3.1, we see that the results are in no specific order. As a matter of fact, I’d be very surprised if you had the exact same order of words that I do! This has to do with how PySpark manages data. In chapter 1, we learned that PySpark distributes the data across multiple nodes. When performing a grouping function, such as `groupby()`, each worker performs the work on its assigned data. `groupby()` and `count()` are transformations, so PySpark will queue them lazily until we request an action. When we pass the `show()` method to our results data frame, it triggers the chain of computation that we see in figure 3.2.

![](https://drek4537l1klr.cloudfront.net/rioux/Figures/03-02.png)

### Exercise 3.1

Starting with the `word_nonull` seen in this section, which of the following expressions would return the number of words per letter count (e.g., there are X one-letter words, Y two-letter words, etc.)?

Assume that `pyspark.sql.functions.col`, `pyspark.sql.functions.length` are imported.

a) `words_nonull.select(length(col("word"))).groupby("length").count()`

b) `words_nonull.select(length(col("word")).alias("length")).groupby("length").count()`

c) `words_nonull.groupby("length").select("length").count()`

d) None of those options would work.


In [7]:
from pyspark.sql.functions import length, col

words_nonull.select(length(col("word")).alias("length")).groupby("length").count()

DataFrame[length: int, count: bigint]

## Ordering the results on the screen using orderBy

In 3.1, we explained why PySpark doesn’t necessarily maintain an order of records when performing transformations. If we look at our five-step blueprint, the last step is to return the top N records for different values of N. We already know how to show a specific number of records, so this section focuses on ordering the records in a data frame before displaying them:

- [DONE]Read: Read the input data (we’re assuming a plain text file).
- [DONE]Token: Tokenize each word.
- [DONE]Clean: Remove any punctuation and/or tokens that aren’t words. Lowercase each word.
- [DONE]Count: Count the frequency of each word present in the text.
- Answer: Return the top 10 (or 20, 50, 100).


Just like we use `groupby()` to group a data frame by the values in one or many columns, we use `orderBy()` to order a data frame by the values of one or many columns. PySpark provides two different syntaxes to order records:

- We can provide the column names as parameters, with an optional `ascending` parameter. By default, we order a data frame in ascending order; by setting `ascending` to `False`, we reverse the order, getting the largest values first.
- Or we can use the Column object directly, via the `col` function. When we want to reverse the ordering, we use the `desc()` method on the column.

PySpark orders the data frame using each column, one at a time. If you pass multiple columns (see chapter 5), PySpark uses the first column’s values to order the data frame, then the second (and then third, etc.) when there are identical values. Since we have a single column—and no duplicates because of `groupby()`—the application of `orderBy()` in the next listing is simple, regardless of the syntax we pick.

In [8]:
results.orderBy("count", ascending=False).show(10)

+----+-----+
|word|count|
+----+-----+
| the| 4480|
|  to| 4218|
|  of| 3711|
| and| 3504|
| her| 2199|
|   a| 1982|
|  in| 1909|
| was| 1838|
|   i| 1750|
| she| 1668|
+----+-----+
only showing top 10 rows



In [9]:
results.orderBy(col("count").desc()).show(10)

+----+-----+
|word|count|
+----+-----+
| the| 4480|
|  to| 4218|
|  of| 3711|
| and| 3504|
| her| 2199|
|   a| 1982|
|  in| 1909|
| was| 1838|
|   i| 1750|
| she| 1668|
+----+-----+
only showing top 10 rows



The list is very unsurprising: even though we can’t argue with Austen’s vocabulary, she isn’t immune to the fact that the English language needs pronouns and other common words. In natural language processing, those words are called stop words and could be removed. We solved our original query and can rest easy. Should you want to get the top 20, top 50, or even top 1,000, it’s easily done by changing the parameter to `show()`.

> PySpark’s method naming convention zoo
>
>If you are detail-oriented, you might have noticed we used ``groupby`` (lowercase), but ``orderBy`` (lowerCamelCase, where you capitalize the first letter of each word but the first word). This seems like an odd design choice.
>
>``groupby()`` is an alias for ``groupBy()``, just like ``where()`` is an alias of ``filter()``. I guess that the PySpark developers found that a lot of typing mistakes were avoided by accepting the two cases. orderBy() didn’t have that luxury, for a reason that escapes my understanding, so we need to be mindful of this.
>
>Part of this incoherence is due to Spark’s heritage. Scala prefers camelCase for methods. On the other hand, we saw ``regexp_extract``, which uses Python’s preferred ``snake_case`` (words separated by an underscore) in chapter 2. There is no magic secret here: you’ll have to be mindful of the different case conventions at play in PySpark.


### Exercise 3.2

Why isn’t the order preserved in the following code block?


In [11]:
(
    results.orderBy("count", ascending=False)
    .groupby(length(col("word")))
    .count()
    .show(5)
)
# +------------+-----+
# |length(word)|count|
# +------------+-----+
# |          12|  199|
# |           1|   10|
# |          13|  113|
# |           6|  908|
# |          16|    4|
# +------------+-----+
# only showing top 5 rows


+------------+-----+
|length(word)|count|
+------------+-----+
|          12|  197|
|           1|   10|
|          13|  109|
|           6|  897|
|          16|    4|
+------------+-----+
only showing top 5 rows



---

Showing results on the screen is great for a quick assessment, but most of the time you’ll want them to have some sort of longevity. It’s much better to save those results to a file so that we’ll be able to reuse them without having to compute everything each time. The next section covers writing a data frame to a file.

## Writing data from a data frame

Just like we use ``read()`` and the ``SparkReader`` to read data in Spark, we use ``write()`` and the ``SparkWriter`` object to write back our data frame to disk. In listing 3.3, I specialize the ``SparkWriter`` to export text into a CSV file, naming the output ``simple_count.csv``. If we look at the results, we can see that PySpark didn’t create a ``results.csv`` file. Instead, it created a directory of the same name, and put 201 files inside the directory (200 CSVs + 1 _SUCCESS file).

In [8]:
results.write.csv("../../data/simple_count.csv", mode="overwrite")
 
# The ls command is run using a shell, not a Python prompt.
# If you use IPython, you can use the bang pattern (! ls -1).
# Use this to get the same results without leaving the IPython console.
 
#! ls -1 ../../data/simple_count.csv # for UNIX and MAC
! dir ..\..\data\simple_count.csv


 Datentr�ger in Laufwerk D: ist Data
 Volumeseriennummer: D824-FAAC

 Verzeichnis von d:\git\manning\DataAnalysisWithPythonAndPySpark\data\simple_count.csv

28.04.2022  15:01    <DIR>          .
28.04.2022  15:01    <DIR>          ..
28.04.2022  15:01               604 .part-00000-dd9d26e2-484d-4ec0-b810-2aa10a13bf4c-c000.csv.crc
28.04.2022  15:01                 8 ._SUCCESS.crc
28.04.2022  15:01            76.075 part-00000-dd9d26e2-484d-4ec0-b810-2aa10a13bf4c-c000.csv
28.04.2022  15:01                 0 _SUCCESS
               4 Datei(en),         76.687 Bytes
               2 Verzeichnis(se), 950.455.422.976 Bytes frei


In [None]:
results.repartition(100).write.csv("../../data/simple_count.csv", mode="overwrite")

#! ls -1 ../../data/simple_count.csv # for UNIX and MAC
! dir ..\..\data\simple_count.csv

There it is, folks! The first moment where we have to care about PySpark’s distributed nature. Just like PySpark will distribute the transformation work across multiple workers, it’ll do the same for writing data. While it might look like a nuisance for our simple program, it is tremendously useful when working in distributed environments. When you have a large cluster of nodes, having many smaller files makes it easy to logically distribute reading and writing the data, making it way faster than having a single massive file.

By default, PySpark will give you one **file per partition**. This means that our program, as run on my machine, yields 200 partitions at the end. This isn’t the best for portability. To reduce the number of partitions, we apply the ``coalesce()`` method with the desired number of partitions. The next listing shows the difference when using ``coalesce(1)`` on our data frame before writing to disk. We still get a directory, but there is a single CSV file inside of it. Mission accomplished!


In [13]:
results.coalesce(1).write.csv("../../data/simple_count_single_partition.csv", mode="overwrite")
 
#! ls -1 ./data/simple_count_single_partition.csv/
! dir ..\..\data\simple_count_single_partition.csv

 Datentr�ger in Laufwerk D: ist Data
 Volumeseriennummer: D824-FAAC

 Verzeichnis von d:\git\manning\DataAnalysisWithPythonAndPySpark\data\simple_count_single_partition.csv

28.04.2022  15:32    <DIR>          .
28.04.2022  15:32    <DIR>          ..
28.04.2022  15:32               604 .part-00000-0382faee-a0fa-4798-97c0-0ec26bba0cf5-c000.csv.crc
28.04.2022  15:32                 8 ._SUCCESS.crc
28.04.2022  15:32            76.075 part-00000-0382faee-a0fa-4798-97c0-0ec26bba0cf5-c000.csv
28.04.2022  15:32                 0 _SUCCESS
               4 Datei(en),         76.687 Bytes
               2 Verzeichnis(se), 950.454.566.912 Bytes frei


> You might have realized that we’re not ordering the file before writing it. Since our data here is pretty small, we could have written the words by decreasing order of frequency. If you have a large data set, this operation will be quite expensive. Furthermore, since reading is a potentially distributed operation, what guarantees that it’ll get read the same way? Never assume that your data frame will keep the same ordering of records unless you explicitly ask via ``orderBy()`` right before the showing step.

## Putting it all together: Counting

The REPL allows you to go back in history using the directional arrows on your keyboard, just like a regular Python REPL. To make things a bit easier, I am providing the step-by-step program in the next listing. This section is dedicated to streamlining and making our code more succinct and readable.

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import (
    col,
    explode,
    lower,
    regexp_extract,
    split,
)
 
spark = SparkSession.builder.appName(
    "Analyzing the vocabulary of Pride and Prejudice."
).getOrCreate()
 
book = spark.read.text("../../data/gutenberg_books/1342-0.txt")
 
lines = book.select(split(book.value, " ").alias("line"))
 
words = lines.select(explode(col("line")).alias("word"))
 
words_lower = words.select(lower(col("word")).alias("word"))
 
words_clean = words_lower.select(
    regexp_extract(col("word"), "[a-z']*", 0).alias("word")
)
 
words_nonull = words_clean.where(col("word") != "")
 
results = words_nonull.groupby(col("word")).count()
 
results.orderBy("count", ascending=False).show(10)
 
results.coalesce(1).write.csv("../../data/simple_count_single_partition.csv")

+----+-----+
|word|count|
+----+-----+
| the| 4480|
|  to| 4218|
|  of| 3711|
| and| 3504|
| her| 2199|
|   a| 1982|
|  in| 1909|
| was| 1838|
|   i| 1749|
| she| 1668|
+----+-----+
only showing top 10 rows



This program runs perfectly if you paste its entirety into the ``pyspark`` shell. With everything in the same file, we can make our code more friendly and make it easier for future you to come back to it. First, we adopt common import conventions when working with PySpark. We then rearrange our code to make it more readable, as seen in chapter 1.

### Simplifying your dependencies with PySpark’s import conventions

This program uses five distinct functions from the ``pyspark.sql.functions`` modules. We should probably replace this with a qualified import, which is Python’s way of importing a module by assigning a keyword to it. While there is no hard rule, the common wisdom is to use ``F`` to refer to PySpark’s functions. The next listing shows the before and after.

In [None]:
# Before
from pyspark.sql.functions import col, explode, lower, regexp_extract, split
 
# After
import pyspark.sql.functions as F

Since ``col``, ``explode``, ``lower``, ``regexp_extract``, and ``split`` are all in ``pyspark.sql.functions``, we can import the whole module. Since the new import statement imports the entirety of the ``pyspark.sql.functions`` module, we assign the keyword (or key letter) ``F``. The PySpark community seems to have implicitly settled on using ``F`` for ``pyspark.sql.functions``, and I encourage you to do the same. It’ll make your programs consistent, and since many functions in the module share their name with pandas or Python built-in functions, you’ll avoid name clashes. Each function application in the program will then be prefixed by ``F``, just like with regular Python-qualified imports.

### Simplifying our program via method chaining

If we look at the transformation methods we applied to our data frames (``select()``, ``where()``, ``groupBy()``, and ``count()``), they all have something in common: they take a structure as a parameter—the data frame or ``GroupedData`` in the case of ``count()``—and return a structure. All transformations can be seen as **pipes** that ingest a structure and return a modified structure. This section will look at method chaining and how it makes a program less verbose and thus easier to read by eliminating intermediate variables.

In PySpark, every transformation returns an object, which is why we need to assign a variable to the result. This means that PySpark doesn’t perform modifications in place. For instance, the following code block by itself in a program wouldn’t do anything because we don’t assign the result to a variable.

We can avoid intermediate variables by chaining the results of one method to the next. Since each transformation returns a data frame (or ``GroupedData``, when we perform the ``groupby()`` method), we can directly append the next method without assigning the result to a variable. This means that we can eschew all but one variable assignment. The code in the next listing shows the before and after. Note that we also added the ``F`` prefix to our functions to respect the import convention we outlined in section 3.4.1.

In [None]:
# Before
book = spark.read.text("./data/gutenberg_books/1342-0.txt")
 
lines = book.select(split(book.value, " ").alias("line"))
 
words = lines.select(explode(col("line")).alias("word"))
 
words_lower = words.select(lower(col("word")).alias("word"))
 
words_clean = words_lower.select(
    regexp_extract(col("word"), "[a-z']*", 0).alias("word")
)
 
words_nonull = words_clean.where(col("word") != "")
 
results = words_nonull.groupby("word").count()

In [3]:
# After
import pyspark.sql.functions as F

results = (
    spark.read.text("../../data/gutenberg_books/1342-0.txt")
    .select(F.split(F.col("value"), " ").alias("line"))
    .select(F.explode(F.col("line")).alias("word"))
    .select(F.lower(F.col("word")).alias("word"))
    .select(F.regexp_extract(F.col("word"), "[a-z']*", 0).alias("word"))
    .where(F.col("word") != "")
    .groupby("word")
    .count()
)


![](https://drek4537l1klr.cloudfront.net/rioux/Figures/03-03.png)

In [None]:
results = spark\
          .read.text('../../data/ch02/1342-0.txt')\

## Using spark-submit to launch your program in batch mode

If we start PySpark with the ``pyspark`` program, the launcher takes care of creating the ``SparkSession`` for us. In chapter 2, we started from a basic Python REPL, so we created our entry point and named it spark. This section takes our program and submits it in batch mode. It is the equivalent of running a Python script; if you only need the result and not the REPL, this will do the trick.

Unlike the interactive REPL, where the choice of language triggers the program to run, as in listing 3.10, we see that Spark provides a single program, named ``spark-submit``, to submit Spark (Scala, Java, SQL), PySpark (Python), and SparkR (R) programs. The full code for our program is available on the book’s repository under ``code/Ch02/word_count_submit.py``.


In [1]:
! spark-submit --help

Usage: spark-submit [options] <app jar | python file | R file> [app arguments]
Usage: spark-submit --kill [submission ID] --master [spark://...]
Usage: spark-submit --status [submission ID] --master [spark://...]
Usage: spark-submit run-example [options] example-class [example args]

Options:
  --master MASTER_URL         spark://host:port, mesos://host:port, yarn,
                              k8s://https://host:port, or local (Default: local[*]).
  --deploy-mode DEPLOY_MODE   Whether to launch the driver program locally ("client") or
                              on one of the worker machines inside the cluster ("cluster")
                              (Default: client).
  --class CLASS_NAME          Your application's main class (for Java / Scala apps).
  --name NAME                 A name of your application.
  --jars JARS                 Comma-separated list of jars to include on the driver
                              and executor classpaths.
  --packages                  Comma-

In [4]:
! spark-submit ./word_count_submit.py
 
# [...]
# +----+-----+
# |word|count|
# +----+-----+
# | the| 4480|
# |  to| 4218|
# |  of| 3711|
# | and| 3504|
# | her| 2199|
# |   a| 1982|
# |  in| 1909|
# | was| 1838|
# |   i| 1749|
# | she| 1668|
# +----+-----+
# only showing top 10 rows
# [...]

+----+-----+
|word|count|
+----+-----+
| the| 4480|
|  to| 4218|
|  of| 3711|
| and| 3504|
| her| 2199|
|   a| 1982|
|  in| 1909|
| was| 1838|
|   i| 1749|
| she| 1668|
+----+-----+
only showing top 10 rows



Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
22/04/29 15:39:02 INFO SparkContext: Running Spark version 3.2.1
22/04/29 15:39:02 INFO ResourceUtils: No custom resources configured for spark.driver.
22/04/29 15:39:02 INFO SparkContext: Submitted application: Counting word occurences from a book.
22/04/29 15:39:02 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: memory, amount: 1024, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
22/04/29 15:39:02 INFO ResourceProfile: Limiting resource is cpu
22/04/29 15:39:02 INFO ResourceProfileManager: Added ResourceProfile id: 0
22/04/29 15:39:02 INFO SecurityManager: Changing view acls to: micha
22/04/29 15:39:02 INFO SecurityManager: Changing modify acls to: micha
22/04/29 15:39:02 INFO SecurityManager: Changing view acls groups to: 


> If you get a deluge of ``INFO`` messages, don’t forget that you have control over this: use ``spark.sparkContext.setLogLevel("WARN")`` right after your spark definition. If your local configuration has ``INFO`` as a default, you’ll still get a slew of messages until it catches this line, but it won’t obscure your results.

## Scaling up our word frequency program

Teaching big data processing has a catch-22. While I want to show the power of PySpark to work with massive data sets, I don’t want you to purchase a cluster or rack up a massive cloud bill. It’s easier to show you the ropes using a smaller set of data, knowing that we can scale using the same code.

Let’s take our word-counting example: How can we scale this to a larger corpus of text? Let’s download more files from Project Gutenberg and place them in the same directory:

In [5]:
! dir "../../data/gutenberg_books"

 Datentr�ger in Laufwerk D: ist Data
 Volumeseriennummer: D824-FAAC

 Verzeichnis von d:\git\manning\DataAnalysisWithPythonAndPySpark\data\gutenberg_books

26.04.2022  20:16    <DIR>          .
29.04.2022  15:32    <DIR>          ..
26.04.2022  20:16           173.595 11-0.txt
26.04.2022  20:16           724.726 1342-0.txt
26.04.2022  20:16           607.788 1661-0.txt
26.04.2022  20:16         1.276.201 2701-0.txt
26.04.2022  20:16         1.076.254 30254-0.txt
26.04.2022  20:16           450.783 84-0.txt
               6 Datei(en),      4.309.347 Bytes
               2 Verzeichnis(se), 950.447.849.472 Bytes frei


We modify our ``word_count_submit.py`` in a very subtle way. Where we ``.read.text()``, we’ll change the path to account for all files in the directory. The next listing shows the before and after: we are only changing the ``1342-0.txt`` to a ``*.txt``, which is called a glob pattern. The ``*`` means that Spark selects all the ``.txt`` files in the directory.

In [13]:
# Before
results = spark.read.text('../../data/gutenberg_books/1342-0.txt')
print("Count single book: ", results.count())
 
# After
results = spark.read.text('../../data/gutenberg_books/*.txt')
print("Count all books: ", results.count())

Count single book:  13427
Count all books:  77910


> You can also just pass the name of the directory if you want PySpark to ingest all the files within the directory.

With this, you can confidently say that you can scale a simple data analysis program using PySpark. You can use the general formula we’ve outlined here and modify some of the parameters and methods to fit your use case. Chapters 4 and 5 will dig a little deeper into some interesting and common data transformations, building on what we’ve learned here.

## Summary

- You can group records using the ``groupby`` method, passing the column names you want to group against as a parameter. This returns a ``GroupedData`` object that waits for an aggregation method to return the results of computation over the groups, such as the ``count()`` of records.
- PySpark’s repertoire of functions that operates on columns is located in ``pyspark.sql.functions``. The unofficial but well-respected convention is to qualify this import in your program using the ``F`` keyword.
- When writing a data frame to a file, PySpark will create a directory and put one file per partition. If you want to write a single file, use the ``coaslesce(1)`` method.
- To prepare your program to work in batch mode via ``spark-submit``, you need to create a ``SparkSession``. PySpark provides a builder pattern in the ``pyspark.sql`` module.
- If your program needs to scale across multiple files within the same directory, you can use a glob pattern to select many files at once. PySpark will collect them in a single data frame.


## Additional Exercises

For these exercises, you’ll need the ``word_count_submit.py`` program we worked on in this chapter. You can pick it from the book’s code repository (``Code/Ch03/word_ count_submit.py``).


### Exercise 3.3

1. By modifying the ``word_count_submit.py`` program, return the number of distinct words in Jane Austen’s Pride and Prejudice. (Hint: results contains one record for each unique word.)

In [17]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F


spark = SparkSession.builder.appName(
    "Counting word occurences from a book."
).getOrCreate()

spark.sparkContext.setLogLevel("WARN")

# If you need to read multiple text files, replace `1342-0` by `*`.
results = (
    spark.read.text("../../data/gutenberg_books/1342-0.txt")  # or "*.txt" for all txt files, or only the parent dir
    .select(F.split(F.col("value"), " ").alias("line"))
    .select(F.explode(F.col("line")).alias("word"))
    .select(F.lower(F.col("word")).alias("word"))
    .select(F.regexp_extract(F.col("word"), "[a-z']*", 0).alias("word"))
    .where(F.col("word") != "")
    .groupby(F.col("word"))
    .count()
)

results.count()
#results.coalesce(1).write.csv("../../data/results_single_partition.csv")


6595

In [19]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F


spark = SparkSession.builder.appName(
    "Counting word occurences from a book."
).getOrCreate()

spark.sparkContext.setLogLevel("WARN")

# If you need to read multiple text files, replace `1342-0` by `*`.
results = (
    spark.read.text("../../data/gutenberg_books/1342-0.txt")
    .select(F.split(F.col("value"), " ").alias("line"))
    .select(F.explode(F.col("line")).alias("word"))
    .select(F.lower(F.col("word")).alias("word"))
    .select(F.regexp_extract(F.col("word"), "[a-z']*", 0).alias("word"))
    .where(F.col("word") != "")
    .distinct()
)

results.count()
#results.coalesce(1).write.csv("../../data/results_single_partition.csv")


6595

2. (Challenge) Wrap your program in a function that takes a file name as a parameter. It should return the number of distinct words.

In [24]:
def get_unique_word_count(file: str):
    results = (
        spark.read.text(file)
        .select(F.split(F.col("value"), " ").alias("line"))
        .select(F.explode(F.col("line")).alias("word"))
        .select(F.lower(F.col("word")).alias("word"))
        .select(F.regexp_extract(F.col("word"), "[a-z']*", 0).alias("word"))
        .where(F.col("word") != "")
        .distinct()
    )

    return results.count()

get_unique_word_count("../../data/gutenberg_books/*.txt")

23754

### Exercise 3.4

Taking ``word_count_submit.py``, modify the script to return a sample of five words that appear only once in Jane Austen’s Pride and Prejudice.


In [27]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F


spark = SparkSession.builder.appName(
    "Counting word occurences from a book."
).getOrCreate()

spark.sparkContext.setLogLevel("WARN")

# If you need to read multiple text files, replace `1342-0` by `*`.
results = (
    spark.read.text("../../data/gutenberg_books/1342-0.txt")
    .select(F.split(F.col("value"), " ").alias("line"))
    .select(F.explode(F.col("line")).alias("word"))
    .select(F.lower(F.col("word")).alias("word"))
    .select(F.regexp_extract(F.col("word"), "[a-z']*", 0).alias("word"))
    .where(F.col("word") != "")
    .groupBy(F.col("word"))
    .count()
    .filter(F.col("count") == 1)
)

results.show(5)
#results.coalesce(1).write.csv("../../data/results_single_partition.csv")


+------------+-----+
|        word|count|
+------------+-----+
|   imitation|    1|
|     solaced|    1|
|premeditated|    1|
|     elevate|    1|
|   destitute|    1|
+------------+-----+
only showing top 5 rows



## Exercise 3.5

1. Using the ``substring`` function (refer to PySpark’s API or the pyspark shell if needed), return the top five most popular first letters (keep only the first letter of each word).


In [30]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F


spark = SparkSession.builder.appName(
    "Counting word occurences from a book."
).getOrCreate()

spark.sparkContext.setLogLevel("WARN")

# If you need to read multiple text files, replace `1342-0` by `*`.
results = (
    spark.read.text("../../data/gutenberg_books/1342-0.txt")
    .select(F.split(F.col("value"), " ").alias("line"))
    .select(F.explode(F.col("line")).alias("word"))
    .select(F.lower(F.col("word")).alias("word"))
    .select(F.regexp_extract(F.col("word"), "[a-z']*", 0).alias("word"))
    .where(F.col("word") != "")
    .select(F.substring(F.col("word"), 1, 1).alias("first_letter"))
    .groupBy(F.col("first_letter"))
    .count()
)

results.orderBy("count", ascending=False).show(5)
#results.coalesce(1).write.csv("../../data/results_single_partition.csv")


+------------+-----+
|first_letter|count|
+------------+-----+
|           t|16101|
|           a|13684|
|           h|10419|
|           w| 9091|
|           s| 8791|
+------------+-----+
only showing top 5 rows



2. Compute the number of words starting with a consonant or a vowel. (Hint: The isin() function might be useful.)

In [36]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F


spark = SparkSession.builder.appName(
    "Counting word occurences from a book."
).getOrCreate()

spark.sparkContext.setLogLevel("WARN")

# If you need to read multiple text files, replace `1342-0` by `*`.
words_with_vowel = (
    spark.read.text("../../data/gutenberg_books/1342-0.txt")
    .select(F.split(F.col("value"), " ").alias("line"))
    .select(F.explode(F.col("line")).alias("word"))
    .select(F.lower(F.col("word")).alias("word"))
    .select(F.regexp_extract(F.col("word"), "[a-z']*", 0).alias("word"))
    .where(F.col("word") != "")
    .filter(F.substring(F.col("word"), 1, 1).isin("a","e","i","o","u").alias("starts_with_vowel"))
    .count()
)

words_with_vowel

33522

In [38]:
words_without_vowel = (
    spark.read.text("../../data/gutenberg_books/1342-0.txt")
    .select(F.split(F.col("value"), " ").alias("line"))
    .select(F.explode(F.col("line")).alias("word"))
    .select(F.lower(F.col("word")).alias("word"))
    .select(F.regexp_extract(F.col("word"), "[a-z']*", 0).alias("word"))
    .where(F.col("word") != "")
    .filter(~F.substring(F.col("word"), 1, 1).isin("a","e","i","o","u").alias("starts_with_vowel"))
    .count()
)

words_without_vowel

88653

In [40]:
(
    spark.read.text("../../data/gutenberg_books/1342-0.txt")
    .select(F.split(F.col("value"), " ").alias("line"))
    .select(F.explode(F.col("line")).alias("word"))
    .select(F.lower(F.col("word")).alias("word"))
    .count()
)


127368

### Exercise 3.6

Let’s say you want to get both the ``count()`` and ``sum()`` of a ``GroupedData`` object. Why doesn’t this code work? Map the inputs and outputs of each method.

In [None]:
my_data_frame.groupby("my_column").count().sum()