# Cleaning Data with PySpark - Part 2

## Manipulating DataFrames in the real world
A look at various techniques to modify the contents of DataFrames in Spark.

In [1]:
BUCKET = 'driven-actor-210609'

In [59]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.types import StringType

In [3]:
spark = SparkSession.builder.getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/03/13 10:37:58 INFO SparkEnv: Registering MapOutputTracker
25/03/13 10:37:58 INFO SparkEnv: Registering BlockManagerMaster
25/03/13 10:37:58 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
25/03/13 10:37:58 INFO SparkEnv: Registering OutputCommitCoordinator


### Filtering column content with Python
You've looked at using various operations on DataFrame columns - now you can modify a real dataset. The DataFrame `voter_df` contains information regarding the voters on the Dallas City Council from the past few years. This truncated DataFrame contains the date of the vote being cast and the name and position of the voter. Your manager has asked you to clean this data so it can later be integrated into some desired reports. The primary task is to remove any null entries or odd characters and return a specific set of voters where you can validate their information.

This is often one of the first steps in data cleaning - removing anything that is obviously outside the format. For this dataset, make sure to look at the original data and see what looks out of place for the `VOTER_NAME` column.

The `pyspark.sql.functions` library is already imported under the alias `F`.

In [73]:
file_path = f'gs://{BUCKET}/pyspark/datasets/DallasCouncilVoters.csv.gz'
# file_path = 'datasets/DallasCouncilVoters.csv.gz'

voter_df = spark.read.format('csv').options(Header=True).load(file_path)

In [74]:
voter_df.show(10)

+----------+-------------+-------------------+
|      DATE|        TITLE|         VOTER_NAME|
+----------+-------------+-------------------+
|02/08/2017|Councilmember|  Jennifer S. Gates|
|02/08/2017|Councilmember| Philip T. Kingston|
|02/08/2017|        Mayor|Michael S. Rawlings|
|02/08/2017|Councilmember|       Adam Medrano|
|02/08/2017|Councilmember|       Casey Thomas|
|02/08/2017|Councilmember|Carolyn King Arnold|
|02/08/2017|Councilmember|       Scott Griggs|
|02/08/2017|Councilmember|   B. Adam  McGough|
|02/08/2017|Councilmember|       Lee Kleinman|
|02/08/2017|Councilmember|      Sandy Greyson|
+----------+-------------+-------------------+
only showing top 10 rows



In [75]:
# Show the distinct VOTER_NAME entries
voter_df.select('VOTER_NAME').distinct().show(40, truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|VOTER_NAME                                                                                                                                                                                                                                                                                                                                                                                                                 |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [76]:
# Filter voter_df where the VOTER_NAME is 1-20 characters in length
voter_df = voter_df.filter('length(VOTER_NAME) > 0 and length(VOTER_NAME) < 20')

# Filter out voter_df where the VOTER_NAME contains an underscore
voter_df = voter_df.filter(~ F.col('VOTER_NAME').contains('_'))

# Filter out voter_df where the VOTER_NAME starts with "M"
voter_df = voter_df.filter(~ voter_df['VOTER_NAME'].like('S%'))

# Filter voter_df where VOTER_NAME is not null
voter_df = voter_df.filter(voter_df['VOTER_NAME'].isNotNull())

In [77]:
# Show the distinct VOTER_NAME entries again
voter_df.select('VOTER_NAME').distinct().show(40, truncate=False)

+-------------------+
|VOTER_NAME         |
+-------------------+
|Mark  Clayton      |
|Omar Narvaez       |
|Dwaine R. Caraway  |
|Casey Thomas       |
|Tiffinni A. Young  |
|Carolyn King Arnold|
|Casey  Thomas      |
|B. Adam  McGough   |
|Philip T. Kingston |
|Monica R. Alonzo   |
|Michael S. Rawlings|
|Kevin Felder       |
|Adam Medrano       |
|Mark Clayton       |
|Rickey D. Callahan |
|Lee M. Kleinman    |
|Erik Wilson        |
|Tennell Atkins     |
|Jennifer S.  Gates |
|Philip T.  Kingston|
|Jennifer S. Gates  |
|Rickey D.  Callahan|
|Lee Kleinman       |
+-------------------+



### Modifying DataFrame columns
Previously, you filtered out any rows that didn't conform to something generally resembling a name. Now based on your earlier work, your manager has asked you to create two new columns - `first_name` and `last_name`. She asks you to split the `VOTER_NAME` column into words on any space character. You'll treat the last word as the `last_name`, and all other words as the `first_name`. You'll be using some new functions in this exercise including `.split()`, `.size()`, and `.getItem()`. The `.getItem(index)` takes an integer value to return the appropriately numbered item in the column. The functions `.split()` and `.size()` are in the `pyspark.sql.functions` library.

Please note that these operations are always somewhat specific to the use case. Having your data conform to a format often matters more than the specific details of the format. Rarely is a data cleaning task meant just for one person - matching a defined format allows for easier sharing of the data later (ie, Paul doesn't need to worry about names - Mary already cleaned the dataset).

The filtered voter DataFrame from your previous exercise is available as `voter_df`. The `pyspark.sql.functions` library is available under the alias `F`.

In [78]:
# Add a new column called splits separated on whitespace
voter_df = voter_df.withColumn('splits', F.split(voter_df.VOTER_NAME, '\s+'))

# Create a new column called first_name based on the first item in splits
voter_df = voter_df.withColumn('first_name', voter_df.splits.getItem(0))

# Get the last entry of the splits list and create a column called last_name
voter_df = voter_df.withColumn('last_name', voter_df.splits.getItem(F.size('splits') - 1))

In [79]:
voter_df.show(10, truncate=False)

+----------+-------------+-------------------+-----------------------+----------+---------+
|DATE      |TITLE        |VOTER_NAME         |splits                 |first_name|last_name|
+----------+-------------+-------------------+-----------------------+----------+---------+
|02/08/2017|Councilmember|Jennifer S. Gates  |[Jennifer, S., Gates]  |Jennifer  |Gates    |
|02/08/2017|Councilmember|Philip T. Kingston |[Philip, T., Kingston] |Philip    |Kingston |
|02/08/2017|Mayor        |Michael S. Rawlings|[Michael, S., Rawlings]|Michael   |Rawlings |
|02/08/2017|Councilmember|Adam Medrano       |[Adam, Medrano]        |Adam      |Medrano  |
|02/08/2017|Councilmember|Casey Thomas       |[Casey, Thomas]        |Casey     |Thomas   |
|02/08/2017|Councilmember|Carolyn King Arnold|[Carolyn, King, Arnold]|Carolyn   |Arnold   |
|02/08/2017|Councilmember|B. Adam  McGough   |[B., Adam, McGough]    |B.        |McGough  |
|02/08/2017|Councilmember|Lee Kleinman       |[Lee, Kleinman]        |Lee       

In [80]:
# Convert string to date format
voter_df = voter_df.withColumn('dt', F.to_date(voter_df['DATE'], 'MM/dd/yyyy'))

# Create a new column called year with year from date column
voter_df = voter_df.withColumn('year', F.year(voter_df['dt']))

# Drop the dt column
voter_df = voter_df.drop('dt')

# Show the voter_df DataFrame
voter_df.show()

+----------+--------------------+-------------------+--------------------+----------+---------+----+
|      DATE|               TITLE|         VOTER_NAME|              splits|first_name|last_name|year|
+----------+--------------------+-------------------+--------------------+----------+---------+----+
|02/08/2017|       Councilmember|  Jennifer S. Gates|[Jennifer, S., Ga...|  Jennifer|    Gates|2017|
|02/08/2017|       Councilmember| Philip T. Kingston|[Philip, T., King...|    Philip| Kingston|2017|
|02/08/2017|               Mayor|Michael S. Rawlings|[Michael, S., Raw...|   Michael| Rawlings|2017|
|02/08/2017|       Councilmember|       Adam Medrano|     [Adam, Medrano]|      Adam|  Medrano|2017|
|02/08/2017|       Councilmember|       Casey Thomas|     [Casey, Thomas]|     Casey|   Thomas|2017|
|02/08/2017|       Councilmember|Carolyn King Arnold|[Carolyn, King, A...|   Carolyn|   Arnold|2017|
|02/08/2017|       Councilmember|   B. Adam  McGough| [B., Adam, McGough]|        B.|  McGo

### when() example
The `when()` clause lets you conditionally modify a Data Frame based on its content. You'll want to modify our `voter_df` DataFrame to add a random number to any voting member that is defined as a "Councilmember".

The `voter_df` DataFrame is defined and available to you. The `pyspark.sql.functions` library is available as `F`. You can use `F.rand()` to generate the random value.

In [81]:
# Add a column to voter_df for any voter with the title **Councilmember**
voter_df = voter_df.withColumn('random_val',
                               F.when(voter_df.TITLE == 'Councilmember', F.rand()))

# Show some of the DataFrame rows, noting whether the when clause worked
voter_df.show()

+----------+--------------------+-------------------+--------------------+----------+---------+----+--------------------+
|      DATE|               TITLE|         VOTER_NAME|              splits|first_name|last_name|year|          random_val|
+----------+--------------------+-------------------+--------------------+----------+---------+----+--------------------+
|02/08/2017|       Councilmember|  Jennifer S. Gates|[Jennifer, S., Ga...|  Jennifer|    Gates|2017|   0.927784486278443|
|02/08/2017|       Councilmember| Philip T. Kingston|[Philip, T., King...|    Philip| Kingston|2017|  0.2802769543704745|
|02/08/2017|               Mayor|Michael S. Rawlings|[Michael, S., Raw...|   Michael| Rawlings|2017|                NULL|
|02/08/2017|       Councilmember|       Adam Medrano|     [Adam, Medrano]|      Adam|  Medrano|2017|  0.9243553842147799|
|02/08/2017|       Councilmember|       Casey Thomas|     [Casey, Thomas]|     Casey|   Thomas|2017| 0.20573991356703414|
|02/08/2017|       Counc

### When / Otherwise
This requirement is similar to the last, but now you want to add multiple values based on the voter's position. Modify your `voter_df` DataFrame to add a random number to any voting member that is defined as a `Councilmember`. Use 2 for the `Mayor` and 0 for anything other position.

The `voter_df` Data Frame is defined and available to you. The `pyspark.sql.functions` library is available as `F`. You can use `F.rand()` to generate the random value.

In [82]:
# Add a column to voter_df for a voter based on their position
voter_df = voter_df.withColumn('random_val',
                               F.when(voter_df.TITLE == 'Councilmember', F.rand())
                               .when(voter_df.TITLE == 'Mayor', 2)
                               .otherwise(0))

# Show some of the DataFrame rows
voter_df.show()

+----------+--------------------+-------------------+--------------------+----------+---------+----+--------------------+
|      DATE|               TITLE|         VOTER_NAME|              splits|first_name|last_name|year|          random_val|
+----------+--------------------+-------------------+--------------------+----------+---------+----+--------------------+
|02/08/2017|       Councilmember|  Jennifer S. Gates|[Jennifer, S., Ga...|  Jennifer|    Gates|2017|  0.5832721810453418|
|02/08/2017|       Councilmember| Philip T. Kingston|[Philip, T., King...|    Philip| Kingston|2017|  0.2788913277014696|
|02/08/2017|               Mayor|Michael S. Rawlings|[Michael, S., Raw...|   Michael| Rawlings|2017|                 2.0|
|02/08/2017|       Councilmember|       Adam Medrano|     [Adam, Medrano]|      Adam|  Medrano|2017|0.020739051913620576|
|02/08/2017|       Councilmember|       Casey Thomas|     [Casey, Thomas]|     Casey|   Thomas|2017|  0.5459341954924434|
|02/08/2017|       Counc

In [83]:
# Use the .filter() clause with random_val
voter_df.filter(voter_df.random_val == 0).show()

+----------+--------------------+-----------------+--------------------+----------+---------+----+----------+
|      DATE|               TITLE|       VOTER_NAME|              splits|first_name|last_name|year|random_val|
+----------+--------------------+-----------------+--------------------+----------+---------+----+----------+
|04/25/2018|Deputy Mayor Pro Tem|     Adam Medrano|     [Adam, Medrano]|      Adam|  Medrano|2018|       0.0|
|04/25/2018|       Mayor Pro Tem|Dwaine R. Caraway|[Dwaine, R., Cara...|    Dwaine|  Caraway|2018|       0.0|
|06/20/2018|Deputy Mayor Pro Tem|     Adam Medrano|     [Adam, Medrano]|      Adam|  Medrano|2018|       0.0|
|06/20/2018|       Mayor Pro Tem|Dwaine R. Caraway|[Dwaine, R., Cara...|    Dwaine|  Caraway|2018|       0.0|
|06/20/2018|Deputy Mayor Pro Tem|     Adam Medrano|     [Adam, Medrano]|      Adam|  Medrano|2018|       0.0|
|06/20/2018|       Mayor Pro Tem|Dwaine R. Caraway|[Dwaine, R., Cara...|    Dwaine|  Caraway|2018|       0.0|
|08/15/201

### Using user defined functions in Spark
You've seen some of the power behind Spark's built-in string functions when it comes to manipulating DataFrames. However, once you reach a certain point, it becomes difficult to process the data in a without creating a rat's nest of function calls. Here's one place where you can use User Defined Functions to manipulate our DataFrames.

For this exercise, we'll use our `voter_df` DataFrame, but you're going to replace the `first_name` column with the first and middle names.

The `pyspark.sql.functions` library is available under the alias `F`. The classes from `pyspark.sql.types` are already imported.

In [84]:
def getFirstAndMiddle(names):
  # Return a space separated string of names
  return ' '.join(names[:-1])

# Define the method as a UDF
udfFirstAndMiddle = F.udf(getFirstAndMiddle, StringType())

# Create a new column using your UDF
voter_df = voter_df.withColumn('first_and_middle_name', udfFirstAndMiddle(voter_df['splits']))

# Show the DataFrame
voter_df.show()

+----------+--------------------+-------------------+--------------------+----------+---------+----+--------------------+---------------------+
|      DATE|               TITLE|         VOTER_NAME|              splits|first_name|last_name|year|          random_val|first_and_middle_name|
+----------+--------------------+-------------------+--------------------+----------+---------+----+--------------------+---------------------+
|02/08/2017|       Councilmember|  Jennifer S. Gates|[Jennifer, S., Ga...|  Jennifer|    Gates|2017|  0.5832721810453418|          Jennifer S.|
|02/08/2017|       Councilmember| Philip T. Kingston|[Philip, T., King...|    Philip| Kingston|2017|  0.2788913277014696|            Philip T.|
|02/08/2017|               Mayor|Michael S. Rawlings|[Michael, S., Raw...|   Michael| Rawlings|2017|                 2.0|           Michael S.|
|02/08/2017|       Councilmember|       Adam Medrano|     [Adam, Medrano]|      Adam|  Medrano|2017|0.020739051913620576|               

### Adding an ID Field
When working with data, you sometimes only want to access certain fields and perform various operations. In this case, find all the unique voter names from the DataFrame and add a unique ID number. Remember that Spark IDs are assigned based on the DataFrame partition - as such the ID values may be much greater than the actual number of rows in the DataFrame.

With Spark's lazy processing, the IDs are not actually generated until an action is performed and can be somewhat random depending on the size of the dataset.

The `spark` session and a Spark DataFrame `df` containing the `DallasCouncilVotes.csv.gz` file are available in your workspace. The `pyspark.sql.functions` library is available under the alias `F`.

In [85]:
# Select all the unique council voters
voter_df = voter_df.select(voter_df["VOTER_NAME"]).distinct()

# Count the rows in voter_df
print("\nThere are %d rows in the voter_df DataFrame.\n" % voter_df.count())

# Add a ROW_ID
voter_df = voter_df.withColumn('ROW_ID', F.monotonically_increasing_id())

# Show the rows with 10 highest IDs in the set
voter_df.orderBy(voter_df.ROW_ID.desc()).show(10)


There are 23 rows in the voter_df DataFrame.

+-------------------+------+
|         VOTER_NAME|ROW_ID|
+-------------------+------+
|       Lee Kleinman|    22|
|Rickey D.  Callahan|    21|
|  Jennifer S. Gates|    20|
|Philip T.  Kingston|    19|
| Jennifer S.  Gates|    18|
|     Tennell Atkins|    17|
|        Erik Wilson|    16|
|    Lee M. Kleinman|    15|
| Rickey D. Callahan|    14|
|       Mark Clayton|    13|
+-------------------+------+
only showing top 10 rows



### IDs with different partitions
You've just completed adding an ID field to a DataFrame. Now, take a look at what happens when you do the same thing on DataFrames containing a different number of partitions.

To check the number of partitions, use the method `.rdd.getNumPartitions()` on a DataFrame.

The `spark` session and two DataFrames, `voter_df` and `voter_df_single`, are available in your workspace. The instructions will help you discover the difference between the DataFrames. The `pyspark.sql.functions` library is available under the alias `F`.

In [86]:
voter_df_single = voter_df.select("*")
voter_df_double = voter_df.select("*")
voter_df_single = voter_df_single.repartition(1, "VOTER_NAME")
voter_df_double = voter_df_double.repartition(2, "VOTER_NAME")

In [87]:
# Print the number of partitions in each DataFrame
print("\nThere are %d partitions in the voter_df DataFrame.\n" % voter_df_double.rdd.getNumPartitions())
print("\nThere are %d partitions in the voter_df_single DataFrame.\n" % voter_df_single.rdd.getNumPartitions())

# Add a ROW_ID field to each DataFrame
voter_df_double = voter_df_double.withColumn('ROW_ID', F.monotonically_increasing_id())
voter_df_single = voter_df_single.withColumn('ROW_ID', F.monotonically_increasing_id())

# Show the top 10 IDs in each DataFrame 
voter_df_double.orderBy(voter_df_double.ROW_ID.desc()).show(10)
voter_df_single.orderBy(voter_df_single.ROW_ID.desc()).show(10)


There are 2 partitions in the voter_df DataFrame.


There are 1 partitions in the voter_df_single DataFrame.

+------------------+----------+
|        VOTER_NAME|    ROW_ID|
+------------------+----------+
|      Lee Kleinman|8589934603|
| Jennifer S. Gates|8589934602|
|    Tennell Atkins|8589934601|
|   Lee M. Kleinman|8589934600|
|Rickey D. Callahan|8589934599|
|      Kevin Felder|8589934598|
|  Monica R. Alonzo|8589934597|
|Philip T. Kingston|8589934596|
|  B. Adam  McGough|8589934595|
|     Casey  Thomas|8589934594|
+------------------+----------+
only showing top 10 rows

+-------------------+------+
|         VOTER_NAME|ROW_ID|
+-------------------+------+
|       Lee Kleinman|    22|
|Rickey D.  Callahan|    21|
|  Jennifer S. Gates|    20|
|Philip T.  Kingston|    19|
| Jennifer S.  Gates|    18|
|     Tennell Atkins|    17|
|        Erik Wilson|    16|
|    Lee M. Kleinman|    15|
| Rickey D. Callahan|    14|
|       Mark Clayton|    13|
+-------------------+------+
only show

### More ID tricks
Once you define a Spark process, you'll likely want to use it many times. Depending on your needs, you may want to start your IDs at a certain value so there isn't overlap with previous runs of the Spark task. This behavior is similar to how IDs would behave in a relational database. You have been given the task to make sure that the IDs output from a monthly Spark task start at the highest value from the previous month.

The `spark` session and two DataFrames, `voter_df_march` and `voter_df_april`, are available in your workspace. The `pyspark.sql.functions` library is available under the alias `F`.

In [97]:
voter_df_march = voter_df.offset(2).limit(10)
voter_df_april = voter_df.offset(12).limit(10)

In [98]:
# Determine the highest ROW_ID and save it in previous_max_ID
previous_max_ID = voter_df_march.select('ROW_ID').rdd.max()[0] + 1

# Add a ROW_ID column to voter_df_april starting at the desired value
voter_df_april = voter_df_april.withColumn('ROW_ID', previous_max_ID + F.monotonically_increasing_id())

# Show the ROW_ID from both DataFrames and compare
voter_df_march.select('ROW_ID').show()
voter_df_april.select('ROW_ID').show()

+------+
|ROW_ID|
+------+
|     2|
|     3|
|     4|
|     5|
|     6|
|     7|
|     8|
|     9|
|    10|
|    11|
+------+

+------+
|ROW_ID|
+------+
|    12|
|    13|
|    14|
|    15|
|    16|
|    17|
|    18|
|    19|
|    20|
|    21|
+------+

