# Feature Engineering with PySpark

The real world is messy and your job is to make sense of it. Toy datasets like MTCars and Iris are the result of careful curation and cleaning, even so the data needs to be transformed for it to be useful for powerful machine learning algorithms to extract meaning, forecast, classify or cluster. This course will cover the gritty details that data scientists are spending 70-80% of their time on; data wrangling and feature engineering. With size of datasets now becoming ever larger, let's use PySpark to cut this Big Data problem down to size!

## Table of Contents

- [Introduction](#intro)
- [Wrangling with Spark Functions](#wra)

In [13]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns

path = "data/dc34/"

In [2]:
from pyspark import SparkContext
sc = SparkContext("local", "First App")
print(sc)

<SparkContext master=local appName=First App>


In [3]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('First App').getOrCreate()

---
<a id='intro'></a>

## Where to Begin

<img src="images/spark4_001.png" alt="" style="width: 800px;"/>

<img src="images/spark4_002.png" alt="" style="width: 800px;"/>

<img src="images/spark4_003.png" alt="" style="width: 800px;"/>

<img src="images/spark4_004.png" alt="" style="width: 800px;"/>

## Check Version

Checking the version of which Spark and Python installed is important as it changes very quickly and drastically. Reading the wrong documentation can cause lots of lost time and unnecessary frustration!

This course touches on a lot of concepts you may have forgotten, so if you ever need a quick refresher, download the [PySpark Cheat Sheet](https://datacamp-community-prod.s3.amazonaws.com/65076e3c-9df1-40d5-a0c2-36294d9a3ca9) and keep it handy!

In [4]:
# Return spark version
print(spark.version)

# Return python version
import sys
print(sys.version_info)

2.4.4
sys.version_info(major=3, minor=7, micro=3, releaselevel='final', serial=0)


## Load in the data

Reading in data is the first step to using PySpark for data science! Let's leverage the new industry standard of parquet files!

In [None]:
# Read the file into a dataframe
df = spark.read.parquet('Real_Estate.parq')
# Print columns in dataframe
print(df.columns)

```
<script.py> output:
    ['NO', 'MLSID', 'STREETNUMBERNUMERIC', 'STREETADDRESS', 'STREETNAME', 'POSTALCODE', 'STATEORPROVINCE', 'CITY', 'SALESCLOSEPRICE', 'LISTDATE', 'LISTPRICE', 'LISTTYPE', 'ORIGINALLISTPRICE', 'PRICEPERTSFT', 'FOUNDATIONSIZE', 'FENCE', 'MAPLETTER', 'LOTSIZEDIMENSIONS', 'SCHOOLDISTRICTNUMBER', 'DAYSONMARKET', 'OFFMARKETDATE', 'FIREPLACES', 'ROOMAREA4', 'ROOMTYPE', 'ROOF', 'ROOMFLOOR4', 'POTENTIALSHORTSALE', 'POOLDESCRIPTION', 'PDOM', 'GARAGEDESCRIPTION', 'SQFTABOVEGROUND', 'TAXES', 'ROOMFLOOR1', 'ROOMAREA1', 'TAXWITHASSESSMENTS', 'TAXYEAR', 'LIVINGAREA', 'UNITNUMBER', 'YEARBUILT', 'ZONING', 'STYLE', 'ACRES', 'COOLINGDESCRIPTION', 'APPLIANCES', 'BACKONMARKETDATE', 'ROOMFAMILYCHAR', 'ROOMAREA3', 'EXTERIOR', 'ROOMFLOOR3', 'ROOMFLOOR2', 'ROOMAREA2', 'DININGROOMDESCRIPTION', 'BASEMENT', 'BATHSFULL', 'BATHSHALF', 'BATHQUARTER', 'BATHSTHREEQUARTER', 'CLASS', 'BATHSTOTAL', 'BATHDESC', 'ROOMAREA5', 'ROOMFLOOR5', 'ROOMAREA6', 'ROOMFLOOR6', 'ROOMAREA7', 'ROOMFLOOR7', 'ROOMAREA8', 'ROOMFLOOR8', 'BEDROOMS', 'SQFTBELOWGROUND', 'ASSUMABLEMORTGAGE', 'ASSOCIATIONFEE', 'ASSESSMENTPENDING', 'ASSESSEDVALUATION']
```

In [8]:
# Load from provided CSV file
# Read the file into a dataframe
df = spark.read.csv(path+'2017_StPaul_MN_Real_Estate.csv')
# Print columns in dataframe
print(df.columns)

['_c0', '_c1', '_c2', '_c3', '_c4', '_c5', '_c6', '_c7', '_c8', '_c9', '_c10', '_c11', '_c12', '_c13', '_c14', '_c15', '_c16', '_c17', '_c18', '_c19', '_c20', '_c21', '_c22', '_c23', '_c24', '_c25', '_c26', '_c27', '_c28', '_c29', '_c30', '_c31', '_c32', '_c33', '_c34', '_c35', '_c36', '_c37', '_c38', '_c39', '_c40', '_c41', '_c42', '_c43', '_c44', '_c45', '_c46', '_c47', '_c48', '_c49', '_c50', '_c51', '_c52', '_c53', '_c54', '_c55', '_c56', '_c57', '_c58', '_c59', '_c60', '_c61', '_c62', '_c63', '_c64', '_c65', '_c66', '_c67', '_c68', '_c69', '_c70', '_c71', '_c72', '_c73']


## Defining A Problem

<img src="images/spark4_005.png" alt="" style="width: 800px;"/>

<img src="images/spark4_006.png" alt="" style="width: 800px;"/>

<img src="images/spark4_007.png" alt="" style="width: 800px;"/>

<img src="images/spark4_008.png" alt="" style="width: 800px;"/>

<img src="images/spark4_009.png" alt="" style="width: 800px;"/>

## What are we predicting?

Which of these fields (or columns) is the value we are trying to predict for?

- TAXES
- SALESCLOSEPRICE
- DAYSONMARKET
- LISTPRICE

In [None]:
# Select our dependent variable
Y_df = df.select(['SALESCLOSEPRICE'])

# Display summary statistics
Y_df.describe().show()

```
<script.py> output:
    +-------+------------------+
    |summary|   SALESCLOSEPRICE|
    +-------+------------------+
    |  count|              5000|
    |   mean|       262804.4668|
    | stddev|140559.82591998563|
    |    min|             48000|
    |    max|           1700000|
    +-------+------------------+
```
We want to know how much a house will actually sell for. We can see the range of values it has here and the average which will help us in our next steps!

## Verifying Data Load

Let's suppose each month you get a new file. You know to expect a certain number of records and columns. In this exercise we will create a function that will validate the file loaded.

- Create a data validation function check_load() with parameters df a dataframe, num_records as the number of records and num_columns the number of columns.
- Using num_records create a check to see if the input dataframe df has the same amount with count().
- Compare input number of columns the input dataframe has withnum_columns by using len() on columns.
- If both of these return True, then print Validation Passed

In [None]:
def check_load(df, num_records, num_columns):
  # Takes a dataframe and compares record and column counts to input
  # Message to return if the critera below aren't met
  message = 'Validation Failed'
  # Check number of records
  if num_records == df.count():
    # Check number of columns
    if num_columns == len(df.columns):
      # Success message
      message = 'Validation Passed'
  return message

# Print the data validation message
print(check_load(df, 5000, 74))
#check_load(spark.createDataFrame([[1,2], [-1,1]], ['a', 'b']), 2, 2)

## Verifying DataTypes

In the age of data we have access to more attributes than we ever had before. To handle all of them we will build a lot of automation but at a minimum requires that their datatypes be correct. In this exercise we will validate a dictionary of attributes and their datatypes to see if they are correct. This dictionary is stored in the variable validation_dict and is available in your workspace.

- Using df create a list of attribute and datatype tuples with dtypes called actual_dtypes_list.
- Iterate through actual_dtypes_list, checking if the column names exist in the dictionary of expected dtypes validation_dict.
- For the keys that exist in the dictionary, check their dtypes and print those that match.

In [9]:
validation_dict = {'ASSESSMENTPENDING': 'string',
 'AssessedValuation': 'double',
 'AssociationFee': 'bigint',
 'AssumableMortgage': 'string',
 'SQFTBELOWGROUND': 'bigint'}

In [None]:
# create list of actual dtypes to check
actual_dtypes_list = df.dtypes
print(actual_dtypes_list)

# Iterate through the list of actual dtypes tuples
for attribute_tuple in actual_dtypes_list:
  
  # Check if column name is dictionary of expected dtypes
  col_name = attribute_tuple[0]
  if col_name in validation_dict:

    # Compare attribute types
    col_type = attribute_tuple[1]
    if col_type == validation_dict[col_name]:
      print(col_name + ' has expected dtype.')

You've created a way to loop through your expected dtypes and compare them to how they got loaded. You could use a similar loop to print or count all the numeric or text fields if you don't have a list of verified field types to compare against.

## Visually Inspecting Data / EDA

<img src="images/spark4_010.png" alt="" style="width: 800px;"/>

<img src="images/spark4_011.png" alt="" style="width: 800px;"/>

<img src="images/spark4_012.png" alt="" style="width: 800px;"/>

<img src="images/spark4_013.png" alt="" style="width: 800px;"/>

<img src="images/spark4_014.png" alt="" style="width: 800px;"/>

<img src="images/spark4_015.png" alt="" style="width: 800px;"/>

<img src="images/spark4_016.png" alt="" style="width: 800px;"/>

<img src="images/spark4_017.png" alt="" style="width: 800px;"/>

<img src="images/spark4_018.png" alt="" style="width: 800px;"/>

<img src="images/spark4_019.png" alt="" style="width: 800px;"/>

## Using Corr()

The old adage 'Correlation does not imply Causation' is a cautionary tale. However, correlation does give us a good nudge to know where to start looking promising features to use in our models. Use this exercise to get a feel for searching through your data for the first time, trying to find patterns.

A list called columns containing column names has been created for you. In this exercise you will compute the correlation between those columns and 'SALESCLOSEPRICE', and find the maximum.

- Use a for loop iterate through the columns.
- In each loop cycle, compute the correlation between the current column and 'SALESCLOSEPRICE' using the corr() method.
- Create logic to update the maximum observed correlation and with which column.
- Print out the name of the column that has the maximum correlation with 'SALESCLOSEPRICE'.

In [11]:
columns = ['FOUNDATIONSIZE', 'DAYSONMARKET', 'FIREPLACES', 'PDOM', 'SQFTABOVEGROUND', 'TAXES', 
           'TAXWITHASSESSMENTS', 'TAXYEAR', 'LIVINGAREA', 'YEARBUILT', 'ACRES', 'BACKONMARKETDATE', 
           'BATHSFULL', 'BATHSHALF', 'BATHQUARTER', 'BATHSTHREEQUARTER', 'BATHSTOTAL', 'BEDROOMS', 
           'SQFTBELOWGROUND', 'ASSOCIATIONFEE', 'ASSESSEDVALUATION']

In [None]:
# Name and value of col with max corr
corr_max = 0
corr_max_col = columns[0]

# Loop to check all columns contained in list
for col in columns:
    # Check the correlation of a pair of columns
    corr_val = df.corr('SALESCLOSEPRICE', col)
    # Logic to compare corr_max with current corr_val
    if corr_max < corr_val:
        # Update the column name and corr value
        corr_max = corr_val
        corr_max_col = col

print(corr_max_col)

## Using Visualizations: distplot

Understanding the distribution of our dependent variable is very important and can impact the type of model or preprocessing we do. A great way to do this is to plot it, however plotting is not a built in function in PySpark, we will need to take some intermediary steps to make sure it works correctly. In this exercise you will visualize the variable the 'LISTPRICE' variable, and you will gain more insights on its distribution by computing the skewness.

The matplotlib.pyplot and seaborn packages have been imported for you with aliases plt and sns.

- Sample 50% of the dataframe df with sample() making sure to not use replacement and setting the random seed to 42.
- Convert the Spark DataFrame to a pandas.DataFrame() with toPandas().
- Plot a distribution plot using seaborn's distplot() method.
- Import the skewness() function from pyspark.sql.functions and compute it on the aggregate of the 'LISTPRICE' column with the agg() method. Remember to collect() your result to evaluate the computation.

In [None]:
# Select a single column and sample and convert to pandas
sample_df = df.select(['LISTPRICE']).sample(False, 0.5, 42)
pandas_df = sample_df.toPandas()

# Plot distribution of pandas_df and display plot
sns.distplot(pandas_df)
plt.show()

# Import skewness function
from pyspark.sql.functions import skewness

# Compute and print skewness of LISTPRICE
print(df.agg({'LISTPRICE': 'skewness'}).collect())

```
<script.py> output:
    [Row(skewness(LISTPRICE)=2.790448093916559)]
```
Checking the distribution visually is a great way to get an idea of what steps will need to be taken before applying a model. We can see the 'ListPrice' is mostly pushed to the left, which means its skewed. We can use the skewness function to verify this numerically rather than visually.

## Using Visualizations: lmplot

Creating linear model plots helps us visualize if variables have relationships with the dependent variable. If they do they are good candidates to include in our analysis. If they don't it doesn't mean that we should throw them out, it means we may have to process or wrangle them before they can be used.

seaborn is available in your workspace with the customary alias sns.

- Using the loaded data set df filter it down to the columns 'SALESCLOSEPRICE' and 'LIVINGAREA' with select().
- Sample 50% of the dataframe with sample() making sure to not use replacement and setting the random seed to 42.
- Convert the Spark DataFrame to a pandas.DataFrame() with toPandas().
- Using 'SALESCLOSEPRICE' as your dependent variable and 'LIVINGAREA' as your independent, plot a linear model plot using seaborn lmplot().

In [None]:
# Select a the relevant columns and sample
sample_df = df.select(['SALESCLOSEPRICE', 'LIVINGAREA']).sample(False, 0.5, 42)

# Convert to pandas dataframe
pandas_df = sample_df.toPandas()

# Linear model plot of pandas_df
sns.lmplot(x='LIVINGAREA', y='SALESCLOSEPRICE', data=pandas_df)
plt.show()

---
<a id='wra'></a>

## Wrangling with Spark Functions

<img src="images/spark4_020.png" alt="" style="width: 800px;"/>

<img src="images/spark4_021.png" alt="" style="width: 800px;"/>

<img src="images/spark4_022.png" alt="" style="width: 800px;"/>

<img src="images/spark4_023.png" alt="" style="width: 800px;"/>

<img src="images/spark4_024.png" alt="" style="width: 800px;"/>

<img src="images/spark4_025.png" alt="" style="width: 800px;"/>

<img src="images/spark4_026.png" alt="" style="width: 800px;"/>

<img src="images/spark4_027.png" alt="" style="width: 800px;"/>

<img src="images/spark4_028.png" alt="" style="width: 800px;"/>

## Dropping a list of columns

Our data set is rich with a lot of features, but not all are valuable. We have many that are going to be hard to wrangle into anything useful. For now, let's remove any columns that aren't immediately useful by dropping them.

- 'STREETNUMBERNUMERIC': The postal address number on the home
- 'FIREPLACES': Number of Fireplaces in the home
- 'LOTSIZEDIMENSIONS': Free text describing the lot shape
- 'LISTTYPE': Set list of values of sale type
- 'ACRES': Numeric area of lot size

Instructions

- Read the list of column descriptions above and explore their top 30 values with show(), the dataframe is already filtered to the listed columns as df
- Create a list of two columns to drop based on their lack of relevance to predicting house prices called cols_to_drop. Recall that computers only interpret numbers explicitly and don't understand context.
- Use the drop() function to remove the columns in the list cols_to_drop from the dataframe df.

In [None]:
# Show top 30 records
df.show(30)

# List of columns to remove from dataset
cols_to_drop = ['LOTSIZEDIMENSIONS', 'STREETNUMBERNUMERIC']

# Drop columns in list
df = df.drop(*cols_to_drop)

```
<script.py> output:
    +-------------------+----------+--------------------+---------------+------------------+
    |STREETNUMBERNUMERIC|FIREPLACES|   LOTSIZEDIMENSIONS|       LISTTYPE|             ACRES|
    +-------------------+----------+--------------------+---------------+------------------+
    |              11511|         0|             279X200|Exclusive Right|              1.28|
    |              11200|         0|             100x140|Exclusive Right|              0.32|
    |               8583|         0|             120x296|Exclusive Right|0.8220000000000001|
    |               9350|         1|             208X208|Exclusive Right|              0.94|
    |               2915|         1|             116x200|Exclusive Right|               0.0|
    |               3604|         1|              50x150|Exclusive Right|             0.172|
    |               9957|         0|              common|Exclusive Right|              0.05|
    |               9934|         0|              common|Exclusive Right|              0.05|
    |               9926|         0|              common|Exclusive Right|              0.05|
    |               9928|         0|              common|Exclusive Right|              0.05|
    |               9902|         0|              common|Exclusive Right|              0.05|
    |               9904|         0|              common|Exclusive Right|              0.05|
    |               9894|         0|              common|Exclusive Right|              0.05|
    |               9892|         0|              COMMON|Exclusive Right|              0.05|
    |               9295|         1|261 x 293 x 287 x...|Exclusive Right|             1.661|
    |               9930|         0|               36X32|Exclusive Right|              0.05|
    |               9898|         0|               36X32|Exclusive Right|              0.05|
    |               9924|         0|              COMMON|Exclusive Right|              0.05|
    |               9906|         0|              COMMON|Exclusive Right|              0.05|
    |               9938|         0|              COMMON|Exclusive Right|              0.05|
    |               9795|         1|               32X60|Exclusive Right|              0.04|
    |               9797|         1|               32X60|Exclusive Right|              0.04|
    |               8909|         2|             125x150|Exclusive Right|              0.43|
    |               3597|         2|             100x250|Exclusive Right|             0.574|
    |               8656|         1|     151x158x130x151|Exclusive Right|             0.498|
    |               9775|         1|               36X32|Exclusive Right|              0.04|
    |               8687|         2|                   -|Exclusive Right|              1.03|
    |               8367|         0|             285x305|Exclusive Right|             1.995|
    |               2866|         0|           Irregular|Exclusive Right|              0.72|
    |               9793|         1|               42x60|Exclusive Right|              0.06|
    +-------------------+----------+--------------------+---------------+------------------+
    only showing top 30 rows
```
Knowing just the house number doesn't tell us anything about what value the house should be. Likewise the freeform text field is likely too messy to extract useful information from. We can always come back to these after our intial model if we need more information.

## Using text filters to remove records

It pays to have to ask your clients lots of questions and take time to understand your variables. You find out that Assumable mortgage is an unusual occurrence in the real estate industry and your client suggests you exclude them. In this exercise we will use isin() which is similar to like() but allows us to pass a list of values to use as a filter rather than a single one.

- Use select() and show() to inspect the distinct values in the column 'ASSUMABLEMORTGAGE' and create the list yes_values for all the values containing the string 'Yes'.
- Use ~df['ASSUMABLEMORTGAGE'], isin(), and .isNull() to create a NOT filter to remove records containing corresponding values in the list yes_values and to keep records with null values. Store this filter in the variable text_filter.
- Use where() to apply the text_filter to df.
- Print out the number of records remaining in df.

In [None]:
# Inspect unique values in the column 'ASSUMABLEMORTGAGE'
df.select(['ASSUMABLEMORTGAGE']).distinct().show()

# List of possible values containing 'yes'
yes_values = ['Yes w/ Qualifying', 'Yes w/No Qualifying']

# Filter the text values out of df but keep null values
text_filter = ~df['ASSUMABLEMORTGAGE'].isin(yes_values) | df['ASSUMABLEMORTGAGE'].isNull()
df = df.where(text_filter)

# Print count of remaining records
print(df.count())

```
script.py> output:
    +-------------------+
    |  ASSUMABLEMORTGAGE|
    +-------------------+
    |  Yes w/ Qualifying|
    | Information Coming|
    |               null|
    |Yes w/No Qualifying|
    |      Not Assumable|
    +-------------------+
    
    4976
```

## Filtering numeric fields conditionally

Again, understanding the context of your data is extremely important. We want to understand what a normal range of houses sell for. Let's make sure we exclude any outlier homes that have sold for significantly more or less than the average. Here we will calculate the mean and standard deviation and use them to filer the near normal field log_SalesClosePrice.

- Import mean() and stddev() from pyspark.sql.functions.
- Use agg() to calculate the mean and standard deviation for 'log_SalesClosePrice' with the imported functions.
- Create the upper and lower bounds by taking mean_val +/- 3 times stddev_val.
- Create a where() filter for 'log_SalesClosePrice' using both low_bound and hi_bound.

In [None]:
from pyspark.sql.functions import mean, stddev

# Calculate values used for outlier filtering
mean_val = df.agg({'log_SalesClosePrice': 'mean'}).collect()[0][0]
stddev_val = df.agg({'log_SalesClosePrice': 'stddev'}).collect()[0][0]

# Create three standard deviation (μ ± 3σ) lower and upper bounds for data
low_bound = mean_val - (3 * stddev_val)
hi_bound = mean_val + (3 * stddev_val)

# Filter the data to fit between the lower and upper bounds
df = df.where((df['log_SalesClosePrice'] < hi_bound) & (df['log_SalesClosePrice'] > low_bound))

Now we've set proper constaints on our data. If we were to get new data, or the value for Jumbo Loans changes, we can dynamically refilter it!

## Adjusting Data

<img src="images/spark4_029.png" alt="" style="width: 800px;"/>

<img src="images/spark4_030.png" alt="" style="width: 800px;"/>

<img src="images/spark4_031.png" alt="" style="width: 800px;"/>

<img src="images/spark4_032.png" alt="" style="width: 800px;"/>

<img src="images/spark4_033.png" alt="" style="width: 800px;"/>

<img src="images/spark4_034.png" alt="" style="width: 800px;"/>

<img src="images/spark4_035.png" alt="" style="width: 800px;"/>

## Custom Percentage Scaling

In the slides we showed how to scale the data between 0 and 1. Sometimes you may wish to scale things differently for modeling or display purposes.

- Calculate the max and min of DAYSONMARKET and put them into variables max_days and min_days, don't forget to use collect() on agg().
- Using withColumn() create a new column called 'percentagescaleddays' based on DAYSONMARKET.
- percentage_scaled_days should be a column of integers ranging from 0 to 100, use round() to get integers.
- Print the max() and min() for the new column percentage_scaled_days.

In [None]:
# Define max and min values and collect them
max_days = df.agg({'DAYSONMARKET': 'max'}).collect()[0][0]
min_days = df.agg({'DAYSONMARKET': 'min'}).collect()[0][0]

# Create a new column based off the scaled data
df = df.withColumn('percentage_scaled_days', 
                  round((df['DAYSONMARKET'] - min_days) / (max_days - min_days)) * 100)

# Calc max and min for new column
print(df.agg({'percentage_scaled_days': 'max'}).collect())
print(df.agg({'percentage_scaled_days': 'min'}).collect())

## Scaling your scalers

In the previous exercise, we minmax scaled a single variable. Suppose you have a LOT of variables to scale, you don't want hundreds of lines to code for each. Let's expand on the previous exercise and make it a function.

- Define a function called min_max_scaler that takes parameters df a dataframe and cols_to_scale the list of columns to scale.
- Use a for loop to iterate through each column in the list and minmax scale them.
- Return the dataframe df with the new columns added.
- Apply the function min_max_scaler() on df and the list of columns cols_to_scale.

In [14]:
cols_to_scale = ['FOUNDATIONSIZE', 'DAYSONMARKET', 'FIREPLACES']

In [None]:
def min_max_scaler(df, cols_to_scale):
  # Takes a dataframe and list of columns to minmax scale. Returns a dataframe.
  for col in cols_to_scale:
    # Define min and max values and collect them
    max_days = df.agg({col: 'max'}).collect()[0][0]
    min_days = df.agg({col: 'min'}).collect()[0][0]
    new_column_name = 'scaled_' + col
    # Create a new column based off the scaled data
    df = df.withColumn(new_column_name, 
                      (df[col] - min_days) / (max_days - min_days))
  return df
  
df = min_max_scaler(df, cols_to_scale)
# Show that our data is now between 0 and 1
df[['DAYSONMARKET', 'scaled_DAYSONMARKET']].show()

```
<script.py> output:
    +------------+--------------------+
    |DAYSONMARKET| scaled_DAYSONMARKET|
    +------------+--------------------+
    |          10|0.044444444444444446|
    |           4|0.017777777777777778|
    |          28| 0.12444444444444444|
    |          19| 0.08444444444444445|
    |          21| 0.09333333333333334|
    |          17| 0.07555555555555556|
    |          32| 0.14222222222222222|
    |           5|0.022222222222222223|
    |          23| 0.10222222222222223|
    |          73|  0.3244444444444444|
    |          80| 0.35555555555555557|
    |          79|  0.3511111111111111|
    |          12| 0.05333333333333334|
    |           1|0.004444444444444...|
    |          18|                0.08|
    |           2|0.008888888888888889|
    |          12| 0.05333333333333334|
    |          45|                 0.2|
    |          31| 0.13777777777777778|
    |          16| 0.07111111111111111|
    +------------+--------------------+
    only showing top 20 rows
```
Creating scalable solutions that can be reused will free up many hours of you and your teams time. Additionally it means that you have fewer things to correct should you need to make changes.

In [None]:
<img src="images/spark4_036.png" alt="" style="width: 800px;"/>

In [None]:
---
<a id='intro'></a>