# Feature Engineering with PySpark

The real world is messy and your job is to make sense of it. Toy datasets like MTCars and Iris are the result of careful curation and cleaning, even so the data needs to be transformed for it to be useful for powerful machine learning algorithms to extract meaning, forecast, classify or cluster. This course will cover the gritty details that data scientists are spending 70-80% of their time on; data wrangling and feature engineering. With size of datasets now becoming ever larger, let's use PySpark to cut this Big Data problem down to size!

## Table of Contents

- [Introduction](#intro)
- [Wrangling with Spark Functions](#wra)
- [Feature Engineering](#feat)
- [Building a Model](#model)

In [13]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns

path = "data/dc34/"

In [2]:
from pyspark import SparkContext
sc = SparkContext("local", "First App")
print(sc)

<SparkContext master=local appName=First App>


In [3]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('First App').getOrCreate()

---
<a id='intro'></a>

## Where to Begin

<img src="images/spark4_001.png" alt="" style="width: 800px;"/>

<img src="images/spark4_002.png" alt="" style="width: 800px;"/>

<img src="images/spark4_003.png" alt="" style="width: 800px;"/>

<img src="images/spark4_004.png" alt="" style="width: 800px;"/>

## Check Version

Checking the version of which Spark and Python installed is important as it changes very quickly and drastically. Reading the wrong documentation can cause lots of lost time and unnecessary frustration!

This course touches on a lot of concepts you may have forgotten, so if you ever need a quick refresher, download the [PySpark Cheat Sheet](https://datacamp-community-prod.s3.amazonaws.com/65076e3c-9df1-40d5-a0c2-36294d9a3ca9) and keep it handy!

In [4]:
# Return spark version
print(spark.version)

# Return python version
import sys
print(sys.version_info)

2.4.4
sys.version_info(major=3, minor=7, micro=3, releaselevel='final', serial=0)


## Load in the data

Reading in data is the first step to using PySpark for data science! Let's leverage the new industry standard of parquet files!

In [None]:
# Read the file into a dataframe
df = spark.read.parquet('Real_Estate.parq')
# Print columns in dataframe
print(df.columns)

```
<script.py> output:
    ['NO', 'MLSID', 'STREETNUMBERNUMERIC', 'STREETADDRESS', 'STREETNAME', 'POSTALCODE', 'STATEORPROVINCE', 'CITY', 'SALESCLOSEPRICE', 'LISTDATE', 'LISTPRICE', 'LISTTYPE', 'ORIGINALLISTPRICE', 'PRICEPERTSFT', 'FOUNDATIONSIZE', 'FENCE', 'MAPLETTER', 'LOTSIZEDIMENSIONS', 'SCHOOLDISTRICTNUMBER', 'DAYSONMARKET', 'OFFMARKETDATE', 'FIREPLACES', 'ROOMAREA4', 'ROOMTYPE', 'ROOF', 'ROOMFLOOR4', 'POTENTIALSHORTSALE', 'POOLDESCRIPTION', 'PDOM', 'GARAGEDESCRIPTION', 'SQFTABOVEGROUND', 'TAXES', 'ROOMFLOOR1', 'ROOMAREA1', 'TAXWITHASSESSMENTS', 'TAXYEAR', 'LIVINGAREA', 'UNITNUMBER', 'YEARBUILT', 'ZONING', 'STYLE', 'ACRES', 'COOLINGDESCRIPTION', 'APPLIANCES', 'BACKONMARKETDATE', 'ROOMFAMILYCHAR', 'ROOMAREA3', 'EXTERIOR', 'ROOMFLOOR3', 'ROOMFLOOR2', 'ROOMAREA2', 'DININGROOMDESCRIPTION', 'BASEMENT', 'BATHSFULL', 'BATHSHALF', 'BATHQUARTER', 'BATHSTHREEQUARTER', 'CLASS', 'BATHSTOTAL', 'BATHDESC', 'ROOMAREA5', 'ROOMFLOOR5', 'ROOMAREA6', 'ROOMFLOOR6', 'ROOMAREA7', 'ROOMFLOOR7', 'ROOMAREA8', 'ROOMFLOOR8', 'BEDROOMS', 'SQFTBELOWGROUND', 'ASSUMABLEMORTGAGE', 'ASSOCIATIONFEE', 'ASSESSMENTPENDING', 'ASSESSEDVALUATION']
```

In [8]:
# Load from provided CSV file
# Read the file into a dataframe
df = spark.read.csv(path+'2017_StPaul_MN_Real_Estate.csv')
# Print columns in dataframe
print(df.columns)

['_c0', '_c1', '_c2', '_c3', '_c4', '_c5', '_c6', '_c7', '_c8', '_c9', '_c10', '_c11', '_c12', '_c13', '_c14', '_c15', '_c16', '_c17', '_c18', '_c19', '_c20', '_c21', '_c22', '_c23', '_c24', '_c25', '_c26', '_c27', '_c28', '_c29', '_c30', '_c31', '_c32', '_c33', '_c34', '_c35', '_c36', '_c37', '_c38', '_c39', '_c40', '_c41', '_c42', '_c43', '_c44', '_c45', '_c46', '_c47', '_c48', '_c49', '_c50', '_c51', '_c52', '_c53', '_c54', '_c55', '_c56', '_c57', '_c58', '_c59', '_c60', '_c61', '_c62', '_c63', '_c64', '_c65', '_c66', '_c67', '_c68', '_c69', '_c70', '_c71', '_c72', '_c73']


## Defining A Problem

<img src="images/spark4_005.png" alt="" style="width: 800px;"/>

<img src="images/spark4_006.png" alt="" style="width: 800px;"/>

<img src="images/spark4_007.png" alt="" style="width: 800px;"/>

<img src="images/spark4_008.png" alt="" style="width: 800px;"/>

<img src="images/spark4_009.png" alt="" style="width: 800px;"/>

## What are we predicting?

Which of these fields (or columns) is the value we are trying to predict for?

- TAXES
- SALESCLOSEPRICE
- DAYSONMARKET
- LISTPRICE

In [None]:
# Select our dependent variable
Y_df = df.select(['SALESCLOSEPRICE'])

# Display summary statistics
Y_df.describe().show()

```
<script.py> output:
    +-------+------------------+
    |summary|   SALESCLOSEPRICE|
    +-------+------------------+
    |  count|              5000|
    |   mean|       262804.4668|
    | stddev|140559.82591998563|
    |    min|             48000|
    |    max|           1700000|
    +-------+------------------+
```
We want to know how much a house will actually sell for. We can see the range of values it has here and the average which will help us in our next steps!

## Verifying Data Load

Let's suppose each month you get a new file. You know to expect a certain number of records and columns. In this exercise we will create a function that will validate the file loaded.

- Create a data validation function check_load() with parameters df a dataframe, num_records as the number of records and num_columns the number of columns.
- Using num_records create a check to see if the input dataframe df has the same amount with count().
- Compare input number of columns the input dataframe has withnum_columns by using len() on columns.
- If both of these return True, then print Validation Passed

In [None]:
def check_load(df, num_records, num_columns):
  # Takes a dataframe and compares record and column counts to input
  # Message to return if the critera below aren't met
  message = 'Validation Failed'
  # Check number of records
  if num_records == df.count():
    # Check number of columns
    if num_columns == len(df.columns):
      # Success message
      message = 'Validation Passed'
  return message

# Print the data validation message
print(check_load(df, 5000, 74))
#check_load(spark.createDataFrame([[1,2], [-1,1]], ['a', 'b']), 2, 2)

## Verifying DataTypes

In the age of data we have access to more attributes than we ever had before. To handle all of them we will build a lot of automation but at a minimum requires that their datatypes be correct. In this exercise we will validate a dictionary of attributes and their datatypes to see if they are correct. This dictionary is stored in the variable validation_dict and is available in your workspace.

- Using df create a list of attribute and datatype tuples with dtypes called actual_dtypes_list.
- Iterate through actual_dtypes_list, checking if the column names exist in the dictionary of expected dtypes validation_dict.
- For the keys that exist in the dictionary, check their dtypes and print those that match.

In [9]:
validation_dict = {'ASSESSMENTPENDING': 'string',
 'AssessedValuation': 'double',
 'AssociationFee': 'bigint',
 'AssumableMortgage': 'string',
 'SQFTBELOWGROUND': 'bigint'}

In [None]:
# create list of actual dtypes to check
actual_dtypes_list = df.dtypes
print(actual_dtypes_list)

# Iterate through the list of actual dtypes tuples
for attribute_tuple in actual_dtypes_list:
  
  # Check if column name is dictionary of expected dtypes
  col_name = attribute_tuple[0]
  if col_name in validation_dict:

    # Compare attribute types
    col_type = attribute_tuple[1]
    if col_type == validation_dict[col_name]:
      print(col_name + ' has expected dtype.')

You've created a way to loop through your expected dtypes and compare them to how they got loaded. You could use a similar loop to print or count all the numeric or text fields if you don't have a list of verified field types to compare against.

## Visually Inspecting Data / EDA

<img src="images/spark4_010.png" alt="" style="width: 800px;"/>

<img src="images/spark4_011.png" alt="" style="width: 800px;"/>

<img src="images/spark4_012.png" alt="" style="width: 800px;"/>

<img src="images/spark4_013.png" alt="" style="width: 800px;"/>

<img src="images/spark4_014.png" alt="" style="width: 800px;"/>

<img src="images/spark4_015.png" alt="" style="width: 800px;"/>

<img src="images/spark4_016.png" alt="" style="width: 800px;"/>

<img src="images/spark4_017.png" alt="" style="width: 800px;"/>

<img src="images/spark4_018.png" alt="" style="width: 800px;"/>

<img src="images/spark4_019.png" alt="" style="width: 800px;"/>

## Using Corr()

The old adage 'Correlation does not imply Causation' is a cautionary tale. However, correlation does give us a good nudge to know where to start looking promising features to use in our models. Use this exercise to get a feel for searching through your data for the first time, trying to find patterns.

A list called columns containing column names has been created for you. In this exercise you will compute the correlation between those columns and 'SALESCLOSEPRICE', and find the maximum.

- Use a for loop iterate through the columns.
- In each loop cycle, compute the correlation between the current column and 'SALESCLOSEPRICE' using the corr() method.
- Create logic to update the maximum observed correlation and with which column.
- Print out the name of the column that has the maximum correlation with 'SALESCLOSEPRICE'.

In [11]:
columns = ['FOUNDATIONSIZE', 'DAYSONMARKET', 'FIREPLACES', 'PDOM', 'SQFTABOVEGROUND', 'TAXES', 
           'TAXWITHASSESSMENTS', 'TAXYEAR', 'LIVINGAREA', 'YEARBUILT', 'ACRES', 'BACKONMARKETDATE', 
           'BATHSFULL', 'BATHSHALF', 'BATHQUARTER', 'BATHSTHREEQUARTER', 'BATHSTOTAL', 'BEDROOMS', 
           'SQFTBELOWGROUND', 'ASSOCIATIONFEE', 'ASSESSEDVALUATION']

In [None]:
# Name and value of col with max corr
corr_max = 0
corr_max_col = columns[0]

# Loop to check all columns contained in list
for col in columns:
    # Check the correlation of a pair of columns
    corr_val = df.corr('SALESCLOSEPRICE', col)
    # Logic to compare corr_max with current corr_val
    if corr_max < corr_val:
        # Update the column name and corr value
        corr_max = corr_val
        corr_max_col = col

print(corr_max_col)

## Using Visualizations: distplot

Understanding the distribution of our dependent variable is very important and can impact the type of model or preprocessing we do. A great way to do this is to plot it, however plotting is not a built in function in PySpark, we will need to take some intermediary steps to make sure it works correctly. In this exercise you will visualize the variable the 'LISTPRICE' variable, and you will gain more insights on its distribution by computing the skewness.

The matplotlib.pyplot and seaborn packages have been imported for you with aliases plt and sns.

- Sample 50% of the dataframe df with sample() making sure to not use replacement and setting the random seed to 42.
- Convert the Spark DataFrame to a pandas.DataFrame() with toPandas().
- Plot a distribution plot using seaborn's distplot() method.
- Import the skewness() function from pyspark.sql.functions and compute it on the aggregate of the 'LISTPRICE' column with the agg() method. Remember to collect() your result to evaluate the computation.

In [None]:
# Select a single column and sample and convert to pandas
sample_df = df.select(['LISTPRICE']).sample(False, 0.5, 42)
pandas_df = sample_df.toPandas()

# Plot distribution of pandas_df and display plot
sns.distplot(pandas_df)
plt.show()

# Import skewness function
from pyspark.sql.functions import skewness

# Compute and print skewness of LISTPRICE
print(df.agg({'LISTPRICE': 'skewness'}).collect())

```
<script.py> output:
    [Row(skewness(LISTPRICE)=2.790448093916559)]
```
Checking the distribution visually is a great way to get an idea of what steps will need to be taken before applying a model. We can see the 'ListPrice' is mostly pushed to the left, which means its skewed. We can use the skewness function to verify this numerically rather than visually.

## Using Visualizations: lmplot

Creating linear model plots helps us visualize if variables have relationships with the dependent variable. If they do they are good candidates to include in our analysis. If they don't it doesn't mean that we should throw them out, it means we may have to process or wrangle them before they can be used.

seaborn is available in your workspace with the customary alias sns.

- Using the loaded data set df filter it down to the columns 'SALESCLOSEPRICE' and 'LIVINGAREA' with select().
- Sample 50% of the dataframe with sample() making sure to not use replacement and setting the random seed to 42.
- Convert the Spark DataFrame to a pandas.DataFrame() with toPandas().
- Using 'SALESCLOSEPRICE' as your dependent variable and 'LIVINGAREA' as your independent, plot a linear model plot using seaborn lmplot().

In [None]:
# Select a the relevant columns and sample
sample_df = df.select(['SALESCLOSEPRICE', 'LIVINGAREA']).sample(False, 0.5, 42)

# Convert to pandas dataframe
pandas_df = sample_df.toPandas()

# Linear model plot of pandas_df
sns.lmplot(x='LIVINGAREA', y='SALESCLOSEPRICE', data=pandas_df)
plt.show()

---
<a id='wra'></a>

## Wrangling with Spark Functions

<img src="images/spark4_020.png" alt="" style="width: 800px;"/>

<img src="images/spark4_021.png" alt="" style="width: 800px;"/>

<img src="images/spark4_022.png" alt="" style="width: 800px;"/>

<img src="images/spark4_023.png" alt="" style="width: 800px;"/>

<img src="images/spark4_024.png" alt="" style="width: 800px;"/>

<img src="images/spark4_025.png" alt="" style="width: 800px;"/>

<img src="images/spark4_026.png" alt="" style="width: 800px;"/>

<img src="images/spark4_027.png" alt="" style="width: 800px;"/>

<img src="images/spark4_028.png" alt="" style="width: 800px;"/>

## Dropping a list of columns

Our data set is rich with a lot of features, but not all are valuable. We have many that are going to be hard to wrangle into anything useful. For now, let's remove any columns that aren't immediately useful by dropping them.

- 'STREETNUMBERNUMERIC': The postal address number on the home
- 'FIREPLACES': Number of Fireplaces in the home
- 'LOTSIZEDIMENSIONS': Free text describing the lot shape
- 'LISTTYPE': Set list of values of sale type
- 'ACRES': Numeric area of lot size

Instructions

- Read the list of column descriptions above and explore their top 30 values with show(), the dataframe is already filtered to the listed columns as df
- Create a list of two columns to drop based on their lack of relevance to predicting house prices called cols_to_drop. Recall that computers only interpret numbers explicitly and don't understand context.
- Use the drop() function to remove the columns in the list cols_to_drop from the dataframe df.

In [None]:
# Show top 30 records
df.show(30)

# List of columns to remove from dataset
cols_to_drop = ['LOTSIZEDIMENSIONS', 'STREETNUMBERNUMERIC']

# Drop columns in list
df = df.drop(*cols_to_drop)

```
<script.py> output:
    +-------------------+----------+--------------------+---------------+------------------+
    |STREETNUMBERNUMERIC|FIREPLACES|   LOTSIZEDIMENSIONS|       LISTTYPE|             ACRES|
    +-------------------+----------+--------------------+---------------+------------------+
    |              11511|         0|             279X200|Exclusive Right|              1.28|
    |              11200|         0|             100x140|Exclusive Right|              0.32|
    |               8583|         0|             120x296|Exclusive Right|0.8220000000000001|
    |               9350|         1|             208X208|Exclusive Right|              0.94|
    |               2915|         1|             116x200|Exclusive Right|               0.0|
    |               3604|         1|              50x150|Exclusive Right|             0.172|
    |               9957|         0|              common|Exclusive Right|              0.05|
    |               9934|         0|              common|Exclusive Right|              0.05|
    |               9926|         0|              common|Exclusive Right|              0.05|
    |               9928|         0|              common|Exclusive Right|              0.05|
    |               9902|         0|              common|Exclusive Right|              0.05|
    |               9904|         0|              common|Exclusive Right|              0.05|
    |               9894|         0|              common|Exclusive Right|              0.05|
    |               9892|         0|              COMMON|Exclusive Right|              0.05|
    |               9295|         1|261 x 293 x 287 x...|Exclusive Right|             1.661|
    |               9930|         0|               36X32|Exclusive Right|              0.05|
    |               9898|         0|               36X32|Exclusive Right|              0.05|
    |               9924|         0|              COMMON|Exclusive Right|              0.05|
    |               9906|         0|              COMMON|Exclusive Right|              0.05|
    |               9938|         0|              COMMON|Exclusive Right|              0.05|
    |               9795|         1|               32X60|Exclusive Right|              0.04|
    |               9797|         1|               32X60|Exclusive Right|              0.04|
    |               8909|         2|             125x150|Exclusive Right|              0.43|
    |               3597|         2|             100x250|Exclusive Right|             0.574|
    |               8656|         1|     151x158x130x151|Exclusive Right|             0.498|
    |               9775|         1|               36X32|Exclusive Right|              0.04|
    |               8687|         2|                   -|Exclusive Right|              1.03|
    |               8367|         0|             285x305|Exclusive Right|             1.995|
    |               2866|         0|           Irregular|Exclusive Right|              0.72|
    |               9793|         1|               42x60|Exclusive Right|              0.06|
    +-------------------+----------+--------------------+---------------+------------------+
    only showing top 30 rows
```
Knowing just the house number doesn't tell us anything about what value the house should be. Likewise the freeform text field is likely too messy to extract useful information from. We can always come back to these after our intial model if we need more information.

## Using text filters to remove records

It pays to have to ask your clients lots of questions and take time to understand your variables. You find out that Assumable mortgage is an unusual occurrence in the real estate industry and your client suggests you exclude them. In this exercise we will use isin() which is similar to like() but allows us to pass a list of values to use as a filter rather than a single one.

- Use select() and show() to inspect the distinct values in the column 'ASSUMABLEMORTGAGE' and create the list yes_values for all the values containing the string 'Yes'.
- Use ~df['ASSUMABLEMORTGAGE'], isin(), and .isNull() to create a NOT filter to remove records containing corresponding values in the list yes_values and to keep records with null values. Store this filter in the variable text_filter.
- Use where() to apply the text_filter to df.
- Print out the number of records remaining in df.

In [None]:
# Inspect unique values in the column 'ASSUMABLEMORTGAGE'
df.select(['ASSUMABLEMORTGAGE']).distinct().show()

# List of possible values containing 'yes'
yes_values = ['Yes w/ Qualifying', 'Yes w/No Qualifying']

# Filter the text values out of df but keep null values
text_filter = ~df['ASSUMABLEMORTGAGE'].isin(yes_values) | df['ASSUMABLEMORTGAGE'].isNull()
df = df.where(text_filter)

# Print count of remaining records
print(df.count())

```
script.py> output:
    +-------------------+
    |  ASSUMABLEMORTGAGE|
    +-------------------+
    |  Yes w/ Qualifying|
    | Information Coming|
    |               null|
    |Yes w/No Qualifying|
    |      Not Assumable|
    +-------------------+
    
    4976
```

## Filtering numeric fields conditionally

Again, understanding the context of your data is extremely important. We want to understand what a normal range of houses sell for. Let's make sure we exclude any outlier homes that have sold for significantly more or less than the average. Here we will calculate the mean and standard deviation and use them to filer the near normal field log_SalesClosePrice.

- Import mean() and stddev() from pyspark.sql.functions.
- Use agg() to calculate the mean and standard deviation for 'log_SalesClosePrice' with the imported functions.
- Create the upper and lower bounds by taking mean_val +/- 3 times stddev_val.
- Create a where() filter for 'log_SalesClosePrice' using both low_bound and hi_bound.

In [None]:
from pyspark.sql.functions import mean, stddev

# Calculate values used for outlier filtering
mean_val = df.agg({'log_SalesClosePrice': 'mean'}).collect()[0][0]
stddev_val = df.agg({'log_SalesClosePrice': 'stddev'}).collect()[0][0]

# Create three standard deviation (μ ± 3σ) lower and upper bounds for data
low_bound = mean_val - (3 * stddev_val)
hi_bound = mean_val + (3 * stddev_val)

# Filter the data to fit between the lower and upper bounds
df = df.where((df['log_SalesClosePrice'] < hi_bound) & (df['log_SalesClosePrice'] > low_bound))

Now we've set proper constaints on our data. If we were to get new data, or the value for Jumbo Loans changes, we can dynamically refilter it!

## Adjusting Data

<img src="images/spark4_029.png" alt="" style="width: 800px;"/>

<img src="images/spark4_030.png" alt="" style="width: 800px;"/>

<img src="images/spark4_031.png" alt="" style="width: 800px;"/>

<img src="images/spark4_032.png" alt="" style="width: 800px;"/>

<img src="images/spark4_033.png" alt="" style="width: 800px;"/>

<img src="images/spark4_034.png" alt="" style="width: 800px;"/>

<img src="images/spark4_035.png" alt="" style="width: 800px;"/>

## Custom Percentage Scaling

In the slides we showed how to scale the data between 0 and 1. Sometimes you may wish to scale things differently for modeling or display purposes.

- Calculate the max and min of DAYSONMARKET and put them into variables max_days and min_days, don't forget to use collect() on agg().
- Using withColumn() create a new column called 'percentagescaleddays' based on DAYSONMARKET.
- percentage_scaled_days should be a column of integers ranging from 0 to 100, use round() to get integers.
- Print the max() and min() for the new column percentage_scaled_days.

In [None]:
# Define max and min values and collect them
max_days = df.agg({'DAYSONMARKET': 'max'}).collect()[0][0]
min_days = df.agg({'DAYSONMARKET': 'min'}).collect()[0][0]

# Create a new column based off the scaled data
df = df.withColumn('percentage_scaled_days', 
                  round((df['DAYSONMARKET'] - min_days) / (max_days - min_days)) * 100)

# Calc max and min for new column
print(df.agg({'percentage_scaled_days': 'max'}).collect())
print(df.agg({'percentage_scaled_days': 'min'}).collect())

## Scaling your scalers

In the previous exercise, we minmax scaled a single variable. Suppose you have a LOT of variables to scale, you don't want hundreds of lines to code for each. Let's expand on the previous exercise and make it a function.

- Define a function called min_max_scaler that takes parameters df a dataframe and cols_to_scale the list of columns to scale.
- Use a for loop to iterate through each column in the list and minmax scale them.
- Return the dataframe df with the new columns added.
- Apply the function min_max_scaler() on df and the list of columns cols_to_scale.

In [14]:
cols_to_scale = ['FOUNDATIONSIZE', 'DAYSONMARKET', 'FIREPLACES']

In [None]:
def min_max_scaler(df, cols_to_scale):
  # Takes a dataframe and list of columns to minmax scale. Returns a dataframe.
  for col in cols_to_scale:
    # Define min and max values and collect them
    max_days = df.agg({col: 'max'}).collect()[0][0]
    min_days = df.agg({col: 'min'}).collect()[0][0]
    new_column_name = 'scaled_' + col
    # Create a new column based off the scaled data
    df = df.withColumn(new_column_name, 
                      (df[col] - min_days) / (max_days - min_days))
  return df
  
df = min_max_scaler(df, cols_to_scale)
# Show that our data is now between 0 and 1
df[['DAYSONMARKET', 'scaled_DAYSONMARKET']].show()

```
<script.py> output:
    +------------+--------------------+
    |DAYSONMARKET| scaled_DAYSONMARKET|
    +------------+--------------------+
    |          10|0.044444444444444446|
    |           4|0.017777777777777778|
    |          28| 0.12444444444444444|
    |          19| 0.08444444444444445|
    |          21| 0.09333333333333334|
    |          17| 0.07555555555555556|
    |          32| 0.14222222222222222|
    |           5|0.022222222222222223|
    |          23| 0.10222222222222223|
    |          73|  0.3244444444444444|
    |          80| 0.35555555555555557|
    |          79|  0.3511111111111111|
    |          12| 0.05333333333333334|
    |           1|0.004444444444444...|
    |          18|                0.08|
    |           2|0.008888888888888889|
    |          12| 0.05333333333333334|
    |          45|                 0.2|
    |          31| 0.13777777777777778|
    |          16| 0.07111111111111111|
    +------------+--------------------+
    only showing top 20 rows
```
Creating scalable solutions that can be reused will free up many hours of you and your teams time. Additionally it means that you have fewer things to correct should you need to make changes.

## Correcting Right Skew Data

In the slides we showed how you might use log transforms to fix positively skewed data (data whose distribution is mostly to the left). To correct negative skew (data mostly to the right) you need to take an extra step called "reflecting" before you can apply the inverse of log, written as (1/log) to make the data look more like normal a normal distribution. Reflecting data uses the following formula to reflect each value: (x$_{max}$+1)–x.

- Use the aggregate function skewness() to verify that 'YEARBUILT' has negative skew.
- Use the withColumn() to create a new column 'Reflect_YearBuilt' and reflect the values of 'YEARBUILT'.
- Using 'Reflect_YearBuilt' column, create another column 'adj_yearbuilt' by taking 1/log() of the values.

In [None]:
from pyspark.sql.functions import log

# Compute the skewness
print(df.agg({'YEARBUILT': 'skewness'}).collect())

# Calculate the max year
max_year = df.agg({'YEARBUILT': 'max'}).collect()[0][0]

# Create a new column of reflected data
df = df.withColumn('Reflect_YearBuilt', (max_year + 1) - df['YEARBUILT'])

# Create a new column based reflected data
df = df.withColumn('adj_yearbuilt', 1 / log(df['Reflect_YearBuilt']))

Adjusting variables is a complex task. What you've seen here are only a few of the ways that you might try to make your data fit a normal distribution.

## Working with Missing Data

<img src="images/spark4_036.png" alt="" style="width: 800px;"/>

<img src="images/spark4_037.png" alt="" style="width: 800px;"/>

<img src="images/spark4_038.png" alt="" style="width: 800px;"/>

<img src="images/spark4_039.png" alt="" style="width: 800px;"/>

<img src="images/spark4_040.png" alt="" style="width: 800px;"/>

<img src="images/spark4_041.png" alt="" style="width: 800px;"/>

<img src="images/spark4_042.png" alt="" style="width: 800px;"/>

## Visualizing Missing Data

Being able to plot missing values is a great way to quickly understand how much of your data is missing. It can also help highlight when variables are missing in a pattern something that will need to be handled with care lest your model be biased.

Which variable has the most missing values? Run all lines of code except the last one to determine the answer. Once you're confident, and fill out the value and hit "Submit Answer".

- Use select() to subset the dataframe df with the list of columns columns and Sample with the provided sample() function, and assign this dataframe to the variable sample_df.
- Convert the Subset dataframe to a pandas dataframe pandas_df, and use pandas isnull() to convert it DataFrame into True/False. Store this result in tf_df.
- Use seaborn's heatmap() to plot tf_df.
- Hit "Run Code" to view the plot. Then assign the name of the variable with most missing values to answer.

In [15]:
columns = ['APPLIANCES',
 'BACKONMARKETDATE',
 'ROOMFAMILYCHAR',
 'BASEMENT',
 'DININGROOMDESCRIPTION']

In [None]:
# Sample the dataframe and convert to Pandas
sample_df = df.select(columns).sample(False, 0.1, 42)
pandas_df = sample_df.toPandas()

# Convert all values to T/F
tf_df = pandas_df.isnull()

# Plot it
sns.heatmap(data=tf_df)
plt.xticks(rotation=30, fontsize=10)
plt.yticks(rotation=0, fontsize=10)
plt.show()

# Set the answer to the column with the most missing data
answer = 'BACKONMARKETDATE'

<img src="images/spark4_043.png" alt="" style="width: 800px;"/>

Visuals like this can help you to quickly eliminate variables that provide no value to your analysis.

## Imputing Missing Data

Missing data happens. If we make the assumption that our data is missing completely at random, we are making the assumption that what data we do have, is a good representation of the population. If we have a few values we could remove them or we could use the mean or median as a replacement. In this exercise, we will look at 'PDOM': Days on Market at Current Price.

- Get a count of the missing values in the column 'PDOM' using where(), isNull() and count().
- Calculate the mean value of 'PDOM' using the aggregate function mean().
- Use fillna() with the value set to the 'PDOM' mean value and only apply it to the column 'PDOM' using the subset parameter.

In [None]:
# Count missing rows
missing = df.where(df['PDOM'].isNull()).count()

# Calculate the mean value
col_mean = df.agg({'PDOM': 'mean'}).collect()[0][0]

# Replacing with the mean value for that column
df.fillna(col_mean, subset=['PDOM'])

Missing value replacement is easy, however its ramifications can be huge. Make sure to spend time considering the appropriate ways to handle missing data in your problems.

## Calculate Missing Percents

Automation is the future of data science. Learning to automate some of your data preparation pays dividends. In this exercise, we will automate dropping columns if they are missing data beyond a specific threshold.

- Define a function column_dropper() that takes the parameters df a dataframe and threshold a float between 0 and 1.
- Calculate the percentage of values that are missing using where(), isNull() and count()
- Check to see if the percentage of missing is higher than the threshold, if so, drop the column using drop()
- Run column_dropper() on df with the threshold set to .6

In [None]:
def column_dropper(df, threshold):
  # Takes a dataframe and threshold for missing values. Returns a dataframe.
  total_records = df.count()
  for col in df.columns:
    # Calculate the percentage of missing values
    missing = df.where(df[col].isNull()).count()
    missing_percent = missing / total_records
    # Drop column if percent of missing is more than threshold
    if missing_percent > threshold:
      df = df.drop(col)
  return df

# Drop columns that are more than 60% missing
df = column_dropper(df, 0.6)

## Getting More Data

<img src="images/spark4_044.png" alt="" style="width: 800px;"/>

<img src="images/spark4_045.png" alt="" style="width: 800px;"/>

<img src="images/spark4_046.png" alt="" style="width: 800px;"/>

<img src="images/spark4_047.png" alt="" style="width: 800px;"/>

<img src="images/spark4_048.png" alt="" style="width: 800px;"/>

## A Dangerous Join

In this exercise, we will be joining on Latitude and Longitude to bring in another dataset that measures how walk-friendly a neighborhood is. We'll need to be careful to make sure our joining columns are the same data type and ensure we are joining on the same precision (number of digits after the decimal) or our join won't work!

Below you will find that df['latitude'] and df['longitude'] are at a higher precision than walk_df['longitude'] and walk_df['latitude'] we'll need to round them to the same precision so the join will work correctly.

- Convert walk_df['latitude'] and walk_df['longitude'] to type double by using cast('double') on the column and replacing the column in place withColumn().
- Round the columns in place with withColumn() and round('latitude', 5) and round('longitude', 5).
- Create the join condition of walk_df['latitude'] matching df['latitude'] and walk_df['longitude'] matching df['longitude'].
- Join df and walk_df together with join(), using the condition above and the left join type. Save the joined dataframe as join_df.

In [None]:
# Cast data types
walk_df = walk_df.withColumn('longitude', walk_df['longitude'].cast('double'))
walk_df = walk_df.withColumn('latitude', walk_df['latitude'].cast('double'))

# Round precision
df = df.withColumn('longitude', round('longitude', 5))
df = df.withColumn('latitude', round('latitude', 5))

# Create join condition
condition = [walk_df['latitude'] == df['latitude'], walk_df['longitude'] == df['longitude']]

# Join the dataframes together
join_df = df.join(walk_df, on=condition, how='left')
# Count non-null records from new field
print(join_df.where(~join_df['walkscore'].isNull()).count())

## Spark SQL Join

Sometimes it is much easier to write complex joins in SQL. In this exercise, we will start with the join keys already in the same format and precision but will use SparkSQL to do the joining.

- Register the Dataframes as SparkSQL tables with createOrReplaceTempView, name them the df and walk_df respectively.
- In the join_sql string, set the left table to df and the right table to walk_df
- Call spark.sql() on the join_sql string to perform the join.

In [None]:
# Register dataframes as tables
df.createOrReplaceTempView('df')
walk_df.createOrReplaceTempView('walk_df')

# SQL to join dataframes
join_sql = 	"""
			SELECT 
				*
			FROM df
			LEFT JOIN walk_df
			ON df.longitude = walk_df.longitude
			AND df.latitude = walk_df.latitude
			"""
# Perform sql join
joined_df = spark.sql(join_sql)

## Checking for Bad Joins

Joins can go bad silently if we are not careful, meaning they will not error out but instead return mangled data with more or less data than you'd intended. Let's take a look at a couple ways that joining incorrectly can change your data set for the worse.

In this example we will look at what happens if you join two dataframes together when the join keys are not the same precision and compare the record counts between the correct join and the incorrect one.

- Create a join between df_orig, the dataframe before its precision was corrected, and walk_df that matches on longitude and latitude in the respective dataframes.
- Count the number of missing values with where() isNull() on df['walkscore'] and correct_join['walkscore']. You should notice that there are many missing values because our datatypes and precision do not match.
- Create a join between df and walk_df that only matches on longitude
- Count the number of records with count(): few_keys_df and correct_join_df. You should notice that there are many more values as we have not constrained our matching correctly.

In [None]:
# Join on mismatched keys precision 
wrong_prec_cond = [df_orig['longitude'] == walk_df['longitude'], df_orig['latitude'] == walk_df['latitude']]
wrong_prec_df = df_orig.join(walk_df, on=wrong_prec_cond, how='left')

# Compare bad join to the correct one
print(wrong_prec_df.where(wrong_prec_df['walkscore'].isNull()).count())
print(correct_join_df.where(correct_join_df['walkscore'].isNull()).count())

# Create a join on too few keys
few_keys_cond = [df['longitude'] == walk_df['longitude']]
few_keys_df = df.join(walk_df, on=few_keys_cond, how='left')

# Compare bad join to the correct one
print("Record Count of the Too Few Keys Join Example: " + str(few_keys_df.count()))
print("Record Count of the Correct Join Example: " + str(correct_join_df.count()))

---
<a id='feat'></a>

## Feature Engineering

<img src="images/spark4_049.png" alt="" style="width: 800px;"/>

<img src="images/spark4_050.png" alt="" style="width: 800px;"/>

<img src="images/spark4_051.png" alt="" style="width: 800px;"/>

<img src="images/spark4_052.png" alt="" style="width: 800px;"/>

<img src="images/spark4_053.png" alt="" style="width: 800px;"/>

## Differences

Let's explore generating features using existing ones. In the midwest of the U.S. many single family homes have extra land around them for green space. In this example you will create a new feature called 'YARD_SIZE', and then see if the new feature is correlated with our outcome variable.

- Create a new column using withColumn() called LOT_SIZE_SQFT and convert ACRES to square feet by multiplying by acres_to_sqfeet the conversion factor.
- Create another new column called YARD_SIZE by subtracting FOUNDATIONSIZE from LOT_SIZE_SQFT.
- Run corr() on each of the independent variables YARD_SIZE, FOUNDATIONSIZE, LOT_SIZE_SQFT against the dependent variable SALESCLOSEPRICE. Does new feature show a stronger correlation than either of its components?

In [None]:
# Lot size in square feet
acres_to_sqfeet = 43560
df = df.withColumn('LOT_SIZE_SQFT', df['ACRES'] * acres_to_sqfeet)

# Create new column YARD_SIZE
df = df.withColumn('YARD_SIZE', df['LOT_SIZE_SQFT'] - df['FOUNDATIONSIZE'])

# Corr of ACRES vs SALESCLOSEPRICE
print("Corr of ACRES vs SALESCLOSEPRICE: " + str(df.corr('ACRES', 'SALESCLOSEPRICE')))
# Corr of FOUNDATIONSIZE vs SALESCLOSEPRICE
print("Corr of FOUNDATIONSIZE vs SALESCLOSEPRICE: " + str(df.corr('FOUNDATIONSIZE', 'SALESCLOSEPRICE')))
# Corr of YARD_SIZE vs SALESCLOSEPRICE
print("Corr of YARD_SIZE vs SALESCLOSEPRICE: " + str(df.corr('YARD_SIZE', 'SALESCLOSEPRICE')))

```
<script.py> output:
    Corr of ACRES vs SALESCLOSEPRICE: 0.22060612588935327
    Corr of FOUNDATIONSIZE vs SALESCLOSEPRICE: 0.6152231695664401
    Corr of YARD_SIZE vs SALESCLOSEPRICE: 0.20714585430854268
```
Not all generated features are worthwhile, many are not but its still worth doing! Most likely this is because there isn't a lot of variation in lot sizes in the neighborhoods we are looking at to create a strong feature. In addition if we look at our data, some of the homes have 0 ACRES if we really wanted to handle this correctly we could have to set the minimum YARD_SIZE to 0.

## Ratios

Ratios are all around us. Whether it's miles per gallon or click through rate, they are everywhere. In this exercise, we'll create some ratios by dividing out pairs of columns.

- Create a new variable ASSESSED_TO_LIST by dividing ASSESSEDVALUATION by LISTPRICE to help us understand if the having a high or low assessment value impacts our price.
- Create another new variable TAX_TO_LIST to help us understand the approximate tax rate by dividing TAXES by LISTPRICE.
- Lastly create another variable BED_TO_BATHS to help us know how crowded our bathrooms might be by dividing BEDROOMS by BATHSTOTAL.

In [None]:
# ASSESSED_TO_LIST
df = df.withColumn('ASSESSED_TO_LIST', df['ASSESSEDVALUATION'] / df['LISTPRICE'])
df[['ASSESSEDVALUATION', 'LISTPRICE', 'ASSESSED_TO_LIST']].show(5)
# TAX_TO_LIST
df = df.withColumn('TAX_TO_LIST', df['TAXES'] / df['LISTPRICE'])
df[['TAX_TO_LIST', 'TAXES', 'LISTPRICE']].show(5)
# BED_TO_BATHS
df = df.withColumn('BED_TO_BATHS', df['BEDROOMS'] / df['BATHSTOTAL'])
df[['BED_TO_BATHS', 'BEDROOMS', 'BATHSTOTAL']].show(5)

```
<script.py> output:
    +-----------------+---------+----------------+
    |ASSESSEDVALUATION|LISTPRICE|ASSESSED_TO_LIST|
    +-----------------+---------+----------------+
    |              0.0|   139900|             0.0|
    |              0.0|   210000|             0.0|
    |              0.0|   225000|             0.0|
    |              0.0|   230000|             0.0|
    |              0.0|   239900|             0.0|
    +-----------------+---------+----------------+
    only showing top 5 rows
    
    +--------------------+-----+---------+
    |         TAX_TO_LIST|TAXES|LISTPRICE|
    +--------------------+-----+---------+
    |0.013280914939242315| 1858|   139900|
    | 0.00780952380952381| 1640|   210000|
    |0.010622222222222222| 2390|   225000|
    |0.009330434782608695| 2146|   230000|
    |0.008378491037932471| 2010|   239900|
    +--------------------+-----+---------+
    only showing top 5 rows
    
    +------------------+--------+----------+
    |      BED_TO_BATHS|BEDROOMS|BATHSTOTAL|
    +------------------+--------+----------+
    |               1.5|       3|         2|
    |1.3333333333333333|       4|         3|
    |               2.0|       2|         1|
    |               1.0|       2|         2|
    |               1.5|       3|         2|
    +------------------+--------+----------+
    only showing top 5 rows
```
Well done, we've created some great ratios to use in our model that people looking at homes might be considering! Often times rather than just hoping that features will be important and trying them all brute force its more worthwhile to talk to someone that knows the context to get ideas!

## Deeper Features

In previous exercises we showed how combining two features together can create good additional features for a predictive model. In this exercise, you will generate 'deeper' features by combining the effects of three variables into one. Then you will check to see if deeper and more complicated features always make for better predictors.

- Create a new feature by adding SQFTBELOWGROUND and SQFTABOVEGROUND and creating a new column Total_SQFT
- Using Total_SQFT, create yet another feature called BATHS_PER_1000SQFT with BATHSTOTAL. Be sure to scale Total_SQFT to 1000's
- Use describe() to inspect the new min, max and mean of our newest feature BATHS_PER_1000SQFT. Notice anything strange?
- Create two jointplots()s with Total_SQFT and BATHS_PER_1000SQFT as the x values and SALESCLOSEPRICE as the y value to see which has the better R**2 fit. Does this more complicated feature have a stronger relationship with SALESCLOSEPRICE?

In [None]:
# Create new feature by adding two features together
df = df.withColumn('Total_SQFT', df['SQFTBELOWGROUND'] + df['SQFTABOVEGROUND'])

# Create additional new feature using previously created feature
df = df.withColumn('BATHS_PER_1000SQFT', df['BATHSTOTAL'] / (df['Total_SQFT'] / 1000))
df[['BATHS_PER_1000SQFT']].describe().show()

# Sample and create pandas dataframe
pandas_df = df.sample(False, 0.5, 0).toPandas()

# Linear model plots
sns.jointplot(x='Total_SQFT', y='SALESCLOSEPRICE', data=pandas_df, kind="reg", stat_func=r2)
plt.show()
sns.jointplot(x='BATHS_PER_1000SQFT', y='SALESCLOSEPRICE', data=pandas_df, kind="reg", stat_func=r2)
plt.show()

```
<script.py> output:
    +-------+-------------------+
    |summary| BATHS_PER_1000SQFT|
    +-------+-------------------+
    |  count|               5000|
    |   mean| 1.4302617483739894|
    | stddev|  14.12890410245937|
    |    min|0.39123630672926446|
    |    max|             1000.0|
    +-------+-------------------+
```
<img src="images/spark4_054.png" alt="" style="width: 800px;"/>

<img src="images/spark4_055.png" alt="" style="width: 800px;"/>

Using the describe() function you could have seen there was a max of 1000 bathrooms per 1000sqft, which is almost for sure an issue with our data since no sane person would need a bathroom for square foot! If you really wanted to use this feature you'd have to filter that outlier out or overwrite it to NULL with when(). After plotting the jointplots()s you should have seen that the less complicated feature Total_SQFT had a much better R**2 of .67 vs BATHS_PER_1000SQFT's .02'. Often simplier is better!

## Time Features

<img src="images/spark4_056.png" alt="" style="width: 800px;"/>

<img src="images/spark4_057.png" alt="" style="width: 800px;"/>

<img src="images/spark4_058.png" alt="" style="width: 800px;"/>

<img src="images/spark4_059.png" alt="" style="width: 800px;"/>

<img src="images/spark4_060.png" alt="" style="width: 800px;"/>

<img src="images/spark4_061.png" alt="" style="width: 800px;"/>

<img src="images/spark4_062.png" alt="" style="width: 800px;"/>

<img src="images/spark4_063.png" alt="" style="width: 800px;"/>

## Time Components

Being able to work with time components for building features is important but you can also use them to explore and understand your data further. In this exercise, you'll be looking to see if there is a pattern to which day of the week a house lists on. Please keep in mind that `PySpark's week starts on Sunday, with a value of 1 and ends on Saturday, a value of 7`.

- Import to_date() and dayofweek() functions from pyspark.sql.functions
- Use the to_date() function to convert LISTDATE to a Spark date type, save the converted column in place using withColumn()
- Create a new column using LISTDATE and dayofweek() then save it as List_Day_of_Week using withColumn()
- Sample half the dataframe and convert it to a pandas dataframe with toPandas() and plot the count of the pandas dataframe's List_Day_of_Week column by using seaborn countplot() where x = List_Day_of_Week.

In [None]:
# Import needed functions
from pyspark.sql.functions import to_date, dayofweek

# Convert to date type
df = df.withColumn('LISTDATE', to_date('LISTDATE'))

# Get the day of the week
df = df.withColumn('List_Day_of_Week', dayofweek('LISTDATE'))

# Sample and convert to pandas dataframe
sample_df = df.sample(False, 0.5, 42).toPandas()

# Plot count plot of of day of week
sns.countplot(x="List_Day_of_Week", data=sample_df)
plt.show()

<img src="images/spark4_064.png" alt="" style="width: 800px;"/>

Using these time components and some visualization techniques from earlier we can see its pretty unlikely to list a home on the weekend (Values 1 and 7).

## Joining On Time Components

Often times you will use date components to join in other sets of information. However, in this example, we need to use data that would have been available to those considering buying a house. This means we will need to use the previous year's reporting data for our analysis.

- Extract the year from LISTDATE using year() and put it into a new column called list_year with withColumn()
- Create another new column called report_year by subtracting 1 from the list_year
- Create a join condition that matches df['CITY'] with price_df['City'] and df['report_year'] with price_df['Year']
- Perform a left join between df and price_df

In [None]:
from pyspark.sql.functions import year

# Initialize dataframes
df = real_estate_df
price_df = median_prices_df

# Create year column
df = df.withColumn('list_year', year('LISTDATE'))

# Adjust year to match
df = df.withColumn('report_year', (df['list_year'] - 1))

# Create join condition
condition = [df['CITY'] == price_df['City'], df['report_year'] == price_df['Year']]

# Join the dataframes together
df = df.join(price_df, on=condition, how='left')
# Inspect that new columns are available
df[['MedianHomeValue']].show()

```
<script.py> output:
    +---------------+
    |MedianHomeValue|
    +---------------+
    |         401000|
    |         401000|
    |         401000|
    |         401000|
    |         401000|
    |         401000|
    |         401000|
    |         401000|
    |         401000|
    |         401000|
    |         401000|
    |         401000|
    |         401000|
    |         401000|
    |         401000|
    |         401000|
    |         401000|
    |         401000|
    |         401000|
    |         401000|
    +---------------+
    only showing top 20 rows
```
You can see how easy it is to join data that is reported out at different intervals to use in your data. You also can see how easy it is to use data that would not have been available at the time of someone buying a home; a form of data leakage.

## Date Math

In this example, we'll look at verifying the frequency of our data. The Mortgage dataset is supposed to have weekly data but let's make sure by lagging the report date and then taking the difference of the dates.

Recall that to create a lagged feature we will need to create a window(). window() allows you to return a value for each record based off some calculation against a group of records, in this case, the previous period's mortgage rate.

- Cast mort_df['DATE'] to date type with to_date()
- Create a window with the Window() function and use orderBy() to sort by mort_df[DATE]
- Create a new column DATE-1 using withColumn() by lagging the DATE column with lag() and window it using over(w)
- Calculate the difference between DATE and DATE-1 using datediff() and name it Days_Between_Report

In [None]:
from pyspark.sql.functions import lag, datediff, to_date
from pyspark.sql.window import Window

# Cast data type
mort_df = mort_df.withColumn('DATE', to_date('DATE'))

# Create window
w = Window().orderBy(mort_df['DATE'])
# Create lag column
mort_df = mort_df.withColumn('DATE-1', lag('DATE', count=1).over(w))

# Calculate difference between date columns
mort_df = mort_df.withColumn('Days_Between_Report', datediff('DATE', 'DATE-1'))
# Print results
mort_df.select('Days_Between_Report').distinct().show()

```
<script.py> output:
    +-------------------+
    |Days_Between_Report|
    +-------------------+
    |               null|
    |                  7|
    |                  6|
    |                  8|
    +-------------------+
```
We can use this to verify that our mortgage rate data set is consistently reported weekly.

## Extracting Features

<img src="images/spark4_065.png" alt="" style="width: 800px;"/>

<img src="images/spark4_066.png" alt="" style="width: 800px;"/>

<img src="images/spark4_067.png" alt="" style="width: 800px;"/>

<img src="images/spark4_068.png" alt="" style="width: 800px;"/>

<img src="images/spark4_069.png" alt="" style="width: 800px;"/>

<img src="images/spark4_070.png" alt="" style="width: 800px;"/>

<img src="images/spark4_071.png" alt="" style="width: 800px;"/>

## Extracting Text to New Features

Garages are an important consideration for houses in Minnesota where most people own a car and the snow is annoying to clear off a car parked outside. The type of garage is also important, can you get to your car without braving the cold or not? Let's look at creating a feature has_attached_garage that captures whether the garage is attached to the house or not.

- Import the needed function when() from pyspark.sql.functions.
- Create a string matching condition using like() to look for for the string pattern Attached Garage in df['GARAGEDESCRIPTION'] and use wildcards % so it will match anywhere in the field.
- Similarly, create another condition using like() to find the string pattern Detached Garage in df['GARAGEDESCRIPTION'] and use wildcards % so it will match anywhere in the field.
- Create a new column has_attached_garage using when() to assign the value 1 if it has an attached garage, zero if detached and use otherwise() to assign null with None if it is neither.

In [None]:
# Import needed functions
from pyspark.sql.functions import when

# Create boolean conditions for string matches
has_attached_garage = df['GARAGEDESCRIPTION'].like('%Attached Garage%')
has_detached_garage = df['GARAGEDESCRIPTION'].like('%Detached Garage%')

# Conditional value assignment 
df = df.withColumn('has_attached_garage', (when(has_attached_garage, 1)
                                          .when(has_detached_garage, 0)
                                          .otherwise(None)))

# Inspect results
df[['GARAGEDESCRIPTION', 'has_attached_garage']].show(truncate=100)

```
<script.py> output:
    +------------------------------------------------------------------+-------------------+
    |                                                 GARAGEDESCRIPTION|has_attached_garage|
    +------------------------------------------------------------------+-------------------+
    |                                                   Attached Garage|                  1|
    |           Attached Garage, Driveway - Asphalt, Garage Door Opener|                  1|
    |                                                   Attached Garage|                  1|
    |    Attached Garage, Detached Garage, Tuckunder, Driveway - Gravel|                  1|
    |           Attached Garage, Driveway - Asphalt, Garage Door Opener|                  1|
    |                               Attached Garage, Driveway - Asphalt|                  1|
    |           Attached Garage, Driveway - Asphalt, Garage Door Opener|                  1|
    |                                                   Attached Garage|                  1|
    |                                                   Attached Garage|                  1|
    |                                                   Attached Garage|                  1|
    |                                                   Attached Garage|                  1|
    |                                                   Attached Garage|                  1|
    |                                                   Attached Garage|                  1|
    |           Attached Garage, Driveway - Asphalt, Garage Door Opener|                  1|
    |Attached Garage, Tuckunder, Driveway - Asphalt, Garage Door Opener|                  1|
    |           Attached Garage, Driveway - Asphalt, Garage Door Opener|                  1|
    |           Attached Garage, Driveway - Asphalt, Garage Door Opener|                  1|
    |           Attached Garage, Driveway - Asphalt, Garage Door Opener|                  1|
    |           Attached Garage, Driveway - Asphalt, Garage Door Opener|                  1|
    |           Attached Garage, Driveway - Asphalt, Garage Door Opener|                  1|
    +------------------------------------------------------------------+-------------------+
    only showing top 20 rows
```
By extracting important string values out and condiontionally assigning values we've created an interesting feature to use!

## Splitting & Exploding

Being able to take a compound field like GARAGEDESCRIPTION and massaging it into something useful is an involved process. It's helpful to understand early what value you might gain out of expanding it. In this example, we will convert our string to a list-like array, explode it and then inspect the unique values.

- Import the needed functions split() and explode() from pyspark.sql.functions
- Use split() to create a new column garage_list by splitting df['GARAGEDESCRIPTION'] on ', ' which is both a comma and a space.
- Create a new record for each value in the df['garage_list'] using explode() and assign it a new column ex_garage_list
- Use distinct() to get unique values of ex_garage_list and show the 100 first rows, truncating them at 50 characters to display the values.

In [None]:
# Import needed functions
from pyspark.sql.functions import split, explode

# Convert string to list-like array
df = df.withColumn('garage_list', split(df['GARAGEDESCRIPTION'], ', '))

# Explode the values into new records
ex_df = df.withColumn('ex_garage_list', explode(df['garage_list']))

# Inspect the values
ex_df[['ex_garage_list']].distinct().show(100, truncate=50)

```
<script.py> output:
    +----------------------------+
    |              ex_garage_list|
    +----------------------------+
    |             Attached Garage|
    |      On-Street Parking Only|
    |                        None|
    | More Parking Onsite for Fee|
    |          Garage Door Opener|
    |   No Int Access to Dwelling|
    |           Driveway - Gravel|
    |       Valet Parking for Fee|
    |              Uncovered/Open|
    |               Heated Garage|
    |          Underground Garage|
    |                       Other|
    |                  Unassigned|
    |More Parking Offsite for Fee|
    |    Driveway - Other Surface|
    |       Contract Pkg Required|
    |                     Carport|
    |                     Secured|
    |             Detached Garage|
    |          Driveway - Asphalt|
    |                  Units Vary|
    |                    Assigned|
    |                   Tuckunder|
    |                     Covered|
    |            Insulated Garage|
    |         Driveway - Concrete|
    |                      Tandem|
    |           Driveway - Shared|
    +----------------------------+
```
Looking at the values, it looks like there is a decent amount of values here but not hundreds. If you have too many, when you pivot them it can make your dataset a mess.

## Pivot & Join

Being able to explode and pivot a compound field is great, but you are left with a dataframe of only those pivoted values. To really be valuable you'll need to rejoin it to the original dataset! After joining the datasets we will have a lot of NULL values for the newly created columns since we know the context of how they were created we can safely fill them in with zero as either the new has an attribute or it doesn't.

- Pivot the values of ex_garage_list by grouping by the record id NO with groupBy() use the provided code to aggregate constant_val to ignore nulls and take the first value.
- Left join piv_df to df using NO as the join condition.
- Create the list of columns, zfill_cols, to zero fill by using the columns attribute on piv_df
- Zero fill the pivoted dataframes columns, zfill_cols, by using fillna() with a subset.

In [None]:
from pyspark.sql.functions import coalesce, first

# Pivot 
piv_df = ex_df.groupBy('NO').pivot('ex_garage_list').agg(coalesce(first('constant_val')))

# Join the dataframes together and fill null
joined_df = df.join(piv_df, on='NO', how='left')

# Columns to zero fill
zfill_cols = piv_df.columns

# Zero fill the pivoted values
zfilled_df = joined_df.fillna(0, subset=zfill_cols)

You now have a bunch of boolean columns created from the single compound field. Hopefully some of these will be valuable in our model!

## Binarizing, Bucketing & Encoding

<img src="images/spark4_072.png" alt="" style="width: 800px;"/>

<img src="images/spark4_073.png" alt="" style="width: 800px;"/>

<img src="images/spark4_074.png" alt="" style="width: 800px;"/>

<img src="images/spark4_075.png" alt="" style="width: 800px;"/>

<img src="images/spark4_076.png" alt="" style="width: 800px;"/>

<img src="images/spark4_077.png" alt="" style="width: 800px;"/>

## Binarizing Day of Week

It is very unlikely for a home to list on the weekend. Let's create a new field that says if the house is listed for sale on a weekday or not. In this example there is a field called List_Day_of_Week that has Monday is labeled 1.0 and Sunday is 7.0. Let's convert this to a binary field with weekday being 0 and weekend being 1. We can use the pyspark feature transformer Binarizer to do this.

- Import the feature transformer Binarizer from pyspark and the ml.feature module.
- Create the transformer using Binarizer() with the threshold for setting the value to 1 as anything after Friday, 5.0, then set the input column as List_Day_of_Week and output column as Listed_On_Weekend.
- Apply the binarizer transformation on df using transform().
- Verify the transformation worked correctly by selecting the List_Day_of_Week and Listed_On_Weekend columns with show().

In [None]:
# Import transformer
from pyspark.ml.feature import Binarizer

# Create the transformer
binarizer = Binarizer(threshold=5.0, inputCol='List_Day_of_Week', outputCol='Listed_On_Weekend')

# Apply the transformation to df
df = binarizer.transform(df)

# Verify transformation
df[['List_Day_of_Week', 'Listed_On_Weekend']].show()

```
<script.py> output:
    +----------------+-----------------+
    |List_Day_of_Week|Listed_On_Weekend|
    +----------------+-----------------+
    |             6.0|              1.0|
    |             1.0|              0.0|
    |             1.0|              0.0|
    |             5.0|              1.0|
    |             2.0|              1.0|
    |             1.0|              0.0|
    |             4.0|              1.0|
    |             7.0|              1.0|
    |             4.0|              1.0|
    |             6.0|              1.0|
    |             5.0|              1.0|
    |             4.0|              1.0|
    |             7.0|              1.0|
    |             1.0|              0.0|
    |             4.0|              1.0|
    |             7.0|              1.0|
    |             7.0|              1.0|
    |             5.0|              1.0|
    |             6.0|              1.0|
    |             5.0|              1.0|
    +----------------+-----------------+
    only showing top 20 rows
```
Transforming features with binarize is helpful in creating more powerful features, in both explainability of your model and performance.

## Bucketing

If you are a homeowner its very important if a house has 1, 2, 3 or 4 bedrooms. But like bathrooms, once you hit a certain point you don't really care whether the house has 7 or 8. This example we'll look at how to figure out where are some good value points to bucket.

- Plot a distribution plot of the pandas dataframe sample_df using Seaborn distplot().
- Given it looks like there is a long tail of infrequent values after 5, create the bucket splits of 1, 2, 3, 4, 5+
- Create the transformer buck by instantiating Bucketizer() with the splits for setting the buckets, then set the input column as BEDROOMS and output column as bedrooms.
- Apply the Bucketizer transformation on df using transform() and assign the result to df_bucket. Then verify the results with show()

In [None]:
from pyspark.ml.feature import Bucketizer

# Plot distribution of sample_df
sns.distplot(sample_df, axlabel='BEDROOMS')
plt.show()

# Create the bucket splits and bucketizer
splits = [0, 1, 2, 3, 4, 5, float('Inf')]
buck = Bucketizer(splits=splits, inputCol='BEDROOMS', outputCol='bedrooms')

# Apply the transformation to df: df_bucket
df_bucket = buck.transform(df)

# Display results
df_bucket[['BEDROOMS', 'bedrooms']].show()

```
<script.py> output:
    +--------+--------+
    |BEDROOMS|bedrooms|
    +--------+--------+
    |     3.0|     3.0|
    |     4.0|     4.0|
    |     2.0|     2.0|
    |     2.0|     2.0|
    |     3.0|     3.0|
    |     3.0|     3.0|
    |     3.0|     3.0|
    |     3.0|     3.0|
    |     3.0|     3.0|
    |     3.0|     3.0|
    |     3.0|     3.0|
    |     3.0|     3.0|
    |     3.0|     3.0|
    |     3.0|     3.0|
    |     2.0|     2.0|
    |     3.0|     3.0|
    |     3.0|     3.0|
    |     3.0|     3.0|
    |     3.0|     3.0|
    |     3.0|     3.0|
    +--------+--------+
    only showing top 20 rows
```
<img src="images/spark4_078.png" alt="" style="width: 800px;"/>

<img src="images/spark4_079.png" alt="" style="width: 800px;"/>

Being able to inspect a distribution plot is important if you are considering bucketing feature values together. Here we saw that after 5 bathrooms it was exceedingly rare, so we could combine either effects together in a 5+ value.

## One Hot Encoding

In the United States where you live determines which schools your kids can attend. Therefore it's understandable that many people care deeply about which school districts their future home will be in. While the school districts are numbered in SCHOOLDISTRICTNUMBER they are really categorical. Meaning that summing or averaging these values has no apparent meaning. Therefore in this example we will convert SCHOOLDISTRICTNUMBER from a categorial variable into a numeric vector to use in our machine learning model later.

- Instantiate a StringIndexer transformer called string_indexer with SCHOOLDISTRICTNUMBER as the input and School_Index as the output.
- Apply the transformer string_indexer to df with fit() and transform(). Store the transformed dataframe in indexed_df.
- Create a OneHotEncoder transformer called encoder using School_Index as the input and School_Vec as the output.
- Apply the transformation to indexed_df using transform(). Inspect the iterative steps of the transformation with the supplied code.

In [None]:
from pyspark.ml.feature import OneHotEncoder, StringIndexer

# Map strings to numbers with string indexer
string_indexer = StringIndexer(inputCol='SCHOOLDISTRICTNUMBER', outputCol='School_Index')
indexed_df = string_indexer.fit(df).transform(df)

# Onehot encode indexed values
encoder = OneHotEncoder(inputCol='School_Index', outputCol='School_Vec')
encoded_df = encoder.transform(indexed_df)

# Inspect the transformation steps
encoded_df[['SCHOOLDISTRICTNUMBER', 'School_Index', 'School_Vec']].show(truncate=100)

```
<script.py> output:
    +-----------------------------+------------+-------------+
    |         SCHOOLDISTRICTNUMBER|School_Index|   School_Vec|
    +-----------------------------+------------+-------------+
    |             834 - Stillwater|         3.0|(7,[3],[1.0])|
    |             834 - Stillwater|         3.0|(7,[3],[1.0])|
    |622 - North St Paul-Maplewood|         1.0|(7,[1],[1.0])|
    |             834 - Stillwater|         3.0|(7,[3],[1.0])|
    |622 - North St Paul-Maplewood|         1.0|(7,[1],[1.0])|
    |             834 - Stillwater|         3.0|(7,[3],[1.0])|
    |             834 - Stillwater|         3.0|(7,[3],[1.0])|
    |             834 - Stillwater|         3.0|(7,[3],[1.0])|
    |             834 - Stillwater|         3.0|(7,[3],[1.0])|
    |             834 - Stillwater|         3.0|(7,[3],[1.0])|
    |             834 - Stillwater|         3.0|(7,[3],[1.0])|
    |             834 - Stillwater|         3.0|(7,[3],[1.0])|
    |             834 - Stillwater|         3.0|(7,[3],[1.0])|
    |             834 - Stillwater|         3.0|(7,[3],[1.0])|
    |             834 - Stillwater|         3.0|(7,[3],[1.0])|
    |             834 - Stillwater|         3.0|(7,[3],[1.0])|
    |             834 - Stillwater|         3.0|(7,[3],[1.0])|
    |             834 - Stillwater|         3.0|(7,[3],[1.0])|
    |             834 - Stillwater|         3.0|(7,[3],[1.0])|
    |             834 - Stillwater|         3.0|(7,[3],[1.0])|
    +-----------------------------+------------+-------------+
    only showing top 20 rows
```
One Hot Encoding is a great way to handle categorial variables. You may have noticed that the implementation in PySpark is different than Pandas get_dummies() as it puts everything into a single column of type vector rather than a new column for each value. It's also different from sklearn's OneHotEncoder in that the last categorical value is captured by a vector of all zeros.

---
<a id='model'></a>

## Building a Model

<img src="images/spark4_080.png" alt="" style="width: 800px;"/>

<img src="images/spark4_081.png" alt="" style="width: 800px;"/>

<img src="images/spark4_082.png" alt="" style="width: 800px;"/>

<img src="images/spark4_083.png" alt="" style="width: 800px;"/>

## Creating Time Splits

Splitting data randomly can be dangerous for time series as data from the future can cause overfitting in our model. Often with time series, you acquire new data as it is made available and you will want to retrain your model using the newest data. We showed how to do a percentage split for test and training sets but suppose you wish to train on all available data except for the last 45days which you want to use for a test set.

In this exercise, we will create a function to find the split date for using the last 45 days of data for testing and the rest for training. Please note that `timedelta()` has already been imported for you from the standard python library datetime.

- Create a function train_test_split_date() that takes in a dataframe, df, the date column to use for splitting split_col and the number of days to use for the test set, test_days and set it to have a default value of 45.
- Find the min and max dates for split_col using ,().
- Find the date to split the test and training sets using max_date and subtract test_days from it by using timedelta() which takes a days parameter, in this case, pass in `test_days,
- Using OFFMKTDATE as the split_col find split_date and use it to filter the dataframe into two new ones, train_df and test_df, Where test_df is only the last 45 days of the data. Additionally, ensure that the test_df only contains homes listed as of the split date by filtering df['LISTDATE'] less than or equal to the split_date.

In [None]:
def train_test_split_date(df, split_col, test_days=45):
  """Calculate the date to split test and training sets"""
  # Find how many days our data spans
  max_date = df.agg({split_col: 'max'}).collect()[0][0]
  min_date = df.agg({split_col: 'min'}).collect()[0][0]
  # Subtract an integer number of days from the last date in dataset
  split_date = max_date - timedelta(days=test_days)
  return split_date

# Find the date to use in spitting test and train
split_date = train_test_split_date(df, 'OFFMKTDATE')

# Create Sequential Test and Training Sets
train_df = df.where(df['OFFMKTDATE'] < split_date) 
test_df = df.where(df['OFFMKTDATE'] >= split_date).where(df['LISTDATE'] <= split_date)

Creating functions like this take more time upfront but if you intend to use the model over and over again its worth spending more time to do thing properly.

## Adjusting Time Features

We have mentioned throughout this course some of the dangers of leaking information to your model during training. Data leakage will cause your model to have very optimistic metrics for accuracy but once real data is run through it the results are often very disappointing.

In this exercise, we are going to ensure that DAYSONMARKET only reflects what information we have at the time of predicting the value. I.e., if the house is still on the market, we don't know how many more days it will stay on the market. We need to adjust our test_df to reflect what information we currently have as of 2017-12-10.

NOTE: This example will use the `lit()` function. This function is used to allow single values where an entire column is expected in a function call.

- Import the following functions from pyspark.sql.functions to use later on: datediff(), to_date(), lit().
- Convert the date string '2017-12-10' to a pyspark date by first calling the literal function, lit() on it and then to_date()
- Create test_df by filtering OFFMKTDATE greater than or equal to the split_date and LISTDATE less than or equal to the split_date using where().
- Replace DAYSONMARKET by calculating a new column called DAYSONMARKET, the new column should be the difference between split_date and LISTDATE use datediff() to perform the date calculation. Inspect the new column and the original using the code provided.

In [None]:
from pyspark.sql.functions import datediff, to_date, lit

split_date = to_date(lit('2017-12-10'))
# Create Sequential Test set
test_df = df.where(df['OFFMKTDATE'] >= split_date).where(df['LISTDATE'] <= split_date)

# Create a copy of DAYSONMARKET to review later
test_df = test_df.withColumn('DAYSONMARKET_Original', test_df['DAYSONMARKET'])

# Recalculate DAYSONMARKET from what we know on our split date
test_df = test_df.withColumn('DAYSONMARKET', datediff(split_date, 'LISTDATE'))

# Review the difference
test_df[['LISTDATE', 'OFFMKTDATE', 'DAYSONMARKET_Original', 'DAYSONMARKET']].show()

```
<script.py> output:
    +-------------------+-------------------+---------------------+------------+
    |           LISTDATE|         OFFMKTDATE|DAYSONMARKET_Original|DAYSONMARKET|
    +-------------------+-------------------+---------------------+------------+
    |2017-10-06 00:00:00|2018-01-24 00:00:00|                  110|          65|
    |2017-09-18 00:00:00|2017-12-12 00:00:00|                   82|          83|
    |2017-11-07 00:00:00|2017-12-12 00:00:00|                   35|          33|
    |2017-10-30 00:00:00|2017-12-11 00:00:00|                   42|          41|
    |2017-07-14 00:00:00|2017-12-19 00:00:00|                  158|         149|
    |2017-10-25 00:00:00|2017-12-20 00:00:00|                   45|          46|
    |2017-12-07 00:00:00|2017-12-23 00:00:00|                   16|           3|
    |2017-11-22 00:00:00|2017-12-16 00:00:00|                   24|          18|
    |2017-10-27 00:00:00|2017-12-13 00:00:00|                   47|          44|
    |2017-09-29 00:00:00|2017-12-12 00:00:00|                   12|          72|
    |2017-11-28 00:00:00|2017-12-11 00:00:00|                   13|          12|
    |2017-09-09 00:00:00|2018-01-17 00:00:00|                  119|          92|
    |2017-11-18 00:00:00|2017-12-15 00:00:00|                   26|          22|
    |2017-12-07 00:00:00|2017-12-18 00:00:00|                   11|           3|
    |2017-11-25 00:00:00|2018-01-02 00:00:00|                   38|          15|
    |2017-11-09 00:00:00|2018-01-03 00:00:00|                   55|          31|
    |2017-10-18 00:00:00|2017-12-26 00:00:00|                   69|          53|
    |2017-10-03 00:00:00|2017-12-15 00:00:00|                   40|          68|
    |2017-10-16 00:00:00|2017-12-15 00:00:00|                   60|          55|
    |2017-11-18 00:00:00|2017-12-28 00:00:00|                   40|          22|
    +-------------------+-------------------+---------------------+------------+
    only showing top 20 rows
```
Thinking critically about what information would be available at the time of prediction is crucial in having accurate model metrics and saves a lot of embarassment down the road if decisions are being made based off your results!

## Feature Engineering Assumptions for RFR

<img src="images/spark4_084.png" alt="" style="width: 800px;"/>

<img src="images/spark4_085.png" alt="" style="width: 800px;"/>

<img src="images/spark4_086.png" alt="" style="width: 800px;"/>

<img src="images/spark4_087.png" alt="" style="width: 800px;"/>

<img src="images/spark4_088.png" alt="" style="width: 800px;"/>

## Dropping Columns with Low Observations

After doing a lot of feature engineering it's a good idea to take a step back and look at what you've created. If you've used some automation techniques on your categorical features like exploding or OneHot Encoding you may find that you now have hundreds of new binary features. While the subject of feature selection is material for a whole other course but there are some quick steps you can take to reduce the dimensionality of your data set.

In this exercise, we are going to remove columns that have less than 30 observations. 30 is a common minimum number of observations for statistical significance. Any less than that and the relationships cause overfitting because of a sheer coincidence!

NOTE: The data is available in the dataframe, df.

- Using the provided for loop that iterates through the list of binary columns, calculate the sum of the values in the column using the agg function. Use collect() to run the calculation immediately and save the results to obs_count.
- Compare obs_count to obs_threshold, the if statement should be true if obs_count is less than or equal to obs_threshold.
- Remove columns that have been appended to cols_to_remove list by using drop(). Recall that the * allows the list to be unpacked.
- Print the starting and ending shape of the PySpark dataframes by using count() for number of records and len() on df.columns or new_df.columns to find the number of columns.

In [None]:
obs_threshold = 30
cols_to_remove = list()
# Inspect first 10 binary columns in list
for col in binary_cols[0:10]:
  # Count the number of 1 values in the binary column
  obs_count = df.agg({col: 'sum'}).collect()[0][0]
  # If less than our observation threshold, remove
  if obs_count <= obs_threshold:
    cols_to_remove.append(col)
    
# Drop columns and print starting and ending dataframe shapes
new_df = df.drop(*cols_to_remove)

print('Rows: ' + str(df.count()) + ' Columns: ' + str(len(df.columns)))
print('Rows: ' + str(new_df.count()) + ' Columns: ' + str(len(new_df.columns)))

```
<script.py> output:
    Rows: 5000 Columns: 253
    Rows: 5000 Columns: 250
```
Removing low observation features is helpful in many ways. It can improve processing speed of model training, prevent overfitting by coincidence and help interpretability by reducing the number of things to consider.

## Naively Handling Missing and Categorical Values

Random Forest Regression is robust enough to allow us to ignore many of the more time consuming and tedious data preparation steps. While some implementations of Random Forest handle missing and categorical values automatically, PySpark's does not. The math remains the same however so we can get away with some naive value replacements.

For missing values since our data is strictly positive, we will assign -1. The random forest will split on this value and handle it differently than the rest of the values in the same feature.

For categorical values, we can just map the text values to numbers and again the random forest will appropriately handle them by splitting on them. In this example, we will dust off pipelines from Introduction to PySpark to write our code more concisely. Please note that the exercise will start by displaying the dtypes of the columns in the dataframe, compare them to the results at the end of this exercise.

NOTE: Pipeline and StringIndexer are already imported for you. The list categorical_cols is also available.

- Replace the values in WALKSCORE and BIKESCORE with -1 using fillna() and the subset parameter.
- Create a list of StringIndexers by using list comprehension to iterate over each column in categorical_cols.
- Apply fit() and transform() to the pipeline indexer_pipeline.
- Drop the categorical_cols using drop() since they are no longer needed. Inspect the result data types using dtypes.

In [None]:
# Replace missing values
df = df.fillna(-1, subset=['WALKSCORE', 'BIKESCORE'])

# Create list of StringIndexers using list comprehension
indexers = [StringIndexer(inputCol=col, outputCol=col+"_IDX")\
            .setHandleInvalid("keep") for col in categorical_cols]
# Create pipeline of indexers
indexer_pipeline = Pipeline(stages=indexers)
# Fit and Transform the pipeline to the original data
df_indexed = indexer_pipeline.fit(df).transform(df)

# Clean up redundant columns
df_indexed = df_indexed.drop(*categorical_cols)
# Inspect data transformations
print(df_indexed.dtypes)

```
[('CITY', 'string'), ('LISTTYPE', 'string'), ('SCHOOLDISTRICTNUMBER', 'string'), ('POTENTIALSHORTSALE', 'string'), ('STYLE', 'string'), ('ASSUMABLEMORTGAGE', 'string'), ('ASSESSMENTPENDING', 'string'), ('WALKSCORE', 'double'), ('BIKESCORE', 'double')]

<script.py> output:
    [('WALKSCORE', 'double'), ('BIKESCORE', 'double'), ('CITY_IDX', 'double'), ('LISTTYPE_IDX', 'double'), ('SCHOOLDISTRICTNUMBER_IDX', 'double'), ('POTENTIALSHORTSALE_IDX', 'double'), ('STYLE_IDX', 'double'), ('ASSUMABLEMORTGAGE_IDX', 'double'), ('ASSESSMENTPENDING_IDX', 'double')]

```
As you can hopefully see, handling missing and categorical values for Random Forest Regression is fairly painless compared to some of the other things we would have had to do if we chose a different algorithm!

## Building a Model

<img src="images/spark4_089.png" alt="" style="width: 800px;"/>

<img src="images/spark4_090.png" alt="" style="width: 800px;"/>

<img src="images/spark4_091.png" alt="" style="width: 800px;"/>

<img src="images/spark4_092.png" alt="" style="width: 800px;"/>

## Building a Regression Model

One of the great things about PySpark ML module is that most algorithms can be tried and tested without changing much code. Random Forest Regression is a fairly simple ensemble model, using bagging to fit. Another tree based ensemble model is Gradient Boosted Trees which uses a different approach called boosting to fit. In this exercise let's train a GBTRegressor.

- Import GBTRegressor from pyspark.ml.regression which you will notice is the same module as RandomForestRegressor.
- Instantiate GBTRegressor with featuresCol set to the vector column of our features named, features, labelCol set to our dependent variable, SALESCLOSEPRICE and the random seed to 42
- Train the model by calling fit() on gbt with the imported training data, train_df.

In [None]:
from pyspark.ml.regression import GBTRegressor

# Train a Gradient Boosted Trees (GBT) model.
gbt = GBTRegressor(featuresCol="features",
                           labelCol="SALESCLOSEPRICE",
                           predictionCol="Prediction_Price",
                           seed=42
                           )

# Train model.
model = gbt.fit(train_df)

As you can see switching from RandomForestRegressor to GBTRegressor was very easy to do. In practice you should try multiple algorithms and evaluate which one fits your data best.

## Evaluating & Comparing Algorithms

Now that we've created a new model with GBTRegressor its time to compare it against our baseline of RandomForestRegressor. To do this we will compare the predictions of both models to the actual data and calculate RMSE and R^2.

- Import RegressionEvaluator from pyspark.ml.evaluation so it is available for use later.
- Initialize RegressionEvaluator by setting labelCol to our actual data, SALESCLOSEPRICE and predictionCol to our predicted data, Prediction_Price
- To calculate our metrics, call evaluate on evaluator with the prediction values preds and create a dictionary with key evaluator.metricName and value of rmse, do the same for the r2 metric.

In [None]:
from pyspark.ml.evaluation import RegressionEvaluator

# Select columns to compute test error
evaluator = RegressionEvaluator(labelCol="SALESCLOSEPRICE", 
                                predictionCol="Prediction_Price")
# Dictionary of model predictions to loop over
models = {'Gradient Boosted Trees': gbt_predictions, 'Random Forest Regression': rfr_predictions}
for key, preds in models.items():
  # Create evaluation metrics
  rmse = evaluator.evaluate(preds, {evaluator.metricName: "rmse"})
  r2 = evaluator.evaluate(preds, {evaluator.metricName: "r2"})

  # Print Model Metrics
  print(key + ' RMSE: ' + str(rmse))
  print(key + ' R^2: ' + str(r2))

```
<script.py> output:
    Gradient Boosted Trees RMSE: 74380.63652512032
    Gradient Boosted Trees R^2: 0.6482244200795505
    Random Forest Regression RMSE: 22898.84041072095
    Random Forest Regression R^2: 0.9666594402208077
```
Be careful in discarding algorithms just because its first pass was not great. Even though Gradient Boosted Trees performed much worse it has many hyper parameters, with proper tuning it would have comparable or better results!

## Interpreting, Saving & Loading

<img src="images/spark4_093.png" alt="" style="width: 800px;"/>

<img src="images/spark4_094.png" alt="" style="width: 800px;"/>

<img src="images/spark4_095.png" alt="" style="width: 800px;"/>

## Interpreting Results

It is almost always important to know which features are influencing your prediction the most. Perhaps its counterintuitive and that's an insight? Perhaps a hand full of features account for most of the accuracy of your model and you don't need to perform time acquiring or massaging other features.

In this example we will be looking at a model that has been trained without any LISTPRICE information. With that gone, what influences the price the most?

NOTE: The array of feature importances, importances has already been created for you from `model.featureImportances.toArray()`

- Create a pandas dataframe using the values of importances and name the column importance by setting the parameter columns.
- Using the imported list of features names, feature_cols, create a new pandas.Series by wrapping it in the pd.Series() function. Set it to the column fi_df['feature'].
- Sort the dataframe using sort_values(), setting the by parameter to our importance column and sort it descending by setting ascending to False. Inspect the results.

In [None]:
# Convert feature importances to a pandas column
fi_df = pd.DataFrame(importances, columns=['importance'])

# Convert list of feature names to pandas column
fi_df['feature'] = pd.Series(feature_cols)

# Sort the data based on feature importance
fi_df.sort_values(by=['importance'], ascending=False, inplace=True)

# Inspect Results
fi_df.head(10)

We can see that now the features that are the most important are things like the area of the house and taxes both of which are highly correlated with the price of the home.

## Saving & Loading Models

Often times you may find yourself going back to a previous model to see what assumptions or settings were used when diagnosing where your prediction errors were coming from. Perhaps there was something wrong with the data? Maybe you need to incorporate a new feature to capture an unusual event that occurred?

In this example, you will practice saving and loading a model.

- Import RandomForestRegressionModel from pyspark.ml.regression.
- Using the model in memory called model call the save() method on it and name the model rfr_no_listprice.
- Reload the saved model file rfr_no_listprice by calling load() on RandomForestRegressionModel and storing it into loaded_model.

In [None]:
from pyspark.ml.regression import RandomForestRegressionModel

# Save model
model.save('rfr_no_listprice')

# Load model
loaded_model = RandomForestRegressionModel.load('rfr_no_listprice')

In [None]:
<img src="images/spark4_096.png" alt="" style="width: 800px;"/>

In [None]:
---
<a id='intro'></a>