# Feature Engineering with PySpark

The real world is messy and your job is to make sense of it. Toy datasets like MTCars and Iris are the result of careful curation and cleaning, even so the data needs to be transformed for it to be useful for powerful machine learning algorithms to extract meaning, forecast, classify or cluster. This course will cover the gritty details that data scientists are spending 70-80% of their time on; data wrangling and feature engineering. With size of datasets now becoming ever larger, let's use PySpark to cut this Big Data problem down to size!

## Table of Contents

- [Introduction](#intro)
- 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

path = "data/dc34/"

In [2]:
from pyspark import SparkContext
sc = SparkContext("local", "First App")
print(sc)

<SparkContext master=local appName=First App>


In [3]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('First App').getOrCreate()

---
<a id='intro'></a>

## Where to Begin

<img src="images/spark4_001.png" alt="" style="width: 800px;"/>

<img src="images/spark4_002.png" alt="" style="width: 800px;"/>

<img src="images/spark4_003.png" alt="" style="width: 800px;"/>

<img src="images/spark4_004.png" alt="" style="width: 800px;"/>

## Check Version

Checking the version of which Spark and Python installed is important as it changes very quickly and drastically. Reading the wrong documentation can cause lots of lost time and unnecessary frustration!

This course touches on a lot of concepts you may have forgotten, so if you ever need a quick refresher, download the [PySpark Cheat Sheet](https://datacamp-community-prod.s3.amazonaws.com/65076e3c-9df1-40d5-a0c2-36294d9a3ca9) and keep it handy!

In [4]:
# Return spark version
print(spark.version)

# Return python version
import sys
print(sys.version_info)

2.4.4
sys.version_info(major=3, minor=7, micro=3, releaselevel='final', serial=0)


## Load in the data

Reading in data is the first step to using PySpark for data science! Let's leverage the new industry standard of parquet files!

In [None]:
# Read the file into a dataframe
df = spark.read.parquet('Real_Estate.parq')
# Print columns in dataframe
print(df.columns)

```
<script.py> output:
    ['NO', 'MLSID', 'STREETNUMBERNUMERIC', 'STREETADDRESS', 'STREETNAME', 'POSTALCODE', 'STATEORPROVINCE', 'CITY', 'SALESCLOSEPRICE', 'LISTDATE', 'LISTPRICE', 'LISTTYPE', 'ORIGINALLISTPRICE', 'PRICEPERTSFT', 'FOUNDATIONSIZE', 'FENCE', 'MAPLETTER', 'LOTSIZEDIMENSIONS', 'SCHOOLDISTRICTNUMBER', 'DAYSONMARKET', 'OFFMARKETDATE', 'FIREPLACES', 'ROOMAREA4', 'ROOMTYPE', 'ROOF', 'ROOMFLOOR4', 'POTENTIALSHORTSALE', 'POOLDESCRIPTION', 'PDOM', 'GARAGEDESCRIPTION', 'SQFTABOVEGROUND', 'TAXES', 'ROOMFLOOR1', 'ROOMAREA1', 'TAXWITHASSESSMENTS', 'TAXYEAR', 'LIVINGAREA', 'UNITNUMBER', 'YEARBUILT', 'ZONING', 'STYLE', 'ACRES', 'COOLINGDESCRIPTION', 'APPLIANCES', 'BACKONMARKETDATE', 'ROOMFAMILYCHAR', 'ROOMAREA3', 'EXTERIOR', 'ROOMFLOOR3', 'ROOMFLOOR2', 'ROOMAREA2', 'DININGROOMDESCRIPTION', 'BASEMENT', 'BATHSFULL', 'BATHSHALF', 'BATHQUARTER', 'BATHSTHREEQUARTER', 'CLASS', 'BATHSTOTAL', 'BATHDESC', 'ROOMAREA5', 'ROOMFLOOR5', 'ROOMAREA6', 'ROOMFLOOR6', 'ROOMAREA7', 'ROOMFLOOR7', 'ROOMAREA8', 'ROOMFLOOR8', 'BEDROOMS', 'SQFTBELOWGROUND', 'ASSUMABLEMORTGAGE', 'ASSOCIATIONFEE', 'ASSESSMENTPENDING', 'ASSESSEDVALUATION']
```

In [8]:
# Load from provided CSV file
# Read the file into a dataframe
df = spark.read.csv(path+'2017_StPaul_MN_Real_Estate.csv')
# Print columns in dataframe
print(df.columns)

['_c0', '_c1', '_c2', '_c3', '_c4', '_c5', '_c6', '_c7', '_c8', '_c9', '_c10', '_c11', '_c12', '_c13', '_c14', '_c15', '_c16', '_c17', '_c18', '_c19', '_c20', '_c21', '_c22', '_c23', '_c24', '_c25', '_c26', '_c27', '_c28', '_c29', '_c30', '_c31', '_c32', '_c33', '_c34', '_c35', '_c36', '_c37', '_c38', '_c39', '_c40', '_c41', '_c42', '_c43', '_c44', '_c45', '_c46', '_c47', '_c48', '_c49', '_c50', '_c51', '_c52', '_c53', '_c54', '_c55', '_c56', '_c57', '_c58', '_c59', '_c60', '_c61', '_c62', '_c63', '_c64', '_c65', '_c66', '_c67', '_c68', '_c69', '_c70', '_c71', '_c72', '_c73']


## Defining A Problem

<img src="images/spark4_005.png" alt="" style="width: 800px;"/>

<img src="images/spark4_006.png" alt="" style="width: 800px;"/>

<img src="images/spark4_007.png" alt="" style="width: 800px;"/>

<img src="images/spark4_008.png" alt="" style="width: 800px;"/>

<img src="images/spark4_009.png" alt="" style="width: 800px;"/>

## What are we predicting?

Which of these fields (or columns) is the value we are trying to predict for?

- TAXES
- SALESCLOSEPRICE
- DAYSONMARKET
- LISTPRICE

In [None]:
# Select our dependent variable
Y_df = df.select(['SALESCLOSEPRICE'])

# Display summary statistics
Y_df.describe().show()

```
<script.py> output:
    +-------+------------------+
    |summary|   SALESCLOSEPRICE|
    +-------+------------------+
    |  count|              5000|
    |   mean|       262804.4668|
    | stddev|140559.82591998563|
    |    min|             48000|
    |    max|           1700000|
    +-------+------------------+
```
We want to know how much a house will actually sell for. We can see the range of values it has here and the average which will help us in our next steps!

## Verifying Data Load

Let's suppose each month you get a new file. You know to expect a certain number of records and columns. In this exercise we will create a function that will validate the file loaded.

- Create a data validation function check_load() with parameters df a dataframe, num_records as the number of records and num_columns the number of columns.
- Using num_records create a check to see if the input dataframe df has the same amount with count().
- Compare input number of columns the input dataframe has withnum_columns by using len() on columns.
- If both of these return True, then print Validation Passed

In [None]:
def check_load(df, num_records, num_columns):
  # Takes a dataframe and compares record and column counts to input
  # Message to return if the critera below aren't met
  message = 'Validation Failed'
  # Check number of records
  if num_records == df.count():
    # Check number of columns
    if num_columns == len(df.columns):
      # Success message
      message = 'Validation Passed'
  return message

# Print the data validation message
print(check_load(df, 5000, 74))
#check_load(spark.createDataFrame([[1,2], [-1,1]], ['a', 'b']), 2, 2)

## Verifying DataTypes

In the age of data we have access to more attributes than we ever had before. To handle all of them we will build a lot of automation but at a minimum requires that their datatypes be correct. In this exercise we will validate a dictionary of attributes and their datatypes to see if they are correct. This dictionary is stored in the variable validation_dict and is available in your workspace.

- Using df create a list of attribute and datatype tuples with dtypes called actual_dtypes_list.
- Iterate through actual_dtypes_list, checking if the column names exist in the dictionary of expected dtypes validation_dict.
- For the keys that exist in the dictionary, check their dtypes and print those that match.

In [9]:
validation_dict = {'ASSESSMENTPENDING': 'string',
 'AssessedValuation': 'double',
 'AssociationFee': 'bigint',
 'AssumableMortgage': 'string',
 'SQFTBELOWGROUND': 'bigint'}

In [None]:
# create list of actual dtypes to check
actual_dtypes_list = df.dtypes
print(actual_dtypes_list)

# Iterate through the list of actual dtypes tuples
for attribute_tuple in actual_dtypes_list:
  
  # Check if column name is dictionary of expected dtypes
  col_name = attribute_tuple[0]
  if col_name in validation_dict:

    # Compare attribute types
    col_type = attribute_tuple[1]
    if col_type == validation_dict[col_name]:
      print(col_name + ' has expected dtype.')

You've created a way to loop through your expected dtypes and compare them to how they got loaded. You could use a similar loop to print or count all the numeric or text fields if you don't have a list of verified field types to compare against.

In [None]:
<img src="images/spark4_010.png" alt="" style="width: 800px;"/>

In [None]:
---
<a id='intro'></a>