# Feature Engineering with PySpark

The real world is messy and your job is to make sense of it. Toy datasets like MTCars and Iris are the result of careful curation and cleaning, even so the data needs to be transformed for it to be useful for powerful machine learning algorithms to extract meaning, forecast, classify or cluster. This course will cover the gritty details that data scientists are spending 70-80% of their time on; data wrangling and feature engineering. With size of datasets now becoming ever larger, let's use PySpark to cut this Big Data problem down to size!

## Table of Contents

- [Introduction](#intro)
- 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

path = "data/dc34/"

In [2]:
from pyspark import SparkContext
sc = SparkContext("local", "First App")
print(sc)

<SparkContext master=local appName=First App>


In [3]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('First App').getOrCreate()

---
<a id='intro'></a>

## Where to Begin

<img src="images/spark4_001.png" alt="" style="width: 800px;"/>

<img src="images/spark4_002.png" alt="" style="width: 800px;"/>

<img src="images/spark4_003.png" alt="" style="width: 800px;"/>

<img src="images/spark4_004.png" alt="" style="width: 800px;"/>

## Check Version

Checking the version of which Spark and Python installed is important as it changes very quickly and drastically. Reading the wrong documentation can cause lots of lost time and unnecessary frustration!

This course touches on a lot of concepts you may have forgotten, so if you ever need a quick refresher, download the [PySpark Cheat Sheet](https://datacamp-community-prod.s3.amazonaws.com/65076e3c-9df1-40d5-a0c2-36294d9a3ca9) and keep it handy!

In [4]:
# Return spark version
print(spark.version)

# Return python version
import sys
print(sys.version_info)

2.4.4
sys.version_info(major=3, minor=7, micro=3, releaselevel='final', serial=0)


## Load in the data

Reading in data is the first step to using PySpark for data science! Let's leverage the new industry standard of parquet files!

In [None]:
# Read the file into a dataframe
df = spark.read.parquet('Real_Estate.parq')
# Print columns in dataframe
print(df.columns)

```
<script.py> output:
    ['NO', 'MLSID', 'STREETNUMBERNUMERIC', 'STREETADDRESS', 'STREETNAME', 'POSTALCODE', 'STATEORPROVINCE', 'CITY', 'SALESCLOSEPRICE', 'LISTDATE', 'LISTPRICE', 'LISTTYPE', 'ORIGINALLISTPRICE', 'PRICEPERTSFT', 'FOUNDATIONSIZE', 'FENCE', 'MAPLETTER', 'LOTSIZEDIMENSIONS', 'SCHOOLDISTRICTNUMBER', 'DAYSONMARKET', 'OFFMARKETDATE', 'FIREPLACES', 'ROOMAREA4', 'ROOMTYPE', 'ROOF', 'ROOMFLOOR4', 'POTENTIALSHORTSALE', 'POOLDESCRIPTION', 'PDOM', 'GARAGEDESCRIPTION', 'SQFTABOVEGROUND', 'TAXES', 'ROOMFLOOR1', 'ROOMAREA1', 'TAXWITHASSESSMENTS', 'TAXYEAR', 'LIVINGAREA', 'UNITNUMBER', 'YEARBUILT', 'ZONING', 'STYLE', 'ACRES', 'COOLINGDESCRIPTION', 'APPLIANCES', 'BACKONMARKETDATE', 'ROOMFAMILYCHAR', 'ROOMAREA3', 'EXTERIOR', 'ROOMFLOOR3', 'ROOMFLOOR2', 'ROOMAREA2', 'DININGROOMDESCRIPTION', 'BASEMENT', 'BATHSFULL', 'BATHSHALF', 'BATHQUARTER', 'BATHSTHREEQUARTER', 'CLASS', 'BATHSTOTAL', 'BATHDESC', 'ROOMAREA5', 'ROOMFLOOR5', 'ROOMAREA6', 'ROOMFLOOR6', 'ROOMAREA7', 'ROOMFLOOR7', 'ROOMAREA8', 'ROOMFLOOR8', 'BEDROOMS', 'SQFTBELOWGROUND', 'ASSUMABLEMORTGAGE', 'ASSOCIATIONFEE', 'ASSESSMENTPENDING', 'ASSESSEDVALUATION']
```

## Defining A Problem

<img src="images/spark4_005.png" alt="" style="width: 800px;"/>

<img src="images/spark4_006.png" alt="" style="width: 800px;"/>

<img src="images/spark4_007.png" alt="" style="width: 800px;"/>

<img src="images/spark4_008.png" alt="" style="width: 800px;"/>

<img src="images/spark4_009.png" alt="" style="width: 800px;"/>

## 

In [None]:
<img src="images/spark4_010.png" alt="" style="width: 800px;"/>

In [None]:
---
<a id='intro'></a>