# 6.3 Random Forest Pipeline

##### Description

Pipelines to train the models are implemented in this notebook. The preprocessing steps of scaling and encoding are embedded in the pipeline. The exploratory models are trained to determine which has the highest accuracy score.

##### Notebook Steps

1. Connect Spark
1. Input data
1. Basic data review
1. Visualize relationships

## 1. Connect Spark

In [1]:
%load_ext sparkmagic.magics

In [2]:
%manage_spark

MagicsControllerWidget(children=(Tab(children=(ManageSessionWidget(children=(HTML(value='<br/>'), HTML(value='…

Added endpoint http://ec2-18-234-35-214.compute-1.amazonaws.com:8998/
Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
4,application_1611833527658_0005,pyspark,idle,Link,Link,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


In [3]:
%%spark
spark.sparkContext.setCheckpointDir('./checkpoints')

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## 2. Load Data
There are two seperate datasets we will be working with while modeling. They are train.csv and validate.csv. Train is used to train the model, while validate is used to perform evaluation on unseen data. Only train is needed at this time.

In [4]:
%%spark
train = spark.read.csv("s3://jolfr-capstone3/training/train.csv", header=True, inferSchema=True)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## 4. Define Pipeline Steps

 1. Feature Hasher
 1. Standard Scaler
 1. Model

Collects column names for later use in pipeline.

In [6]:
%%spark
cols = train.drop("label").columns

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

##### Define Feature Hasher

In [7]:
%%spark
from pyspark.ml.feature import FeatureHasher
hasher = FeatureHasher(inputCols=cols, outputCol="hash")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

##### Define Scaler

In [8]:
%%spark
from pyspark.ml.feature import StandardScaler
scaler = StandardScaler(inputCol=hasher.getOutputCol(), outputCol="features")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## 5. Define Model and Parameter Grid

In [9]:
%%spark
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.tuning import ParamGridBuilder

rf = RandomForestClassifier(featuresCol = 'features', labelCol = 'label')

params = ParamGridBuilder().addGrid(rf.numTrees, [200, 400, 800]).build()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## 6. Define Pipeline

In [10]:
%%spark
from pyspark.ml import Pipeline
pipe = Pipeline(stages=[hasher, scaler, rf])

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## 6. Define Cross Validator

In [11]:
%%spark
from pyspark.ml.tuning import CrossValidator
from pyspark.ml.evaluation import BinaryClassificationEvaluator
crossval = CrossValidator(estimator=pipe, estimatorParamMaps=params, evaluator=BinaryClassificationEvaluator(), parallelism=5, numFolds=3, seed=42)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## 7. Fit Model

In [None]:
%%spark
cvModel = crossval.fit(train)

## 8. Save Best Params

In [3]:
%%spark
params = {'featuresCol': 'features', 'labelCol': 'label', 'predictionCol': 'prediction', 'probabilityCol': 'probability', 'rawPredictionCol': 'rawPrediction', 'maxDepth': 30, 'minInstancesPerNode': 50, 'numTrees': 100}

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [None]:
%%spark
best_params = cvModel.bestModel.stages[-1].extractParamMap()

params = {}

for k, v in best_params.items():
    params[k.name] = v
    
params

In [4]:
%%spark
params

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

{'featuresCol': 'features', 'labelCol': 'label', 'predictionCol': 'prediction', 'probabilityCol': 'probability', 'rawPredictionCol': 'rawPrediction', 'maxDepth': 30, 'minInstancesPerNode': 50, 'numTrees': 100}

In [5]:
%%spark
import boto3
import pickle

s3 = boto3.client('s3')

serialized = pickle.dumps(params)

s3.put_object(Bucket='jolfr-capstone3',Key='ForestParams', Body=serialized)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

{'ResponseMetadata': {'RequestId': '718129FB0C67EEB0', 'HostId': 'BYW55X8rP2ZBXbsNK4tCCzwytLqyhCK5xQBRU7LqbatQYjCj02pESKlhOeU1UGOTRCcm83MwfgM=', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amz-id-2': 'BYW55X8rP2ZBXbsNK4tCCzwytLqyhCK5xQBRU7LqbatQYjCj02pESKlhOeU1UGOTRCcm83MwfgM=', 'x-amz-request-id': '718129FB0C67EEB0', 'date': 'Thu, 28 Jan 2021 12:55:43 GMT', 'etag': '"895711ae3d6177782ef2f3acc0f2ff1f"', 'content-length': '0', 'server': 'AmazonS3'}, 'RetryAttempts': 0}, 'ETag': '"895711ae3d6177782ef2f3acc0f2ff1f"'}