# 5.1 Exploratory Modeling

##### Description

Pipelines to train the models are implemented in this notebook. The preprocessing steps of scaling and encoding are embedded in the pipeline. The exploratory models are trained to determine which has the highest accuracy score.

##### Notebook Steps

1. Connect Spark
1. Input data
1. Basic data review
1. Visualize relationships

## 1. Connect Spark

In [1]:
%load_ext sparkmagic.magics

In [2]:
%manage_spark

MagicsControllerWidget(children=(Tab(children=(ManageSessionWidget(children=(HTML(value='<br/>'), HTML(value='…

Added endpoint http://ec2-54-221-79-184.compute-1.amazonaws.com:8998/
Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
1,application_1611312942316_0002,pyspark,idle,Link,Link,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


Exception in thread Thread-13:
Traceback (most recent call last):
  File "C:\Users\thoma\anaconda3\envs\capstone-03\lib\site-packages\sparkmagic\livyclientlib\livysession.py", line 54, in run
    self.livy_session.refresh_status_and_info()
  File "C:\Users\thoma\anaconda3\envs\capstone-03\lib\site-packages\sparkmagic\livyclientlib\livysession.py", line 287, in refresh_status_and_info
    response = self._http_client.get_session(self.id)
  File "C:\Users\thoma\anaconda3\envs\capstone-03\lib\site-packages\sparkmagic\livyclientlib\livyreliablehttpclient.py", line 39, in get_session
    return self._http_client.get(self._session_url(session_id), [200]).json()
  File "C:\Users\thoma\anaconda3\envs\capstone-03\lib\site-packages\sparkmagic\livyclientlib\reliablehttpclient.py", line 34, in get
    return self._send_request(relative_url, accepted_status_codes, self._session.get)
  File "C:\Users\thoma\anaconda3\envs\capstone-03\lib\site-packages\sparkmagic\livyclientlib\reliablehttpclient.py", 

Cleaned up endpoint http://ec2-54-221-79-184.compute-1.amazonaws.com:8998/


## 2. Load Data
There are two seperate datasets we will be working with while modeling. They are train.csv and validate.csv. Train is used to train the model, while validate is used to perform evaluation on unseen data.

In [3]:
%%spark
train = spark.read.csv("s3://jolfr-capstone3/training/train.csv", header=True, inferSchema=True)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## 4. Define Pipeline

 1. Feature Hasher
 1. Standard Scaler
 1. Model(s)

In [17]:
%%spark
from pyspark.ml import Pipeline
from pyspark.ml.classification import GBTClassifier, RandomForestClassifier, DecisionTreeClassifier
from pyspark.ml.feature import FeatureHasher, StandardScaler
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Collects column names for later use in pipeline.

In [23]:
%%spark
cols = train.drop("label").columns

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

##### Define Feature Hasher

In [24]:
%%spark
hasher = FeatureHasher(inputCols=cols, outputCol="hash")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

##### Define Scaler

In [25]:
%%spark
scaler = StandardScaler(inputCol=hasher.getOutputCol(), outputCol="features")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

##### Define Pipeline

In [26]:
%%spark
pipe = Pipeline(stages=[hasher, scaler])

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## 5. Define Model

In [27]:
%%spark
from pyspark.ml.classification import LogisticRegression
lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## 6. Define Paramter Grid

## 7. Define Cross Validator

## 8. Fit Model

In [28]:
%%spark
lr_pipe = Pipeline(stages=[pipe, lr])

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [29]:
%%spark
lr_pipe.fit(train)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

KeyboardInterrupt: 

In [None]:
## 9. Save Best Params

In [None]:
gbt_model = gbt_cv.fit(train)

In [None]:
rf_model = rf_cv.fit(train)

In [None]:
dt_model = dt_cv.fit(train)

In [None]:
gbt = gbt_model.transform(test)

In [None]:
rf = rf_model.transform(test)

In [None]:
dt = dt_model.transform(test)

In [None]:
dt = dt_model.transform(test)

In [None]:
%%spark

gbt = GBTClassifier()

gbt_pipeline = Pipeline(stages=[one-hot, scaler, gbt])

gbt_cv = CrossValidator(estimator=pipeline,
                          evaluator=BinaryClassificationEvaluator(),
                          numFolds=3)

In [None]:
%%spark

rf = RandomForestClassifier()

rf_pipeline = Pipeline(stages=[one-hot, scaler, rf])

rf_cv = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=BinaryClassificationEvaluator(),
                          numFolds=3)

In [None]:
dt = DecisionTreeClassifier()

dt_pipeline = Pipeline(stages=[one-hot, scaler, dt])

dt_cv = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=BinaryClassificationEvaluator(),
                          numFolds=3)

In [None]:
gbt_model = gbt_cv.fit(train)

In [None]:
rf_model = rf_cv.fit(train)

In [None]:
dt_model = dt_cv.fit(train)

In [None]:
gbt = gbt_model.transform(test)

In [None]:
rf = rf_model.transform(test)

In [None]:
dt = dt_model.transform(test)