## Final Project - Desicion Tree Classifier

### Prepare the Data

First, import the libraries you will need and prepare the training and test data:

In [3]:
# Import Spark SQL and Spark ML libraries
from pyspark.sql.types import *
from pyspark.sql.functions import *

from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.tuning import ParamGridBuilder, TrainValidationSplit, CrossValidator
from pyspark.ml.evaluation import BinaryClassificationEvaluator


### Read your csv file from its table at Databricks

In [5]:
# Load the source data
csv = sqlContext.sql("SELECT * FROM final_csv")

In [6]:
# Select features and label
data1 = csv.select("tract_to_msamd_income","population", "minority_population", "loan_amount_000s", "applicant_income_000s","purchaser_type_name","preapproval_name","owner_occupancy_name","loan_type_name","lien_status_name","co_applicant_sex_name","co_applicant_race_name_1","co_applicant_ethnicity_name","agency_name","agency_abbr",col("action_taken_name").alias("label"))

In [7]:
# Drop rows from the table even if one value is null
data2 = data1.dropna()

In [8]:
# Split the data
splits = data2.randomSplit([0.7, 0.3])
train = splits[0]
test = splits[1].withColumnRenamed("label", "trueLabel")

### Define the Pipeline
Now define a pipeline that creates a feature vector and trains a classification model

In [10]:
vectorAssembler = VectorAssembler(inputCols = ["tract_to_msamd_income","population", "minority_population", "loan_amount_000s", "applicant_income_000s","purchaser_type_name","preapproval_name","owner_occupancy_name","loan_type_name","lien_status_name","co_applicant_sex_name","co_applicant_race_name_1","co_applicant_ethnicity_name","agency_name","agency_abbr"], outputCol="features")
#Model1 - Decision Tree 
decision = DecisionTreeClassifier(labelCol="label", featuresCol= "features")
pipeline = Pipeline(stages=[vectorAssembler, decision])

### Train the Model

In [12]:
# define list of models made from Train Validation Split and Cross Validation
model = pipeline.fit(train)

### Test the Model
Now you're ready to apply the model to the test data.

In [14]:
prediction = model.transform(test)
predicted = prediction.select("prediction", "trueLabel")

### Review the Area Under ROC
Another way to assess the performance of a classification model is to measure the area under a ROC curve for the model. the spark.ml library includes a **BinaryClassificationEvaluator** class that you can use to compute this.

In [16]:
evaluator = BinaryClassificationEvaluator(labelCol="trueLabel", rawPredictionCol="prediction", metricName="areaUnderROC")
auc = evaluator.evaluate(prediction)
print "Average Accuracy =", auc

### Review the Recall And Precision
Another way to assess the performance of a classification model is to measure the area under a ROC curve for the model. the spark.ml library includes a **BinaryClassificationEvaluator** class that you can use to compute this.

In [18]:
tp = float(predicted.filter("prediction == 1.0 AND truelabel == 1").count())
fp = float(predicted.filter("prediction == 1.0 AND truelabel == 0").count())
tn = float(predicted.filter("prediction == 0.0 AND truelabel == 0").count())
fn = float(predicted.filter("prediction == 0.0 AND truelabel == 1").count())
metrics = spark.createDataFrame([("Precision", tp / (tp + fp)), ("Recall", tp / (tp + fn))],["metric", "value"])
metrics.show()