### DS5110 Final Project Assignment

#### The Ed Squad
* Isaac Stevens (is3sb)<br>
* Jamie Oh (hso6b)<br>
* Ashlie Ossege (ajo5fs)<br>
* Shilpa Narayan (smn7ba)<br>

### Research question/hypothesis

Inspired by the question "How much has your ZIP Code Determined Your Opportunities" , posed in the Student Opinion section of NY Times<sup>1</sup>, The Ed Squad seeks to answer the question:

<b>What factors of a household contribute to completing education?</b></br>

Specifically, we hypothesize that Public Use Microdata Areas(PUMA) will be one of the leading indictors of whether or not one is predicted to complete their education. Our census data source from Ipums.org does not contain zipcode identifiers in public-use data, which is why we are looking at the most granular geographic factor our dataset has available.
    
H<sub>0</sub> = B<sub>puma</sub> = 0 <br>
H<sub>a</sub> = B<sub>puma</sub> ne 0


### Data Sources

Our data comes from the American Community Survey 2015-2019 Sample from usa.ipsums.org. and contains households and person information such as age, income, health insurance, and other demographic variables. In total there are 194 variables. Our focus of the study is in the South Region.

Data: https://usa.ipums.org/usa/sampdesc.shtml#us2019c </br>


### Supplemental Sources
<sup>1</sup>  https://www.nytimes.com/2020/05/19/learning/how-much-has-your-zip-code-determined-your-opportunities.html </br>
<sup>2</sup> https://www.census.gov/programs-surveys/geography/guidance/geo-areas/pumas.html 

In [10]:
# import context manager: SparkSession
from pyspark.sql import SparkSession
from pyspark.sql.types import ArrayType, StructField, StructType, StringType, IntegerType
from pyspark.ml.linalg import Vectors
from pyspark.ml.stat import Correlation
from pyspark.sql import functions as F
from pyspark.sql.types import *
from pyspark.sql import SQLContext
# set up the session
spark = SparkSession \
        .builder \
        .appName("project")\
        .config("spark.executor.memory", "100g")\
        .getOrCreate()
        
sqlContext = SQLContext(spark)

In [11]:

spark.conf.get("spark.sql.shuffle.partitions")

'200'

In [12]:

import os
os.listdir()
os.getcwd()

'/sfs/qumulo/qhome/ajo5fs'

In [13]:

#import pandas too for visualizations
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import itertools
from sklearn.metrics import confusion_matrix
pd.set_option('display.max_rows', 200000)

In [4]:
%%time
#import mlLib libraries for classification
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder,TrainValidationSplit
from pyspark.ml.evaluation import BinaryClassificationEvaluator,MulticlassClassificationEvaluator
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml import Pipeline
from pyspark.ml.feature import PCA
from pyspark.mllib.evaluation import MulticlassMetrics,BinaryClassificationMetrics
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import StandardScaler

CPU times: user 4.76 ms, sys: 747 µs, total: 5.51 ms
Wall time: 4.86 ms


### Read Data; Create a binary flag' rename columns drop if necessary

In [15]:
#import whole data from the census
data = spark.read.csv('/project/ds5559/ds5110_project_snoo/acs_15_19_south.csv', inferSchema="true", header="true")

In [16]:
#writing a user defined function to create a Educated or Not Flag - if EDUC>6 then it is 1 and if not 0
#https://towardsdatascience.com/5-ways-to-add-a-new-column-in-a-pyspark-dataframe-4e75c2fd8c08
def EDUCFunc(value):
  if   value > 6: 
      return 1
  else:
      return 0

#create the function to be applied and create a new column EDUC_FLAG
udfsomefunc = F.udf(EDUCFunc, IntegerType())
data = data.withColumn("EDUC_FLAG", udfsomefunc("EDUC"))
#see sample data
data.select('EDUC_FLAG').show(5)

+---------+
|EDUC_FLAG|
+---------+
|        0|
|        1|
|        0|
|        1|
|        0|
+---------+
only showing top 5 rows



In [7]:
df = data.withColumn("label",data.EDUC_FLAG) \
      .drop("EDUC_FLAG")

In [None]:

%%time
#check the count for EDUC>6 or verify if flag was populated correctly
data.filter(data.EDUC>6).count()

In [None]:

%%time
#Verify the flag count. Should match number above
data.filter(data.EDUC_FLAG!=0).count()

In [None]:
%%time
#renaming dependent variable to label because the classfier is not recognizing other names. Skip thsi if you are trying other classifiers

df = data.withColumn("label",data.EDUC_FLAG) \
      .drop("EDUC_FLAG")

In [18]:

#saving col names in case if we can use it later ot iterate or use the list for labels etc.
cols = df.columns
#spark.createDataFrame(cols,StringType()).toPandas()

#saving col names in case if we can use it later ot iterate or use the list for labels etc.
#cols = df.columns
cols = df.drop('_c0','EDUC','CLUSTER','CBSERIAL','STRATA','HHWT','EDUCD',\
 'QCOSTELE',\
 'QCOSTFUE',\
 'QCOSTGAS',\
 'QCOSTWAT',\
 'QFOODSTM',\
 'QINSINCL',\
 'QMORTGAG',\
 'QOWNERSH',\
 'QPROPINS',\
 'QTAXINCL',\
 'QVALUEH',\
 'QFUELHEA',\
 'QCIDIAL',\
 'QCILAPTOP',\
 'QCINETHH',\
 'QCIOTHSVC',\
 'QCISAT',\
 'QCISMRTPHN',\
 'QCITABLET',\
 'QCIDATAPLN',\
 'QVEHICLE',\
 'QAGE',\
 'QMARRNO',\
 'QMARST',\
 'QRELATE',\
 'QSEX',\
 'QYRMARR',\
 'QBPL',\
 'QCITIZEN',\
 'QHISPAN',\
 'QRACE',\
 'QYRNATUR',\
 'QHINSEMP',\
 'QHINSPUR',\
 'QHINSTRI',\
 'QHINSCAI',\
 'QHINSCAR',\
 'QHINSVA',\
 'QHINSIHS',\
 'QEDUC',\
 'QGRADEAT',\
 'QDEGFIELD',\
 'QSCHOOL',\
 'QCLASSWK',\
 'QEMPSTAT',\
 'QIND',\
 'QOCC',\
 'QUHRSWOR',\
 'QINCEARN',\
 'QINCBUS',\
 'QINCINVS',\
 'QINCOTHE',\
 'QINCRETI',\
 'QINCSS',\
 'QINCSUPP',\
 'QINCTOT',\
 'QFTOTINC',\
 'QINCWAGE',\
 'QINCWELF',\
 'QVETSTAT',\
 'QCARPOOL',\
 'QDEPARTS',\
 'QPWSTAT2',\
 'QRIDERS',\
 'QTRANTIM',\
 'QTRANWOR',\
 'QGCHOUSE',\
 'QGCMONTH',\
 'QGCRESPO').columns
#spark.createDataFrame(cols,StringType()).toPandas()

In [19]:
len(cols)

130

In [20]:
#Define variables which will be consistently used
seed = 42
split_ratio = [0.7,0.3]
numFolds = 3
threads = 6

rf = RandomForestClassifier(labelCol = "label", featuresCol = "scaledFeatures_train")

paramGrid = ParamGridBuilder().addGrid(rf.numTrees, [30, 50]).build()

pca_model = PCA(k=10, inputCol = "scaledFeatures_train", outputCol = "pca_features_cv")

#create a param grid to pass to cross validator 
#k --> number of principal components
#number of treess in rf
#need to add more later
paramGrid_pca = ParamGridBuilder().addGrid(rf.numTrees, [20, 30, 50]).build()

#.addGrid(pca_model.k, [10]) \ 
bcm = BinaryClassificationEvaluator()

selected_cols=[cols for cols in cols if cols not in['label','MULTYEAR']]
#Identified first 35 variables of interest
keep_35 = ["HHTYPE","REGION","STATEFIP","COUNTYFIP","METRO","COSTELEC","COSTGAS","COSTWATR","COSTFUEL","FOODSTMP","CINETHH","CILAPTOP",\
        "CISMRTPHN","CITABLET","VEHICLES","COUPLETYPE","NFAMS","NMOTHERS","NFATHERS","CITIZEN","YRSUSA1","RACAMIND","RACASIAN","RACBLK","RACPACIS"\
        ,"RACWHT","RACOTHER","HCOVANY","EMPSTAT","LABFORCE","CLASSWKR","UHRSWORK","VETSTAT","TRANWORK","GCHOUSE","label","MULTYEAR"]

In [9]:
df.show(5)

+---+----+--------+------+------+-------------+----+------+-------------+------+--------+---------+-----+------+---+--------+---------+--------+-------+-------+--------+--------+-------+--------+--------+--------+-------+------+-------+--------+---------+--------+---------+---------+-----+------+--------+--------+--------+----------+----+-----+-------+--------+--------+--------+-------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+-------+--------+-------+---------+--------+---------+------+----------+---------+----------+--------+--------+------+-----+------+-------+---+---+-----+-------+------+------+----+-----+------+-------+---+----+-------+-------+-------+-------+--------+--------+------+--------+------+--------+-------+--------+-------+-------+-------+-------+--------+--------+------+-------+------+----+-----+--------+---------+--------+--------+---------+---------+----------+-------+--------+--------+--------+---------+----+--

### Reusable Functions to create sample data, split data, preprocess train, tst, fit model and generate classfication metrics and CM

In [25]:
def createSampleData(df,cols,sampleweight):
    df_small = df.select(cols)
    sampled = df_small.sampleBy("MULTYEAR", fractions={2015:sampleweight, 2016: sampleweight, 2017:sampleweight, 2018:sampleweight, 2019:sampleweight}, seed=seed)
    return sampled

def splitData(dataframe,split_ratio,seed):
    training_data, test_data = dataframe.randomSplit(split_ratio, seed=seed)
    cached_tr = training_data.cache()
    cached_test = test_data.cache()
    return cached_tr,cached_test

def preProcessTrainFit(cached_tr,model,pca_model,paramGrid,evaluator,numFolds,seed ):
    #Assemble traininngdata
    #pass all the features into vector assembler to create a vector format to pass to the classification model
    selected_cols=[cols for cols in cached_tr.columns if cols not in['label','MULTYEAR']]
    assembler = VectorAssembler(inputCols=selected_cols, outputCol="features") 
    #scale
    scaler_train = StandardScaler(inputCol="features", outputCol="scaledFeatures_train",withStd=False, withMean=True)

    #creating a pipeline with the assembler,scaler and model to use in the cross validator
    if pca_model is None:  print("Not PCA"); ppl_cv = Pipeline(stages = [assembler,scaler_train, model])
        
    else: 
        print("PCA");
        rf = RandomForestClassifier(labelCol = "label", featuresCol = "pca_features_cv")
        ppl_cv = Pipeline(stages = [assembler,scaler_train,pca_model, rf])

    #passs the model with variosu combinations of the parameters and it will pick the best one. Using 3 folds to save time. Check seed=42.
    crossval = CrossValidator(estimator = ppl_cv,\
                                            estimatorParamMaps=paramGrid,\
                                            evaluator = evaluator ,\
                                            numFolds= numFolds,seed=seed,parallelism=threads)
    #this is our best model - fit the training data
    #https://spark.apache.org/docs/2.3.0/api/python/pyspark.ml.html#pyspark.ml.tuning.CrossValidator
    return crossval.fit(cached_tr)
    

def preProcessTest(test_data):
    #prepare test data to test predictions
    selected_cols=[cols for cols in test_data.columns if cols not in['label','MULTYEAR']]
    assembler_test = VectorAssembler(inputCols=selected_cols, outputCol="features") 
    transformed_test = assembler_test.transform(test_data)
    #register table as sql table and keep only columns fo interest and save in a new dataframe. This can be done without using SQl as well.
    transformed_test.registerTempTable('transformed_tbl_test')
    transformed_df_test = sqlContext.sql('select label,features from transformed_tbl_test')
    #scale test data
    scaler_test = StandardScaler(inputCol="features", outputCol="scaledFeatures",withStd=False, withMean=True)
    scalerModel_test = scaler_test.fit(transformed_df_test)
    scaledData_test = scalerModel_test.transform(transformed_df_test)
    
    return scaledData_test

In [22]:
#https://runawayhorse001.github.io/LearningApacheSpark/classification.html
#https://shihaojran.com/distributed-machine-learning-using-pyspark/
#https://www.analyticsvidhya.com/blog/2020/06/auc-roc-curve-machine-learning/
# Calculate the elements of the confusion matrix
#https://runawayhorse001.github.io/LearningApacheSpark/classification.html#random-forest-classification

def createLabelsCM(preds):
    
    ##saving labels in a list to pass to the plot
    class_temp = preds.select("label").groupBy("label")\
                            .count().sort('count', ascending=False).toPandas()
    class_temp = class_temp["label"].values.tolist()
    y_true = preds.select("label")
    y_true = y_true.toPandas()

    y_pred = preds.select("prediction")
    y_pred = y_pred.toPandas()

    return confusion_matrix(y_true, y_pred,class_temp)

def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=0)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

def classificationMetrics(preds, evaluator):
    #calcualte classification report
    TN = preds.filter('prediction = 0 AND label = prediction').count()
    TP = preds.filter('prediction = 1 AND label = prediction').count()
    FN = preds.filter('prediction = 0 AND label <> prediction').count()
    FP = preds.filter('prediction = 1 AND label <> prediction').count()
    # show confusion matrix
    preds.groupBy('label', 'prediction').count().show()
    # calculate metrics by the confusion matrix
    accuracy = (TN + TP) / (TN + TP + FN + FP)
    precision = TP / (TP + FP)
    recall = TP / (TP + FN)
    F =  2 * (precision*recall) / (precision + recall)
    # calculate auc
    auc = evaluator.evaluate(preds, {evaluator.metricName: 'areaUnderROC'})
    print('n precision: %0.3f' % precision)
    print('n recall: %0.3f' % recall)
    print('n accuracy: %0.3f' % accuracy)
    print('n F1 score: %0.3f' % F)
    print('AUC: %0.3f' % auc)

### PreProcess

In [26]:

#sampling data to use more effeciently; seed = 42
#https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.sampleBy.html
#https://towardsdatascience.com/exploratory-data-analysis-eda-with-pyspark-on-databricks-e8d6529626b1
#https://www.kaggle.com/tientd95/advanced-pyspark-for-exploratory-data-analysis
sampled = createSampleData(df,cols,0.1)

### Random Forest Model with Dimension Reduction

In [28]:
%%time
#split data
cached_tr_pca, cached_test_pca = splitData(sampled,split_ratio,seed)

#preprocesstrain
cv_model_pca = preProcessTrainFit(cached_tr_pca,rf,pca_model,paramGrid_pca,bcm,numFolds,seed)

#preprocess test
scaled_test_pca = preProcessTest(cached_test_pca)

PCA
CPU times: user 1.38 s, sys: 370 ms, total: 1.75 s
Wall time: 6min 31s


In [30]:
#select the best model ffro mthe cross validator
bestPipeline = cv_model_pca.bestModel

#creating a new dataframe wiith labels features and predictions
predictions = bestPipeline.transform(cached_test_pca)

#Metrics
#call classification metrics method to print metrics
classificationMetrics(predictions,bcm)

# Plot normalized confusion matrix
cnf_matrix = createLabelsCM(predictions)
plot_confusion_matrix(cnf_matrix, classes=['Not Educated','Educated'], normalize=True)
                      #title='Normalized confusion matrix')
#plt.show()

+-----+----------+-----+
|label|prediction|count|
+-----+----------+-----+
|    1|       0.0|32484|
|    0|       0.0|88025|
|    1|       1.0|41389|
|    0|       1.0|16534|
+-----+----------+-----+

n precision: 0.715
n recall: 0.560
n accuracy: 0.725
n F1 score: 0.628
AUC: 0.810




In [None]:
##PCA Loadings
pipe = bestPipeline.stages[2]
exp_var = pipe.explainedVariance
print("Explained Variance: ",exp_var)
data_scaled = bestPipeline.stages[1]
#https://stackoverflow.com/questions/22984335/recovering-features-names-of-explained-variance-ratio-in-pca-with-sklearn
#print(pd.DataFrame(pipe.pc,columns=pd.DataFrame(data_scaled).columns,index = ['PC-0','PC-1','PC-2','PC-3','PC-4','PC-5','PC-6','PC-7','PC-8','PC-9']))
#https://www.py4u.net/discuss/218858
#https://datascience-enthusiast.com/Python/PCA_Spark_Python_R.html
rows = pipe.pc.toArray().tolist()
df_pca = spark.createDataFrame(rows,['PC-0','PC-1','PC-2','PC-3','PC-4','PC-5','PC-6','PC-7','PC-8','PC-9'])
df_pandas = df_pca.toPandas()
df_pandas.index = selected_cols
df_pandas.sort_values(by='PC-0', ascending=False)

In [None]:

#look at the chosen model and models in all the folds and with all params
rf_model = bestPipeline.stages[3]
print(rf_model)

#all the 9 model accuracies. The max one was picked as best
avgMetricsGrid = cv_model_pca.avgMetrics
print(avgMetricsGrid)

#https://tsmatz.github.io/azure-databricks-exercise/exercise04-hyperparams-tuning.html
#https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.tuning.CrossValidator.html
# View all results (accuracy) by each params - these can be converted to pretty tables in pandas later
list(zip(cv_model_pca.getEstimatorParamMaps()))

### Model Selection features based on PC-0 component

In [None]:
%%time
#create sample dataframe
keep_pca =['VALUEH','FTOTINC','INCTOT','INCEARN','BPLD','COSTFUEL','PROPINSR','YRMARR','DEGFIELDD','PWPUMA00','IND','label','MULTYEAR']
sampled_35 = createSampleData(df,keep_pca,0.1)

#split data
cached_tr, cached_test = splitData(sampled_35,split_ratio,seed)

#preProcessTrain data and fit
model_35 = preProcessTrainFit(cached_tr,rf,None,paramGrid,bcm,numFolds,seed)

#preprocess testdata
scaled_test = preProcessTest(cached_test)

In [None]:
#predictions
bestPipeline_35 = model_35.bestModel
pipe_35 = bestPipeline_35.stages[1]
predictions_35 = bestPipeline_35.transform(cached_test)
classificationMetrics(predictions_35,bcm)

#Visualize Metrics
cnf_matrix_35 = createLabelsCM(predictions_35)
plot_confusion_matrix(cnf_matrix_35, classes=['Not Educated','Educated'], normalize=True,
                      title='Normalized confusion matrix')
plt.show()

In [None]:
avgMetricsGrid_35 = model_35.avgMetrics
print(avgMetricsGrid_35)
list(zip(model_35.getEstimatorParamMaps()))

### Checking Feature Importance to improve the model by removing features which are not important

In [None]:

bestModel = bestPipeline_35.stages[2]
importances = bestModel.featureImportances
x_values = list(range(len(importances)))
selected_cols_imp=[cols for cols in keep_pca if cols not in['label','MULTYEAR']]
plt.barh(x_values,importances);
plt.yticks(x_values,selected_cols_imp, rotation=0);
plt.ylabel('Feature');
plt.xlabel('Importance');
plt.title('Feature Importances');
plt.show();

### GRADIANT BOOSTING - ASHLIE



In [33]:
from pyspark.ml import Pipeline
from pyspark.ml.classification import GBTClassifier
from pyspark.ml.feature import StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [32]:
# Load and parse the data file, converting it to a DataFrame.
df = df.select(keep_35)


+------+------+--------+---------+-----+--------+-------+--------+--------+--------+-------+--------+---------+--------+--------+----------+-----+--------+--------+-------+-------+--------+--------+------+--------+------+--------+-------+-------+--------+--------+--------+-------+--------+-------+-----+--------+
|HHTYPE|REGION|STATEFIP|COUNTYFIP|METRO|COSTELEC|COSTGAS|COSTWATR|COSTFUEL|FOODSTMP|CINETHH|CILAPTOP|CISMRTPHN|CITABLET|VEHICLES|COUPLETYPE|NFAMS|NMOTHERS|NFATHERS|CITIZEN|YRSUSA1|RACAMIND|RACASIAN|RACBLK|RACPACIS|RACWHT|RACOTHER|HCOVANY|EMPSTAT|LABFORCE|CLASSWKR|UHRSWORK|VETSTAT|TRANWORK|GCHOUSE|label|MULTYEAR|
+------+------+--------+---------+-----+--------+-------+--------+--------+--------+-------+--------+---------+--------+--------+----------+-----+--------+--------+-------+-------+--------+--------+------+--------+------+--------+-------+-------+--------+--------+--------+-------+--------+-------+-----+--------+
|     1|    32|       1|       97|    4|    2724|    648| 

In [37]:
from pyspark.ml.feature import VectorAssembler

# inputCols take a list of column names
# outputCol is arbitrary name of new column; generally called features

keep_34 = ["HHTYPE","REGION","STATEFIP","COUNTYFIP","METRO","COSTELEC","COSTGAS","COSTWATR","COSTFUEL","FOODSTMP","CINETHH","CILAPTOP",\
        "CISMRTPHN","CITABLET","VEHICLES","COUPLETYPE","NFAMS","NMOTHERS","NFATHERS","CITIZEN","YRSUSA1","RACAMIND","RACASIAN","RACBLK","RACPACIS"\
        ,"RACWHT","RACOTHER","HCOVANY","EMPSTAT","LABFORCE","CLASSWKR","UHRSWORK","VETSTAT","TRANWORK","GCHOUSE","MULTYEAR"]

assembler = VectorAssembler(inputCols=keep_34,
                            outputCol="features")

tr = assembler.transform(df)


+------+------+--------+---------+-----+--------+-------+--------+--------+--------+-------+--------+---------+--------+--------+----------+-----+--------+--------+-------+-------+--------+--------+------+--------+------+--------+-------+-------+--------+--------+--------+-------+--------+-------+-----+--------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+
|HHTYPE|REGION|STATEFIP|COUNTYFIP|METRO|COSTELEC|COSTGAS|COSTWATR|COSTFUEL|FOODSTMP|CINETHH|CILAPTOP|CISMRTPHN|CITABLET|VEHICLES|COUPLETYPE|NFAMS|NMOTHERS|NFATHERS|CITIZEN|YRSUSA1|RACAMIND|RACASIAN|RACBLK|RACPACIS|RACWHT|RACOTHER|HCOVANY|EMPSTAT|LABFORCE|CLASSWKR|UHRSWORK|VETSTAT|TRANWORK|GCHOUSE|label|MULTYEAR|features                                                                                                                                                         |
+------+------+--------+---------+-----+--------

In [38]:

# Index labels, adding metadata to the label column.
# Fit on whole dataset to include all labels in index.
labelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel").fit(tr)
# Automatically identify categorical features, and index them.
# Set maxCategories so features with > 8 distinct values are treated as continuous.

featureIndexer =\
    VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=8).fit(tr)

In [39]:

# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = tr.randomSplit([0.7, 0.3])


In [40]:
# Train a GBT model.
gbt = GBTClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures", maxIter=10)

In [41]:



# Chain indexers and GBT in a Pipeline
pipeline = Pipeline(stages=[labelIndexer, featureIndexer, gbt])

# Train model.  This also runs the indexers.
model = pipeline.fit(trainingData)

# Make predictions.
predictions = model.transform(testData)

# Select example rows to display.
predictions.select("prediction", "indexedLabel", "features").show(5)

# Select (prediction, true label) and compute test error
evaluator = MulticlassClassificationEvaluator(
    labelCol="indexedLabel", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test Error = %g" % (1.0 - accuracy))

gbtModel = model.stages[2]
print(gbtModel)  # summary only

+----------+------------+--------------------+
|prediction|indexedLabel|            features|
+----------+------------+--------------------+
|       0.0|         0.0|(36,[1,2,9,16,21,...|
|       0.0|         0.0|(36,[1,2,9,16,21,...|
|       0.0|         1.0|(36,[1,2,9,16,21,...|
|       0.0|         1.0|(36,[1,2,9,16,21,...|
|       0.0|         1.0|(36,[1,2,9,16,21,...|
+----------+------------+--------------------+
only showing top 5 rows

Test Error = 0.260424
GBTClassificationModel: uid = GBTClassifier_4d6012abc4d7, numTrees=10, numClasses=2, numFeatures=36
