## 101 - Training and Evaluating Classifiers with `mmlspark`

In this example, we try to predict incomes from the *Adult Census* dataset.

First, we import the packages (use `help(mmlspark)` to view contents),

In [0]:
import numpy as np
import pandas as pd

In [0]:
import mmlspark

Now let's read the data and split it to train and test sets:

In [0]:
data = spark.read.parquet("wasbs://publicwasb@mmlspark.blob.core.windows.net/AdultCensusIncome.parquet")
data = data.select(["education", "marital-status", "hours-per-week", "income"])
train, test = data.randomSplit([0.75, 0.25], seed=123)
train.limit(10).toPandas()

Unnamed: 0,education,marital-status,hours-per-week,income
0,10th,Divorced,1.0,<=50K
1,10th,Divorced,40.0,<=50K
2,10th,Divorced,40.0,<=50K
3,10th,Divorced,40.0,<=50K
4,10th,Divorced,40.0,<=50K
5,10th,Divorced,40.0,<=50K
6,10th,Divorced,40.0,<=50K
7,10th,Divorced,40.0,<=50K
8,10th,Divorced,68.0,>50K
9,10th,Married-civ-spouse,16.0,<=50K


`TrainClassifier` can be used to initialize and fit a model, it wraps SparkML classifiers.
You can use `help(mmlspark.TrainClassifier)` to view the different parameters.

Note that it implicitly converts the data into the format expected by the algorithm: tokenize
and hash strings, one-hot encodes categorical variables, assembles the features into a vector
and so on.  The parameter `numFeatures` controls the number of hashed features.

In [0]:
from mmlspark.train import TrainClassifier
from pyspark.ml.classification import LogisticRegression
model = TrainClassifier(model=LogisticRegression(), labelCol="income", numFeatures=256).fit(train)

In [0]:
help(TrainClassifier)

After the model is trained, we score it against the test dataset and view metrics.

In [0]:
from mmlspark.train import ComputeModelStatistics, TrainedClassifierModel
prediction = model.transform(test)
metrics = ComputeModelStatistics().transform(prediction)
metrics.limit(10).toPandas()

Unnamed: 0,evaluation_type,confusion_matrix,accuracy,precision,recall,AUC
0,Classification,"DenseMatrix([[5780., 378.],\n [10...",0.825283,0.708333,0.468846,0.87245


Finally, we save the model so it can be used in a scoring program.

In [0]:
model.write().overwrite().save("AdultCensus.mml")