## 101 - Training and Evaluating Classifiers with `mmlspark`

In this example, we try to predict incomes from the *Adult Census* dataset.

First, we import the packages (use `help(mmlspark)` to view contents),

In [1]:
import numpy as np
import pandas as pd
from pyspark.sql.types import DoubleType, StringType, StructField, StructType
from pyspark.sql import SparkSession
import pyspark
spark = SparkSession.builder.appName("MyApp") \
            .config("spark.jars.packages", "com.microsoft.ml.spark:mmlspark_2.11:0.18.1") \
            .config("spark.jars.repositories", "https://mmlspark.azureedge.net/maven") \
            .getOrCreate()
import mmlspark

Now let's read the data and split it to train and test sets:

In [4]:
schema = StructType([
  StructField("age", DoubleType(), False),
  StructField("workclass", StringType(), False),
  StructField("fnlwgt", DoubleType(), False),
  StructField("education", StringType(), False),
  StructField("education_num", DoubleType(), False),
  StructField("marital_status", StringType(), False),
  StructField("occupation", StringType(), False),
  StructField("relationship", StringType(), False),
  StructField("race", StringType(), False),
  StructField("sex", StringType(), False),
  StructField("capital_gain", DoubleType(), False),
  StructField("capital_loss", DoubleType(), False),
  StructField("hours_per_week", DoubleType(), False),
  StructField("native_country", StringType(), False),
  StructField("income", StringType(), False)
])

data = spark.read.format("csv").schema(schema).load("/home/robin/datatsets/adult/adult.data")
data = data.select(["education", "marital-status", "hours-per-week", "income"])
train, test = data.randomSplit([0.75, 0.25], seed=123)
train.limit(10).toPandas()

AnalysisException: "cannot resolve '`marital-status`' given input columns: [sex, hours_per_week, race, marital_status, income, capital_loss, native_country, age, education_num, education, capital_gain, relationship, fnlwgt, occupation, workclass];;\n'Project [education#43, 'marital-status, 'hours-per-week, income#54]\n+- Relation[age#40,workclass#41,fnlwgt#42,education#43,education_num#44,marital_status#45,occupation#46,relationship#47,race#48,sex#49,capital_gain#50,capital_loss#51,hours_per_week#52,native_country#53,income#54] csv\n"

`TrainClassifier` can be used to initialize and fit a model, it wraps SparkML classifiers.
You can use `help(mmlspark.TrainClassifier)` to view the different parameters.

Note that it implicitly converts the data into the format expected by the algorithm: tokenize
and hash strings, one-hot encodes categorical variables, assembles the features into a vector
and so on.  The parameter `numFeatures` controls the number of hashed features.

In [None]:
from mmlspark.train import TrainClassifier
from pyspark.ml.classification import LogisticRegression
model = TrainClassifier(model=LogisticRegression(), labelCol="income", numFeatures=256).fit(train)

After the model is trained, we score it against the test dataset and view metrics.

In [None]:
from mmlspark.train import ComputeModelStatistics, TrainedClassifierModel
prediction = model.transform(test)
metrics = ComputeModelStatistics().transform(prediction)
metrics.limit(10).toPandas()

Finally, we save the model so it can be used in a scoring program.

In [None]:
model.write().overwrite().save("AdultCensus.mml")