## Binary classification example - Training, Evaluating Classifiers with `mmlspark` along with how to create a web service using MMLSpark Serving 

In this example, we try to predict incomes from the *Adult Census* dataset. Then using MMLSpark serving create a web service. 

First, we import the packages (use `help(mmlspark)` to view contents),

In [2]:
import sys

In [3]:
import numpy as np
import pandas as pd
import mmlspark

# help(mmlspark)

Now let's read the data and split it to train and test sets:

In [5]:
dataFilePath = "AdultCensusIncome.csv"
import os, urllib
if not os.path.isfile(dataFilePath):
    urllib.urlretrieve("https://mmlspark.azureedge.net/datasets/" + dataFilePath, dataFilePath)
data = spark.createDataFrame(pd.read_csv(dataFilePath, dtype={" hours-per-week": np.float64}))
data = data.select([" education", " marital-status", " hours-per-week", " income"])
train, test = data.randomSplit([0.75, 0.25], seed=123)
train.limit(10).toPandas()

`TrainClassifier` can be used to initialize and fit a model, it wraps SparkML classifiers.
You can use `help(mmlspark.TrainClassifier)` to view the different parameters.

Note that it implicitly converts the data into the format expected by the algorithm: tokenize
and hash strings, one-hot encodes categorical variables, assembles the features into a vector
and so on.  The parameter `numFeatures` controls the number of hashed features.

In [7]:
from mmlspark import TrainClassifier
from pyspark.ml.classification import LogisticRegression
model = TrainClassifier(model=LogisticRegression(), labelCol=" income", numFeatures=256).fit(train)

After the model is trained, we score it against the test dataset and view metrics.

In [9]:
from mmlspark import ComputeModelStatistics, TrainedClassifierModel
prediction = model.transform(test)
prediction.head(5)
test.printSchema()

In [10]:
metrics = ComputeModelStatistics().transform(prediction)
metrics.limit(10).toPandas()

In [11]:
#Link to documentation: https://github.com/Azure/mmlspark/blob/master/docs/mmlspark-serving.md 

Determine the schema for the input row for the webservice

In [13]:
from pyspark.sql.functions import length, col, lit, to_json, struct

row = test.select(to_json(struct("*")).alias("json")).take(1)[0]
row.json

Define the webservice input/output

In [15]:
from pyspark.sql.functions import length, col, lit, from_json
from pyspark.sql.types import *
import uuid
import requests


serving_inputs = spark.readStream.server() \
    .address("localhost", 8888, "my_api") \
    .load()\
    .withColumn("variables", from_json(col("value"), test.schema))\
    .select("id","variables.*")
    
serving_outputs = model.transform(serving_inputs) \
  .withColumn("scored_labels", col("scored_labels").cast("string"))

serving_outputs.writeStream \
    .server() \
    .option("name", "my_api") \
    .queryName("my_query") \
    .option("replyCol", "scored_labels") \
    .option("checkpointLocation", "checkpoints-{}".format(uuid.uuid1())) \
    .start()


Test the webservice

In [17]:
import requests
import base64

data = u'{" education":" 10th"," marital-status":" Divorced"," hours-per-week":40.0," income":" <=50K"}'
r = requests.post(data=data, url="http://localhost:8888/my_api")
print(r.text)