# PySpark ML

In [2]:
%fs ls dbfs:/FileStore/tables/

path,name,size
dbfs:/FileStore/tables/01_pyspark_ml.ipynb,01_pyspark_ml.ipynb,19104
dbfs:/FileStore/tables/births.csv,births.csv,1908068
dbfs:/FileStore/tables/births_transformed.csv,births_transformed.csv,1908068


## Predict chances of infant survival with ML

### Load the data

In [5]:
from pyspark.sql.types import *

labels = [
    ('INFANT_ALIVE_AT_REPORT', IntegerType()),
    ('BIRTH_PLACE', IntegerType()),
    ('MOTHER_AGE_YEARS', IntegerType()),
    ('FATHER_COMBINED_AGE', IntegerType()),
    ('CIG_BEFORE', IntegerType()),
    ('CIG_1_TRI', IntegerType()),
    ('CIG_2_TRI', IntegerType()),
    ('CIG_3_TRI', IntegerType()),
    ('MOTHER_HEIGHT_IN', IntegerType()),
    ('MOTHER_PRE_WEIGHT', IntegerType()),
    ('MOTHER_DELIVERY_WEIGHT', IntegerType()),
    ('MOTHER_WEIGHT_GAIN', IntegerType()),
    ('DIABETES_PRE', IntegerType()),
    ('DIABETES_GEST', IntegerType()),
    ('HYP_TENS_PRE', IntegerType()),
    ('HYP_TENS_GEST', IntegerType()),
    ('PREV_BIRTH_PRETERM', IntegerType())
]

schema = StructType([
    StructField(e[0], e[1], False) for e in labels
])

births = spark.read.csv('dbfs:/FileStore/tables/births.csv', 
                        header=True, 
                        schema=schema)

In [6]:
births.show()

In [7]:
births.groupby('BIRTH_PLACE').count().show()

### Create transformers
- Having done this, we can now create our first `Transformer`.

In [9]:
from pyspark.ml.feature import OneHotEncoder, VectorAssembler

encoder = OneHotEncoder(
    inputCol='BIRTH_PLACE', 
    outputCol='BIRTH_PLACE_VEC')

Let's now create a single column with all the features collated together.

In [11]:
labels[0:]

In [12]:
featuresCreator = VectorAssembler(
    inputCols=[
        col[0] 
        for col 
        in labels[2:]] + \
    [encoder.getOutputCol()], 
    outputCol='features'
)

In [13]:
[col[0] for col in labels[2:]] + [encoder.getOutputCol()]

### Create an estimator
- Which machine learning model might be a good start? Why?

In [15]:
from pyspark.ml.classification import LogisticRegression

Once loaded, let's create the model.

In [17]:
logistic = LogisticRegression(
    maxIter=10, 
    regParam=0.01, 
    labelCol='INFANT_ALIVE_AT_REPORT')

### Create a pipeline

All that is left now is to creat a `Pipeline` and fit the model. First, let's load the `Pipeline` from the package.

In [20]:
from pyspark.ml import Pipeline

pipeline = Pipeline(stages=[
        encoder, 
        featuresCreator, 
        logistic
    ])

### Fit the model

Conventiently, `DataFrame` API has the `.randomSplit(...)` method.

In [23]:
births_train, births_test = births.randomSplit([0.7, 0.3], seed=1)

Now run our `pipeline` and estimate our model.

In [25]:
model = pipeline.fit(births_train)
test_model = model.transform(births_test)

Here's what the `test_model` looks like.

In [27]:
test_model.printSchema()

Let's look only at the columns prediction, INFANT_ALIVE_AT_REPORT and features

In [29]:
test_model.select('prediction','INFANT_ALIVE_AT_REPORT','features').show(10)

### Model performance

How well did the model do?

In [32]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

evaluator = BinaryClassificationEvaluator(
    rawPredictionCol='probability', 
    labelCol='INFANT_ALIVE_AT_REPORT')

print(evaluator.evaluate(test_model, 
     {evaluator.metricName: 'areaUnderROC'}))

THE END!

Other things you can do :
  - Use GridSearch to see what hyperparameters are the best for Logistic Regression
  - Try out another machine learning model such as Random Forest