**2021/22**

# Binary classification
This lecture is about binary classification of discrete data.

Previously we have setup a ML pipeline to work with a linear regression algorithm. Now the focus is on binary classification in a discrete space. 

The dataset under consideration is from the domain of financial markets. As expected, we will be using Apache Spark MLlib.

# ML pipelines

In a prior notebook we have used pipelines to specify stages upon which data is submitted to for processing. But at the time we barely got into details. 

As stated in the Spark's programming guide, **"ML Pipelines provide a uniform set of high-level APIs built on top of DataFrames that help users create and tune practical machine learning pipelines."**

Hence, it is possible to combine multiple algorithms into a single pipeline, or workflow. Besides DataFrames, it involves the following:
1. Transformer: an algorithm which can transform one DataFrame into another DataFrame. For example, an ML model is a Transformer which transforms a DataFrame with features into a DataFrame with predictions.
2. Estimator: an algorithm which can be fit on a DataFrame to produce a Transformer. For example, a learning algorithm is an Estimator which trains on a DataFrame and produces a model.
3. Pipeline: the way to chain multiple Transformers and Estimators together to specify an ML workflow. 4. Parameter: all Transformers and Estimators share a common API for specifying parameters.
Further details can be found in http://spark.apache.org/docs/latest/ml-pipeline.html



Recall that, in general, a typical ML workflow is designed to work as depicted below:

![ml-pipeline.png](attachment:ml-pipeline.png)

# Problem formulation

This exercise is about stock market prediction in the financial industry. Our case-study and dataset is based on a past Kaggle competition, which details can be found in https://www.kaggle.com/c/jane-street-market-prediction .

Obviously we are not going to create a complex quantitative trading system but just work on a tiny bit - to deciding whether a particular trade proposal is going to be accepted, so traded, or not. In the end, it is a binary classification problem.

Basically, the functional requirements for the Spark program we are about to create are as follows:
1. To load the dataset under analysis and making sure it can be further processed by a ML classifier.
2. To create a classification model supported by a SVM algorithm that is fit for the purpose.
3. For each day, to compute the daily score as the sum of the product **weight x resp** (see schema below)
of the trades the system opt to trade in that day. In that respect, the data to be processed is different from the one being used to create the model.

You can find the file dataset JaneStreetMarket.csv.zip in:


https://iscteiul365.sharepoint.com/:u:/s/G365_UC_AlgoritmosparaBigData/Ebj3nk6DLhZHpuoZFEeptqcB6wovOyeUgi72o7QLC-kBeQ?e=GHSn4C


**Context**

Financial markets are very complex. In such a fast-moving environment, electronic trading allows for thousands of transactions to occur within a fraction of a second, and so providing many opportunities to find and take advantage of price differences in real-time.
In an efficient market, buyers and sellers would have all the information they need to make rational decisions. As result, products remain at fair values, not undervalued nor overpriced. But in the real world, markets do not work like that.

Developing trading strategies to identify and take advantage of inefficiencies is challenging. Even if a strategy is profitable at some point in time, it may not be in the future, and market volatility makes it impossible to predict the profitability of any given trade with certainty. It is hard to distinguish good luck from good trading decision.

Hence, the goal is to build a quantitative trading model to maximize returns using real time market data. Once the model faces trading opportunities electronically, it must decide whether to accept or reject them.
The all building process includes testing its predictiveness against future market returns, as well as checking against historical data, the so-called backtesting. There are a few more notes worth considering, namely:

- The development of good models is very challenging. For example, (i) we may collect too much noise from the market data, (ii) there will be redundancy in the information collected, (iiI) we may experience strong feature correlations, (iv) it is difficult to establish a proper mathematical formulation, etc.
- An highly predictive model, which selects the right trades to execute, will contribute to pushing prices closer to fair values, as it sends correct messages to the market.
- As more people are using this kind of strategies, the slimest advantage due to having good models will pay off.


**Market data**


The dataset we are going to use relates to the Kaggle competition mentioned above. It contains an anonymized set of features, feature_{0...129}, representing real stock market data features. For example, trading volume and volatility in various time horizons, statistical indicators like *Relative Strength Index (RSI)*, etc.

Each row in the dataset represents a trading opportunity, for which the system will be predicting an action value: 1 to make the trade or 0 to pass on it.
For each trade, there is an associated `weight` and `resp`: 
- The `weight` is the importance of the trade like for example a ratio of transaction cost or, in other words, the capital invested in the trade.
- The `resp` represents the return of the trade, and we can have additional returns over specific but not revealed time horizons. But what matters the most is `resp`.

The data also includes the day of the trade (as a number) and a value *ts_id* representing time ordering.


The **data schema** of the dataset is the following:

|Column     | Type| Description |
|:---:|:---:| :---:| 
| **date** |Integer| Day of the trade|
| **weight** | Double | The importante of the trade. When 0, it does not contribute for the score of the evaluation of the model, but it is included for the purpose of completeness |
| **resp_1** | Double | Value related to returns over time horizon 1 |
| **resp_2** | Double | Value related to returns over time horizon 2 |
| **resp_3**  | Double | Value related to returns over time horizon 3 |
| **resp_4** | Double | Value related to returns over time horizon 4 |
| **resp** | Double | Value e.g. returns |
| **feature_0** | Double  | Value of anonymized feature 0 |
| ... |  ... | Columns with features 1 to 128 |
| **feature_129** | Double |  Value of anonymized feature 129 |
| **ts_id** | Integer |  Time ordering |



In [None]:
# If we need to install some packages, e.g. matplotlib

# ! pip3 install matplotlib
# ! pip3 install seaborn

In [None]:
# Some imports 

import os 

import numpy as np 
import pandas as pd  
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

# Useful visualization functions

Some functions that we can use to plot data but as Python dataframes.

**Disclaimer**: these functions are broadly distributed among users. Further adjustments are needed and/or advisable. Feel free to use your own plotting functions.

In [None]:
def plotHistogram(df, xcol, huecol):
    sns.histplot(data=df, x=xcol, hue=huecol, multiple="stack")

In [None]:
def plot(df, xcol, ycol):
    sns.lineplot(data=df, x=xcol, y=ycol)

In [None]:
def plotScatter(df, xcol, ycol, huecol):
    sns.scatterplot(data=df, x=xcol, y=ycol, hue=huecol)

In [None]:
def plotScatterMatrix(df, huecol):
    sns.pairplot(data=df, hue=huecol)

In [None]:
def plotCorrelationMatrix1(df):
    # compute the correlation matrix
    corr = df.corr()

    # generate a mask for the upper triangle
    mask = np.triu(np.ones_like(corr, dtype=bool))

    # set up the matplotlib figure
    f, ax = plt.subplots(figsize=(11, 9))

    # generate a custom diverging colormap
    cmap = sns.diverging_palette(230, 20, as_cmap=True)

    # draw the heatmap with the mask and correct aspect ratio
    sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

In [None]:
def plotCorrelationMatrix2(df):
    # compute a correlation matrix and convert to long-form
    corr_mat = df.corr().stack().reset_index(name="correlation")
    # draw each cell as a scatter point with varying size and color
    g = sns.relplot(
        data=corr_mat,
        x="level_0", y="level_1", hue="correlation", size="correlation",
        palette="vlag", hue_norm=(-1, 1), edgecolor=".7",
        height=10, sizes=(50, 250), size_norm=(-.2, .8),
    )

    # tweak the figure to finalize
    g.set(xlabel="", ylabel="", aspect="equal")
    g.despine(left=True, bottom=True)
    g.ax.margins(.02)
    for label in g.ax.get_xticklabels():
        label.set_rotation(90)
    for artist in g.legend.legendHandles:
        artist.set_edgecolor(".7")

# Collect and label data

## Data ingestion

In [None]:
! pwd 
! ls -la

In [None]:
! head -n 2 JaneStreetMarket.csv
! tail -n 2 JaneStreetMarket.csv

In [None]:
# some Spark related imports we will use hereafter

import sys

from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *

from pyspark.ml import Pipeline
from pyspark.ml.stat import Correlation, ChiSquareTest
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import LinearSVC
from pyspark.ml.evaluation import BinaryClassificationEvaluator

In [None]:
# Build a SparkSession instance if one does not exist. Notice that we can only have one per JVM

spark = SparkSession\
    .builder\
    .appName("JaneStreetMarket")\
    .config("spark.sql.shuffle.partitions",6)\
    .getOrCreate()

In [None]:
# Read the dataset 

df_raw = spark.read.csv("JaneStreetMarket.csv", header="true", inferSchema="true", sep=',')

## Columns to keep

In [None]:
# check the data - schema, show and and count

df_raw.
df_raw.
df_raw.

It seems there are no reasons to drop any column.

# Evaluate data

Let us get some data insight, with some exploratory data analysis based on descriptive statistics and visualizations.

In [None]:
# Check some column statistics, one by one.
# Leave aside those that are named like "feature_nn"

for cl in df_raw.columns:
    if not 'feature' in cl:
        


Following previous understanding, all collected data should be considered as of interest.

# Feature Engineering

Now we have to prepare data in a way that it can be properly used by ML algorithms, which includes selection and extraction of features, as well as dealing with poor data quality if that is the case.

As we can see, all columns are numeric. Furthermore, data types are OK.

## Data cleasing

We will look at
* Nulls
* Extreme values e.g. outliers

In [None]:
# Nulls: if needed, the brute-force solution: remove rows where at least one of the columns is null or NaN value

# checking the sizes, just to make sure we can move on



In [None]:
# Outliers: for that, we use summary(), one column by one, except those starting with feature_ as above

for cl in df_raw.columns:
    if not 'feature' in cl:
        

No nulls. Fine. And in respect to outliers, likewise there are no problems.

## Saving clean data

In [None]:
# As we have a large dataset, we should also have a smaller one,
# just for the purpose of working locally, maybe starting 
# from the begining (as it is sorted by timing)

small_num_rows = 300000 # out of 2 390 491 
df_small = 

In [None]:
# Save the smaller version to a file for future use in case of need

df_small.write.mode("overwrite").parquet("small-clean-janestreetmarket")

# and later on, we can use spark.read.parquet() to load files

In [None]:
# Also, although no changes were made to the initial dataset, 
# it may be convenient to have the original data stored in 
# the parquet format as well. It will be the normal dataset

df_raw.

In [None]:
# Check in the running directory if that was accomplished

! ls -la

In [None]:
! ls -la small-clean-janestreetmarket

## Data to be used hereafter

In [None]:
# df_clean = 
df_clean = 

In [None]:
# Delete memory consuming variables that are no longer needed

del df_raw, df_small

## Final  overview
After establishing the clean data to be used, let us get an overview about what we have achieved, with some statistics and visualizations.

In [None]:
# Particular columns to check

cls1 = ["resp_1", "resp_2", "resp_3", "resp_4", "resp"]
cls2 = ["date", "weight", "resp_1", "resp_2", "resp_3", "resp_4", "resp", "ts_id" ] # non feature_nn


### Descriptive statistics

In [None]:
# Describe

df_clean.
df_clean.

In [None]:
# Summary

for cl in cls1:
    

for cl in cls2:
    

### Correlations

In [None]:
# check some columns e.g. resp vs weight; 
# Correlation needs vectors so we convert to vector column first

# the columns to compute correlations
#cols_corr = df_clean.columns  # or specific ones
cols_corr = cls2

vector_col = "corr_features"
assembler = VectorAssembler(inputCols=cols_corr, outputCol=vector_col)
df_vector = assembler.transform(df_clean).select(vector_col)

# get correlation matrix - it can be Pearson’s (default) or Spearman’s correlation

# corr = Correlation.corr(df_vector, vector_col).head()
# print("Pearson correlation matrix:\n" + str(corr[0]))

# corr = Correlation.corr(df_vector, vector_col, "spearman").head()
# print("Spearman correlation matrix:\n" + str(corr[0]))

corrmatrix = Correlation.corr(df_vector, vector_col).collect()[0][0].toArray().tolist()
corrmatrix

In [None]:
# Just for visualization purposes, convert to Pandas 

df_plot = pd.DataFrame(data = corrmatrix, index=cols_corr, columns=cols_corr)

In [None]:
# a plot 
plotCorrelationMatrix1(df_plot)

### Daily averages
To put into context, it also seems sensible to look at daily averages, since in most of the days the intra-day behaviour does not change that much, relatively speaking.

In [None]:
# Trading days, sorted

df_clean.
df_clean.

In [None]:
# Check average values by day 

df_clean_daily = df_clean.groupBy("date").mean() 
df_clean_daily.columns

In [None]:
df_clean_daily.count()

In [None]:
df_plot = df_clean_daily.toPandas()

In [None]:
# some plots

plot(df_plot, 'date', 'avg(weight)')

## Columns selection

It is time to start thinking about which columns to use in the model, whether existing or new derived ones. To do so, the best we understand what the business is all about the better, including in relation to the characteristics of the data we were given. Statistics we have made, and more to do, would help to figure out patterns of interest. (Recall that features in this problem are anonymized)

However, in the context of this problem we will make straightforward decisions so leaving aside a thoroughly business analysis for the experts. But at the end of the day, there is a score to pursuit, as pointed out in the problem formulation:

    For each day, to compute the daily score as the sum of the product `weight * resp` (see schema above) of the trades the system opt to trade in that day. The data to be processed shall be different from the one being  used to create the model.

Notice: MLlib provides a set of tools to help tackling this issue of features. See http://spark.apache.org/docs/latest/ml-features.html . 
But we leave it for another exercise.

In [None]:
# All columns are numeric

cols = df_clean.columns

cols_features_nn = [cl for cl in cols if 'feature' in cl]

cols_non_feature_nn = ["date", "weight", "resp_1", "resp_2", "resp_3", "resp_4", "resp", "ts_id" ]
                     
cols_non_features_nn = [cl for cl in cols if cl not in cols_features_nn]

[cols_non_features_nn, len(cols_features_nn)]

In [None]:
# Columns as features - let us use cols_features_nn

features = cols_features_nn

# Set the label, the target to predict: we should trade when weight * resp is positive 

df_preprocessed = 

df_preprocessed.select

# Select and train model

Now it is time to train and test a model to be used for binary classification, that is, to decide whether to trade or not.

We are going to use a Linear Support Vector Machine algorithm, as presented in
http://spark.apache.org/docs/latest/ml-classification-regression.html#linear-support-vector-machine . 

But before going further, probably it is worth having a look at both the supervised learning and the ML pipeline slides from the lectures.

## Train/test split

The step of creating a ML model means we should keep some part of the data in the dark.
Standard split is 80/20 (or 70/30 if dataset is really large). 

Recall that if the test part is relatively too small, then the model will memorize the data so it will reach an overfit situation. That would be bad as it no longer have data to evaluate how well it will generalize to unseen data.


In [None]:
# train/test split

df_train, df_test = df_preprocessed

# caching data ... but just the training part
df_train

# print the number of rows in each part
print(f"There are {df_train.count()} rows in the training set and {df_test.count()} in the test set.")

**Notice** 

As we did with clean data, we may consider storing the data split into files, should we want to use it elsewhere. 

This relates to the need of guaranteeing of unicidade noutro ambiente eexperiemntal 

We leave it as it is now.

## Features transformation to vector

ML algorithms require that all input features are contained within a single vector. Therefore we need a transformation, so we use the `VectorAsAssembler` transformer.

Transformers accept a DataFrame as input and return a new one with one or more columns appended to it, using a `transform()` method following rule-based transformations - they do not learn from the data. Notice that they are lazily evaluated.

In [None]:
# Put all input features into a single vector, by using a transformer

vec_assembler = VectorAssembler(inputCols=features, outputCol="features") 


## Linear SVM model

Once we have a vector assembled with the features in place, then we can use the `Linear SVM` estimator (the algorithm) to learn from the training data and consequently to build the model. 

In [None]:
# Linear SVC algorithm
# default: featuresCol='features', labelCol='label', predictionCol='prediction'

lsvc = LinearSVC(maxIter=10, regParam=0.1)


## ML pipeline configuration

In [None]:
# The pipeline holds two stages set above: 
#  1. vec_assembler (related to features) 
#  2. lsvc (related to ML estimator)

pipeline = Pipeline(stages=[vec_assembler, lsvc])


## Model fitting
Get the model (as transformer) by fitting the pipeline to training data.

In [None]:
pipeline_model = pipeline.fit(df_train)

# Evaluate model

Let us evaluate the Linear SVM model.

## Testing the model

It is time to apply the model built to test data. Again, we will use the pipeline set above, meaning the stages already specified will be reused. Notice that, since the pipeline model is a transformer, we can easily apply it to test data.

In [None]:
# Make predictions on test data and show values of columns of interest

df_prediction = 

# Check its schema

df_prediction

In [None]:
# Columns of concern

df_prediction.select

## Evaluation metrics

How right was the model? Let us figure out using:
1. Evaluators
2. Confusion matrix


In [None]:
# Compute evaluation metrics on testing data

prediction_label = df_prediction.select("rawPrediction", "label")  

# supports metricName="areaUnderROC" (default) and "areaUnderPR"
# it relates sensitivity (TP rate) and specificity (FP rate)

evaluator = BinaryClassificationEvaluator()

print("areaUnderROC = " + str(evaluator.evaluate(prediction_label)))


Recalling the confusion matrix:

- True Positive: the prediction was positive and it is true. (Great, we trade and score)
- True Negative: the prediction was negative and it is true. (Fine, we do not trade and so avoiding a loss) 
- False Positive: the prediction was positive and it is false. (Terrible, we trade and have a loss)
- False Negative: the prediction was negative and it is false. (Not so good, we do not trade and therefore missing to score)


 ![confusion_matrix.png](attachment:confusion_matrix.png)

It follows the TP, TN, FP and FN computations.

In [None]:
# counting rows for each case TP, TN, FP and FN respectively

n = df_prediction.count()
tp = df_prediction.filter(expr("prediction > 0") & expr("label == prediction")).count()
tn = df_prediction
fp = df_prediction
fn = n - tp - tn - fp
[tp, tn, fp, fn, n]

**Accuracy** = (TP + TN) / (TP + TN + FP + FN)

How often the classifier is correct? (score)

Metric widely used but not so useful when there are many TN cases.

In [None]:
accuracy = 

**Precision** = TP / (TP + FP)

Positive predictive value - proportion of positive results that were correctly identified.

It removes NP and FN from consideration.

In [None]:
precision = 

**Recall** = TP / (TP + FN)

True positive rate. (hit rate, sensitivity)

In [None]:
recall = 

**Specifity** = TN / (TN + FP)

True negative rate. (selectivity)

In [None]:
specifity = 

**F1 score** = 2 * Recall * Precision / (Recall + Precision)

Useful metric because it is difficult to compare two models with low precision and high recall or vice versa. 
Indeed, by combining recall and precision it helps to measure them at once.


In [None]:
f1_score = 

In [None]:
# Confusion matrix conclusions

print("TP = {}, TN = {}, FP = {}, FN = {}, Total = {}".format(tp, tn, fp, fn, n)) 
print("Accuracy = {}".format(accuracy))
print("Precison = {}".format(precision))
print("Recall = {}".format(recall))
print("Specifity = {}".format(specifity))
print("F1 score = {}".format(f1_score))

## Visual analysis
Plotting `label` versus `prediction` obtained above.

In [None]:
# plots

cols_to_plot = ["date", "label", "prediction"]
df_plot = df_prediction.select(cols_to_plot).limit(100000).toPandas()

In [None]:
plotScatter(df_plot, "date", "prediction", "label")

## Back to business: the daily score


In [None]:
# Compute the daily score,
# based on the trades made in that day, as  weight * resp

df_score = df_prediction.withColumn("score", expr("weight * resp * prediction")) 

gdf = df_score.groupBy(df_score.date)
df_daily_score = gdf.agg({"score": "sum"}).sort("date") 

trading_days = df_daily_score.count()
acumulative_score = df_daily_score.agg({"sum(score)": "sum"}).collect()[0][0]

df_daily_score.show()
print("For the {} trading days, the accumulative score is {}".format(trading_days, acumulative_score))

## Saving the pipeline

In [None]:
# We can save the pipeline for further use should it be required

pipeline.save("pipeline-LinearSVM")

# later on, it can be loaded anywhere


In [None]:
! ls -la

In [None]:
! ls -la pipeline-LinearSVM

# Tune model

We should improve the model. For example, we can think about:
- How can we interpret the scores above?
- Could a model with different set of features and/or target engineering would perform better? 
- And what about using real-time data, that is, not training nor testing data?

See the exercise below.

# Additional exercise

Once this exercise has been completed, create a new notebook with similar implementation but using
the following classifiers instead:
1. Logistic Regression 
2. Decision Tree

Also, try to improve the process of feature/target engineering.

See related information in:
http://spark.apache.org/docs/latest/ml-classification-regression.html#logistic-regression (http://spark.apache.org/docs/latest/ml-classification-regression.html#logistic-regression)

http://spark.apache.org/docs/latest/ml-classification-regression.html#decision-tree-classifier (http://spark.apache.org/docs/latest/ml-classification-regression.html#decision-tree-classifier)

# References

* Learning Spark - Lightning-Fast Data Analytics, 2nd Ed. J. Damji, B. Wenig, T. Das, and D. Lee. O'Reilly, 2020
* Spark: The Definitive Guide - Big Data Processing Made Simple, 1st Ed. B. Chambers and M. Zaharia. O'Reilly, 2018
* http://spark.apache.org/docs/latest/ml-guide.html
* https://docs.python.org/3/ 
* https://www.kaggle.com/c/jane-street-market-prediction
