### Data challenge
##### Concepts: PySpark and Spark MLlib, class imbalance


In [1]:
from pyspark.sql import SQLContext
from pyspark import SparkContext

In [2]:
# set up Spark context
sc=SparkContext("local","hw")
sqlContext=SQLContext(sc)

Read the data into a dataframe:

In [3]:
df=sqlContext.read.parquet('./dataset.parquet/*parquet')

#### Data cleaning discussion

Below are some quick explorations: number of columns, data types, number of rows and descriptive statistics for an example feature.

If I were to do data cleaning, I would look for the following:
* NaN values (if the feature was categorical, I would make NaN its own category, while if the feature were continuous, I would set the value to zero and create another binary feature that would indicate True if the continuous feature were NaN and False otherwise)
* correct data types in each category
* remove strings (e.g. dollar signs) from numerical variables
* make sure dates and times are in the correct format, if applicable
* categorical variables that have too many levels would be treated as continuous variables (maybe more than 5-10, depending on the application)
* one-hot encode categorical variables, unless algorithm doesn't require this (most do, however)
* correct feature labels, no typos
* balanced classes (check if classes need to be combined, if they are very similar)
* tidy data (every observation has one and only one row)
* outliers - check if outliers are physically possible, if they are, typically keep them
* double check for duplicate observations

#### Quick data exploration

In [4]:
df.printSchema()

root
 |-- label: integer (nullable = true)
 |-- f1: long (nullable = true)
 |-- f2: long (nullable = true)
 |-- f3: long (nullable = true)
 |-- f4: long (nullable = true)
 |-- f5: double (nullable = true)
 |-- f6: double (nullable = true)
 |-- f7: double (nullable = true)
 |-- f8: long (nullable = true)
 |-- f9: long (nullable = true)
 |-- f10: long (nullable = true)
 |-- f11: long (nullable = true)
 |-- f12: long (nullable = true)
 |-- f13: long (nullable = true)
 |-- f14: long (nullable = true)
 |-- f15: long (nullable = true)
 |-- f16: long (nullable = true)
 |-- f17: long (nullable = true)
 |-- f18: long (nullable = true)
 |-- f19: long (nullable = true)
 |-- f20: long (nullable = true)
 |-- f21: long (nullable = true)
 |-- f22: long (nullable = true)
 |-- f23: long (nullable = true)
 |-- f24: integer (nullable = true)
 |-- f25: long (nullable = true)
 |-- f26: long (nullable = true)
 |-- f27: long (nullable = true)
 |-- f28: long (nullable = true)
 |-- f29: long (nullable = true)


In [5]:
df.count()

6883958

In [6]:
df.describe('f502').show()

+-------+--------------------+
|summary|                f502|
+-------+--------------------+
|  count|             6883958|
|   mean|0.001443065166870...|
| stddev| 0.03796028107271428|
|    min|                   0|
|    max|                   1|
+-------+--------------------+



I decided to initially use a Random Forest algorithm, as this algorithm often works quite decently out of the box for many problems. However, at my first pass through this dataset with a RF algorithm, I didn't notice the huge class imbalance as noted below, leaving an F-score with much to be desired.

In [7]:
df.groupBy('label').count().show()

+-----+-------+
|label|  count|
+-----+-------+
|    1|    368|
|    0|6883590|
+-----+-------+



#### Undersampling for class balancing

I decided to heavily undersample class 0 and use the resulting dataset to train the Random Forest.

In [10]:
zero_label=df.where(df['label']==0).rdd.takeSample(False, 368, seed=12)

In [11]:
zerodf=sqlContext.createDataFrame(zero_label)

In [12]:
onedf=df.where(df['label']==1)

In [13]:
df_bal=zerodf.union(onedf)

#### Model training

Here, I pipelined a vectorassembler and RF classifier together and trained the model on the 736-row dataframe

In [14]:
from pyspark.ml.feature import VectorAssembler

In [15]:
ignore=['label']
assembler = VectorAssembler(
        inputCols=[x for x in df.columns if x not in ignore],
        outputCol='features')

In [16]:
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier

In [17]:
rf=RandomForestClassifier(featuresCol='features')

In [18]:
pipeline=Pipeline(stages=[assembler, rf])

In [19]:
model=pipeline.fit(df_bal)

In [20]:
model.save('rf.model')

#### Prediction

I predicted labels for the entire dataset here:

In [22]:
pred=model.transform(df)

In [23]:
pred.select("prediction", "label", "features").show(5)

+----------+-----+--------------------+
|prediction|label|            features|
+----------+-----+--------------------+
|       0.0|    0|(502,[22,24,26,29...|
|       0.0|    0|(502,[22,100,368,...|
|       0.0|    0|(502,[22,100,368,...|
|       0.0|    0|(502,[22,100,340,...|
|       0.0|    0|(502,[22,100,340,...|
+----------+-----+--------------------+
only showing top 5 rows



Finally, I evaluated these predictions according to their F1-score, as seen below:

In [24]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [25]:
evaluator=MulticlassClassificationEvaluator(metricName='f1')

In [26]:
f1=evaluator.evaluate(pred)

In [27]:
f1

0.9447626111943398

This leaves quite a good F1 score (much better than training on unbalanced classes, in which my F1 score was closer to 0.01). Since the score was quite high, I decided to not try any further algorithms out. In assessing feature importances, the pyspark Random Forest model allows us to access feature importances directly by featureImportances :

#### Feature importance and discussion

In [38]:
featimp=model.stages[-1].featureImportances.toArray()

In [40]:
# top feature importances
featimp[featimp>0.05]

array([ 0.05568427,  0.06748862,  0.12727115,  0.0515309 ,  0.05874978,
        0.05186435,  0.05402104])

In [44]:
# the indices of the top feature importances
import numpy as np
np.argwhere(featimp>0.05)

array([[  0],
       [  1],
       [  3],
       [  5],
       [ 22],
       [ 26],
       [479]])

As an example of features that have importance greater than 0.05, we can see that there are 7 such features (column number and weight) shown above.

As for determining the directionality of the features, the first idea that springs to mind is to train a logistic regression model and look at the signs of the coefficients. However, as training a new model can be expensive, and since logistic regression does not necessarily rely on the same decision boundaries as the RF model, we instead can examine the top features and determine where they are most represented, ie, if feature X has higher values in class 0 than class 1, this would indicate negative directionality. We would want to take multiple random subsets of the 0-class data again in order to make sure that both classes were well represented when making such an assessment. Additionally, we could use concepts like permutation importance or SHAP values to help us understand how our model quantifies contributions.