 # Analysis of the playlist from TheCurrent.org
 An attempt to predict when an artist will be played on the radio station TheCurrent.org. This example attempts to predict what will be played on a Thursday.
 
## Setup 
 1. Download Spark 2.0.1 from here: http://d3kbcqa49mib13.cloudfront.net/spark-2.0.1-bin-hadoop2.7.tgz
 1. Untar it to the location /opt/spark-2.0.1-bin-hadoop2.7
 1. From the command line execute: `source profile && pyspark`


## Read and label data
Read in the source CSV files from the `output/csv` directory. Label each row as positive if it occured on a Thursday.

In [1]:
from pyspark.sql.functions import udf, col
from pyspark.sql.types import IntegerType


def get_label(day_of_week, hour):
    '''Return 1 if the day is Thursday. Otherwise 0. Used by the UDF to label a row.'''
    return 1 if day_of_week == "Thursday" else 0

# The UDF
label_udf = udf(get_label, IntegerType())

# Read the CSV files
df = sqlContext.read.format('com.databricks.spark.csv') \
        .option("header", True) \
        .option("inferSchema", True) \
        .load('output/csv/*/*') \
        .selectExpr("id", "datetime", "artist", "title", "cast(year as int) year", "cast(month as int) month", "cast(day as int) day", "day_of_week", "cast(hour as int) hour")
        
# Add a column called `label` which indicates if the article was played on a Thursday (1) or not (0)
df = df.withColumn("label", label_udf(col("day_of_week"), col("hour")))
df = df.dropna()

## Create training and test data
Creates a training set of data consisting of 80% of the source data, and a test set consisting of 20% of the source data.

In [2]:
splits = df.randomSplit([0.8, 0.2])
train = splits[0].cache()
test = splits[1].cache()

## Generate features
Use the `StringIndexer` to create numeric representations of the `artist`, `title`, and `day_of_week` columns. Use a `VectorAssembler` to combine all numeric data into a vector.

In [3]:
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import VectorAssembler


artistInd = StringIndexer(inputCol="artist", outputCol="artistIndex").setHandleInvalid("skip")
titleInd = StringIndexer(inputCol="title", outputCol="titleIndex").setHandleInvalid("skip")
dayOfWeekInd = StringIndexer(inputCol="day_of_week", outputCol="dayOfWeekIndex").setHandleInvalid("skip")
assembler = VectorAssembler(inputCols=["artistIndex", "dayOfWeekIndex", "titleIndex", "hour", "day", "month", "year"], outputCol="features")

## Build the vector pipeline and train the model
The pipeline executes the following stages:
 1. Creates the `artist` index 
 1. Creates the `title` index 
 1. Creates the `day_of_week` index
 1. Assembles all numeric columns into a vector
 1. Trains the model using 80% of the source data we split previously

In [23]:
from pyspark.ml.classification import LogisticRegression
from pyspark.ml import Pipeline

lr = LogisticRegression(maxIter=300000, threshold=.1)
pipeline = Pipeline().setStages([artistInd, titleInd, dayOfWeekInd, assembler, lr])

model = pipeline.fit(train)

## Test the model
Using 20% of the source data split previously, test the model. Use the `BinaryClassificationMetrics` to calculate the AUC and AUP.

In [13]:
from pyspark.mllib.evaluation import BinaryClassificationMetrics

results = model.transform(test)
predictionsAndLabels = results.select("prediction", "label") \
                              .rdd.map(lambda r: (float(r["prediction"]), float(r["label"])))
metrics = BinaryClassificationMetrics(predictionsAndLabels)
print "AUC: {0}".format(metrics.areaUnderROC)
print "AUP: {0}".format(metrics.areaUnderPR)

AUC: 0.682087698739
AUP: 0.606145311924


In [17]:
results.select("probability").take(10)



[Row(probability=DenseVector([0.9363, 0.0637])),
 Row(probability=DenseVector([0.7791, 0.2209])),
 Row(probability=DenseVector([0.9293, 0.0707])),
 Row(probability=DenseVector([0.7848, 0.2152])),
 Row(probability=DenseVector([0.937, 0.063])),
 Row(probability=DenseVector([0.8365, 0.1635])),
 Row(probability=DenseVector([0.9419, 0.0581])),
 Row(probability=DenseVector([0.7565, 0.2435])),
 Row(probability=DenseVector([0.8412, 0.1588])),
 Row(probability=DenseVector([0.8402, 0.1598]))]