WordPress Blog Post Recommender in Spark and Scala
At the Advanced Data Analytics team at Barclays we solved the Kaggle competition as proof-of-concept of how to use Spark, Scala and the Spark Notebook to solve a typical machine learning problem end-to-end. �The case study is recommending a sequence of WordPress blog posts that the users may like based on their historical likes and blog/post/author characteristics. Details of the competition available at https://www.kaggle.com/c/predict-wordpress-likes.
Workshop presentation slides available at: https://speakerdeck.com/gmspacagna/wordpress-blog-posts-recommender-in-spark-scala-and-the-sparknotebook.
Recommending a sequence of blog posts that the users may like on WordPress based on their historical likes and blog/post/author characteristics.
Each data point consisted on the event like/not-like of a particular pair of user and blog post.
We computed the following set of features:
- categoriesLikelihood: probability of liking the post by the user only based on the categories term frquency of likes.
- tagsLikelihood: probability of liking the post by the user only based on the tags term frquency of likes.
- languageLikelihood: probability of liking the post by the user only based on the language frquency of likes.
- authorLikelihood: probability of liking the post by the user only based on the author frequency of likes.
- titleLengthMeanError: absolute difference between the title length and the average title length of liked posts.
- blogLikelihood: probability of liking the post by the user only based on the blog frequency of likes.
- averageLikesPerPost: average number of likes per post of that blog.
The raw data only contained the true class points. We generated the false class points according to the same distribution and by combining posts with random users who did not like that post.
Some features used some summary historical statistics about the user who were not available for all of them. We computed the global prior probabilities averaging all of the available information of other users instead of filling missing features using interpolation techniques.
We implemented two models:
- Simple Recommender simply based on the tags likelihood.
- Logistic Regression model to predict wheter the user is going to like or not a given post and then ranking the true-classified posts based on the logistic raw score.
Final end-to-end evaluation was made using Mean Average Precision @ 100.
For the logistic regression we also run a bunch of standard classifier evaluations (e.g. ROC curve) to evaluate how good it is on predicting the true/false like class and finding the best threshold for limiting the number of recommended blog posts.
For the ETL and Feature Engineering we used a mixed approach where data collections were trasnformed from DataFrame into RDD of case classes and viceversa based on the specific use case so that we could exploit both the functionalities of the two APIs. We followed an investigate then develop methodology where we iterate through the following steps:
- Run some interactive exploratory analysis in the SparkNotebook
- Switch into a proper IDE (IntelliJ in our case) to develop the code and make the API available in the form of functions
- Build the jar as a library and add it to the SparkContext classpath
- Call the library functions within the notebook to trigger the execution
- Analyse and present results
This workflow allowed us to first understand the data interactively and then develop the main logic into a more productive and robust environment like IntelliJ so that the Notebook stays clean and the library can be reused in different contexts.
The end-to-end workflow was implemented and performed according to the following external references:
- [Manifesto for Agile Data Science] (http://www.datasciencemanifesto.org)
- The complete 18 steps to start a new Agile Data Science project
We deployed Spark in local mode in a 8 cores 2.5 Ghz i7 Macbook Pro with 16GB of RAM.
Spark version 1.5.0
Notebook version 1.6.1
Scala version 2.10.4
We run our experiments in Spark local mode since that the size of data was small enough but the implementation is so that it can scale for any size and any cluster.
We did not leverage the DataFrame optimizations for the engineering part but we preferred the flexibility and functionalities of the classic RDDs.
Features independency was not verified, for example we expect tagsLikelihood and categoriesLikelihood features to be highly correlated. PCA or similar techniques could have been adopted to make the solution more reliable.
Whilst the training set was balanced, the test set contained many more false records.
The whole analysis/modeling was performed statically without considering timestamps or sequence of events.
In order to reduce the search space in the evaluation we pre-filtered only the blog posts who were liked at least once in the test dataset. That is obviously cheating but we needed to reduce the search space for our experiments because we did not use a proper Big Data cluster.
Same for evaluated users, we only selected 100 random ones.
We have not compared how Zeppelin compare to the SparkNotebook in terms of visualizations.
Bear in mind that ehe goal was not proving the correctness of the solution but more showing how you can easily implement and end-to-end scalable and production-quality solution for typical data science problem by leveraging Spark, Scala and the SparkNotebook.
Lessons learnt are:
- DataFrame is great for I/O, schema inference from the sources and when you have flatten schemas. Operations start to be more complicated with nested and array fields.
- RDD gives you the flexibility of doing your ETL using the richness of the Scala framework, in the other hand you must be careful on optimizing your execution plans.�Functional Programming allowed us to express complex logic with a simple and clear code and free of side effects.
- Map joins with broadcast maps is very efficient but we need to make sure to reduce at minimum its size before to broadcast, e.g. applying some filtering to remove the unmatched keys before the join or capping the size of each value in case of size-variable structures (e.g. hash maps).
- ETL and feature engineering is the most time-consuming part, once you obtained the data you want in vector format then you can convert back to DataFrame and use the ML APIs.
- ML unfortunately does not wrap everything available in MlLib, sometime you have to convert back to RDD[LabeledPoint] or RDD[(Double, Vector)] in order to use the MlLib features (e.g. evaluation metrics).
- ML pipeline API (Transformer, Estimator, Evaluator) seems cool but for an MVP is a pre-mature abstraction.
- Do not underestimate simple solutions. In the worst case they serve as baseline for benchmarking.
- Even tough the Logistic Regression was better on classifying as true or false, the simple model outperformed when running the end-to-end ranking evaluation.
- Focus on solving problems rather than models or algorithms.�Many Data Science problems can be solved with counts and divisions, e.g. Naïve Bayes.
- Logistic Regression “raw scores” are NOT probabilities, �treat them carefully!
- SparkNotebook is good for EDA and as entry point for calling APIs and presenting results.
- Developing in the notebook is non very productive, the more you write code the more become harder to track and refactor previously developed code.
- Better writing code in IntelliJ and then either pack it into a fat jar and import it from the notebook or copy and paste every time into a notebook dedicated cell.
- In order to keep normal Notebook cells clean, they should not contain more than 4/5 lines of code or complex logic, they should ideally just code queries in the form of functional processing and entry points of a logic API.
- Plotting in the notebook with the built in visualization is handy but very rudimental, can only visualize 25 points, we created a Pimp to take any Array[(Double,Double)] and interpolate its values to only 25 points.
- Tip: when you visualize a Scala Map with Double keys in the range 0.0 to 1.0, the take(25) method will return already uniform samples in that range and since the x-axis is numerical, the built-in visualization will automatically sort it for you.
- Probably we should have investigated advanced libraries like Bokeh or D3 that are already supported in the Notebook.
Manifesto for Agile Data Science: http://www.datasciencemanifesto.org
The complete 18 steps to start a new Agile Data Science project: https://datasciencevademecum.wordpress.com/2015/11/12/the-complete-18-steps-to-start-a-new-agile-data-science-project/