Data Science examples
This is a collection of interesting examples and projects in machine learning, optimization and business intelligence that I’ve done as part of coursework, books, online courses, personal projects and others, with a focus on collaborative filtering. Most of them use public or simulated data.
Note: most of the links below point towards links generated from nbviewer and htmlpreview – opening the files within GitHub itself might cause some bad behavior (e.g. mathematical symbols not appearing, TOC not clickable).
- Recommending movies based on movie rating history (Python + Spark) (click here): building a very basic, distributed recommender system using collaborative filtering on the movie ratings dataset from movielens, based on low-rank matrix factorization from users' history of movie ratings.
- Recommending movies based on movie rating history + user demographics + movie genres (Python + Tensorflow) (click here) (algorithm implementation): extending the previous collaborative filtering model to incorporate information about the users and movies through collective matrix factorization.
- Recommending movies based on user demographics and movie tags (Python + Spark) (click here): building a recommender system according to how users with different characteristics rate certain types of movies better or worse than average. Unlike collaborative filtering, this can give personalized recommendations to new users without ratings and recommend unrated items.
- Recommending songs based on music listening history (Python) (click here)(better implementation): building a recommender system according to the history of tracks played by users from the MillionSong dataset, using Hierarchical Poisson Factorization (HPF). Unlike movie ratings, number of times a songs is played is only an indirect measure of user preference, and doesn’t signal dislikes, so it’s harder to recommend. Contrary to methods like BPR (Bayesian Personalized Ranking) or weighted-implicit-ALS, HPF does not require iterating over songs not played by each user.
- Recommending products based on event logs and item descriptions (Python) (click here)(algorithm implementation): extending the model above to include text descriptions of the items being recommended through Collaborative Topic Poisson Factorization, using the RetailRocket dataset of event logs (click, add-to-cart, buy) from an e-commerce retailer.
- Combining recommended lists from different methods (Python) (click here): combining lists of Top-N recommended movies from very different methodologies (collaborative filtering, content-based, most-popular) through interleaved ranking, which can force a greater degree of mixture than summing or multiplying scores from different algorithms.
- Diversifying recommendation lists (Python) (click here): selecting diverse top-N recommended items from a larger list. As recommendation formulas tend to be based on predicted ranking/probability, they tend to contain elements that are too similar to each other (e.g. all books form the same author), which might not be received positively by the user.
- Online Contextual Bandits (Python) (click here) (algorithms implementations): a comparison of algorithms for online bandits with side information. This represents scenarios such as online advertising where there are different choices to present to a user, some haven been shown many times, some are new, and we don't know if a user would have clicked the Ads he was not presented (for the version without side information see the other projects below).
- Clustering musical artists (Python + Spark) (click here): clustering (finding similar groups of) musical artists using the data from Last.fm on top-played artists per user (large dataset) with graph-based methods, using Spark to parallelize the computations and speed-up the process.
- Entity Resolution (R) (click here) (computations in Spark): (also called record linkage) determining (with probabilistic models) which product descriptions indexed by Google from different sites refer to the same product listed at Amazon based on their description.
- Topic Modeling (R’s Shiny) (click here): topic modeling on the latest tweets from popular data science sites – in other words, finding out what are they talking about, by means of computer algorithms.
- List ranking (Python) (click here): looking for the optimal order in which to rank the items in a list, according to either incomplete series of aggregated preferences between two items (ItemA > ItemB) from different people, or from full or partial rankings of some items. The number of possible orderings makes a brute-force search impossible, but it can be efficiently approximated with local search methods, greedy algorithms (e.g. Kwik-Sort, PageRank), or a relaxation of the problem that makes it convex.
- Choosing Ads to display (Python) (click here): choosing ads to display by different policies so as to maximize clicks (based on the multi-armed bandits problem), in simulations under different settings such as equal vs. unequal pay-per-click, few vs. infinite number of ads (many-armed bandits), fixed vs. changing click probabilities (restless bandits), permanent vs. expiring ads (mortal multi-armed bandits).
- Splitting biased dataset from convenience-sampling (Python) (click here): Splitting a dataset of feedback from users about products into a training and test set (for a recommendation algorithm) in such a way that the sets would contain non-intersecting subsets of users and products, while meeting some criteria such as minimum size and minimum number of products from each category – this step on itself represents a whole optimization problem.
- Non-linear optimization examples (Python) (click here): small tutorial illustrating how to solve a toy constrained non-linear optimization problem (minimizing a function subject to constraints)(hs071) in different solver interfaces for Python (scipy, casadi, nlopt, cvxopt, pyipopt), and a short benchmark when solving a larger problem (hs103).
- RFM Analysis (R) (click here): using transaction data from an online application (software) to calculate the lifetime value of customers, see which customers are at risk of defecting and finding segments (groups of similar users) from the users that are still active, based on probabilistic models using RFM analysis (Recency-Frequency-Monetary). One of the difficulties with such data is that payments are not made on a regular basis and it’s hard to determine who is still an active user.
- Database Marketing (R) (click here): using survival analysis (also called duration modeling) to model customer retention and attrition over time in a phone company, according to the channel through which they were acquired and the market to which they belong, then building a model for estimating expected customer revenue after 1 year and after 3 years according to how customers were obtained.
- Marketing Research (R) (click here): creating a perceptual map (also called “brand map”) of laptop brands from a small survey using principal components biplots and comparing it to other types of brand maps with data from surveys on university courses and perception of countries.