# Redis for Recommendations

## About me

I am actually adding this section because I am planning to use this workbook for a talk next week and it's kind of common to introduce yourself.

So here the standard sentences which you will find in the agenda of one or the other conference or meetup:

> David is a creative Software Engineer and a skilled Consultant with experiences in software project management, for both product development and customer projects. Furthermore he has a strong database background by being specialized on NoSQL database systems.

My experience is mainly based on my work as Senior Software Engineer, Project Lead, Software Architect, Consultant, Principal Solutions Engineer, Performance Architect for database companies like Ingres, sones (GraphDB), Couchbase or Redis Labs. As you can see, it's not my daily business to work on recommender systems (recommendation engines). I am more a database guy who loves to write code (even if I don't find too much time at the moment).

I am currently working for Redis Labs as a Technical Enablement Manager by being responsible for the on-boarding of customers and technical field resources.

## About Redis

Think it's a good idea to mention what Redis actually is, whereby Redis is very popular and so most of you might already know what it is. Here a ranking of the most popular database systems (just to highlight how popular Redis is):

* https://db-engines.com/en/ranking

As Redis Enterprise is adding a bunch of cool features, I am also expaining in very short words what the benefit could be to use Redis Enterprise.

Redis is an open source (BSD licensed), in-memory data structure store, used as a database, cache and message broker. It supports data structures such as strings, hashes, lists, sets, sorted sets with range queries, bitmaps, hyperloglogs, geospatial indexes with radius queries and streams. Redis has built-in replication, Lua scripting, LRU eviction, transactions and different levels of on-disk persistence.

Redis Labs is the home of OSS Redis and the provider of the multi-model in-memory database system 'Redis Enterprise'. Redis Enterprise is based on Redis Open Source and is providing the following addtional features:

* **Easier Operability**: Admin Web UI, Admin REST Service, several CLI tools, Redis Enterprise Cloud (Hosted or VPC), Enterprise Support
* **Enhanced High Availability**: Node based quorum, rack-tone awareness, disaster recovery, periodic backup, faster failover times and different watchdog profiles, ...
* **Improved Scalability and Consistent Performance**: Multiple Redis shards behind a single endpoint, different shards placement policies, built-in resource management for better resource isolation, multi-tenancy, Tunable frontend thread management, ...
* **Active-Active Geo-replicated Databases**: By leveraging Conflict-free Replicated Data Types (resetable PN-Counters, OR-Sets, LW-wins Register, causual consistency via Vector Clocks)
* **Redis on Flash**: Uses Flash drives as RAM extension in order to store more data at lower costs.

In addition Redis Labs is maintaining the following modules:

* **RediSearch**: Search engine over Redis
* **Redis-ML**: Machine Learning Model Server
* **Redis Graph**: Graph database with an Open Cypher-based query language
* **ReJSON**: A JSON data type for Redis
* **ReBloom**: Scalable Bloom filters

The source code of all of these modules is available on Github.

## Preparations

Here how to establish a connection to a Redis Server (which is using the default port and which has authentication disabled.

In [76]:
import json
import redis

# Standard Redis connections
r = redis.StrictRedis()
r_g = redis.StrictRedis('localhost', '7777')

instances = [r , r_g]

for i in instances:
    print("Connected: {0}".format(i))
    print("Flushed: {0}".format(i.flushall()))

Connected: Redis<ConnectionPool<Connection<host=localhost,port=6379,db=0>>>
Flushed: True
Connected: Redis<ConnectionPool<Connection<host=localhost,port=7777,db=0>>>
Flushed: True


## Content Based Filtering

The idea is to look at what a specific user is interested in and then to recommend things those are similar (i.e. having the same class) as other things the  user is liking.

* Data structures: **Sets**
* Operations: Members/Scans, Union

In [77]:
# David owns i.e. 1 comic per one of the following categories:
r.sadd('usr:david:catg', 'fantasy', 'super-heros', 'scifi')

# Here the items per category
r.sadd('ctg:scifi:items','Valerian', 'Fantastic Four')
r.sadd('ctg:super-heros:items', 'Batman', 'Spiderman', 'Wonder Woman')
r.sadd('ctg:fantasy:items', 'Avatar', 'Dragon Age')

# The following items could be interesting for David
## BTW: SSCAN better for large sets
categories = r.smembers('usr:david:catg')

## Helper to prepare key list
keys = []
for ctg in categories:
    keys.append("ctg:" + ctg.decode('UTF8')+ ":items")
   
    
## BTW: SUNIONSTORE for materializing large result sets
result = r.sunion(keys)
print("David could be also interested in: {0}".format(result))

David could be also interested in: {b'Wonder Woman', b'Avatar', b'Batman', b'Spiderman', b'Fantastic Four', b'Dragon Age', b'Valerian'}


## Collaborative Filtering

It's mandatory to have details about many other users collectd. The underlying idea is that if person A likes the same things as person B, then person B might also like the other items those are liked by person A.

* Data structures: **Sets**
* Operations: Members/Scans, Union, Diff

In [78]:
# David owns the comics Spiderman and Batman
r.sadd('usr:david:items','Spiderman', 'Batman')

# Pieter owns the comics Wonder Woman and Batman
r.sadd('usr:pieter:items', 'Wonder Woman', 'Batman')

# The following is the reverse mapping per item
r.sadd('itm:spiderman:users', 'david')
r.sadd('itm:batman:users', 'david', 'pieter')
r.sadd('item:wonder_woman:users', 'pieter')

# These are all the users interested in the same items as David
items = r.smembers('usr:david:items')
keys = []
for item in items:
    keys.append("itm:" + item.decode('UTF-8').lower().replace(' ', '_') + ":users")

users = r.sunion(keys)
print("Users interested in the same items as David: {0}".format(users))

                 
# Pieter is interested in the same items as David, so here the recommendation for David based on Pieter's interests
print("David is interested in: {0}".format(r.smembers('usr:david:items')))
david_key = 'usr:david:items'

for usr in users:
    usr_key = "usr:" + usr.decode('UTF-8') + ":items"
    if usr_key != david_key:
        print("David could be also interested in: {0}".format(r.sdiff(usr_key, david_key)))



Users interested in the same items as David: {b'pieter', b'david'}
David is interested in: {b'Batman', b'Spiderman'}
David could be also interested in: {b'Wonder Woman'}


## Ratings based Collaborative Filtering

Same as collaborative filtering but we are now interested in 'How much does a user like an item' which allows us to find out if 2 or more users are liking similar things. Things those are also liked by User B but not yet liked by user A could be also interesting for user A.

* Structures: **Sorted Sets**
* Operations: Intersections, Unions, Members/Scans, Ranges, Weights & Aggregations

In [80]:
# TODO: Check syntax of r.zadd
def zadd(key, score, item):
    return r.execute_command('ZADD', key, score, item)

# TODO: redis-py doesn't support weitghts here
def zinterstore(target, keys, weights):
    return r.execute_command('ZINTERSTORE', target, len(keys), *keys, 'WEIGHTS', *weights)

def zunionstore_agg_min(target, keys, weights):
    # Weights will be applied before the aggregation is executed as part of the union
    return r.execute_command('ZUNIONSTORE', target, len(keys), *keys, 'WEIGHTS', *weights, 'AGGREGATE', 'MIN')

# Root Mean Square
import math
def rms(values):
    sq_sum = 0
    for v in values:
        v = v[1]
        v = v ** 2
        sq_sum = sq_sum + v
    sq_sum_avg = sq_sum / len(values)
    return math.sqrt(sq_sum_avg)
        
# Ratings by user
zadd('usr:david:ratings', 3.0, 'spiderman')
zadd('usr:david:ratings', 4.0, 'batman')
zadd('usr:david:ratings', 3.0, 'superman')
zadd('usr:pieter:ratings', 3.0, 'batman')
zadd('usr:pieter:ratings', 1.0, 'wonder_woman')
zadd('usr:pieter:ratings', 5.0, 'aqua_man')
zadd('usr:pieter:ratings', 4.0, 'superman')


# Ratings by item
zadd('itm:spiderman:ratings', 3.0, 'david')
zadd('itm:batman:ratings', 4.0, 'david')
zadd('itm:batman:ratings', 3.0, 'pieter')
zadd('itm:wonder_woman:ratings', 5.0, 'pieter')

# Items rated by David
rated_david = r.zrange('usr:david:ratings', 0, -1)
keys = []
for rt in rated_david:
    key = "itm:" + rt.decode('UTF8') + ":ratings"
    keys.append(key)

r.zunionstore('usr:david:ratings:same', keys)
users = r.zrange('usr:david:ratings:same', 0, -1)
print("The following users rated David's items: {}".format(users))

#Calculate similarities
david_key = 'usr:david:ratings'
for usr in users:
        usr = usr.decode('UTF-8')
        usr_key = "usr:" + usr + ':ratings'
        
        if usr_key != david_key:
            usr_keys = [ david_key, usr_key ]
            # Weights are multiplying the scores
            usr_weights = [1, -1]
            '''
            By default, the resulting score of an element is the sum of its scores in the sorted sets where it exists. 
            Weights multiplicators for scores
            The weight is (1,-1) means that we subtract the second value from the first
            So rms:<user1>:<user2> does for now just store the distance between the user ratings
            '''
            zinterstore("dist:david:" + usr, usr_keys, usr_weights)
            dists = r.zrange("dist:david:" + usr, 0, -1, True, True)
            print("The rating distance to {0} is {1}".format(usr, dists))
            print("The average distance (RMS) to {0} is {1}".format(usr, rms(dists)))
            
            # The user is similar enough to David, add items of other users to the recommendation list
            if rms(dists) <= 1:
                #print(r.zrangebyscore(usr_key,4,5))
                
                # Items those are rated by David will have a negative score
                usr_filter = [-1, 1]
                zunionstore_agg_min('rec:david', usr_keys, usr_filter)
                # Filter only items with a score between 4 and 5 out
                print("The following is highly recommended: {}".format(r.zrangebyscore('rec:david',4,5, withscores=True)))

The following users rated David's items: [b'pieter', b'david']
The rating distance to pieter is [(b'batman', 1.0), (b'superman', -1.0)]
The average distance (RMS) to pieter is 1.0
The following is highly recommended: [(b'aqua_man', 5.0)]


## Social Collaborative Filtering

The previous examples used Sets and Sorted Sets. We are now exploring how to use Graphs. Our example is taking a social ('friend of') aspect into account.

In [81]:
# TODO: Check syntax of r.zadd
def zadd(key, score, item):
    return r_g.execute_command('ZADD', key, score, item)

def format_query_props(query):
    query = query.replace(': "', ": '")
    query = query.replace('",', "',")
    query = query.replace('"}', "'}")
    query = query.replace('"', "")
    #DEBUG: print("query = " + query)
    return query

'''
CREATE ( :person { name: 'A', age: B})
'''
def create_vertex(graph, label, props ):
    # Some query formatting
    query = 'CREATE ( :{0} {1} )'.format(label, json.dumps(props))
    query = format_query_props(query)
    r_g.execute_command('GRAPH.QUERY', graph, query)
    
'''
MATCH (a:Person),(b:Person)
WHERE a.name = 'A' AND b.name = 'B'
CREATE (a)-[r:RELTYPE]->(b)
RETURN type(r)
'''    
def create_edge(graph, slabel, tlabel, source, target, elabel):
    query = "MATCH (a:{0}),(b:{1}) WHERE a.name = '{2}' AND b.name = '{3}' CREATE (a)-[r:{4}]->(b) RETURN type(r)"
    query = query.format(slabel, tlabel, source, target, elabel)
    #DEBUG print(query)
    return r_g.execute_command('GRAPH.QUERY', graph, query)
    
'''
MATCH (a:Person)-[r:RELTYPE]->(b:Person)
WHERE a.name = 'A' RETURN b.name"
'''
def neighbours(graph, slabel, tlabel, elabel, source):
    query = "MATCH (a:{0})-[r:{1}]->(b:{2}) WHERE a.name = '{3}' RETURN b.name".format(slabel, elabel, tlabel, source)
    #DEBUG: print(query)
    return r_g.execute_command('GRAPH.QUERY', graph, query)
    
# Constants
GRAPH = 'Comics'
T_PERSON = 'Person'
T_COMIC = 'Comic'
T_CATEGORY = 'Category'
T_FRIEND = 'IS_FRIEND_OF'
T_LIKES = 'LIKES'
T_CONTAINS = 'CONTAINS'


# Load some vertices
## Ages might be not real ;-)
david={"name": "David", "age": 38, "gender": "male"}
pieter={"name": "Pieter", "age": 35, "gender": "male"}
itamar={"name": "Itamar", "age": 40, "gender": "male"}
vassilis={"name": "Vassilis", "age": 39, "gender": "male"}
katrin={"name": "Katrin", "age": 38, "gender": "female"}
romy={"name": "Romy", "age": 35, "gender": "female"}

spiderman={"name": "Spiderman"}
batman={"name": "Batman"}
wonder_woman={"name": "Wonder Woman"}
superman={"name": "Superman"}
aquaman={"name": "Auqaman"}
valierian={"name" : "Valerian"}
fantastic_four={"name" : "Fantastic Four"}

super_heros = { "name" : "Super Heros" }
scifi = { "name" : "SciFi" }


v_persons = [ david, pieter, itamar, vassilis, katrin, romy ]  
v_comics = [ spiderman, batman, wonder_woman, superman, aquaman ]
v_categories = [ super_heros, scifi ]


# Clean the graph
r_g.flushall()

for v in v_persons:
    create_vertex(GRAPH, T_PERSON, v)
    
for v in v_comics:
    create_vertex(GRAPH, T_COMIC, v)

for v in v_categories:
    create_vertex(GRAPH, T_CATEGORY, v)
    

# Create some edges
## Person has Friends
create_edge(GRAPH, T_PERSON, T_PERSON, 'David', 'Pieter', T_FRIEND)
create_edge(GRAPH, T_PERSON, T_PERSON, 'David', 'Vassilis', T_FRIEND)
create_edge(GRAPH, T_PERSON, T_PERSON, 'David', 'Katrin', T_FRIEND)

## Person likes Comics
create_edge(GRAPH, T_PERSON, T_COMIC, 'David', 'Spiderman', T_LIKES)
create_edge(GRAPH, T_PERSON, T_COMIC, 'David', 'Batman', T_LIKES)
create_edge(GRAPH, T_PERSON, T_COMIC, 'Pieter', 'Batman', T_LIKES)
create_edge(GRAPH, T_PERSON, T_COMIC, 'Pieter', 'Wonder Woman', T_LIKES)
create_edge(GRAPH, T_PERSON, T_COMIC, 'Vassilis', 'Wonder Woman', T_LIKES)
create_edge(GRAPH, T_PERSON, T_COMIC, 'Vassilis', 'Superman', T_LIKES)


## Comic is type of
create_edge(GRAPH, T_CATEGORY, T_COMIC, 'Super Heros', 'Spiderman', T_CONTAINS)
create_edge(GRAPH, T_CATEGORY, T_COMIC, 'Super Heros', 'Batman', T_CONTAINS)
create_edge(GRAPH, T_CATEGORY, T_COMIC, 'Super Heros', 'Wonder Woman', T_CONTAINS)
create_edge(GRAPH, T_CATEGORY, T_COMIC, 'Super Heros', 'Superman', T_CONTAINS)
create_edge(GRAPH, T_CATEGORY, T_COMIC, 'Super Heros', 'Aquaman', T_CONTAINS)
create_edge(GRAPH, T_CATEGORY, T_COMIC, 'SciFi', 'Valerian', T_CONTAINS)
create_edge(GRAPH, T_CATEGORY, T_COMIC, 'SciFi', 'Fantastic Four', T_CONTAINS)

## Basic test of the Graph
print("David has the following friends: {0}".format(neighbours(graph, T_PERSON, T_PERSON, T_FRIEND, 'David')))
print("David likes {0}".format(neighbours(graph, T_PERSON, T_COMIC, T_LIKES, 'David')))

## Super hero comics that David's friends like
            
            
'''
MATCH (person:Person)-[:IS_FRIEND_OF]->(friend)-[likes:LIKES]->(comic)<-[:CONTAINS]-(type) 
WHERE person.name = 'David' 
AND type.name = 'Super Heros' 
RETURN comic.name, count(likes) AS relevance 
ORDER BY relevance DESC
LIMIT 10
'''

query = "MATCH (person:Person)-[:IS_FRIEND_OF]->(friend)-[likes:LIKES]->(comic)<-[:CONTAINS]-(type) WHERE person.name = 'David' AND type.name = 'Super Heros' RETURN comic.name, count(likes) AS relevance ORDER BY relevance DESC LIMIT 10"
result = r_g.execute_command('GRAPH.QUERY', GRAPH, query)
for i in range(0, len(result[0])):
    if i != 0:
        r = result[0][i]
        comic = r[0]
        relevance = r[1]
        print("Comic {0} with relevance {1}".format(comic, relevance))

David has the following friends: [[[b'b.name'], [b'Pieter'], [b'Vassilis'], [b'Katrin']], [b'Query internal execution time: 0.112000 milliseconds']]
David likes [[[b'b.name'], [b'Spiderman'], [b'Batman']], [b'Query internal execution time: 0.109000 milliseconds']]
Comic b'Wonder Woman' with relevance b'2.000000'
Comic b'Batman' with relevance b'1.000000'
Comic b'Superman' with relevance b'1.000000'


## TODO-s

### Probabilistic data structures

* Bloom filters: Space-efficient check if sth. contains to a specific category
* HLL: Space-efficient cardinality estimation of a set, i.e. unique visits

### RediSearch

i.e. Boosts for weighting categories

* Scoring
* Boost based queries

### Redis-ML

* Neural networks as function approximators + classifiers
* Tree ensembles

### RedisAI

* Outlook