# Counting Unique Page Views with Sets

Page metrics are an important measurement of the health and vitality of a website. Page
view metrics can not only measure customer interest in your site, they can also identify
user interface and operational problems in real time. Commonly tracked site metrics
include page views, daily active users, unique views, and monthly active users. Redis can
be a vital part of your analytics processing pipeline, computing aggregate statistics.

In this chapter, we are going to look at how the Redis set datatype is used to calculate
and store unique viewers.  Among the topics we will cover in this chapter:

* Sets in Redis and Python 
* Specifying data storage conventions 
* Using structured keys for data 
* Adding views 
* Scanning keys
* Secondary indexing

## Sets

### Redis Sets

Redis provides two types of set data structures, the set and the sorted set. Either set
type would work to calculate unique viewers, but in this chapter we are going to look at
examples using the set data type.

Sets in Redis are very similar to the set data structure found in many programming
languages or mathematics. Redis sets are an unordered collection of unique elements.
Elements in Redis are represented as strings and byte-wise equality test is used to
determine if two elements are equal. Unlike lists, the sets `(1, 2, 3)`, `(2, 3, 1)`, and
`(3, 1, 2)` are all identical as they contain the exact same elements.

There are fifteen different command to manipulate set data in the current release of Redis
(3.2). These commands provide a variety of different operations including:

* Adding elements to and removing elements from a set
* Testing for elements in the set
* Retrieve various elements of a set
* Compare and combine multiple sets 

Redis' set commands generally fall into one of two categories: those that operate on a
single set and those that operate on multiple sets. The commands that operate on a
individual set normally take a key and a series of one or more elements as parameters. The
commands that operate on multiple sets normally take a series of keys as parameters and
either return a set as a result or write the result into a destination set.

### Sets in Python

The Python language provides a built-in set datatype that is used by Redis client
libraries, including **redis-py** the client we are using for our example code, to
represent sets returned from a Redis command.

Sets were a later introduction to the Python language. Sets are constructed in Python
using the `set([iterable])` constructor. Newer versions of Python, including the one we are
using for this Notebook, provide the syntactical shortcut `{item1, item2,...}`. If you are
unfamiliar with Python sets, please see the Python documentation on [standard Python
types](https://docs.python.org/2/library/stdtypes.html#set) for more information.

## Page Views

Raw page views are an important metric for websites, but there are other important metrics
based on individual users that require additional resources to compute. Unique daily page
views, daily active users (DAUs) and monthly active users (MAUs) are all common metrics
for measuring the performance of a website.

In this chapter, we are going to look at how Redis can be used as part of an event based
analytics pipeline. In our example, Redis will be used to store information about users,
allowing us to compute aggregated statistics for unique users on a daily, weekly, and
monthly basis.

Our analytics system reads processed data from an event log which could be implemented by
a wide range of software, including Redis. Our log system provides a stream of interesting
events - logins, page views, downloads - that our analytics pipeline can process into
interesting metrics and data sets.

Our examples will look at how to process a stream of events into Redis to generate sets of
unique users. Then once we have that data loaded into Redis, we show examples of how Redis
can be used to compute addition metrics and aggregations from that data.


### Data Storage Conventions

Our pipeline will store unique pages viewers in a set associated with both the
page and the date.  Our system provides us with a unique, integer page id for
each page in the system, so we can use that id plus the date to construct a 
key of the form `page:{page_id}:unique:{year}:{month}:{day}` to reference our unique
user set.  The members of the set will be the integer user id associated with each
database user.

* * *

> **Note**
>
> The standard date and time handling libraries in programming languages provide 
> facilities for getting the day, month and year components from a timestamp.
> In Python the `time.gmtime` or `time.localtime` functions provide those conversions.
> Because we are doing log processing, we will use the `gmtime` function in our examples.
> If you are unfamiliar with the time handling functions in Python, please see the 
> [time package documentation](https://docs.python.org/2/library/time.html) for more 
> details.
>

* * * 


## Recording a Page View

In our first example, we are going to read events from our event stream and record the
page view events in Redis. Using the Redis set data structure, our stream processing
function will generate sets of unique users.

As our code processes events in the stream, it will:

* Parse the event
* Compute a key based on the event data
* Store the viewer in the viewers set

The Redis SADD (**S**et **ADD**) command adds the specified member to a set, creating
the set if it doesn't already exist.  Since a member can only exist in the set
once, we can add the user to the set every time we record a page view for the
user, but the user will only be stored once, giving us our unique user data.

*Try running the sample code below by selecting the code cell and pressing SHIFT + ENTER*

In [None]:
import redis

# example connection parameters 
config = {
    "host": "redis",
    "port": 6379
}

r = redis.StrictRedis(**config)

sample_page_view_events = [
    ("1ef81361-0071-11e7-bf3a-4c3275922049", 2017, 3, 4, 3001, 201),
    ("1ef81554-0071-11e7-b11b-4c3275922049", 2017, 3, 4, 3001, 202),
    ("1ef8164f-0071-11e7-80ec-4c3275922049", 2017, 3, 4, 3002, 201),
    ("1ef81717-0071-11e7-9791-4c3275922049", 2017, 3, 4, 3001, 202),
    ("1ef8188a-0071-11e7-a448-4c3275922049", 2017, 3, 4, 3001, 201),
    ("1ef81917-0071-11e7-9215-4c3275922049", 2017, 3, 4, 3003, 201),
    ("1ef81997-0071-11e7-ac0b-4c3275922049", 2017, 3, 4, 3004, 201),
    ("1ef81a2e-0071-11e7-a560-4c3275922049", 2017, 3, 4, 3003, 201),
    ("1ef81ac2-0071-11e7-9ffe-4c3275922049", 2017, 3, 4, 3001, 205),
    ("1ef81b59-0071-11e7-967a-4c3275922049", 2017, 3, 4, 3003, 202),
    ("4eb72d75-0072-11e7-b160-4c3275922049", 2017, 3, 5, 3001, 201),
    ("4eb72f57-0072-11e7-aa82-4c3275922049", 2017, 3, 5, 3002, 202),
    ("4eb732b0-0072-11e7-9153-4c3275922049", 2017, 3, 5, 3002, 201),
    ("4eb733c0-0072-11e7-b177-4c3275922049", 2017, 3, 5, 3001, 202),
    ("4eb734e1-0072-11e7-aeb5-4c3275922049", 2017, 3, 5, 3003, 204),
    ("4eb7358a-0072-11e7-a629-4c3275922049", 2017, 3, 5, 3003, 204),
    ("4eb7364a-0072-11e7-b999-4c3275922049", 2017, 3, 5, 3001, 204),
    ("4eb73780-0072-11e7-b7c1-4c3275922049", 2017, 3, 5, 3001, 202),
    ("4eb7385c-0072-11e7-a8c5-4c3275922049", 2017, 3, 5, 3003, 201),
    ("4eb73907-0072-11e7-9caf-4c3275922049", 2017, 3, 5, 3001, 202)
]

def daily_page_view_key(page_id, year, month, day):
    """Builds a structured key of the form page:{page_id}:unique:{year}:{month}:{day} to track
    unique page views
    """

    return  "page:" + str(page_id) + ":unique:" + str(year) + ":" + str(month) + ":" + str(day)

def record_user_page_view(r, pid, year, month, day, uid):
    "Records a page view in Redis to generate unique viewers"
    
    key = daily_page_view_key(pid, year, month, day)
    return r.sadd(key, uid)

def log_page_view_event(eid, pid, uid, year, month, day):
    "Utility function to log page view event for set exercises"

    print ("%s PAGE_VIEW: %04d user %04d %04d-%02d-%02d" % (eid, pid, uid, year, month, day))

def process_page_view_events(r, events):
    """Reads a list of events for the form:
    (event_id, year, month, day, user_id, page_id)
    and processes them into daily unique views
    """
    
    cnt = 0
    for (eid, year, month, day, uid, pid) in events:
        log_page_view_event(eid, pid, uid, year, month, day)
        record_user_page_view(r, pid, year, month, day, uid)
        cnt += 1
        
    print ("Events processed: {}".format(cnt))
                   
process_page_view_events(r, sample_page_view_events) 


When you execute the sample code, you should see a sequence of 20 log like messages
representing processed page view events from our simulated stream which spans a
combination of users, pages, and dates. Each of these log messages corresponds to one page
view being recorded in the database.

We can use a redisinsights to see the state of our database after we finish processing the stream of data. 

Looking at the output of the database state, you can see how our code has constructed
several sets of unique viewers for specific pages and dates.  Using the simulated stream
provided, you should see three unique users: 3001, 3002, and 3003 all viewed page 201
on March 4, 2017.

## Counting Unique Page Views

Now that our event stream has been processed, we can use the results in Redis to compute a
variety of additional metrics. The first example we will look at is how we can compute the
daily unique viewers for the site.

With the Redis SCARD (**S**et **CARD**inality) command we can get the cardinality, or
size, of a set from the server. The set size is our unique viewer count. In the sample
code below, we implement a function to return unique viewers using the SCARD command:

In [None]:
def get_unique_views(r, pid, year, month, day):
    """Returns the number of unique views for a page (indexed by id and day of year)"""
    
    key = daily_page_view_key(pid, year, month, day)
    # scard is cardinality of the set 
    return r.scard(key)
    

# Fetch page views for March 4, 2017
print ("Page 201 Unique views for March 4 2017: ", get_unique_views(r, 201, 2017, 3, 4))


### Maintaining a Secondary Index

Using the SCAN command is one way we can find our data, another way is to maintain a
*secondary index* using Redis data structures to find our data. Maintaining a secondary
index requires additional work on our part, but is also an effective way to speed up
access to our data.

Implementing secondary indexing is easy and we can use our Redis set data type to
implement our index. Secondary indexing can be added, just by modifying our our
`record_user_page_view` code to maintain the index at insert time. An updated version of
`record_user_page_view` with secondary index management is shown below, you can reload the
sample data by running the code below:

In [None]:
def log_page_view(pid, year, month, day, views):
    "Utility function to print page views for set exercises"

    print ("Unique Page Views %04d %04d-%02d-%02d: %03d" % (pid, year, month, day, views))

def convert_key_to_components(key):
    "Returns a (pid, year, month, day) tuple from a key"

    comps = key.split(b':')
    return (int(comps[1]), int(comps[3]), int(comps[4]), int(comps[5]))

def secondary_page_index():
    "Returns the index key for the given date"

    return "index:unique-page"

def record_user_page_view(r, pid, year, month, day, uid):
    "Records a page view in Redis to generate unique viewers"

  # modify this function to create a secondary index
    idx_name = secondary_page_index()
    r.sadd(idx_name, daily_page_view_key(pid, year, month, day))
    
    key = daily_page_view_key(pid, year, month, day)
    return r.sadd(key, uid)


process_page_view_events(r, sample_page_view_events) 


Unless our reporting code takes advantage of the secondary index, building it 
is a wasted workload on our database.  Review the `report_unique_page_views`
function below and try to update it to use the new secondary index stored at
`index:unique-page` instead of doing a key scan.

In [None]:
def report_unique_page_views(r):
    "Implements a basic report of unique page views for the data in Redis"
    
    # keys = scan_keys(r, 'page:*:unique:*')
    # keys.sort()
    idx_name = secondary_page_index()
    keys = r.smembers(idx_name)
        
    for key in keys:
        pid, year, month, day = convert_key_to_components(key)
        
        views = get_unique_views(r, pid, year, month, day)
        log_page_view(pid, year, month, day, views)
        

report_unique_page_views(r)


Our example of secondary indexing makes a classic tradeoff - it increases the 
amount of work when data is inserted to reduce the amount of work required to 
access our data.  This may or may not be the right choice for your work loads,
the best way to determine that is through profiling your work load.

### Monthly Active Users

Our unique viewer storage scheme is flexible enough that we can compute Monthly Active
Users (MAUs) quickly from the processed data we already loaded into Redis. The Monthly
Active User metric tracks the number of unique visitors to a site aggregated over the
entire month.

Using Redis' set manipulation commands, we can take the daily page data and generate
statistics for the monthly viewers. Usually MAUs are tracked at the site level, so we will
aggregate our statistics for the entire site and not individual pages.

Building up a set of monthly users from our existing data can be accomplished with a
simple procedure:

* Use our page index to find Daily Views
* Iteratively build a set of unique monthly viewers
* Count the monthly viewers

This can be accomplished using three Redis commands, the SMEMBERS (**S**et **MEMBERS**) 
command, the SUNIONSTORE (**S*ET **UNION** **STORE**) command and the SCARD command we learned earlier.

The SUNIONSTORE command unions several sets and stores the results.  It takes as 
parameters the destination key to store the result and the a sequence of one or more
sets that are unioned together into the final result.  The union operation here refers to 
the familiar set union operation - if our database has three sets: `s1 = (a, b, c)`, 
`s2 = (a, b, f)`, and `s3 = (b, c, f)`, the result of of calling `SUNIONSTORE s4 s1 s2 s3` 
would be to store in s4 the set `(a, b, c, f)`.

The SMEMBERS command simply returns all of the members of the set provided as a 
parameter.  The result will be returned as a Python set.  Remember, that because
we are working with sets, we can not depend on the order of the returned results.

In the following sample, we show code to compute our monthly user set and
store the results in Redis:

In [None]:
def get_keys_from_secondary_index(r):
    "Returns the keys from the secondary index"

    idx_name = secondary_page_index()
    return r.smembers(idx_name)

def get_mau_key(year, month):
    "Returns the key for the MAU storage"

    return "site:metrics:" + str(year) + ":" + str(month)

def compute_monthly_users(r, year, month):
    "Computes the set of active monthly users and stores in Redis"
    
    keys = get_keys_from_secondary_index(r)
    for key in keys:
        k_pid, k_year, k_month, k_day = convert_key_to_components(key)
        
        if k_year == year and k_month == month:
            mau_key = get_mau_key(k_year, k_month)
            day_key = daily_page_view_key(k_pid, k_year, k_month, k_day)
            r.sunionstore(mau_key, mau_key, day_key)

def get_mau_count(r, year, month):
    "Returns the count of active monthly users"
    
    return r.scard(get_mau_key(year, month))

compute_monthly_users(r, 2017, 3)
print ("Monthly Active Users (MAU) for March 2017: {}".format(get_mau_count(r, 2017, 3)))


In our sample code, we relied on the fact that Redis uses sensible defaults instead
of returning errors whenever possible.  We iteratively build our final monthly user
set by applying the union operator to our results so far and the current result we 
are processing, but we don't have to special case the first iteration in the loop, 
because when Redis treats undefined sets like the empty set, so the union proceeds 
without error.

There are many other commands in Redis that provide the familiar operations on sets.
In Redis, there are two variants of each command: one which returns the result and
one which stores the result.  The set command provided in Redis are:

Operation    | Results Returned | Results Stored 
-------------|------------------|----------------
Union        | SUNION           | SUNIONSTORE   
Intersection | SINTER           | SINTERSTORE    
Difference   | SDIFF            | SDIFFSTORE     

More details about the set operation commands can be found in the [documentation](https://redis.io/commands#set)
page on [Redis.io](https://wwww.redis.io)

## Absent Users

Keeping users engaged with your website is critical, but sometimes despite your efforts
users may stop using your site. Often, you want to send an email or reach out to the users
to understand why they stopped visiting and encourage them to come back. We can build this
functionality on top of our processed data using additional features from Redis.

For this example, assume that there is already a processed set stored in the key
`site:users`, that provides a set of all the user ids for currently registered accounts.
Using Redis, we want to determine all the users that haven't visited our site in the last
month, so that we can send them an email inviting them to come back.

Redis also provides commands for testing for membership in a particular set, SISMEMBER
(**S**et **IS** **MEMBER**), which we can use in conjunction with our processed data and
our user id set to find missing users.

The SISMEMBER function takes as parameters a set key and a member to determine if that
element is a member of the specified set. It returns a zero or one depending on if the
element is or is not a member.

To find absent users we need a function that will:

* Retrieve our registered user set
* Check each member to see if they visited the site in the current month
* Store the absent users in a result set (for other systems to use)

Our sample code below, show the Redis commands to implement this:

In [None]:
def all_user_key():
    "Returns the key for all users in the system"

    return "site:users:all_users"
def create_all_users_set(r):
    "Creates a sample all user set"

    users = set()
    for event in sample_page_view_events:
        users.add(event[4])

    users.add(1001)
    users.add(1002)
    users.add(1003)

    r.sadd(all_user_key(), *users)

create_all_users_set(r)
def absent_user_key(year, month):
    "Returns the absent user key for the given year, month combo"

    return "site:users:absent_users:" + str(year) + ":" + str(month)

def generate_absent_users(r, year, month):
    "Computes the absent users for a given year and month and stores result in Redis"

    mau_key = get_mau_key(year, month)
    absent_users_key = absent_user_key(year, month)

    all_users_key = all_user_key()
    users = r.smembers(all_users_key)
    for user in users:
        if not r.sismember(mau_key, user):
            r.sadd(absent_users_key, user)
            
def get_absent_users(r, year, month):
    "Returns the absent users for a given year and month"
    
    absent_users_key = absent_user_key(year, month)
    return r.smembers(absent_users_key)

generate_absent_users(r, 2017, 3)
pprint.pprint(get_absent_users(r, 2017, 3))


Many of you may have balked slightly at our naive implementation of this function,
when Redis provides more efficient operations, which we have already talked about,  
to compute this same result.

In the cell below, try and reimplement the `generate_absent_users` function from
the same set of data but using more efficient Redis commands.

In [None]:
# your new code goes here
def generate_absent_users(r, year, month):

    pass


## Scaling

Many of the samples shown here today would work well for small websites, as the size of
your dataset grows, you will have to reconsider some of the techniques presented.

One of the first things to consider is how you fetch items from Redis. Most of the
implementations in our sample code read entire sets of keys into memory at once - this is
fine when the sets are small, but as the sets get larger you will need to look into
progressively fetching the results of a query. You may need to refactor your client
application to use an iterator or generator pattern to operate on a subset of results at a
time.

You may also need to reconsider how you store the data in Redis. The members of a set are
stored as string and equality is determined using a byte-wise string compare. There are
other, more compact, ways of representing this data in Redis. One way of reducing the
amount of storage required for larger datasets is to use Redis' bit operators that treat
strings as a vector of bits. For more information on bitmap operators see the [String
documentation](https://redis.io/commands#set) at [Redis.io](https://www.redis.io).

## Review

This chapter looked at ways we can use Redis sets to process analytics data.
We first looked at how to process a stream of events to generate daily unique users,
then we saw how we could extend our application to compute monthly unique users, and 
finally how to identify absent users.  In the process we learned to use a variety of
Redis commands including:

* SADD
* SCARD
* SCAN
* SUNIONSTORE
* SMEMBERS
* SISMEMBERS

We saw how Redis provides commands that modify the membership of sets and commands that
operate on multiple sets to return or store a result. The details of all the commands
Redis provides to work with sets can be found in the [Set
Documentation](https://redis.io/commands#set) on [Redis.io](https://www.redis.io).