## Chapter 5

In this chapter I will introduce the Mean Average Precision (MAP). In addition, I will try to answer the question *"What is the worst we could do?"* by computing the MAP@10 for random recommendations. 

### 5. 1 Evaluation Metric

In the kaggle competition they ask us to evaluate our recommendations in terms of [Mean Average Precision](https://en.wikipedia.org/wiki/Evaluation_measures_(information_retrieval) at 10 recommendations (MAP@10). 

The Average Precision (AP) is measured as (from wikipedia):

$$\text{AP} = \frac{\sum_{k=1}^{n} (P(k) \times rel(k))}{\text{number of relevant documents}}$$

where $k$ is the rank in the sequence of retrieved documents, $n$ is the number of retrieved documents, $P(k)$ is the precision at cut-off $k$ in the list and $rel(k)$ is an indicator function equaling 1 if the item at rank $k$ is a relevant document, zero otherwise. Let's illustrate this with one example. Let's say that a user has interacted with 5 items, that will denote with numbers:

```
actual_items = [3, 7, 4, 2, 5]
```

Now let's say that our algorithm, out of all of the items in stock recommends the following 10, ranked based on some score designed to represent the likes of that customer:

```
recommended_items = [12, 7, 53, 90, 3, 23, 14, 37, 18, 67]
```

Then the MAP@10 would read as follow: our first recommendation fails (since 12 is not among the actual items). The second recommendation is a "hit". In other words, by the time we make two recommendations we got 1 right, so we add $1/2$. Our third recommendation is again, a bad one, as it is the forth. The fifth one however is another hit, so we add $2/5$. From there in advance we don't get any more hits. Therefore, the AP@10 is:

$$ \text{AP@10} = \frac{0.5 + 0.4}{5} = 0.18$$

The MAP@10 is nothing more than the average of all recommendations we make for all the users. In python, AP and MAP are implemented as (**credit goes [here](https://github.com/benhamner/Metrics/tree/master/Python/ml_metrics)**):

In [1]:
import numpy as np

def apk(actual, predicted, k=10):
    """
    Computes the average precision at k.
    This function computes the average prescision at k between two lists of
    items.
    Parameters
    ----------
    actual : list
             A list of elements that are to be predicted (order doesn't matter)
    predicted : list
                A list of predicted elements (order does matter)
    k : int, optional
        The maximum number of predicted elements
    Returns
    -------
    score : double
            The average precision at k over the input lists
    """
    if len(predicted)>k:
        predicted = predicted[:k]

    score = 0.0
    num_hits = 0.0

    for i,p in enumerate(predicted):
        if p in actual and p not in predicted[:i]:
            num_hits += 1.0
            score += num_hits / (i+1.0)

    if not actual:
        return 0.0

    return score / min(len(actual), k)

def mapk(actual, predicted, k=10):
    """
    Computes the mean average precision at k.
    This function computes the mean average prescision at k between two lists
    of lists of items.
    Parameters
    ----------
    actual : list
             A list of lists of elements that are to be predicted
             (order doesn't matter in the lists)
    predicted : list
                A list of lists of predicted elements
                (order matters in the lists)
    k : int, optional
        The maximum number of predicted elements
    Returns
    -------
    score : double
            The mean average precision at k over the input lists
    """
    return np.mean([apk(a,p,k) for a,p in zip(actual, predicted)])

Let's see if our example is alright

In [2]:
actual_items = [3, 7, 4, 2, 5]
actual_items = [str(i) for i in actual_items]
recommended_items = [12, 7, 53, 90, 3, 23, 14, 37, 18, 67]
recommended_items = [str(i) for i in recommended_items]

apk(actual_items, recommended_items)

0.18

In my working directory I have included a module called `recutils` where I place all custome submodules that I will use during for the different examples. There you can find a `average_precision.py` with the code in the previous cell.

If you want to learn more about Information Retrieval Metrics, please, visit [this repo](https://gist.github.com/bwhite/3726239) and the references therein. In the future I aim to include a version on [Normalize Discount Cumulative Gain](https://en.wikipedia.org/wiki/Discounted_cumulative_gain) (NDCG) where the relevance scale will be based on a meassure of interest of users in items. 

### 5.2 Random recommendations

One would hope that any algorithm, complex or not, performs better than just random recommendations. I would say that if the results of your algorithmic-based recommendations are similar to those of random recommendation, is almost certain that there is something wrong in your data. Maybe you made a mistake during feature engineering or the features you use don't mean a thing. 

In any case, let's see what is the MAP@10 for our dataset, when we use random recommendations:

In [3]:
import pandas as pd
import numpy as np
import os

from recutils.average_precision import mapk

inp_dir = "../datasets/Ponpare/data_processed/"
train_dir = "train"
valid_dir = "valid"

# load train users dataframe
df_user_train_feat = pd.read_pickle(os.path.join(inp_dir, 'train', 'df_user_train_feat.p'))
train_users = df_user_train_feat.user_id_hash.unique()

# validation coupons
df_coupons_valid_feat = pd.read_pickle(os.path.join(inp_dir, 'valid', 'df_coupons_valid_feat.p'))

# validation activities
df_purchases_valid = pd.read_pickle(os.path.join(inp_dir, 'valid', 'df_purchases_valid.p'))
df_visits_valid = pd.read_pickle(os.path.join(inp_dir, 'valid', 'df_visits_valid.p'))
df_visits_valid.rename(index=str, columns={'view_coupon_id_hash': 'coupon_id_hash'}, inplace=True)

# subset users that were seeing in training
df_vva = df_visits_valid[df_visits_valid.user_id_hash.isin(train_users)]
df_pva = df_purchases_valid[df_purchases_valid.user_id_hash.isin(train_users)]

Note that here we will consider "hits" all interactions, whether purchases or visits. In the real world a more thorough analysis on this decision would be required (if not mandatory). For example, it is possible that when computing the MAP we want to add a weight to purchase-related hits that is higher than that of visits. 

Also, if you are a data scientist/analyst/ML engineer/etc building a recommendation algorithm in your company one would hope that you are familiar with the product that you are building the algorithm for. This knowledge might give you information on what to do with different types of interactions. 

For example, let's say we are building a recommendation algorithm for an online retail company. A user visits an item page once and spend less than $X$ seconds on the page (for example less than 5 seconds). Very likely this is an indication that the user did not like that item. However, if the user visits the item more than once or spends more than $X$ seconds on the item page is probably an indicator of some interest. All this should be considered when implementing your algorithm. 

Throughout the notebooks here, purchases and visits will be considered equally when computing the MAP. 

In [4]:
id_cols = ['user_id_hash', 'coupon_id_hash']
df_interactions_valid = pd.concat([df_pva[id_cols], df_vva[id_cols]], ignore_index=True)
df_interactions_valid = (df_interactions_valid.groupby('user_id_hash')
    .agg({'coupon_id_hash': 'unique'})
    .reset_index())

df_interactions_valid.head()

Unnamed: 0,user_id_hash,coupon_id_hash
0,000cc06982785a19e2a2fdb40b1c9d59,"[68b8f4ff1151b51f864764cab41a30b5, 977a7c4963a..."
1,002ae30377cd30f65652e52618e8b2d6,"[1ae11153f2bfacec6ab5450d01453c4d, 404d7f06930..."
2,002b08971471e6083dd716f6c3bb6572,"[fed386703b0295119cadda40b20efcba, bd336887b48..."
3,003a7b4941222b7e507fdc9e95de2cc1,"[4bb514d6a036c3caba31d4b62f873cd7, b2a128a9175..."
4,00441c9b51cfe60b82bdf7a20ad79fc8,"[d7fb5915505943b2c7a7f5c1c550bb25, 9beb2e8ddc1..."


Let's discuss an additional consideration. For this particular section it does not matter much, but in moving forward is good to at least mention it. There are some users that during the validation period did not interact with any of the validation coupons. This means that it does not matter how I rank validation coupons for these users, the AP is always going to be 0. We could either remove or keep those users, it does not matter as long as ***all algorithm comparisons*** are made with the same set of users. 

In [5]:
tmp_valid_dict = pd.Series(df_interactions_valid.coupon_id_hash.values,
    index=df_interactions_valid.user_id_hash).to_dict()
valid_coupon_ids = df_coupons_valid_feat.coupon_id_hash.values

# let's keep users that during the validation period, interacted with at least one validation coupon
keep_users = []
for user, coupons in tmp_valid_dict.items():
    if np.intersect1d(valid_coupon_ids, coupons).size !=0:
        keep_users.append(user)

# out of 6924 users seen during validation, 6071 interacted with at least one validation coupon.
interactions_valid_dict = {k:v for k,v in tmp_valid_dict.items() if k in keep_users}

# for each user this dictionary contains the coupon_hash_id for the validation coupons 
# that that user interacted with
interactions_valid_dict['002b08971471e6083dd716f6c3bb6572']

array(['fed386703b0295119cadda40b20efcba',
       'bd336887b48211fdfffefa487a3f9825',
       'b2a128a9175ce0b906f522136861c253',
       '61b14e823046233811b066724ded6ec6'], dtype=object)

Let's then recommend at random and see which MAP we obtain. We run the loop a few times so we can average over a few random recommendation MAP values

In [6]:
mapk_l = []

for _ in range(10):
    coupon_id_rn = valid_coupon_ids.copy()
    recomendations_dict = {}
    for user, _  in interactions_valid_dict.items():
        np.random.shuffle(coupon_id_rn)
        recomendations_dict[user] = coupon_id_rn

    actual = []
    pred = []
    for k,_ in recomendations_dict.items():
        actual.append(list(interactions_valid_dict[k]))
        pred.append(list(recomendations_dict[k]))

    mapk_l.append(mapk(actual,pred))

print(np.mean(mapk_l))

0.007868248221401582


Let's keep that number in mind, since we **ALWAYS** need to do better than that :)