# 02 - Non-Personalized Content-Based Recommender

Here, our goal is develop a non-personalized recommender system. Basically, this means that our recommender system will be able to provide similar items (e.g. restaurant) for any given user selection.

This code in this notebook was run with the following configuration:

    pd.__version__: 0.21.0

In [1]:
import os

import pandas as pd
from pprint import pprint
from random import choice
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.metrics.pairwise import cosine_similarity

Load reviews from disk

In [2]:
df_reviews = pd.read_csv('../data/reviews.csv')

First, let's remove records with empty or missing reviews

In [3]:
df_reviews.dropna(subset=['review_text'], inplace=True)
df_reviews['review_text'] = df_reviews.review_text.astype(str)

## Concatenate Reviews per Item

We intend to represent each item based on the words that are mentioned in its reviews. So, first thing is to combine all the reviews of each item in preparation to compute a bag-of-word representation for it.

In [4]:
def concat_reviews(df_reviews=None, id_column=None):
    assert id_column in ('user_id', 'item_id')
    for _, (id_value, df) in enumerate(df_reviews.groupby(id_column)):
        yield (id_value, ' '.join(df.review_text.tolist()))

In [5]:
df_user_reviews = pd.DataFrame(
    concat_reviews(df_reviews, id_column='user_id'),
    columns=['user_id', 'review_text']
)
df_user_reviews.to_csv(os.path.join('../data/user-reviews.csv'), index=False)

df_item_reviews = pd.DataFrame(
    concat_reviews(df_reviews, id_column='item_id'),
    columns=['item_id', 'review_text']
)
df_item_reviews.to_csv(os.path.join('../data/item-reviews.csv'), index=False)

del df_reviews    # We don't need this anymore!

In [6]:
print('Number of items: {:,}'.format(len(df_item_reviews)))
print('Number of users: {:,}'.format(len(df_user_reviews)))

Number of items: 4,174
Number of users: 11,165


## Create Bag-of-Words Representation

Next, using the newly created dataframe (i.e. `df_item_reviews`), we create TF-IDF matrix where the rows represent items and the columns represent the words mentioned in the all the reviews of the item.

In [7]:
vectorizer = TfidfVectorizer()
matrix = vectorizer.fit_transform(df_item_reviews.review_text)
print('TF-IDF matrix created with dimensions', matrix.shape)

TF-IDF matrix created with dimensions (4174, 49225)


When dealing with a large corpus of text, it's possible that the vocabulary can be massive. Because this will lead to each item (or user) being represented in a high dimensional space, computing similarities could have costly performance implications.

Optionally, if resources are scarce (e.g. you're building a recommender on a laptop), you can reduce this dimension so that an item is now represented using fewer latent features, for example:

In [8]:
# Uncomment these lines if you want to apply dimensionality reduction.
# 
n_features = 100    # higher features may result in better recommendations.
reducer = TruncatedSVD(n_components=n_features)
matrix = reducer.fit_transform(matrix)

In [9]:
# IMPORTANT: this re-orders the integer index of item reviews. It's crucial
# the the index starts at 0 and increases consistently to the 1 - num_of_items.
# This will then be used to build and item_id lookup table.
df_item_reviews.reset_index(drop=True, inplace=True)

Computing pairwise similarities (i.e. each item vs every other item) can be very expensive, especially for the `cosine_similarity` function. Therefore, we will split our item profiles into `n_parts`, each representing a range of unique rows. That way, we can compute the similarities of the rows in each `n_part` against all other rows in our matrix.

In [10]:
def get_split_intervals(n_items: int, n_parts: int):
    """Get intervals for splitting an array of n_items into n_parts

    Args:
        n_items (int): The number of items in the array
        n_parts (int): The number of parts to split the array

    Returns:
        list: A list of tuples, each containing a lower and upper-bound index.
    """
    min_idx, max_idx = 0, 0
    intervals = []
    for p in range(0, n_parts):
        max_idx += int(n_items / n_parts)
        interval = (min_idx, n_items) if p == n_parts - 1 else (min_idx, max_idx)
        intervals.append(interval)
        min_idx = max_idx
    return intervals


n_parts = 5
intervals = get_split_intervals(n_items=matrix.shape[0], n_parts=n_parts)
print('intervals: ', intervals)

intervals:  [(0, 834), (834, 1668), (1668, 2502), (2502, 3336), (3336, 4174)]


Let's also create a lookup for item IDs. Essentially, it's a dictionary (or a lookup table) where 

- the **keys** correspond to the row value of the item in the `df_item_reviews` dataframe, and 
- the **values** correspond to the actualy ID of the item (i.e. the restaurant)

In [11]:
item_id_lookup = df_item_reviews.item_id.to_dict()  
pprint(list(item_id_lookup.items())[:3])

[(0, '--9e1ONYQuAa-CB_Rrw7Tw'),
 (1, '--cZ6Hhc9F7VkKXxHMVZSQ'),
 (2, '-0NhdsDJsdarxyDPR523ZQ')]


## Generate Recommendations

In [12]:
recommendations = []
recommendation_size = 10
for z, (min_idx, max_idx) in enumerate(intervals):
    print('Processing batch {} of {}...'.format(z, n_parts))
    sims = cosine_similarity(matrix[min_idx: max_idx], matrix)

    for idx in range(min_idx, max_idx):
        # query_item_idx is the index of the item we want 
        # to generate recommendations for.
        query_item_idx = idx    
        
        # Let's get the actual ID of the item in position `query_item_idx`
        query_item_id = item_id_lookup[query_item_idx]

        # get the indexes of the top-n items that are similar to item_idx,
        # results may include the input item_idx.
        # Sort the recommendations in decreasing order of similarity,
        # and return the indexes of the top-n items where n = recommendation_size
        sim_item_idxs = sims[query_item_idx - min_idx].argsort()[::-1][1:recommendation_size]
        
        # Convert those index positions to actual ID values using our lookup table.
        sim_item_ids = [item_id_lookup[item_idx] for item_idx in sim_item_idxs]
        
        # Get the similarity scores for the recommendations.
        sim_scores = sims[query_item_idx - min_idx][sim_item_idxs].tolist()
        
        recommendation = dict(
            item_id=query_item_id,
            sim_item_ids=','.join(sim_item_ids),
            sim_scores=','.join([str(s) for s in sim_scores])
        )
        
        recommendations.append(recommendation)

Processing batch 0 of 5...
Processing batch 1 of 5...
Processing batch 2 of 5...
Processing batch 3 of 5...
Processing batch 4 of 5...


Let's convert our recommendations to a DataFrame, because we 💚 DataFrames. The columns are:
- `item_id`: The ID of the item.
- `sim_item_ids`: The recommendations for the item in `item_id`
- `sim_scores`: The similarities of the recommendations to the item in `item_id`

In [13]:
df_recommendations = pd.DataFrame(recommendations)
display(df_recommendations.head(3))

Unnamed: 0,item_id,sim_item_ids,sim_scores
0,--9e1ONYQuAa-CB_Rrw7Tw,"xkVMIk_Vqh17f48ZQ_6b0w,J4CATH00YZrq8Bne2S4_cw,...","0.9934357337678983,0.9919156041485075,0.991792..."
1,--cZ6Hhc9F7VkKXxHMVZSQ,"NLEe-RzDSU-5BN6xp_WWCw,EGI8uU1uf0msVtu8XDrNIw,...","0.9167905901560954,0.8988767942795876,0.893639..."
2,-0NhdsDJsdarxyDPR523ZQ,"Dxaz8OxaadecnWol18kAtw,bsFZnc2mYyGy5cFf5EWTWg,...","0.8292243918141551,0.8219221781649738,0.820271..."


## Example Recommendations

Now, given any item ID, we should be able to produce a recommendation for the top-N most similar items (e.g. restaurants) based on the words in that occur in their reviews.

In [14]:
df_restaurants = pd.read_csv('../data/items.csv')

# Let's randomly choose 3 restaurants
random_item_ids = df_restaurants.sample(3).item_id.values.tolist()

In [15]:
for item_id in random_item_ids:
    print('Generating recommendation for {}'.format(
        df_restaurants.query('item_id == @item_id').item_name.values[0]
    ))
    row = df_recommendations.query('item_id == @item_id')
    df_mlt = pd.DataFrame({
        'item_id': row.sim_item_ids.values[0].split(','),
        'similarities': row.sim_scores.values[0].split(',')
    })

    columns_to_show = ['item_name', 'categories', 'average_rating', 'similarities']
    df_biz_info = pd.merge(df_restaurants, df_mlt, how='inner')
    df_biz_info = df_biz_info[columns_to_show].sort_values(
        ['similarities', 'average_rating']
    )
    display(df_biz_info)

    

Generating recommendation for Oliva


Unnamed: 0,item_name,categories,average_rating,similarities
2,Tappo Wine Bar & Restaurant,"Nightlife,Restaurants,Canadian (New),Italian,W...",3.0,0.6526109333265193
0,Il Fornaio,"Pizza,Seafood,Italian,Restaurants",3.5,0.6566249215013181
8,Roma Deli 1 & Restaurant,"Delis,Italian,Restaurants",4.5,0.6566437152594679
5,Vargas Steakhouse & Sushi,"Asian Fusion,Restaurants,Sushi Bars,Steakhouses",3.5,0.6626549214920852
1,La Scala,"Restaurants,Italian",4.0,0.6664917870830185
4,Kit Kat Italian Bar & Grill,"Restaurants,Italian,Mediterranean",3.5,0.6726781569185242
7,Vivoli,"Food Delivery Services,Restaurants,Food,Italia...",3.5,0.6743097243458839
3,Spuntini Restaurant & Bar,"Restaurants,Italian",4.0,0.6753488505861416
6,Franco's Trattoria,"Italian,Restaurants",4.5,0.7255768897501995


Generating recommendation for True Island BBQ


Unnamed: 0,item_name,categories,average_rating,similarities
8,Aloha Hawaiian BBQ,"Hawaiian,Barbeque,Restaurants",3.5,0.8624947820243573
1,Chicken Latino,"Peruvian,Latin American,Restaurants",4.0,0.8629458898326695
2,Aloha Kitchen,"Hawaiian,Restaurants",3.5,0.8637571846537101
6,Aloha Kitchen & Bar,"Hawaiian,Karaoke,Nightlife,Restaurants,Bars",3.5,0.8658845557156303
4,The Stockyards,"Barbeque,American (Traditional),Restaurants,So...",4.0,0.8669109293801532
3,Memphis Championship Barbecue,"Barbeque,Restaurants",3.0,0.868985194968142
0,Lucille's Smokehouse Bar-B-Que,"Cajun/Creole,Barbeque,Smokehouse,Food,Restaurants",4.0,0.8849223445751552
5,Island Flavor,"Restaurants,Hawaiian,Event Planning & Services...",4.5,0.8893857973238805
7,Ohana Hawaiian BBQ,"Hawaiian,Restaurants,Barbeque,Chinese,Cantonese",3.5,0.9133319786510306


Generating recommendation for El Sol Mexican Art Cafe


Unnamed: 0,item_name,categories,average_rating,similarities
5,Cafe Costa Rica,"Latin American,Caribbean,Restaurants",3.5,0.9008344878904984
2,Taqueria Guadalajara,"Mexican,Restaurants",4.0,0.901085567017958
6,El Trompo Taco Bar,"Mexican,Restaurants",4.0,0.9016792712583106
1,Three Amigos Mexican Restaurant and Cantina,"Restaurants,Mexican",4.0,0.9052276194699012
7,Jose Cuervo Tequileria,"Mexican,Restaurants",2.0,0.9089910086545344
0,Pink Taco,"Mexican,Restaurants",3.5,0.9160802800245356
4,Diego Mexican Cuisine,"Nightlife,Restaurants,Mexican",3.5,0.9165370508591864
3,Zócalo Tequilería,"Cocktail Bars,Bars,Mexican,Restaurants,Nightlife",2.5,0.9255605248438642
8,Tacos El Asador,"Latin American,Mexican,Restaurants",4.0,0.9313126015979492
