# Content based recommender filtering

## **Introduction**

Content based filtering is constructed with restaurant's metadata and restaurant reviews. It provide a restaurant recommendations to user based on user profile.

## **Implementation Strategy**

### **1 Ranking**
#### **1.1 Cosine Similarity between user - restaurant vectors by using Tfidf Vectorizer**
Restaurant feature vector is computed from restaurant review by Tfidf Vectorizer and user feature vector is computed by restaurant feature vector weighted by corresponding user rating.

#### **1.2 Predicted user rating of restaurant**
User's rating is predicted using supervised regression model. RMSE is used for model selection. 

### **2. Evaluation using NDCG**
`Normalized Discounted Cumulative Gain` is used to evalauate above three ranking strategies. For each policy, NDCG@5 and NDCG@10 are computed. Results proves the startegy I to be consistently best. Strategy I utilizes restaurant review data to calculate vectors and restaurant metadata can be extracted from rich review dataset. 

### **3 Implementation** 

#### **3.1 Development**
The best performing ranking policy is chosen for implementing content based recommender filtering. User, restaurant feature vector are computed and saved to a file. For the `user_id` of interest, cosine similarly scores are calculated between user and restaurants, added back as a restaurant feature. Then restaurant list is filtered and ranked by descending similarity scores to generate the recommendations.

#### **3.2 Testing**
Different test cases are implemented to see the completeness and coputing time

In [3]:
# Import required libraries 

import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np

import dill

import matplotlib.pyplot as plt

import pickle
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA

In [None]:
# Load business data

business = pd.read_csv('clean_business.csv')
business.head()

In [None]:
# Update `postal_code` column in the business dataset as string

business['postal_code'] = business.postal_code.astype('str')

In [None]:
# Load review data

review = pd.read_csv('clean_review.csv')
review.head()

#### Subset of `review` dataset containing only related filtered business (US based restaurants)

In [None]:
%%time

# Clean review data based on cleaned business data (US based restaurant businesses)

review_clean = review[review.business_id.isin(business.business_id.unique())].reset_index(drop=True)

print(len(review), len(review_clean))

#### Split dataset

1. `review_clean` dataset is randomly split by 80:20, train-test ratio.
2. Users and businesses that are only present in the test data are moved to the train data.

In [None]:
%%time

# Split review data

train_review_clean, test_review_clean = train_test_split(review_clean, test_size=0.2, random_state=42)

print('Current train-test ratio: ', len(train_review_clean)/len(review_clean))
print('Train dataset: ', len(test_review_clean))

#### One value in text and date (separate rows) contains NaN valus 

In [None]:
mask = train_review_clean.text.apply(lambda x: type(x) == float)
print(train_review_clean[mask])

In [None]:
train_review_clean = train_review_clean.dropna()

#### Move reviews of user that are only present in test data

In [None]:
test_only_user = test_review_clean[~test_review_clean.user_id.isin(train_review_clean.user_id.unique())]

idx_user = test_only_user['user_id'].drop_duplicates().index
idx_train = train_review_clean.index.union(idx_user)
idx_test = review_clean.index.difference(idx_train)

train_review = review_clean.loc[idx_train]
test_review = review_clean.loc[idx_test]

print('Current train-test ratio: ',len(train_review)/len(review_clean))

#### Move reviews of businesses that are only present in test data

In [None]:
test_only_bus = test_review_clean[~test_review_clean.business_id.isin(train_review_clean.business_id.unique())]

idx_bus = test_only_bus['business_id'].drop_duplicates().index
idx_train = train_review_clean.index.union(idx_bus)
idx_test = review_clean.index.difference(idx_train)

train_review = review_clean.loc[idx_train]
test_review = review_clean.loc[idx_test]

print('Current train-test ratio: ',len(train_review)/len(review_clean))

### 1 Ranking 
#### 1.1 Cosine Similarity between user - restaurant vectors by using Tfidf Vectorizer

#### Append all `business_id` reviews

In [None]:
mask = review_clean.text.apply(lambda x: type(x) == float)
review_clean = review_clean.dropna()

In [None]:
# Combine reviews for each restaurants

bus_rev = review_clean.groupby('business_id').agg\
        ({'review_id' : 'count','text': lambda a: '##'.join(a)}).rename\
        (columns={'review_id' : 'review_count', 'text': 'combined_reviews'})

bus_rev = bus_rev.reset_index()
bus_rev.head()

In [None]:
user_rev = review_clean.groupby('user_id').agg\
        ({'review_id' : 'count','text': lambda a: '##'.join(a)}).rename\
        (columns={'review_id' : 'review_count', 'text': 'combined_reviews'})

user_rev = user_rev.reset_index()
user_rev.head()

#### Restaurants vector using Tfidf Vectorizer

In [None]:
%%time

# Tfidf to extract top 500 features

vectorizer = TfidfVectorizer(stop_words='english', ngram_range=(1, 2), max_features=500)   ## limit to top 500 words
X = vectorizer.fit_transform(bus_rev.combined_reviews)                                     ## scipy.sparse.csr.csr_matrix

In [None]:
X_df = pd.DataFrame(X.todense())
bus_revFeature = X_df.set_index(bus_rev.business_id)
bus_revFeature.columns = vectorizer.get_feature_names()
bus_revFeature.head()

In [None]:
# look at all the bigrams being picked up in the top 1000 features
for i in bus_revFeature.columns:
    if len(i.split())>1:
        print(i, end=',')

In [None]:
%%time
# feature selection
from sklearn.decomposition import PCA
pca = PCA()
bus_pcaFeature = pca.fit_transform(bus_revFeature)
vr = pca.explained_variance_ratio_

In [None]:
vr_cum = [sum(vr[:i+1]) for i in range(len(vr))]
plt.plot(list(range(len(vr))),vr_cum, color='salmon');

In [None]:
vr_cum[300]

In [None]:
# extract and inspect the top 6 PCA components, in relationship to the original review features

components = pd.DataFrame(data=pca.components_, columns = bus_revFeature.columns)
for i in range(6):
    component = components.loc[i].sort_values(ascending=False)
    print("principle component #{}:\n".format(i), component[:5])

In [None]:
bus_pcaFeature = pd.DataFrame(bus_pcaFeature[:,:300], columns=[str(i) for i in np.arange(0,300)]).set_index\
               (bus_revFeature.index)
bus_pcaFeature.columns.name = 'principle_components'

bus_pcaFeature.head()

#### Dimension reduction with PCA

In [None]:
# Refactor the PCA coefficients so that all feature vector has the unit length

bus_pcaFeature['root_sum_sq'] = bus_pcaFeature.apply(lambda row: np.sqrt(sum([i*i for i in row])), axis=1)
bus_pcaFeature = bus_pcaFeature.divide(bus_pcaFeature.root_sum_sq, axis=0).drop('root_sum_sq', axis=1)

bus_pcaFeature.head()

In [None]:
with open('bus_pcaFeature.pkl', 'rb') as f:
    bus_pcaFeature = pickle.load(f)

#### User vector based on restaurants vector

In [None]:
%%time

# User prfoile is constructed by computing weighted sum of restaurant vectors for all user rated item with user rating as weights

user_pcaFeature = pd.merge(review_clean[['user_id', 'business_id', 'stars']], bus_pcaFeature, how='inner',\
                          left_on='business_id', right_index=True).drop('business_id', axis=1)

user_pcaFeature.head()

In [None]:
# Refactor user PCA by multiplying with stars

user_pcaFeature.iloc[:, 2:302] = user_pcaFeature.iloc[:, 2:302].multiply(user_pcaFeature.stars, axis=0)

user_pcaFeature.head()

In [None]:
%%time

user_pcaFeature = user_pcaFeature.drop('stars', axis=1).groupby('user_id').sum()

In [None]:
# Refactor the PCA coefficients so that all feature vector has the unit length

user_pcaFeature['root_sum_sq'] = user_pcaFeature.apply(lambda row: np.sqrt(sum([i*i for i in row])), axis=1)
user_pcaFeature = user_pcaFeature.drop('root_sum_sq', axis=1).divide(user_pcaFeature.root_sum_sq, axis=0)

user_pcaFeature.head()

#### Save PCA features

In [None]:
max_bytes = 2**31 - 1
bytes_out = pickle.dumps(bus_pcaFeature)

with open('bus_pcaFeature.pkl','wb') as f:
    for idx in range(0, len(bytes_out), max_bytes):
        f.write(bytes_out[idx:idx+max_bytes])

In [None]:
max_bytes = 2**31 - 1
bytes_out = pickle.dumps(user_pcaFeature)

with open('user_pcaFeature.pkl','wb') as f:
    for idx in range(0, len(bytes_out), max_bytes):
        f.write(bytes_out[idx:idx+max_bytes])

In [None]:
bus_revFeature = None
#pca = None
#VR = None
#VR_sum = None
#bus_pcaFeature = None
user_pcaFeature = None

In [None]:
with open('train_bus_pcaFeature.pkl', 'rb') as f:
    bus_pcaFeature = pickle.load(f)
    
with open('train_user_pcaFeature.pkl', 'rb') as f:
    user_pcaFeature = pickle.load(f)

#### 1.2 Predicted user rating of resturant

In [None]:
# Feature building

# Numerical columns from business dataset
reg_bus = business[['business_id','latitude','longitude','stars','review_count']]

reg_train = train_review[['review_id','user_id','business_id','stars']].set_index('review_id')
reg_test = test_review[['review_id','user_id','business_id','stars']].set_index('review_id')

# Merge numerical columns from business dataset on `business_id`
reg_train = reg_train.merge(reg_bus, how='inner', on='business_id', suffixes=('_review', '_business'))
reg_test = reg_test.merge(reg_bus, how='inner', on='business_id', suffixes=('_review', '_business'))

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

# import regression models and metrics
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor 
from sklearn.metrics import r2_score, mean_squared_error

In [None]:
# Create a function to display regression model performance

index = ['Lasso','Ridge','Random Forest']
result_table = pd.DataFrame(index = index, columns= ['r2_train','mse_train','rmse_train','r2_test','mse_test','rmse_test'])

def display_reg_results(model, pred_train, pred_test):
    
    r2_train = r2_score(y_train, pred_train)
    r2_test = r2_score(y_test, pred_test)
    mse_train = mean_squared_error(y_train, pred_train)
    mse_test = mean_squared_error(y_test, pred_test)
    rmse_train = np.sqrt(mse_train)
    rmse_test = np.sqrt(mse_test)
    
    result_table.loc[model,:] = r2_train, mse_train, rmse_train, r2_test, mse_test, rmse_test

In [None]:
%%time

x_train, y_train = reg_train.drop(['user_id','business_id','stars_review'], axis=1), reg_train.stars_review
x_test, y_test = reg_test.drop(['user_id','business_id','stars_review'], axis=1), reg_test.stars_review

## Lasso model ##

lasso = Pipeline([('scaler', StandardScaler()),('lasso', Lasso(alpha=0.0015, max_iter=1000, selection='random'))])
lasso.fit(x_train, y_train)
pred_train = lasso.predict(x_train)
pred_test = lasso.predict(x_test)

# Features, coefficients 
coeff_feature = pd.DataFrame({'feature':x_train.columns, 'coefficient':lasso.named_steps.lasso.coef_})
print("Feature coefficients for the fitted Lasso model:\n",coeff_feature.sort_values('coefficient',ascending=False))

# logging of model performance
display_reg_results("Lasso", pred_train, pred_test)

## Ridge model ##

ridge = Pipeline([('scaler', StandardScaler()),('ridge',Ridge(alpha=100,max_iter=1000,tol=0.001))])
ridge.fit(x_train, y_train)
pred_train = ridge.predict(x_train)
pred_test = ridge.predict(x_test)

# Features, coefficients
coeff_feature = pd.DataFrame({'feature':x_train.columns, 'coefficient':ridge.named_steps.ridge.coef_})
print("\nFeature coefficients for the fitted Ridge model:\n", coeff_feature.sort_values('coefficient',ascending=False))

# logging of model performance
display_reg_results("Ridge", pred_train, pred_test)

## Rnadom forest model ##

rfr = Pipeline([('scaler', StandardScaler()),('rfr', RandomForestRegressor(n_estimators=70, max_features='log2'))])
rfr.fit(x_train, y_train)
pred_train = rfr.predict(x_train)
pred_test = rfr.predict(x_test)

# Features, importance
rank_feature = pd.DataFrame({'feature': x_train.columns, 'importance': rfr.named_steps.rfr.feature_importances_})
print("\nFeature importance for the fitted Random Forest Regressor model:\n", rank_feature.sort_values(by='importance',ascending=False))

# logging of model performance
display_reg_results("Random Forest", pred_train, pred_test)


result_table

In [None]:
%%time

X_train, Y_train = reg_train.drop(['user_id','business_id','stars_review'], axis=1), reg_train.stars_review
X_test, Y_test = reg_test.drop(['user_id','business_id','stars_review'], axis=1), reg_test.stars_review

lasso = Pipeline([('scaler', StandardScaler()),('lasso', Lasso(alpha=0.0015, max_iter=1000, selection='random'))])
lasso.fit(X_train, Y_train)
pred_test = lasso.predict(X_test)

In [None]:
rating_predict = pd.Series(pred_test)
rating_predict.name = 'rating_predict'


# Generate recommendation ranking by the predicted rating in descending order

# first, join predicted rating with all reviews in the testset
rec = pd.concat([test_review.reset_index(drop=True),rating_predict], axis=1)

# then rank by predicted rating in descending order
rec = rec.sort_values('rating_predict', ascending=False)

In [None]:
user_id = 'KbtcIPQdfmXToZV24trjVg'
rec_id = rec[rec.user_id == user_id].set_index('business_id').sort_values('stars', ascending=False)[0:10]
print('Ranking by cosine similarity score:\n', rec_id[['stars','rating_predict']])

## Implementation

In [4]:
import os.path
from sklearn.metrics.pairwise import linear_kernel

In [5]:
business = pd.read_csv('clean_business.csv')
review = pd.read_csv('clean_review.csv')
review_clean = review[review.business_id.isin(business.business_id.unique())].reset_index(drop=True)

mean_global = ((business.stars * business.review_count).sum())/(business.review_count.sum())
k = 30 # set strength k to 22, which is the 50% quantile of the review counts for all businesses
business['stars_adj'] = (business.review_count * business.stars + k * mean_global)/(business.review_count + k)

In [6]:
class Recommender_Engine:
    
    def __init__(self, n=10, stars_original=False):
        """
        Instantiate the object. Default setting for ranking would be stars_adj with top 10 recommendations.
        """
        
        self.n = n                                                     # Number of recommendations
        self.stars_original = stars_original                           # Boolean for ranking method                            
        self.disply_columns = ['name', 'address', 'city','state',\
                               'attributes.RestaurantsPriceRange2',\
                               'review_count','stars','stars_adj',\
                               'cuisine','style']                    # List of columns to be displayed in the results
        
        if self.stars_original:
            score = 'stars'
        else:
            score = 'stars_adj'
            
        self.recommendation = business[business.is_open == 1].sort_values(score, ascending=False)
                                                                      # Filter only open restaurants
    
    def display(self):
        
        if len(self.recommendation) == 0:
            print("Sorry, there are no matching recommendations.")
        elif self.n < len(self.recommendation):
            print("Below is the list of the top {} recommended restaurants for you: ".format(self.n))
            print(self.recommendation.iloc[:self.n][self.disply_columns])
        else:
            print("Below is the list of the top {} recommended restaurants for you: ".format(len(self.recommendation)))
            print(self.recommendation.iloc[self.disply_columns]) 
    
            
    def content_filtering(self, user_id=None):
        self.user_id = user_id
        if self.user_id is None:
            print('User ID is not provided')
            return None
        if len(user_id) != 22:                                        # Sanity check on length of user id
            print('Invalid user ID')
            return None
        if self.user_id not in review_clean.user_id.unique():
            print('No user data available yet!')
            return []
            
        self.recommendation = business[business.is_open == 1]
        if 'stars_pred' in self.recommendation.columns:
            self.recommendation.drop('stars_pred', axis=1, inplace=True)
            
        self.display_columns = ['name', 'address', 'city','state',\
                                'attributes.RestaurantsPriceRange2',\
                                'review_count','stars','stars_adj',\
                                'cuisine','style']
            
        max_bytes = 2**31 - 1
        bytes_in = bytearray(0)
        input_size_bus = os.path.getsize('bus_pcaFeature.pkl')
        input_size_bus = os.path.getsize('user_pcaFeature.pkl')
        
        with open('bus_pcaFeature.pkl', 'rb') as f:
            bus_pcaFeature = pickle.load(f)
            
        with open('user_pcaFeature.pkl', 'rb') as f:
            user_pcaFeature = pickle.load(f)
           
         # Recommendations
        score_matrix = linear_kernel(user_pcaFeature.loc[user_id].values.reshape(1,-1), bus_pcaFeature)
        score_matrix = score_matrix.flatten()
        score_matrix = pd.Series(score_matrix, index=bus_pcaFeature.index)
        score_matrix.name = 'cosine_sim_score'
        
        self.recommendation = pd.concat([score_matrix, self.recommendation.set_index('business_id')], axis=1, join='inner').reset_index()
        
        # Filter restaurants not rated by user 
        rated_res = review_clean[review_clean.user_id == self.user_id].business_id.unique()
        self.recommendation = self.recommendation[~self.recommendation.business_id.isin(rated_res)]
        
        # Sort restaurants by cosine similarity score
        self.recommendation = self.recommendation.sort_values('cosine_sim_score', ascending=False).reset_index(drop=True)
       
        self.display_columns.insert(0, 'cosine_sim_score')
        self.display()
        
        return self.recommendation

In [7]:
%%time

# Instantiate the object
results = Recommender_Engine();

# Test case 1: Display results
print('Test case 1: *****------------*****\n');
results.display();

# Test case 2: No user_id input
print('Test case 2: *****------------*****\n');
results.content_filtering();

# test 3: User with no previous user data
print('Test case 3: *****------User with no previous user data------*****\n')
results.content_filtering(user_id='-NzChtoNOw706kps82x0Kg');

# test 4: User with few restaurants reviews
print('Test case 4: *****------User with few restaurants reviews------*****\n')
results.content_filtering(user_id='---89pEy_h9PvHwcHNbpyg');

# test 5: User with more than 100 restaurants reviews
print('Test case 5: *****------User with more than 100 restaurants reviews------*****\n')
results.content_filtering(user_id='---1lKK3aKOuomHnwAkAow');

Test case 1: *****------------*****

Below is the list of the top 10 recommended restaurants for you: 
                          name                           address        city  \
29761          Little Miss BBQ              4301 E University Dr     Phoenix   
2648              Brew Tea Bar      7380 S Rainbow Blvd, Ste 101   Las Vegas   
33734          Cocina Madrigal                    4044 S 16th St     Phoenix   
35172  Green Corner Restaurant        1038 W Southern Ave, Ste 1        Mesa   
3590            Worth Takeaway                     218 W Main St        Mesa   
9839            Zenaida's Cafe      3430 E Tropicana Ave, Ste 32   Las Vegas   
34397          Kodo Sushi Sake  15040 N Northsight Blvd, Ste 104  Scottsdale   
21628  Bajamar Seafood & Tacos             1615 S Las Vegas Blvd   Las Vegas   
11825                   Karved              3957 S Maryland Pkwy   Las Vegas   
34603    Not Your Typical Deli    1166 South Gilbert Rd, Ste 101     Gilbert   

      state attr

Unnamed: 0,business_id,cosine_sim_score,city,attributes.GoodForMeal,attributes.Smoking,attributes.BusinessAcceptsBitcoin,address,attributes.BYOBCorkage,attributes.WheelchairAccessible,attributes.RestaurantsDelivery,...,attributes.BusinessParking,hours.Monday,attributes.CoatCheck,hours,hours.Friday,attributes.BusinessAcceptsCreditCards,attributes.RestaurantsTableService,cuisine,style,stars_adj
0,Ehy00JWQixgoXzisVKhvag,0.717487,Las Vegas,"{'dessert': None, 'latenight': False, 'lunch':...",,,"3720 S Las Vegas Blvd, Ste 240",,,False,...,"{'garage': True, 'street': False, 'validated':...",11:30-22:0,,"{'Monday': '11:30-22:0', 'Tuesday': '11:30-22:...",11:30-23:0,True,,"pizza,italian",restaurants,3.993314
1,SsN-SaGGkJn2Qm-jSDQ4aQ,0.703229,Las Vegas,"{'dessert': False, 'latenight': False, 'lunch'...",,,3500 S Las Vegas Blvd,,,False,...,"{'garage': True, 'street': False, 'validated':...",11:0-23:0,False,"{'Monday': '11:0-23:0', 'Tuesday': '11:0-21:0'...",11:0-0:0,True,,"italian,pizza,ice cream & frozen yogurt,americ...","restaurants,food stands",3.993601
2,YNDxeeRUARbd8GRnscJSvg,0.701485,Las Vegas,"{'dessert': False, 'latenight': False, 'lunch'...",,,246 Via Antonio Ave,,True,False,...,"{'garage': False, 'street': False, 'validated'...",11:30-21:0,,"{'Monday': '11:30-21:0', 'Tuesday': '11:30-21:...",11:30-21:0,True,True,"italian,beer,wine & spirits,vegan","restaurants,nightlife,cocktail bars,bars",3.987394
3,N0apJkxIem2E8irTBRKnHw,0.694780,Las Vegas,"{'dessert': None, 'latenight': None, 'lunch': ...",,,3799 Las Vegas Blvd,,,False,...,"{'garage': True, 'street': False, 'validated':...",11:30-21:30,,"{'Monday': '11:30-21:30', 'Tuesday': '11:30-21...",11:30-6:0,True,,"pizza,sandwiches,beer,wine & spirits,american ...","nightlife,restaurants,bars",3.995993
4,e0JOkQYz_cnz91k6X55PLw,0.693439,Las Vegas,"{'dessert': False, 'latenight': False, 'lunch'...",,,3131 Las Vegas Blvd S,,,False,...,"{'garage': True, 'street': False, 'validated':...",17:30-22:0,,"{'Monday': '17:30-22:0', 'Tuesday': '17:30-22:...",17:30-22:0,True,,"italian,vegan","restaurants,bars,nightlife",3.989329
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
30317,KOxJOievdW0w7lJVfE4QtQ,-0.318668,Medina,"{'dessert': False, 'latenight': False, 'lunch'...",,,933 North Court St.,,,False,...,"{'garage': False, 'street': False, 'validated'...",10:30-0:0,,"{'Monday': '10:30-0:0', 'Tuesday': '10:30-0:0'...",10:30-0:0,True,,burgers,"restaurants,fast food",3.239476
30318,HXdPqrO27tANiLpbCe9BVA,-0.318691,Phoenix,"{'dessert': False, 'latenight': False, 'lunch'...",,,19818 N 27th Ave,,,False,...,"{'garage': False, 'street': False, 'validated'...",0:0-0:0,,"{'Monday': '0:0-0:0', 'Tuesday': '0:0-0:0', 'W...",0:0-0:0,True,,"burgers,mexican,tacos","restaurants,fast food,breakfast & brunch",2.996973
30319,CZNeNfo_6C8d3lFbkL_8vA,-0.318743,Bethel Park,"{'dessert': False, 'latenight': False, 'lunch'...",,,5261 Library Rd,,,True,...,"{'garage': False, 'street': False, 'validated'...",5:0-22:0,,"{'Monday': '5:0-22:0', 'Tuesday': '7:0-23:0', ...",5:0-0:0,True,,"coffee & tea,burgers","restaurants,fast food",3.070989
30320,b4dtR83mPAcvmM2a7mqpGw,-0.319842,Mesa,"{'dessert': False, 'latenight': False, 'lunch'...",,,146 W Baseline Rd,,,True,...,"{'garage': False, 'street': False, 'validated'...",5:0-23:0,,"{'Monday': '5:0-23:0', 'Tuesday': '5:0-23:0', ...",5:0-23:0,True,False,"coffee & tea,burgers","fast food,restaurants",2.838723
