- Author: Kemil Herath 
- MIS 753: Independent Study, Fall 2021

## Collaborative filtering 
- For this project I'm using the yelp dataset: 
https://www.yelp.com/dataset

- A subset of the businessness data and reviews data is used. 

#### Sources: 
- https://predictivehacks.com/item-based-collaborative-filtering-in-python/
- https://chrisalbon.com/code/machine_learning/feature_engineering/select_best_number_of_components_in_tsvd/
- https://realpython.com/build-recommendation-engine-collaborative-filtering/

In [1]:
## Load Libraries 
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns
import string 
from wordcloud import WordCloud

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
import re
import random
import nltk
from nltk.corpus import stopwords

import re 

pd.set_option('display.max_rows', None)

  import pandas.util.testing as tm


In [2]:
# Load businesses data 
col_names = ['business_name', 'business_id', 'city', 'state', 'star_rating', 'service_categories']
business = pd.read_csv('Data/business.csv')
business.columns = col_names
business.head()

Unnamed: 0,business_name,business_id,city,state,star_rating,service_categories
0,Boruboru - Sandy Springs,kPiQ9kI_eP6LUob-Pv6AWg,Atlanta,GA,4.5,"Japanese, Sushi Bars, Food, Poke, Ramen, Resta..."
1,Smokey Bones Bar & Fire Grill,K-TMzK7eKQT4cAUoy_IlSw,Stoughton,MA,2.5,"Barbeque, Burgers, Restaurants, American (Trad..."
2,Dugans,mjJI-DFchylyZHB0b3n2lA,Atlanta,GA,2.5,"Bars, American (Traditional), Restaurants, Nig..."
3,China Station,3YsJCNhI4Vw7OtweFN622w,Boston,MA,4.0,"Chinese, Restaurants"
4,Ben Hill Grill,cOo7OHZinS8hn5P_-tFVKA,Atlanta,GA,4.5,"Seafood, Fast Food, Vegan, American (Tradition..."


In [3]:
business.shape

(34196, 6)

In [4]:
# Load reviews Data 
col_names = ['review_id', 'business_id', 'user_id', 'rating']
reviews = pd.read_csv('reviews.csv')
reviews.columns = col_names
reviews.head()

Unnamed: 0,review_id,business_id,user_id,rating
0,J4a2TuhDasjn2k3wWtHZnQ,xGXzsc-hzam-VArK6eTvtw,RNm_RWkcd02Li2mKPRe7Eg,1.0
1,9vqwvFCBG3FBiHGmOHMmiA,DbXHNl890xSXNiyRczLWAg,XGkAG92TQ3MQUKGX9sLUhw,5.0
2,FdoBFTjXXMn4hVnJ59EtiQ,WQFn1A7-UAA4JT5YWiop_w,eLAYHxHUutiXswy-CfeiUw,1.0
3,ucFOnqgaV40oQ2YNyz5ddQ,KXCXaF5qimmtKKqnPc_LQA,JHXQEayrDHOWGexs0dCviA,1.0
4,GDgXjXSZCA1iNQWD7OHXfg,mOnesB4IF9j6-ZmHoOHOig,1RCRKuHgP3FskGUVnmFdxg,4.0


In [5]:
reviews.shape

(250000, 4)

#### Take samples of the data for faster processing 

In [6]:
## Take a random sample of reviews data for easier processing 
reviews1 = reviews.sample(frac=0.16, random_state=12)

In [7]:
reviews1.head()

Unnamed: 0,review_id,business_id,user_id,rating
109672,7O6ujyFXDW4DWf4LFOO1fQ,j-JGN1f_cN_20oePpRWXrw,8KpEEKNjzgHu_zWXA4-cbQ,5.0
143738,Cg75QYA8b4jypg6GDDn7vQ,OX01EqImVbTGWepgNPNsJA,BQrLWECGBozqKaUVnv8TYw,5.0
180135,qSg3uobzatQaBBAwQZ57jQ,S7mL8gwckyeWuLGmE94kkw,s7qFhPQgZq_po9ffh_Icpg,2.0
248710,z7py0F2cQBJ0nwLzIaCjqA,EGsftsHMWmKF3mc2UQoQug,4GUk6uuf-9QQIUxJYYPNBA,2.0
41081,ocfbk7Rl5TB2otquxlurIw,vQNe7TD_QDpGtGbkaAqi8Q,LbfAaO0-15DunU_3qKGF4Q,5.0


In [8]:
reviews1.shape

(40000, 4)

In [9]:
## Take a random sampel of businessess data for easier processing 
business1 = business.sample(frac=0.15, random_state=42)

In [10]:
business1.shape

(5129, 6)

### Data Processing/ Exploration

In [11]:
### Merge reviews data and businesses data
reviewsData = pd.merge(business1[['business_name', 'business_id']],
                       reviews1[['business_id','user_id','rating']],
                       on='business_id')

reviewsData.head()

Unnamed: 0,business_name,business_id,user_id,rating
0,Khob Khun Thai Cuisine,wTVzfhLOYUyd6wY9LJWC2w,dQyXVDO7JYKtgZfsV0xOhA,5.0
1,Khob Khun Thai Cuisine,wTVzfhLOYUyd6wY9LJWC2w,4jRbtVIieuBbSBJWDhP3jw,5.0
2,Khob Khun Thai Cuisine,wTVzfhLOYUyd6wY9LJWC2w,WzCU-7Vjvv2eTYx8f85U_A,4.0
3,Khob Khun Thai Cuisine,wTVzfhLOYUyd6wY9LJWC2w,Ff_K-dXlwSJwN-IVMpKuOw,5.0
4,Khob Khun Thai Cuisine,wTVzfhLOYUyd6wY9LJWC2w,GyZRi_VETkrHGQoRKa9gsQ,1.0


In [12]:
reviewsData.shape

(3032, 4)

In [13]:
## Check which restaurent has most number of reviews 
reviewsData.groupby('business_name')['rating'].count().sort_values(ascending =False).head(10)

business_name
Lechon                                                 128
Barcelona Wine Bar South End                            81
Tin Shed Garden Cafe                                    64
FLIP burger boutique                                    63
Santarpio's Pizza                                       62
JINYA Ramen Bar - Austin                                52
J & M Diner                                             49
H&F Burger                                              48
Cooper's Hawk Winery & Restaurant - Waterford Lakes     47
Sushi Katana                                            47
Name: rating, dtype: int64

- We can see that **Lechon** has the most reviews. 

### Utility matrix 
- U X R matrix where U: Users and R: Restaurants
- The values will contain reviews given by each user for each restaurent

In [14]:
reviews_crosstab = reviewsData.pivot_table('rating', index='user_id', columns='business_name')

In [15]:
reviews_crosstab.head(10)

business_name,&pizza - Harvard Square,19th Hole,5 Star Pizza,African Paradise Restaurant,Alamo Drafthouse Cinema South Lamar,American Fresh Beer Garden,Anaheim Produce,Anna's Hand Cut Donuts,Arby's,Arepas Grill,...,Wok N' Kebab Asian Street Fare,X-Site Grill & Bistro,Yan's China Bistro,Yoko Sushi,Yokohama Teppanyaki,ZAZ,Zen Japanese Grill & Sushi Bar,Zeus Greek Street Food,Zoup!,sandoitchi Pop Up
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
-1CV3L7RAk34790wXVQu4g,,,,,,,,,,,...,,,,,,,,,,
-3s52C4zL_DHRK0ULG6qtg,,,,,,,,,,,...,,,,,,,,,,
-4RRsux7RIX19uBsVTSsMw,,,,,,,,,,,...,,,,,,,,,,
-5E2DVKUmSqpKPOWAL7sTg,,,,,,,,,,,...,,,,,,,,,,
-6t5bJFJne44N_E7lgdR6g,,,,,,,,,,,...,,,,,,,,,,
-9XyhOsyd8tOfgvFLGJfAQ,,,,,,,,,,,...,,,,,,,,,,
-A9ICz4e9hgrNK4-Fs3TmQ,,,,,,,,,,,...,,,,,,,,,,
-BCHgVhr-mx8iEBbphbTXw,,,,,,,,,,,...,,,,,,,,,,
-CUA8xL-9bCZN3nOi1e_Hw,,,,,,,,,,,...,,,,,,,,,,
-CvJZ3v4XxnzQAIZu5JvxQ,,,,,,,,,,,...,,,,,,,,,,


In [16]:
reviews_crosstab.shape

(3005, 365)

- This is a great example for a **sparse matrix (sparse data).**
- Meaning this matrix contains a lot of null (0) values. 
- **This is a common occurrence in ratings data, and an obstecle to recommender systems.**

#### Fill null values with 0

In [17]:
reviews_crosstab = reviews_crosstab.fillna(0)

### Model Based Colleborative Filtering
- **Transpose** reviews_crosstab so that restaurents(items) are in rows are users are columns
- Identify optimal number of componenets for SVD
- Run truncated SVD on User and Restaurant matrix (reviews_crosstab)
- Calculate the pearson correlation to identify similar users. 

In [18]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import TruncatedSVD

### 1. Transpose the utility matrix (reviews_crosstab)

In [19]:
reviews_crosstab.shape

(3005, 365)

In [20]:
# Convert X values to float
X = reviews_crosstab.astype(float)

In [21]:
X = reviews_crosstab.T
X.shape

(365, 3005)

In [22]:
X.head()

user_id,-1CV3L7RAk34790wXVQu4g,-3s52C4zL_DHRK0ULG6qtg,-4RRsux7RIX19uBsVTSsMw,-5E2DVKUmSqpKPOWAL7sTg,-6t5bJFJne44N_E7lgdR6g,-9XyhOsyd8tOfgvFLGJfAQ,-A9ICz4e9hgrNK4-Fs3TmQ,-BCHgVhr-mx8iEBbphbTXw,-CUA8xL-9bCZN3nOi1e_Hw,-CvJZ3v4XxnzQAIZu5JvxQ,...,zpwbkEZPuJ5kpYULdEuRiw,zrbk3Xzx6V3s4fCOvNBEKw,zs4NnxnKBKTCf-DlLXHyzg,zsUMstOI2PIGkPsc58VZIw,ztD-vr_Q2iBEw7g9qLRceQ,zuJizFaUq59eywOjN5cdxw,zvTHuymw0LCHBcVEt5WHsA,zwsfGSYBHnssi6mDYXepTw,zxSTe_VEZPK8e2PnkwgcGA,zzdi0RIbc21HMJYIpBg8lw
business_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
&pizza - Harvard Square,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
19th Hole,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5 Star Pizza,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
African Paradise Restaurant,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Alamo Drafthouse Cinema South Lamar,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### 2. Identify the optimal number of components for SVD

In [23]:
X.shape

(365, 3005)

In [24]:
compSVD = TruncatedSVD(n_components=X.shape[1]-1)

In [25]:
compSVD.fit(X)

TruncatedSVD(n_components=3004)

In [26]:
## Create a list of explanied variances 
svd_var_ratios = compSVD.explained_variance_ratio_

In [27]:
## Create a function to get n_componenets that exceeds the defined threshold
def select_n_components(var_ratio, goal_var: float) -> int:
    # Set initial variance explained so far
    total_variance = 0.0
    
    # Set initial number of features
    n_components = 0
    
    # For the explained variance of each feature:
    for explained_variance in var_ratio:
        
        # Add the explained variance to the total
        total_variance += explained_variance
        
        # Add one to the number of components
        n_components += 1
        
        # If we reach our goal level of explained variance
        if total_variance >= goal_var:
            # End the loop
            break
            
    # Return the number of components
    return n_components

In [28]:
# Get the n_components using the function 
n_comp = select_n_components(svd_var_ratios, 0.95)

In [29]:
n_comp

231

In [42]:
n_comp/X.shape[1]

0.07687188019966722

### 3. Run SVD using n_components

In [30]:
X.shape

(365, 3005)

In [31]:
SVD = TruncatedSVD(n_components=n_comp, random_state=12)
resultant_matrix = SVD.fit_transform(X)
resultant_matrix.shape

(365, 231)

### 4. Create correlation matrix using Pearson Correaltion 

In [32]:
corr_mat = np.corrcoef(resultant_matrix)
corr_mat.shape

(365, 365)

#### Find indexes of restaurents to be recommended

In [33]:
## Check which restaurent has most number of reviews 
reviewsData.groupby('business_name')['rating'].count().sort_values(ascending =False)[:100]

business_name
Lechon                                                 128
Barcelona Wine Bar South End                            81
Tin Shed Garden Cafe                                    64
FLIP burger boutique                                    63
Santarpio's Pizza                                       62
JINYA Ramen Bar - Austin                                52
J & M Diner                                             49
H&F Burger                                              48
Cooper's Hawk Winery & Restaurant - Waterford Lakes     47
Sushi Katana                                            47
Westgate Lakes Resort and Spa                           44
Marutama Ramen                                          43
Roaring Fork                                            42
The COOP: A Southern Affair                             37
Heo Eatery                                              37
Shaking Crab - Newton                                   33
Coquine                                   

### 5. Receommendations

In [35]:
col_idx = reviews_crosstab.columns.get_loc("FLIP burger boutique")
specific_corr = corr_mat[col_idx]
pd.DataFrame({'Correlation': specific_corr, "Restaurant":reviews_crosstab.columns}).sort_values(
                                                                    'Correlation', ascending=False).head(10)

Unnamed: 0,Correlation,Restaurant
102,1.0,FLIP burger boutique
364,0.012086,sandoitchi Pop Up
287,0.008818,Sunny Street Cafe
9,0.008106,Arepas Grill
358,0.008019,Yoko Sushi
139,0.007435,IHOP
135,0.007307,Hot Spot Pizza
255,0.007076,Relish Burger Bistro
243,0.007042,Pizza Peddler Deli
346,0.00664,Waffle House


In [36]:
col_idx = reviews_crosstab.columns.get_loc("Lechon")
specific_corr = corr_mat[col_idx]
pd.DataFrame({'Correlation': specific_corr, "Restaurant":reviews_crosstab.columns}).sort_values(
                                                                    'Correlation', ascending=False).head(10)

Unnamed: 0,Correlation,Restaurant
172,1.0,Lechon
364,0.012086,sandoitchi Pop Up
287,0.008818,Sunny Street Cafe
9,0.008106,Arepas Grill
358,0.008019,Yoko Sushi
139,0.007435,IHOP
135,0.007307,Hot Spot Pizza
255,0.007076,Relish Burger Bistro
243,0.007042,Pizza Peddler Deli
346,0.00664,Waffle House


In [37]:
business.loc[business.business_name == "Lechon"]

Unnamed: 0,business_name,business_id,city,state,star_rating,service_categories
25749,Lechon,eUbq0uNxRlXQ6sy7phM7yA,Portland,OR,4.5,"Tapas Bars, Restaurants, Cocktail Bars, Nightl..."


In [38]:
col_idx = reviews_crosstab.columns.get_loc("Marutama Ramen")
specific_corr = corr_mat[col_idx]
pd.DataFrame({'Correlation': specific_corr, "Restaurant":reviews_crosstab.columns}).sort_values(
                                                                    'Correlation', ascending=False).head(10)

Unnamed: 0,Correlation,Restaurant
189,1.0,Marutama Ramen
134,0.090809,Hong Sushi
364,0.012067,sandoitchi Pop Up
287,0.009975,Sunny Street Cafe
358,0.009168,Yoko Sushi
139,0.009054,IHOP
9,0.008537,Arepas Grill
243,0.007979,Pizza Peddler Deli
110,0.00749,Foodoko
346,0.007384,Waffle House
