# Content Based Recommendation System

This notebook outlines the process of building a recommendation system using Yelp's dataset. The system preprocesses review data to extract features from businesses and users, applies PCA for dimensionality reduction, and evaluates the recommendation quality using nDCG scores. We start by suppressing warnings and importing necessary libraries.


In [1]:
# Importing the essential libraries for data manipulation and numerical operations.
import pandas as pd
import numpy as np


## Loading Preprocessed Data

Here, we load the business and review datasets previously preprocessed and saved as CSV files. These datasets contain crucial information for feature extraction and further analysis.


In [2]:
business = pd.read_csv('business.csv') 
review = pd.read_csv('review.csv')


## Train-Test Split

Splitting the dataset into training and testing sets ensures that we can evaluate our recommendation system's performance on unseen data. This step involves random shuffling and splitting the review data.


In [3]:
# Establishing a random seed for consistent shuffling
np.random.seed(42)

# Randomly shuffling reviews
shuffled_indices = np.random.permutation(review.index)

# Determining the split point for training and testing
split_idx = int(len(shuffled_indices) * 0.9)

# Dividing the data into training and testing sets
train = review.loc[shuffled_indices[:split_idx]]
test = review.loc[shuffled_indices[split_idx:]]

# Displaying the ratio of training to testing data
print(f"Current train:test ratio: {len(train) / len(test):.2f}")


Current train:test ratio: 9.00


In [4]:
import pandas as pd

# Identifying users exclusive to the test set
users_test_only = test[~test.user_id.isin(train.user_id)].user_id.unique()

# Choosing one review per user to add to the training set for inclusivity
rows_to_add = test[test.user_id.isin(users_test_only)].groupby('user_id').head(1).index

# Updating the training and testing sets accordingly
idx_train = pd.Index(train.index.tolist() + rows_to_add.tolist()).unique()
idx_test = test.index.difference(rows_to_add)

train = review.loc[idx_train]
test = review.loc[idx_test]

# Reporting the updated training to testing data ratio
print(f"Current train:test ratio: {len(train) / len(test):.2f}")


Current train:test ratio: 12.06


In [5]:
# Identifying businesses unique to the test set
businesses_test_only = test[~test.business_id.isin(train.business_id)].business_id.unique()

# Selecting one review per such business to move to the training set
indices_to_move = test[test.business_id.isin(businesses_test_only)].groupby('business_id').head(1).index

# Reallocating indices for training and testing datasets
idx_train = train.index.union(indices_to_move)
idx_test = test.index.difference(indices_to_move)

train = review.loc[idx_train]
test = review.loc[idx_test]

# Displaying the new training to testing ratio


In [6]:
rev_by_rest = train.groupby('business_id').agg(
    review_count=('review_id', 'count'), 
    review_combined=('text', lambda x: '###'.join(x))
).reset_index()

# Checking the aggregation result
rev_by_rest.head(1)


Unnamed: 0,business_id,review_count,review_combined
0,-0iIxySkp97WNlwK66OGWg,221,"I really love this location, it's on the downt..."


## Feature Extraction from Reviews

Feature extraction is performed on the review texts by grouping them by businesses and applying TF-IDF vectorization. This process converts text data into a matrix of TF-IDF features.


In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer

# TF-IDF Vectorization for feature extraction from concatenated reviews
vectorizer = TfidfVectorizer(
    stop_words='english',
    ngram_range=(1, 2),
    max_features=1000,
    max_df=0.5,
    min_df=2
)

X = vectorizer.fit_transform(rev_by_rest.review_combined)

# Inspecting the most significant features
top_features = vectorizer.get_feature_names_out()[:50]
print(top_features)


['100' '13' '14' '16' '18' '19' '20 minutes' '24' '30 minutes' '40' '45'
 '45 minutes' '95' '99' 'absolute' 'absolutely delicious' 'accommodating'
 'actual' 'addition' 'additional' 'adults' 'affordable' 'afternoon' 'ahi'
 'air' 'al' 'amazing food' 'amazing service' 'ambiance' 'ambience'
 'american' 'anniversary' 'answer' 'anymore' 'apart' 'apologized' 'app'
 'apparently' 'appetizer' 'appetizers' 'apple' 'appreciate' 'appreciated'
 'arrive' 'asada' 'asian' 'atlantis' 'authentic' 'avocado' 'avoid']


In [8]:
import pandas as pd

# Transforming the TF-IDF sparse matrix to a DataFrame
feature_names = vectorizer.get_feature_names_out()
rest_revfeature = pd.DataFrame.sparse.from_spmatrix(X, columns=feature_names)

# Aligning the business_id with the index
rest_revfeature.index = rev_by_rest.business_id.values
rest_revfeature.head()


Unnamed: 0,100,13,14,16,18,19,20 minutes,24,30 minutes,40,...,wrap,wrapped,write,yeah,year old,years ago,yesterday,york,yum,zero
-0iIxySkp97WNlwK66OGWg,0.002297,0.0,0.0,0.005097,0.00387,0.002542,0.002146,0.001332,0.001156,0.002162,...,0.002842,0.0,0.004241,0.00112,0.0,0.002387,0.0,0.008615,0.002175,0.002428
-1YvpVvnnLrTZ0zjtUYPXA,0.0,0.005898,0.017318,0.0,0.0,0.005974,0.0,0.0,0.010869,0.0,...,0.0,0.0,0.0,0.0,0.006119,0.0,0.005297,0.0,0.02555,0.017116
-3xX_IfttKjPJ792BOBJ-Q,0.004554,0.0,0.0,0.005052,0.015344,0.00504,0.034028,0.005281,0.027508,0.008573,...,0.0,0.010643,0.004203,0.017756,0.0,0.018928,0.013405,0.0,0.017243,0.014439
-7KnD-G4ZYi7-Xs4ZJAYWQ,0.010939,0.002988,0.0,0.003034,0.003071,0.003026,0.017879,0.0,0.008259,0.01287,...,0.0,0.0,0.0,0.010662,0.0,0.002842,0.0,0.0,0.0,0.0
-Dr6MZW6ZVP7X6ai30Qrlw,0.0,0.0,0.0,0.0,0.014408,0.014197,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.01333,0.0,0.0,0.012143,0.013558


## Applying PCA for Dimensionality Reduction

To reduce the dimensionality of the TF-IDF feature matrix and capture the essence of the dataset, PCA (Principal Component Analysis) is applied. This transformation retains components that explain a significant portion of the variance.


In [9]:
from sklearn.decomposition import PCA
import pandas as pd
import numpy as np

# Executing PCA to capture 80% of the variance
pca = PCA(n_components=0.8)
rest_pcafeature = pca.fit_transform(rest_revfeature)

# Defining component names dynamically based on PCA results
num_components = pca.n_components_
rest_pcafeature_df = pd.DataFrame(rest_pcafeature, 
                                  index=rest_revfeature.index, 
                                  columns=[f'PCA_{i}' for i in range(1, num_components + 1)])
rest_pcafeature_df.columns.name = 'pca_components'

# Normalizing and merging PCA features with user preferences for profile construction

rest_pcafeature_df = rest_pcafeature_df.div(np.linalg.norm(rest_pcafeature_df, axis=1), axis=0)
user_prefs = pd.merge(train[['user_id', 'business_id', 'stars']], 
                      rest_pcafeature_df, 
                      how='inner', 
                      left_on='business_id', 
                      right_index=True).drop(columns=['business_id'])

for i in range(1, num_components + 1):
    user_prefs[f'PCA_{i}'] *= user_prefs['stars']

user_pcafeature_df = user_prefs.groupby('user_id').sum().drop(columns='stars')
user_pcafeature_df = user_pcafeature_df.div(np.linalg.norm(user_pcafeature_df, axis=1), axis=0)

# Displaying the initial entries for both restaurants and users after PCA transformation
display(rest_pcafeature_df.head(1))
display(user_pcafeature_df.head(1))

# Save the restaurant and user PCA features
rest_pcafeature_df.to_pickle('rest_pcafeature_train.pkl')
user_pcafeature_df.to_pickle('user_pcafeature_train.pkl')



pca_components,PCA_1,PCA_2,PCA_3,PCA_4,PCA_5,PCA_6,PCA_7,PCA_8,PCA_9,PCA_10,...,PCA_120,PCA_121,PCA_122,PCA_123,PCA_124,PCA_125,PCA_126,PCA_127,PCA_128,PCA_129
-0iIxySkp97WNlwK66OGWg,-0.0178,-0.066374,0.041662,0.177428,0.074457,0.142361,0.029258,0.130991,0.029388,0.011707,...,-0.019519,-0.005879,-0.004544,0.007208,-0.007785,0.005574,-0.005158,0.005407,0.00089,-0.006741


Unnamed: 0_level_0,PCA_1,PCA_2,PCA_3,PCA_4,PCA_5,PCA_6,PCA_7,PCA_8,PCA_9,PCA_10,...,PCA_120,PCA_121,PCA_122,PCA_123,PCA_124,PCA_125,PCA_126,PCA_127,PCA_128,PCA_129
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
--3Hl2oAvTPlq-f7KtogJg,-0.02244,-0.036778,0.102362,0.053947,-0.095409,0.055039,-0.015558,0.152621,-0.144908,0.07329,...,0.006327,0.002786,0.009739,-0.007246,-0.003631,0.011568,0.014476,0.010554,-0.007374,0.004826


## Recommendation System Evaluation

The recommendation system's performance is evaluated using the normalized Discounted Cumulative Gain (nDCG) metric. This evaluation considers the relevance of recommended businesses to the users' preferences.


In [16]:
import pandas as pd
import numpy as np
import pickle
import os
from sklearn.metrics.pairwise import linear_kernel

def load_pca_features():
    # Load restaurant PCA features
    with open('rest_pcafeature_train.pkl', 'rb') as f:
        rest_pcafeature = pickle.load(f)
        
    # Load user PCA features
    max_bytes = 2**31 - 1
    bytes_in = bytearray(0)
    input_size = os.path.getsize('user_pcafeature_train.pkl')
    with open('user_pcafeature_train.pkl', 'rb') as f:
        for _ in range(0, input_size, max_bytes):
            bytes_in += f.read(max_bytes)
        user_pcafeature = pickle.loads(bytes_in)
    
    return user_pcafeature, rest_pcafeature

def dcg_at_k(scores):
    """Calculate DCG"""
    return np.sum(
        (2**scores - 1) / np.log2(np.arange(2, scores.size + 2))
    )

def ndcg_at_k(true_scores, pred_scores, k):
    """Calculate nDCG"""
    best_dcg = dcg_at_k(np.sort(true_scores)[::-1][:k])
    if best_dcg == 0:
        return 0
    return dcg_at_k(pred_scores[:k]) / best_dcg

def evaluate_ndcg(user_pcafeature, rest_pcafeature, test_data, k=10):
    ndcg_scores = []
    
    for user_id in test_data['user_id'].unique():
        if user_id in user_pcafeature.index:
            sim_matrix = linear_kernel(user_pcafeature.loc[[user_id]], rest_pcafeature).flatten()
            predictions = pd.Series(sim_matrix, index=rest_pcafeature.index).sort_values(ascending=False)

            # Filter to the business_ids in test_data for this user
            true_relevance = test_data[test_data['user_id'] == user_id]
            # Ensure business_id is unique in true_relevance
            if not true_relevance['business_id'].is_unique:
                # Handle duplicates as needed, for example by averaging duplicate entries
                true_relevance = true_relevance.groupby('business_id')['stars'].mean().reset_index()
            true_relevance = true_relevance.set_index('business_id')['stars'].reindex(predictions.index).fillna(0)
            
            # Calculate nDCG
            ndcg_score = ndcg_at_k(true_relevance.values, predictions.values, k)
            ndcg_scores.append(ndcg_score)
    
    # Average nDCG score across all users
    avg_ndcg = np.mean(ndcg_scores)
    return avg_ndcg

# Assuming you have test_data DataFrame with columns ['user_id', 'business_id', 'stars']
user_pcafeature, rest_pcafeature = load_pca_features()
avg_ndcg_score = evaluate_ndcg(user_pcafeature, rest_pcafeature, test)
print(f"Average nDCG Score: {avg_ndcg_score}")


Average nDCG Score: 0.559138825007025


## Saving PCA Features and Profiles

After transforming the business review features and user preferences into lower-dimensional spaces using PCA, these new feature sets are saved for future use in recommendation tasks.


In [11]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA
import pickle  # Ensure pickle is imported for serialization

# Assuming `review` is defined elsewhere in your code and contains the review data

# Group reviews by business_id, concatenate text, and count reviews
rev_by_rest = review.groupby('business_id').agg(
    review_count=('review_id', 'count'),
    review_combined=('text', lambda texts: '###'.join(texts))
).reset_index()

# TF-IDF Vectorization
vectorizer = TfidfVectorizer(stop_words='english', ngram_range=(1, 2), max_features=1000)
X = vectorizer.fit_transform(rev_by_rest.review_combined)

# The feature names from TF-IDF
feature_names = vectorizer.get_feature_names_out()

# Create DataFrame with TF-IDF features
rest_revfeature = pd.DataFrame(X.toarray(), index=rev_by_rest.business_id, columns=feature_names)

# Dimensionality Reduction with PCA
pca = PCA(n_components=0.80)  # Keep components that explain 80% of the variance
rest_pcafeature = pca.fit_transform(rest_revfeature)
rest_pcafeature_df = pd.DataFrame(rest_pcafeature, index=rest_revfeature.index)

# Get the original feature names corresponding to the PCA components
original_feature_names = np.array(feature_names)[np.argsort(np.abs(pca.components_), axis=1)[:, ::-1]]


# Normalize restaurant feature vectors
rest_pcafeature_df = rest_pcafeature_df.div(np.linalg.norm(rest_pcafeature_df, axis=1), axis=0)

# Merge user ratings with restaurant PCA features
user_pcafeature = pd.merge(
    review[['user_id', 'business_id', 'stars']],
    rest_pcafeature_df,
    how='inner',
    left_on='business_id',
    right_index=True
).drop('business_id', axis=1)

# Scale PCA components by user ratings
for col in user_pcafeature.columns[2:]:  # Skip user_id and stars columns
    user_pcafeature[col] = user_pcafeature[col] * user_pcafeature['stars']

# Aggregate PCA components by user
user_pcafeature_df = user_pcafeature.groupby('user_id').sum().drop(columns='stars')

# Normalize user feature vectors
user_pcafeature_df = user_pcafeature_df.div(np.linalg.norm(user_pcafeature_df, axis=1), axis=0)




In [12]:
# Save the restaurant and user PCA features
# Save the restaurant and user PCA features
rest_pcafeature_df.to_pickle('rest_pcafeature_all.pkl')
user_pcafeature_df.to_pickle('user_pcafeature_all.pkl')

with open('./model/original_feature_names.pickle', 'wb') as f:
    pickle.dump(original_feature_names, f)

