Food Recommender System
==

Reference site : https://www.kaggle.com/gspmoreira/recommender-systems-in-python-101

This Notebook is document for developing Food recommender system for the ones having hard time deciding what menu they would eat.

Contributor : Taenyun Kim



Things to Develop (After midterm...)
--


1. function to save **df_user_info**, **interactions_full_df**, **food_df** into .csv file.

1. function to recommend different items each round.

1. function to update food ratings.

1. fucntion tocrawl recipes from the web for the content-based filtering... *(might not be needed? but I personally think that it is better with having both hash tag and web-crawled recipe data into the model for the better performanaces....)*

1. hybrid recommender system.

1. create knowledge based filtering for mood-and-weather-based recommender

1. function to extract *n* food items that follow gaussian distribution and make new users to rate for solving cold start problem.

In [1]:
#import modules
import numpy as np
import scipy
import pandas as pd
import math
import random
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.utils import shuffle
from scipy.sparse.linalg import svds
from itertools import combinations 
import matplotlib.pyplot as plt
from nltk.corpus import stopwords

Import Data
---

In [2]:
#import food rating data
df_food = pd.read_csv('data\\food_rating.csv')
df_food.head()

Unnamed: 0,타임스탬프,이름 ex) 홍길동,성별,만 나이 ex) 24,찜 (갈비찜/찜닭 등),갈비탕/설렁탕,곱창/막창,볶음밥,김치찌개,된장찌개,...,오므라이스,컵밥,브리또&타코,햄버거,샌드위치,치킨,혼밥은 주로 어떻게 하나요?,혼밥할 때 주로 먹는 음식 메뉴는 무엇인가요? 한 가지만 적어주세요.,(선택사항) 데이트를 하는 상황에서는 주로 무슨 음식을 즐겨 먹나요? 한 가지만 적어주세요.,친한 친구들 여럿이서 만나는 자리에서는 무슨 음식을 즐겨 먹나요? 한 가지만 적어 주세요.
0,10-6-2018 14:32:49,이영건,남자,24,2,4,4,3,2,4,...,3,2,4,5,5,4,집에서 음식을 해 먹는다.,볶음밥,파스타,
1,10-6-2018 15:18:04,성창민,남자,20,5,5,0,5,5,5,...,5,4,5,1,1,3,밖에서 사 먹는다.,덥밥,데이트안함,고기꾸어먹음
2,10-6-2018 15:19:56,윤혜진,여자,20,4,4,3,4,3,5,...,4,3,2,4,3,4,밖에서 사 먹는다.,알밥,파스타,떡볶이
3,10-6-2018 15:21:10,한상욱,남자,21,4,4,5,4,4,2,...,4,2,3,3,4,4,밖에서 사 먹는다.,제육덮밥,스테이크,막창구이
4,10-6-2018 15:21:55,황준원,남자,20,3,5,3,5,4,4,...,5,4,5,5,5,5,배달을 시킨다.,햄버거,양식(파스타),양 많은 것(닭갈비)


In [3]:
#set user id
df_food_reset_index = df_food.reset_index()
df_food_reset_index.head()

Unnamed: 0,index,타임스탬프,이름 ex) 홍길동,성별,만 나이 ex) 24,찜 (갈비찜/찜닭 등),갈비탕/설렁탕,곱창/막창,볶음밥,김치찌개,...,오므라이스,컵밥,브리또&타코,햄버거,샌드위치,치킨,혼밥은 주로 어떻게 하나요?,혼밥할 때 주로 먹는 음식 메뉴는 무엇인가요? 한 가지만 적어주세요.,(선택사항) 데이트를 하는 상황에서는 주로 무슨 음식을 즐겨 먹나요? 한 가지만 적어주세요.,친한 친구들 여럿이서 만나는 자리에서는 무슨 음식을 즐겨 먹나요? 한 가지만 적어 주세요.
0,0,10-6-2018 14:32:49,이영건,남자,24,2,4,4,3,2,...,3,2,4,5,5,4,집에서 음식을 해 먹는다.,볶음밥,파스타,
1,1,10-6-2018 15:18:04,성창민,남자,20,5,5,0,5,5,...,5,4,5,1,1,3,밖에서 사 먹는다.,덥밥,데이트안함,고기꾸어먹음
2,2,10-6-2018 15:19:56,윤혜진,여자,20,4,4,3,4,3,...,4,3,2,4,3,4,밖에서 사 먹는다.,알밥,파스타,떡볶이
3,3,10-6-2018 15:21:10,한상욱,남자,21,4,4,5,4,4,...,4,2,3,3,4,4,밖에서 사 먹는다.,제육덮밥,스테이크,막창구이
4,4,10-6-2018 15:21:55,황준원,남자,20,3,5,3,5,4,...,5,4,5,5,5,5,배달을 시킨다.,햄버거,양식(파스타),양 많은 것(닭갈비)


Divide Data into User information, and Rating
--

In [4]:
user_info_index = [0,1,2,3,4]  + list(range(54,58))
food_drop_index = list(range(0,5)) + list(range(54,58))

In [5]:
df_user_info = df_food_reset_index.iloc[:,user_info_index]
df_food_rating = df_food_reset_index.drop(axis =1,columns=df_food_reset_index.columns[food_drop_index])
df_food_rating.head()

Unnamed: 0,찜 (갈비찜/찜닭 등),갈비탕/설렁탕,곱창/막창,볶음밥,김치찌개,된장찌개,닭갈비,닭도리탕,불고기,냉면(물/비빔),...,카레/커리,김밥,분식(떡볶이/튀김/순대),라면,오므라이스,컵밥,브리또&타코,햄버거,샌드위치,치킨
0,2,4,4,3,2,4,3,1,3,4,...,3,3,3,4,3,2,4,5,5,4
1,5,5,0,5,5,5,2,2,5,2,...,3,3,1,1,5,4,5,1,1,3
2,4,4,3,4,3,5,3,2,4,3,...,3,2,4,4,4,3,2,4,3,4
3,4,4,5,4,4,2,4,5,4,4,...,3,3,5,3,4,2,3,3,4,4
4,3,5,3,5,4,4,5,3,5,5,...,5,5,4,5,5,4,5,5,5,5


In [6]:
df1 = pd.DataFrame([[1],[1],[1]])
df2 = pd.DataFrame([[2],[3]])

In [7]:
df1 = df1.append(df2)
df1

Unnamed: 0,0
0,1
1,1
2,1
0,2
1,3


In [9]:
df_user_info.columns = ['personId','timestamp','userName','sex','age','aloneHow','eatAlone','eatDate','eatTogether']
#personId : 사용자 고유번호
#timestamp : 설문시각
#userName : 사용자 이름
#sex : 성별
#age : 나이
#aloneHow : 혼밥은 주로 어떻게 하나요?
#eatAlone : 혼밥할 때 주로 먹는 음식 메뉴는 무엇인가요? 한 가지만 적어주세요.
#eatDate : (선택사항) 데이트를 하는 상황에서는 주로 무슨 음식을 즐겨 먹나요? 한 가지만 적어주세요.
#eatTogether : 친한 친구들 여럿이서 만나는 자리에서는 무슨 음식을 즐겨 먹나요? 한 가지만 적어 주세요.

df_user_info.head()

Unnamed: 0,personId,timestamp,userName,sex,age,aloneHow,eatAlone,eatDate,eatTogether
0,0,10-6-2018 14:32:49,이영건,남자,24,집에서 음식을 해 먹는다.,볶음밥,파스타,
1,1,10-6-2018 15:18:04,성창민,남자,20,밖에서 사 먹는다.,덥밥,데이트안함,고기꾸어먹음
2,2,10-6-2018 15:19:56,윤혜진,여자,20,밖에서 사 먹는다.,알밥,파스타,떡볶이
3,3,10-6-2018 15:21:10,한상욱,남자,21,밖에서 사 먹는다.,제육덮밥,스테이크,막창구이
4,4,10-6-2018 15:21:55,황준원,남자,20,배달을 시킨다.,햄버거,양식(파스타),양 많은 것(닭갈비)


In [8]:
df_food_rating_stack = pd.DataFrame(df_food_rating.stack()).reset_index() 
df_food_rating_stack = shuffle(df_food_rating_stack)
df_food_rating_stack.columns =['personId','contentId','eventStrength']

추천 시스템을 위한 변수 새롭게 설정
--

In [9]:
#food_df 
#ContentId : 음식 고유번호 
#FoodName : 고유번호에 따른 음식 이름

food_df = pd.DataFrame(df_food_rating_stack.contentId.unique()).reset_index()
food_df.columns = ['contentId','foodName'] 

In [10]:
#interactions_full_df 
#personId : 개인 고유번호
#ContentId : 음식 고유번호 
#eventStrength : 음식에 대한 평가

interactions_full_with_zeros_df = df_food_rating_stack.copy().reset_index(drop=True)

for food in range(len(food_df.foodName)):
    interactions_full_with_zeros_df.loc[interactions_full_with_zeros_df.contentId == food_df.foodName[food],'contentId'] = food_df.contentId[food]
interactions_df  = interactions_full_with_zeros_df[interactions_full_with_zeros_df['eventStrength'] != 0]

데이터 프레임 저장
--

In [11]:
df_user_info.to_csv('data\\info\\df_user_info.csv',index=False)
food_df.to_csv('data\\info\\food_df.csv',index=False)
interactions_df.to_csv('data\\info\\interactions_df.csv',index=False)


User with at least 5 interaction
--
Recommender systems have a problem known as user cold-start, in which is hard do provide personalized recommendations for users with none or a very few number of consumed items, due to the lack of information to model their preferences.
For this reason, we are keeping in the dataset only users with at leas 5 interactions.

In [12]:
users_interactions_count_df = interactions_df.groupby(['personId', 'contentId']).size().groupby('personId').size()
print('# users: %d' % len(users_interactions_count_df))
users_with_enough_interactions_df = users_interactions_count_df[users_interactions_count_df >= 5].reset_index()[['personId']]
print('# users with at least 5 interactions: %d' % len(users_with_enough_interactions_df))

# users: 159
# users with at least 5 interactions: 159


In [13]:
print('# of interactions: %d' % len(interactions_df))
interactions_from_selected_users_df = interactions_df.merge(users_with_enough_interactions_df, 
               how = 'right',
               left_on = 'personId',
               right_on = 'personId')
print('# of interactions from users with at least 5 interactions: %d' % len(interactions_from_selected_users_df))

# of interactions: 7693
# of interactions from users with at least 5 interactions: 7693


In [14]:
#Users are allowed to rate an food many times, and interact with them in different ratings. 
#Thus, to model the user interest on a given food, we aggregate all the ratings the user has performed 
#in an item by a weighted sum of rating score strength and apply a log transformation to smooth the distribution.

#shoudn't be mean??

def smooth_user_preference(x):
    return math.log(1+x, 2)
    
interactions_full_df = interactions_from_selected_users_df \
                    .groupby(['personId', 'contentId'])['eventStrength'].mean() \
                    .apply(smooth_user_preference).reset_index()
print('# of unique user/item interactions: %d' % len(interactions_full_df))
interactions_full_df.head(10)

# of unique user/item interactions: 7693


Unnamed: 0,personId,contentId,eventStrength
0,0,0,2.584963
1,0,1,2.321928
2,0,2,2.584963
3,0,3,1.584963
4,0,4,2.0
5,0,5,2.0
6,0,6,2.0
7,0,7,1.0
8,0,8,2.0
9,0,9,2.584963


Model Evaluation
--

Evaluation is important for machine learning projects, because it allows to compare objectivelly different algorithms and hyperparameter choices for models.
One key aspect of evaluation is to ensure that the trained model generalizes for data it was not trained on, using Cross-validation techniques. We are using here a simple cross-validation approach named holdout, in which a random data sample (20% in this case) are kept aside in the training process, and exclusively used for evaluation. All evaluation metrics reported here are computed using the test set.

Ps. **A more robust evaluation approach** could be to split train and test sets by a **reference date**, where **the train set** is composed by **all interactions before that date**, and **the test set** are interactions **after that date**. For the sake of simplicity, we chose the first random approach for this notebook, but you may want to try the second approach to better simulate how the recsys would perform in production predicting "future" users interactions.

In [14]:
interactions_train_df, interactions_test_df = train_test_split(interactions_full_df,
                                   stratify=interactions_full_df['personId'], 
                                   test_size=0.20,
                                   random_state=42)

print('# interactions on Train set: %d' % len(interactions_train_df))
print('# interactions on Test set: %d' % len(interactions_test_df))

# interactions on Train set: 6154
# interactions on Test set: 1539


In [15]:
#Indexing by personId to speed up the searches during evaluation
interactions_full_indexed_df = interactions_full_df.set_index('personId')
interactions_train_indexed_df = interactions_train_df.set_index('personId')
interactions_test_indexed_df = interactions_test_df.set_index('personId')

In [16]:
def get_items_interacted(person_id, interactions_df):
    # Get the user's data and merge in the movie information.
    interacted_items = interactions_df.loc[person_id]['contentId']
    return set(interacted_items if type(interacted_items) == pd.Series else [interacted_items])

In [17]:
#Top-N accuracy metrics consts

#EVAL_RANDOM_SAMPLE_NON_INTERACTED_ITEMS = 0

class ModelEvaluator:
    
    def set_random_sample_non_interacted(self,EVAL_RANDOM_SAMPLE_NON_INTERACTED_ITEMS):
        self.EVAL_RANDOM_SAMPLE_NON_INTERACTED_ITEMS = EVAL_RANDOM_SAMPLE_NON_INTERACTED_ITEMS


    def get_not_interacted_items_sample(self, person_id, sample_size, seed=42):
        interacted_items = get_items_interacted(person_id, interactions_full_indexed_df)
        all_items = set(food_df['contentId'])
        non_interacted_items = all_items - interacted_items

        random.seed(seed)
        non_interacted_items_sample = random.sample(non_interacted_items, sample_size)
        return set(non_interacted_items_sample)

    def _verify_hit_top_n(self, item_id, recommended_items, topn):        
            try:
                index = next(i for i, c in enumerate(recommended_items) if c == item_id)
            except:
                index = -1
            hit = int(index in range(0, topn))
            return hit, index

    def evaluate_model_for_user(self, model, person_id):
        #Getting the items in test set
        interacted_values_testset = interactions_test_indexed_df.loc[person_id]
        if type(interacted_values_testset['contentId']) == pd.core.series.Series:
            person_interacted_items_testset = set(interacted_values_testset['contentId'])
        else:
            person_interacted_items_testset = set([int(interacted_values_testset['contentId'])])  
        interacted_items_count_testset = len(person_interacted_items_testset) 

        #Getting a ranked recommendation list from a model for a given user
        person_recs_df = model.recommend_items(person_id, 
                                               items_to_ignore=get_items_interacted(person_id, 
                                                                                    interactions_train_indexed_df), 
                                               topn=10000000000)

        hits_at_5_count = 0
        hits_at_10_count = 0
        #For each item the user has interacted in test set
        for item_id in person_interacted_items_testset:
            #Getting a random sample (100) items the user has not interacted 
            #(to represent items that are assumed to be no relevant to the user)
            non_interacted_items_sample = self.get_not_interacted_items_sample(person_id, 
                                                                          sample_size=self.EVAL_RANDOM_SAMPLE_NON_INTERACTED_ITEMS, 
                                                                          seed=item_id%(2**32)) ## seed=item_id%(2**32)
            #Combining the current interacted item with the 100 random items
            items_to_filter_recs = non_interacted_items_sample.union(set([item_id]))

            #Filtering only recommendations that are either the interacted item or from a random sample of 100 non-interacted items
            valid_recs_df = person_recs_df[person_recs_df['contentId'].isin(items_to_filter_recs)]                    
            valid_recs = valid_recs_df['contentId'].values
            #Verifying if the current interacted item is among the Top-N recommended items
            hit_at_5, index_at_5 = self._verify_hit_top_n(item_id, valid_recs, 5)
            hits_at_5_count += hit_at_5
            hit_at_10, index_at_10 = self._verify_hit_top_n(item_id, valid_recs, 10)
            hits_at_10_count += hit_at_10

        #Recall is the rate of the interacted items that are ranked among the Top-N recommended items, 
        #when mixed with a set of non-relevant items
        recall_at_5 = hits_at_5_count / float(interacted_items_count_testset)
        recall_at_10 = hits_at_10_count / float(interacted_items_count_testset)

        person_metrics = {'hits@5_count':hits_at_5_count, 
                          'hits@10_count':hits_at_10_count, 
                          'interacted_count': interacted_items_count_testset,
                          'recall@5': recall_at_5,
                          'recall@10': recall_at_10}
        return person_metrics

    def evaluate_model(self, model):
        #print('Running evaluation for users')
        people_metrics = []
        for idx, person_id in enumerate(list(interactions_test_indexed_df.index.unique().values)):
            #if idx % 100 == 0 and idx > 0:
            #    print('%d users processed' % idx)
            person_metrics = self.evaluate_model_for_user(model, person_id)  
            person_metrics['_person_id'] = person_id
            people_metrics.append(person_metrics)
        print('%d users processed' % idx)

        detailed_results_df = pd.DataFrame(people_metrics) \
                            .sort_values('interacted_count', ascending=False)
        
        global_recall_at_5 = detailed_results_df['hits@5_count'].sum() / float(detailed_results_df['interacted_count'].sum())
        global_recall_at_10 = detailed_results_df['hits@10_count'].sum() / float(detailed_results_df['interacted_count'].sum())
        
        global_metrics = {'modelName': model.get_model_name(),
                          'recall@5': global_recall_at_5,
                          'recall@10': global_recall_at_10}    
        return global_metrics, detailed_results_df
    
model_evaluator = ModelEvaluator()   

Popularity Based Recommneder
==

In [18]:
df_food_grouped = interactions_full_df.groupby(['contentId']).agg({'eventStrength': 'sum'}).reset_index()
grouped_rating = interactions_full_df.groupby(['contentId']).agg({'personId': 'count'}).reset_index()
df_food_grouped['eventStrength']  = df_food_grouped['eventStrength'].div(grouped_rating['personId'])
df_food_ranking = df_food_grouped.sort_values(['eventStrength', 'contentId'], ascending = [0,1])

df_food_ranking['foodName'] = ''

for food in range(len(food_df.contentId)):
    df_food_ranking.loc[df_food_ranking.contentId == food_df.contentId[food],'foodName'] = food_df.foodName[food]

df_food_ranking.index = range(1,len(df_food_ranking)+1)

Most Favored 10 Foods
--

In [19]:
df_food_ranking.head(10)

Unnamed: 0,contentId,eventStrength,foodName
1,17,2.40147,삼겹살(구이)
2,22,2.396908,초밥
3,39,2.393494,스테이크
4,7,2.378197,치킨
5,18,2.344726,파스타
6,6,2.34421,수육/보쌈
7,36,2.316628,회(사시미)
8,16,2.298451,피자
9,45,2.285019,분식(떡볶이/튀김/순대)
10,34,2.284148,닭갈비


Least Favored 10 Foods
--

In [20]:
df_food_ranking.tail(10)

Unnamed: 0,contentId,eventStrength,foodName
40,11,2.11305,삼계탕
41,31,2.10899,김밥
42,29,2.099598,베트남 쌀국수
43,35,2.082919,샌드위치
44,20,2.069607,비빔밥
45,2,2.056446,수제비
46,0,2.036338,함박스테이크
47,19,2.032874,월남쌈
48,12,1.968258,컵밥
49,47,1.806685,콩국수


Food Recommender
==

Collaborative Filtering Model
==

This method makes automatic predictions (filtering) about the interests of a user by collecting preferences or taste information from many users (collaborating). The underlying assumption of the collaborative filtering approach is that if a person A has the same opinion as a person B on a set of items, A is more likely to have B's opinion for a given item than that of a randomly chosen person.

In [21]:
users_items_pivot_matrix_df = interactions_train_df.pivot(index='personId', 
                                                          columns='contentId', 
                                                          values='eventStrength').fillna(0)

users_items_pivot_matrix_df.head(10)

contentId,0,1,2,3,4,5,6,7,8,9,...,39,40,41,42,43,44,45,46,47,48
personId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,2.0,0.0,1.0,0.0,2.321928,2.0,2.584963,2.321928,0.0,2.584963,...,2.584963,1.584963,2.0,1.584963,2.321928,2.0,2.0,1.584963,2.0,1.584963
1,0.0,1.584963,1.584963,2.584963,1.584963,1.0,0.0,0.0,0.0,2.321928,...,1.584963,2.584963,1.0,1.584963,2.584963,2.584963,1.0,0.0,2.584963,2.584963
2,2.0,2.0,2.321928,1.584963,2.0,2.0,2.321928,2.321928,0.0,2.321928,...,2.584963,2.0,2.0,2.0,0.0,0.0,2.321928,2.321928,2.321928,2.321928
3,2.0,0.0,2.321928,2.0,2.321928,2.321928,2.321928,2.321928,1.584963,2.584963,...,2.584963,0.0,2.584963,2.0,0.0,1.0,0.0,2.0,0.0,2.321928
4,2.584963,2.321928,1.584963,2.584963,2.584963,2.584963,0.0,2.584963,2.321928,2.584963,...,0.0,0.0,2.321928,0.0,2.584963,2.321928,2.321928,2.584963,1.584963,2.0
5,2.0,1.0,2.321928,2.321928,2.321928,2.584963,2.584963,2.584963,2.584963,2.584963,...,0.0,2.321928,2.321928,1.584963,0.0,0.0,2.584963,2.584963,1.584963,1.584963
6,2.0,2.0,2.321928,0.0,2.584963,0.0,2.321928,2.584963,2.321928,2.0,...,2.0,2.584963,2.0,2.0,0.0,2.321928,2.321928,2.584963,0.0,2.584963
7,0.0,2.321928,2.321928,2.321928,2.321928,2.321928,0.0,0.0,2.321928,2.321928,...,2.584963,2.321928,2.321928,2.321928,2.321928,0.0,2.321928,2.584963,2.321928,2.321928
8,1.0,2.321928,2.584963,1.584963,2.0,1.584963,2.321928,2.0,2.584963,1.584963,...,2.0,2.321928,2.584963,0.0,0.0,2.584963,2.0,0.0,2.0,2.0
9,0.0,2.0,1.584963,0.0,0.0,2.321928,2.0,1.584963,2.321928,0.0,...,2.584963,2.321928,2.0,2.321928,2.0,2.584963,2.584963,2.321928,1.584963,0.0


In [22]:
#The number of factors to factor the user-item matrix.
users_items_pivot_matrix = users_items_pivot_matrix_df.values
NUMBER_OF_FACTORS_MF = 20
#Performs matrix factorization of the original user item matrix
U, sigma, Vt = svds(users_items_pivot_matrix, k = NUMBER_OF_FACTORS_MF)
sigma = np.diag(sigma)
all_user_predicted_ratings = np.dot(np.dot(U, sigma), Vt)

print('Users items pivot matrix shape : ', users_items_pivot_matrix.shape, 
      '\n\n---------------------------------------------------\n',
      '\nU shape : ', U.shape,
      '\nVt shape : ', Vt.shape,
      '\nSigma : ', sigma.shape,
      '\n\n====================================================',
      '\n\nU*sigma*Vt ='
     '\n\nAll user predicted ratings : ', all_user_predicted_ratings.shape)

Users items pivot matrix shape :  (159, 49) 

---------------------------------------------------
 
U shape :  (159, 20) 
Vt shape :  (20, 49) 
Sigma :  (20, 20) 


U*sigma*Vt =

All user predicted ratings :  (159, 49)


In [23]:
item_ids = food_df['contentId'].tolist()

In [24]:
#randomise
interactions_train_df, interactions_test_df = train_test_split(interactions_full_df,
                                   stratify=interactions_full_df['personId'], 
                                   test_size=0.20)

print('# interactions on Train set: %d' % len(interactions_train_df))
print('# interactions on Test set: %d' % len(interactions_test_df))

#Indexing by personId to speed up the searches during evaluation
interactions_full_indexed_df = interactions_full_df.set_index('personId')
interactions_train_indexed_df = interactions_train_df.set_index('personId')
interactions_test_indexed_df = interactions_test_df.set_index('personId')

#Creating a sparse pivot table with users in rows and items in columns
users_items_pivot_matrix_df = interactions_train_df.pivot(index='personId', 
                                                          columns='contentId', 
                                                          values='eventStrength').fillna(0)

users_items_pivot_matrix = users_items_pivot_matrix_df.values
users_ids = list(users_items_pivot_matrix_df.index)
 
#The number of factors to factor the user-item matrix.
NUMBER_OF_FACTORS_MF = 15
#Performs matrix factorization of the original user item matrix
U, sigma, Vt = svds(users_items_pivot_matrix, k = NUMBER_OF_FACTORS_MF)

sigma = np.diag(sigma)

all_user_predicted_ratings = np.dot(np.dot(U, sigma), Vt) 

#Converting the reconstructed matrix back to a Pandas dataframe
cf_preds_df = pd.DataFrame(all_user_predicted_ratings, columns = users_items_pivot_matrix_df.columns, index=users_ids).transpose()

# interactions on Train set: 6154
# interactions on Test set: 1539


In [None]:
def 
        interactions_train_df, interactions_test_df = train_test_split(interactions_full_df,
                                   stratify=interactions_full_df['personId'], 
                                   test_size=0.20)
        
        #Indexing by personId to speed up the searches during evaluation
        interactions_full_indexed_df = interactions_full_df.set_index('personId')
        interactions_train_indexed_df = interactions_train_df.set_index('personId')
        interactions_test_indexed_df = interactions_test_df.set_index('personId')
        
        #Creating a sparse pivot table with users in rows and items in columns
        users_items_pivot_matrix_df = interactions_train_df.pivot(index='personId', 
                                                          columns='contentId', 
                                                          values='eventStrength').fillna(0)
        
        users_items_pivot_matrix = users_items_pivot_matrix_df.values
        users_ids = list(users_items_pivot_matrix_df.index)
        
        #The number of factors to factor the user-item matrix.
        NUMBER_OF_FACTORS_MF = number_of_factors_mf

        
        #Performs matrix factorization of the original user item matrix
        
        U, sigma, Vt = svds(users_items_pivot_matrix, k = NUMBER_OF_FACTORS_MF)

        sigma = np.diag(sigma)

        all_user_predicted_ratings = np.dot(np.dot(U, sigma), Vt) 


        #Converting the reconstructed matrix back to a Pandas dataframe

        cf_preds_df = pd.DataFrame(all_user_predicted_ratings, columns = users_items_pivot_matrix_df.columns, index=users_ids).transpose()

In [25]:
class CFRecommender:
    
    MODEL_NAME = 'Collaborative Filtering'
    
    def __init__(self, cf_predictions_df, items_df=None):
        self.cf_predictions_df = cf_predictions_df
        self.items_df = items_df
        
    def get_model_name(self):
        return self.MODEL_NAME
        
    def recommend_items(self, user_id, items_to_ignore=[], topn=10, verbose=False):
        # Get and sort the user's predictions
        sorted_user_predictions = self.cf_predictions_df[user_id].sort_values(ascending=False) \
                                    .reset_index().rename(columns={user_id: 'recStrength'})

        # Recommend the highest predicted rating movies that the user hasn't seen yet.
        recommendations_df = sorted_user_predictions[~sorted_user_predictions['contentId'].isin(items_to_ignore)] \
                               .sort_values('recStrength', ascending = False) \
                               .head(topn)

        if verbose:
            if self.items_df is None:
                raise Exception('"items_df" is required in verbose mode')

            recommendations_df = recommendations_df.merge(self.items_df, how = 'left', 
                                                          left_on = 'contentId', 
                                                          right_on = 'contentId')[['recStrength', 'contentId', 'foodName']]


        return recommendations_df
    

cf_recommender_model = CFRecommender(cf_preds_df, food_df)


In [26]:
class CFRecommenderRandom:
    
    MODEL_NAME = 'Collaborative Filtering Random'
    
    def __init__(self, interactions_full_df, items_df=None,number_of_factors_mf = 15):
        
        
        #randomise
        interactions_train_df, interactions_test_df = train_test_split(interactions_full_df,
                                   stratify=interactions_full_df['personId'], 
                                   test_size=0.20)
        
        #Indexing by personId to speed up the searches during evaluation
        interactions_full_indexed_df = interactions_full_df.set_index('personId')
        interactions_train_indexed_df = interactions_train_df.set_index('personId')
        interactions_test_indexed_df = interactions_test_df.set_index('personId')
        
        #Creating a sparse pivot table with users in rows and items in columns
        users_items_pivot_matrix_df = interactions_train_df.pivot(index='personId', 
                                                          columns='contentId', 
                                                          values='eventStrength').fillna(0)
        
        users_items_pivot_matrix = users_items_pivot_matrix_df.values
        users_ids = list(users_items_pivot_matrix_df.index)
        
        #The number of factors to factor the user-item matrix.
        NUMBER_OF_FACTORS_MF = number_of_factors_mf

        
        #Performs matrix factorization of the original user item matrix
        
        U, sigma, Vt = svds(users_items_pivot_matrix, k = NUMBER_OF_FACTORS_MF)

        sigma = np.diag(sigma)

        all_user_predicted_ratings = np.dot(np.dot(U, sigma), Vt) 


        #Converting the reconstructed matrix back to a Pandas dataframe

        cf_preds_df = pd.DataFrame(all_user_predicted_ratings, columns = users_items_pivot_matrix_df.columns, index=users_ids).transpose()

        self.cf_predictions_df = cf_preds_df
        self.items_df = items_df
        
    def get_model_name(self):
        return self.MODEL_NAME
        
    def recommend_items(self, user_id, items_to_ignore=[], topn=10, verbose=False):
        # Get and sort the user's predictions
        sorted_user_predictions = self.cf_predictions_df[user_id].sort_values(ascending=False) \
                                    .reset_index().rename(columns={user_id: 'recStrength'})

        # Recommend the highest predicted rating movies that the user hasn't seen yet.
        recommendations_df = sorted_user_predictions[~sorted_user_predictions['contentId'].isin(items_to_ignore)] \
                               .sort_values('recStrength', ascending = False) \
                               .head(topn)

        if verbose:
            if self.items_df is None:
                raise Exception('"items_df" is required in verbose mode')

            recommendations_df = recommendations_df.merge(self.items_df, how = 'left', 
                                                          left_on = 'contentId', 
                                                          right_on = 'contentId')[['recStrength', 'contentId', 'foodName']]


        return recommendations_df

In [27]:
# randomized collaborative filtering Reccomender

personNum = int(input('PersonId를 입력해주세요 : '))

cf_recommender_model_random = CFRecommenderRandom(interactions_full_df, food_df)
cf_model_random = cf_recommender_model_random.recommend_items(personNum,verbose=True)

for food in range(len(food_df.contentId)):
    cf_model_random.loc[cf_model_random.contentId == food_df.contentId[food],'foodName'] = food_df.foodName[food]

cf_model_random.index = range(1,len(cf_model_random)+1)

cf_model_random

PersonId를 입력해주세요 :  100


Unnamed: 0,recStrength,contentId,foodName
1,3.562691,8,된장찌개
2,3.227592,36,회(사시미)
3,3.122075,20,비빔밥
4,2.717711,32,잔치국수
5,2.617616,1,우동
6,2.582844,37,냉모밀
7,2.519314,27,볶음밥
8,2.498544,19,월남쌈
9,2.49226,15,돈부리(일본식 덮밥)
10,2.450813,44,게장


In [28]:
personNum = int(input('PersonId를 입력해주세요 : '))

cf_model = cf_recommender_model.recommend_items(personNum,verbose=True)

for food in range(len(food_df.contentId)):
    cf_model.loc[cf_model.contentId == food_df.contentId[food],'foodName'] = food_df.foodName[food]

cf_model.index = range(1,len(cf_model)+1)

cf_model

PersonId를 입력해주세요 :  100


Unnamed: 0,recStrength,contentId,foodName
1,3.252363,17,삼겹살(구이)
2,2.836569,7,치킨
3,2.780454,30,닭도리탕
4,2.755653,5,돈까스
5,2.656902,8,된장찌개
6,2.630919,16,피자
7,2.625764,26,갈비탕/설렁탕
8,2.608113,15,돈부리(일본식 덮밥)
9,2.594904,20,비빔밥
10,2.565736,0,함박스테이크


In [29]:
print('Evaluating Collaborative Filtering (SVD Matrix Factorization) model...')
model_evaluator.set_random_sample_non_interacted(0)
cf_global_metrics, cf_detailed_results_df = model_evaluator.evaluate_model(cf_recommender_model)
print('\nGlobal metrics:\n%s' % cf_global_metrics)
cf_detailed_results_df.head(10)

Evaluating Collaborative Filtering (SVD Matrix Factorization) model...
158 users processed

Global metrics:
{'modelName': 'Collaborative Filtering', 'recall@5': 1.0, 'recall@10': 1.0}


Unnamed: 0,_person_id,hits@10_count,hits@5_count,interacted_count,recall@10,recall@5
0,94,10,10,10,1.0,1.0
64,131,10,10,10,1.0,1.0
104,136,10,10,10,1.0,1.0
102,117,10,10,10,1.0,1.0
101,25,10,10,10,1.0,1.0
100,85,10,10,10,1.0,1.0
99,39,10,10,10,1.0,1.0
98,42,10,10,10,1.0,1.0
97,64,10,10,10,1.0,1.0
96,149,10,10,10,1.0,1.0


Content-based Filtering
==

This method uses only information about the description and attributes of the items users has previously consumed to model user's preferences. In other words, these algorithms try to recommend items that are similar to those that a user liked in the past (or is examining in the present). In particular, various candidate items are compared with items previously rated by the user and the best-matching items are recommended.



**Model-based approach**

This approach, models are developed using different machine learning algorithms to recommend items to users. There are many model-based CF algorithms, like neural networks, bayesian networks, clustering models, and latent factor models such as Singular Value Decomposition (SVD) and, probabilistic latent semantic analysis.

In [30]:
with open('data\\stopwords-ko.txt', 'r',encoding='utf-8') as f:
    lines = f.readlines()
    stopwords_ko = list(map(lambda x: x.rstrip('\n'),lines))

In [31]:
stopwords_list = stopwords_ko
vectorizer = TfidfVectorizer(analyzer='word',
                     ngram_range=(1, 2),
                     min_df=0.003,
                     max_df=0.5,
                     max_features=5000,
                     stop_words=stopwords_list)

item_ids = food_df['contentId'].tolist()
tfidf_matrix = vectorizer.fit_transform(food_df['foodName'])
#foodName, foodDescribtion!!!
tfidf_feature_names = vectorizer.get_feature_names()
tfidf_matrix

<49x81 sparse matrix of type '<class 'numpy.float64'>'
	with 81 stored elements in Compressed Sparse Row format>

In [32]:
pd.DataFrame(tfidf_matrix.todense(),columns = tfidf_feature_names).head()

Unnamed: 0,갈비찜,갈비찜 찜닭,갈비탕,갈비탕 설렁탕,게장,곱창,곱창 막창,구이,김밥,김치찌개,...,콩국수,타코,튀김,튀김 순대,파스타,피자,함박스테이크,해장국,해장국 뼈해장국,햄버거
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.57735,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [33]:
def get_item_profile(item_id):
    idx = item_ids.index(item_id)
    item_profile = tfidf_matrix[idx:idx+1]
    return item_profile

def get_item_profiles(ids):
    item_profiles_list = [get_item_profile(x) for x in ids]
    item_profiles = scipy.sparse.vstack(item_profiles_list)
    return item_profiles

def build_users_profile(person_id, interactions_indexed_df):
    interactions_person_df = interactions_indexed_df.loc[person_id]
    user_item_profiles = get_item_profiles(interactions_person_df['contentId'])
    
    user_item_strengths = np.array(interactions_person_df['eventStrength']).reshape(-1,1)
    #Weighted average of item profiles by the interactions strength
    user_item_strengths_weighted_avg = np.sum(user_item_profiles.multiply(user_item_strengths), axis=0) / np.sum(user_item_strengths)
    user_profile_norm = sklearn.preprocessing.normalize(user_item_strengths_weighted_avg)
    return user_profile_norm

def build_users_profiles(): 
    interactions_indexed_df = interactions_full_df[interactions_full_df['contentId'] \
                                                   .isin(food_df['contentId'])].set_index('personId')
    user_profiles = {}
    for person_id in interactions_indexed_df.index.unique():
        user_profiles[person_id] = build_users_profile(person_id, interactions_indexed_df)
    return user_profiles

In [34]:
user_profiles = build_users_profiles()
len(user_profiles)

159

In [35]:
myuserid = int(input('perosnId를 입력해주세요 : '))
myprofile = user_profiles[myuserid]
print(myprofile.shape)
pd.DataFrame(sorted(zip(tfidf_feature_names, 
                        myprofile.flatten().tolist()), key=lambda x: -x[1])[:20],
             columns=['token', 'relevance'])

perosnId를 입력해주세요 :  90


(1, 81)


Unnamed: 0,token,relevance
0,닭갈비,0.16115
1,돈까스,0.16115
2,볶음밥,0.16115
3,사시미,0.16115
4,샌드위치,0.16115
5,스테이크,0.16115
6,제육볶음,0.16115
7,피자,0.16115
8,김치찌개,0.144752
9,닭도리탕,0.144752


In [36]:
class ContentBasedRecommender:
    
    MODEL_NAME = 'Content-Based'
    
    def __init__(self, items_df=None):
        self.item_ids = item_ids
        self.items_df = items_df
        
    def get_model_name(self):
        return self.MODEL_NAME
        
    def _get_similar_items_to_user_profile(self, person_id, topn=1000):
        #Computes the cosine similarity between the user profile and all item profiles
        cosine_similarities = cosine_similarity(user_profiles[person_id], tfidf_matrix)
        #Gets the top similar items
        similar_indices = cosine_similarities.argsort().flatten()[-topn:]
        #Sort the similar items by similarity
        similar_items = sorted([(item_ids[i], cosine_similarities[0,i]) for i in similar_indices], key=lambda x: -x[1])
        return similar_items
        
    def recommend_items(self, user_id, items_to_ignore=[], topn=10, verbose=False):
        similar_items = self._get_similar_items_to_user_profile(user_id)
        #Ignores items the user has already interacted
        similar_items_filtered = list(filter(lambda x: x[0] not in items_to_ignore, similar_items))
        
        recommendations_df = pd.DataFrame(similar_items_filtered, columns=['contentId', 'recStrength']) \
                                    .head(topn)

        if verbose:
            if self.items_df is None:
                raise Exception('"items_df" is required in verbose mode')

            recommendations_df = recommendations_df.merge(self.items_df, how = 'left', 
                                                          left_on = 'contentId', 
                                                          right_on = 'contentId')[['recStrength', 'contentId', 'foodName']]


        return recommendations_df
    
content_based_recommender_model = ContentBasedRecommender(food_df)

In [37]:
personNum = int(input('PersonId를 입력해주세요 : '))

cb_model = content_based_recommender_model.recommend_items(personNum,verbose=True)

for food in range(len(food_df.contentId)):
    cb_model.loc[cb_model.contentId == food_df.contentId[food],'foodName'] = food_df.foodName[food]

cb_model.index = range(1,len(cb_model)+1)

cb_model

PersonId를 입력해주세요 :  90


Unnamed: 0,recStrength,contentId,foodName
1,0.16115,27,볶음밥
2,0.16115,5,돈까스
3,0.16115,36,회(사시미)
4,0.16115,35,샌드위치
5,0.16115,34,닭갈비
6,0.16115,33,제육볶음
7,0.16115,16,피자
8,0.16115,39,스테이크
9,0.16115,3,브리또&타코
10,0.16115,14,곱창/막창


In [38]:
model_evaluator.set_random_sample_non_interacted(0)
print('Evaluating Content-Based Filtering model...')
cb_global_metrics, cb_detailed_results_df = model_evaluator.evaluate_model(content_based_recommender_model)
print('\nGlobal metrics:\n%s' % cb_global_metrics)
cb_detailed_results_df.head(10)

Evaluating Content-Based Filtering model...
158 users processed

Global metrics:
{'modelName': 'Content-Based', 'recall@5': 1.0, 'recall@10': 1.0}


Unnamed: 0,_person_id,hits@10_count,hits@5_count,interacted_count,recall@10,recall@5
0,94,10,10,10,1.0,1.0
64,131,10,10,10,1.0,1.0
104,136,10,10,10,1.0,1.0
102,117,10,10,10,1.0,1.0
101,25,10,10,10,1.0,1.0
100,85,10,10,10,1.0,1.0
99,39,10,10,10,1.0,1.0
98,42,10,10,10,1.0,1.0
97,64,10,10,10,1.0,1.0
96,149,10,10,10,1.0,1.0


여러명이서 같이 먹을 떄
--

In [39]:
def joint_recommender_model(recommender_model, *args):
    
    model = recommender_model
    
    model_list=[]
    personId_list=[]
    
   #find the first common food shared by people searching from top 1 to the last.
    
    top=1
    common_content = []
    while common_content == []:
        
        personId_list = []
        model_list = []
        user1_content=[]
        common_content = []
        
        
        for arg in args :
            
            personId_list.append(arg)
            model_list.append(model.recommend_items(arg,verbose=True,topn=top))
            
        user1_content = list(model_list[0].contentId)
        common_content = user1_content
        
        for user in model_list:
            user_content = list(user.contentId)
            common_content = list(set(common_content).intersection(user_content))
        
        
        if common_content == []:
            top=top+1
        else: 
            break
            
    
    
    joint_content_df = food_df[food_df.contentId == common_content]
    
    print(len(personId_list),'명이 함께 먹을 추천 음식은 [',list(joint_content_df.foodName)[0],']입니다!!!\n')
    
    person = 0
    
    for user in model_list:
        rank = user[user.contentId == common_content].index[0] + 1
        
        print('UserId ', personId_list[person],'번은 이 음식을', rank, '번 째로 좋아합니다!')
        
        person = person + 1 

In [40]:
a = [1,2,4]

In [41]:
list(map(lambda x : str(x), a))

['1', '2', '4']

In [42]:
#함꼐 먹을 때 음식을 추천해주는 알고리즘

rec_user_list = []
get_user = '1'

while True:
    while get_user in (list(map(lambda x: str(x), list(df_user_info.personId)))+['END']):
        get_user = input('\npersonId를 입력해주세요(종료는 END) : ')
        
        if get_user in list(map(lambda x: str(x), rec_user_list)):
            while get_user in list(map(lambda x : str(x), rec_user_list)):
                print('\n\n[[[중복된 personId를 입력하였습니다!!!]]]\n')
                get_user = input('제대로 된 personId를 입력해주세요(종료는 END) : ')
        
        if get_user not in (list(map(lambda x: str(x), list(df_user_info.personId)))+['END']):
            while get_user not in (list(map(lambda x: str(x), list(df_user_info.personId)))+['END']):
                print('\n\n[[[잘못된 UserId를 입력하였습니다!!!]]]\n')
                get_user = input('제대로 된 personId를 입력해주세요(종료는 END) : ')
                
                if get_user in list(map(lambda x: str(x), rec_user_list)):
                    while get_user in list(map(lambda x : str(x), rec_user_list)):
                        print('\n\n[[[중복된 personId를 입력하였습니다!!!]]]\n')
                        get_user = input('제대로 된 personId를 입력해주세요(종료는 END) : ')
        
        
        if get_user in list(map(lambda x: str(x), list(df_user_info.personId))):
            rec_user_list.append(int(get_user))
        
        if get_user == 'END':
            break
    break
        
        
            
    rec_user_list.append(get_user)
    
print('\n\n---------------------------------------------------------------------------\n\n')

cf_recommender_model_random = CFRecommenderRandom(interactions_full_df, food_df)

joint_recommender_model(cf_recommender_model_random ,*rec_user_list)


personId를 입력해주세요(종료는 END) :  100

personId를 입력해주세요(종료는 END) :  200




[[[잘못된 UserId를 입력하였습니다!!!]]]



제대로 된 personId를 입력해주세요(종료는 END) :  54

personId를 입력해주세요(종료는 END) :  100




[[[중복된 personId를 입력하였습니다!!!]]]



제대로 된 personId를 입력해주세요(종료는 END) :  0

personId를 입력해주세요(종료는 END) :  END




---------------------------------------------------------------------------


3 명이 함께 먹을 추천 음식은 [ 수육/보쌈 ]입니다!!!

UserId  100 번은 이 음식을 8 번 째로 좋아합니다!
UserId  54 번은 이 음식을 3 번 째로 좋아합니다!
UserId  0 번은 이 음식을 7 번 째로 좋아합니다!


In [34]:

#Add new interaction
def new_interaction(old_interaction_df, personId, contentId, eventStrength):
    new_interaction_df = pd.DataFrame([[personId,contentId,eventStrength]],columns = ['personId','contentId','eventStrength'])
    old_interaction_df = old_interaction_df.append(new_interaction_df,ignore_index=True)
    return old_interaction_df

#Add new food
def new_food(old_food_df,contentId,foodName):
    new_food_df = pd.DataFrame([[contentId,foodName]], columns = ['contentId','foodName'])
    old_food_df = old_food_df.append(new_food_df,ignore_index=True)
    return old_food_df


#Add new users
def new_user_info(old_user_info_df, personId,timestamp,userName,sex,age,aloneHow,eatAlone,eatDate,eatTogether):
    new_user_info_df = pd.DataFrame([[personId,timestamp,userName,sex,age,aloneHow,eatAlone,eatDate,eatTogether]], 
                                    columns = ['personId','timestamp','userName','sex','age','aloneHow','eatAlone','eatDate','eatTogether'])
    old_user_info_df = old_user_info_df.append(new_user_info_df,ignore_index=True)

In [57]:
food_df.head()

Unnamed: 0,contentId,foodName
0,0,잔치국수
1,1,스테이크
2,2,족발
3,3,피자
4,4,치킨


In [58]:
food_df.contentId == 0

0      True
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15    False
16    False
17    False
18    False
19    False
20    False
21    False
22    False
23    False
24    False
25    False
26    False
27    False
28    False
29    False
30    False
31    False
32    False
33    False
34    False
35    False
36    False
37    False
38    False
39    False
40    False
41    False
42    False
43    False
44    False
45    False
46    False
47    False
48    False
Name: contentId, dtype: bool

In [16]:
contentId = 10
food = food_df.foodName[food_df.contentId == contentId].to_string(header=False,index=False)

In [None]:
def cold_start_question(personId,number):
    contentIdlist = food_df.contentId.tolist()
    selected_content = np.random.choice(contentIdlist,number,replace=False )
    
    question = []
    
    for contentId in selected_content:
        food = food_df.foodName[food_df.contentId == contentId].to_string(header=False,index=False)
        eventStrength = ''
        
        while eventStrength not in ['0','1','2','3','4','5']:
            eventStrength = input('\n당신은 ' + food+'을(를) 얼마나 좋아하십니까?\n\n0 - 먹어본 적 없음\n1 - 별로 좋아하지 않음\n5 - 매우 좋아함\n\n')
            
            if eventStrength not in ['0','1','2','3','4','5']:
                print('\n\n0-5 사이의 숫자릅 입력하여 주세요!')
            
        eventStrength = int(eventStrength)
            
        question.append([personId,contentId,eventStrength])
        
    starting_question_df = pd.DataFrame(question,columns = ['personId','contentId','eventStrength'])
    
    return starting_question_df


In [None]:
cold_start_question(1000,5)