# Using Machine Learning in Hybrid Recommendation System for Diet Improvement Based on Health and Taste

## Yiren Qu

### Abstract
Recommendation systems are used everywhere today, such as for online shopping or Netflix videos. The use of these systems necessitates the ability to predict the taste of food effectively. In addition, other food concerns, such as food safety and nutritional value have become more important than before. With these thoughts, a novel diet improvement system, which is based on a recommendation system using hybrid matrix decomposition and K-nearest neighborhood algorithms, is proposed in this paper. The need to minimize the error of prediction in terms of taste and obtain a healthier prediction, required the evaluation of a combination of different algorithms and also required different components of algorithms to be changed to determine the most accurate way of improving the diet. The results of an additional psychological study on induced motivation were incorporated into the system to develop a method to generate a model for each user to enable them to make the transition from tasty food to tasty and healthy food. The system contributes a new hybrid optimized recommendation system based on novel system construction.


#### Problem:
    Food, similar to online advertisement services, the prediction of the prices of products, and the weather forecast, is one of the most important major components in people’s lives and has become the focus of development of a number of computer applications by that would enable information analysis. Under modern pressure, people seem to lose concern about their own eating. In the near future, personal data of preferred food habits will connect to restaurants, home computers, and mobile devices including cars. People eating at restaurants will receive immediate information containing food recommendations based on their personal health condition and an optimized taste model. In a long term, the recommendation system would enable people to specify their own settings to make their diets healthier and easy to accept by using the induced motivation method, recommending tasty food to give user a buffer time to get used to it.

#### Goal: 
    This system is to find a solution to recommend food that is both healthy and tasty for users with the hope that, in the long term, the diet may become healthier because of the average summation of taste and health by the personal model. Also, this system will be built on the base of a traditional recommendation system. It will also be developed in both time efficiency and accuracy of the prediction.


#### Program Language Selection:
Python 3  

##### External Library of Python:
- Numpy        - For Numeric Calculation and Quick Data Analysis
- Pandas(Python Data Analysis Library) - It is a new and great library for big data analysis.

In [1]:
import math
import numpy as np
import pandas as pd

### Problem statement
This system aims to find a solution to recommend food that is both healthy and tasty for users with the hope that, in the long term, the diet may become healthier because of the average summation of taste and health by the personal model. 
Specifically, assume there are $N_{u}$  users and $ N_{f}$ foods, given a set of training data, which is a set of triples containing columns: user id, food id, and ratings, define the $ \mathrm{{A}}_{u,f} $ "user-food matrix" "containing"  only known food ratings and $B_{u,f} $matrix as:

 $$ {\mathrm{A}}_{u,f}=\left[ \begin{array}{ccc}
{ID}_u & {ID}_f & R_{u,f} \\ 
\vdots  & \vdots  & \vdots  \end{array}
\right]\mathrm{\ \ \ \ size\ of\ A\ is\ }\left(3,N_a\right),\ in\ which\ N_a\le N_u\times N_f $$

$$ B_{u,f}=\left( \begin{array}{ccc}
R_{u1,f1} & \cdots  & R_{u1,fi} \\ 
\vdots  & \ddots  & \vdots  \\ 
R_{ui,f1} & \cdots  & R_{ui,f1} \end{array}
\right)\ \ \ \ size\ of\ B\ is\ \left(N_u,N_f\right) $$

The first task is to predict all the users’ ratings on given foods based on the training set, in this case the known food ratings set. In this problem, each rating is an integer between 1 and 10, representing from unlike to like, and the goal is to minimize the root mean square error (RMSE) of the results of the testing sets containing unknown ratings of food. The training set, error and RMSE, specifically, are defines as:
$$ S_{Test}=\left(\left[ \begin{array}{cc}
{ID}_u & {ID}_f \end{array}
\right],\cdots \right)\ ,\ in\ which\ {ID}_u,{ID}_f\in A_{u,f}\ and\ R_{u,f}\notin A_{u,f} $$
$$E_{u,f}=R_{u,f}-P_{u,f}$$,
$$ RMSE=\sqrt{\frac{1}{\left|S_{Test}\right|}\sum_{\left(u,f\right)\in S_{Test}}{{E_{u,f}}^2}}
$$ 
where $ {\left|S_{Test}\right|} $ is its cardinality, $ R_{u,f}$is the true rating, and $P_{u,f}$ is the predicted rating based on the recommendation system. As paper [1] on evaluating recommendation systems argues, RMSE is popular and highly accurate in expressing the performance of a recommendation system.

The second task is to generate a health score based on the comparison between users’ needs in terms of basic nutrition and food nutrition. In this problem, each health score is an integer between 1 and 10 and the goal is to sort the predicted food rating list after obtaining the average summation of the health score and predicted taste score. Specifically, define the health score, result score, and recommended food list as: 
$$ L_{{u,P}_f}=\left[ \begin{array}{cc}
{ID}_f & P_{u,f} \\ 
\vdots  & \vdots  \end{array}
\right]\ \ \left(u,f\right)\in \ S_{Test}\ and\ A_{{u,P}_f}\ is\ a\ descending\ sorted\ list\ by\ P_{u,f}$$
$$H_{u,f}=\sum_{\left(u,f\right)\in }{\frac{h_{f,i}}{h_{u,i}}\times a_i}\ \ \left(u,f\in \ L_{{u,P}_f}\right)$$
$$L_{u,Result}=\ \left[P_{u,f}\times a_{u,t}+H_{u,f}\times a_{u,h},\cdots \right]\ \left(u,f\right)\in \ S_{Test}
$$
where$ L_{{u,P}_f}$ is a descending sorted list of tuples of predicted food scores, $h_{f,i} $is the nutritional value of food items, $h_{u,i}$ is the nutritional needs of users, $a_{u,i}$ is the weight of each nutritional component in the weighted summation, and $(a_{u,t},a_{u,h})$ are the weights according to the induced motivation model controlling combination of tasty and healthy components to make users easy to accept.





### import files

In [2]:
#load food csv
def foodcsv():
    foodcsv = pd.read_csv('E:/project/main/DRS/DRS/data/food.csv', index_col=0, )
    return foodcsv
# load score.csv
def scoredata():
    scorecsv = pd.read_csv('E:/project/main/DRS/DRS/data/score_mod.csv', index_col=0, )
    return scorecsv
def scoredata_original():
    scorecsv = pd.read_csv('E:/project/main/DRS/DRS/data/score.csv', index_col=0, )
    return scorecsv



### Show Data set Examples

In [5]:
foodcsv()[["name","type","calories","fat","totalcarbs","sodium","protein","special"]].head(2)

Unnamed: 0_level_0,name,type,calories,fat,totalcarbs,sodium,protein,special
foodid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,Cheese Pizza,1,277,10,36,565,12,veg
3,Pepperoni Pizza,1,285,10,36,640,12,0


In [59]:
scoredata().head(2)

Unnamed: 0,food_id,rating,user_id
0,1,3.75,183
2,3,3.0,183


In [60]:
scoredata_original().head(2)

Unnamed: 0,1,3,4,5,6,7,8,9,10,11,...,48,63,64,60,65,66,70,69,67,68
183,3.75,3.0,3.0,5,3,7,3,8.0,5,4.0,...,,,,,,,,,,
147,9.666667,6.333333,5.5,7,9,3,9,,10,,...,,,,,,,,,,


## System design

 The diet improvement system is divided into three parts: Cloud data storage and methods, nearline storage and methods, and local storage and methods as depicted in Figure 1.

In the system, all databases containing basic information are saved in the Cloud, and specifically, there is a"user-food matrix" and Food Dataset containing basic nutritional values, such as number of calories, amount of fat, total amount of carbon dioxides, sodium, and proteins. Nearline, according to [2], is a way of transiting data between Offline and Online. We set up the recommendation system based on a hybrid SVD / KNN method and a database of similar user and similar food generated by the KNN method. The local data consisting of the users’ own goal and the induced motivation model generated by the goal are saved offline and are recalled when generating the final recommended food list. The interaction interface displays all the results received from the recommendation system.
<img src="figure1.png" >
Figure 1 System Construction

### K-nearest neighbor (KNN) algorithm

- Euclidean Similarity Function
<img src="http://cs.carleton.edu/cs_comps/0910/netflixprize/final_results/knn/img/knn/euc.png">
$$ sim\left(x,y\right)=\ \frac{1}{1+\sqrt{\sum{{(x-y)}^2}}}\ \ \ \ \ \ (1)$$

In [7]:
def euclidean(s1, s2):
    """Take two pd.Series objects and return their euclidean 'similarity'."""
    diff = s1 - s2
    return 1 / (1 + np.sqrt(np.sum(diff ** 2)))

- Cosine Similarity Function
<img src="http://cs.carleton.edu/cs_comps/0910/netflixprize/final_results/knn/img/knn/cos.png">

$$ sim(x,y) = \frac{(x . y)}{\sqrt{(x . x) (y . y)}} $$

In [8]:
def cosine(s1, s2):
    """Take two pd.Series objects and return their cosine similarity."""
    return np.sum(s1 * s2) / np.sqrt(np.sum(s1 ** 2) * np.sum(s2 ** 2))

- Pearson correlation

$$ sim(x,y) = \frac{(x - \bar x).(y - \bar y)}{\sqrt{(x - \bar x).(x - \bar x) * (y - \bar y)(y - \bar y)}} $$

In [9]:
def pearson(s1, s2):
    """Take two pd.Series objects and return a pearson correlation."""
    s1_c = s1 - s1.mean()
    s2_c = s2 - s2.mean()
    return np.sum(s1_c * s2_c) / np.sqrt(np.sum(s1_c ** 2) * np.sum(s2_c ** 2))

In [16]:
def knn(user_id,food_id):
    scorecsv_mod = scoredata()
    all_user_profiles = scorecsv_mod.pivot_table('rating', index='food_id', columns='user_id')
    user_condition = scorecsv_mod.user_id != user_id
    food_condition = scorecsv_mod.food_id == food_id
    ratings_by_others = scorecsv_mod.loc[user_condition & food_condition]
    # pivot table of other user's score of one food(I chose the most input food)
    if ratings_by_others.empty:
        return pd.DataFrame()

    ratings_by_others.set_index('user_id', inplace=True)
    their_ids = ratings_by_others.index
    their_profiles = all_user_profiles[their_ids]
    user_profile = all_user_profiles[user_id]
    sims = their_profiles.apply(lambda profile: pearson(profile, user_profile), axis=0) # apply cosine function
    sims = sims.to_frame().sort_values(by=0, axis=0, ascending=False)
    sims.rename(columns={0:food_id},inplace=True)
    sims=sims.head(13)
    return sims
def knnf(food_id,user_id):    #K-food
    scorecsv = scoredata()
    all_food_profiles = scorecsv.pivot_table('rating', index='user_id', columns='food_id')
    user_condition = scorecsv.user_id == user_id
    food_condition = scorecsv.food_id != food_id
    ratings_by_others = scorecsv.loc[user_condition & food_condition]
    # using having most input user 183 to generate
    if ratings_by_others.empty:
        return pd.DataFrame()
    ratings_by_others.set_index('food_id', inplace=True)
    their_ids = ratings_by_others.index
   
    their_profiles = all_food_profiles[their_ids]
    food_profile = all_food_profiles[food_id]
    sims = their_profiles.apply(lambda profile: pearson(profile, food_profile), axis=0)
    sims = sims.to_frame().sort_values(by=0, axis=0, ascending=False)
    sims=sims.head(13)
    return sims

In [38]:
knn(68,3).head()

Unnamed: 0_level_0,3
user_id,Unnamed: 1_level_1
92,0.707107
192,0.47619
168,0.436436
58,0.31664
17,0.272772


In [23]:
knnf(1,183).head()

Unnamed: 0_level_0,0
food_id,Unnamed: 1_level_1
3,0.524233
13,0.394883
6,0.371358
5,0.365304
59,0.304019


In [21]:
%timeit -n10 knn(1,1)
%timeit -n10 knnf(1,1)

10 loops, best of 3: 105 ms per loop
10 loops, best of 3: 22.3 ms per loop


### Pure KNN Method to predict foods' scores

In [24]:
class CollabEuclideanReco:
    """ Collaborative filtering using a custom sim(u,u'). """

    def learn(self):
        """ Prepare datastructures for estimation. """
        scorecsv = scoredata()
        self.all_user_profiles = scorecsv.pivot_table('rating', index='food_id', columns='user_id')
        self.all_food_profiles = scorecsv.pivot_table('rating', index='user_id', columns='food_id')

    def estimate(self, user_id, food_id):
        """ Ratings weighted by correlation similarity. """
        scorecsv = scoredata()

        user_condition = scorecsv.user_id != user_id
        food_condition = scorecsv.food_id == food_id
        ratings_by_others = scorecsv.loc[user_condition & food_condition]
        if ratings_by_others.empty:
            return 5.5

        ratings_by_others.set_index('user_id', inplace=True)
        their_ids = ratings_by_others.index
        their_ratings = ratings_by_others.rating
        their_profiles = self.all_user_profiles[their_ids]
        user_profile = self.all_user_profiles[user_id]
        sims = their_profiles.apply(lambda profile: pearson(profile, user_profile), axis=0)
        ratings_sims = pd.DataFrame({'sim': sims, 'rating': their_ratings})
        ratings_sims = ratings_sims[ratings_sims.sim > 0]

        if ratings_sims.empty:
            userest = their_ratings.mean()
        else:
            b = ratings_sims.sort_values(by="sim", axis=0, ascending=False)
            b = b.head(13)
            userest = np.average(b.rating, weights=b.sim)

        user_condition = scorecsv.user_id == user_id
        food_condition = scorecsv.food_id != food_id
        ratings_by_others = scorecsv.loc[user_condition & food_condition]
        if ratings_by_others.empty:
            return 6

        ratings_by_others.set_index('food_id', inplace=True)
        their_ids = ratings_by_others.index
        their_ratings = ratings_by_others.rating
        their_profiles = self.all_food_profiles[their_ids]
        food_profile = self.all_food_profiles[food_id]
        sims = their_profiles.apply(lambda profile: pearson(profile, food_profile), axis=0)
        ratings_sims = pd.DataFrame({'sim': sims, 'rating': their_ratings})
        ratings_sims = ratings_sims[ratings_sims.sim > 0]
        if ratings_sims.empty:
            return (userest + their_ratings.mean()) / 2
        else:
            c = ratings_sims.sort_values(by="sim", axis=0, ascending=False)
            c = c.head(13)
        return (userest + np.average(c.rating, weights=c.sim)) / 2

In [51]:
rec=CollabEuclideanReco()
rec.learn()
rec.estimate(64,19)

6.9498260677054446

##  Single value decomposition SVD

$$Matrix\ B=\left( \begin{array}{ccc}
R_{u1,f1} & \cdots  & R_{u1,f} \\ 
\vdots  & \ddots  & \vdots  \\ 
R_{u,f1} & \cdots  & R_{u,f} \end{array}
\right)\ \ \ \approx \ \ \left( \begin{array}{cc}
R_{u1,k1} &  \begin{array}{cc}
R_{u1,k2} & \cdots  \end{array}
 \\ 
 \begin{array}{c}
R_{u2,k1} \\ 
\vdots  \end{array}
 &  \begin{array}{cc}
 \begin{array}{c}
R_{u2,k2} \\ 
\vdots  \end{array}
 &  \begin{array}{c}
\cdots  \\ 
R_{u,k} \end{array}
 \end{array}
 \end{array}
\right)\ \ \times\ \ {\left( \begin{array}{cc}
R_{k1,f1} &  \begin{array}{cc}
R_{k1,f2} & \cdots  \end{array}
 \\ 
 \begin{array}{c}
R_{k2,f1} \\ 
\vdots  \end{array}
 &  \begin{array}{c}
 \begin{array}{cc}
R_{k2,f2} & \cdots  \end{array}
 \\ 
 \begin{array}{cc}
\vdots  & R_{k,f} \end{array}
 \end{array}
 \end{array}
\right)}^{Transpose}
$$

$$B = UV^{T}$$
$$
U = \left(
\begin{array}{ccc}
 u_{00} & u_{01} & u_{02}  \\
 u_{10} & u_{11} & u_{12}  \\
 u_{20} & u_{21} & u_{22}  \\
 u_{30} & u_{31} & u_{32}  \\  
 u_{40} & u_{41} & u_{42} \\  
 u_{50} & u_{51} & u_{52}  \\ 
 u_{60} & u_{61} & u_{62}  \\
 u_{70} & u_{71} & u_{72}  \\
\end{array} 
\right) \, \text{and } V = \left(
\begin{array}{cccccccccc}
 v_{00}  & v_{01} & v_{02} & v_{03} & v_{04} &v_{05} & v_{06} & v_{07} & v_{08} & v_{09} \\
 v_{10}  & v_{11} & v_{12} & v_{13} & v_{14} &v_{15} & v_{16} & v_{17} & v_{18} & v_{19} \\
 v_{20}  & v_{01} & v_{02} & v_{23} & v_{24} &v_{25} & v_{26} & v_{07} & v_{28} & v_{29} \\
\end{array}
\right)
$$

#### Using KNN model to scale down the input array

In [28]:
def knn_array(f,u):
    R = scoredata().pivot_table('rating', index='user_id', columns='food_id').fillna(0)
    a = list(knnf(f,u).index)
    b = list(knn(u,f).index)
    a.append(f)
    b.append(u)
    RC = R.iloc[R.index.isin(b)]
    RT = R.T
    RT = RT.iloc[RT.index.astype(float).isin(a)]
    RO = RC[RT.index]
    R = np.array(RO)
    return R,RO

- Example  User -6 and food -19

In [50]:
R,RO=knn_array(19,64)
RO.head()

food_id,9,11,15,18,19,21,27,34
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
18,9,0,7,9,7,0,0,0
20,4,0,5,4,3,0,0,0
40,0,0,7,0,7,0,0,0
64,10,6,10,9,0,6,3,8
80,10,0,10,10,10,0,0,0


In [43]:
import numpy
def matrix_factorization(u, f, steps=3500, alpha=0.0003, beta=0.02):
    R, RO = knn_array(f, u)
    N = len(R)
    M = len(R[0])
    K = 4
    P = numpy.random.rand(N, K)
    Q = numpy.random.rand(M, K)
    Q = Q.T
    for step in range(steps):
        for i in range(len(R)):
            for j in range(len(R[i])):
                if R[i][j] > 0:
                    eij = R[i][j] - numpy.dot(P[i, :], Q[:, j])
                    for k in range(K):
                        P[i][k] = P[i][k] + alpha * (2 * eij * Q[k][j] - beta * P[i][k])
                        Q[k][j] = Q[k][j] + alpha * (2 * eij * P[i][k] - beta * Q[k][j])
        eR = numpy.dot(P, Q)
        e = 0
        for i in range(len(R)):
            for j in range(len(R[i])):
                if R[i][j] > 0:
                    e = e + pow(R[i][j] - numpy.dot(P[i, :], Q[:, j]), 2)
                    for k in range(K):
                        e += (beta / 2) * (pow(P[i][k], 2) + pow(Q[k][j], 2))
        if e < 0.001:
            break
    Rpred = numpy.dot(P, Q)
    scorepred = pd.DataFrame(Rpred[0:, 0:], index=RO.index, columns=RO.columns)
    try:
        temp=scorepred.loc[u].loc[f]
        if temp>0:
            
            return temp
        else:
            return 0
    except:
        print(u, f, "error")
        return 5.78

In [49]:
matrix_factorization(64,19)

10.150741312505179

- real score of user-64 food-19

In [58]:
realsurvey_64 = pd.read_csv("data/survey/recfood/ID64.csv",index_col=0)
realsurvey_64.loc[19]["score"]

10.0

### Induced Motivation Model (IMM) based on custom goal

The health score of food is different for different users and even on different days when the aim is to complete a dietary goal or keeping a healthy diet. In the process, a user may have a personal goal, such as losing two kilograms of weight in 30 days, gaining four kilograms of weight in 25 days to train muscles. According to [8], the ideal perfect model for a user to gradually change their eating diet and eating choice, specifically, should first recommend tastier but less healthy food to the user, after which it would be possible for them to transit to food which is both tasty and healthy. Thus, we use IMM as the model of weight in the weighted summation to control the combination of tasty and healthy components in order to make user easy to accept. The mathematics model we are using here is a logistic model, with slight changes to the start and end and with logarithmic growth in the middle to satisfy both time efficiency and allow adjustment time for the user to adapt. In the evaluation, we use information of a virtual person to test whether the system meet the needs. The particular person is a normal healthy18 year-old male student with normal activities with an initial height of 180 centimeters and an initial weight of 75 kilograms, and with the goal of losing two kilograms of weight in 30 days. The custom IMM graph is shown in Figure 2:

<img src="figure_3.png">

## Health score calculation

$$ H_{u,f}=\sum_{\left(u,f\right)\in }{\frac{h_{f,i}}{h_{u,i}}?a_i}\ \ \left(u,f\in \ L_{{u,P}_f}\right) $$

In [62]:
def healthscore(user, food):
    def dif(a, b):
        return math.fabs(float(math.fabs(a - b) / b))

    ts = 1  # control differrent type of food and their percentage of calories of a lunch
    if food.type == 1:
        ts = 0.6
    elif food.type == 2:
        ts = 0.3
    elif food.type == 3:
        ts = 0.3
    elif food.type == 4:
        ts = 0.2
    elif food.type == 5:
        ts = 0.1
    basic = user.getcalorie() / 3 * ts
    carbs = ts * basic * 0.45
    fatmin = basic * 0.2 / 9.0 * ts
    fatmax = basic * 0.35 / 9.0 * ts
    sodium = 2400 / 3.0 * ts
    if user.age >= 51:
        sodium = 1500 / 3 * ts
    protein = 0.9 * user.getweight()
    if 0.5 <= user.goal <= 1.5:
        protein = 1.5 * user.getweight()
    elif user.goal > 1.5:
        protein = 2.2 * user.getweight()
    if "protein" in user.special:
        protein = user.special["protein"]
    protein /= 3
    protein *= ts
    if food.getfat() in [fatmin, fatmax]:
        fat = 0
    else:
        fat = min(dif(food.getfat(), fatmin), dif(food.getfat(), fatmax))
    diff = dif(food.getcal(), basic) * 0.5 + dif(food.gettotalcarbs(), carbs) * 0.1 + dif(food.getsodium(),
                                                                                          sodium) * 0.1 + dif(
            food.getprotein(), protein) * 0.1 + fat * 0.2
    result -= math.pow(7 * diff, 2) + 100
    if diff <= 0.05:
        result = 100
    if result <= 40: result = 40

    if veg(user, food): result = 0 # if vegtarian, test food if veg, ifnot return 0
    return result

def veg(user, food):
    if "veg" in user.special:
        if food.getspecial("veg"):
            return True
        else:
            return False
def health(p):# input p user_id list
        user = []
        food = fooddata()
        healthcsv = pd.DataFrame()
        for i in p:
            user.append(userselect(i))
        for i in user:
               for j in food:
                    if j.foodid not in healthcsv.columns:
                        healthcsv[j.foodid] = np.nan
                    score = healthscore(i, j)
                    healthcsv.loc[i.id, j.foodid] = score
        healthcsv = healthcsv.round(3)
        return healthcsv

In [51]:
_health=health([183]).multiply(.1)
_health

Unnamed: 0,1,3,4,5,6,7,8,9,10,11,...,63,64,65,66,67,68,69,70,100,101
183,9.1476,9.1117,4,4,4,4,4,8.9055,4,7.9836,...,8.7119,8.7541,7.6183,8.4282,4,4,6.4836,4.2268,8.1613,8.1613


## Evaluation
### Next two algorithms will cause some error so I need to build test & train files and evaluation function to test

- There are few useable open-source large-scale datasets on food ratings; therefore, a three-week track lunch survey was sent out on school days. A survey listing all the courses served in the cafeteria was distributed online as a Google form. The survey reviewed participants’ opinions based only on the taste of the foods they ate for lunch during the specified period expressed as a score ranging from 1 to 10. 
- The resulting matrix generated from the track survey contains responses from 186 people and pertains to 64 courses of food. The respondents consist of 165 students aging from 15 to 18 and 21 adult teachers; thus, the matrix has a size of $(186 \times 64)$. Matrix A contained a total of 1235 entries. The system was evaluated by randomly dividing the original Matrix A into $Matrix A_{Training}$ and $Set  S_{Test}$. The size of $Matrix A_{Training} $ is $(555\times3)$ and the cardinality of the Set test is 212. All of the entries were useable and valuable ranging from 1 to 10 as integers.


#### Create train & test files

In [7]:
def gene_test_train():
    scorecsv = scoredata()
    scorecsv = scorecsv.ix[np.random.choice(scorecsv.index, size=800, replace=False)]
    user_ids_larger_1 = pd.value_counts(scorecsv.user_id, sort=False) > 1
    user_ids_larger_1 = user_ids_larger_1[user_ids_larger_1].index

    scorecsv = scorecsv.select(lambda l: scorecsv.loc[l, 'user_id'] in user_ids_larger_1)
    assert np.all(scorecsv.user_id.value_counts() > 1)

    def assign_to_set(df):
        sampled_ids = np.random.choice(df.index,
                                       size=np.int64(np.ceil(df.index.size * 0.2)),
                                       replace=False)
        df.ix[sampled_ids, 'for_testing'] = True
        return df

    scorecsv['for_testing'] = False
    grouped = scorecsv.groupby('user_id', group_keys=False).apply(assign_to_set)
    scorecsv_train = scorecsv[grouped.for_testing == False]
    scorecsv_test = scorecsv[grouped.for_testing == True]
    assert len(scorecsv_train.index & scorecsv_test.index) == 0
    scorecsv_train.to_csv('E:/project/main/DRS/DRS/data/test_score/scorecsv_train.csv')
    scorecsv_test.to_csv('E:/project/main/DRS/DRS/data/test_score/scorecsv_test.csv')
gene_test_train()

In [64]:
scorecsv_train = pd.read_csv('E:/project/main/DRS/DRS/data/test_score/scorecsv_train.csv',index_col=0)
scorecsv_test=pd.read_csv('E:/project/main/DRS/DRS/data/test_score/scorecsv_test.csv',index_col=0)

In [65]:
scorecsv_train.head()

Unnamed: 0,food_id,rating,user_id,for_testing
1200,1,5,57,False
603,63,9,163,False
332,10,8,62,False
1123,3,8,59,False
1064,11,8,131,False


In [66]:
scorecsv_test.head()

Unnamed: 0,food_id,rating,user_id,for_testing
565,19,4,14,False
545,48,9,99,False
372,16,1,18,False
55,38,7,147,False
1316,1,1,132,False


### Evaluation: 

- RMSE: $\sqrt{\frac{\sum(\hat y - y)^2}{n}}$


In [67]:
def compute_rmse(y_pred, y_true):
    """ Compute Root Mean Squared Error. """
    
    return np.sqrt(np.mean(np.power(y_pred - y_true, 2)))

In [68]:
def evaluate_data():
    train = pd.read_csv("E:/project/main/DRS/DRS/data/test_score/scorecsv_train.csv")
    test = pd.read_csv("E:/project/main/DRS/DRS/data/test_score/scorecsv_test.csv")
    return train, test

def evaluate(estimate_f):
    """ RMSE-based predictive performance evaluation with pandas. """
    scorecsv_train, scorecsv_test = evaluate_data()
    ids_to_estimate = zip(scorecsv_test.user_id, scorecsv_test.food_id)
    estimated = np.array([estimate_f(u, i) for (u, i) in ids_to_estimate])
    real = scorecsv_test.rating.values
    return compute_rmse(estimated, real)

## Evaluation Result

### naive program  (using average score)

In [74]:
def naive(a,b):
    return 5.798
evaluate(naive)

3.0204758775525073

### pure KNN method

<table>
<tr>
<td></td>
<td>Euclidean</td>
<td>Cosine</td>
<td>Pearson</td>
</tr>
<tr>
<td>RMSE-Using training dataset</td>
<td>2.46493</td>
<td>2.49006</td>
<td>2.54757</td>
</tr>
<tr>
<td>RMSE-Using whole dataset</td>
<td>1.85630</td>
<td>2.28051</td>
<td>1.80219</td>
</tr>

</table>


<img src="figure_4.png">
Figure 3 Cross Validation from KNN Algorithm

### Regular LFM Algorithm

- When Step=3500,α=0.0003 and β=0.02,RMSE = 1.00169

- Best Time of Three is 5 min 30 s

### Modified LFM Algorithm ( Hybrid method with KNN)

#### Test RMSE in different cost function components

<img src="eva1.png" >
Figure 3 Cross Validation for LFM algorithm: Steps are ranged from 3000 to 7000


<img src="eva2.png">
Figure 4 Cross Validation for LFM algorithm: α is ranging from 0.0001 to 0.001

<img src="eva3.png">
Figure 5 Cross Validation for LFM algorithm: β is ranging from 0.01 to 0.1

- best time of three is 13.1 seconds

### Health

The virtual user we created to represent normal health conditions is an 18 year-old male student with normal activities and with an initial height of 180 centimeters  and an initial weight 75 kilograms. The goal that was specified for weight loss was 2kilograms in 30 days; thus, his BMR was about 1860.55 and the daily calories needed to satisfy the goal were about 2435 calories per day for about 1000 calories permitted for lunch. In the list of recommended foods, 86.67% of the food was found to satisfy the user’s needs.

## Real test result

This study used a post-survey to receive feedback about the accuracy of the prediction. In the post-survey, the system generated six recommended foods and randomly select six foods from food list. Mixing the order of the total 12 foods, the survey is a blind test in order to make sure people answer this survey without personal influences. There are total 376 foods’ rating received. Among 376 foods, 188 foods are from recommended food list with 63.3% accuracy predicting user’s attitude on foods. The average rating for recommended foods is <span style="color:red;">7.87234</span>, while the average ranting for random foods is <span style="color:red;">4.49468</span>. It clearly shows more foods in the recommended food list are tasty for users. 


## REFFERENCES
- [1] Gunawardana, A., & Shani, G. (2015). Evaluating Recommender Systems. Recommender Systems Handbook, 265-308.
- [2] Neumann, S., Thum, A., & Böttcher, C. (2012). Nearline acquisition and processing of liquid chromatography-tandem mass spectrometry data. Metabolomics, 9(S1), 84-91.
- [3] Sarwar, B., Karypis, G., Konstan, J., & Reidl, J. (2001). Item-based collaborative filtering recommendation algorithms. Proceedings of the Tenth International Conference on World Wide Web - WWW '01.
- [4] Matrix Factorization: A Simple Tutorial and Implementation in Python. (n.d.). Retrieved Jan, 2016, from http://www.quuxlabs.com/blog/2010/09/matrix-factorization-a-simple-tutorial-and-implementation-in-python/
- [5] Schelter, S., Boden, C., & Markl, V. (2012). Scalable similarity-based neighborhood methods with MapReduce. Proceedings of the Sixth ACM Conference on Recommender Systems - RecSys '12.
- [6] Ryza, S., Laserson, U., Owen, S., & Wills, J. (2015). Advanced analytics with Spark. Sebastopol, CA: O'Reilly Media.
- [7] Takács, G., Pilászy, I., Németh, B., & Tikk, D. (2008). Matrix factorization and neighbor based algorithms for the netflix prize problem. Proceedings of the 2008 ACM Conference on Recommender Systems - RecSys '08.
- [8] Miller, Neal E. "Motivation and Psychological Stress." The Physiological Mechanisms of Motivation (1982): 409-32. Web.
- [9] Stephen, Farenga, J., and Ness Daniel. "Calories, Energy, and the Food You Eat." Questia. Web. Feb. 2016. <https://www.questia.com/article/1G1-143734185/calories-energy-and-the-food-you-eat>.
- [10] "Nonessential Nitrogen Supplements And Essential Amino Acid Requirements." Nutrition Reviews 21.6 (2009): 183-84. Web.
