# Getting Started

In this project, We will analyze datasets containing data on restaurants, consumers and user-item-rating. The goal of this project is to implement Collaborative Filtering i.e., to find similarities between various consumers and recommend restaurants to consumers.

The datasets for this project can be found on [Kaggle](https://www.kaggle.com/uciml/restaurant-data-with-consumer-ratings). 

The following code loads the datasets, along with a few of the necessary Python libraries required for this project.

In [1]:
# Import libraries necessary for this project
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display # Allows the use of display() for DataFrames
import scipy.stats

# Pretty display for notebooks
%matplotlib inline


In [2]:
print('Loading restaurant datasets')

# Load Restaurant Payment dataset
try:
    rest_pay = pd.read_csv('chefmozaccepts.csv')
    print('Payment dataset has %d samples with %d features each.' % (rest_pay.shape))
except:
    print('Payment dataset could not be loaded. Is the dataset missing?')
    
# Load the Restaurant Cuisine dataset
try:
    rest_cuisine = pd.read_csv('chefmozcuisine.csv')
    print('Cuisine dataset has %d samples with %d features each.' % (rest_cuisine.shape))
except:
    print('Cuisine dataset could not be loaded. Is the dataset missing?')
    
# Load the Restaurant Hours dataset
try:
    rest_hours = pd.read_csv('chefmozhours4.csv')
    print('Hours dataset has %d samples with %d features each.' % (rest_hours.shape))
except:
    print('Hours dataset could not be loaded. Is the dataset missing?')
    
# Load the Restaurant Parking dataset
try:
    rest_parking = pd.read_csv('chefmozparking.csv')
    print('Parking dataset has %d samples with %d features each.' % (rest_parking.shape))
except:
    print('Parking dataset could not be loaded. Is the dataset missing?')

#Load Restaurant Geo-places dataset
try:
    rest_geo = pd.read_csv('geoplaces2.csv')
    print('Geo-places dataset has %d samples with %d features each.' % (rest_geo.shape))
except:
    print('Geo-places dataset could not be loaded. Is the dataset missing?')

print('\n')

print('Loading consumer datasets')

# Load the Consumer Cuisine dataset
try:
    cons_cuisine = pd.read_csv('usercuisine.csv')
    print('Cuisine dataset has %d samples with %d features each.' % (cons_cuisine.shape))
except:
    print('Cuisine dataset could not be loaded. Is the dataset missing?')

#Load Consumer Payment dataset
try:
    cons_pay = pd.read_csv('userpayment.csv')
    print('Payment dataset has %d samples with %d features each.' % (cons_pay.shape))
except:
    print('Payment dataset could not be loaded. Is the dataset missing?')

#Load Consumer Profile dataset
try:
    cons_profile = pd.read_csv('userprofile.csv')
    print('Profile dataset has %d samples with %d features each.' % (cons_profile.shape))
except:
    print('Profile dataset could not be loaded. Is the dataset missing?')
    
print('\n')

print('Loading User-Item-Rating dataset')

#Load Rating dataset
try:
    rating = pd.read_csv('rating_final.csv')
    print('Rating dataset has %d samples with %d features each.' % (rating.shape))
except:
    print('Rating dataset could not be loaded. Is the dataset missing?')
    


Loading restaurant datasets
Payment dataset has 1314 samples with 2 features each.
Cuisine dataset has 916 samples with 2 features each.
Hours dataset has 2339 samples with 3 features each.
Parking dataset has 702 samples with 2 features each.
Geo-places dataset has 130 samples with 21 features each.


Loading consumer datasets
Cuisine dataset has 330 samples with 2 features each.
Payment dataset has 177 samples with 2 features each.
Profile dataset has 138 samples with 19 features each.


Loading User-Item-Rating dataset
Rating dataset has 1161 samples with 5 features each.


# Data Exploration

In this section, we will begin exploring the data through visualizations and code to understand how features of each dataset are related to one another.

Resturant datasets:<br>
1. rest_pay: 'placeID', 'Rpayment'<br>
2. rest_cuisine: 'placeID', 'Rcuisine' <br>
3. rest_hours: 'placeID', 'hours', 'days' <br>
4. rest_parking: 'placeID', 'parking_lot' <br>
5. rest_geo: 'placeID', 'latitude', 'longitude', 'the_geom_meter', 'name', 'address','city', 'state', 'country', 'fax', 'zip', 'alcohol', 'smoking_area','dress_code', 'accessibility', 'price', 'url', 'Rambience', 'franchise','area', 'other_services'<br>

User datasets:<br>
1. cons_pay: 'userID', 'Upayment'<br>
2. cons_cuisine: 'userID', 'Rcuisine'<br>
3. cons_profile: 'userID', 'latitude', 'longitude', 'smoker', 'drink_level', 'dress_preference', 'ambience', 'transport', 'marital_status', 'hijos', 'birth_year', 'interest', 'personality', 'religion', 'activity', 'color', 'weight', 'budget', 'height' <br>

Rating dataset:
1. rating: 'userID', 'placeID', 'rating', 'food_rating', 'service_rating'


In [3]:
#No.of users who are given ratings to the restaurants
list_users = rating.userID.unique()
print(len(list_users))

138


In [4]:
# #Merge all the restaurant dataframes into one
# from functools import reduce
# df = [rest_cuisine,rest_hours,rest_parking,rest_geo]
# rest_final = reduce(lambda left,right: pd.merge(left,right,on='placeID'), df)
# print(rest_final.columns)

# #Merge all the user dataframes into one
# df = [cons_cuisine,cons_profile]
# cons_final = reduce(lambda left,right: pd.merge(left,right,on='userID'), df)
# print(cons_final.columns)

In [5]:
#Delete users from cons_profile who have not given ratings
for index, row in cons_profile.iterrows():
    if row['userID'] not in list_users:
        del row

In [6]:
# #Remove features which are not useful for recommendation
# rest_final = rest_final.drop(['url'], axis = 1) #not useful as most of the values are '?'
# rest_final = rest_final.drop(['fax'], axis = 1) #all the values are '?'
# rest_final = rest_final.drop(['country','state','city','zip','address'], axis = 1) #Not useful as we can directly
#                                                                                    #use latitudes and logitudes


In [7]:
# #Scatter matrix for continuous values in the user dataset
# pd.plotting.scatter_matrix(cons_final, alpha = 0.3, figsize = (20,20), diagonal = 'kde')
# #From the graph below, we know that there's a correlation between weight and height and therefore we can remove one of them.

In [8]:
# #Remove Height since it shows high correlation with Weight
cons_profile = cons_profile.drop('height', axis = 1)

## Checking and replacing missing values in the datasets

### Restaurant Dataset

In [9]:
#The code below gives True if any of the attributes contain missing values
# print('Retaurant:\n',rest_final.isin(['?']).any(), end = '\n\n')
print('Customer:\n',cons_profile.isin(['?']).any())

Customer:
 userID              False
latitude            False
longitude           False
smoker               True
drink_level         False
dress_preference     True
ambience             True
transport            True
marital_status       True
hijos                True
birth_year          False
interest            False
personality         False
religion            False
activity             True
color               False
weight              False
budget               True
dtype: bool


### User Dataset

In [10]:
#Store indices of features having 'Nan' or '?' values
indices = set() #to store unique values
for index,row in cons_profile.iterrows():
    for i in range(len(row)):
        if row[i] is np.nan or row [i] is '?':
            indices.add(i)

In [11]:
#Features having 'Nan' or '?' values
missing = list(cons_profile.columns[list(indices)])
print(missing)

['smoker', 'dress_preference', 'ambience', 'transport', 'marital_status', 'hijos', 'activity', 'budget']


In [12]:
#Only the features with categorical data have missing values
#Replace 'Nan' or '?' with a random value from the feature
import random 
for attr in missing:
    uni = list(cons_profile[attr].unique()) #List of all unique values in the feature
    if '?' in uni:
        uni.remove('?') #remove '?' if present in the list
    if np.nan in uni:
        uni.remove(np.nan) #remove 'Nan' if present in the list
    i=0
    for value in cons_profile[attr]: 
        if value is np.nan or value is '?':
            cons_profile[attr][i] = cons_profile[attr][i].replace(value,random.choice(uni)) #replace it with a random item from the list
        i+=1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  del sys.path[0]


In [13]:
len(cons_profile)

138

## Encoding String/Object type data into Integer

In [14]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
cons_profile['smoker'] = le.fit_transform(cons_profile['smoker'])
cons_profile['drink_level'] = le.fit_transform(cons_profile['drink_level'])
cons_profile['dress_preference'] = le.fit_transform(cons_profile['dress_preference'])
cons_profile['ambience'] = le.fit_transform(cons_profile['ambience'])
cons_profile['transport'] = le.fit_transform(cons_profile['transport'])
cons_profile['marital_status'] = le.fit_transform(cons_profile['marital_status'])
cons_profile['hijos'] = le.fit_transform(cons_profile['hijos'])
cons_profile['interest'] = le.fit_transform(cons_profile['interest'])
cons_profile['personality'] = le.fit_transform(cons_profile['personality'])
cons_profile['religion'] = le.fit_transform(cons_profile['religion'])
cons_profile['activity'] = le.fit_transform(cons_profile['activity'])
cons_profile['color'] = le.fit_transform(cons_profile['color'])
cons_profile['budget'] = le.fit_transform(cons_profile['budget'])
# cons_profile['Upayment'] = le.fit_transform(cons_profile['Upayment'])
# cons_profile['Rcuisine'] = le.fit_transform(cons_profile['Rcuisine'])


# rest_final['Rpayment'] = le.fit_transform(rest_final['Rpayment'])
# rest_final['parking_lot'] = le.fit_transform(rest_final['parking_lot'])
# rest_final['Rcuisine'] = le.fit_transform(rest_final['Rcuisine'])
# rest_final['days'] = le.fit_transform(rest_final['days'])
# rest_final['the_geom_meter'] = le.fit_transform(rest_final['the_geom_meter'])
# rest_final['name'] = le.fit_transform(rest_final['name'])
# rest_final['smoking_area'] = le.fit_transform(rest_final['smoking_area'])
# rest_final['dress_code'] = le.fit_transform(rest_final['dress_code'])
# rest_final['price'] = le.fit_transform(rest_final['price'])
# rest_final['alcohol'] = le.fit_transform(rest_final['alcohol'])
# rest_final['Rambience'] = le.fit_transform(rest_final['Rambience'])
# rest_final['accessibility'] = le.fit_transform(rest_final['accessibility'])
# rest_final['franchise'] = le.fit_transform(rest_final['franchise'])
# rest_final['area'] = le.fit_transform(rest_final['area'])
# rest_final['other_services'] = le.fit_transform(rest_final['other_services'])

In [15]:
from sklearn.utils import shuffle
cons_profile = shuffle(cons_profile)
test_size = int(0.3*len(cons_profile))
cons_test = cons_profile[-test_size:]
cons_train = cons_profile[:test_size]

In [16]:
print(test_size)

41


In [17]:
print(len(cons_train['latitude'].unique()))

40


In [18]:
print(cons_train['latitude'].unique())

[23.735698 23.733    22.156469 22.118464 23.753112 22.122989 22.207749
 22.162562 23.73944  22.150683 18.871674 22.149654 22.139997 23.753336
 22.15     22.146708 23.730569 22.168997 22.174624 18.952615 22.153385
 23.752269 19.347641 22.152884 22.169184 22.160572 18.879729 22.154339
 18.927072 22.19204  22.143524 23.724972 18.925773 22.143078 22.139511
 22.190949 23.77103  22.138245 18.813348 22.205802]


In [19]:
temp = pd.cut(cons_train['latitude'],2)

In [20]:
print(temp.value_counts())

(21.292, 23.771]    34
(18.808, 21.292]     7
Name: latitude, dtype: int64


In [21]:
cons_train['latitude'] = pd.cut(cons_train['latitude'],2)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [22]:
print(cons_train['latitude'].head())

30     (21.292, 23.771]
25     (21.292, 23.771]
75     (21.292, 23.771]
6      (21.292, 23.771]
122    (21.292, 23.771]
Name: latitude, dtype: category
Categories (2, interval[float64]): [(18.808, 21.292] < (21.292, 23.771]]


In [23]:
temp = pd.cut(cons_train['longitude'],2)

In [24]:
print(temp.value_counts())

(-101.001, -100.033]    24
(-100.033, -99.067]     17
Name: longitude, dtype: int64


In [25]:
cons_train['longitude'] = pd.cut(cons_train['longitude'],2)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [26]:
print(cons_train.head())

    userID          latitude             longitude  smoker  drink_level  \
30   U1031  (21.292, 23.771]   (-100.033, -99.067]       0            0   
25   U1026  (21.292, 23.771]   (-100.033, -99.067]       0            0   
75   U1076  (21.292, 23.771]  (-101.001, -100.033]       0            2   
6    U1007  (21.292, 23.771]  (-101.001, -100.033]       0            1   
122  U1123  (21.292, 23.771]   (-100.033, -99.067]       0            0   

     dress_preference  ambience  transport  marital_status  hijos  birth_year  \
30                  2         2          2               0      2        1992   
25                  1         0          2               1      1        1989   
75                  0         0          2               0      2        1987   
6                   2         2          2               1      1        1989   
122                 2         0          0               1      1        1987   

     interest  personality  religion  activity  color  weight 

In [27]:
info  = []
for index,row in cons_test.iterrows():
    if row['userID'] == 'U1129':
        for attr in row:
            info.append(attr)

In [28]:
print(info)

['U1129', 23.728798, -99.134047, 0, 1, 3, 1, 2, 1, 1, 1989, 4, 3, 4, 1, 4, 69, 1]


In [29]:
for index,row in cons_train.iterrows():
    if row[0] == 'U1011':
        print(row)

userID                            U1011
latitude               (21.292, 23.771]
longitude           (-100.033, -99.067]
smoker                                0
drink_level                           0
dress_preference                      3
ambience                              0
transport                             2
marital_status                        1
hijos                                 1
birth_year                         1989
interest                              4
personality                           1
religion                              0
activity                              1
color                                 4
weight                               68
budget                                2
Name: 10, dtype: object


In [67]:
def similar_user(userid):
    info  = []
    for index,row in cons_test.iterrows():
        if row['userID'] == userid:
            for attr in row:
                info.append(attr)
    users = {}
    res = []
    count=0
    points = 0
    maxi = 0
    similar = ''
    for index,row in cons_train.iterrows():
        points = 0
        for i in range(len(row)-1):
            if i == 1 or i==2:
                if info[i] in row[i]:
                    points+=1
            if info[i] == row[i]:
                points+=1
        users[row[0]] = points
        
    for userid, value in sorted(users.items(), key=lambda x: x[1], reverse = True)[:5]:
        res.append(userid)
    return res
            

In [68]:
for i in range(len(rating)):
    if rating.userID[i] == 'U1111':
        print(rating.userID[i], rating.placeID[i], rating.rating[i], rating.food_rating[i], rating.service_rating[i])

U1111 132845 2 2 1
U1111 135071 2 2 2
U1111 132858 1 1 1
U1111 132854 2 2 2
U1111 132877 1 1 1
U1111 132851 2 1 0
U1111 135108 2 1 0
U1111 132869 0 0 0
U1111 132870 0 0 0
U1111 132847 0 0 0
U1111 135082 1 0 0


In [69]:
def maximum(usr):
    max = 0
    temp =  rating
    MaxList = []
    avg = []
    x = []
    for i in range(len(rating)):
        if(rating.userID[i] == usr):
            average = (rating.rating[i] + rating.food_rating[i] + rating.service_rating[i])/3
            avg.append(average)
            col = temp.loc[: , "rating":"service_rating"]
            x = col.mean(numeric_only=True, axis=1)
            temp['average'] = x
    avg.sort(reverse = True)
    for i in range(len(rating)):
        if(temp.userID[i] == usr):
            if(avg[0] == temp.average[i] and avg[0] != 0):
                MaxList.append(rating.placeID[i])
    return MaxList

In [71]:
usr = 'U1129'
a = []
userList = similar_user(usr)
place = {}
i=0
for value in userList:
    val = maximum(value)
    place[value] = val
print(place)
print(len(place))
# maximum('U1129')

{'U1019': [], 'U1007': [135057, 135058], 'U1087': [132667, 132732], 'U1011': [132717], 'U1094': [135108]}
5
