# Recommendation Systems

In this project, we focus on building a popularity-based recommender using the Pandas library that can recommend similar items based on correlation. Also, we deploy various machine learning algorithms to make recommendations and evaluate our recommender system.

## Popularity-Based Recommenders

Popularity based recommenders offer items based on popularity of items among users. The assumption is that the item that has the highest counts of rating is the most popular. This method cannot produce personalized results. <br>
Example: most shared article on news websites 

In [1]:
#importing libraries
import pandas as pd
import numpy as np

The dataset is from this link: https://archive.ics.uci.edu/ml/datasets/Restaurant+%26+consumer+data

In [2]:
#loading the data (rating_final & chefmozcuisine)
frame = pd.read_csv('rating_final.csv')
cuisine = pd.read_csv('chefmozcuisine.csv')

In [3]:
frame.head()

Unnamed: 0,userID,placeID,rating,food_rating,service_rating
0,U1077,135085,2,2,2
1,U1077,135038,2,2,1
2,U1077,132825,2,2,2
3,U1077,135060,1,2,2
4,U1068,135104,1,1,2


As you see, this dataset shows all the ratings for each place. Every place get a rating of 0, 1, or 2. Where 2, 0 are the best and the worst ratings, respectively.

In [4]:
cuisine.head()

Unnamed: 0,placeID,Rcuisine
0,135110,Spanish
1,135109,Italian
2,135107,Latin_American
3,135106,Mexican
4,135105,Fast_Food


The cuisine dataset shows the type of cuisine that each place serves. 

In [5]:
rating_count = pd.DataFrame(frame.groupby('placeID')['rating'].count())
rating_count.sort_values('rating', ascending = False).head()

Unnamed: 0_level_0,rating
placeID,Unnamed: 1_level_1
135085,36
132825,32
135032,28
135052,25
132834,25


Looking at the rating counts, we see the top 5 popular placed based on number of ratings. Now, let's look the cuisine that each place serves and find similarity between them.

In [6]:
most_rated_places = pd.DataFrame([135005, 132825, 135032, 135052, 152834], index = np.arange(5), columns = ['placeID'])
summary = pd.merge(most_rated_places, cuisine, on = 'placeID')
summary

Unnamed: 0,placeID,Rcuisine
0,135005,Mexican
1,132825,Mexican
2,135032,Cafeteria
3,135032,Contemporary
4,135052,Bar
5,135052,Bar_Pub_Brewery


So, what we have here is a list of popular places in town and the cuisine that is served in each of them.

In [None]:
cuisine['Rcuisine'].describe()

There is 59 different kind of cuisine and the most frequently occuring is <strong>Mexican</strong>. 

Based on cuisine dataset and most rated placed daat farme, the mexican food is the most popular among others. It makes sense!

## Correlation-Based Recommenders

In correlation-based recommendation systems, items are recommended based on users reviews. In other words, it chooses the items based on how well the items correlates with other items with respect to users ratings. Unlike popularity-based system, they do take users preferences into account. Correlation-based recommendation systems using Pearson's R correlation to offer the items which are most similar to the past chosen items by users.

In [8]:
#dataset is from the same source the cited in previous section. 
#Frame and Cuisine dataframed were loaded in previous section. 

geodata = pd.read_csv('geoplaces2.csv', encoding = 'mbcs')

In [9]:
geodata.head()

Unnamed: 0,placeID,latitude,longitude,the_geom_meter,name,address,city,state,country,fax,...,alcohol,smoking_area,dress_code,accessibility,price,url,Rambience,franchise,area,other_services
0,134999,18.915421,-99.184871,0101000020957F000088568DE356715AC138C0A525FC46...,Kiku Cuernavaca,Revolucion,Cuernavaca,Morelos,Mexico,?,...,No_Alcohol_Served,none,informal,no_accessibility,medium,kikucuernavaca.com.mx,familiar,f,closed,none
1,132825,22.147392,-100.983092,0101000020957F00001AD016568C4858C1243261274BA5...,puesto de tacos,esquina santos degollado y leon guzman,s.l.p.,s.l.p.,mexico,?,...,No_Alcohol_Served,none,informal,completely,low,?,familiar,f,open,none
2,135106,22.149709,-100.976093,0101000020957F0000649D6F21634858C119AE9BF528A3...,El Rincón de San Francisco,Universidad 169,San Luis Potosi,San Luis Potosi,Mexico,?,...,Wine-Beer,only at bar,informal,partially,medium,?,familiar,f,open,none
3,132667,23.752697,-99.163359,0101000020957F00005D67BCDDED8157C1222A2DC8D84D...,little pizza Emilio Portes Gil,calle emilio portes gil,victoria,tamaulipas,?,?,...,No_Alcohol_Served,none,informal,completely,low,?,familiar,t,closed,none
4,132613,23.752903,-99.165076,0101000020957F00008EBA2D06DC8157C194E03B7B504E...,carnitas_mata,lic. Emilio portes gil,victoria,Tamaulipas,Mexico,?,...,No_Alcohol_Served,permitted,informal,completely,medium,?,familiar,t,closed,none


The reason we use this geodata dataframe is because it provides a name for all reviewed places. 

In [10]:
places = geodata[['placeID', 'name']]

In [11]:
rating = pd.DataFrame(frame.groupby('placeID')['rating'].mean())
rating.head()

Unnamed: 0_level_0,rating
placeID,Unnamed: 1_level_1
132560,0.5
132561,0.75
132564,1.25
132572,1.0
132583,1.0


Now, we got all the places and the average rating that each place is given. Now, let's see how popular each place is by using the counts of ratings. 

In [12]:
rating['rating_count'] = pd.DataFrame(frame.groupby('placeID')['rating'].count())
rating.head()

Unnamed: 0_level_0,rating,rating_count
placeID,Unnamed: 1_level_1,Unnamed: 2_level_1
132560,0.5,4
132561,0.75,4
132564,1.25,4
132572,1.0,15
132583,1.0,4


In [13]:
rating.describe()

Unnamed: 0,rating,rating_count
count,130.0,130.0
mean,1.179622,8.930769
std,0.349354,6.124279
min,0.25,3.0
25%,1.0,5.0
50%,1.181818,7.0
75%,1.4,11.0
max,2.0,36.0


The rating count shows 130 unique places that have been reviwed in the dataframe. 

In [14]:
rating.sort_values('rating_count', ascending = False).head()

Unnamed: 0_level_0,rating,rating_count
placeID,Unnamed: 1_level_1,Unnamed: 2_level_1
135085,1.333333,36
132825,1.28125,32
135032,1.178571,28
135052,1.28,25
132834,1.0,25


In [15]:
#finding the name of the restaurant by placeID
places[places['placeID'] == 135085]

Unnamed: 0,placeID,name
121,135085,Tortas Locas Hipocampo


In [16]:
#type of the cuisine of this place
cuisine[cuisine['placeID'] == 135085]

Unnamed: 0,placeID,Rcuisine
44,135085,Fast_Food


In [17]:
#preparing the data for analysis
places_crosstab = pd.pivot_table(data = frame, values = 'rating', index = 'userID', columns = 'placeID')
places_crosstab.head(5)

placeID,132560,132561,132564,132572,132583,132584,132594,132608,132609,132613,...,135080,135081,135082,135085,135086,135088,135104,135106,135108,135109
userID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
U1001,,,,,,,,,,,...,,,,0.0,,,,,,
U1002,,,,,,,,,,,...,,,,1.0,,,,1.0,,
U1003,,,,,,,,,,,...,2.0,,,,,,,,,
U1004,,,,,,,,,,,...,,,,,,,,2.0,,
U1005,,,,,,,,,,,...,,,,,,,,,,


The reason behind so many None values is that not many people write reviews. 

In [None]:
tortas_rating = places_crosstab[135085]
tortas_rating[tortas_rating >= 0]

In [19]:
#evaluating based on correlation 
similar_to_tortas = places_crosstab.corrwith(tortas_rating)
corr_tortas = pd.DataFrame(similar_to_tortas, columns = ['PearsonR'])
corr_tortas.dropna(inplace = True) #dropping the None values
corr_tortas.head()

  c = cov(x, y, rowvar)
  c *= np.true_divide(1, fact)


Unnamed: 0_level_0,PearsonR
placeID,Unnamed: 1_level_1
132572,-0.428571
132723,0.301511
132754,0.930261
132825,0.700745
132834,0.814823


In [20]:
tortas_corr_summary = corr_tortas.join(rating['rating_count'])

In [21]:
tortas_corr_summary[tortas_corr_summary['rating_count']>=10].sort_values('PearsonR', ascending = False).head(10)

Unnamed: 0_level_0,PearsonR,rating_count
placeID,Unnamed: 1_level_1,Unnamed: 2_level_1
135076,1.0,13
135085,1.0,36
135066,1.0,12
132754,0.930261,13
135045,0.912871,13
135062,0.898933,21
135028,0.892218,15
135042,0.881409,20
135046,0.867722,11
132872,0.840168,12


Here, we need to note that there might be places with PearsonR equal to 1 but with very few rating_count which shows that those places are not as popular as tortas. The other reason might be that two places only have one reviewer in common. To be on the safe side, we don't consider those places and work with 7 remaining one. 

In [22]:
places_corr_tortas = pd.DataFrame([135085, 132754, 135045, 135062, 135028, 135042, 135046], index = np.arange(7), 
                                  columns = ['placeID'])
summary = pd.merge(places_corr_tortas, cuisine, on='placeID')
summary

Unnamed: 0,placeID,Rcuisine
0,135085,Fast_Food
1,132754,Mexican
2,135028,Mexican
3,135042,Chinese
4,135046,Fast_Food


The reason that we are only seeing 6 places instead of 7 is that not all the places were listed in cuisine dataset. We also see one of these places also serves fast food as tortas. 

In [23]:
#name of the restaurant
places[places['placeID'] == 135046]

Unnamed: 0,placeID,name
42,135046,Restaurante El Reyecito


In [24]:
cuisine['Rcuisine'].describe()

count         916
unique         59
top       Mexican
freq          239
Name: Rcuisine, dtype: object

<strong>Conclusion</strong> : In this case, we recommend Restaurante El Reyecito to the users who like tortas. 

## Classification-Based Collaborative Filtering (Machine Learning Based Recommenders - Logistic Regression)

Classification-Based Collaborative Filtering are able to make personlized recommendations since these recommenders are able to taking into the account the users attributes as well as purchase history and other contextual data(e.g. browser history). 

In [98]:
from pandas import DataFrame, Series
from sklearn.linear_model import LogisticRegression

The dataset is from https://archive.ics.uci.edu/ml/datasets/Bank+Marketing (UCI machine learning repository)

In [99]:
bank_full = pd.read_csv('bank_full_w_dummy_vars.csv')
bank_full.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,...,job_unknown,job_retired,job_services,job_self_employed,job_unemployed,job_maid,job_student,married,single,divorced
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,...,0,0,0,0,0,0,0,1,0,0
1,44,technician,single,secondary,no,29,yes,no,unknown,5,...,0,0,0,0,0,0,0,0,1,1
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,...,0,0,0,0,0,0,0,1,0,0
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,...,0,0,0,0,0,0,0,1,0,0
4,33,unknown,single,unknown,no,1,no,no,unknown,5,...,1,0,0,0,0,0,0,0,1,1


In [100]:
bank_full.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45211 entries, 0 to 45210
Data columns (total 37 columns):
 #   Column                        Non-Null Count  Dtype 
---  ------                        --------------  ----- 
 0   age                           45211 non-null  int64 
 1   job                           45211 non-null  object
 2   marital                       45211 non-null  object
 3   education                     45211 non-null  object
 4   default                       45211 non-null  object
 5   balance                       45211 non-null  int64 
 6   housing                       45211 non-null  object
 7   loan                          45211 non-null  object
 8   contact                       45211 non-null  object
 9   day                           45211 non-null  int64 
 10  month                         45211 non-null  object
 11  duration                      45211 non-null  int64 
 12  campaign                      45211 non-null  int64 
 13  pdays           

There are some dummy variables added to the dataset to represent the non numerical features of the dataset. 

In [101]:
X = bank_full.iloc[:, 18:37].values

y = bank_full.iloc[:, 17].values

In [102]:
LogReg = LogisticRegression()
model = LogReg.fit(X, y)

In [103]:
new_user = [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1]
new_user_array = np.array([new_user])

In [104]:
y_pred = model.predict(new_user_array)
y_pred

array([0], dtype=int64)

Based on the prediction results, the new users will not accept the bank offer. So, the representative should not market their offer to the new user.  

#### Evaluating our recommendation system

In [105]:
from sklearn.metrics import classification_report

In [109]:
y_pred_total = model.predict(X)

In [110]:
print(classification_report(y, y_pred_total))

              precision    recall  f1-score   support

           0       0.90      0.99      0.94     39922
           1       0.67      0.17      0.27      5289

    accuracy                           0.89     45211
   macro avg       0.79      0.58      0.61     45211
weighted avg       0.87      0.89      0.86     45211



## Collaborative Filtering System

Collaborative filtering is a recommendation system method that works based on reactions by similar users. 
1. user_based:based on similarity between users attributes
2. item_based:people who like this product also like x, y, and z product - similarity between items with respect to user ratings(Correlation-Based Recommenders)

In [52]:
import sklearn
from sklearn.decomposition import TruncatedSVD

The dataset is downloaded from https://grouplens.org/datasets/movielens/100k/ (grouplens @ university of Minnesota)

In [54]:
#making columns names
columns = ['user_id', 'item_id', 'rating', 'timestamp']
data = pd.read_csv('ml-100k/u.data', sep = '\t', names = columns)
data.head()

Unnamed: 0,user_id,item_id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


In [56]:
columns_2 = ['item_id', 'movie title', 'release date', 'video release date', 'IMDb URL', 'unknown', 'Action', 'Adventure',
          'Animation', 'Childrens', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror',
          'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']
movies = pd.read_csv('ml-100k/u.item', sep = '|', names = columns_2, encoding = 'latin_1')

In [57]:
movie_names = movies[['item_id', 'movie title']]
movie_names.head()

Unnamed: 0,item_id,movie title
0,1,Toy Story (1995)
1,2,GoldenEye (1995)
2,3,Four Rooms (1995)
3,4,Get Shorty (1995)
4,5,Copycat (1995)


In [58]:
#combining the two dataset into one
combined_movie_data = pd.merge(data, movie_names, on = 'item_id')
combined_movie_data.head()

Unnamed: 0,user_id,item_id,rating,timestamp,movie title
0,196,242,3,881250949,Kolya (1996)
1,63,242,3,875747190,Kolya (1996)
2,226,242,5,883888671,Kolya (1996)
3,154,242,3,879138235,Kolya (1996)
4,306,242,5,876503793,Kolya (1996)


In [59]:
combined_movie_data.groupby('movie title')['rating'].count().sort_values(ascending = False).head()

movie title
Star Wars (1977)             583
Contact (1997)               509
Fargo (1996)                 508
Return of the Jedi (1983)    507
Liar Liar (1997)             485
Name: rating, dtype: int64

<strong>Utility Matrix</strong> = The data used in a recommendation system is divided in two categories: the users and the items. Each user likes certain items, and the rating value rij (from 1 to 5) is the data associated with each user i and item j and represents how much the user appreciates the item. (definition is from machine learning for web book)

In [61]:
#building a utility matrix
rating_crosstab = combined_movie_data.pivot_table(values = 'rating', index = 'user_id', columns = 'movie title', 
                                                  fill_value = 0)
rating_crosstab.head()

movie title,'Til There Was You (1997),1-900 (1994),101 Dalmatians (1996),12 Angry Men (1957),187 (1997),2 Days in the Valley (1996),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),3 Ninjas: High Noon At Mega Mountain (1998),"39 Steps, The (1935)",...,Yankee Zulu (1994),Year of the Horse (1997),You So Crazy (1994),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Zeus and Roxanne (1997),unknown,Á köldum klaka (Cold Fever) (1994)
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0,0,2,5,0,0,3,4,0,0,...,0,0,0,5,3,0,0,0,4,0
2,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,2,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,2,0,0,0,0,4,0,0,...,0,0,0,4,0,0,0,0,4,0


In [63]:
rating_crosstab.shape

(943, 1664)

In [65]:
#transposing the matrix
X1 = rating_crosstab.values.T
X1.shape

(1664, 943)

In [67]:
SVD = TruncatedSVD(n_components = 12, random_state = 17)
resultant_matrix = SVD.fit_transform(X1)
resultant_matrix.shape

(1664, 12)

In [68]:
#generating the correlation matrix
corr_mat = np.corrcoef(resultant_matrix)
corr_mat.shape

(1664, 1664)

In [77]:
#isolating star wars from correlation matrix
movies_names = rating_crosstab.columns
movies_list = list(movies_names)
star_wars = movies_list.index('Star Wars (1977)')
print(star_wars)

1398


In [78]:
corr_star_wars = corr_mat[star_wars]
corr_star_wars.shape

(1664,)

In [83]:
#recommending a highly correlated movie
list(movies_names[(corr_star_wars < 1.0) & (corr_star_wars > 0.9)])

['Die Hard (1988)',
 'Empire Strikes Back, The (1980)',
 'Fugitive, The (1993)',
 'Raiders of the Lost Ark (1981)',
 'Return of the Jedi (1983)',
 'Star Wars (1977)',
 'Terminator 2: Judgment Day (1991)',
 'Terminator, The (1984)',
 'Toy Story (1995)']

In [82]:
list(movies_names[(corr_star_wars < 1) & (corr_star_wars > 0.95)])

['Return of the Jedi (1983)', 'Star Wars (1977)']

Based the our machine learning recommender system, the person who enjoyed Star Wars 1977 has a significant chance to like Return of the Jedi 1983 as well. 

## Content_based Recommender

It recommends items based on their features and the similarity of the features of other items. To do this, we use Nearest Neighbor algorithm which is an unsupervised classifier.

In [84]:
import sklearn
from sklearn.neighbors import NearestNeighbors

In [86]:
cars = pd.read_csv('mtcars.csv')
cars.columns = ['car_names', 'mpg', 'cyl', 'disp', 'hp', 'drat', 'wt', 'qsec', 'vs', 'am', 'gear', 'carb']
cars.head()

Unnamed: 0,car_names,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
0,Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
1,Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
2,Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
3,Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
4,Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2


In [89]:
# we need to find a car with these features
t = [15, 300, 160, 3.2]

X = cars.iloc[:, [1,3,4,6]].values
X[0:5]

array([[ 21.   , 160.   , 110.   ,   2.62 ],
       [ 21.   , 160.   , 110.   ,   2.875],
       [ 22.8  , 108.   ,  93.   ,   2.32 ],
       [ 21.4  , 258.   , 110.   ,   3.215],
       [ 18.7  , 360.   , 175.   ,   3.44 ]])

In [91]:
nbrs = NearestNeighbors(n_neighbors = 1).fit(X)

In [92]:
print(nbrs.kneighbors([t]))

(array([[10.77474942]]), array([[22]], dtype=int64))


In [96]:
cars.loc[22]['car_names']

'AMC Javelin'

According to our results, the best match for this customer is AMC Javelin.