# Recommendation

Purpose: to find and recommend items that a user is most likely to be interested in.

### Examples of Recommendation Engines:

1. Product: Amazon, Etsy
2. Movie: Netflix
3. Music: Apple Music, Spotify etc
4. Social connections: Facebook, Linkedin, Instagram

## Simple Appoaches to Recommender Systems:

### 1. POPULARITY-BASED RECOMMENDERS

**Based on simple copunt statistics** (numer of ratings given to an item)

| user   | place   | rating |
| :----:   | :----:    | :----:   |
| user A | place 1 | 10     |
| user B | place 1 | 8 |
| user C | place 2 | 8 |
| user D | place 2 | 7 |
| user E | place 1 | 8 |
| user F | place 1 | 7 |
| user G | place 1 | 10 |
| | ![](https://cdnjs.cloudflare.com/ajax/libs/fontisto/3.0.4/icons/directional/arrow-down.png) | |

| place   | rating count |
|:-------:|:------------:|
| place 1 | 5 |
| place 2 | 2 |

Fun facts on **Popularity based recommenders**:
- rely on purchase history data
- are often used by online news sites like Bloomberg
- cannot produce personalized result

In [1]:
import pandas as pd
import numpy as np

In [2]:
ratings = pd.read_csv('rating_final.csv')
ratings.head()

Unnamed: 0,userID,placeID,rating,food_rating,service_rating
0,U1077,135085,2,2,2
1,U1077,135038,2,2,1
2,U1077,132825,2,2,2
3,U1077,135060,1,2,2
4,U1068,135104,1,1,2


In [3]:
cuisines = pd.read_csv('chefmozcuisine.csv')
cuisines.head()

Unnamed: 0,placeID,Rcuisine
0,135110,Spanish
1,135109,Italian
2,135107,Latin_American
3,135106,Mexican
4,135105,Fast_Food


In [11]:
# how many people have given ratings to a particuar 'place'
ratings_counts = pd.DataFrame(ratings.groupby('placeID')['rating'].count())
ratings_counts.sort_values('rating', ascending=False).head()

Unnamed: 0_level_0,rating
placeID,Unnamed: 1_level_1
135085,36
132825,32
135032,28
135052,25
132834,25


In [13]:
cuisine_popularity = pd.merge(ratings_count, cuisines, on='placeID')
cuisine_popularity.sort_values('rating', ascending=False).head()

Unnamed: 0,placeID,rating,Rcuisine
105,135085,36,Fast_Food
28,132825,32,Mexican
71,135032,28,Cafeteria
72,135032,28,Contemporary
86,135052,25,Bar_Pub_Brewery


In [15]:
cuisine_popularity['Rcuisine'].describe()

count         112
unique         23
top       Mexican
freq           28
Name: Rcuisine, dtype: object

In [16]:
cuisines['Rcuisine'].describe()

count         916
unique         59
top       Mexican
freq          239
Name: Rcuisine, dtype: object

### 2. CORRELATION BASED RECOMMENDER SYSTEMS

**Pearson's correlation coefficient (r) - "Pearson's r"**

|       r  | description  |
|----------|--------------|
| *r = 1*  | Strong positive *linear* relationship |
| *r = 0*  | Not linearly correlated               |
| *r = -1* | Strong negative *linear* relationship |

#### Item based similarity:
    Recommend an item based on how well it correlates with other items with respect to user ratings

In [17]:
ratings = pd.read_csv('rating_final.csv')
cuisines = pd.read_csv('chefmozcuisine.csv')
geodata = pd.read_csv('geoplaces2.csv')

In [18]:
geodata.head()

Unnamed: 0,placeID,latitude,longitude,the_geom_meter,name,address,city,state,country,fax,...,alcohol,smoking_area,dress_code,accessibility,price,url,Rambience,franchise,area,other_services
0,134999,18.915421,-99.184871,0101000020957F000088568DE356715AC138C0A525FC46...,Kiku Cuernavaca,Revolucion,Cuernavaca,Morelos,Mexico,?,...,No_Alcohol_Served,none,informal,no_accessibility,medium,kikucuernavaca.com.mx,familiar,f,closed,none
1,132825,22.147392,-100.983092,0101000020957F00001AD016568C4858C1243261274BA5...,puesto de tacos,esquina santos degollado y leon guzman,s.l.p.,s.l.p.,mexico,?,...,No_Alcohol_Served,none,informal,completely,low,?,familiar,f,open,none
2,135106,22.149709,-100.976093,0101000020957F0000649D6F21634858C119AE9BF528A3...,El Rinc�n de San Francisco,Universidad 169,San Luis Potosi,San Luis Potosi,Mexico,?,...,Wine-Beer,only at bar,informal,partially,medium,?,familiar,f,open,none
3,132667,23.752697,-99.163359,0101000020957F00005D67BCDDED8157C1222A2DC8D84D...,little pizza Emilio Portes Gil,calle emilio portes gil,victoria,tamaulipas,?,?,...,No_Alcohol_Served,none,informal,completely,low,?,familiar,t,closed,none
4,132613,23.752903,-99.165076,0101000020957F00008EBA2D06DC8157C194E03B7B504E...,carnitas_mata,lic. Emilio portes gil,victoria,Tamaulipas,Mexico,?,...,No_Alcohol_Served,permitted,informal,completely,medium,?,familiar,t,closed,none


In [19]:
places = geodata[['placeID', 'name']]
places.head()

Unnamed: 0,placeID,name
0,134999,Kiku Cuernavaca
1,132825,puesto de tacos
2,135106,El Rinc�n de San Francisco
3,132667,little pizza Emilio Portes Gil
4,132613,carnitas_mata


In [90]:
# Average Rating based ranking w.r.t. places
place_avg_ratings = pd.DataFrame(ratings.groupby('placeID')['rating'].mean())
place_avg_ratings.head()

Unnamed: 0_level_0,rating
placeID,Unnamed: 1_level_1
132560,0.5
132561,0.75
132564,1.25
132572,1.0
132583,1.0


In [91]:
place_avg_ratings['rating_count'] = pd.DataFrame(ratings.groupby('placeID')['rating'].count())
place_avg_ratings.head()

Unnamed: 0_level_0,rating,rating_count
placeID,Unnamed: 1_level_1,Unnamed: 2_level_1
132560,0.5,4
132561,0.75,4
132564,1.25,4
132572,1.0,15
132583,1.0,4


In [92]:
place_avg_ratings.describe()

Unnamed: 0,rating,rating_count
count,130.0,130.0
mean,1.179622,8.930769
std,0.349354,6.124279
min,0.25,3.0
25%,1.0,5.0
50%,1.181818,7.0
75%,1.4,11.0
max,2.0,36.0


In [93]:
place_avg_ratings.sort_values('rating_count', ascending=False).head()

Unnamed: 0_level_0,rating,rating_count
placeID,Unnamed: 1_level_1,Unnamed: 2_level_1
135085,1.333333,36
132825,1.28125,32
135032,1.178571,28
135052,1.28,25
132834,1.0,25


In [94]:
ratings_cross_table = pd.pivot_table(data=ratings, values='rating', index='userID', columns='placeID')
ratings_cross_table.head()

placeID,132560,132561,132564,132572,132583,132584,132594,132608,132609,132613,...,135080,135081,135082,135085,135086,135088,135104,135106,135108,135109
userID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
U1001,,,,,,,,,,,...,,,,0.0,,,,,,
U1002,,,,,,,,,,,...,,,,1.0,,,,1.0,,
U1003,,,,,,,,,,,...,2.0,,,,,,,,,
U1004,,,,,,,,,,,...,,,,,,,,2.0,,
U1005,,,,,,,,,,,...,,,,,,,,,,


In [178]:
most_popular_place = place_avg_ratings.iloc[place_avg_ratings['rating_count'].argmax()].name
print("most popular place: {}".format(most_popular_place))

most_popular_cuisine = cuisines.loc[cuisines['placeID'] == most_popular_place]['Rcuisine'].name
print("most popular cuisine: {}".format(most_popular_cuisine))

most_popular_place_ratings = ratings_cross_table[most_popular_place]
most_popular_place_ratings[most_popular_place_ratings.notnull()]

most popular place: 135085
most popular cuisine: Rcuisine


userID
U1001    0.0
U1002    1.0
U1007    1.0
U1013    1.0
U1016    2.0
U1027    1.0
U1029    1.0
U1032    1.0
U1033    2.0
U1036    2.0
U1045    2.0
U1046    1.0
U1049    0.0
U1056    2.0
U1059    2.0
U1062    0.0
U1077    2.0
U1081    1.0
U1084    2.0
U1086    2.0
U1089    1.0
U1090    2.0
U1092    0.0
U1098    1.0
U1104    2.0
U1106    2.0
U1108    1.0
U1109    2.0
U1113    1.0
U1116    2.0
U1120    0.0
U1122    2.0
U1132    2.0
U1134    2.0
U1135    0.0
U1137    2.0
Name: 135085, dtype: float64

In [183]:
# Evaluating similarity based on "correlation":

similar_to_most_popular = ratings_cross_table.corrwith(most_popular_place_ratings)

corr_most_popular = pd.DataFrame(similar_to_most_popular, columns=['PearsonsR'])
corr_most_popular.dropna(inplace=True)
corr_most_popular.head()

  c = cov(x, y, rowvar, dtype=dtype)
  c *= np.true_divide(1, fact)


Unnamed: 0_level_0,PearsonsR
placeID,Unnamed: 1_level_1
132572,-0.428571
132723,0.301511
132754,0.930261
132825,0.700745
132834,0.814823


In [186]:
corr_with_most_pop_ratcnt = corr_most_popular.join(place_avg_ratings['rating_count'])
corr_with_most_pop_ratcnt.head()

Unnamed: 0_level_0,PearsonsR,rating_count
placeID,Unnamed: 1_level_1,Unnamed: 2_level_1
132572,-0.428571,15
132723,0.301511,12
132754,0.930261,13
132825,0.700745,32
132834,0.814823,25


In [189]:
top_10_places_like_most_pop = corr_with_most_pop_ratcnt[corr_with_most_pop_ratcnt['rating_count'] >= 10
                                                       ].sort_values('PearsonsR', ascending=False
                                                                    )[:10]
top_10_places_like_most_pop

Unnamed: 0_level_0,PearsonsR,rating_count
placeID,Unnamed: 1_level_1,Unnamed: 2_level_1
135076,1.0,13
135085,1.0,36
135066,1.0,12
132754,0.930261,13
135045,0.912871,13
135062,0.898933,21
135028,0.892218,15
135042,0.881409,20
135046,0.867722,11
132872,0.840168,12


In [195]:
pd.merge(pd.merge(top_10_places_like_most_pop, cuisines, on='placeID'), places, on='placeID')

Unnamed: 0,placeID,PearsonsR,rating_count,Rcuisine,name
0,135085,1.0,36,Fast_Food,Tortas Locas Hipocampo
1,132754,0.930261,13,Mexican,Cabana Huasteca
2,135028,0.892218,15,Mexican,La Virreina
3,135042,0.881409,20,Chinese,Restaurant Oriental Express
4,135046,0.867722,11,Fast_Food,Restaurante El Reyecito
5,132872,0.840168,12,American,Pizzeria Julios


## Collaborative Filtering Recommenders

**Recommend items based on crowdsourced information about users' preferences for items**.

2 approaches:
1. User based

    *Based on known user attributes, we know that User B is similar to User D. User D really likes his life insurance policy, so let's recomment it to Uesr B also.*

2. Item based

    *User B and User D both gave high ratings to the cell phone and the cell phone case. Since User A also likes the cell phone, let's recommend to her the cell phone case also.*
    
User attributes can be described as a list of values (possibly boolean).

### Classification-Based Collaborative Filtering

Provides personalizarion by accepting:
- user and item attribute data
- purchase history data
- other contextual data
- Gives a Yes/No classification! (Will he/she accept/purchase?)

Example classification methods:

1. Naive Bayes classification
2. Logistic regression

#### 1. Logistic Regression as classifier

In [197]:
import numpy as np
import pandas as pd

from pandas import Series, DataFrame
from sklearn.linear_model import LogisticRegression

In [198]:
bank_full = pd.read_csv('bank_full_w_dummy_vars.csv')
bank_full.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,...,job_unknown,job_retired,job_services,job_self_employed,job_unemployed,job_maid,job_student,married,single,divorced
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,...,0,0,0,0,0,0,0,1,0,0
1,44,technician,single,secondary,no,29,yes,no,unknown,5,...,0,0,0,0,0,0,0,0,1,1
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,...,0,0,0,0,0,0,0,1,0,0
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,...,0,0,0,0,0,0,0,1,0,0
4,33,unknown,single,unknown,no,1,no,no,unknown,5,...,1,0,0,0,0,0,0,0,1,1


In [199]:
bank_full.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45211 entries, 0 to 45210
Data columns (total 37 columns):
 #   Column                        Non-Null Count  Dtype 
---  ------                        --------------  ----- 
 0   age                           45211 non-null  int64 
 1   job                           45211 non-null  object
 2   marital                       45211 non-null  object
 3   education                     45211 non-null  object
 4   default                       45211 non-null  object
 5   balance                       45211 non-null  int64 
 6   housing                       45211 non-null  object
 7   loan                          45211 non-null  object
 8   contact                       45211 non-null  object
 9   day                           45211 non-null  int64 
 10  month                         45211 non-null  object
 11  duration                      45211 non-null  int64 
 12  campaign                      45211 non-null  int64 
 13  pdays           

In [219]:
X = bank_full.iloc[:,list(range(18,37))].values

y = bank_full.iloc[:,17].values

print("X shape = {}".format(X.shape))
print("y shape = {}".format(y.shape))

X shape = (45211, 19)
y shape = (45211,)


In [220]:
log_reg = LogisticRegression()
log_reg.fit(X, y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [221]:
new_user = [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1]
y_pred = log_reg.predict([new_user])
y_pred

array([0])

#### 2. Model-based Collaborative filtering

##### Singular Value Decomposition (SVD)
- A linear algebra method that can decompose utilituy matrix into three compressed matrices.
- Model-based recommender - use these ompressed matrices to make recommendations without having to refer back to the complete data set.
- Latent variables - ingferred, nonobservable variables that are present within, and affect the behavior of a data set.

$\Large \begin{bmatrix} {}_{A} \end{bmatrix} = \begin{bmatrix} {}_{u} \end{bmatrix} \times \begin{bmatrix} {}_{S} \end{bmatrix} \times \begin{bmatrix} {}_{v} \end{bmatrix}$


$\Large {A} = {u} \times {S} \times {v}$

- **A** = Original matrix (utility matrix)
- **u** = Left orthogonal matrix - holds important, non-redundant information about users
- **v** = Right orthogonal matrix - holds important, non-redundant information on items
- **S** = Diagonal matrix - contains all of the information about the decomposition processes performned during the compression

**Building a utility matrix**

```python
ratings_crosstab = combined_movies_data.pivot_table(
    values='rating', 
    index='user_id',
    columns='movie_title',
    fill_value=0)

# shape = (num_users, num_movies)
```

This will generate cross table with users as the rows (indices) and each movie as the columns, a typical wide matrix.

**Transposing the Matrix**

```python
ratings_crosstab.values.T

# shape = (num_movies, num_users)
```

This will transpose the matrix, rows interchange with columns.

**Decomposing the Matrix**

```python
SVD = TruncatedSVD(n_components=12, random_state=17)

resultant_matrix = SVD.fit_transform(X)

# shape = (num_movies, n_components=12)
```

**Generating a Correlation Matrix**

```python
corr_mat = np.corrcoef(resultant_matrix)

# shape = (num_movies, num_movies)
```

**Isolating top movie from the correlation matrix**

```python
movie_names = rating_crosstab.columns
movies_list = list(movie_names)

top_movie = movies_lisst.index('<Top movie title (9999)>')

corr_top_movie = corr_mat[top_movie]
corr_top_movie.shape

# shape = (num_movies,)
```

**Recommending a Highly Correlated Movie**

```python
list(movie_names[
    (corr_top_movie < 1.0) & (corr_top_movie > 0.9)
])

# list of highly correlated movie w.r.t. 'Top movie'
```

## Machine Learning based Recommenders

### Content-based recommender systems

**Content-based recommenders recomend items based on similarities between features.**

*Example: A user who loves Miami might also love Austin, based on the similarities between temperature. const of living and Wi-Fi speeds at both places.*

#### K-nearest neighbor algorithm

- Unsupervised classifier
- Also known as a memory-based system
- Memorizes instances and then recommends item (a single instance) based on how quantitatively similar it is to a new, incoming instance.

Example:
> I want to buy a car that gets 25 MPG, and has a 4.7 L engine with 425 HP.

Solution is to find **"1"** car closest to the specification provided in cartesian distance.

In [223]:
import numpy as np
import pandas as pd

import sklearn
from sklearn.neighbors import NearestNeighbors

In [225]:
cars = pd.read_csv('mtcars.csv')
cars.columns = ['car_names', 'mpg', 'cyl', 'disp', 'hp', 'drat', 'wt', 'qsec', 'vs', 'am', 'gear', 'carb']
cars.head()

Unnamed: 0,car_names,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
0,Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
1,Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
2,Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
3,Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
4,Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2


In [226]:
specifications_needed = [15, 300, 160, 3.2]
# mpg, disp, hp, wt

X = cars[['mpg', 'disp', 'hp', 'wt']]
X[0:5]

Unnamed: 0,mpg,disp,hp,wt
0,21.0,160.0,110,2.62
1,21.0,160.0,110,2.875
2,22.8,108.0,93,2.32
3,21.4,258.0,110,3.215
4,18.7,360.0,175,3.44


In [233]:
nearest_neighbors = NearestNeighbors(n_neighbors=1).fit(X)

y, neighbor_coord = nearest_neighbors.kneighbors([specifications_needed])

print("y = {}, neighbor_coord = {}".format(y, neighbor_coord))

y = [[10.77474942]], neighbor_coord = [[22]]


In [242]:
cars.iloc[neighbor_coord[0]]

Unnamed: 0,car_names,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
22,AMC Javelin,15.2,8,304.0,150,3.15,3.435,17.3,0,0,3,2


In [243]:
cars

Unnamed: 0,car_names,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
0,Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
1,Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
2,Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
3,Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
4,Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2
5,Valiant,18.1,6,225.0,105,2.76,3.46,20.22,1,0,3,1
6,Duster 360,14.3,8,360.0,245,3.21,3.57,15.84,0,0,3,4
7,Merc 240D,24.4,4,146.7,62,3.69,3.19,20.0,1,0,4,2
8,Merc 230,22.8,4,140.8,95,3.92,3.15,22.9,1,0,4,2
9,Merc 280,19.2,6,167.6,123,3.92,3.44,18.3,1,0,4,4
