# Yelp Recommender

## Intro

The purpose of this exercise is to use Spark in a real dataset, instead of just a toy example.

You will use the data from the [Yelp Dataset Challenge](https://www.yelp.de/dataset_challenge), which contains information about businesses, users, reviews and more.

For this exercise, you will need to focus only on the following files:
- yelp_academic_dataset_business.json
- yelp_academic_dataset_review.json

The goal is to build a recommender using [Spark's ALS (Alternating Least Squares)](https://spark.apache.org/docs/2.3.0/ml-collaborative-filtering.html) and then generate recommendations for a given user.

Since the dataset is quite big, you should pick a business category (e.g. Restaurants) and a city (e.g. Edinburgh) and work on the recommender using only this subset of the data.

Please take some time to:
- find out what information you will need to feed as input to Spark's ALS
- check how this information is available in the dataset
- plan how you will tackle this problem

In [1]:
# Download a small version of the Yelp dataset
#!wget https://s3.us-west-2.amazonaws.com/dsr-spark-appliedml/yelp_dataset_small.tar.gz
#!gunzip yelp_dataset_small.tar.gz

In [2]:
import pyspark
from pyspark import SparkContext, SQLContext
sc = pyspark.SparkContext('local[*]')
sqlc = SQLContext(sc)

## Business Data

- Load the file ***yelp_academic_dataset_business.json*** and select the following columns:
    - business_id
    - name
    - city
    - stars
    - categories
    - address

In [3]:
df_business = sqlc.read.json('../data/yelp/yelp_academic_dataset_business.json')

In [4]:
df_business.take(1)

[Row(address='227 E Baseline Rd, Ste J2', attributes=['BikeParking: True', 'BusinessAcceptsBitcoin: False', 'BusinessAcceptsCreditCards: True', "BusinessParking: {'garage': False, 'street': False, 'validated': False, 'lot': True, 'valet': False}", 'DogsAllowed: False', 'RestaurantsPriceRange2: 2', 'WheelchairAccessible: True'], business_id='0DI8Dt2PJp07XkVvIElIcQ', categories=['Tobacco Shops', 'Nightlife', 'Vape Shops', 'Shopping'], city='Tempe', hours=['Monday 11:0-21:0', 'Tuesday 11:0-21:0', 'Wednesday 11:0-21:0', 'Thursday 11:0-21:0', 'Friday 11:0-22:0', 'Saturday 10:0-22:0', 'Sunday 11:0-18:0'], is_open=0, latitude=33.3782141, longitude=-111.936102, name='Innovative Vapors', neighborhood='', postal_code='85283', review_count=17, stars=4.5, state='AZ', type='business')]

In [5]:
df_business.select('business_id', 'name', 'city', 'stars', 'categories', 'address').limit(5).toPandas()

Unnamed: 0,business_id,name,city,stars,categories,address
0,0DI8Dt2PJp07XkVvIElIcQ,Innovative Vapors,Tempe,4.5,"[Tobacco Shops, Nightlife, Vape Shops, Shopping]","227 E Baseline Rd, Ste J2"
1,LTlCaCGZE14GuaUXUGbamg,Cut and Taste,Las Vegas,5.0,"[Caterers, Grocery, Food, Event Planning & Ser...",495 S Grand Central Pkwy
2,EDqCEAGXVGCH4FJXgqtjqg,Pizza Pizza,Toronto,2.5,"[Restaurants, Pizza, Chicken Wings, Italian]",979 Bloor Street W
3,cnGIivYRLxpF7tBVR_JwWA,Plush Salon and Spa,Oakdale,4.0,"[Hair Removal, Beauty & Spas, Blow Dry/Out Ser...",7014 Steubenville Pike
4,cdk-qqJ71q6P7TJTww_DSA,Comfort Inn,Toronto,3.0,"[Hotels & Travel, Event Planning & Services, H...",321 Jarvis Street


### Choosing a business category

- Define a regular Python function that takes a list of categories and returns 1 if a category of your choice (for instance, 'Restaurants') is contained in the list of categories or 0 otherwise
- Using the Python function, define a Spark's User Defined Function (UDF) with an IntegerType return
- Using the UDF, filter the businesses that belong to the category you chose

In [6]:
list=["a","b","c"]


In [7]:
from pyspark.sql.functions import UserDefinedFunction
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import udf, PandasUDFType

@udf("integer")
def is_restaurant(categories):
    return 1 if "Restaurants" in categories else 0

# df_restaurants = df_business.withColumn("isRestaurant", is_restaurant(df_business.categories))
df_restaurants = df_business.filter(is_restaurant(df_business.categories) == 1)

In [8]:
df_restaurants.select('name','categories').limit(10).toPandas()

Unnamed: 0,name,categories
0,Pizza Pizza,"[Restaurants, Pizza, Chicken Wings, Italian]"
1,Taco Bell,"[Tex-Mex, Mexican, Fast Food, Restaurants]"
2,Ohana Hawaiian BBQ,"[Hawaiian, Restaurants, Barbeque]"
3,Chez Lionel,"[Restaurants, Cafes]"
4,La Prep,"[Sandwiches, Breakfast & Brunch, Salad, Restau..."
5,Chipotle Mexican Grill,"[Fast Food, Mexican, Restaurants]"
6,Carrabba's Italian Grill,"[Restaurants, Italian, Seafood]"
7,Don Tequila,"[Restaurants, Mexican, American (Traditional)]"
8,Lo-Lo's Chicken & Waffles,"[Restaurants, Waffles, Southern, Soul Food]"
9,Kabob Palace,"[Persian/Iranian, Restaurants, Ethnic Food, Fo..."


- The UDF approach works just fine, but there is a more straightforward way to perform the same operation
    - hint: look at ***array_contains*** SQL function

In [9]:
import pyspark.sql.functions as F

# you can overwrite the former df_restaurants
df_business.createOrReplaceTempView("business")
df_restaurants = sqlc.sql("select * from business where array_contains(categories, 'Restaurants')")

In [10]:
df_restaurants.select('name','categories').limit(10).toPandas()

Unnamed: 0,name,categories
0,Pizza Pizza,"[Restaurants, Pizza, Chicken Wings, Italian]"
1,Taco Bell,"[Tex-Mex, Mexican, Fast Food, Restaurants]"
2,Ohana Hawaiian BBQ,"[Hawaiian, Restaurants, Barbeque]"
3,Chez Lionel,"[Restaurants, Cafes]"
4,La Prep,"[Sandwiches, Breakfast & Brunch, Salad, Restau..."
5,Chipotle Mexican Grill,"[Fast Food, Mexican, Restaurants]"
6,Carrabba's Italian Grill,"[Restaurants, Italian, Seafood]"
7,Don Tequila,"[Restaurants, Mexican, American (Traditional)]"
8,Lo-Lo's Chicken & Waffles,"[Restaurants, Waffles, Southern, Soul Food]"
9,Kabob Palace,"[Persian/Iranian, Restaurants, Ethnic Food, Fo..."


### Choosing a city
- Having filtered by the business category, now it is time to filter by the city (for instance, Edinburgh)

In [11]:
df_city_restaurants = df_restaurants.filter(df_restaurants.city == 'Edinburgh')

In [12]:
df_city_restaurants.select('name','city','categories').limit(10).toPandas()

Unnamed: 0,name,city,categories
0,Juice Almighty,Edinburgh,"[Food, Fast Food, Restaurants, Juice Bars & Sm..."
1,Crofters,Edinburgh,"[Nightlife, Bars, Pubs, Burgers, Steakhouses, ..."
2,Spicy Bite,Edinburgh,"[Fast Food, Halal, Pizza, Restaurants]"
3,Croma Pizzeria,Edinburgh,"[Italian, Restaurants, Pizza]"
4,Dubh Prais Restaurant,Edinburgh,"[Restaurants, British]"
5,Chop Chop Leith,Edinburgh,"[Restaurants, Chinese]"
6,Chip Inn,Edinburgh,"[Restaurants, Fish & Chips]"
7,Cafe Voltaire,Edinburgh,"[Coffee & Tea, Bars, Food, Restaurants, Cockta..."
8,The Doric,Edinburgh,"[Gastropubs, Pubs, Bars, Nightlife, Restaurant..."
9,The Pompadour by Galvin,Edinburgh,"[French, Restaurants, Scottish]"


### Generating numeric IDs
- If you haven't done it yet, take one sample from your already filtered DataFrame and notice that the ***business_id*** contains an alphanumeric value - this is not good for Spark's ALS implementation, which requires IDs for items (in our case, businesses) and users to be numeric
- Use a ***StringIndexer*** to create a new column ***business_idn*** from the conversion of business_id into a numeric value

In [13]:
from pyspark.ml.feature import StringIndexer

t1 = StringIndexer(inputCol="business_id", outputCol="business_idn").fit(df_city_restaurants)
df_city_restaurants = t1.transform(df_city_restaurants)

In [14]:
df_city_restaurants.select('business_id', 'business_idn').take(5)

[Row(business_id='NsarUMMMPOlMBb6K04x6hw', business_idn=24.0),
 Row(business_id='m5CY1jy3dvBw8J4iiCBlXw', business_idn=1033.0),
 Row(business_id='raixiox15brAXfTUbg1YBw', business_idn=1127.0),
 Row(business_id='ynjAEdXdIw7JF153uABgoQ', business_idn=1318.0),
 Row(business_id='aCul-vH-5hCxX6XZs0g22A', business_idn=972.0)]

In [15]:
df_city_restaurants.cache()

DataFrame[address: string, attributes: array<string>, business_id: string, categories: array<string>, city: string, hours: array<string>, is_open: bigint, latitude: double, longitude: double, name: string, neighborhood: string, postal_code: string, review_count: bigint, stars: double, state: string, type: string, business_idn: double]

## Review Data

- Load the file ***yelp_academic_dataset_review.json*** and select the following columns:
    - user_id
    - business-id
    - stars
    - date

In [16]:
df_reviews = sqlc.read.json('../data/yelp/yelp_academic_dataset_review.json')

In [17]:
df_reviews.limit(5).toPandas()


Unnamed: 0,business_id,cool,date,funny,review_id,stars,type,useful,user_id
0,2aFiy99vNLklCx3T_tGS9A,0,2011-10-10,0,NxL8SIC5yqOdnlXCg18IBg,5,review,0,KpkOkG6RIf4Ra25Lhhxf1A
1,2aFiy99vNLklCx3T_tGS9A,0,2010-12-29,0,pXbbIgOXvLuTi_SPs1hQEQ,5,review,1,bQ7fQq1otn9hKX-gXRsrgA
2,2aFiy99vNLklCx3T_tGS9A,0,2011-04-29,0,wslW2Lu4NYylb1jEapAGsw,5,review,0,r1NUhdNmL6yU9Bn-Yx6FTw
3,2LfIuF3_sX6uwe-IR-P0jQ,1,2014-07-14,0,GP6YEearUWrzPtQYSF1vVg,5,review,0,aW3ix1KNZAvoM8q-WghA3Q
4,2LfIuF3_sX6uwe-IR-P0jQ,0,2014-01-15,0,25RlYGq2s5qShi-pn3ufVA,4,review,0,YOo-Cip8HqvKp_p9nEGphw


In [18]:
df_reviews.count()

4153150

### Keeping reviews for the chosen city only

- You are only interested in reviews of businesses you kept after filtering for category and city - how to filter out everything else? (hint: take a look at the ***join*** operation of DataFrames)

In [19]:
df_city_reviews = df_reviews.join(df_city_restaurants, on="business_id").drop(df_city_restaurants["stars"])

In [20]:
df_city_reviews.count()

23779

### Generating numeric IDs

- As it happened with the ***business_id***, you also need to convert ***user_id*** into a numeric value - once again, use a ***StringIndexer*** to create a new column named ***user_idn*** containing the result of the conversion

In [21]:
t2 = StringIndexer(inputCol="user_id", outputCol="user_idn").fit(df_city_reviews)
df_city_reviews = t2.transform(df_city_reviews)

In [22]:
from pyspark.sql.functions import to_timestamp
df_city_reviews = df_city_reviews.withColumn('date', to_timestamp(df_city_reviews['date'], 'yyyy-mm-dd'))

In [23]:
df_city_reviews.cache()

DataFrame[business_id: string, cool: bigint, date: timestamp, funny: bigint, review_id: string, stars: bigint, type: string, useful: bigint, user_id: string, address: string, attributes: array<string>, categories: array<string>, city: string, hours: array<string>, is_open: bigint, latitude: double, longitude: double, name: string, neighborhood: string, postal_code: string, review_count: bigint, state: string, type: string, business_idn: double, user_idn: double]

In [24]:
df_city_reviews.limit(5).toPandas()

Unnamed: 0,business_id,cool,date,funny,review_id,stars,type,useful,user_id,address,...,latitude,longitude,name,neighborhood,postal_code,review_count,state,type.1,business_idn,user_idn
0,-3pfhzz9CB7F2DpbF1Ko7Q,0,2008-01-06 00:07:00,0,H7eJZ9azd1eH5minOhc-uw,5,review,0,VRVCKQhYDCkzaEDce8GEtQ,31-35 Grassmarket,...,55.946932,-3.196491,Metro Bar & Brasserie,Old Town,EH1 2HS,3,EDH,business,1208.0,63.0
1,-3pfhzz9CB7F2DpbF1Ko7Q,0,2010-01-15 00:04:00,0,fTLZIeehPoPe_vv8NVS51g,1,review,0,SxV1Jq7UANuSYpn42JXvOA,31-35 Grassmarket,...,55.946932,-3.196491,Metro Bar & Brasserie,Old Town,EH1 2HS,3,EDH,business,1208.0,6.0
2,-3pfhzz9CB7F2DpbF1Ko7Q,2,2015-01-10 00:03:00,2,k5aSkCKZ7jhZcD3a5cy_Ag,3,review,2,soDF6mePh1SuNZI3rN7HPQ,31-35 Grassmarket,...,55.946932,-3.196491,Metro Bar & Brasserie,Old Town,EH1 2HS,3,EDH,business,1208.0,433.0
3,2PqCZxon6AZHJrQ5iam4LA,0,2012-01-04 00:02:00,0,qjn_nLwosOpcjxFsQ2Tmsw,2,review,1,LURC3E0DoXYgN9aYTF3XOg,10 Clifton Terrace,...,55.946036,-3.218771,Clifton Fish Bar,Haymarket,EH12 5DR,3,EDH,business,684.0,2.0
4,2PqCZxon6AZHJrQ5iam4LA,0,2011-01-24 00:02:00,0,6aldHbF3FeWKglU90NiVCg,1,review,1,W1Nl6_R7amuZ6NStXI3uBA,10 Clifton Terrace,...,55.946036,-3.218771,Clifton Fish Bar,Haymarket,EH12 5DR,3,EDH,business,684.0,91.0


### Adding a sequential number to the user's reviews

- Now add a ***sequential number*** to the user's reviews, that is, for each user, order his/her reviews by date (multiple reviews on the same date can be randomly ordered) and number them (hint: check ***window functions***)
- This sequential number will be useful later to perform a time-wise split of the dataset

In [25]:
from pyspark.sql import Window, functions as F

w1 = Window.partitionBy('user_id').orderBy('date')
df_city_reviews = df_city_reviews.withColumn('seq', F.row_number().over(w1))

In [26]:
df_city_reviews.select('user_id','date','seq').orderBy('user_id','seq').limit(40).toPandas()

Unnamed: 0,user_id,date,seq
0,-0MQ4webH2uc1ZAsGsNENg,2013-01-14 00:03:00,1
1,-0MQ4webH2uc1ZAsGsNENg,2013-01-14 00:03:00,2
2,-0MQ4webH2uc1ZAsGsNENg,2013-01-14 00:03:00,3
3,-0MQ4webH2uc1ZAsGsNENg,2013-01-14 00:03:00,4
4,-0MQ4webH2uc1ZAsGsNENg,2013-01-14 00:03:00,5
5,-0WUJsVpizkaAYQp05giUA,2012-01-29 00:09:00,1
6,-0WUJsVpizkaAYQp05giUA,2012-01-29 00:09:00,2
7,-0WUJsVpizkaAYQp05giUA,2012-01-29 00:09:00,3
8,-0wDYXGaz2mrHd6fQUvPHQ,2015-01-07 00:01:00,1
9,-0wDYXGaz2mrHd6fQUvPHQ,2015-01-10 00:01:00,2


### Subsetting reviews to keep only users with more than 4 reviews

- Some users had rated only 1 or a few businesses - this would pose as a problem to make recommendations - so you would want to keep only users who had rated more than 4 reviews, for instance
- Find the ***total number of reviews*** for each user and then filter them using this information (hint: again, you can use a ***window function***)

In [27]:
w2 = Window.partitionBy('user_id')
df_selected = df_city_reviews\
.withColumn('numOfReviews', F.count("seq").over(w2))\

In [28]:
df_selected.select('user_id','date','seq', 'numOfReviews').orderBy('user_id', 'seq').limit(40).toPandas()

Unnamed: 0,user_id,date,seq,numOfReviews
0,-0MQ4webH2uc1ZAsGsNENg,2013-01-14 00:03:00,1,5
1,-0MQ4webH2uc1ZAsGsNENg,2013-01-14 00:03:00,2,5
2,-0MQ4webH2uc1ZAsGsNENg,2013-01-14 00:03:00,3,5
3,-0MQ4webH2uc1ZAsGsNENg,2013-01-14 00:03:00,4,5
4,-0MQ4webH2uc1ZAsGsNENg,2013-01-14 00:03:00,5,5
5,-0WUJsVpizkaAYQp05giUA,2012-01-29 00:09:00,1,3
6,-0WUJsVpizkaAYQp05giUA,2012-01-29 00:09:00,2,3
7,-0WUJsVpizkaAYQp05giUA,2012-01-29 00:09:00,3,3
8,-0wDYXGaz2mrHd6fQUvPHQ,2015-01-07 00:01:00,1,5
9,-0wDYXGaz2mrHd6fQUvPHQ,2015-01-10 00:01:00,2,5


In [29]:
df_selected = df_selected.filter("numOfReviews > 4")

In [30]:
df_selected.filter("user_id = '-2EcIDIDnA8H7N81jwYpcQ'").toPandas()

Unnamed: 0,business_id,cool,date,funny,review_id,stars,type,useful,user_id,address,...,name,neighborhood,postal_code,review_count,state,type.1,business_idn,user_idn,seq,numOfReviews
0,S5gtq6g6dIfyaFvyPYyzkg,0,2014-01-17 00:06:00,0,ygrE2MBNRlRGC1ISEB7jjg,3,review,0,-2EcIDIDnA8H7N81jwYpcQ,14 George Street,...,The Dome,New Town,EH2 2PF,101,EDH,business,551.0,504.0,1,7
1,P96Nq4JcXYrsw1gNK-knDg,0,2015-01-07 00:03:00,0,iUKqyPB969OUoSvPb3m8QQ,3,review,0,-2EcIDIDnA8H7N81jwYpcQ,91/93 Shadwick Place,...,Burger,West End,EH2 4SD,16,EDH,business,1331.0,504.0,2,7
2,Mv3YLMYrLpSctyamrQCdQw,1,2015-01-09 00:03:00,1,FQ_GCvJa-0kDGu2g6cYr2g,4,review,1,-2EcIDIDnA8H7N81jwYpcQ,"77B George Street, New Town",...,Café Andaluz,New Town,EH2 3EE,77,EDH,business,709.0,504.0,3,7
3,Nvi9RLcOdrJvSwcw-ziZQQ,0,2015-01-11 00:08:00,0,rdqQ0rg8aSnoQDYE9k4i-g,1,review,1,-2EcIDIDnA8H7N81jwYpcQ,18 Morrison St,...,Lebowskis,West End,EH3 8BJ,53,EDH,business,524.0,504.0,4,7
4,Nu1yhfnJxvdHW70r8FNvrg,0,2015-01-20 00:08:00,0,OTgYDyWaDDOzYokT0DxRkA,4,review,1,-2EcIDIDnA8H7N81jwYpcQ,13 A Brougham Street,...,El Quijote,Tollcross,EH3 9,19,EDH,business,18.0,504.0,5,7
5,EiYE3T3hfqcpcficZ4zJzQ,0,2015-01-28 00:03:00,0,9S7Ya1RQ3xEYg5N_AHlsxg,4,review,0,-2EcIDIDnA8H7N81jwYpcQ,"159-161 Bruntsfield Pl, Bruntsfield",...,Montpeliers,Bruntsfield,EH10 4DG,52,EDH,business,757.0,504.0,6,7
6,1yQUqh3_h1IOrXZmb4CBFw,1,2015-01-30 00:08:00,0,0EmiZlrU-P975xHjwzblrg,2,review,1,-2EcIDIDnA8H7N81jwYpcQ,88 Bruntsfield Place,...,TriBeCa,Bruntsfield,EH10 4HG,15,EDH,business,1242.0,504.0,7,7


In [31]:
df_selected.select('user_id','date','seq', 'numOfReviews').orderBy('user_id', 'seq').limit(40).toPandas()

Unnamed: 0,user_id,date,seq,numOfReviews
0,-0MQ4webH2uc1ZAsGsNENg,2013-01-14 00:03:00,1,5
1,-0MQ4webH2uc1ZAsGsNENg,2013-01-14 00:03:00,2,5
2,-0MQ4webH2uc1ZAsGsNENg,2013-01-14 00:03:00,3,5
3,-0MQ4webH2uc1ZAsGsNENg,2013-01-14 00:03:00,4,5
4,-0MQ4webH2uc1ZAsGsNENg,2013-01-14 00:03:00,5,5
5,-0wDYXGaz2mrHd6fQUvPHQ,2015-01-07 00:01:00,1,5
6,-0wDYXGaz2mrHd6fQUvPHQ,2015-01-10 00:01:00,2,5
7,-0wDYXGaz2mrHd6fQUvPHQ,2015-01-14 00:06:00,3,5
8,-0wDYXGaz2mrHd6fQUvPHQ,2015-01-14 00:06:00,4,5
9,-0wDYXGaz2mrHd6fQUvPHQ,2016-01-15 00:05:00,5,5


In [32]:
df_selected.cache()

DataFrame[business_id: string, cool: bigint, date: timestamp, funny: bigint, review_id: string, stars: bigint, type: string, useful: bigint, user_id: string, address: string, attributes: array<string>, categories: array<string>, city: string, hours: array<string>, is_open: bigint, latitude: double, longitude: double, name: string, neighborhood: string, postal_code: string, review_count: bigint, state: string, type: string, business_idn: double, user_idn: double, seq: int, numOfReviews: bigint]

### Calculating mean rating by user

- Now you can calculate the mean rating by user and make it into a dictionary where the key is the ***user_id*** (hint: look at ***rdd*** method of DataFrames and ***collectAsMap*** method of RDDs)

In [33]:
dict_user_means = ...

### Centering rating by user

- The dictionary containing mean ratings by user can be seen as a ***lookup table*** - what is the appropriate way of dealing with those in Spark?
- Once you have figured this out, define a regular Python function that takes two arguments - ***user_id*** (String) and ***rating*** (String, which you will need to convert to float inside the function) - and returns the result of subtracting the mean rating of the user from the rating parameter
- Using the Python function, define a Spark's User Defined Function (UDF) with a DoubleType return
- Using the UDF, create a column in your DataFrame with the centered ratings

In [34]:
from pyspark.sql.types import DoubleType

lookup_user_means = ...

def zero_mean(user_id, rating):
    pass

df_centered = df_selected.withColumn('ratings_centered', df_selected.stars - F.avg('stars').over(w2))

In [35]:
df_centered.select('user_id','stars','date','seq', 'numOfReviews', 'ratings_centered').orderBy('user_id', 'seq').limit(20).toPandas()

Unnamed: 0,user_id,stars,date,seq,numOfReviews,ratings_centered
0,-0MQ4webH2uc1ZAsGsNENg,3,2013-01-14 00:03:00,1,5,-1.2
1,-0MQ4webH2uc1ZAsGsNENg,4,2013-01-14 00:03:00,2,5,-0.2
2,-0MQ4webH2uc1ZAsGsNENg,4,2013-01-14 00:03:00,3,5,-0.2
3,-0MQ4webH2uc1ZAsGsNENg,5,2013-01-14 00:03:00,4,5,0.8
4,-0MQ4webH2uc1ZAsGsNENg,5,2013-01-14 00:03:00,5,5,0.8
5,-0wDYXGaz2mrHd6fQUvPHQ,5,2015-01-07 00:01:00,1,5,1.2
6,-0wDYXGaz2mrHd6fQUvPHQ,5,2015-01-10 00:01:00,2,5,1.2
7,-0wDYXGaz2mrHd6fQUvPHQ,3,2015-01-14 00:06:00,3,5,-0.8
8,-0wDYXGaz2mrHd6fQUvPHQ,1,2015-01-14 00:06:00,4,5,-2.8
9,-0wDYXGaz2mrHd6fQUvPHQ,5,2016-01-15 00:05:00,5,5,1.2


In [36]:
df_centered.limit(5).toPandas()

Unnamed: 0,business_id,cool,date,funny,review_id,stars,type,useful,user_id,address,...,neighborhood,postal_code,review_count,state,type.1,business_idn,user_idn,seq,numOfReviews,ratings_centered
0,hmAnDlbxTbSPv8MIw1yvkA,0,2007-01-29 00:08:00,0,UB38v64seaWXtGwYkAHQkg,5,review,0,6FFFFNZLeeHERUs1MatgBg,1a Dock Place,...,Leith,EH6 6LU,9,EDH,business,142.0,478.0,1,8,0.875
1,lWRbdK9YABr0auLG19FHhQ,0,2007-01-29 00:08:00,0,Ol9g6Xl0EjF48kQFJsWE_A,2,review,0,6FFFFNZLeeHERUs1MatgBg,"National Museums of Scotland, Chambers Street",...,Old Town,EH1 1JF,46,EDH,business,213.0,478.0,2,8,-2.125
2,nJdNzQ5Z278gTCFJ4HQBLA,0,2007-01-29 00:08:00,0,pzmJORq1c2GNmMYoM_t95Q,4,review,0,6FFFFNZLeeHERUs1MatgBg,15/16 George IV Bridge,...,Old Town,EH1 1EE,161,EDH,business,760.0,478.0,3,8,-0.125
3,GS06uO84yKQpa7oezgVZgQ,0,2007-01-29 00:08:00,0,5rtyHHV_8a242O4mP2K-_w,4,review,0,6FFFFNZLeeHERUs1MatgBg,34 Thistle Street Lane NW,...,New Town,EH2 1EA,27,EDH,business,134.0,478.0,4,8,-0.125
4,Q0fcX_1wvdmffqEPa246rg,0,2007-01-29 00:08:00,0,boGdTjjRLnsugeIAzjsUeg,5,review,0,6FFFFNZLeeHERUs1MatgBg,352 Castlehill,...,Royal Mile,EH1 2NF,156,EDH,business,491.0,478.0,5,8,0.875


- Once again, the UDF approach is not the most "Sparkonic" way of handling this - can you perform the same operation using only functions from ***pyspark.sql.functions*** (which was imported earlier as F)?
    - hint: you'll need ***Window functions***

In [37]:

df_centered.cache()

DataFrame[business_id: string, cool: bigint, date: timestamp, funny: bigint, review_id: string, stars: bigint, type: string, useful: bigint, user_id: string, address: string, attributes: array<string>, categories: array<string>, city: string, hours: array<string>, is_open: bigint, latitude: double, longitude: double, name: string, neighborhood: string, postal_code: string, review_count: bigint, state: string, type: string, business_idn: double, user_idn: double, seq: int, numOfReviews: bigint, ratings_centered: double]

## Dataset

### Splitting into training and test sets by time

- In recommender systems, it is common practice to do the training/test split timewise, that is, the test set is composed of the latest reviews
- First, filter only those reviews which have a sequential number smaller than the ***total number of reviews***, by user: this is your training set
- Then, filter only those reviews which have a sequential number identical to the ***total number of reviews***, by user: this is your test set
- Now you can see why you had to add a sequential number to the user's reiews - since some users had done all his/her reviews on the same day, you need to disambiguate them to split the dataset. By doing this, you guarantee your test set will have only 1 review for each user.

In [38]:
df_training = df_centered[df_centered.seq < df_centered.numOfReviews]
df_test = df_centered[df_centered.seq == df_centered.numOfReviews]

In [39]:
df_training.cache()

DataFrame[business_id: string, cool: bigint, date: timestamp, funny: bigint, review_id: string, stars: bigint, type: string, useful: bigint, user_id: string, address: string, attributes: array<string>, categories: array<string>, city: string, hours: array<string>, is_open: bigint, latitude: double, longitude: double, name: string, neighborhood: string, postal_code: string, review_count: bigint, state: string, type: string, business_idn: double, user_idn: double, seq: int, numOfReviews: bigint, ratings_centered: double]

DataFrame[business_id: string, cool: bigint, date: timestamp, funny: bigint, review_id: string, stars: bigint, type: string, useful: bigint, user_id: string, address: string, attributes: array<string>, categories: array<string>, city: string, hours: array<string>, is_open: bigint, latitude: double, longitude: double, name: string, neighborhood: string, postal_code: string, review_count: bigint, state: string, type: string, business_idn: double, user_idn: double, seq: int, numOfReviews: bigint, ratings_centered: double]

### If using Spark 2.1 (as in the Docker image), you need to filter out "new" businesses in the test set

In [52]:
businesses = df_training.select('business_id').distinct()
df_test = df_test.join(businesses, on='business_id')

In [53]:
df_test.cache()

DataFrame[business_id: string, cool: bigint, date: timestamp, funny: bigint, review_id: string, stars: bigint, type: string, useful: bigint, user_id: string, address: string, attributes: array<string>, categories: array<string>, city: string, hours: array<string>, is_open: bigint, latitude: double, longitude: double, name: string, neighborhood: string, postal_code: string, review_count: bigint, state: string, type: string, business_idn: double, user_idn: double, seq: int, numOfReviews: bigint, ratings_centered: double]

## Alternate Least Squares (ALS) Model

- This is the recommender itself - the ALS uses a iterative approach to find the underlying factors that yield the user/item rating matrix
- It takes as input a DataFrame with three columns, representing:
    - userCol: user IDs (numeric - remember the conversion you did)
    - itemCol: item IDs (numeric - remember the conversion you did)
    - ratingCol: rating (numeric, obviously)
    - coldStartStrategy: "drop" (if there is unseen data on the test set, meaning a new user/business, drop it) - ***only available from Spark 2.2 on***
- Its parameters are:
    - rank: the number of factors to consider
    - maxIter: the maximum number of iterations to perform
    - regParam: the regularization parameter
- Use Spark's ALS to fit a model based on your DataFrame

In [58]:
from pyspark.ml.recommendation import ALS
from pyspark.ml.evaluation import RegressionEvaluator
# from pyspark.ml import Pipeline

model = ALS()\
            .setUserCol('user_idn')\
            .setItemCol('business_idn')\
            .setRatingCol('ratings_centered')\
            .setColdStartStrategy('drop')\
            .setRank(10)\
            .setMaxIter(10)\
            .setRegParam(0.1)
            
model = model.fit(df_training)

### Predictions for the training set

- Once the model is trained, make predictions for the training set and use a ***RegressionEvaluator*** to find out the RMSE of the predictions

In [59]:
predictions = model.transform(df_training)

In [55]:
# predictions.select('user_id','seq','stars','ratings_centered','numOfReviews', 'prediction').orderBy('user_id', 'seq').limit(20).toPandas()

In [60]:
evaluator = RegressionEvaluator().setLabelCol("ratings_centered") \
                            .setPredictionCol("prediction") \
                            .setMetricName("rmse")

train_rmse = evaluator.evaluate(predictions)

print(train_rmse)

0.38297118150098625


### Predictions for the test set

- Now, make predictions for the test set and use a ***RegressionEvaluator*** to find out the RMSE of the predictions

In [62]:
test_predictions = model.transform(df_test)
test_rmse = evaluator.evaluate(test_predictions)
print(test_rmse)

0.981803303289935


## Recommendations

Now, your model is trained, but how can you use it to make recommendations for a given user?

### Organizing business data

- It would not make sense to recommend a place the user has already rated, right? So, generate a dictionary where ***user_idn*** is the key and a list of the already rated ***business_idn*** is the value (hint: when aggregating DataFrames, ***collect_list*** is a VERY useful function to turn multiple records into a list)

In [None]:
from pyspark.sql.functions import collect_list

dict_visited_by_user = ...

- Besides, recommending a given business_id also does not help much, right? So you need to organize the business data in a way it can be shown to the user.
    - Define a regular Python function that takes one argument ***row*** (Row type) and returns a dictionary where ***business_idn*** is the key and the value is yet another dictionary with relevant fields (for instance: name, address, stars, categories)
    - Transform your business DataFrame into an RDD and apply the function you defined - upon collecting, you will end up with a list of dictionaries
    - Transform this list of dictionaries into a single dictionary

In [None]:
def rest_to_json(row):
    pass

rest = ...

dict_rest = {k: v for d in rest for k, v in d.items()}

### Making recommendations for a user

- To actually make the recommendations, we need to build an input DataFrame to feed the model
    - A DataFrame can be created using the SQL Context and a list of Rows, each containg two columns: user_idn and business_idn - the rating will be computed by the model
    - But you only need to have rows for the businesses which were not yet rated by the user - from all businesses, exclude the ones already rated by him/her

In [None]:
from pyspark.sql import Row
from pyspark.sql.functions import desc

user_idn = 317
n_business = len(dict_rest)

visited = ...
not_visited = ...

df_test_user = ...

- Now, you can use the generated DataFrame to make predictions
    - If there are any NA predictions, make sure to turn them into a really bad value (for instance, -5.0) (hint: remember ***na*** method of DataFrames)
- Order the predictions and take the ***business_idn*** of the top 5
- Finally, use this information to fetch the business data from the dictionary you assembled a couple of steps ago

In [None]:
predictions = ...

top_predictions = ...

response = list(map(lambda idn: dict_rest[idn], top_predictions))

In [None]:
response

## Congratulations, you finished the exercise!

In [None]:
sc.stop()