## Preprocessing for ALS Modelling

For training ALS model using spark mlib library, we need to join user.json, review.json and business.json 

In [1]:
from pyspark.sql.functions import *

inputPath = "dataset/"
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext('local')
spark = SparkSession(sc)

In [2]:
userDf = spark.read.json(inputPath+"user.json")

In [3]:
reviewDf = spark.read.json(inputPath+"review.json")

In [4]:
filteredBusinessDf = spark.read.json('business_restaurant.json')

Spark mlib library for collaborative filtering requires unique business and user id to created in numeric format to be able to train our model

In [8]:
userDf = userDf.withColumn("user_unique_user_id",monotonically_increasing_id())
filteredBusinessDf = filteredBusinessDf.withColumn("unique_business_id",monotonically_increasing_id())

In [9]:
filteredBusinessDf.createOrReplaceTempView("business")
userDf.createOrReplaceTempView("user")
reviewDf.createOrReplaceTempView("review")

In [10]:
display(userDf)

DataFrame[average_stars: double, compliment_cool: bigint, compliment_cute: bigint, compliment_funny: bigint, compliment_hot: bigint, compliment_list: bigint, compliment_more: bigint, compliment_note: bigint, compliment_photos: bigint, compliment_plain: bigint, compliment_profile: bigint, compliment_writer: bigint, cool: bigint, elite: array<bigint>, fans: bigint, friends: array<string>, funny: bigint, name: string, review_count: bigint, useful: bigint, user_id: string, yelping_since: string, user_unique_user_id: bigint]

In [12]:
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

For each state, we load the json file containing a combined view of user reviews and restaurants

In [24]:
df = sqlContext.sql('select user_unique_user_id,unique_business_id, review.stars, state from review inner join business on business.business_id = review.business_id inner join user on user.user_id = review.user_id where user_unique_user_id < 2000000000 and unique_business_id < 2000000000')

Save state wise filtered business reviews given by users file to be used later by Yelp-MLib-ALS file to create state wise ALS Models

In [45]:
import pandas as pd

NC_rest_df = df[df['state']=="NC"]
d = NC_rest_df.toPandas()

In [47]:
d.to_json('NC_business.json', orient='records', lines=True)

In [48]:
OH_rest_df = df[df['state']=="OH"]
oh = OH_rest_df.toPandas()
oh.to_json('OH_business.json', orient='records', lines=True)

In [49]:
ON_rest_df = df[df['state']=="ON"]
on = ON_rest_df.toPandas()
on.to_json('ON_business.json', orient='records', lines=True)

In [50]:
NV_rest_df = df[df['state']=="NV"]
nv = NV_rest_df.toPandas()
nv.to_json('NV_business.json', orient='records', lines=True)

In [51]:
AZ_rest_df = df[df['state']=="AZ"]
az = AZ_rest_df.toPandas()
az.to_json('AZ_business.json', orient='records', lines=True)

Storing business details for top states to be used by the flask application to get deatils of the restaurants recommended

In [25]:
df = sqlContext.sql('select user_unique_user_id,unique_business_id, review.stars, state, business.business_id, business.name, business.address from review inner join business on business.business_id = review.business_id inner join user on user.user_id = review.user_id where user_unique_user_id < 2000000000 and unique_business_id < 2000000000')

In [26]:
display(df)

DataFrame[user_unique_user_id: bigint, unique_business_id: bigint, stars: bigint, state: string, business_id: string, name: string, address: string]

In [27]:
import pandas as pd

NC_rest_df = df[df['state']=="NC"]
d = NC_rest_df.toPandas()
d.to_json('NC_business_details.json', orient='records', lines=True)

In [28]:
OH_rest_df = df[df['state']=="OH"]
oh = OH_rest_df.toPandas()
oh.to_json('OH_business_details.json', orient='records', lines=True)

In [29]:
ON_rest_df = df[df['state']=="ON"]
on = ON_rest_df.toPandas()
on.to_json('ON_business_details.json', orient='records', lines=True)

In [30]:
NV_rest_df = df[df['state']=="NV"]
nv = NV_rest_df.toPandas()
nv.to_json('NV_business_details.json', orient='records', lines=True)

In [31]:
AZ_rest_df = df[df['state']=="AZ"]
az = AZ_rest_df.toPandas()
az.to_json('AZ_business_details.json', orient='records', lines=True)

### License: 

The text in the document by Shrikant Mudholkar, Varsha Bhanushali and Monas Bhar is licensed under CC BY 3.0 https://creativecommons.org/licenses/by/3.0/us/

The code in the document by Shrikant Mudholkar, Varsha Bhanushali and Monas Bhar is licensed under the MIT License https://opensource.org/licenses/MIT