# Spark DataFrames continued

Let's read in two of the three data files from the Yelp academic dataset (https://www.kaggle.com/yelp-dataset/yelp-dataset) and examine the schemas for each one (we're skipping the reviews.json file for this class):

In [1]:
business = spark.read.json('s3://umsi-data-science/data/yelp/business.json')

VBox()

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
1,application_1555002898256_0002,pyspark,idle,Link,Link,✔


SparkSession available as 'spark'.


In [2]:
business.printSchema()

VBox()

root
 |-- address: string (nullable = true)
 |-- attributes: struct (nullable = true)
 |    |-- AcceptsInsurance: boolean (nullable = true)
 |    |-- AgesAllowed: string (nullable = true)
 |    |-- Alcohol: string (nullable = true)
 |    |-- Ambience: struct (nullable = true)
 |    |    |-- casual: boolean (nullable = true)
 |    |    |-- classy: boolean (nullable = true)
 |    |    |-- divey: boolean (nullable = true)
 |    |    |-- hipster: boolean (nullable = true)
 |    |    |-- intimate: boolean (nullable = true)
 |    |    |-- romantic: boolean (nullable = true)
 |    |    |-- touristy: boolean (nullable = true)
 |    |    |-- trendy: boolean (nullable = true)
 |    |    |-- upscale: boolean (nullable = true)
 |    |-- BYOB: boolean (nullable = true)
 |    |-- BYOBCorkage: string (nullable = true)
 |    |-- BestNights: struct (nullable = true)
 |    |    |-- friday: boolean (nullable = true)
 |    |    |-- monday: boolean (nullable = true)
 |    |    |-- saturday: boolean (nullab

In [8]:
# review = spark.read.json('s3://umsi-data-science/data/yelp/review.json.gz')

VBox()

In [42]:
# review.printSchema

VBox()

An error was encountered:
Session 0 unexpectedly reached final status 'dead'. See logs:
stdout: 
#
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="kill -9 %p"
#   Executing /bin/sh -c "kill -9 12301"...

stderr: 
19/04/11 17:56:34 INFO TaskSetManager: Finished task 180.0 in stage 72.0 (TID 3334) in 88 ms on ip-172-31-43-75.ec2.internal (executor 11) (181/200)
19/04/11 17:56:34 INFO TaskSetManager: Starting task 183.0 in stage 72.0 (TID 3337, ip-172-31-43-75.ec2.internal, executor 11, partition 183, NODE_LOCAL, 8095 bytes)
19/04/11 17:56:34 INFO TaskSetManager: Finished task 181.0 in stage 72.0 (TID 3335) in 70 ms on ip-172-31-43-75.ec2.internal (executor 11) (182/200)
19/04/11 17:56:34 INFO TaskSetManager: Starting task 184.0 in stage 72.0 (TID 3338, ip-172-31-43-75.ec2.internal, executor 11, partition 184, NODE_LOCAL, 8095 bytes)
19/04/11 17:56:34 INFO TaskSetManager: Finished task 182.0 in stage 72.0 (TID 3336) in 98 ms on ip-172-31-43-75.ec2.internal (executo

In [3]:
tip = spark.read.json('s3://umsi-data-science/data/yelp/tip.json')

VBox()

In [11]:
tip.printSchema()

VBox()

root
 |-- business_id: string (nullable = true)
 |-- date: string (nullable = true)
 |-- likes: long (nullable = true)
 |-- text: string (nullable = true)
 |-- user_id: string (nullable = true)

### Let's try to find the name of the business that has the highest number of "tips":

In [12]:
most_tips = tip.groupBy('business_id').count().sort('count',ascending=False)

VBox()

In [13]:
from pyspark.sql.functions import col
most_tips = most_tips.withColumn('the_count',col('count'))

VBox()

In [14]:
most_tips.show()

VBox()

+--------------------+-----+---------+
|         business_id|count|the_count|
+--------------------+-----+---------+
|FaHADZARwnY4yvlvp...| 3517|     3517|
|JmI9nslLD7KZqRr__...| 2382|     2382|
|DkYS3arLOhA8si5uU...| 1474|     1474|
|5LNZ67Yw9RD6nf4_U...| 1436|     1436|
|K7lWdNUhCbcnEvI0N...| 1346|     1346|
|hihud--QRriCYZw1z...| 1287|     1287|
|RESDUcs7fIiihp38-...| 1149|     1149|
|yfxDa8RFOvJPQh0rN...| 1062|     1062|
|4JNXUYY8wbaaDmk3B...| 1038|     1038|
|iCQpiavjjPzJ5_3gP...| 1033|     1033|
|SMPbvZLSMMb7KU76Y...|  996|      996|
|7sPNbCx7vGAaH7SbN...|  981|      981|
|UPIYuRaZvknINOd1w...|  959|      959|
|eoHdUeQDNgQ6WYEnP...|  940|      940|
|yQab5dxZzgBLTEHCw...|  900|      900|
|JyxHvtj-syke7m9rb...|  888|      888|
|LNGBEEelQx4zbfWnl...|  854|      854|
|WUq8HJHIZU4uteB15...|  831|      831|
|f4x1YBxkLrZg652xt...|  800|      800|
|El4FC8jcawUVgw_0E...|  759|      759|
+--------------------+-----+---------+
only showing top 20 rows

In [15]:
joined = most_tips.join(business,'business_id','left').sort('the_count',ascending=False)

VBox()

In [16]:
most_tips_joined = joined.select("name","the_count").filter(joined['the_count'] > 1000).collect()

VBox()

In [17]:
for b in most_tips_joined:
    print(b.name,b.count,b.the_count)

VBox()

(u'McCarran International Airport', <built-in method count of Row object at 0x7f44944230a8>, 3517)
(u'Phoenix Sky Harbor International Airport', <built-in method count of Row object at 0x7f4494423100>, 2382)
(u'Earl of Sandwich', <built-in method count of Row object at 0x7f4494423158>, 1474)
(u'The Cosmopolitan of Las Vegas', <built-in method count of Row object at 0x7f44944231b0>, 1436)
(u'Wicked Spoon', <built-in method count of Row object at 0x7f4494423208>, 1346)
(u'Gangnam Asian BBQ Dining', <built-in method count of Row object at 0x7f4494423260>, 1287)
(u'Bacchanal Buffet', <built-in method count of Row object at 0x7f44944232b8>, 1149)
(u'Pho Kim Long', <built-in method count of Row object at 0x7f4494423310>, 1062)
(u'Mon Ami Gabi', <built-in method count of Row object at 0x7f4494423368>, 1038)
(u'Secret Pizza', <built-in method count of Row object at 0x7f44944233c0>, 1033)


## Your turn
Use a combination of Spark and plain old python code to answer the following questions.  Include code and written responses in English for each question.

### Q1. How many businesses in the data set are located in the state of Ohio (OH)?

In [4]:
q1 = business.select('state').filter(business.state == 'OH').count()

VBox()

In [5]:
q1

VBox()

12609

### Q2. How many Pennsylvania-based businesses have a hipster ambience?

In [31]:
q2 = business.filter(business.state == 'PA').filter(business.attributes.Ambience.hipster == True).count()


VBox()

In [32]:
q2

VBox()

71

### Q3. Which Nevada-based business has the most liked tip, and what is the text of the tip?

In [38]:
q3 = business.filter(business.state == 'NV').join(nev,'business_id', 'inner').sort(tip.likes, ascending = False).collect()
for b in q3:
    print(b.name, b.text, b.likes)
    break

VBox()

An error was encountered:
Invalid status code '400' from https://172.31.33.24:18888/sessions/0/statements/38 with error payload: "requirement failed: Session isn't active."


### Q4. Excluding businesses in the state of Nevada, list 10 businesses with the highest number of tips

In [6]:
q4 = tip.join(business, 'business_id', 'left').filter(business.state != 'NV')\
     .groupby('business_id').count().sort('count', ascending=False)

VBox()

In [None]:
q4.join(business, 'business_id').select('name','count').sort('count', ascending=False).show()

### Q5. List the names of the divey businesses from Ohio that have an overall rating of 4 or more stars and have at least 1000 tips.
You might want to do this in several steps.

In [None]:
q5 = tip.join(business, 'business_id', 'inner')\
                .filter(business.attributes.Ambience.divey == True)\
                .filter(business.state == 'OH')\
                .filter(business.stars >= 4.0)\
                .groupby('business_id').count()\
                .filter('count' > 1000)\
                .join(business, 'business_id').select('name')