## Yelp Reviews Data Analysis

## 1. Overall Project Objectives

Yelp is an application to provide the platform for customers to write reviews and provide a star-rating. A research indicates that a one-star increase led to 59% increase in revenue of independent restaurants. Therefore, we see great potential of Yelp dataset as a valuable insights repository.

The main purpose of our project is to conduct thorough analysis on 7 different cuisine types of restaurants which are Korean, Japanese, Chinese, Vietnamese,Thai, French and Italian, figure out what makes a good restaurant and what concerns customers, and then make recommendations of the future improvement and profit growth. Specifically, we will mainly analyze customers' reviews and figure out reasons why customers love or dislike the restaurant. For example, there may be great reviews primarily due to the friendly service, or negative reviews about high price. Meanwhile, we will also compare among those 7 different cuisine types and figure out differences from reviews and gain valuable insights to make customized recommendations to different types of restaurants.

## 2. Description of Data

The Yelp dataset is downloaded from Yelp Reviews website. In total, there are 5,200,000 user reviews, information on 174,000 business. we will focus on two tables which are business table and review table. Attributes of business table are as following:

* business_id: ID of the business 
* name: name of the business
* neighborhood 
* address: address of the business
* city: city of the business
* state: state of the business
* postal_code: postal code of the business
* latitude: latitude of the business
* longitude: longitude of the business
* stars: average rating of the business
* review_count: number of reviews received
* is_open: 1 if the business is open, 0 therwise
* categories: multiple categories of the business

Attribues of review table are as following:
* review_id: ID of the review
* user_id: ID of the user
* business_id: ID of the business
* stars: ratings of the business
* date: review date
* text: review from the user
* useful: number of users who vote a review as usefull
* funny: number of users who vote a review as funny
* cool: number of users who vote a review as cool


## 3. Direction of Analysis

**Exploratory Data Analysis**

* Count something
* Vizualize something

In [1]:
#spark.sparkContext
#spark.stop()



In [2]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.\
        appName("SparkDataAnalysis").\
        master("spark://master:7077").\
        getOrCreate()

spark

25/10/19 20:05:39 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


In [3]:
#HARD RESET THE KERNEL
#import os
#os._exit(00)


## 4. Coding Stage

In [4]:
import pandas as pd
import numpy as np
from collections import Counter
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display
pd.set_option('display.max_columns', None)   # show all columns
pd.set_option('display.max_colwidth', None)  # don‚Äôt truncate text
pd.set_option('display.width', 1000)### *Read Yelp Dataset*

sns.set(style="whitegrid")

from pyspark.sql import functions as F
from pyspark.sql import types as T


#spark.conf.set("spark.sql.debug.maxToStringFields", 2000)
#spark.conf.set("spark.sql.shuffle.partitions", "200")   # adjust for cluster cores
#spark.conf.set("spark.sql.debug.maxToStringFields", "2000")  # for printing big schemas
#spark.conf.set("spark.executor.memory", "4g")
#spark.conf.set("spark.driver.memory", "4g")


In [34]:

df_business = spark.read.json("/yelp_review_dataset/yelp_academic_dataset_business.json")
df_user = spark.read.json("/yelp_review_dataset/yelp_academic_dataset_user.json")
df_tip = spark.read.json("/yelp_review_dataset/yelp_academic_dataset_tip.json")
df_review = spark.read.json("/yelp_review_dataset/yelp_academic_dataset_review.json")
df_checkin = spark.read.json("/yelp_review_dataset/yelp_academic_dataset_checkin.json")


                                                                                

### Academic Business

üí° Role:

This file contains information about each business listed on Yelp: restaurants, salons, hotels, gyms, etc.
Every other file connects to it via business_id.

{
  "business_id": "1SWheh84yJXfytovILXOAQ",                                **main_objects**
  "name": "The Range at Lake Norman",
  "address": "10913 Bailey Rd",
  "city": "Cornelius",
  "state": "NC",
  "postal_code": "28031",
  "latitude": 35.4627242,
  "longitude": -80.8526119,
  "stars": 4.0,
  "review_count": 36,
  "is_open": 1,
  "categories": "Active Life, Gun/Rifle Ranges, Guns & Ammo",
  
  "attributes": {                                                           **nested oject -> analysis later**
    "BusinessAcceptsCreditCards": "True",
    "WiFi": "free"
  },
  "hours": {                                                                **nested oject -> analysis later**
    "Monday": "10:00-18:00",
    "Tuesday": "10:00-18:00"
  }
}

In [40]:

main_columns = ["business_id", "name" , "address", "city","state", "postal_code", "latitude", "longitude", "stars", "review_count", "is_open", "categories"]
sub_objects = ["hours", "attributes"]

df_business.select(main_objects).limit(10).toPandas()



Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,categories
0,Pns2l4eNsfO8kk83dixA6A,"Abby Rappoport, LAC, CMQ","1616 Chapala St, Ste 2",Santa Barbara,CA,93101,34.426679,-119.711197,5.0,7,0,"Doctors, Traditional Chinese Medicine, Naturopathic/Holistic, Acupuncture, Health & Medical, Nutritionists"
1,mpf3x-BjTdTEA3yCZrAYPw,The UPS Store,87 Grasso Plaza Shopping Center,Affton,MO,63123,38.551126,-90.335695,3.0,15,1,"Shipping Centers, Local Services, Notaries, Mailbox Centers, Printing Services"
2,tUFrWirKiKi_TAnsVWINQQ,Target,5255 E Broadway Blvd,Tucson,AZ,85711,32.223236,-110.880452,3.5,22,0,"Department Stores, Shopping, Fashion, Home & Garden, Electronics, Furniture Stores"
3,MTSW4McQd7CbVtyjqoe9mw,St Honore Pastries,935 Race St,Philadelphia,PA,19107,39.955505,-75.155564,4.0,80,1,"Restaurants, Food, Bubble Tea, Coffee & Tea, Bakeries"
4,mWMc6_wTdE0EUBKIGXDVfA,Perkiomen Valley Brewery,101 Walnut St,Green Lane,PA,18054,40.338183,-75.471659,4.5,13,1,"Brewpubs, Breweries, Food"
5,CF33F8-E6oudUQ46HnavjQ,Sonic Drive-In,615 S Main St,Ashland City,TN,37015,36.269593,-87.058943,2.0,6,1,"Burgers, Fast Food, Sandwiches, Food, Ice Cream & Frozen Yogurt, Restaurants"
6,n_0UpQx1hsNbnPUSlodU8w,Famous Footwear,"8522 Eager Road, Dierbergs Brentwood Point",Brentwood,MO,63144,38.627695,-90.340465,2.5,13,1,"Sporting Goods, Fashion, Shoe Stores, Shopping, Sports Wear, Accessories"
7,qkRM_2X51Yqxk3btlwAQIg,Temple Beth-El,400 Pasadena Ave S,St. Petersburg,FL,33707,27.76659,-82.732983,3.5,5,1,"Synagogues, Religious Organizations"
8,k0hlBqXX-Bt0vf1op7Jr1w,Tsevi's Pub And Grill,8025 Mackenzie Rd,Affton,MO,63123,38.565165,-90.321087,3.0,19,0,"Pubs, Restaurants, Italian, Bars, American (Traditional), Nightlife, Greek"
9,bBDDEgkFA1Otx9Lfe7BZUQ,Sonic Drive-In,2312 Dickerson Pike,Nashville,TN,37207,36.208102,-86.76817,1.5,10,1,"Ice Cream & Frozen Yogurt, Fast Food, Burgers, Restaurants, Food"


#### Academic Business Schema

In [8]:
print("Total businesses:", df_business.count(), '\n')
df_business.printSchema()



Total businesses: 150346
root
 |-- address: string (nullable = true)
 |-- attributes: struct (nullable = true)
 |    |-- AcceptsInsurance: string (nullable = true)
 |    |-- AgesAllowed: string (nullable = true)
 |    |-- Alcohol: string (nullable = true)
 |    |-- Ambience: string (nullable = true)
 |    |-- BYOB: string (nullable = true)
 |    |-- BYOBCorkage: string (nullable = true)
 |    |-- BestNights: string (nullable = true)
 |    |-- BikeParking: string (nullable = true)
 |    |-- BusinessAcceptsBitcoin: string (nullable = true)
 |    |-- BusinessAcceptsCreditCards: string (nullable = true)
 |    |-- BusinessParking: string (nullable = true)
 |    |-- ByAppointmentOnly: string (nullable = true)
 |    |-- Caters: string (nullable = true)
 |    |-- CoatCheck: string (nullable = true)
 |    |-- Corkage: string (nullable = true)
 |    |-- DietaryRestrictions: string (nullable = true)
 |    |-- DogsAllowed: string (nullable = true)
 |    |-- DriveThru: string (nullable = true)
 |  

                                                                                

In [None]:
# Basic numeric summary
df_business.select("stars", "review_count", "is_open").describe().show()

In [None]:
# Top cities & categories
df_business.groupBy("city").count().orderBy(F.desc("count")).show(10)
df_business.groupBy("state").count().orderBy(F.desc("count")).show(10)

How many unique businesses are present and how many by city / state?

What are the top categories (and top cuisines) by number of businesses?

Distribution of average stars and review_count across businesses.

Which categories have the highest average stars (filtering for min reviews)?

Geographical spread: scatter of businesses (lat/lon) colored by average stars (sampled).

How many businesses are open vs closed (is_open) and by city?

In [None]:
df_business.select("stars", "review_count", "is_open").describe().show()

In [None]:
# Top cities & categories
df_business.groupBy("city").count().orderBy(F.desc("count")).show(10)
df_business.groupBy("state").count().orderBy(F.desc("count")).show(10)

In [None]:
# Top categories (explode categories string into array)
df_business = df_business.withColumn("categories_array", F.split(F.col("categories"), ",\\s*"))
df_business.select(F.explode(F.col("categories_array")).alias("category")) \
    .groupBy("category").count().orderBy(F.desc("count")).show(15, truncate=False)

### Academic User

üí° Role:

Contains details about users who write reviews or tips.



In [None]:

main_columns = []

df_user.select(main_objects).limit(10).toPandas()


#### Academic User Schema

In [41]:
print("Total users:", df_user.count())
df_user.printSchema()



Total users: 1987897
root
 |-- average_stars: double (nullable = true)
 |-- compliment_cool: long (nullable = true)
 |-- compliment_cute: long (nullable = true)
 |-- compliment_funny: long (nullable = true)
 |-- compliment_hot: long (nullable = true)
 |-- compliment_list: long (nullable = true)
 |-- compliment_more: long (nullable = true)
 |-- compliment_note: long (nullable = true)
 |-- compliment_photos: long (nullable = true)
 |-- compliment_plain: long (nullable = true)
 |-- compliment_profile: long (nullable = true)
 |-- compliment_writer: long (nullable = true)
 |-- cool: long (nullable = true)
 |-- elite: string (nullable = true)
 |-- fans: long (nullable = true)
 |-- friends: string (nullable = true)
 |-- funny: long (nullable = true)
 |-- name: string (nullable = true)
 |-- review_count: long (nullable = true)
 |-- useful: long (nullable = true)
 |-- user_id: string (nullable = true)
 |-- yelping_since: string (nullable = true)



                                                                                

In [None]:
Distribution of review_count per user (most active users).

Distribution of average_stars across users (do some users always give high/low stars?).

How many elite users and elite years distribution.

Relationship between user influence metrics (fans, useful, funny, cool) and their review_count.

Top users by useful votes or fans.

Network size: distribution of number of friends per user.

In [None]:
# Most active users
df_user.select("user_id","name","review_count","fans") \
    .orderBy(F.desc("review_count")).show(10)

### Academic Tip

üí° Role:

Similar to reviews but shorter, like quick ‚Äúpro-tips‚Äù or mini comments.

In [None]:
df_tip.limit(5).toPandas()

#### Academic Tip Schema

In [None]:
print("Total tips:", df_tip.count())
df_tip.printSchema()

In [None]:
Distribution of compliment_count on tips (how many tips get compliments).

What are the common words in tips (quick top insights)?

Average tip length and relation with compliment_count.

Top businesses by number of tips.

Temporal trend of tips (tips per year).

Are tips more positive or negative? (quick polarity using TextBlob ‚Äî optional if installed

In [None]:
# Top tip users
df_tip.groupBy("user_id").agg(F.count("*").alias("tip_count"),
                              F.sum("compliment_count").alias("total_compliments")) \
      .orderBy(F.desc("tip_count")).show(10)

### Academic Review

üí° Role:

Contains all user-written reviews for each business.

{
  "review_id": "xQY8N_XvtGbearJ5X0KlyQ",
  "user_id": "OwjRMXRC0KyPrIlcjYv4-A",
  "business_id": "f9NumwFMBDn751xgFiRbNA",
  "stars": 4,
  "date": "2016-03-09",
  "text": "Great food, friendly staff...",
  "useful": 2,
  "funny": 0,
  "cool": 1
}



In [46]:

main_objects = df_review.columns

df_review.select(main_objects).limit(5).toPandas()


Unnamed: 0,business_id,cool,date,funny,review_id,stars,text,useful,user_id
0,XQfwVwDr-v0ZS3_CbbE5Xw,0,2018-07-07 22:09:11,0,KU_O5udG6zpxOg-VcAEodg,3.0,"If you decide to eat here, just be aware it is going to take about 2 hours from beginning to end. We have tried it multiple times, because I want to like it! I have been to it's other locations in NJ and never had a bad experience. \n\nThe food is good, but it takes a very long time to come out. The waitstaff is very young, but usually pleasant. We have just had too many experiences where we spent way too long waiting. We usually opt for another diner or restaurant on the weekends, in order to be done quicker.",0,mh_-eMZ6K5RLWhZyISBhwA
1,7ATYjTIgM3jUlt4UM3IypQ,1,2012-01-03 15:28:18,0,BiTunyQ73aT9WBnpR9DZGw,5.0,"I've taken a lot of spin classes over the years, and nothing compares to the classes at Body Cycle. From the nice, clean space and amazing bikes, to the welcoming and motivating instructors, every class is a top notch work out.\n\nFor anyone who struggles to fit workouts in, the online scheduling system makes it easy to plan ahead (and there's no need to line up way in advanced like many gyms make you do).\n\nThere is no way I can write this review without giving Russell, the owner of Body Cycle, a shout out. Russell's passion for fitness and cycling is so evident, as is his desire for all of his clients to succeed. He is always dropping in to classes to check in/provide encouragement, and is open to ideas and recommendations from anyone. Russell always wears a smile on his face, even when he's kicking your butt in class!",1,OyoGAe7OKpv6SyGZT5g77Q
2,YjUWPpI6HXG530lwP-fb2A,0,2014-02-05 20:30:30,0,saUsX_uimxRlCVr67Z4Jig,3.0,"Family diner. Had the buffet. Eclectic assortment: a large chicken leg, fried jalape√±o, tamale, two rolled grape leaves, fresh melon. All good. Lots of Mexican choices there. Also has a menu with breakfast served all day long. Friendly, attentive staff. Good place for a casual relaxed meal with no expectations. Next to the Clarion Hotel.",0,8g_iMtfSiwikVnbP2etR0A
3,kxX2SOes4o-D3ZQBkiMRfA,1,2015-01-04 00:01:03,0,AqPFMleE6RsU23_auESxiA,5.0,"Wow! Yummy, different, delicious. Our favorite is the lamb curry and korma. With 10 different kinds of naan!!! Don't let the outside deter you (because we almost changed our minds)...go in and try something new! You'll be glad you did!",1,_7bHUi9Uuf5__HHc_Q8guQ
4,e4Vwtrqf-wpJfwesgvdgxQ,1,2017-01-14 20:54:15,0,Sx8TMOWLNuJBWer-0pcmoA,4.0,"Cute interior and owner (?) gave us tour of upcoming patio/rooftop area which will be great on beautiful days like today. Cheese curds were very good and very filling. Really like that sandwiches come w salad, esp after eating too many curds! Had the onion, gruyere, tomato sandwich. Wasn't too much cheese which I liked. Needed something else...pepper jelly maybe. Would like to see more menu options added such as salads w fun cheeses. Lots of beer and wine as well as limited cocktails. Next time I will try one of the draft wines.",1,bcjbaE6dDog4jkNY91ncLQ


#### Academic Review Schema

In [47]:
print("Total reviews:", df_review.count())
df_review.printSchema()



Total reviews: 6990280
root
 |-- business_id: string (nullable = true)
 |-- cool: long (nullable = true)
 |-- date: string (nullable = true)
 |-- funny: long (nullable = true)
 |-- review_id: string (nullable = true)
 |-- stars: double (nullable = true)
 |-- text: string (nullable = true)
 |-- useful: long (nullable = true)
 |-- user_id: string (nullable = true)



                                                                                

In [None]:
Distribution of stars in reviews (how many 1‚Äì5 stars)?

Review length distribution (characters / words) and extreme lengths.

Temporal trends: reviews per year/month; seasonal patterns.

Top businesses by number of reviews; which businesses attract most reviews?

Relationship between review text length and rating (do longer reviews correlate with lower/higher stars?).

Most frequent words / bigrams in positive vs negative reviews.

    #how many review was madde,
#The highest amount of reviews one person can make
#the top business has highest user reviews?
#The time reviews was mades

In [None]:
# Reviews per year
df_review = df_review.withColumn("year", F.year("date"))
df_review.groupBy("year").count().orderBy("year").show()

In [None]:
# Stars distribution
df_review.groupBy("stars").count().orderBy("stars").show()

### Academic Checkin

üí° Role:

Stores check-in records: when users visited a business.

In [None]:
df_checkin.limit(5).toPandas()

#### Academic Chekin

In [None]:
print("Total checkin entries:", df_checkin.count())
df_checkin.printSchema()

In [None]:
What are the busiest hours of day aggregated across businesses?

Which weekday has the highest check-ins?

Distribution of total check-ins per business (how many businesses are popular).

Time-series: check-in trends by month or year (if data contains dates).

Correlate check-in counts with business average stars / review_count (need join to business).

Peak hours by category (join checkin ‚Üí business categories).

In [None]:
# Total check-ins per business (if checkin_info is map)
df_checkin = df_checkin.withColumn("checkin_total",
                                   F.expr("aggregate(map_values(checkin_info), 0L, (acc, x) -> acc + x)"))
df_checkin.select(F.mean("checkin_total"), F.max("checkin_total")).show()

### Academic Restaurant Reviews (Merged)

In [51]:

df_restaurant_review = df_business.join(df_review, on="business_id", how="inner")

df_restaurant_review

DataFrame[business_id: string, address: string, attributes: struct<AcceptsInsurance:string,AgesAllowed:string,Alcohol:string,Ambience:string,BYOB:string,BYOBCorkage:string,BestNights:string,BikeParking:string,BusinessAcceptsBitcoin:string,BusinessAcceptsCreditCards:string,BusinessParking:string,ByAppointmentOnly:string,Caters:string,CoatCheck:string,Corkage:string,DietaryRestrictions:string,DogsAllowed:string,DriveThru:string,GoodForDancing:string,GoodForKids:string,GoodForMeal:string,HairSpecializesIn:string,HappyHour:string,HasTV:string,Music:string,NoiseLevel:string,Open24Hours:string,OutdoorSeating:string,RestaurantsAttire:string,RestaurantsCounterService:string,RestaurantsDelivery:string,RestaurantsGoodForGroups:string,RestaurantsPriceRange2:string,RestaurantsReservations:string,RestaurantsTableService:string,RestaurantsTakeOut:string,Smoking:string,WheelchairAccessible:string,WiFi:string>, categories: string, city: string, hours: struct<Friday:string,Monday:string,Saturday:string

In [53]:
df_restaurant_review.limit(1).toPandas()

25/10/19 22:46:56 WARN TaskSetManager: Lost task 1.0 in stage 46.0 (TID 309) (192.168.56.102 executor 0): java.io.IOException: No space left on device
	at java.io.FileOutputStream.writeBytes(Native Method)
	at java.io.FileOutputStream.write(FileOutputStream.java:326)
	at org.apache.spark.storage.TimeTrackingOutputStream.write(TimeTrackingOutputStream.java:59)
	at org.apache.spark.io.MutableCheckedOutputStream.write(MutableCheckedOutputStream.scala:43)
	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
	at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
	at net.jpountz.lz4.LZ4BlockOutputStream.flushBufferedData(LZ4BlockOutputStream.java:225)
	at net.jpountz.lz4.LZ4BlockOutputStream.write(LZ4BlockOutputStream.java:178)
	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
	at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
	at java.io.DataOutputStream.write(DataOutputStream.java:107)
	at org.apache.spark.sql

Py4JJavaError: An error occurred while calling o468.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 46.0 failed 4 times, most recent failure: Lost task 0.3 in stage 46.0 (TID 317) (192.168.56.102 executor 0): java.io.IOException: No space left on device
	at java.io.FileOutputStream.writeBytes(Native Method)
	at java.io.FileOutputStream.write(FileOutputStream.java:326)
	at org.apache.spark.storage.TimeTrackingOutputStream.write(TimeTrackingOutputStream.java:59)
	at org.apache.spark.io.MutableCheckedOutputStream.write(MutableCheckedOutputStream.scala:43)
	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
	at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
	at net.jpountz.lz4.LZ4BlockOutputStream.flushBufferedData(LZ4BlockOutputStream.java:225)
	at net.jpountz.lz4.LZ4BlockOutputStream.write(LZ4BlockOutputStream.java:178)
	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
	at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
	at java.io.DataOutputStream.write(DataOutputStream.java:107)
	at org.apache.spark.sql.catalyst.expressions.UnsafeRow.writeToStream(UnsafeRow.java:519)
	at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$1.writeValue(UnsafeRowSerializer.scala:69)
	at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:312)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:171)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
	at org.apache.spark.scheduler.Task.run(Task.scala:141)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2856)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2792)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2791)
	at scala.collection.immutable.List.foreach(List.scala:333)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2791)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1247)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1247)
	at scala.Option.foreach(Option.scala:437)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1247)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3060)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2994)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2983)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
Caused by: java.io.IOException: No space left on device
	at java.io.FileOutputStream.writeBytes(Native Method)
	at java.io.FileOutputStream.write(FileOutputStream.java:326)
	at org.apache.spark.storage.TimeTrackingOutputStream.write(TimeTrackingOutputStream.java:59)
	at org.apache.spark.io.MutableCheckedOutputStream.write(MutableCheckedOutputStream.scala:43)
	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
	at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
	at net.jpountz.lz4.LZ4BlockOutputStream.flushBufferedData(LZ4BlockOutputStream.java:225)
	at net.jpountz.lz4.LZ4BlockOutputStream.write(LZ4BlockOutputStream.java:178)
	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
	at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
	at java.io.DataOutputStream.write(DataOutputStream.java:107)
	at org.apache.spark.sql.catalyst.expressions.UnsafeRow.writeToStream(UnsafeRow.java:519)
	at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$1.writeValue(UnsafeRowSerializer.scala:69)
	at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:312)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:171)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
	at org.apache.spark.scheduler.Task.run(Task.scala:141)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)



Top 10 cities with most restaurants

Distribution of restaurants in each state

Top 10 cities with most reviews

Top 9 restaurants with most reviews


Distribution of positive and negative reviews in each category


### Reviews Analysis