# Yelp Open Dataset for Review Classification - Data Preparation

## Import Libraries

In [42]:
import pandas as pd
import numpy as np

## Data Preparation

Durante questa fase, abbiamo caricato i file JSON dei business e delle review per selezionare solamente le colonne di interesse e ricaricarlo in un dataset in formato CSV per le fasi successive.

Durante tale processo, abbiamo rimosso alcune colonne relative ai business chiusi per allegerire il carico di reviews presenti nel dataset di origine.

In [43]:
rtypes = {"stars": np.float16, 
            "useful": np.int32, 
            "funny": np.int32,
            "cool": np.int32,
            "text" : np.str,
           }
reviewPath = './data/yelp_academic_dataset_review.json'
businessPath = './data/yelp_academic_dataset_business.json'
chunkSize = 10000

In [44]:
%%time

bs = pd.read_json(businessPath, lines=True)

CPU times: user 3.13 s, sys: 450 ms, total: 3.58 s
Wall time: 3.59 s


In [45]:
bs.head()

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours
0,6iYb2HFDywm3zjuRg0shjw,Oskar Blues Taproom,921 Pearl St,Boulder,CO,80302,40.017544,-105.283348,4.0,86,1,"{'RestaurantsTableService': 'True', 'WiFi': 'u...","Gastropubs, Food, Beer Gardens, Restaurants, B...","{'Monday': '11:0-23:0', 'Tuesday': '11:0-23:0'..."
1,tCbdrRPZA0oiIYSmHG3J0w,Flying Elephants at PDX,7000 NE Airport Way,Portland,OR,97218,45.588906,-122.593331,4.0,126,1,"{'RestaurantsTakeOut': 'True', 'RestaurantsAtt...","Salad, Soup, Sandwiches, Delis, Restaurants, C...","{'Monday': '5:0-18:0', 'Tuesday': '5:0-17:0', ..."
2,bvN78flM8NLprQ1a1y5dRg,The Reclaimory,4720 Hawthorne Ave,Portland,OR,97214,45.511907,-122.613693,4.5,13,1,"{'BusinessAcceptsCreditCards': 'True', 'Restau...","Antiques, Fashion, Used, Vintage & Consignment...","{'Thursday': '11:0-18:0', 'Friday': '11:0-18:0..."
3,oaepsyvc0J17qwi8cfrOWg,Great Clips,2566 Enterprise Rd,Orange City,FL,32763,28.914482,-81.295979,3.0,8,1,"{'RestaurantsPriceRange2': '1', 'BusinessAccep...","Beauty & Spas, Hair Salons",
4,PE9uqAjdw0E4-8mjGl3wVA,Crossfit Terminus,1046 Memorial Dr SE,Atlanta,GA,30316,33.747027,-84.353424,4.0,14,1,"{'GoodForKids': 'False', 'BusinessParking': '{...","Gyms, Active Life, Interval Training Gyms, Fit...","{'Monday': '16:0-19:0', 'Tuesday': '16:0-19:0'..."


In [46]:
# cleaning business dataframe
bs = bs[bs['is_open']==1] # removing closed business
bs = bs.drop(['name', 'address', 'city', 'postal_code', 
              'latitude','longitude','is_open', 'attributes', 
              'categories','hours','state'], axis=1)
bs = bs.rename(columns={'stars' : 'meanStars','review_count' : 'reviewCount'})


In [47]:
bs.head()

Unnamed: 0,business_id,meanStars,reviewCount
0,6iYb2HFDywm3zjuRg0shjw,4.0,86
1,tCbdrRPZA0oiIYSmHG3J0w,4.0,126
2,bvN78flM8NLprQ1a1y5dRg,4.5,13
3,oaepsyvc0J17qwi8cfrOWg,3.0,8
4,PE9uqAjdw0E4-8mjGl3wVA,4.0,14


In [48]:
%%time
review = pd.read_json(reviewPath, lines=True,
                      dtype=rtypes,
                      chunksize=chunkSize)
chunkList = []
for chunkReview in review:
    chunkReview = chunkReview.drop(['review_id','useful','funny','cool', 'user_id'], axis=1)
    chunkReview = chunkReview.rename(columns={'stars': 'reviewStars'})
    chunkMerged = pd.merge(bs, chunkReview, on='business_id', how='inner')
    print(f"{chunkMerged.shape[0]} out of {chunkSize:,} related reviews")
    chunkList.append(chunkMerged)
df = pd.concat(chunkList, ignore_index=True, join='outer', axis=0)

7678 out of 10,000 related reviews
7662 out of 10,000 related reviews
7769 out of 10,000 related reviews
7773 out of 10,000 related reviews
7711 out of 10,000 related reviews
7800 out of 10,000 related reviews
7809 out of 10,000 related reviews
7833 out of 10,000 related reviews
7707 out of 10,000 related reviews
7689 out of 10,000 related reviews
7748 out of 10,000 related reviews
7653 out of 10,000 related reviews
7663 out of 10,000 related reviews
7644 out of 10,000 related reviews
7677 out of 10,000 related reviews
7677 out of 10,000 related reviews
7783 out of 10,000 related reviews
7715 out of 10,000 related reviews
7644 out of 10,000 related reviews
7590 out of 10,000 related reviews
7769 out of 10,000 related reviews
7714 out of 10,000 related reviews
7684 out of 10,000 related reviews
7674 out of 10,000 related reviews
7711 out of 10,000 related reviews
7675 out of 10,000 related reviews
7709 out of 10,000 related reviews
7712 out of 10,000 related reviews
7772 out of 10,000 r

9076 out of 10,000 related reviews
9107 out of 10,000 related reviews
9070 out of 10,000 related reviews
9001 out of 10,000 related reviews
9075 out of 10,000 related reviews
8983 out of 10,000 related reviews
9018 out of 10,000 related reviews
9067 out of 10,000 related reviews
9059 out of 10,000 related reviews
8971 out of 10,000 related reviews
8983 out of 10,000 related reviews
9013 out of 10,000 related reviews
9049 out of 10,000 related reviews
9028 out of 10,000 related reviews
9146 out of 10,000 related reviews
9188 out of 10,000 related reviews
9507 out of 10,000 related reviews
9518 out of 10,000 related reviews
9532 out of 10,000 related reviews
9461 out of 10,000 related reviews
9498 out of 10,000 related reviews
9487 out of 10,000 related reviews
8916 out of 10,000 related reviews
7578 out of 10,000 related reviews
7494 out of 10,000 related reviews
7513 out of 10,000 related reviews
7582 out of 10,000 related reviews
7478 out of 10,000 related reviews
7526 out of 10,000 r

7437 out of 10,000 related reviews
7479 out of 10,000 related reviews
7351 out of 10,000 related reviews
7409 out of 10,000 related reviews
7371 out of 10,000 related reviews
7391 out of 10,000 related reviews
7361 out of 10,000 related reviews
7364 out of 10,000 related reviews
7368 out of 10,000 related reviews
7321 out of 10,000 related reviews
7391 out of 10,000 related reviews
8017 out of 10,000 related reviews
8340 out of 10,000 related reviews
8335 out of 10,000 related reviews
8476 out of 10,000 related reviews
8629 out of 10,000 related reviews
8339 out of 10,000 related reviews
8426 out of 10,000 related reviews
8507 out of 10,000 related reviews
8348 out of 10,000 related reviews
8331 out of 10,000 related reviews
8531 out of 10,000 related reviews
8513 out of 10,000 related reviews
8349 out of 10,000 related reviews
8460 out of 10,000 related reviews
8665 out of 10,000 related reviews
8576 out of 10,000 related reviews
8559 out of 10,000 related reviews
8373 out of 10,000 r

7398 out of 10,000 related reviews
7484 out of 10,000 related reviews
7466 out of 10,000 related reviews
7537 out of 10,000 related reviews
7354 out of 10,000 related reviews
7448 out of 10,000 related reviews
7487 out of 10,000 related reviews
7486 out of 10,000 related reviews
7510 out of 10,000 related reviews
7486 out of 10,000 related reviews
7396 out of 10,000 related reviews
7491 out of 10,000 related reviews
7466 out of 10,000 related reviews
7458 out of 10,000 related reviews
7496 out of 10,000 related reviews
7520 out of 10,000 related reviews
7426 out of 10,000 related reviews
7442 out of 10,000 related reviews
7428 out of 10,000 related reviews
7518 out of 10,000 related reviews
7481 out of 10,000 related reviews
7522 out of 10,000 related reviews
7477 out of 10,000 related reviews
7423 out of 10,000 related reviews
7505 out of 10,000 related reviews
7470 out of 10,000 related reviews
7485 out of 10,000 related reviews
7452 out of 10,000 related reviews
7352 out of 10,000 r

In [49]:
df.head()

Unnamed: 0,business_id,meanStars,reviewCount,reviewStars,text,date
0,6iYb2HFDywm3zjuRg0shjw,4.0,86,5.0,Stopped in on a busy Friday night. Despite the...,2018-03-04 00:59:21
1,tCbdrRPZA0oiIYSmHG3J0w,4.0,126,4.0,Elephant's contacted me the same day I posted ...,2012-07-16 05:04:05
2,tCbdrRPZA0oiIYSmHG3J0w,4.0,126,5.0,I'm not usually a fan of airport food. I usual...,2015-04-28 21:11:10
3,tCbdrRPZA0oiIYSmHG3J0w,4.0,126,4.0,"If one must have breakfast at the airport, per...",2015-11-18 18:50:05
4,tCbdrRPZA0oiIYSmHG3J0w,4.0,126,5.0,"Reasonably priced, tasty local joint. Lots of ...",2011-11-30 20:15:41


In [50]:
df = df.rename(columns={'business_id' : 'businessId'})

In [51]:
df.head()

Unnamed: 0,businessId,meanStars,reviewCount,reviewStars,text,date
0,6iYb2HFDywm3zjuRg0shjw,4.0,86,5.0,Stopped in on a busy Friday night. Despite the...,2018-03-04 00:59:21
1,tCbdrRPZA0oiIYSmHG3J0w,4.0,126,4.0,Elephant's contacted me the same day I posted ...,2012-07-16 05:04:05
2,tCbdrRPZA0oiIYSmHG3J0w,4.0,126,5.0,I'm not usually a fan of airport food. I usual...,2015-04-28 21:11:10
3,tCbdrRPZA0oiIYSmHG3J0w,4.0,126,4.0,"If one must have breakfast at the airport, per...",2015-11-18 18:50:05
4,tCbdrRPZA0oiIYSmHG3J0w,4.0,126,5.0,"Reasonably priced, tasty local joint. Lots of ...",2011-11-30 20:15:41


In [53]:
df.to_csv('./data/yelp_academic_base_dataset.csv', index=False)