In [1]:
# import the required libraries here
import pandas as pd
import matplotlib.pyplot as plt

# Extracting desired business reviews
In the last notebook we achieved two things. Firstly, we identified the business categories (Italian/Pizza) that our campaign was going to target. Secondly, we identified business IDs associated with those categories. Knowing the business IDs allows us to select just the reviews of interest from the reviews file. Before you go rushing into reading in the review data to then filter just those IDs of interest, as a data scientist you will have first got an idea of the size of the file. The review file is pretty big. Unless you have a very good computer with lots of RAM, or you particularly want to fire up your system/memory monitor and watch the free memory fall further and further until your computer seizes, you should be thinking in terms of how to read in only the lines of interest. This way, you only consume the minimum RAM necessary. Of course, in general, even then you should be making some sort of calculation as to whether even this would be within the capabilities of your hardware. In this case, it should be doable for most modernish computers.

In [58]:
businesses = pd.read_csv('business_list.csv')

In [59]:
businesses.head()

Unnamed: 0,business_id,review_count,stars,state
0,Apn5Q_b6Nz61Tq4XzPdf9A,24,4.0,AB
1,6YC6CsXRrmPv_iwfvc9onA,11,3.0,OH
2,0jtRI7hVMpQHpUVtUy4ITw,242,4.0,NV
3,AcGRSWCpb7YB95MTsHlGEw,4,2.0,AZ
4,AYL_y8ahquUW0o-cvIyLbg,85,3.5,NC


In [60]:
business_ids = businesses['business_id'].values
print(business_ids[:5])

['Apn5Q_b6Nz61Tq4XzPdf9A' '6YC6CsXRrmPv_iwfvc9onA'
 '0jtRI7hVMpQHpUVtUy4ITw' 'AcGRSWCpb7YB95MTsHlGEw'
 'AYL_y8ahquUW0o-cvIyLbg']


## Filtering a large data file in chunks
The review data file comprises some nearly 6 million lines. It's rather slow to it in one line at a time, check whether it's a line we want, add that line to a list if it is, and continue to the next line. On the other hand, it is extremely demanding of RAM to read the entire file in as one and then filter the rows in memory. 

In [61]:
# For convenience here, we are accessing the data in the working directory that contains our notebooks.
# Normal best practise is to keep your data separate, but this keeps things simple.
# task: create a reader object for the review json file
# Hint: use lines=True as before but add the chunksize=1 parameter
# one line of code here
review_reader = pd.read_json('yelp_academic_dataset_review.json', lines=True, chunksize=100000)

In [64]:
%%time
# task: process the file one chunk at a time,
# filter that chunk for rows with a business_id in business_ids
# You can either do this within in a loop, having initialized an empty list,
# or using a more pythonic list comprehension
reviews = [review.loc[review['business_id'].apply(lambda x: x in business_ids)] for review in review_reader]
# (this took some 24 minutes on my old i7)

CPU times: user 24min 11s, sys: 27 s, total: 24min 38s
Wall time: 24min 36s


In [52]:
%%time
counter = 0
batches = []
for review in review_reader:
    counter += 1
    batch = review[['business_id', 'review_id', 'user_id']]
    batches.append(batch.loc[batch['business_id'].apply(lambda x: x in business_ids)])
    if counter == 10:
        break


CPU times: user 3min 39s, sys: 3.74 s, total: 3min 43s
Wall time: 3min 44s


In [65]:
len(reviews)

60

In [66]:
type(reviews)

list

In [67]:
reviews_df = pd.concat(reviews)

In [68]:
reviews_df.shape

(506291, 9)

In [69]:
reviews_df.head()

Unnamed: 0,business_id,cool,date,funny,review_id,stars,text,useful,user_id
0,iCQpiavjjPzJ5_3gPD5Ebg,0,2011-02-25,0,x7mDIiDB3jEiPGPHOmDzyw,2,The pizza was okay. Not the best I've had. I p...,0,msQe1u7Z_XuqjGoqhB0J5g
17,q0n4I-zqiI47xispOqc1lA,0,2016-09-19,0,M9gD64U797dfIpLA9OHYVA,5,"Delicious, friendly staff, cool atmosphere, al...",0,eG6HneK9zLcuZpVuKcsCGQ
25,f-v1fvtnbdw_QQRsCnwH-g,0,2017-11-18,0,alI_kRKyEHfdHibYGgtJbw,1,I have to write a review on the Fractured Prun...,0,Fc_nb6N6Sdurqb-rwsY1Bw
37,RyTEGJz5tG7zC73BdXt-cQ,0,2016-11-18,0,6lRnO0n3QdkSGlJLUQb15w,5,Best veal sandwich ever ( i like rapini/ garli...,0,zx_Op2OAOM_fRic9tU-jqA
50,OX0T9dWI8b7meu-ljTo22A,0,2016-05-01,0,oz66Z8p9Etq0WbcZVCmm7w,5,"Friendliest staff, no matter how swamped they ...",0,0pf5VuzE4_1pwj5NJHG5TQ


In [70]:
'iCQpiavjjPzJ5_3gPD5Ebg' in business_ids

True

In [71]:
reviews_df.to_csv('reviews_filtered.csv', index=False)