### Preparing data language modeling

dataset: https://www.yelp.com/dataset

### Description

The Yelp dataset is a subset of Yelp's businesses, reviews, and user data for use in personal, educational, and academic purposes. Available in both JSON and SQL files, use it to teach students about databases, to learn NLP, or for sample production data while you learn how to make mobile apps


Please visit the Yelp dataset webpage here: https://www.yelp.com/dataset
1. Click "Get the Data"
2. Please review, agree to, and respect Yelp's terms of use!
3. The dataset downloads as a compressed .tar file; 

That's it! You're ready to go.



The current iteration of the Yelp dataset (as of this demo) consists of the following data:

- 174K businesses
- 5.2M user reviews
- 11 metropolitan areas

The data is provided in a handful of files in .json format. We'll be using the following files for our demo:

- business.json — the records for individual businesses
- review.json — the records for reviews users wrote about businesses

The files are text files (UTF-8) with one json object per line, each one corresponding to an individual data record.

In [1]:
import os
import codecs

data_directory = os.path.join("./data/")

businesses_filepath = os.path.join(data_directory, 'business.json')

with codecs.open(businesses_filepath, encoding='utf_8') as f:
    first_business_record = f.readline() 

print (first_business_record)

{"business_id":"1SWheh84yJXfytovILXOAQ","name":"Arizona Biltmore Golf Club","address":"2818 E Camino Acequia Drive","city":"Phoenix","state":"AZ","postal_code":"85016","latitude":33.5221425,"longitude":-112.0184807,"stars":3.0,"review_count":5,"is_open":0,"attributes":{"GoodForKids":"False"},"categories":"Golf, Active Life","hours":null}



** **
The business records consist of key, value pairs containing information about the particular business. Attributes we'll be interested in for this project:

- business_id — unique identifier for businesses
- categories — an array containing relevant category values of businesses
- city - geographical locations of businesses

In [2]:
review_json_filepath = os.path.join(data_directory, 'review.json')

with codecs.open(review_json_filepath, encoding='utf_8') as f:
    first_review_record = f.readline()
    
print (first_review_record)

{"review_id":"Q1sbwvVQXV2734tPgoKj4Q","user_id":"hG7b0MtEbXx5QzbzE6C_VA","business_id":"ujmEBvifdJM6h6RLv4wQIg","stars":1.0,"useful":6,"funny":1,"cool":0,"text":"Total bill for this horrible service? Over $8Gs. These crooks actually had the nerve to charge us $69 for 3 pills. I checked online the pills can be had for 19 cents EACH! Avoid Hospital ERs at all costs.","date":"2013-05-07 04:34:36"}



A few attributes of interest the review records:

- business_id — indicates which business the review is about
- text — the natural language text the user wrote
- date - the date when review is captured
- stars - indicates the rating the user gave

### Data selection criteria - All the Restaurants based reviews

Read in each business record and convert it to a Python dict Filter out business records that aren't about restaurants (i.e., not in the "Restaurant" category)

Create a frozenset of the business IDs for restaurants, which we'll use in the next step

In [4]:
import json

restaurant_ids = set()

# open the businesses file
with codecs.open(businesses_filepath, encoding='utf_8') as f:
    
    # iterate through each line (json record) in the file
    for business_json in f:
        
        # convert the json record to a Python dict
        business = json.loads(business_json)
        
        # if this business is not a restaurant, skip to the next one
        if business[u'categories'] is not None and u'Restaurants' not in business[u'categories']:
            continue
            
        # add the restaurant business id to our restaurant_ids set
        restaurant_ids.add(business[u'business_id'])

# turn restaurant_ids into a frozenset, as we don't need to change it anymore
restaurant_ids = frozenset(restaurant_ids)

# print the number of unique restaurant ids in the dataset
print ('{:,}'.format(len(restaurant_ids)), u'restaurants in the dataset.')

59,853 restaurants in the dataset.


Next, we will create a new file that contains only the text from reviews about restaurants, with one review per line in the file.

In [5]:
# create a clean_data directory under parent data directory
clean_data_path = './data/clean_data/'

review_txt_filepath_all = os.path.join(clean_data_path, 'yelp_review_restausrant_all.txt')

In [6]:
%%time

# this is a bit time consuming - make the if statement True
# if you want to execute data prep yourself.
if 1 == 1:
    
    review_count = 0

    # create & open a new file in write mode
    with codecs.open(review_txt_filepath_all, 'w', encoding='utf_8') as review_txt_file:

        # open the existing review json file
        with codecs.open(review_json_filepath, encoding='utf_8') as review_json_file:

            # loop through all reviews in the existing file and convert to dict
            for review_json in review_json_file:
                review = json.loads(review_json)

                # if this review is not about a restaurant, skip to the next one
                if review[u'business_id'] not in restaurant_ids:
                    continue

                # write the business id, date, restaurant review and star as a line in the new file
                # escape newline characters in the original review text
                review_txt_file.write(review[u'text'].replace('\n', '\\n') + '\n')
                review_count += 1

    print (u'''Text from {:,} restaurant reviews
              written to the new txt file.'''.format(review_count))
    
else:
    
    with codecs.open(review_txt_filepath, encoding='utf_8') as review_txt_file:
        for review_count, line in enumerate(review_txt_file):
            pass
        
    print (u'Text from {:,} restaurant reviews in the txt file.'.format(review_count + 1))

Text from 4,203,821 restaurant reviews
              written to the new txt file.
CPU times: user 2min 14s, sys: 10.1 s, total: 2min 24s
Wall time: 2min 25s
