<h2>Springboard, Capstone #2: NLP with Yelp reviews</h2>

<b>Overview:</b> Customer reviews over a wealth of information, and Yelp.com is one of the most extensive places to look for reviews of any kind. Whether it's a restaurant, business, or other consumer experience, Yelp facilitates users telling other users what they thought of a product, allowing other people to make choices about what to buy and where to spend their money. Natural language processing, naturally (::chuckle::), allows us to harness this text data and to use it for business insights. 

<b>The dataset:</b> This dataset comes publicly available from Yelp.com [here](https://www.yelp.com/dataset/download), where the organization originally published the data to encourage students to play with the data and to gain insights that they could share, in return, with Yelp about their customers. The files in the dataset are JSON format, and for this project, we'll be looking at:
* 'yelp_academic_dataset_review.json', for the review content;
*and 'yelp_academic_dataset_user.json' for user-specific information.

<b>The goal:</b> The objective with this project is to process the text data provided in customer reviews (i.e. starting, for example, with restaurant reviews) to:
* 1) cluster restaurant goers into groups using machine learning techniques;
* 2) recommend other, similar restaurants to those users based on their previous Yelp reviews.

In [6]:
import pandas as pd

In [23]:
import tarfile

with tarfile.open('yelp_dataset.tar', 'r') as t:
    print(t.getnames())

['./._Dataset_Challenge_Dataset_Agreement.pdf', 'Dataset_Challenge_Dataset_Agreement.pdf', './._Yelp_Dataset_Challenge_Round_12.pdf', 'Yelp_Dataset_Challenge_Round_12.pdf', 'yelp_academic_dataset_business.json', 'yelp_academic_dataset_checkin.json', 'yelp_academic_dataset_photo.json', 'yelp_academic_dataset_review.json', 'yelp_academic_dataset_tip.json', 'yelp_academic_dataset_user.json']


In [25]:
import time

with tarfile.open('yelp_dataset.tar', 'r') as t:
    for member_info in t.getmembers():
        print(member_info.name)
        print('  Modified:', time.ctime(member_info.mtime))
        print('  Mode    :', oct(member_info.mode))
        print('  Type    :', member_info.type)
        print('  Size    :', member_info.size, 'bytes')
        print()

./._Dataset_Challenge_Dataset_Agreement.pdf
  Modified: Tue Jul 31 11:34:32 2018
  Mode    : 0o644
  Type    : b'0'
  Size    : 674 bytes

Dataset_Challenge_Dataset_Agreement.pdf
  Modified: Tue Jul 31 11:34:32 2018
  Mode    : 0o644
  Type    : b'0'
  Size    : 100912 bytes

./._Yelp_Dataset_Challenge_Round_12.pdf
  Modified: Tue Jul 31 11:34:00 2018
  Mode    : 0o644
  Type    : b'0'
  Size    : 674 bytes

Yelp_Dataset_Challenge_Round_12.pdf
  Modified: Tue Jul 31 11:34:00 2018
  Mode    : 0o644
  Type    : b'0'
  Size    : 111712 bytes

yelp_academic_dataset_business.json
  Modified: Mon Jul  2 17:22:59 2018
  Mode    : 0o644
  Type    : b'0'
  Size    : 146374098 bytes

yelp_academic_dataset_checkin.json
  Modified: Mon Jul  2 17:25:06 2018
  Mode    : 0o644
  Type    : b'0'
  Size    : 52744210 bytes

yelp_academic_dataset_photo.json
  Modified: Tue Jul 31 13:55:22 2018
  Mode    : 0o644
  Type    : b'0'
  Size    : 36596656 bytes

yelp_academic_dataset_review.json
  Modified: Mon

In [33]:
import tarfile

y = tarfile.open('yelp_dataset.tar')

y_data = y.extractall()

In [40]:
import codecs

with codecs.open('yelp_academic_dataset_review.json', encoding='utf-8') as f:
    first_review = f.readline()

print(first_review)

{"review_id":"x7mDIiDB3jEiPGPHOmDzyw","user_id":"msQe1u7Z_XuqjGoqhB0J5g","business_id":"iCQpiavjjPzJ5_3gPD5Ebg","stars":2,"date":"2011-02-25","text":"The pizza was okay. Not the best I've had. I prefer Biaggio's on Flamingo \/ Fort Apache. The chef there can make a MUCH better NY style pizza. The pizzeria @ Cosmo was over priced for the quality and lack of personality in the food. Biaggio's is a much better pick if youre going for italian - family owned, home made recipes, people that actually CARE if you like their food. You dont get that at a pizzeria in a casino. I dont care what you say...","useful":0,"funny":0,"cool":0}

