# A Basic Delve into Python Data Science
### *~ According to the Majestic Whims of Jay Kaiser ~*

**Note**: A vast majority of the code and methodologies found here has been taken directly from [Modern NLP in Python](http://nbviewer.jupyter.org/github/skipgram/modern-nlp-in-python/blob/master/executable/Modern_NLP_in_Python.ipynb). This is merely a way for me to practice these topics myself in my own environment, using the most recent dataset as provided by Yelp. I do not claim original creation of anything found here, and all credit belongs with the authors of that Jupyter notebook. I am merely striving to learn these concepts myself in a manner that is most conductive to my own research...

With that in mind, let's get started.

## Introduction

Hi. My name is Jay Kaiser, and I am in my final semester of grad school at Indiana University Bloomington, where I am receiving my M.S. in Computational Linguistics. Naturally, this has consisted mostly of classes of linguistics and of computer science separately, but occasionally a class that cleverly combines the two appears. Classes like these have slowly migrated me into the world of syntax and syntactic parsing. Moreover, such classes have taken me as well to the wonderful world of data science and data analytics, and my interest for such was only expanded upon following an internship with Kingfisher Systems, Inc. as a data scientist in Washington D.C. over the summer of 2017.

However, I must continue in my studies if I ever hope to succeed. In order to rapidly acclimate myself to data science and its many aspects, I have invested in a range of resources to assist me with my learning. Nothing, though, will teach me better than merely getting my hands dirty and doing it myself, so here it goes...

### Project: Yelp Dataset

I have downloaded on my computer the entirety of the JSON portion of the [Yelp Dataset](https://www.yelp.com/dataset/challenge), which unzipped into six separate JSON files, each of which outlined below.

| Dataset File | Description |
|-------------:|:------------|
| business.json | information on each business reviewed on Yelp |
| checkin.json | each business' hours |
| photos.json | photos found on Yelp with their captions |
| review.json | text reviews of each business |
| tip.json | small quotes of advice for future visitors of the business |
| user.json | Yelp reviewers and information describing them |

However, only *business.json* and *review.json* will be used here.

### Step 1: A tour of the dataset

In [1]:
import os
import codecs

data_directory = os.path.join(r'C:\Users\jayka\Documents', 'Datasets', 'yelp')
business_filepath = os.path.join(data_directory, 'business.json')
review_filepath = os.path.join(data_directory, 'review.json')

with codecs.open(business_filepath, encoding='utf_8') as f:
    first_business_record = f.readline()
    
print(first_business_record)

{"business_id": "YDf95gJZaq05wvo7hTQbbQ", "name": "Richmond Town Square", "neighborhood": "", "address": "691 Richmond Rd", "city": "Richmond Heights", "state": "OH", "postal_code": "44143", "latitude": 41.5417162, "longitude": -81.4931165, "stars": 2.0, "review_count": 17, "is_open": 1, "attributes": {"RestaurantsPriceRange2": 2, "BusinessParking": {"garage": false, "street": false, "validated": false, "lot": true, "valet": false}, "BikeParking": true, "WheelchairAccessible": true}, "categories": ["Shopping", "Shopping Centers"], "hours": {"Monday": "10:00-21:00", "Tuesday": "10:00-21:00", "Friday": "10:00-21:00", "Wednesday": "10:00-21:00", "Thursday": "10:00-21:00", "Sunday": "11:00-18:00", "Saturday": "10:00-21:00"}}



Of these, **text** is the field I will use. However, any one of these fields could be used as well for other types of analysis.

This JSON data is not easily navigable, so it'll have to be converted into a more-usable format.

In [2]:
import json

restaurant_ids = set()

with codecs.open(business_filepath, encoding='utf_8') as f:
    for business_json in f:
        business = json.loads(business_json)
        
        if u'Restaurants' not in business[u'categories']:
            continue
        
        restaurant_ids.add(business[u'business_id'])
    
    restaurant_ids = frozenset(restaurant_ids)
    
    print('{:,}'.format(len(restaurant_ids)), u'restaurants in the dataset.')

51,613 restaurants in the dataset.


There are 51,613 registered restaurants in the dataset, over twice as many as there had been in 2016.

Separately, I'll need to find the reviews for each of the restaurants, based on their ID. This will be placed in a separate file that's easier to parse.

In [3]:
intermediate_directory = os.path.join(r'C:\Users\jayka\Documents\Datasets', 'intermediate')
review_text_filepath = os.path.join(intermediate_directory, 'yelp_review_text.txt')

In [24]:
%%time

review_count = 0

if 0 == 1: # Once this is run once it does not need to be run again.
    
    with codecs.open(review_text_filepath, 'w', encoding='utf_8') as review_text_file:
        with codecs.open(review_filepath, encoding='utf_8') as review_json_file:
            for review_json in review_json_file:
                review = json.loads(review_json)

                if review[u'business_id'] not in restaurant_ids:
                    continue

                review_text_file.write(review[u'text'].replace('\n', '\\n') + '\n')
                review_count += 1
            
    print(u'''Text from {:,} restaurant reviews
              written to the new text file.'''.format(review_count))

else:
    with codecs.open(review_text_filepath, encoding='utf_8') as review_text_file:
        for review_count, line in enumerate(review_text_file):
            pass

    print(u'''Text from {:,} restaurant reviews written to the new txt file.'''.format(review_count + 1))

Text from 2,929,512 restaurant reviews written to the new txt file.
Wall time: 52.5 s


This process takes me about 2.5 minutes on my dual-core i5 laptop. Moreover, ther are a total of 2,927,731 reviews found, over three times as many as in 2016. I now have the IDs and review texts in separate files, and real parsing can now begin.

### Step 2: Introduction to text processing with spaCy

spaCy is a means to perform NLP tasks on these reviews just collected. It is both free-to-use and incredibly efficient and fast, much moreso than NLTK. For testing, *sample_review* has been created below.

In [5]:
import spacy
import pandas as pd
import itertools as it

nlp = spacy.load('en')

with codecs.open(review_text_filepath, encoding='utf_8') as f:
    sample_review = list(it.islice(f, 8, 9))[0]
    sample_review = sample_review.replace('\\n', '\n')

print(sample_review)

The staff here is great and they're nice,  wonderful and quick. People were ranting in raving about pei wei, I had to try it.  Even good yelp reviews.  I'm highly dissatisfied with the flavor of the food. This  should be labeled Asian inspired and not Asian. I've tried a variety of Chinese restaurants, this doesn't taste close to anything I've had at other Asian restaurants. Their Mongolian beef  was 5 pieces of beef and large mushrooms cut into thirds in a thick sauce. You eat the rice to wash off the nasty flavor. My shrimp was thickly coated in an overpowering  sauce as well.  I only ate some of the veggies that take center stage on a meat dish.  The center of my pork egg roll was cold. The hot N sour soup was a much thicker consistency almost like that of a chili instead of being brothy. Worst of all was the price.  This was not worth it to us. Neither me or my husband enjoyed either of  our dishes.  We didn't even eat half of our plates.  We even refused to take it home with us.  

In [6]:
%%time
parsed_review = nlp(sample_review)

Wall time: 26.6 ms


For listing all of the sentences in the sample review:

In [7]:
for num, sentence in enumerate(parsed_review.sents):
    print('Sentence {}:'.format(num + 1))
    print(sentence)
    print('')

Sentence 1:
The staff here is great and they're nice,  wonderful and quick.

Sentence 2:
People were ranting in raving about pei wei, I had to try it.  

Sentence 3:
Even good yelp reviews.  

Sentence 4:
I'm highly dissatisfied with the flavor of the food.

Sentence 5:
This  should be labeled Asian inspired and not Asian.

Sentence 6:
I've tried a variety of Chinese restaurants, this doesn't taste close to anything I've had at other Asian restaurants.

Sentence 7:
Their Mongolian beef  was 5 pieces of beef and large mushrooms cut into thirds in a thick sauce.

Sentence 8:
You eat the rice to wash off the nasty flavor.

Sentence 9:
My shrimp was thickly coated in an overpowering  sauce as well.  

Sentence 10:
I only ate some of the veggies that take center stage on a meat dish.  

Sentence 11:
The center of my pork egg roll was cold.

Sentence 12:
The hot N sour soup was a much thicker consistency almost like that of a chili instead of being brothy.

Sentence 13:
Worst of all was the 

For listing all of the named entities:

In [8]:
for num, entity in enumerate(parsed_review.ents):
    print('Entity {}:'.format(num + 1), entity, '-', entity.label_)
    print('')

Entity 1: Asian - NORP

Entity 2: Asian - NORP

Entity 3: Chinese - NORP

Entity 4: Asian - NORP

Entity 5: Mongolian - NORP

Entity 6: 5 - CARDINAL

Entity 7: half - CARDINAL

Entity 8: Asian - NORP



For listing each word's POS tag:
(We'll be using *pandas* for this for its easy displaying and further parsing.)

In [9]:
token_text = [token.orth_ for token in parsed_review]
token_pos = [token.pos_ for token in parsed_review]

pd.DataFrame(list(zip(token_text, token_pos)),
             columns=['token_text', 'part_of_speech'])

Unnamed: 0,token_text,part_of_speech
0,The,DET
1,staff,NOUN
2,here,ADV
3,is,VERB
4,great,ADJ
5,and,CCONJ
6,they,PRON
7,'re,VERB
8,nice,ADJ
9,",",PUNCT


For lemmas and shape analysis:

In [10]:
token_lemma = [token.lemma_ for token in parsed_review]
token_shape = [token.shape_ for token in parsed_review]

pd.DataFrame(list(zip(token_text, token_lemma, token_shape)),
            columns=['token_text', 'token_lemma', 'token_shape'])

Unnamed: 0,token_text,token_lemma,token_shape
0,The,the,Xxx
1,staff,staff,xxxx
2,here,here,xxxx
3,is,be,xx
4,great,great,xxxx
5,and,and,xxx
6,they,-PRON-,xxxx
7,'re,be,'xx
8,nice,nice,xxxx
9,",",",",","


For listing tokens and their analytics:

In [11]:
token_entity_type = [token.ent_type_ for token in parsed_review]
token_entity_iob = [token.ent_iob_ for token in parsed_review]

pd.DataFrame(list(zip(token_text, token_entity_type, token_entity_iob)),
            columns=['token_text', 'entity_type', 'inside_outside_begin'])

Unnamed: 0,token_text,entity_type,inside_outside_begin
0,The,,O
1,staff,,O
2,here,,O
3,is,,O
4,great,,O
5,and,,O
6,they,,O
7,'re,,O
8,nice,,O
9,",",,O


For listing stopwords, punctuation, whitespace, numbers, standard vocab words, etc:

In [12]:
token_attributes = [(token.orth_,
                    token.prob,
                    token.is_stop,
                    token.is_punct,
                    token.is_space,
                    token.like_num,
                    token.is_oov)
                   for token in parsed_review]

df = pd.DataFrame(token_attributes,
                 columns=['text',
                         'log_probability',
                         'stop?',
                         'punctuation?',
                         'whitespace?',
                         'number?',
                         'out of vocab.?'])

df.loc[:, 'stop?':'out of vocab.?'] = (df.loc[:, 'stop?':'out of vocab.?']
                                      .applymap(lambda x: u'Yes' if x else u''))

df

Unnamed: 0,text,log_probability,stop?,punctuation?,whitespace?,number?,out of vocab.?
0,The,-5.774222,Yes,,,,
1,staff,-10.720455,,,,,
2,here,-7.175437,Yes,,,,
3,is,-4.329765,Yes,,,,
4,great,-7.822114,,,,,
5,and,-4.195279,Yes,,,,
6,they,-5.429816,Yes,,,,
7,'re,-6.377125,,,,,
8,nice,-8.462502,,,,,
9,",",-3.391480,,Yes,,,


### Step 3: Automatic Phrase Modeling

The problem with these tokens is that independently, they do not properly reflect the relationships between words. For example above, *do* and *n't* are considered separate tokens, but they would be better reflected as a single collocation, since *n't* will never appear apart.

From the *gensim* library, I will be able to complete phrase modeling over the current list of tokens.

In [13]:
import warnings
# Because of Windows Anaconda, a warning appears without this warning.
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')

from gensim.models import Phrases
from gensim.models.word2vec import LineSentence

There are four steps that must be done in data preparation here:

#### a) Segment text of complete reviews into sentences & normalize text

Three helper functions will be used for this step. They will, respectively, find punctuation and whitespace, line breaks, and word lemmas.

In [14]:
def punct_space(token):
    """ eliminates purely punctuation and whitespace tokens"""
    return token.is_punct or token.is_space

def line_review(filename):
    """reads in each review and unescapes the line-breaks"""
    with codecs.open(filename, encoding='utf_8') as f:
        for review in f:
            yield review.replace('\\n', '\n')
            
def lemmatized_sentence_corpus(filename):
    """uses spaCy to parse reviews, lemmatize, and tokenize sentences"""
    for parsed_review in nlp.pipe(line_review(filename),
                                 batch_size=10000, n_threads=4):
        
        for sent in parsed_review.sents:
            yield u' '.join([token.lemma_ for token in sent
                            if not punct_space(token)])

In [15]:
unigram_sentences_filepath = os.path.join(intermediate_directory,
                                         'unigram_sentences_all.txt')

*lemmatized_sentence_corpus* loops over the original review text and outputs it to *unigram_sentences_all* one normalized sentence at a time, with one per line.

In [16]:
%%time
if 0 == 1: # Once this is run once it does not need to be run again.
    
    with codecs.open(unigram_sentences_filepath, 'w', encoding='utf_8') as f:
        for sentence in lemmatized_sentence_corpus(review_text_filepath):
            f.write(sentence + '\n')

Wall time: 0 ns


This process took me 2 hours and 48 minutes to complete on my modest computer.

The Gensim library provides a LineSentence class that streams documents of this format from disk, allowing massive scaling for huge amounts of data that now does not need to be saved in whole into memory. 

In [18]:
unigram_sentences = LineSentence(unigram_sentences_filepath)

What do these sentences look like now?

In [19]:
for unigram_sentence in it.islice(unigram_sentences, 450, 460):
    print(u' '.join(unigram_sentence))
    print(u'')

this place be clean the staff be wonderful and go out of -PRON- way to help -PRON-

-PRON- stay a step ahead of -PRON- and get what what -PRON- ne the food be perfect

-PRON- have eat here many many time and -PRON- have always be fresh hot and delicious -PRON- highly recommend this place 100%

service be a 4

very sloppy also the hot and sour soup be not that bad if u add enough chile paste and salt and soy sauce the gentleman name pop -PRON- be the manager great service from -PRON- and this other lady do not get -PRON- name unfortunetly -PRON- have change -PRON- rating 2 time that be how impressed -PRON- be

love the food but -PRON- be great at all location

the customer service be horrible at this location

only two table be occupy at the time -PRON- walk in for a carry out

-PRON- walk in and -PRON- look at -PRON- like really

go away



#### b) First-order phrase modeling

From these normalized sentences, one can now use *gensim.models.Phrases* to find two-word collocations and link them together into single tokens. 

For example, in one of the sentences above are the words "soy sauce". Though they are technically separate words, when together they act as a distinct entity. Hence, by training a model on frequent bigrams, one can disambiguate cases like these.

In [20]:
bigram_model_filepath = os.path.join(intermediate_directory, 'bigram_model_all')

In [21]:
%%time
if 0 == 1: # Once this is run once it does not need to be run again.
    
    bigram_model = Phrases(unigram_sentences)
    bigram_model.save(bigram_model_filepath)

# load the finished model after built
bigram_model = Phrases.load(bigram_model_filepath)

Wall time: 14min 21s


This process took a little more than 14 minutes to complete on my computer.

#### c) Second-order phrase modeling

#### d) Apply text normalization and second-order phrase model to text of complete reviews

### Step 4: Visualizing topic models with pyLDAvis

### Step 5: Word vector models with word2vec

### Step 6: Visualizing word2vec with t-SNE