<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Explore-NLP-methods/vizs-with-champaign-restaurant-review-data" data-toc-modified-id="Explore-NLP-methods/vizs-with-champaign-restaurant-review-data-1">Explore NLP methods/vizs with champaign restaurant review data</a></span><ul class="toc-item"><li><span><a href="#Get-data" data-toc-modified-id="Get-data-1.1">Get data</a></span></li><li><span><a href="#Preprocessing" data-toc-modified-id="Preprocessing-1.2">Preprocessing</a></span><ul class="toc-item"><li><span><a href="#Data-preview" data-toc-modified-id="Data-preview-1.2.1">Data preview</a></span></li><li><span><a href="#Tokenize" data-toc-modified-id="Tokenize-1.2.2">Tokenize</a></span></li><li><span><a href="#Make-dictionary-&amp;-corpus" data-toc-modified-id="Make-dictionary-&amp;-corpus-1.2.3">Make dictionary &amp; corpus</a></span></li></ul></li></ul></li></ul></div>

# Explore NLP methods/vizs with champaign restaurant review data

In [32]:
#data 
from sqlalchemy import create_engine
from sqlalchemy.engine.url import URL
import pandas as pd

#spacy for tokenization
from spacy.lang.en import English # Create the nlp object
import spacy
#gensim for similarity
from gensim.corpora.dictionary import Dictionary
from gensim.models.tfidfmodel import TfidfModel
from gensim.similarities.docsim import MatrixSimilarity,Similarity

#sklearn for distance metrics
from sklearn.metrics.pairwise import cosine_similarity

## Get data

In [33]:
postgres_db_params = {'drivername': 'postgres',
                'database':'restaurants',
               'username': 'michaelkranz',
               'password': 'helloworld',
               'host': 'localhost',
               'port': 5432}

postgres_db_url = URL(**postgres_db_params)
engine = create_engine(postgres_db_url)

In [34]:
review_df = pd.read_sql(
    con=engine.connect(),
    sql='''
    SELECT *
    FROM champaign_restaurant_reviews
    ''')

## Preprocessing

### Data preview

In [35]:
review_df.head(2)

Unnamed: 0,business_id,name,review_id,user_id,text,stars,date
0,9A1C1f0m4nQltQrOOTl-Kw,Orange & Brew,m4AXzV9l14iFBd9DRdM82w,6X0i-oGUbh5DZdTHzFuKfg,The building is lovely. The remodel after But...,1.0,2013-12-07 02:26:13
1,VHsNB3pdGVcRgs6C3jt6Zg,Dublin O'Neil's,A-yKlSLEQQcoHR5q2lCyHg,Yximlvn0cfb3yVDaLuXDxw,LOVE LOVE LOVE this place! I'm a bit of a suck...,5.0,2013-08-03 19:59:56


In [36]:
review_df.head(2).text.values[0]

'The building is lovely.  The remodel after Buttita\'s is as good as one can do when turning a lovely restaurant into a sports bar-style restaurant (though why one would want to do that is beyond me).  And that\'s where the good stuff stops.\nIn short, the service and all things related to it were glacially slow; the food, when it arrived, was mediocre at best -- and missing key elements (one whole order, and other parts).  This restaurant cannot possibly last if they don\'t figure out how to greet and serve customers, or get the food that was ordered actually to the tables.\nThe full story.\nUpon arrival, we found a vacant hostess stand, then a long walk down the hall to the dining room along which we saw no one who works there.  Luckily, we ran into friends who assured us that we could seat ourselves.\nAfter waiting at our table for some 15 minutes, we used a bit of self-help, and accosted a waitress to ask for some menus.  After another 5-7 minutes another waitress finally showed up

### Tokenize

In [37]:
def tokenize_text(text_str,nlp_obj):
    '''
    use spacy to separate text into words
    (ie tokenization)
    and return the lemmatization 
    (ie feet for footing and foot)
    for only nouns and adjectives
    
    TODO: refine methodology
    '''
    spacy_doc = nlp_obj(text_str)
    
    tokenized_doc = [
        token.lemma_
        for token in spacy_doc
        if token.pos_ in ("NOUN","ADJ")
        ]
    
    return tokenized_doc
    #return spacy_doc
        

### Make dictionary & corpus

In [38]:
nlp = spacy.load('en_core_web_sm')

In [39]:
#TODO: combine reviews in SQL to scale
#TODO: sciktilearn?

In [40]:
reviews_tokenized = (
    review_df
    #.head(2)
    .groupby('business_id')
    .text
    .apply(lambda x: ' '.join(x))
    .apply(tokenize_text,nlp_obj=nlp)
)

In [41]:
reviews_tokenized.head(3)

business_id
-2q4dnUw0gGJniGW2aPamQ    [girlfriend, place, diner, full, unofficial, w...
-5NXoZeGBdx3Bdk70tuyCw    [amazing, tender, pork, sandwich, good, homema...
-5dd-RjojGVK9hjAMCXVZw    [great, restaurant, inexpensive, american, mex...
Name: text, dtype: object

In [42]:
reviews_dictionary = Dictionary(reviews_tokenized)

In [184]:
reviews_dictionary.num_docs

701

In [185]:
review_df.business_id.unique().shape

(701,)

In [59]:
#corpus
reviews_corpus = [reviews_dictionary.doc2bow(doc) for doc in reviews_tokenized]

In [84]:
#tfidf with document being each restaurant and corpus being all restaurants
reviews_tfidf_model = TfidfModel(reviews_corpus)

In [85]:
reviews_tfidf_docs = [reviews_tfidf_model[review] for review in reviews_corpus]

In [87]:
#similarity indices for each doc
similarity_indices = MatrixSimilarity(reviews_tfidf_docs)

In [110]:
#doc example
orange_and_brew = reviews_corpus[0]
orange_and_brew_tfidf = reviews_tfidf_model[orange_and_brew]

In [113]:
#similarity to each restaurant (ie doc)
orange_and_brew_similarity = (
    pd.Series(similarity_indices[orange_and_brew_tfidf])
)

In [159]:
#get index:name mappings
doc_mapping = (
    review_df[['business_id','name']]
    .drop_duplicates()
    .reset_index(0)
    ['name'] #business name
    .to_dict()
)

#ie switch index to key and token str to object
token_mapping = {
    i:token 
    for token,i in reviews_dictionary.token2id.items()
} 

In [120]:
orange_and_brew_similarity_df = (
    pd.concat([orange_and_brew_similarity,doc_mapping],
              axis=1)
    .sort_values(0,ascending=False)
)

In [121]:
orange_and_brew_similarity_df

Unnamed: 0,0,index,business_id,name
0,1.000000,0,9A1C1f0m4nQltQrOOTl-Kw,Orange & Brew
172,0.773942,7173,N5PfEojrY4rFqpqzno4aZg,Bevande Coffee
369,0.282999,17783,bkh5o6bCzDCovSkRYBsU1w,LongHorn Steakhouse
508,0.280406,23584,aFJ8NSdpJi9LcaMm6ICowQ,Burger King
161,0.264201,7044,t0f8qXrKGNWv6j8MGmosRg,Caffebene
...,...,...,...,...
153,0.002990,6981,3C_S1j70o3nYJUTvnVo7dQ,Arby's
461,0.002729,21083,SuQpsHxcxCAB8kxPEIXiBg,Wonderdogs
256,0.002531,10219,3B74iILT_W1ptfQeW7pIYQ,Buffalo Wild Wings
235,0.001795,9891,epvLkQNL6MOvk3s6JlTntA,Carmon's Restaurant


In [185]:
#for orange and brew and bevande coffee (doc indices 0 and 172):
#doc_id = 0
all_bag_of_words_list = []
for doc_id in range(len(reviews_corpus)):
    bag_of_words = (
        pd.DataFrame(
            {"freqency":dict(reviews_corpus[doc_id]),
             "tf_idf":dict(reviews_tfidf_docs[doc_id]),
             "business":doc_mapping[doc_id],
             "word":token_mapping
            }
        )
        .set_index(['business','word'])
    )
    all_bag_of_words_list.append(bag_of_words)
                                  

In [None]:
pd.concat(all_bag_of_words_list)

In [186]:
.head(2)

Unnamed: 0_level_0,Unnamed: 1_level_0,freqency,tf_idf
business,word,Unnamed: 2_level_1,Unnamed: 3_level_1
Orange & Brew,additional,1.0,0.04959
Orange & Brew,advice,1.0,0.063035


In [None]:
#ratings

#location