# Natural Language Processing

This notebook uses the reviews dataframe for Philadelphia to identify key words in all of the reviews.

<i> Next up, create a subset of reviews for known gay businesses using the business_id. Run the same NLP code to generate a list of gay words.

In [1]:
import pandas as pd
import geopandas as gpd

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize

import re
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/harper/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/harper/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [2]:
reviews = pd.read_pickle('data/reviewsPhil.pandas')
reviews.reset_index(inplace=True)

Let's look at the reviews dataframe.

In [3]:
print('There are '+str(len(reviews)) + ' total reviews for '+ str(len(reviews.business_id.unique())) + ' unique businesses.')
print(reviews.text.head())

There are 1930159 total reviews for 44840 unique businesses.
0    If you decide to eat here, just be aware it is...
1    I've taken a lot of spin classes over the year...
2    Wow!  Yummy, different,  delicious.   Our favo...
3    I am a long term frequent customer of this est...
4    Amazingly amazing wings and homemade bleu chee...
Name: text, dtype: object


## Simple Word Counts

First, I'll put all the reviews into one massive string.

In [4]:
s = ""
for i in range(len(reviews)):
    s+=reviews['text'][i]

Now split up all the words, clean out symbols and stopwords, and count them!

In [5]:
def countWords(wordlist):
    counts = {} 
    
    for word in wordlist:
        lword = word.lower()
        if lword in counts:
            counts[lword] +=1
        else:
            counts[lword] = 1

    df = pd.DataFrame.from_dict(counts, orient='index', columns=['word_count'])
    df.sort_values('word_count', ascending=False, inplace=True)
    df.index.name = 'word'
    
    return df

wordlist = word_tokenize(re.sub(r"[^A-z\s]", "", s))

df = countWords(wordlist)
df.drop(index=stopwords.words('english'), errors='ignore', inplace=True)

df.head(10)

Unnamed: 0_level_0,word_count
word,Unnamed: 1_level_1
food,975227
good,902067
place,868397
great,736548
like,629904
time,616062
get,600139
one,583531
service,579999
would,562962


In [6]:
df.to_pickle('wordcounts_all.pandas')