# Natural Language Processing

This notebook uses the reviews dataframe for Philadelphia to identify key words in all of the reviews.

<i> Next up, create a subset of reviews for known gay businesses using the business_id. Run the same NLP code to generate a list of gay words.

In [1]:
import pandas as pd
import geopandas as gpd

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize

import re
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/harper/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/harper/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [2]:
reviews = pd.read_pickle('data/gay_reviews_phil.pandas') # this mf is massive
reviews.reset_index(inplace=True)

In [3]:
reviews.head()

Unnamed: 0,index,review_id,user_id,business_id,stars,useful,funny,cool,text,date
0,641,9kB_3ZwSjvz22matz8L1hw,MyLGJPkC8q5sKTNImKBb9Q,8FALz8g6oaTca0ufiUuF6w,4,0,0,0,"Went there due to having a gc, and was suprise...",2011-10-27 23:30:47
1,19626,ZddJtqDHdOUHnW65f1NqAQ,bgyDhaVEcmZwCiWK24ULNw,8FALz8g6oaTca0ufiUuF6w,4,1,5,1,"As far as gay bars go, this place is the tits!...",2011-07-28 18:43:11
2,25242,TSxQdmKXrL9eQ6jxCc5l4g,84HvpQDxcHWmbMDfs8IEYw,8FALz8g6oaTca0ufiUuF6w,4,4,1,6,A Gay Sports Bar? Hell yeah! I went on St....,2010-03-18 18:50:09
3,28557,VLMnIlr9Iz8SlM6ins9Dlw,-wBQ2jBxanWHra6E9gQUcQ,8FALz8g6oaTca0ufiUuF6w,5,0,0,0,Yum yum yum! The new menu is great. I really ...,2017-01-28 19:16:34
4,31642,ydo3dp9Uh9DzSsngaRLA6g,lm56NZaqG90rCIq7mdZYgA,8FALz8g6oaTca0ufiUuF6w,4,1,1,2,I can only speak on the atmosphere since I did...,2015-10-09 22:00:48


Let's look at the reviews dataframe.

In [3]:
print('There are '+str(len(reviews)) + ' total reviews for '+ str(len(reviews.business_id.unique())) + ' unique businesses.')
print(reviews.text.head())

There are 1212 total reviews for 11 unique businesses.
0    Went there due to having a gc, and was suprise...
1    As far as gay bars go, this place is the tits!...
2    A Gay Sports Bar?  Hell yeah!    I went on St....
3    Yum yum yum!  The new menu is great. I really ...
4    I can only speak on the atmosphere since I did...
Name: text, dtype: object


## Simple Word Counts

First, I'll put all the reviews into one massive string.

In [5]:
s = ""
for i in range(len(reviews)):
    s+=reviews['text'][i]

Now split up all the words, clean out symbols and stopwords, and count them!

In [6]:
def countWords(wordlist):
    counts = {} 
    
    for word in wordlist:
        lword = word.lower()
        if lword in counts:
            counts[lword] +=1
        else:
            counts[lword] = 1

    df = pd.DataFrame.from_dict(counts, orient='index', columns=['word_count'])
    df.sort_values('word_count', ascending=False, inplace=True)
    df.index.name = 'word'
    
    return df

wordlist = word_tokenize(re.sub(r"[^A-z\s]", "", s))

df = countWords(wordlist)
df.drop(index=stopwords.words('english'), errors='ignore', inplace=True)

df.head(10)

Unnamed: 0_level_0,word_count
word,Unnamed: 1_level_1
bar,1025
place,701
drinks,522
night,511
gay,510
good,504
great,487
time,475
like,468
go,436


In [20]:
df.to_pickle('data/wordcounts_gay.pandas')

## Check for really gay words

In [12]:
gay_words = ['gay', 'gays', 'queer', 'queers', 'lesbian', 'lesbians', 'dyke', 'dykes','lgbt', 'lgbtq', 
             'homosexual', 'homosexuals', 'homo', 'homos', 'homophobic', 'drag', 
             'queen', 'queens', 'trans', 'transgender', 'transphobic', 'bisexual','bisexuals',
             'twink', 'twinks', 'bear', 'bears']

In [16]:
gay_in_yelp = [word for word in gay_words if word in df.index]
gay_in_yelp

['gay',
 'gays',
 'queer',
 'queers',
 'lesbian',
 'lesbians',
 'lgbt',
 'lgbtq',
 'homosexual',
 'homophobic',
 'drag',
 'queen',
 'queens',
 'trans',
 'transgender',
 'bisexual',
 'twinks',
 'bear',
 'bears']

In [19]:
print(len(gay_words))
print(len(gay_in_yelp))

25
19


Wow! Looks like 19 out of our 20 gay words are in the yelp reviews! 