# Natural Language Processing

This notebook uses the gay reviews dataframe for Philadelphia to identify key words in businesses that we have already identified as queer. 

The output of this notebook is a list of queer words that we will use to identify other queer businesses. 


In [1]:
import pandas as pd
import geopandas as gpd

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize

import re
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/harper/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/harper/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [2]:
reviews = pd.read_pickle('data/gay_reviews_phil.pandas')
reviews.reset_index(inplace=True)

Let's look at the reviews dataframe.

In [7]:
print('There are '+str(len(reviews)) + ' total reviews for '+ str(len(reviews.business_id.unique())) + ' unique businesses.')
print(reviews.text.head())

There are 10365 total reviews for 36 unique businesses.
0    Beautiful clean shop with knock your socks off...
1    Went there due to having a gc, and was suprise...
2    The Lemon Cake is as good as it is heavy.    I...
3    First time here and I have to say it's pretty ...
4    Not a great vibe, employees were pretty rude m...
Name: text, dtype: object


## Simple Word Counts

First, I'll put all the reviews into one massive string.

In [8]:
s = ""
for i in range(len(reviews)):
    s+=reviews['text'][i]

Now split up all the words, clean out symbols and stopwords, and count them!

In [10]:
def countWords(wordlist):
    counts = {} 
    
    for word in wordlist:
        lword = word.lower()
        if lword in counts:
            counts[lword] +=1
        else:
            counts[lword] = 1

    df = pd.DataFrame.from_dict(counts, orient='index', columns=['word_count'])
    df.sort_values('word_count', ascending=False, inplace=True)
    df.index.name = 'word'
    
    return df

wordlist = word_tokenize(re.sub(r"[^A-z\s]", "", s))

df = countWords(wordlist)
df.drop(index=stopwords.words('english'), errors='ignore', inplace=True)

df.to_pickle('data/wordcounts_gay.pandas')

In [29]:
#list(df[df.word_count < 100].index)

## Check for really gay words

Check the word list for our generated list of queer vocabulary and also pull the word count for each. 

In [30]:
gay_vocab = ['gay', 'gays', 'queer', 'queers', 'lesbian', 'lesbians', 'dyke', 'dykes','lgbt', 'lgbtq', 
             'homosexual', 'homosexuals', 'homo', 'homos', 'homophobic', 'drag', 
             'queen', 'queens', 'trans', 'transgender', 'transphobic', 'bisexual','bisexuals',
             'twink', 'twinks', 'bear', 'bears']

In [31]:
gay_word_count = df[df.index.isin([word for word in gay_vocab if word in df.index])] 
gay_word_count

Unnamed: 0_level_0,word_count
word,Unnamed: 1_level_1
gay,530
drag,197
queen,39
queer,28
queens,27
lgbt,25
bear,25
lesbian,24
gays,19
lgbtq,19


In [34]:
list(gay_word_count.index)
# just copy and paste these guys to the all biz notebook

['gay',
 'drag',
 'queen',
 'queer',
 'queens',
 'lgbt',
 'bear',
 'lesbian',
 'gays',
 'lgbtq',
 'homophobic',
 'bears',
 'trans',
 'lesbians',
 'transgender',
 'dyke',
 'twinks',
 'queers',
 'homosexual',
 'bisexual',
 'homosexuals']