# Natural Language Processing

This notebook uses the reviews dataframe for Philadelphia to identify key words in all of the reviews.

<i> Next up, create a subset of reviews for known gay businesses using the business_id. Run the same NLP code to generate a list of gay words.

In [1]:
import pandas as pd
import geopandas as gpd

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize

import re
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/harper/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/harper/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [None]:
reviews = pd.read_pickle('data/reviewsPhil.pandas')
reviews.reset_index(inplace=True)

Let's look at the reviews dataframe.

In [None]:
print('There are '+str(len(reviews)) + ' total reviews for '+ str(len(reviews.business_id.unique())) + ' unique businesses.')
print(reviews.text.head())

## Simple Word Counts

First, I'll put all the reviews into one massive string.

In [None]:
s = ""
for i in range(len(reviews)):
    s+=reviews['text'][i]

Now split up all the words, clean out symbols and stopwords, and count them!

In [None]:
def countWords(wordlist):
    counts = {} 
    
    for word in wordlist:
        lword = word.lower()
        if lword in counts:
            counts[lword] +=1
        else:
            counts[lword] = 1

    df = pd.DataFrame.from_dict(counts, orient='index', columns=['word_count'])
    df.sort_values('word_count', ascending=False, inplace=True)
    df.index.name = 'word'
    
    return df

wordlist = word_tokenize(re.sub(r"[^A-z\s]", "", s))

df = countWords(wordlist)
df.drop(index=stopwords.words('english'), errors='ignore', inplace=True)

df.head(10)

In [None]:
df.to_pickle('wordcounts_all.pandas')

## Extract gay words

In [2]:
gay = ['gay', 'queer', 'lesbian', 'lgbt', 'lgbtq', 'homosexual', 'homophobic', 'drag', 'trans', 'transgender', 'bisexual', 'twink']