# All Businesses in Philadelphia

This notebook uses a list of gay words (created in the NLP notebook from existing gay business reviews) to find other gay business that are not tagged as such. 

I will bring in the dataframe that contains all reviews for Philadelphia (Created in Process Yelp) and count the number of times each gay word appears in a business' reviews. 

The output of this notebook is a list of Yelp Business IDs of our newly identified queer businesses. I'll take that list into our final notebook to analyze those businesses through mapping and inspecting other attributes. 

In [1]:
import pandas as pd
import geopandas as gpd

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk import ngrams, FreqDist

import re
nltk.download('stopwords')
nltk.download('punkt')

from collections import Counter

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/harper/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/harper/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


First, I'll bring in the reviews. Then, I want to create a dataframe where the index is the business_id and each row is all the reviews smashed together. Essentially, there is one bag of reviews for each business.

In [2]:
# bring in all reviews in philadelphia
reviews = pd.read_pickle('data/reviewsPhil.pandas')

In [3]:
# standardize the reviews text
reviews['text'] = reviews['text'].fillna('').astype(str)

# group by business_id and sum
bizDf = reviews.groupby('business_id')['text'].sum().to_frame()

Next, I'll clean each of the review bags to remove whitespaces and other characters and make lowercase. I'll also need to tokenize so I can count words in the next step.

<i> ok this is a big ass df to work with. I want to clean the text for each row and then I want to do counts. Might need to make a function that does both of those, then run the function in chunks.

In [7]:
def FindNewGays(in_df):
    
    # clean all the text and tokenize
    df = in_df.text.apply(lambda item: item.lower()).apply(lambda item: word_tokenize(re.sub(r"[^A-z\s]", "", item))).to_frame()
    
    # remove stopwords
    df['text'] = df['text'].apply(lambda review: [word for word in review if (word not in mystop)])

    # make column with gay words extracted
    df['gay_text'] = df['text'].apply(lambda review: [word for word in review if (word in gay_words)])

    # create new dataframe only with businesses that have at least one gay word in the reviews
    newgayDf = df[df['gay_text'].apply(lambda x: len(x) >0)].copy()

    # count number of words for each business and sort
    newgayDf['gay_text_len'] = newgayDf.gay_text.apply(lambda x: len(x))
    newgayDf.sort_values('gay_text_len', ascending=False, inplace=True)

    # create frequency distribution of gay words for each business
    newgayDf['gay_counts'] = newgayDf['gay_text'].apply(lambda words: Counter(words))

    # count unique gay words for each business
    newgayDf['gay_unique_count'] = newgayDf['gay_counts'].apply(lambda freq: len(freq.keys()))

    return newgayDf


In [8]:
gay_words = ['gay','drag','queen','queer','queens','lgbt','bear','lesbian','gays','lgbtq','homophobic','bears','trans','lesbians','transgender','dyke','twinks','queers','homosexual','bisexual','homosexuals']
mystop = stopwords.words('english')

newgayDf = FindNewGays(bizDf)

newgayDf.to_pickle('data/newgaybiz.pandas')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  newgayDf['gay_text_len'] = newgayDf.gay_text.apply(lambda x: len(x))
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  newgayDf.sort_values('gay_text_len', ascending=False, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  newgayDf['gay_counts'] = newgayDf['gay_text'].apply(lambda words: Counter(words))
A value is trying to be set on a copy of a 

In [10]:
newgayList = newgayDf[['business_id', 'gay_text_len', 'gay_counts']]

KeyError: "['business_id'] not in index"

## Refine on a Slice
Now I need to count the occurence of each gay word in the reviews. 

In [None]:
gay_words = ['gay','drag','queen','queer','queens','lgbt','bear','lesbian','gays','lgbtq','homophobic','bears','trans','lesbians','transgender','dyke','twinks','queers','homosexual','bisexual','homosexuals']

In [None]:
# make a slice
sliceDf = bizDf.head(1000).copy()

In [None]:
# clean up the text and tokenize
sliceDf = sliceDf.text.apply(lambda item: item.lower()).apply(lambda item: word_tokenize(re.sub(r"[^A-z\s]", "", item))).to_frame()

In [None]:
# remove stopwords
mystop = stopwords.words('english') + ['pm']
sliceDf['text'] = sliceDf['text'].apply(lambda review: [word for word in review if (word not in mystop)])

In [None]:
# extract gay words into new column
sliceDf['gay_text'] = sliceDf['text'].apply(lambda review: [word for word in review if (word in gay_words)])

In [None]:
# subset for businesses with gay words
newgayDf = sliceDf[sliceDf['gay_text'].apply(lambda x: len(x) >0)].copy()

### Time to count! 

In [None]:
# Just count the number of words that were identified. This can be used later to screen out some businesses
newgayDf['gay_text_len'] = newgayDf.gay_text.apply(lambda x: len(x))
newgayDf.sort_values('gay_text_len', ascending=False, inplace=True)

In [None]:
# create frequency distribution of gay words for each business 
newgayDf['gay_counts'] = newgayDf['gay_text'].apply(lambda words: Counter(words))

In [None]:
# count unique gay words for each business
newgayDf['gay_unique_count'] = newgayDf['gay_counts'].apply(lambda freq: len(freq.keys()))

## Future Tasks

- Drop words that are getting false hits (like bear and drag)
- Drop businesses with a few hits
- Do manual check of what those businesses are (will need to join newgayDf back to bizPhil)
- Get frequency of summation of all gay words
- Do ngrams on text and gay text? 