# COGS 108 - Final Project 

# Overview

In this project, I wanted to figure out what people like and hate about San Diego Park & Recreation and answer the question on "what should the city of San Diego do in the next five years?". For the dataset, I used yelp data which contains both rating reviews and text reviews. To analyze the data, I cleaned the data, by removing the places unrelated to park from the dataset and used TF-IDF and Bag of Words (BoW) approach. 

# Name & GitHub

- Name: Hyunjin Shin
- GitHub Username: hyunjinUCSD

# Research Question

What are some of the things people do like or dislike about the city of San Diego's park and recreation places?

## Background and Prior Work

One of the biggest American media company U.S.News & World Report stated visiting Balboa Park as the first best thing to do in San Diego and visiting Mission Beach as the second best thing to do in San Diego (link in the first reference). Since many people see this and visit the parks in San Diego, recognizing what people do like and dislike about the San Diego park is crucial to make people revisit the beautiful parks in San Diego. This is one of the reasons I decided to look for what people like and dislike about the parks and recreation places in San Diego. 

I enjoy visiting parks in San Diego, especially the Balboa Park, on weekends. However, every experiences visiting the parks in San Diego are different. Some experiences are considered to be good when I readily found a parking spot and when I easily found the resting area. However, some experiences are considered to be bad when I was not able to find the resting area when I felt tired. On the internet, such as yelp website (link in the second reference), I could easily find reviews with high ratings on Balboa Park, but also I could find the reviews with low ratings on Balboa Park. So, I decided to find out what are the things that people do like and dislike about San Diego's park and recreation places. Also, I decided this research question since San Diego imporves what people dislike about the park and the recreation places, more people will revisit the city of San Diego

References (include links):
- 1) Best Things To Do in San Diego 
     https://travel.usnews.com/San_Diego_CA/Things_To_Do/
     
- 2) Balboa Park Yelp Reviews
     https://www.yelp.com/biz/balboa-park-san-diego?osq=Balboa+park

# Hypothesis


Cleanliness is what people most like about San Diego's park and recreation places, and the lack of parking facilities is what people least like about San diego's park and recreation places.

# Dataset(s)

- Dataset Name: yelp_SD_reviews.csv
- Link to the dataset: https://www.yelp.com/developers/documentation/v3/business_reviews
- Number of observations: 3,478 observations
- Description: Each observation of the dataset consists of the name of park & recreation place, rating given by the reviewer and the text written by the reviewer. Some places in this dataset are not related to the San Deigo Park & Recreation, so I'll clean these data in Data Cleaning step.

# Setup

### step 1. import libraries and packages

In [1]:
%matplotlib inline
%config InlineBackend.figure_format ='retina'
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import requests
import bs4
from bs4 import BeautifulSoup
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.probability import FreqDist
import string
import warnings
import patsy
import statsmodels.api as sm
import scipy.stats as stats
from scipy.stats import ttest_ind, chisquare, normaltest
from sklearn.svm import SVC
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import classification_report, precision_recall_fscore_support

plt.rcParams['figure.figsize'] = (17, 7)
plt.rcParams.update({'font.size': 14})

sns.set()
sns.set_context('talk')

pd.set_option('precision', 2)

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('vader_lexicon') 

warnings.filterwarnings('ignore')

[nltk_data] Downloading package stopwords to /Users/henry/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/henry/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/henry/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


# Data Cleaning

I'll load yelp_SD_reviews.csv file into the pandas DataFrame df_review.

In [2]:
#df_review
df_review = pd.read_csv('yelp_SD_reviews.csv')

I'll see if there's any null value in the df_review

In [3]:
df_review.isnull().sum()

id        0
rating    0
text      0
dtype: int64

From the output, we now know that there's no null value in the df_review.

Next, I'll remove the reviews unrelated to the park and the recreation centers, such as apartments, offices, restuarants (I decided to remove restaurants because some people think restaurant is a recreation place, but some people don't) and schools. 

1. remove apartments

In [4]:
# key_substring = 'apartment'
df_review=df_review[~df_review.id.str.contains("Apartments")]
df_review=df_review[~df_review.id.str.contains("apartments")]
df_review=df_review[~df_review.id.str.contains("apartment")]
df_review=df_review[~df_review.id.str.contains("Apartment")]

#remove apartment called "Pinnacle at Otay Ranch"
df_review=df_review[~df_review.id.str.contains("Pinnacle at Otay Ranch")]

#remove apartment called  "5811 Lindo Paseo"
df_review=df_review[~df_review.id.str.contains("5811 Lindo Paseo")]

#remove apartment called  "The Villas At Camino Bernardo"
df_review=df_review[~df_review.id.str.contains("The Villas At Camino Bernardo")]

#remove apartment called  "Strata"
df_review=df_review[~df_review.id.str.contains("Strata")]

#remove apartment called  "Villa La Jolla Townhomes Association"
df_review=df_review[~df_review.id.str.contains("Villa La Jolla")]

#remove an apartment "Strauss On Fifth"
df_review=df_review[~df_review.id.str.contains("Strauss On Fifth")]

#remove an apartment "Arcadia at StoneCrest Village"
df_review=df_review[~df_review.id.str.contains("Arcadia")]

2. remove companies

In [5]:
# key_substring = 'company'
df_review=df_review[~df_review.id.str.contains("Company")]
df_review=df_review[~df_review.id.str.contains("company")]

# key_substring = 'inc'
df_review=df_review[~df_review.id.str.contains("Inc")]
df_review=df_review[~df_review.id.str.contains("inc")]

# key_substring = 'office'
df_review=df_review[~df_review.id.str.contains("Office")]
df_review=df_review[~df_review.id.str.contains("office")]

# remove a property management, homeowner association called  "ASPM-San Diego"
df_review=df_review[~df_review.id.str.contains("ASPM")]

3. remove restaurants & bars

In [6]:
# key_substring = 'kitchen'
df_review=df_review[~df_review.id.str.contains("Kitchen")]
df_review=df_review[~df_review.id.str.contains("kitchen")]

# key_substring = 'restuarant'
df_review=df_review[~df_review.id.str.contains("Restuarant")]
df_review=df_review[~df_review.id.str.contains("restuarant")]

# key_substring = 'bar'
df_review=df_review[~df_review.id.str.contains("Bar")]
df_review=df_review[~df_review.id.str.contains("bar")]

#remove a restaurant called  "Fernside"
df_review=df_review[~df_review.id.str.contains("Fernside")]

#remove a restaurant called  "Phil's BBQ"
df_review=df_review[~df_review.id.str.contains("BBQ")]

4. remove schools & academies

In [7]:
# key_substring = 'school'
df_review=df_review[~df_review.id.str.contains("School")]
df_review=df_review[~df_review.id.str.contains("school")]

# key_substring = 'college'
df_review=df_review[~df_review.id.str.contains("College")]
df_review=df_review[~df_review.id.str.contains("college")]

# key_substring = 'academy'
df_review=df_review[~df_review.id.str.contains("Academy")]
df_review=df_review[~df_review.id.str.contains("academy")]

Since we removed the observations unrelated to the San Diego Park & Recreation, let's see how df_review does look now.

In [8]:
pd.set_option('display.max_rows', df_review.shape[0]+1)
df_review

Unnamed: 0,id,rating,text
0,Balboa Park,5,Balboa Park is a must see when coming to San D...
1,Balboa Park,5,Beautiful grounds even to take a stroll during...
2,Balboa Park,5,Beautiful sightseeing in San Diego. Lots of wa...
3,Civita Park,5,Was invited to child's B-Day party pre Covid-1...
4,Civita Park,5,"Pretty nice park, beautiful design. Anyone is..."
5,Civita Park,1,We love Civita Park for its wide open spaces t...
6,Waterfront Park,5,"After eating brunch in Little Italy, we decide..."
7,Waterfront Park,5,Definitely worth the price of admission.\n\nIf...
8,Waterfront Park,4,Well First I got here and I was quite amazed a...
12,Bay View Park,5,I LOVE this spot!!! So beautiful at night!! Th...


As we can see above, the ratings range from 1 to 5, and all of the ratings are integers. I'll seperate df_review into two dataframes: <b>df_review_pos</b> consisting of positive reviews (rating >= 3) and <b>df_review_neg</b> consisting of negative reviews (rating < 3). 

In [9]:
df_review_pos=df_review[df_review['rating'] >= 3]
df_review_neg=df_review[df_review['rating'] < 3]

Let's see if both of our datasets look great!

In [10]:
df_review_pos

Unnamed: 0,id,rating,text
0,Balboa Park,5,Balboa Park is a must see when coming to San D...
1,Balboa Park,5,Beautiful grounds even to take a stroll during...
2,Balboa Park,5,Beautiful sightseeing in San Diego. Lots of wa...
3,Civita Park,5,Was invited to child's B-Day party pre Covid-1...
4,Civita Park,5,"Pretty nice park, beautiful design. Anyone is..."
6,Waterfront Park,5,"After eating brunch in Little Italy, we decide..."
7,Waterfront Park,5,Definitely worth the price of admission.\n\nIf...
8,Waterfront Park,4,Well First I got here and I was quite amazed a...
12,Bay View Park,5,I LOVE this spot!!! So beautiful at night!! Th...
13,Bay View Park,5,I took my kids here on a week day before The 4...


In [11]:
df_review_neg

Unnamed: 0,id,rating,text
5,Civita Park,1,We love Civita Park for its wide open spaces t...
241,Cowles Mountain,1,Stay away!! \n\nGo take your graffiti and lou...
287,Edward Tyler Cramer Park,1,There is NO PARKING. This is a taxpayer mainta...
292,Bay Park,2,Bay Park has been a disappointment on the last...
318,Western Hills Park,2,I almost got a ticket there for having my dog ...
319,South Clairemont Recreation Center & Community...,1,Stay away from the youth theater program if yo...
327,Ray's Tennis Shop,1,"I've tried, for the better part of 3, years to..."
333,Cadman Park Leash-Free Area,1,This park brought awareness to the danger of o...
338,Nate's Point Dog Park - Balboa Park,2,Dog park review #4 - I would say 2.5 but Yelp ...
349,Emerald City Realty,1,We bought a home through Emerald. It was one o...


Since df_review_pos have all reviews with rating greater than or equal to 3, and df_review_neg have all reviews with rating less than 3, I am confident enough to use these data in the data analysis and results.

# Data Analysis & Results

I'll <b>tokenize</b> the `text` column for the data in the df_review_pos and the df_review_neg.  

In [12]:
df_review_pos['token'] = df_review_pos['text'].apply(word_tokenize) 
df_review_neg['token'] = df_review_neg['text'].apply(word_tokenize) 

I'll remove <b>stop words</b> that are not important to my analysis for the data in the df_review_pos and the df_review_neg.

In [13]:
stop_words = set(stopwords.words('english'))
df_review_pos['token'] = df_review_pos['token'].apply(lambda x: [item for item in x if item not in stop_words])
df_review_neg['token'] = df_review_neg['token'].apply(lambda x: [item for item in x if item not in stop_words])

Here is the df_review_pos after the tokenization and removal of stop words.

In [14]:
df_review_pos

Unnamed: 0,id,rating,text,token
0,Balboa Park,5,Balboa Park is a must see when coming to San D...,"[Balboa, Park, must, see, coming, San, Diego, ..."
1,Balboa Park,5,Beautiful grounds even to take a stroll during...,"[Beautiful, grounds, even, take, stroll, Covid..."
2,Balboa Park,5,Beautiful sightseeing in San Diego. Lots of wa...,"[Beautiful, sightseeing, San, Diego, ., Lots, ..."
3,Civita Park,5,Was invited to child's B-Day party pre Covid-1...,"[Was, invited, child, 's, B-Day, party, pre, C..."
4,Civita Park,5,"Pretty nice park, beautiful design. Anyone is...","[Pretty, nice, park, ,, beautiful, design, ., ..."
6,Waterfront Park,5,"After eating brunch in Little Italy, we decide...","[After, eating, brunch, Little, Italy, ,, deci..."
7,Waterfront Park,5,Definitely worth the price of admission.\n\nIf...,"[Definitely, worth, price, admission, ., If, l..."
8,Waterfront Park,4,Well First I got here and I was quite amazed a...,"[Well, First, I, got, I, quite, amazed, crowde..."
12,Bay View Park,5,I LOVE this spot!!! So beautiful at night!! Th...,"[I, LOVE, spot, !, !, !, So, beautiful, night,..."
13,Bay View Park,5,I took my kids here on a week day before The 4...,"[I, took, kids, week, day, The, 4th, July, ., ..."


Here is the df_review_neg after the tokenization and removal of stop words.

In [15]:
df_review_neg

Unnamed: 0,id,rating,text,token
5,Civita Park,1,We love Civita Park for its wide open spaces t...,"[We, love, Civita, Park, wide, open, spaces, b..."
241,Cowles Mountain,1,Stay away!! \n\nGo take your graffiti and lou...,"[Stay, away, !, !, Go, take, graffiti, loud-a,..."
287,Edward Tyler Cramer Park,1,There is NO PARKING. This is a taxpayer mainta...,"[There, NO, PARKING, ., This, taxpayer, mainta..."
292,Bay Park,2,Bay Park has been a disappointment on the last...,"[Bay, Park, disappointment, last, visits, ., M..."
318,Western Hills Park,2,I almost got a ticket there for having my dog ...,"[I, almost, got, ticket, dog, leash, ., It, 's..."
319,South Clairemont Recreation Center & Community...,1,Stay away from the youth theater program if yo...,"[Stay, away, youth, theater, program, n't, bor..."
327,Ray's Tennis Shop,1,"I've tried, for the better part of 3, years to...","[I, 've, tried, ,, better, part, 3, ,, years, ..."
333,Cadman Park Leash-Free Area,1,This park brought awareness to the danger of o...,"[This, park, brought, awareness, danger, off-l..."
338,Nate's Point Dog Park - Balboa Park,2,Dog park review #4 - I would say 2.5 but Yelp ...,"[Dog, park, review, #, 4, -, I, would, say, 2...."
349,Emerald City Realty,1,We bought a home through Emerald. It was one o...,"[We, bought, home, Emerald, ., It, one, listin..."


Next, I'll reduce the tokenized words to their root words with <b>stemming</b> method.

In [16]:
ps = PorterStemmer()
df_review_pos['token'] = df_review_pos['token'].apply(lambda x: [ps.stem(y) for y in x])
df_review_neg['token'] = df_review_neg['token'].apply(lambda x: [ps.stem(y) for y in x])

I'll identify the tokens which are most unique to df_review_pos and df_review_neg, respectively, using <b>TF-IDF</b>.

But, first, let's create two lists for the positive reviews and the negative reviews with their `token`s, respectively.

In [17]:
# pos_list, neg_list
pos_list = list(df_review_pos['token'].values)
neg_list = list(df_review_neg['token'].values)

Then, I'll <b>remove punctuation</b> in the lists and combine two lists into together.

In [18]:
for c in string.punctuation:
    pos_list = str(pos_list).replace(c, "")
    neg_list = str(neg_list).replace(c, "")
    
corpus = [str(pos_list), str(neg_list)]

I'll create a <b>TfidfVectorizer</b> object to transform our text data into vectors.

In [19]:
# create vectorizer
tfidf = TfidfVectorizer(sublinear_tf=True,
                        analyzer='word',
                        max_features=2000,
                        tokenizer=word_tokenize,
                        stop_words=stop_words)

Then, I'll calculate TF-IDF for the positive reviews and the negative reviews, respectively (row index 0: negative, row index 1: positive). 

In [20]:
# calculate TF-IDF
reviews_tfidf = pd.DataFrame(tfidf.fit_transform(corpus).toarray())
reviews_tfidf.columns = tfidf.get_feature_names()
reviews_tfidf = reviews_tfidf.rename(index={0:'neg', 1:'pos'})

TF-IDF calculation is done. Here is the output sorted by 'pos'.

In [21]:
reviews_tfidf.sort_values(by='pos', axis=1, ascending=False)

Unnamed: 0,park,nt,thi,place,time,dog,one,manag,use,year,...,hill,hiker,highlight,hidden,held,hedda,heat,heart,heard,zoo
neg,0.05,0.04,0.04,0.04,0.04,0.04,0.04,0.03,0.03,0.04,...,0.04,0.02,0.02,0.04,0.03,0.02,0.02,0.04,0.03,0.03
pos,0.09,0.08,0.08,0.08,0.08,0.08,0.07,0.07,0.07,0.07,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


From the output, I can conclude that `manag` is the important word of this output. 

Here is another look of output sorted by 'neg'.

In [22]:
reviews_tfidf.sort_values(by='neg', axis=1, ascending=False)

Unnamed: 0,park,spot,fun,place,great,favorit,thi,north,town,time,...,cockroach,german,gambl,disgust,lil,jimmi,bombard,video,assur,daytim
neg,0.05,0.05,0.05,0.04,0.04,0.04,0.04,0.04,0.04,0.04,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
pos,0.09,0.0,0.0,0.08,0.07,0.0,0.08,0.0,0.0,0.08,...,0.04,0.04,0.04,0.04,0.04,0.04,0.04,0.04,0.04,0.04


From this output, I think `spot` is the important word from this output because spot indicates the parking spots in the parks.

From the data analysis, there are many `manag` in the positive reviews. This indicates that people mostly like cleanliness of the San Diego parks and recreation places, and this accords to my hypothesis. 
Since there are many `spot` in the negative reviews, I can guess what people dislike about San Diego parks and recreation places is the lack of parking spots near the park. Hence, the city of San Diego should extned the parking facilities near the parks and recreation places. 

# Ethics & Privacy

1. Consent issue in using yelp_SD_reviews.csv 

The dataset was collected without notifying people who shared their reviews on yelp, so the consent issues occurred. To continue using the dataset in my project, I have to get the consents from them.

2. Privacy issue in using yelp_SD_reviews.csv

The data itself doesn't include the information of the users on yelp. However, when I searched the `text` of the review on the Google search bar, I could easily figure out who wrote this review and when this review was made. Since I found out there is a privacy issue in the dataset, I have to notify the privacy issues to each of the users on yelp who shared their reviews on yelp and ask them if I can continue working with their reviews as a dataset in the project.

# Conclusion & Discussion

In conclusion, I figured out what people like and dislike about San Diego's parks and recreation places, by cleaning the yelp data and analyzing it with calculating TF-IDF for both positive reviews and negative reviews. However, while analyzing the data, I found that there are the consent issue and the privacy issue in the dataset because the data was collected without notificiation, and the data can indirectly expose the information of the user on yelp.

As a result of analyzing the data, I found that what people like about San Diego's park is environment management and what people dislike about San Diego's park is the lack of parking spots. The result accords with my hypothesis. To answer the question "what should the city of San Diego do in the next five years", they should extend the parking facilitise near the parks and recreation places where there's the lack of parking spots. 