# COGS 108 - Final Project 

# Overview

In this project I intended to find the predictability of park review scores based on the text of the review. I looked at the dataset of reviews on places in San Diego and narrowed it down to just parks in San Diego. From this analysis, I found that positive reviews are easier to predict than negative reviews. The data seems to suggest that there is more variety in the words used in negative reviews versus positive reviews.

# Name & GitHub ID

- Name: Connor Isenman
- GitHub Username: Konnur-I

# Research Question

Are positive reviews of parks in San Diego more predictable than negative reviews?

## Background and Prior Work

With 57 recreation centers, 13 aquatic centers, and approximately 260 playgrounds in 8,700 acres of developed parks the San Diego Parks and Recreation department is spread thin when it comes to deciding where to put money for future expansions and upgrades [1]. In Balboa Park alone over 300 million dollars is required for much needed repairs and upkeep in the park but the money is instead being put towards parking projects in the park [2]. This kind of misallocation of funds could be avoided by looking at the relationship between traffic and budgets for improvements.

In order to more efficiently allocate funds for parks around the city I believe it would be beneifical to know what traffic is like in areas when determining their budgets for improvement so we can allocate more money for road improvements or facility improvements accordingly. San Diego Parks and Recreations had a total budget of just over 116 million dollars in 2019, being able to efficiently allocate the budgeted money could help save the city thousands of dollars that would be spent on underutilized facilities or roads because they didn't analyze what areas would benefit from different improvements [3].

References (include links):

- 1)https://www.sandiego.gov/sites/default/files/v3parkandrec_0.pdf
- 2)https://www.sandiegouniontribune.com/opinion/commentary/sdut-utbg-balboa-park-plaza-oppose-2016jul22-htmlstory.html
- 3)https://www.sandiego.gov/sites/default/files/fiscal_year_2020_parks_and_recreation_department_adopted_budget.pdf

# Hypothesis


Positive reviews are more predictable than negative reviews

# Dataset(s)


- Dataset Name: yelp_SD_reviews.csv
- Link to the dataset: https://raw.githubusercontent.com/COGS108/individual_fa20/master/data/yelp_SD_reviews.csv
- Number of observations: 2333

This dataset is a collection of Yelp reviews on places in San Diego.

# Setup

In [8]:


import numpy as np
import pandas as pd
import requests
import io

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

import warnings
warnings.filterwarnings('ignore')

from sklearn.svm import SVC
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import classification_report, precision_recall_fscore_support

nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Connor\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Connor\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [9]:
url1 = 'https://raw.githubusercontent.com/COGS108/individual_fa20/master/data/yelp_SD_reviews.csv'
download1 = requests.get(url1).content


df1 = pd.read_csv(io.StringIO(download1.decode('utf-8')))

vectorizer = CountVectorizer(analyzer = 'word', max_features = 2000, tokenizer = word_tokenize, 
                                     stop_words = stopwords.words('english'))


In [10]:
df1 = df1[df1['id'].str.contains(' Park')]

In [11]:
def convert_rating(rating):
    if rating >= 3:
        output = 1.0
    elif rating < 3:
        output = 0.0
    else:
        output = rating
        
    return output
    

# Data Cleaning

I have removed all ids not containing the word park in order to narrow the data to just parks in San Diego instead of any Yelp review in San Diego. I have also converted the ratings into a binary positive or negative (1.0 or 0.0) based on the rating being 3 or above for positive and below 3 for negative.

In [12]:
df1 = df1[df1['id'].str.contains(' Park')]
df1['y'] = df1['rating'].apply(convert_rating)

# Data Analysis & Results

In the first cell I created two arrays, PR_X using the vectorizer to vectorize the text in each review and PR_y using the positive or negative column from the orinigal dataframe.

Then I set two ints based off 20% of the rows and 80% of the rows, the 80% to train the SVM and 20% to test the prediction are accuarate. Next I split the arrays into training  and testing groups using the variables defined before.

In [13]:
PR_X = vectorizer.fit_transform(df1['text']).toarray()
PR_y = np.array(df1['y'])

num_testing = int(df1.shape[0] * .2)
num_training = int(df1.shape[0] * .8)

PR_train_X = PR_X[:num_training]
PR_train_y = PR_y[:num_training]
PR_test_X = PR_X[num_training:]
PR_test_y = PR_y[num_training:]


I defined a function that initializes an SVM classifier and then trains it. Then I used the training group to train the SVM which I followed up by predicting the training data to check the SVM is working correctly, next I used the test data to see if the SVM works outside of training.

In [14]:
def train_SVM(X, y, kernel='linear'):
    clf = SVC(kernel = kernel)
    return clf.fit(X,y)

PR_clf = train_SVM(PR_train_X, PR_train_y)

PR_predicted_train_y = PR_clf.predict(PR_train_X)
PR_predicted_test_y = PR_clf.predict(PR_test_X)

print(classification_report(PR_train_y,PR_predicted_train_y))

print(classification_report(PR_test_y, PR_predicted_test_y))

              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00        50
         1.0       1.00      1.00      1.00       562

    accuracy                           1.00       612
   macro avg       1.00      1.00      1.00       612
weighted avg       1.00      1.00      1.00       612

              precision    recall  f1-score   support

         0.0       0.50      0.21      0.30        19
         1.0       0.90      0.97      0.93       135

    accuracy                           0.88       154
   macro avg       0.70      0.59      0.61       154
weighted avg       0.85      0.88      0.85       154



# Ethics & Privacy

As all the data collected is from a public source and the names of the reviewers have been removed there are not any obvious issues with ethics or privacy. The reviewer allowed their reviews to be seen by others and our data has no personal information therefore we are ethical and respecting their privacy with the use of this data.

# Conclusion & Discussion

Based on my prediction model positive reviews are more predictable than negative reviews. This may be because the criteria for positive reviews includes more possible values (3-5) than the negative values (1-2). People making positive reviews are more likely to use the same generic words like great and amazing while negative reviews tend to explain why their experience was negative or what could have been better so there is a lot more subjectivity in negative reviews versus positive reviews.