# COGS 108 - Final Project 

# Overview

In this project, I aim to see if dog parks in San Diego have higher ratings on the website 'Yelp' than other types of parks. Using a dataset from Yelp, I will compare scores using Linear and Multivariate Regression models (Ordinary Least Squares) between general parks, dog parks, and parks overall where dogs are allowed. If dog parks do score a higher rating on average, San Diego City could aim to improve dog parks to increase population attitude and involvement in outdoor activities.

# Name & GitHub

- Name: Bomed Pham
- GitHub Username: BPham151

# Research Question

Do dog parks in San Diego on average score a higher rating (on a scale from one to five, five being the highest) than other general types of parks on the website 'Yelp'? If not, do parks where dogs are allowed score a higher rating on 'Yelp'?

## Background and Prior Work

One way to make walks in parks more entertaining and fun is to have a dog buddy to help you pass the time. From looking at San Diego online news articles, one can see that dog parks are highly regarded for a good place to relax for both the walker and the dog. 

At a glance, the Fiesta Island Off Leash Dog Park has high regards for being a great dog park based on a local article and the park's 'Yelp' reviews.

References:
- 1) https://www.nbcsandiego.com/news/local/clear-the-shelters-2019-dog-parks-top-five-balboa-park-coronado-animals-pets-adopt/129799/
- 2) https://www.nbcsandiego.com/news/local/city-council-approves-dog-friendly-makeover-for-fiesta-island/79566/
- 3) https://www.yelp.com/biz/fiesta-island-off-leash-dog-park-san-diego

# Hypothesis


I hypothesize that designated dog parks have higher ratings than other parks in San Diego on the website 'Yelp'. Higher ratings could benefit the city of San Diego through community involvement and happiness, thus improving upon dog parks will then increase ratings and benefits. 

# Dataset(s)

- Dataset Name: yelp_SD_reviews.csv
- Link to the dataset:
        https://www.yelp.com/developers/documentation/v3/business_reviews
        and
        https://github.com/COGS108/individual_fa20/blob/master/data/yelp_SD_reviews.csv
- Number of observations: 2333 Observations (2333, 3)

This provided dataset has three columns of information: id (name of the business place in San Diego), rating (rating of one person's review on the website 'Yelp'), and text (the one person's review of the place).

# Setup

In [1]:
import pandas as pd

import warnings
warnings.filterwarnings('ignore')

import patsy
import statsmodels.api as sm
import scipy.stats as stats
from scipy.stats import ttest_ind

In [2]:
df = pd.read_csv('yelp_SD_reviews.csv', names=['park', 'rating', 'review'], header = 0)

# Data Cleaning

In [3]:
def has_park(str_in):
    str_in = str_in.lower()
    str_in = str_in.strip()
    
    if 'park' in str_in:
        output = True
    else:
        output = False
    
    return output

In [4]:
def has_dog(str_in):
    str_in = str_in.lower()
    str_in = str_in.strip()
    
    if 'dog' in str_in:
        output = True
    else:
        output = False
        
    return output

In [5]:
def dogs_allowed(s):
    if (s['is_dog_park'] == True) or (s['mentions_dog'] == True):
        return True
    else:
        return False

These three functions allow us to clean up the dataset to only use what is relevant to the question. 

- The 'has_park' function determines if a string contains "park" in it. Used to determine if the place is a park.
- The 'has_dog' function determines if a string contains "dog" in it. Used to determine if the park is a dog park or if the review mentions "dog".
- The 'dogs_allowed' function determines if a dataframe's data is a dog park or mentions dog.

In [6]:
df['is_park'] = df['park'].apply(has_park)

In [7]:
sd_parks = df[df['is_park'] == True]
sd_parks = sd_parks.drop(columns = ['is_park'])

sd_parks['is_dog_park'] = sd_parks['park'].apply(has_dog)
sd_parks['mentions_dog'] = sd_parks['review'].apply(has_dog)
sd_parks['dogs_allowed'] = sd_parks.apply(dogs_allowed, axis = 1)

Create a new dataframe, 'sd_parks', from the previous dataframe with new columns 'is_dog_park', 'mentions_dog', and 'dogs_allowed' while removing the column 'is_park'.

- 'is_dog_park' = Is this park a dog park based on if Dog Park is in the name of the place?
- 'mentions_dog' = Do reviews of this place mention a dog(s)?
- 'dogs_allowed' = Does this place allow dogs based on if a dog is mentioned in the reviews of the place?

# Data Analysis & Results

Include cells that describe the steps in your data analysis.

In [8]:
dogs_only = sd_parks[(sd_parks['is_dog_park'] == True)]
allows_dogs = sd_parks[(sd_parks['mentions_dog'] == True) | (sd_parks['is_dog_park'] == True)]
no_dogs = sd_parks[(sd_parks['mentions_dog'] == False) & (sd_parks['is_dog_park'] == False)]

Create three new dataframes to be used for data analysis.

- 'dogs_only' is a Dataframe containing all the places that are considered a dog park.
- 'allows_dogs' is a Dataframe containing all the places that are considered a dog park or where reviews mention a dog.
- 'no_dogs' is a Dataframe containing all the places that are not considered a dog park and where reviews do not mention a dog.


In [9]:
y_dog_1 = dogs_only['rating']
y_dog_2 = allows_dogs['rating']
n_dog = no_dogs['rating']

avg_y_dog_1 = y_dog_1.mean()
avg_y_dog_2 = y_dog_2.mean()
avg_n_dog = n_dog.mean()

print('Average rating of dog park reviews is: \t\t\t {:2.2f}'.format(avg_y_dog_1))
print('Average rating of dogs-allowed park reviews is: \t {:2.2f}'.format(avg_y_dog_2))
print('Average rating of park reviews is: \t\t\t {:2.2f}'.format(avg_n_dog))

Average rating of dog park reviews is: 			 3.83
Average rating of dogs-allowed park reviews is: 	 3.90
Average rating of park reviews is: 			 4.14


Calculate the mean of the ratings from reviews of dog parks, parks where dogs are mentioned, and the rest of the parks.

In [11]:
t_val_1, p_val_1 = stats.ttest_ind(y_dog_1, n_dog)
p_val_1

0.04696576726470643

Test to see if the p-value of the ratings of dog parks compared to the other parks is less than 0.05 to be considered significant.

In [12]:
t_val_2, p_val_2 = stats.ttest_ind(y_dog_2, n_dog)
p_val_2

0.034890754855247404

Test to see if the p-value of the ratings of parks where dogs are allowed compared to the other parks is less than 0.05 to be considered significant.

In [13]:
outcome_1, predictors_1 = patsy.dmatrices('rating ~ is_dog_park', sd_parks)
mod_1 = sm.OLS(outcome_1, predictors_1)

res_1 = mod_1.fit()

In [14]:
print(res_1.summary())

                            OLS Regression Results                            
Dep. Variable:                 rating   R-squared:                       0.005
Model:                            OLS   Adj. R-squared:                  0.003
Method:                 Least Squares   F-statistic:                     3.577
Date:                Wed, 16 Dec 2020   Prob (F-statistic):             0.0590
Time:                        09:08:32   Log-Likelihood:                -1214.4
No. Observations:                 781   AIC:                             2433.
Df Residuals:                     779   BIC:                             2442.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
Intercept               4.1248    

First linear model using Ordinary Least Squares showing that dog parks do not have a significant effect on the reviews' ratings. (P-value was not less than 0.05)

In [15]:
outcome_2, predictors_2 = patsy.dmatrices('rating ~ dogs_allowed', sd_parks)
mod_2 = sm.OLS(outcome_2, predictors_2)

res_2 = mod_2.fit()

In [16]:
print(res_2.summary())

                            OLS Regression Results                            
Dep. Variable:                 rating   R-squared:                       0.006
Model:                            OLS   Adj. R-squared:                  0.004
Method:                 Least Squares   F-statistic:                     4.466
Date:                Wed, 16 Dec 2020   Prob (F-statistic):             0.0349
Time:                        09:08:35   Log-Likelihood:                -1214.0
No. Observations:                 781   AIC:                             2432.
Df Residuals:                     779   BIC:                             2441.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
Intercept                4.1405 

Second linear model using Ordinary Least Squares showing that parks where dogs are allowed/mentioned do have a significant effect on the reviews' ratings. (P-value was less than 0.05)

In [17]:
outcome_3, predictors_3 = patsy.dmatrices('rating ~ is_dog_park + dogs_allowed', sd_parks)
mod_3 = sm.OLS(outcome_3, predictors_3)

res_3 = mod_3.fit()

In [18]:
print(res_3.summary())

                            OLS Regression Results                            
Dep. Variable:                 rating   R-squared:                       0.006
Model:                            OLS   Adj. R-squared:                  0.004
Method:                 Least Squares   F-statistic:                     2.454
Date:                Wed, 16 Dec 2020   Prob (F-statistic):             0.0866
Time:                        09:08:37   Log-Likelihood:                -1213.7
No. Observations:                 781   AIC:                             2433.
Df Residuals:                     778   BIC:                             2447.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
Intercept                4.1405 

Multivariate Regression model using Ordinary Least Squares to predict the reviews' ratings from dog parks and parks where dogs are allowed. Dog parks do not have a significant effect on the reviews' ratings.

# Ethics & Privacy

In this data analysis, the dataset is purely just the name of the park, the rating of the park, and the review from the person. The reviewer's personal information such as name is not included in the data. On this regards, privacy of the users will not be a major concern.

One problem that arises is that the dataset used only comes from one review website which could include bias thus making the analysis bias. 

Another source of concern for bias is the reviewers' thoughts, emotions, and experiences for a certain place, potentially skewing the score that they provided. With only using one dataset, this will have to be taken in account when viewing the results.

# Conclusion & Discussion

From the data analysis using Linear and Multivariate Regression, the alternative hypothesis is rejected since the P-values returned are not less than 0.05; this indicates that a park being a dog park does not have a higher likelihood to have higher ratings on the website 'Yelp'. From the analysis, it would not be recommended to use the analysis as evidence to improve dog parks.

One large limitation to this project is the use of only one dataset used for data analysis. This brings with it bias that derives from the use of only one source of data and the data itself (people will be bias in reviews based on many different factors).

With that being said, there may be an effect on ratings where dogs are allowed in parks as the data analysis yielded a P-value less than 0.05. More research can be conducted to see if this is the case.