# COGS 108 - Final Project 

# Overview

*Fill in your overview here*

# Name & GitHub ID

- Name: Hector Penado Jr
- GitHub Username: Hector7179

# Research Question

Is there a relationship between the size of a San Diego Park (in acres) and its corresponding rating on yelp?

## Background and Prior Work

After living in San Diego for the better part of 3 years I have been to over a handful of small, neighborhood parks around the area that I live in and only one large park (Balboa Park). When I went to Balboa park, I remember distinctly how there were lots of people walking around or engaging in some activity. It was busier than most parks I have been too and I associated that with the idea that this park must be very liked. Conversely, whenever I would go to the smaller parks around my area, I would notice that they tended to by empty in both activities one could do, and number or people. This got me thinking if smaller parks were less liked than their bigger counterparts.

From the background work that I have gathered, it appears to me that bigger parks do tend to be more popular than smaller parks because there is often more to do in them.

USA Today compiled a list of the best parks in San Diego. All of the parks mentioned in the list were relatively large parks. These parks either had large spaces for activits, gardens, close proximity to the beach, hiking trails, etc. There were no mentionings of small parks at all in the list. This reinforces my hypothesis that large parks are more liked than smaller ones.

An article by Andrew Price titles, "Grand Parks vs. Neighborhood" makes a distinction between "Grand Parks" and "Neighborhood Parks" that gave me ideas as to why larger parks may be more liked than smaller ones. He brought up the point that to lots of people, neighborhood parks may seem irrelevant because they don't offer activites they could already do in their homes. Whereas grand parks offer opportunities for new experiences and excitement which neighborhood parks can't provide. I believe that because larger parks tend to have more excitement than smaller parks, they will receive better ratings on yelp because of the more fun experience people have in larger parks as opposed to a more mundane experience you could expect from a smaller park.

References:
- 1)https://www.10best.com/destinations/california/san-diego/attractions/parks/
- 2)https://www.strongtowns.org/journal/2015/9/15/parks'

# Hypothesis


The background knowledge I have gained in preparation for this topic leads me to the belief that an increase in acreage of parks allows for more activites that people can engage in which in turn leads to a better experience. That is why I believe that an increase in the size of a park, will lead to an increase in people's yelp ratings of the park.

# Dataset(s)

The ideal dataset for this question would be one that provided me detailed information on the different parks of San Diego. The primary information I would be looking for would be the name of parks, size of the park, some sort of rating scale I could use as an indication of how people felt about the park. Ideally, I would want to have this information on the parks throughout the years to see if a trend is present regarding the relationship between size of parks and people's ratings of them. Ideally, I think it would be best to store the observations in a datframe where the columns would be for the variables I am interested in exploring and the rows would be each park in San Diego as an observation.

Dataset 1:
Dataset: yelp_SD_parks.csv
Link: I was provided with this dataset by instructor.
Number of observations: 833
This dataset provides information from yelp regarding multiple parks in San Diego. Important information provided in this dataset for this project is name of park and user rating.

Dataset 2:
Dataset: parks_datasd.geojson
Link: I was provided with this dataset by the instructor.
Number of observations: 2769
This dataset provides location data of parks in San Diego. Important information provided in this dataset for this project is alias of park and gis_acres.

Plan for combining datasets: Since I am using two datasets, I plan on merging them based on park names.

# Setup

In [1]:

#import libraries needed for project
%matplotlib inline

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import matplotlib as mpl

import seaborn as sns

import patsy
import statsmodels.api as sm

#ignore warnings
import warnings
warnings.filterwarnings('ignore')

from scipy.stats import skewnorm
import geopandas as gpd

# Data Cleaning

### creating data frames

In [2]:
# placing datasets into dataframes
df_1 = pd.read_csv('https://raw.githubusercontent.com/COGS108/individual_fa20/master/data/yelp_SD_parks.csv')
df_2 = gpd.read_file('https://raw.githubusercontent.com/COGS108/individual_fa20/master/data/parks_datasd.geojson')


### Cleaning park names for merging preparation

Goal is to merge both datasets based on the name of the parks. For df_1 I will be cleaning the 'name' column and for df_2 I will be cleaning the 'alias' column. I won't be using the 'name' column of df_2 because I found that the 'alias' columns tends to write out park names like in df_1 more often.

In [3]:
# function that will standardize the names of parks in both datasets so that they will have better success at merging.
def standardizer(string):
    # Make the input all lowercase
    new = string.lower()
    
    # Drop all whitespace
    new = new.strip()
    #get rid of 'park' to include names that may not include park
    new = new.replace('park', '')
    new = new.strip()
    return new


In [4]:
#applying function to both dataframes
df_1['name'] = df_1['name'].apply(standardizer)
df_2['alias'] = df_2['alias'].apply(standardizer)

### Getting rid of unwanted columns

I do not need all of the columns from both datasets to conduct my research. Therefore, I will drop the columns I do not require.

In [5]:
df_1 = df_1[['name','rating']]
df_2 = df_2[['alias','gis_acres']]

### Merging

Now my datasets are almost ready to merge. I am just going to change the column name of the 'name' and 'alias' columns of df_1 and df_2, respectively, to 'park' because I want to use those columns to merge on.

In [6]:
df_1.rename(columns={'name':'park'}, inplace=True)
df_2.rename(columns={'alias':'park'}, inplace=True)

In [7]:
df_1.head()

Unnamed: 0,park,rating
0,balboa,5.0
1,civita,4.5
2,waterfront,4.5
3,trolley barn,4.5
4,bay view,5.0


In [8]:
df_2.head()

Unnamed: 0,park,gis_acres
0,south carlsbad state beach,115.895878
1,torrey pines state beach,67.294309
2,ruocco,3.312526
3,tuna harbor,0.639035
4,san diego bayfront,3.669272


In [9]:
# merging both datasets to variable park_df
park_df = pd.merge(df_1,df_2, on = 'park')
park_df

Unnamed: 0,park,rating,gis_acres
0,balboa,5.0,1089.476460
1,waterfront,4.5,12.693865
2,centrum,3.5,2.162943
3,presidio,4.5,61.265073
4,olive grove,4.0,9.176157
...,...,...,...
230,carmel grove,3.5,2.791804
231,ashley falls,4.5,11.660442
232,cuvier,4.5,0.610523
233,saratoga,5.0,1.251101


It is important to note that during the process we lost quite a few observations from the original datasets. This is due to the fact that the datasets used had a tendency of writing the names of parks differently or one dataset did not include the park another one did.

# Data Analysis & Results

Describing the the park_df gives me quick and useful information that gives me a sense of my data and which could be useful as I continue my analysis.

In [11]:
park_df.describe()

Unnamed: 0,rating,gis_acres
count,235.0,235.0
mean,3.974468,79.474211
std,0.61271,348.187215
min,1.0,0.04072
25%,4.0,2.719932
50%,4.0,8.024891
75%,4.5,19.168185
max,5.0,4108.39715


# Ethics & Privacy

Being that parks are for the public, the information on them was also be publicly available. The biggest privacy concern I can currently see would be If my study would at some point need information on the people visiting the parks that could be personal. In that case I would take measures to be granted permission to use personal information and ensure the privacy of participants by maintining there data anonymous.
A potential bias that I see involves the location of the parks in San Diego. There are parks in San Diego that reside in both wealthier areas of the city and poorer ones. This could potentially lead to bias in my study If I am not careful.
In order to deal with these potential issues I will need to first ensure that all the data I gather is available for me to use. To enusre this I will do my due diligence to research the parks in San Diego to see what information I can and can't use and also ask informed consent if needed. I will also need to ensure that I maintain the privacy of my data by keeping any personal information anonymous. For the biases my project may face, I will need to attempt to find as many cofounding variables as possible that may lead my study to produce biased results.

# Conclusion & Discussion

*Fill in your discussion information here*