# Notebook 2: All Parks
Now that we established our method for MacArthur Park, the next step was to apply this process to 12 parks across LA we selected based on a quick Flickr search to determine if there were a variety of photos with a variety of topics. 

This notebook is broken into three sections:
1. Inspect & Explore the Flickr API
2. Data Wrangling
3. Visualizing Cultural Ecosystem Services

In [2]:
#The first step was to once again call the API function

import flickrapi
import json

api_key = u'f950122b83b682c546201f10d33edffe'
api_secret = u'057c65cd7fe1b2c8'

#flickr = flickrapi.FlickrAPI(api_key, api_secret)
#for json format
flickr = flickrapi.FlickrAPI(api_key, api_secret, format='parsed-json')

This time we wanted to pull data from all 12 pre-selected parks. Here, we searched by tags identifying each park, then saved into a dataframe 

In [3]:
import pandas as pd

def get_parks(num_pages):
    park_list = []
    for i in range(1, num_pages+1): #range documentation starts at 0, +1 ensures we pull the page number we feed the function below
        extras = 'geo,description,tags'
        tags = ['MacArthur Park, Woodley Avenue Park, Rio de Los Angeles State Park, Runyon Canyon, Temescal Gateway, Heidelberg Park, Hancock Park, Franklin Canyon Park, Angels Gate, Coldwater Canyon, Chatsworth Park South, Cheviot Hills, O''Melveny Park']
        parks_LA = flickr.photos.search(tags=tags, bbox = '-118.898278,33.704902,-118.161021,34.32848',
                                        method_name='flickr', page=i, per_page=500, extras=extras)  
        
        #.extend combines each page of search results together: https://www.programiz.com/python-programming/methods/list/extend
        park_list.extend(parks_LA['photos']['photo']) #pulls data from each individual photo 
        
    #reorients the data and converts to pandas df: https://stackoverflow.com/questions/20638006/convert-list-of-dictionaries-to-a-pandas-dataframe
    df = pd.DataFrame.from_dict(park_list, orient='columns') 
    
    return df

parks_data = get_parks(12) #pulls data from all 12 pages of photos

In [4]:
#checking the dataframe 
parks_data

Unnamed: 0,id,owner,secret,server,farm,title,ispublic,isfriend,isfamily,description,...,latitude,longitude,accuracy,context,place_id,woeid,geo_is_public,geo_is_contact,geo_is_friend,geo_is_family
0,50952829406,77318907@N08,6fe5b8ef3a,65535,66,Angel's Gate Cloudscape,1,0,0,"{'_content': 'San Pedro, CA 01-02-21'}",...,33.721867,-118.271759,16,0,,5392528,1,0,0,0
1,50952896937,77318907@N08,ebe224b984,65535,66,Harbor Entrance at Sunrise,1,0,0,"{'_content': 'The Angel's Gate in San Pedro, C...",...,33.721707,-118.271791,16,0,,5392528,1,0,0,0
2,50292619641,66115413@N07,11c1a5bb33,65535,66,Never come here at night,1,0,0,{'_content': 'La Brea Tar Pits Los Angeles'},...,34.063858,-118.357064,15,0,,5381273,1,0,0,0
3,50220027608,66115413@N07,a917301efa,65535,66,Quarantined Sloths,1,0,0,{'_content': 'La Brea Tar Pits / Hancok Park'},...,34.064481,-118.357150,16,0,,5381273,1,0,0,0
4,49861807202,22316914@N06,ccec045e20,65535,66,Subway,1,0,0,{'_content': 'Westlake / MacArthur Park metro ...,...,34.055000,-118.274167,16,0,,8062690,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2745,168615083,53611868@N00,5da8b48efc,67,1,oh! the clouds,1,0,0,"{'_content': 'Los Angeles, 1996 <a target=""_b...",...,34.106989,-118.348996,15,0,ViHtwh1TWr3akxeLCg,28751511,1,0,0,0
2746,163958185,63348143@N00,222f36be46,73,1,slippery,1,0,0,{'_content': ''},...,34.081597,-118.327217,16,0,efoucK5TWr0U1ThAlQ,28751566,1,0,0,0
2747,161398070,63348143@N00,c3cdd3f644,50,1,flea market,1,0,0,{'_content': ''},...,34.081597,-118.327217,16,0,efoucK5TWr0U1ThAlQ,28751566,1,0,0,0
2748,140184967,35946983@N00,fed4bef5ee,50,1,MacArthur Park,1,0,0,"{'_content': 'Kent, Darryl, and Jan Dubois, De...",...,34.058623,-118.277918,15,0,c9d0dwRTUb8v0hz88w,23511979,1,0,0,0


The next step was to create a new column and assign the park name to that column so that we could more easily track tags as they relate to specific parks.

In [5]:
# Create a new column and assign the park name 
park_names = ['macarthur', 'woodley', 'riodelosangeles', 'runyoncanyon', 'temescalgateway', 'heidelbergpark', 'hancockpark', 'franklincanyonpark', 'angelsgate', 'coldwatercanyon', 'chatsworthparksouth','cheviothills']

def get_park_name(row):
    for park in park_names:
        if park in row['tags']:
            return park
    return 'Unknown'

parks_data['parkname'] = parks_data.apply(lambda row: get_park_name(row), axis=1)

In [1]:
#checking our work
parks_data

NameError: name 'parks_data' is not defined

In [6]:
#save to a CSV file so we do not have to keep pulling from Flickr

parks_data.to_csv('parks_data.csv', index=False)  

In [None]:
import pandas as pd

#Read the .csv file and output as a pandas dataframe
parksDf = pd.read_csv('/Users/ellietroxell/Documents/GitHub - UP229.nosync/Final Project/LaParks_NLP/parks_data.csv')

# Confirm that the .csv was read properly
# We can confirm that there are the same number of rows and columns as the original dataframe
print(len(parksDf))
print(len(parksDf.columns))

Next we take a look at the tags to determine how best to clean them up. 

***I think we can create a tags only df here from the .csv and use that from here on out

In [None]:
df=parksDf.explode('tags').groupby('parkname')['tags'].value_counts()
df

In [2]:
#create function to remove stop words, tokenize, convert all text to lowercase, and convert string to list
#We also went back and created a list from our tags of common words we did not want to analyze (i.e. losangeles, la, california, etc)

import nltk
import re
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords

swords = [re.sub(r"[^A-z\s]", "", sword) for sword in stopwords.words('english')]
swords += ['losangeles', 'la', 'losangelesca', 'ca', 'macarthur', 'macarthurpark', 'woodley', 'riodelosangeles', 'runyoncanyon', 
           'temescalgateway', 'heidelbergpark', 'hancockpark', 'franklincanyonpark', 'franklincanyonpark', 'angelsgate', 
           'coldwatercanyon', 'chatsworthparksouth','cheviothills', 'california', 'usa', 'southerncalifornia', 'park', 'parklabrea', 
          'unitedstates', 'america']

def clean_string(text):
    # remove punctuation
    text = re.sub(r"[^A-z\s]", "", text)
    
    cleaned_list_of_words = [word for word in word_tokenize(text.lower()) if word not in swords] #return a string or apply to all tags
    
    return cleaned_list_of_words

#calling the function to only apply to the tags column 
parks_data['tags'] = parks_data['tags'].apply(clean_string)

NameError: name 'parks_data' is not defined

In [7]:
# Check to make sure tags are in a list
parks_data['tags']

0       angelsgate losangelesharbor sanpedro sunrise c...
1       sanpedro losangeles harbor california southern...
2       hancockpark losangeles bw blackandwhite d750 m...
3       bw barrier blackandwhite d750 monochrome nikon...
4       losangeles ca california subway metro public t...
                              ...                        
2745    california sky skyline clouds 35mm geotagged l...
2746    morning dog wet corner crash accident clinton ...
2747    night corner sale clinton pickuptruck hollywoo...
2748    westlake 1972 darryl macarthurpark bypk kentka...
2749    westlake penny 1972 darryl macarthurpark carll...
Name: tags, Length: 2750, dtype: object

For our next step, we needed to break apart the tags so tha no more than one tag is assocated with each park name. To do this, we used the .explode function. When moved each list item (the tags) to a new row. 

In [8]:
#extend the daraframe by pairing up each individual tag by park name
#Source: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.explode.html

cols = ['tags', 'parkname']
tag_park = parks_data[cols].explode('tags', ignore_index=True)

Next, we grouped the tags by park name to determine how many tags are associated with each park

In [9]:
#getting a simple count of tags per park 
print(tag_park.groupby('parkname').count())

                     tags
parkname                 
Unknown                42
angelsgate            348
chatsworthparksouth     1
cheviothills           19
coldwatercanyon        21
franklincanyonpark     37
hancockpark          1369
heidelbergpark          1
macarthur             490
riodelosangeles        10
runyoncanyon          406
temescalgateway         1
woodley                 5


Next, we obtained the top 100 most-used tags across all parks.

In [11]:
#groupby documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html
#sort_values documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html

#create a column with count of each tag 
tag_park['value'] = [1] * tag_park.shape[0]

#return top 100 most used tags sorted by value
top_100_tags = tag_park.groupby('tags').sum().sort_values('value', ascending=False).head(100)

#so we can view all tags
pd.set_option('display.max_rows', 100)

print(top_100_tags)

                                                    value
tags                                                     
texture losangeles textures runyoncanyon textur...    152
california losangeles labrea tarpits labreatarp...     80
real estate realestate house hancockpark losang...     73
animals museum paleontology tigers bones fossil...     63
gay lesbian losangeles pride lgbt gaypride gard...     62
california losangeles labrea tarpits labreatarp...     57
hancockpark midcity losangeles socal southernca...     48
people museum losangeles lacma hancockpark             44
westlake macarthurpark lake architecture neighb...     41
park wrigley omelveny omelvenypark                     40
iceage tarpits pagemuseum hancockpark pleistoce...     37
familyseparations keepfamliestogethor march los...     36
garden losangeles hancockpark gardentour               32
real estate realestate hancockpark midcity losa...     29
california losangeles labrea tarpits labreatarp...     27
animals museum

Finally, we want to obtain the top 50 tags by park 

In [12]:
#create a new dataframe, grouping by tags and parkname, and then find the sum of each tags' occurence 
tag_park_counts = tag_park.groupby(cols).sum()

#then write a for loop to loop through each park name, returning the top 50 most common tags for each park, sorted by the "value" column (i.e. the frequency of that tags' occurence)
#.index.get_level_values allows us to sort by park name: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Index.get_level_values.html
park_counts = {}
for park in park_names:
    park_counts[park] = tag_park_counts.iloc[tag_park_counts.index.get_level_values('parkname') == park].sort_values('value', ascending=False).head(50)

park_counts

{'macarthur':                                                               value
 tags                                               parkname        
 westlake macarthurpark lake architecture neighb... macarthur     41
 familyseparations keepfamliestogethor march los... macarthur     36
 park urban la losangeles macarthurpark             macarthur     21
 tacos mexican lengua asada picounion burrito ca... macarthur     19
 mexican lengua asada picounion burrito carnitas... macarthur     17
 signs losangeles fear rally sanity macarthurpark   macarthur     15
 city urban losangeles westlake macarthurpark       macarthur     14
 loslobos music macarthurpark doslobos              macarthur     12
 langersdeli losangeles california ca la normlan... macarthur     12
 california ca music la losangeles live burns vi... macarthur     11
 monument landmark lodge parkplaza 1925 macarthu... macarthur      9
 streets geotagged losangeles downtown bikes bic... macarthur      8
 monument landmark lo

In [14]:
park_counts.to_csv('park_counts.csv', index=False) 

AttributeError: 'dict' object has no attribute 'to_csv'

In [13]:
### ADAPT using the new tags only DF

#Use the function to clean up the list of tags and return a new list of cleaned tags
cleaned_macarthur_tags = [clean_string(x) for x in macarthurDf_tags]
print(cleaned_macarthur_tags)

NameError: name 'macarthurDf_tags' is not defined