# Mapping Instagram Data
***
### Part of [Photos in the National Parks](http://104.131.40.84:3000)

This notebook will explain how I parsed and organized posts from the Instagram API to display online via Leafletjs maps.   If you're interested in how to gather posts from the Instagram API, please see my [github repository](https://github.com/jefarrell/Python-Instagram_API_Scripts).

***
***
***
## Methodology 

The 10 parks chosen were the 10 most popular parks of 2015, ranked by total visitors.   

Gathering posts with a specific location tag (e.g. posts tagged at "Yellowstone National Park") results in all posts showing up at the same location marker.  The best method I found was to search by popular hashtags used in relation to each specific park.  After finding the tags for each park, I gathered 33,000 posts per park, for a total of 330,000.  The dataset was filtered to remove posts that did not include location data.  I then used Shape files, provided by [data.gov](https://catalog.data.gov/dataset/national-park-boundariesf0a4c), of the National Park boundaries.  Using these boundaries, I could eliminate any posts that were not taken within the geographic boundaries of the parks themselves.  This process is demonstrated below.

Was that the right method?  I don't know; There are certainly problems with it.  But it was the best approach I found.  I'd love to hear if you have a better way.   

***
## Where Does Location Data Come From?
Location data in photos comes from the GPS in your phone in a process called geotagging.  That data is then picked up in Instagram.  This means our data is limited to photos taken where GPS was available (one of the problems I mentioned above).  

This is data people allow to be shared, but don't always realize they are allowing it.

If that makes you uncomfortable, you can turn it off.  [This guide](http://www.igeeksblog.com/disable-geotagging-for-photos-on-iphone-ipad/) shows how to disable geotagging on an iPhone.  If you only want to turn off geotagging in Instagram, [go here](https://instagramtipsandtricks.blogspot.com/p/turn-location-on-or-off.html).   

***
## Data Sources
All data used in this project was gathered through the [Instagram API](instagram.com/developer), which provides access to all publicly-available posts.  The web application is NOT live - collection took place on May 25-26th 2016.  Instagram was announcing big API changes on June 1st, so I opted to collect all data before that date.  

***

#### I'm going to start with a pickle file I created via my [Instagram Tag Search]((https://github.com/jefarrell/Python-Instagram_API_Scripts/blob/master/instagramTagSearch.py).   

#### The output of that script will be a pickle file of Instagram posts containing a given hashtag.

In [1]:
# Import pickle, choose the file we want to examine
# The Tag Search script saves the pickle files as "whateverHashtag_data.p"
# We'll check out Yellowstone here
import pickle
used_tag="yellowstone"
path='data/'
new_media = pickle.load(open(path+'%s_data.p'%used_tag,'rb'))

In [2]:
# Quick check
print len(new_media)

33033


In [3]:
# See what data exists in a single media object
dir(new_media[0])

['__class__',
 '__delattr__',
 '__dict__',
 '__doc__',
 '__format__',
 '__getattribute__',
 '__hash__',
 '__init__',
 '__module__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__unicode__',
 '__weakref__',
 'caption',
 'comment_count',
 'comments',
 'created_time',
 'filter',
 'get_low_resolution_url',
 'get_standard_resolution_url',
 'get_thumbnail_url',
 'id',
 'images',
 'like_count',
 'likes',
 'link',
 'object_from_dictionary',
 'tags',
 'type',
 'user',
 'users_in_photo']

In [4]:
## That's more than I need, so let's make a new dictionary and save the important bits
#### Most importantly, I'm only saving posts that contain the "location" attribute
datadict = [{"time":m.created_time, "location":m.location, "photo":m.images['standard_resolution'].url, "user":m.user.username} for m in new_media if hasattr(m, 'location')]

In [5]:
# Check how many that is compared to our original number
print len(new_media)
print len(datadict)

33033
16264


#### So from our original list of 33,033 posts, 16,264 contain location data.  50% isn't too bad!  






In [6]:
# Quick check that all posts have lat/lon
for d in datadict:
    if hasattr(d["location"].point, "latitude"):
        pass
    else:
        print "Exception"

In [7]:
## In some instances, I found posts with a location attribute but no lat/lon points
## This would cause the above test to fail, and it would be a useless post for our purposes
#### This is a fix to remove any stray posts
for d in datadict:
    if hasattr(d["location"].point, "latitude"):
        pass
    else:
        datadict.remove(d)

#### Now that we only have posts with location, we should take a look at them

In [8]:
# We're going to use Folium to examine the data on maps
# https://github.com/python-visualization/folium
import folium

In [9]:
# Use the appropriate map coordinates for your data
mapCoords = [44.559337,-110.349717]
locationMap = folium.Map(location=mapCoords)

In [10]:
# Add data points to the map
# I'm only checking the first 1000 posts - I would run into trouble mapping bigger sets
# Folium can be a bit slow, but it's much easier to do quick checks like this vs using Basemap

for d in datadict[0:1000]: 
    folium.CircleMarker([d['location'].point.latitude, d['location'].point.longitude], 
    popup = (d['user'] + str(d['location'].point.latitude)), radius=500, color="#D15400", fill_color="#D15400").add_to(locationMap)

In [11]:
# Let's see our map
locationMap

### Points on a map!  
#### But there's a lot of points that don't actually lie within the park boundaries
#### Let's find a way to identify only points within the parks

In [12]:
# Some new imports
import fiona
import shapely
from shapely.geometry import Polygon
import json

In [13]:
# We'll need an empty list
parksContainer = []

In [14]:
# The government provides shape files which give us the boundaries of all Nationally-protected areas
# https://catalog.data.gov/dataset/national-park-boundariesf0a4c
# We're going to check our point locations against these shapes
## More about shape files: https://doc.arcgis.com/en/arcgis-online/reference/shapefiles.htm

vector = fiona.open('geom/ne_10m_parks_and_protected_lands/ne_10m_parks_and_protected_lands_area.shp')

In [15]:
# There's lots of extra stuff in this file,
# So let's sort out only the National Parks, and add them to our empty list
for feature in vector:
    if (feature['properties']['unit_type']) == 'National Park':
        parksContainer.append(feature)

In [16]:
# Let's check it out
for i in parksContainer:
    print i['properties']['name']

Hawai'i Volcanoes
Channel Islands
Redwood
Olympic
Joshua Tree
Badlands
Zion
Petrified Forest
Grand Teton
Kobuk Valley
Sequoia
Kings Canyon
Yellowstone
Lassen Volcanic
Yosemite
Shenandoah
Rocky Mountain
North Cascades
Mount Rainier
Great Smoky Mountains
Death Valley
Crater Lake
Capitol Reef
Canyonlands
Kenai Fjords
Big Bend
Everglades
Isle Royale
Voyageurs
Grand Canyon
Glacier


In [17]:
# Cool!  Let's make sure we're looking at the park we want
parksContainer[12]['properties']['name']

u'Yellowstone'

In [18]:
# This is a list of all the coordinates that make up the polygon of the park boundary
parkPoints_list = parksContainer[12]['geometry']['coordinates'][0]

In [19]:
# If you get an error about Linear Rings needing 3 sets of tuples,
# You might need to change how you select parkPoints_list
# In some cases, it's "...['coordinates'][0][0]", other times it's "[1][0]"
## Some of these are just formatted differently for whatever reason
park_polygon = Polygon(parkPoints_list)

In [20]:
# Let's extract only the points that lie within that polygon, and put them in a new bucket
pointsinbounds = []

### Make note, Shapely expects points as long/lat here
for d in datadict:
    if park_polygon.contains(shapely.geometry.Point(d['location'].point.longitude, d['location'].point.latitude)):
        pointsinbounds.append(d)    

In [21]:
# How much data have we been filtering out?
print len(new_media)
print len(datadict)
print len(pointsinbounds)

33033
16264
10977


#### 33k original posts, to 16k with location data, to 11k that actually took place within the park

In [22]:
# Time to check out a map of our new data
inboundsmap = folium.Map(location=mapCoords)

In [23]:
# Again, we're only going to display some of points so we don't crash our browser
for p in pointsinbounds[0:1000]: 
    folium.CircleMarker(
        [p['location'].point.latitude, p['location'].point.longitude], 
        popup=d['user'],radius=500, color="#D15400",
        fill_color="#D15400").add_to(inboundsmap)

In [24]:
inboundsmap

### Great!  We can clearly see that the stray posts have been filtered, and we're only seeing data within the park boundaries.  
### Last thing we want to do is export our data so we can use it elsewhere.  We will use the geojson file format.  
#### It has specific formatting, you can find more info here: [geojson.org](http://www.geojson.org)

In [25]:
### MAKE NOTE ###
# Put points as long/lat, NOT as lat/long
# Leaflet reads them like that


# This is a heading that all geojson files need
# We will then append our data into the 'features' list after we make this heading
points_map = {
    "type": "FeatureCollection",
    "crs": {
        "type": "name",
        "properties": {
            "name": "urn:ogc:def:crs:OGC:1.3:CRS84"
        }
    },
    "features":[]
    
    }


for d in pointsinbounds:
    points_map['features'].append(
        {
        "type":"Feature",
        "properties":{
            "time":str(d["time"]),
            "photo":d["photo"], 
            "user":d["user"]
        },
        "geometry":{
            "type": "Point",
            "coordinates": [
                d["location"].point.longitude,
                d["location"].point.latitude
                ]
            }
        }
    )

In [26]:
# Did we get them all?
print len(pointsinbounds)
print len(points_map['features'])

10977
10977


In [27]:
# Save our geojson file - it's now ready for display online via leaflet
with open("%s.geojson"%used_tag, "w") as fp:
    json.dump(points_map, fp)

### We took 33,000 Instagram posts, removed extraneous data, filtered for location, checked for geographic boundaries, and saved our output in a new usable format.  

## Please head over to [Photos in the Parks](http://104.131.40.84:3000) to view how I used this data.   

I'd love to hear any suggestions or updates you have, so please get in contact: <johnefarrell18@gmail.com>