# Procure Safegraph Patterns on Amazon EC2

This notebook is intended to speed up extracting the patterns data by taking the processing to the data. The Safeagraph Data resides in a Simple Storage Service (S3) bucet in the the US-East-2 region. Therefore, the most efficient way to get the data we need...aside from recruiting a true _big data_ infrastructure stack, is to run an Amazon Elastic Compute Cloud (EC2) instance with Python, Jupyter Notebook and a copy of this repository, and use this execution environment to extract just the data we need.

In [1]:
import os
from pathlib import Path

from dotenv import find_dotenv, load_dotenv

from sg_data import SafegraphClient

In [2]:
dir_prj = Path('./').absolute().parent

dir_data = dir_prj/'data'

dir_raw = dir_data/'raw'

For the notebook to run, you will need to populate the variables below with the information you get from your own account credentials provided by Safegraph for data access.

In [3]:
# I am keeing these in a separate file called .env and using the dotenv module to load them - this keeps them out of version control
AWS_KEY = os.getenv('AWS_KEY')
AWS_SECRET = os.getenv('AWS_SECRET')

sg_poi = 'sg:af471021a929414cbf69854e6f8f1b0c'  # white pass

Using the credentials provided above, a Safegraph Client object instance can be created. This object is part of the Safegraph Data Utilities provided with this package.

In [4]:
# the module is designed to automatically try to load the settings from environment variales just like I did explicitly above, but I am shwoing here clarity
sg = SafegraphClient(access_key=AWS_KEY, secret_key=AWS_SECRET)

sg

<sg_data.main.SafegraphClient at 0x7f4a7a7c3a60>

One of the functions of the Safegraph Client provides introspection, the ability to interrogate the available resources to discover what is available as a Pandas DataFrame.

In [5]:
ptrns_df = sg.content_dataframe

ptrns_df.head()

Unnamed: 0,source_path,year,month,resource_type,standardized_path
0,monthly-patterns/brand_info_backfill/2020/12/1...,2018,1,brand_info,monthly-patterns/brand_info/2018/01/brand_info...
1,monthly-patterns/brand_info_backfill/2020/12/1...,2018,2,brand_info,monthly-patterns/brand_info/2018/02/brand_info...
2,monthly-patterns/brand_info_backfill/2020/12/1...,2018,3,brand_info,monthly-patterns/brand_info/2018/03/brand_info...
3,monthly-patterns/brand_info_backfill/2020/12/1...,2018,4,brand_info,monthly-patterns/brand_info/2018/04/brand_info...
4,monthly-patterns/brand_info_backfill/2020/12/1...,2018,5,brand_info,monthly-patterns/brand_info/2018/05/brand_info...


This DataFrame can be filtered and organized to discover what is available. In this case we are interested in what `patterns` data is available, and to more easily see the range available, we are sorting by month and year.

In [6]:
ptrns_df[ptrns_df['resource_type'] == 'patterns'].sort_values(['year', 'month'])

Unnamed: 0,source_path,year,month,resource_type,standardized_path
304,monthly-patterns/patterns_backfill/2020/05/07/...,2018,1,patterns,monthly-patterns/patterns/2018/01/patterns-par...
305,monthly-patterns/patterns_backfill/2020/05/07/...,2018,1,patterns,monthly-patterns/patterns/2018/01/patterns-par...
306,monthly-patterns/patterns_backfill/2020/05/07/...,2018,1,patterns,monthly-patterns/patterns/2018/01/patterns-par...
307,monthly-patterns/patterns_backfill/2020/05/07/...,2018,1,patterns,monthly-patterns/patterns/2018/01/patterns-par...
416,monthly-patterns/patterns_backfill/2020/12/13/...,2018,1,patterns,monthly-patterns/patterns/2018/01/core_poi-pat...
...,...,...,...,...,...
299,monthly-patterns/patterns/2020/11/06/11/patter...,2020,11,patterns,monthly-patterns/patterns/2020/11/patterns-par...
300,monthly-patterns/patterns/2020/12/04/04/patter...,2020,12,patterns,monthly-patterns/patterns/2020/12/patterns-par...
301,monthly-patterns/patterns/2020/12/04/04/patter...,2020,12,patterns,monthly-patterns/patterns/2020/12/patterns-par...
302,monthly-patterns/patterns/2020/12/04/04/patter...,2020,12,patterns,monthly-patterns/patterns/2020/12/patterns-par...


Finally, although we are going to retrieve three years worth of data, since interested only in ski season, we are filtering to November through April. Although not instantanious, when I performed the same data pull on a local machine in my office, this process took over six hours, so this is a dramatic improvement.

In [7]:
%%time

pt_df = sg.get_patterns_dataframe([2018, 2019, 2020], [11, 12, 1, 2, 3, 4], safegraph_pois=sg_poi)

pt_df.head()

CPU times: user 1h 19min 38s, sys: 3min 20s, total: 1h 22min 59s
Wall time: 1h 24min 22s


Unnamed: 0,placekey,safegraph_place_id,location_name,street_address,city,region,postal_code,safegraph_brand_ids,brands,date_range_start,...,naics_code,latitude,longitude,iso_country_code,phone_number,open_hours,opened_on,closed_on,tracking_opened_since,tracking_closed_since
169339,zzy-222@5xd-7jh-f75,sg:af471021a929414cbf69854e6f8f1b0c,White Pass Ski Area,48935 U.s. 12,Naches,WA,98937,,,2020-10-01T00:00:00-07:00,...,,,,,,,,,,
233622,zzw-222@5xd-7jh-ffz,sg:af471021a929414cbf69854e6f8f1b0c,White Pass Ski Area,48935 U.s. 12,Naches,WA,98937,,,2020-11-01T00:00:00-07:00,...,,,,,,,,,,
94681,,sg:af471021a929414cbf69854e6f8f1b0c,White Pass Ski Area,48935 U.s. 12,Naches,WA,98937,,,2018-01-01T00:00:00-08:00,...,,,,,,,,,,
128545,,sg:af471021a929414cbf69854e6f8f1b0c,White Pass Ski Area,48935 U.s. 12,Naches,WA,98937,,,2018-02-01T00:00:00-08:00,...,,,,,,,,,,
119562,,sg:af471021a929414cbf69854e6f8f1b0c,White Pass Ski Area,48935 U.s. 12,Naches,WA,98937,,,2018-03-01T00:00:00-08:00,...,,,,,,,,,,


In [8]:
pt_df.to_parquet(dir_raw/'patterns_white_pass.parquet')