## Background and Motivation

For the past few weeks I've been working on learning Python and building a data science toolkit. Until now, this had largely entailed following canned tutorials and working with data that came from flat files downloaded from some online resource.

Given that this is not where most data scientists obtain their data, I wanted to do a small project that would require obtaining data using an API, populating a database, and creating visualizations on top of that database. 

In [1]:
import pandas as pd
import numpy as np
import requests # used for api calls
import json # parse json body
from pandas.io.json import json_normalize
import psycopg2
from sqlalchemy import create_engine


  """)


## Strava API

Strava makes it very easy to use their API. If you are a user of the Strava app, you simply login to their web app and create an application. Even as a complete beginner to using API's, once I had the keys it only took a little experimenting to figure out how to use it. I found this [blog post](https://yizeng.me/2017/01/11/get-a-strava-api-access-token-with-write-permission/) by Yi Zeng very helpful to figure out how to get additional read and write permissions that are necessary to request more detailed information. 

I learned that it is best practice to hide your API keys from public view (github). It is clear there are more sophisticated and robust ways of doing this, but for a quick and easy solution, you can create a `config.py` file in your project folder and use that script to store your keys. Then you add `config.py` to your `.gitignore` and it doesn't get uploaded.

At this point, I'll assume you have followed the blog post by Yi Zeng and have experimented with some API calls in your command line.

### Refresh API token
The access token provided by Strava expires after a certain amount of time. You are given a refresh token that you can exchange for a new access token. I wanted to be able to come back to this notebook and refresh the data at a future point, so I included the lines below to use the refresh token to acquire a new access token

In [2]:
import config # create a configuration file to hide your API keys

In [3]:
# this refreshes your access code - previously created an access code that had greater permissions
refresh = {'client_id':config.client_id,
          'client_secret':config.client_secret,
          'grant_type':'refresh_token',
          'refresh_token':config.refresh_token}
response = requests.post('https://www.strava.com/oauth/token', params=refresh)

In [4]:
# confirm connection
access_token = response.json()['access_token']
header = {'Authorization': 'Bearer {}'.format(access_token)}

In [5]:
# update weight to verify additional permissions are working
d = {"weight":"178"}
resp = requests.put('https://www.strava.com/api/v3/athlete', headers=header, data=d)
resp.json()['weight']

178.0

I'm primarily interested in the activity data, let's see what the response looks like. I'll use this information to determine the structure the table I create in my database.

In [6]:
activities = requests.get('https://www.strava.com/api/v3/athlete/activities', headers = header)

In [7]:
activities.json()
activities_test = json_normalize(activities.json(), sep = "_")
activities_test.columns

Index(['achievement_count', 'athlete_count', 'athlete_id',
       'athlete_resource_state', 'average_cadence', 'average_heartrate',
       'average_speed', 'average_temp', 'comment_count', 'commute',
       'display_hide_heartrate_option', 'distance', 'elapsed_time',
       'elev_high', 'elev_low', 'end_latlng', 'external_id', 'flagged',
       'from_accepted_tag', 'gear_id', 'has_heartrate', 'has_kudoed',
       'heartrate_opt_out', 'id', 'kudos_count', 'location_city',
       'location_country', 'location_state', 'manual', 'map_id',
       'map_resource_state', 'map_summary_polyline', 'max_heartrate',
       'max_speed', 'moving_time', 'name', 'photo_count', 'pr_count',
       'private', 'resource_state', 'start_date', 'start_date_local',
       'start_latitude', 'start_latlng', 'start_longitude', 'timezone',
       'total_elevation_gain', 'total_photo_count', 'trainer', 'type',
       'upload_id', 'utc_offset', 'visibility', 'workout_type'],
      dtype='object')

Note that the json structure can vary activity to activity. I primarily use Strava for running, so I don't receive any of the fields specific to cycling. Also I noticed that depending on whether or not you use a device for a run, some fields will be included. The varying returned fields is something to consider when creating a pipeline from API to database.

# PostgreSQL

I opted to work with PostgreSQL for this project. It seemed like an approachable step up from sqlite3 and you can access it from the command line, so while working on this notebook, I could have a terminal window open, validating steps and experimenting.

We use `sqlalchemy` as the primary tool for interacting with our postgres database from python. `sqlalchemy` uses `psycopg2` which is a driver for PostgreSQL. Both allow us to write raw SQL statements and send them to the database, however if we want to use some of the SQL features of Pandas (which we do) we need to use `sqlalchemy` to create an `engine`. The `engine` abstracts away the complexity of managing connections to the database.   

There is a good [SO post on the differences here](https://stackoverflow.com/questions/8588126/sqlalchemy-or-psycopg2)

Finally, as this experimentation is meant to learn how to request data using the API and load that data into a database, I'm using a local database. In a future iteration of this project, I would like to learn to use a database hosted on a remote server.

In [8]:
# create an engine - run once
engine = create_engine('postgresql://jakekirsch:@localhost/jakekirsch')

### Define an activities table
I defined the table first - this was because I found the fields vary from request to request.

In [9]:
# create table
drop_create_table = """
DROP TABLE IF EXISTS activities;

CREATE TABLE activities (
achievement_count text,
athlete_id text,
athlete_resource_state text,
athlete_count text,
average_cadence float,
average_heartrate float,
average_speed float,
average_temp float,
comment_count float,
commute boolean,
display_hide_heartrate_option boolean,
distance float,
device_watts text,
elapsed_time float,
elev_high float,
elev_low float,
end_latlng text,
external_id text,
flagged boolean,
from_accepted_tag boolean,
gear_id text,
has_heartrate boolean,
has_kudoed boolean,
heartrate_opt_out boolean,
id bigint PRIMARY KEY,
kudos_count text,
location_city text,
location_country text,
location_state text,
manual boolean,
map_id text,
map_resource_state text,
map_summary_polyline text,
max_heartrate float,
max_speed float,
moving_time float,
name text,
photo_count text,
pr_count text,
private boolean,
resource_state text,
start_date timestamp,
start_date_local timestamp,
start_latitude text,
start_latlng text,
start_longitude text,
timezone text,
total_elevation_gain float,
total_photo_count text,
trainer boolean,
type text,
upload_id text,
utc_offset float,
visibility text,
workout_type text
);"""


In [10]:
# create the table
engine.execute(drop_create_table)

<sqlalchemy.engine.result.ResultProxy at 0x112306710>

### Get total activity count
This request returns the total number of activities, we use that information to set the loop for requesting all the information.

In [11]:
# get stats
param = {"page":1,
        "per_page":35}
athlete_id = config.athlete_id
athlete_stats = requests.get('https://www.strava.com/api/v3/athletes/{}/stats'.format(athlete_id), 
                             headers = header)

In [12]:
# total count of activities
total_activities = sum([athlete_stats.json()['all_ride_totals']['count'], 
                        athlete_stats.json()['all_run_totals']['count'], 
                        athlete_stats.json()['all_swim_totals']['count']])

In [13]:
# params for api calls - we'll split this into pages because there is a limit to the per_page request
per_page = 100
(total_activities / per_page) + 1

4.2

### Insert data using pandas
This section structures a request for a set of activities. Then we normalize the json payload into a pandas dataframe. Finally, we use the handy sql methods of pandas to append the returned information to our activities postgres table.

In [14]:
for page in np.arange(1, (total_activities / per_page) + 2, 1):
    activities = requests.get('https://www.strava.com/api/v3/athlete/activities', headers = header, 
                                 params = {
                                     'page': page,
                                     'per_page':per_page
                                 })
    activities_df = json_normalize(activities.json(),  sep="_")
    activities_df.to_sql('activities', con=engine, if_exists="append", index=False)
    

We can use the command line tools to test the result, or we can use pandas again and inspect

In [15]:
test = pd.read_sql_table('activities', con=engine)

In [16]:
test.head()

Unnamed: 0,achievement_count,athlete_id,athlete_resource_state,athlete_count,average_cadence,average_heartrate,average_speed,average_temp,comment_count,commute,...,start_longitude,timezone,total_elevation_gain,total_photo_count,trainer,type,upload_id,utc_offset,visibility,workout_type
0,0,8633517,1,1,93.9,158.7,3.031,22.0,0.0,False,...,,(GMT-05:00) America/New_York,0.0,0,True,Run,2266650521,-18000.0,everyone,
1,7,8633517,1,1,89.7,161.2,3.078,20.0,0.0,False,...,-73.45,(GMT-05:00) America/New_York,210.0,0,False,Run,2266650344,-18000.0,everyone,
2,0,8633517,1,1,87.2,152.7,2.403,22.0,0.0,False,...,-73.45,(GMT-05:00) America/New_York,31.0,0,False,Run,2266649807,-18000.0,everyone,
3,4,8633517,1,1,94.5,169.1,4.041,10.0,0.0,False,...,-73.46,(GMT-05:00) America/New_York,121.0,0,False,Run,2257236739,-18000.0,everyone,
4,2,8633517,1,1,93.7,164.0,3.431,21.0,0.0,False,...,-73.45,(GMT-05:00) America/New_York,104.0,0,False,Run,2254958593,-18000.0,everyone,


*Note that there is a limitation to the `pd.to_sql` method, we can't easily use it to update existing rows or only upload non-duplicates if a primary key is defined on the table. There are quite a few solutions on SO, however the most robust seems to learn how to use the ORM capabilities of sqlalchemy to create a data pipeline. This would be less SQL code and more python, however the underlying concepts are similar.*

## Summary 

That's it! In summary, we accomplished the following.

1. Generate keys to Strava API
2. Refresh access keys to API when needed
3. Defined a table in PostgreSQL
4. Requested activity data from Strava
5. Used the SQL methods of Pandas to insert into our postgres table
6. Used SQL methods of Pandas to retrieve data from activities table 

Hopefully this short walk-through gives you enough familiarity with some of the tools needed to get started working with the Strava API.

The next post will cover some examples creating visualizations using your Strava activity data. Because we need the SQL practice, we will challenge ourselves to do all of the data processing using SQL commands!