# Programming Project - Unit 2
*by Débora Azevedo, Eliseu Jayro, Francisco de Paiva and Igor Brandão*

**Goals**
The purpose of this project is explore the following:

- Access Health Graph API - Runkeeper content;
- Geolocation analysis and hypotheses should be explained in detail;

<hr>

# Global Imports section

Import the necessary libraries to handle 

- File input;
- Tqdm progress bar
- Requests;
- urlopen;
- HTTPError;

In [None]:
### Library necessary to run this IPython Notebook
!pip install tqdm
!pip install tabulate
!pip install pandas-datareader
!pip install requests

In [1]:
# Import pandas
import pandas as pd

# Import numpy library
import numpy as np

# Import tqdm progressing bar plugin
from tqdm import tqdm

# Import API libraries
import requests
import json
from pandas.io.json import json_normalize

# Imports to output the result as a Markdown
from IPython.display import display, Markdown
from tabulate import tabulate

<hr>

# I - API section

## API data retrieving

#### In the cell bellow, we perform a connection with Health Graph API - Runkeeper.

In [2]:
# Access token
ACCESS_TOKEN = '25bc30d6dd6f4b99bbeb48e8619103b4'

# Base URI
api_URI = "http://api.runkeeper.com/fitnessActivities"

# Number of results
pageSize = 5000

# Final URI
url = '%s?pageSize=%s&access_token=%s' % \
    (api_URI, pageSize, ACCESS_TOKEN,)

# print(url)

# Receive the results from API
api_content = requests.get(url).json()

print(json.dumps(api_content, indent=1))

{
 "items": [
  {
   "entry_mode": "API", 
   "utc_offset": -3, 
   "tracking_mode": "outdoor", 
   "total_calories": 256, 
   "start_time": "Fri, 26 Oct 2018 18:03:16", 
   "uri": "/fitnessActivities/1248977291", 
   "source": "RunKeeper", 
   "total_distance": 8151.82599416682, 
   "duration": 2272, 
   "type": "Cycling", 
   "has_path": true
  }, 
  {
   "entry_mode": "API", 
   "utc_offset": -3, 
   "tracking_mode": "outdoor", 
   "total_calories": 193, 
   "start_time": "Fri, 26 Oct 2018 07:56:50", 
   "uri": "/fitnessActivities/1248977286", 
   "source": "RunKeeper", 
   "total_distance": 7900.92514314074, 
   "duration": 1555, 
   "type": "Cycling", 
   "has_path": true
  }, 
  {
   "entry_mode": "API", 
   "utc_offset": -3, 
   "tracking_mode": "outdoor", 
   "total_calories": 127, 
   "start_time": "Wed, 24 Oct 2018 12:33:35", 
   "uri": "/fitnessActivities/1247963081", 
   "source": "RunKeeper", 
   "total_distance": 4127.19294558563, 
   "duration": 1123, 
   "type": "Cyclin

## JSON to Data Frame conversion

In order to have a better data manipulation, in the next cell we perform a conversion of importe in json format from API to pandas data frame

In [3]:
# Perform a conversion from JSON to Data Frame
api_df = json_normalize(api_content['items'])

# Converts the duration from seconds to minutes
api_df['duration'] = api_df['duration']/60
api_df['duration'] = api_df['duration'].round(2);

# Round the distance
api_df['total_distance'] = api_df['total_distance'].round(2);

api_df

Unnamed: 0,duration,entry_mode,has_path,source,start_time,total_calories,total_distance,tracking_mode,type,uri,utc_offset
0,37.87,API,True,RunKeeper,"Fri, 26 Oct 2018 18:03:16",256,8151.83,outdoor,Cycling,/fitnessActivities/1248977291,-3
1,25.92,API,True,RunKeeper,"Fri, 26 Oct 2018 07:56:50",193,7900.93,outdoor,Cycling,/fitnessActivities/1248977286,-3
2,18.72,API,True,RunKeeper,"Wed, 24 Oct 2018 12:33:35",127,4127.19,outdoor,Cycling,/fitnessActivities/1247963081,-3
3,24.63,API,True,RunKeeper,"Wed, 24 Oct 2018 07:22:57",201,7933.17,outdoor,Cycling,/fitnessActivities/1247963047,-3
4,20.37,API,True,RunKeeper,"Tue, 23 Oct 2018 13:15:36",158,5281.13,outdoor,Cycling,/fitnessActivities/1247963042,-3
5,24.95,API,True,RunKeeper,"Fri, 19 Oct 2018 07:30:10",203,7546.29,outdoor,Cycling,/fitnessActivities/1247963039,-3
6,33.05,API,True,RunKeeper,"Wed, 17 Oct 2018 15:20:50",252,7992.05,outdoor,Cycling,/fitnessActivities/1247963036,-3
7,25.77,API,True,RunKeeper,"Wed, 17 Oct 2018 07:58:51",196,7807.02,outdoor,Cycling,/fitnessActivities/1247963033,-3
8,41.90,API,True,RunKeeper,"Tue, 9 Oct 2018 13:42:13",282,8791.48,outdoor,Cycling,/fitnessActivities/1241059614,-3
9,24.40,API,True,RunKeeper,"Tue, 9 Oct 2018 07:01:20",195,7944.65,outdoor,Cycling,/fitnessActivities/1241059598,-3


## Data export [optional]

In order to visualize the data into an excel file, the cell bellow is responsible for exporting the data.

In [None]:
# Export the new dataSet to csv
api_df.to_csv('dataSource.csv', encoding="utf-8")

<hr>

# II - Activites by period

#### First of all we need to split the information by period. To achieve that, the idea is apply a group selection by partial string in timestamp column 

## Timestamp split into columns

In order to have a better way to handle the data by period, it'll be necessary split the timestamp column into separate columns

In [4]:
# =================================================================================
# Dataframe timestamp split
# =================================================================================

# Copy the data
data_with_period = api_df.copy()
data_with_period["month"] = 0
data_with_period["month_index"] = 0
data_with_period["year"] = 0

# Fill the years
data_with_period.loc[data_with_period['start_time'].str.contains('2018'), 'year'] = '2018'
data_with_period.loc[data_with_period['start_time'].str.contains('2017'), 'year'] = '2017'
data_with_period.loc[data_with_period['start_time'].str.contains('2016'), 'year'] = '2016'
data_with_period.loc[data_with_period['start_time'].str.contains('2015'), 'year'] = '2015'
data_with_period.loc[data_with_period['start_time'].str.contains('2014'), 'year'] = '2014'

# Fill the months
data_with_period.loc[data_with_period['start_time'].str.contains('Jan'), 'month'] = 'Jan'
data_with_period.loc[data_with_period['start_time'].str.contains('Jan'), 'month_index'] = '1'

data_with_period.loc[data_with_period['start_time'].str.contains('Feb'), 'month'] = 'Feb'
data_with_period.loc[data_with_period['start_time'].str.contains('Feb'), 'month_index'] = '2'

data_with_period.loc[data_with_period['start_time'].str.contains('Mar'), 'month'] = 'Mar'
data_with_period.loc[data_with_period['start_time'].str.contains('Mar'), 'month_index'] = '3'

data_with_period.loc[data_with_period['start_time'].str.contains('Apr'), 'month'] = 'Apr'
data_with_period.loc[data_with_period['start_time'].str.contains('Apr'), 'month_index'] = '4'

data_with_period.loc[data_with_period['start_time'].str.contains('May'), 'month'] = 'May'
data_with_period.loc[data_with_period['start_time'].str.contains('May'), 'month_index'] = '5'

data_with_period.loc[data_with_period['start_time'].str.contains('Jun'), 'month'] = 'Jun'
data_with_period.loc[data_with_period['start_time'].str.contains('Jun'), 'month_index'] = '6'

data_with_period.loc[data_with_period['start_time'].str.contains('Jul'), 'month'] = 'Jul'
data_with_period.loc[data_with_period['start_time'].str.contains('Jul'), 'month_index'] = '7'

data_with_period.loc[data_with_period['start_time'].str.contains('Aug'), 'month'] = 'Aug'
data_with_period.loc[data_with_period['start_time'].str.contains('Aug'), 'month_index'] = '8'

data_with_period.loc[data_with_period['start_time'].str.contains('Sep'), 'month'] = 'Sep'
data_with_period.loc[data_with_period['start_time'].str.contains('Sep'), 'month_index'] = '9'

data_with_period.loc[data_with_period['start_time'].str.contains('Oct'), 'month'] = 'Oct'
data_with_period.loc[data_with_period['start_time'].str.contains('Oct'), 'month_index'] = '10'

data_with_period.loc[data_with_period['start_time'].str.contains('Nov'), 'month'] = 'Nov'
data_with_period.loc[data_with_period['start_time'].str.contains('Nov'), 'month_index'] = '11'

data_with_period.loc[data_with_period['start_time'].str.contains('Dec'), 'month'] = 'Dec'
data_with_period.loc[data_with_period['start_time'].str.contains('Dec'), 'month_index'] = '12'

## Data export [optional]

In order to avoid replacing the timestamp to other columns, we'll export the new dataSet to a csv file. You can skip this operation because the new .csv dataSet is already included in the project

In [None]:
# Export the new dataSet to csv
data_with_period.to_csv('dataSourceByPeriod.csv', encoding="utf-8")

## Filter the period [optional]

In [None]:
# Filter param
year = '2017'

# Copy the data with timestamp splitted
data_filter = data_with_period.copy()

# Perform the filter
data_filter = data_filter[data_filter['year'] == year]

# Export the new dataSet to csv
data_filter.to_csv('dataSourceByPeriod' + year + '.csv', encoding="utf-8")

<hr>

# III - Geolocation section

#### Here in this section, we'll handle the geolocation infos.

## Geolocation data import [warning: too heavy]

The cell below perform multiple requests to Runkeep Health Graph API, in order to retrieve all activities in details to get location informations such as: latitude/longitude

In [10]:
# =================================================================================
# Geolocation data
# =================================================================================

# Copy the data
geolocation_data = pd.DataFrame()

# =================================================================================
# API single activity request
# =================================================================================

from itertools import islice

# Base URI
base_URI = "http://api.runkeeper.com"

# Loop parameters
start = 499
stop = 585
step = 1

# Run through all activites
for idx, row in islice(data_with_period.iterrows(), start, stop, step):
    # Final activity URI
    activity_url = base_URI + row['uri'] + "?access_token=" + ACCESS_TOKEN

    # Receive the results from API
    activity_content = requests.get(activity_url).json()

    # Perform a conversion from JSON to Data Frame
    if 'path' in activity_content:
        activity_df = json_normalize(activity_content['path'])
        activity_df['datetime'] = row['start_time']
        activity_df['year'] = row['year']
        activity_df['month'] = row['month_index']
    elif 'limit' in activity_content:
        # Reach the API limit (need to wait)
        print('The API reached the 15 minutes limit')
        break
    
    # Add the activity path data to geolocation
    geolocation_data = pd.concat([geolocation_data, activity_df])
    
geolocation_data

Unnamed: 0,altitude,latitude,longitude,timestamp,type,datetime,year,month
0,51.194613,-5.832384,-35.204734,0.000,start,"Mon, 21 Nov 2016 16:16:48",2016,11
1,51.270707,-5.832340,-35.204739,3.801,gps,"Mon, 21 Nov 2016 16:16:48",2016,11
2,51.350505,-5.832283,-35.204851,8.806,gps,"Mon, 21 Nov 2016 16:16:48",2016,11
3,51.432772,-5.832272,-35.204768,10.955,gps,"Mon, 21 Nov 2016 16:16:48",2016,11
4,51.516768,-5.832201,-35.204724,17.954,gps,"Mon, 21 Nov 2016 16:16:48",2016,11
5,51.602020,-5.832112,-35.204726,24.957,gps,"Mon, 21 Nov 2016 16:16:48",2016,11
6,51.750781,-5.832040,-35.204685,38.948,gps,"Mon, 21 Nov 2016 16:16:48",2016,11
7,51.916070,-5.832085,-35.204604,45.018,gps,"Mon, 21 Nov 2016 16:16:48",2016,11
8,52.081359,-5.832103,-35.204508,51.024,gps,"Mon, 21 Nov 2016 16:16:48",2016,11
9,52.209917,-5.832119,-35.204423,56.953,gps,"Mon, 21 Nov 2016 16:16:48",2016,11


## Geolocation data import by period [less heavy]

#### Note: Execute the dataSet filter first

The cell below perform multiple requests to Runkeep Health Graph API, in order to retrieve all activities in details to get location informations such as: latitude/longitude

In [None]:
# =================================================================================
# Geolocation data
# =================================================================================

# Copy the data
geolocation_data = pd.DataFrame()

# =================================================================================
# API single activity request
# =================================================================================

# Base URI
base_URI = "http://api.runkeeper.com"

# Run through filtered activities
for idx, row in tqdm(data_filter.iterrows()):
    
    # Final activity URI
    activity_url = base_URI + row['uri'] + "?access_token=" + ACCESS_TOKEN

    # Receive the results from API
    activity_content = requests.get(activity_url).json()

    # Perform a conversion from JSON to Data Frame
    if 'path' in activity_content:
        activity_df = None
        activity_df = json_normalize(activity_content['path'])
        activity_df['datetime'] = row['start_time']
        activity_df['year'] = row['year']
        activity_df['month'] = row['month_index']

    # Add the activity path data to geolocation
    geolocation_data = pd.concat([geolocation_data, activity_df])
    
geolocation_data

## Data export [optional]

In order to avoid processing the cell above, here we are saving the processed geolocationd data.

In [11]:
# Export the new dataSet to csv
geolocation_data.to_csv('geolocation006.csv', encoding="utf-8")