# Programming Project - Unit 2
*by Débora Azevedo, Eliseu Jayro, Francisco de Paiva and Igor Brandão*

**Goals**
The purpose of this project is explore the following:

- Access Health Graph API - Runkeeper content;
- Geolocation analysis and hypotheses should be explained in detail;

<hr>

# Global Imports section

Import the necessary libraries to handle 

- File input;
- Tqdm progress bar
- Requests;
- urlopen;
- HTTPError;

In [None]:
### Library necessary to run this IPython Notebook
!pip install tqdm
!pip install tabulate
!pip install pandas-datareader
!pip install requests

In [None]:
# Import pandas
import pandas as pd

# Import numpy library
import numpy as np

# Import tqdm progressing bar plugin
from tqdm import tqdm

# Import API libraries
import requests
import json
from pandas.io.json import json_normalize

# Imports to output the result as a Markdown
from IPython.display import display, Markdown
from tabulate import tabulate

<hr>

# I - API section

## API data retrieving

#### In the cell bellow, we perform a connection with Health Graph API - Runkeeper.

In [None]:
# Access token
ACCESS_TOKEN = '25bc30d6dd6f4b99bbeb48e8619103b4'

# Base URI
api_URI = "http://api.runkeeper.com/fitnessActivities"

# Number of results
pageSize = 5000

# Final URI
url = '%s?pageSize=%s&access_token=%s' % \
    (api_URI, pageSize, ACCESS_TOKEN,)

# print(url)

# Receive the results from API
api_content = requests.get(url).json()

print(json.dumps(api_content, indent=1))

## JSON to Data Frame conversion

In order to have a better data manipulation, in the next cell we perform a conversion of importe in json format from API to pandas data frame

In [None]:
# Perform a conversion from JSON to Data Frame
api_df = json_normalize(api_content['items'])

# Converts the duration from seconds to minutes
api_df['duration'] = api_df['duration']/60
api_df['duration'] = api_df['duration'].round(2);

# Round the distance
api_df['total_distance'] = api_df['total_distance'].round(2);

api_df

## Data export [optional]

In order to visualize the data into an excel file, the cell bellow is responsible for exporting the data.

In [None]:
# Export the new dataSet to csv
api_df.to_csv('dataSource.csv', encoding="utf-8")

<hr>

# II - Activites by period

#### First of all we need to split the information by period. To achieve that, the idea is apply a group selection by partial string in timestamp column 

## Timestamp split into columns

In order to have a better way to handle the data by period, it'll be necessary split the timestamp column into separate columns

In [None]:
# =================================================================================
# Dataframe timestamp split
# =================================================================================

# Copy the data
data_by_period = api_df.copy()
data_by_period["month"] = 0
data_by_period["month_index"] = 0
data_by_period["year"] = 0

# Fill the years
data_by_period.loc[data_by_period['start_time'].str.contains('2018'), 'year'] = '2018'
data_by_period.loc[data_by_period['start_time'].str.contains('2017'), 'year'] = '2017'
data_by_period.loc[data_by_period['start_time'].str.contains('2016'), 'year'] = '2016'
data_by_period.loc[data_by_period['start_time'].str.contains('2015'), 'year'] = '2015'
data_by_period.loc[data_by_period['start_time'].str.contains('2014'), 'year'] = '2014'

# Fill the months
data_by_period.loc[data_by_period['start_time'].str.contains('Jan'), 'month'] = 'Jan'
data_by_period.loc[data_by_period['start_time'].str.contains('Jan'), 'month_index'] = '1'

data_by_period.loc[data_by_period['start_time'].str.contains('Feb'), 'month'] = 'Feb'
data_by_period.loc[data_by_period['start_time'].str.contains('Feb'), 'month_index'] = '2'

data_by_period.loc[data_by_period['start_time'].str.contains('Mar'), 'month'] = 'Mar'
data_by_period.loc[data_by_period['start_time'].str.contains('Mar'), 'month_index'] = '3'

data_by_period.loc[data_by_period['start_time'].str.contains('Apr'), 'month'] = 'Apr'
data_by_period.loc[data_by_period['start_time'].str.contains('Apr'), 'month_index'] = '4'

data_by_period.loc[data_by_period['start_time'].str.contains('May'), 'month'] = 'May'
data_by_period.loc[data_by_period['start_time'].str.contains('May'), 'month_index'] = '5'

data_by_period.loc[data_by_period['start_time'].str.contains('Jun'), 'month'] = 'Jun'
data_by_period.loc[data_by_period['start_time'].str.contains('Jun'), 'month_index'] = '6'

data_by_period.loc[data_by_period['start_time'].str.contains('Jul'), 'month'] = 'Jul'
data_by_period.loc[data_by_period['start_time'].str.contains('Jul'), 'month_index'] = '7'

data_by_period.loc[data_by_period['start_time'].str.contains('Aug'), 'month'] = 'Aug'
data_by_period.loc[data_by_period['start_time'].str.contains('Aug'), 'month_index'] = '8'

data_by_period.loc[data_by_period['start_time'].str.contains('Sep'), 'month'] = 'Sep'
data_by_period.loc[data_by_period['start_time'].str.contains('Sep'), 'month_index'] = '9'

data_by_period.loc[data_by_period['start_time'].str.contains('Oct-'), 'month'] = 'Oct'
data_by_period.loc[data_by_period['start_time'].str.contains('Oct'), 'month_index'] = '10'

data_by_period.loc[data_by_period['start_time'].str.contains('Nov'), 'month'] = 'Nov'
data_by_period.loc[data_by_period['start_time'].str.contains('Nov'), 'month_index'] = '11'

data_by_period.loc[data_by_period['start_time'].str.contains('Dec'), 'month'] = 'Dec'
data_by_period.loc[data_by_period['start_time'].str.contains('Dec'), 'month_index'] = '12'

## Data export [optional]

In order to avoid replacing the timestamp to other columns, we'll export the new dataSet to a csv file. You can skip this operation because the new .csv dataSet is already included in the project

In [None]:
# Export the new dataSet to csv
data_by_period.to_csv('dataSourceByPeriod.csv', encoding="utf-8")

## Filter the period [optional]

In [63]:
# Filter param
year = '2018'

# Copy the data with timestamp splitted
data_filter = data_by_period.copy()

# Perform the filter
data_filter = data_filter[data_filter['year'] == year]

# Export the new dataSet to csv
data_filter.to_csv('dataSourceByPeriod' + year + '.csv', encoding="utf-8")

<hr>

# III - Geolocation section

#### Here in this section, we'll handle the geolocation infos.

## Geolocation data import [warning: too heavy]

The cell below perform multiple requests to Runkeep Health Graph API, in order to retrieve all activities in details to get location informations such as: latitude/longitude

In [None]:
# =================================================================================
# Geolocation data
# =================================================================================

# Copy the data
geolocation_data = pd.DataFrame()

# =================================================================================
# API single activity request
# =================================================================================

# Base URI
base_URI = "http://api.runkeeper.com"

# Run through all activites
for idx, row in tqdm(api_df.iterrows()):
    
    # Final activity URI
    activity_url = base_URI + row['uri'] + "?access_token=" + ACCESS_TOKEN

    # Receive the results from API
    activity_content = requests.get(activity_url).json()

    # Perform a conversion from JSON to Data Frame
    if 'path' in activity_content:
        activity_df = json_normalize(activity_content['path'])
    
    # Add the activity path data to geolocation
    geolocation_data = pd.concat([geolocation_data, activity_df])
    
geolocation_data

## Geolocation data import by period [less heavy]

#### Note: Execute the dataSet filter first

The cell below perform multiple requests to Runkeep Health Graph API, in order to retrieve all activities in details to get location informations such as: latitude/longitude

In [64]:
# =================================================================================
# Geolocation data
# =================================================================================

# Copy the data
geolocation_data = pd.DataFrame()

# =================================================================================
# API single activity request
# =================================================================================

# Base URI
base_URI = "http://api.runkeeper.com"

# Run through filtered activities
for idx, row in tqdm(data_filter.iterrows()):
    
    # Final activity URI
    activity_url = base_URI + row['uri'] + "?access_token=" + ACCESS_TOKEN

    # Receive the results from API
    activity_content = requests.get(activity_url).json()

    # Perform a conversion from JSON to Data Frame
    if 'path' in activity_content:
        activity_df = json_normalize(activity_content['path'])

    # Add the activity path data to geolocation
    geolocation_data = pd.concat([geolocation_data, activity_df])
    
geolocation_data

291it [03:38,  1.33it/s]


Unnamed: 0,altitude,latitude,longitude,timestamp,type
0,55.441733,-5.832449,-35.204543,0.000,start
1,55.352654,-5.832374,-35.204625,38.993,gps
2,55.251754,-5.832304,-35.204703,40.985,gps
3,55.132872,-5.832217,-35.204758,43.000,gps
4,54.992312,-5.832138,-35.204784,45.916,gps
5,54.844251,-5.832057,-35.204775,60.920,gps
6,54.601827,-5.832023,-35.204699,64.918,gps
7,54.322016,-5.832027,-35.204600,67.917,gps
8,54.017264,-5.832073,-35.204480,70.919,gps
9,53.704132,-5.832101,-35.204392,72.916,gps


## Data export [optional]

In order to avoid processing the cell above, here we are saving the processed geolocationd data.

In [66]:
# Export the new dataSet to csv
geolocation_data.to_csv('geolocation2018.csv', encoding="utf-8")