## Load Libraries

In [2]:
import pandas as pd
import numpy as np
from pandas import Timestamp
import os
from datetime import datetime, timedelta

The main scope of collecting **Moves** data was to cross-check the results from the **Garmin** and have a backup in any case there was a fault data collection from other sources. For example, we may use Calories burnt from **Moves** app instead of **LifeSum** which was sourcing the activities from **Google Fit** and the data was really unreliable at the end.

### Keep only the days of the trip inside the directory

In [3]:
# Read the directories with the data and save file_names in two list
path_to_places = 'python_data/moves_angelos/moves_export/csv/daily/places/'
path_to_summary = 'python_data/moves_angelos/moves_export/csv/daily/summary/'

csv_files_places = [single_csv for single_csv in os.listdir(path_to_places) if single_csv.endswith('.csv')]
csv_files_summary = [single_csv for single_csv in os.listdir(path_to_summary) if single_csv.endswith('.csv')]

In [4]:
# Check if filenames are parsed correctly
print csv_files_places[:5]
print csv_files_summary[:5]

['places_20170705.csv', 'places_20170706.csv', 'places_20170707.csv', 'places_20170708.csv', 'places_20170709.csv']
['summary_20170705.csv', 'summary_20170706.csv', 'summary_20170707.csv', 'summary_20170708.csv', 'summary_20170709.csv']


## Usefull Functions

In [5]:
# Adds a number to index
def iterNo(d):
    return d + 1

In [6]:
# Transforms seconds to hours
def secToHours(d):
    # Create the rule
    seconds = d
    minutes, seconds = divmod(seconds, 60)
    hours, minutes = divmod(minutes, 60)
    
    #Convert all to string
    seconds = str(seconds)
    minutes = str(minutes)
    hours = str(hours)
    
    #Add 0(zeros) for single digit numbers
    if len(seconds)==1:
        seconds = '0'+seconds
    if len(minutes)==1:
        minutes = '0'+minutes
    if len(hours)==1:
        hours = '0'+hours
    return hours+':'+minutes+':'+seconds

In [7]:
# Create function that calculates km and seconds to km/h average
def avgSpeedConverter(f,d):
    # Define variables
    km = f
    seconds = d
    
    # Convert km to meters
    meters = km*1000
    
    # Calculate speed
    avg_speed = (meters/seconds) * 3.6
    return avg_speed

In [8]:
# Fix year to Date column
def yearFixer(s):
    main_part = s[:-2]
    year_before = s.split('/')[2]
    year_after = str(20) + year_before
    return main_part+year_after

In [9]:
# Create a good format for the Date column
def dateConverter(s):
    # Set date formats
    time_format = "%d/%m/%Y"

    # Convert from str to datetime
    converted = datetime.strptime(s,time_format)
    
    return converted

## Create a single useful dataframe for each segment

### Places

Places are the checkpoint coordinates that **Moves** app uses when the mobile stops moving. So each row of the below dataframe is a place that we stopped moving. We will not use this data for any reason at this project but I will create a nice format in order to be able to parsed and used for any visualization in the future.

In [9]:
# Places df
df_places = pd.DataFrame()
for file_name in csv_files_places:
    df_tmp = pd.read_csv(path_to_places+file_name)
    df_places = pd.concat([df_places, df_tmp])

# Drop last two columns
df_places = df_places.drop(['Category', 'Link'],axis = 1).reset_index()   
    
# Change str and end time to datetime type
df_places['Start'] = pd.to_datetime(df_places['Start']) 
df_places['End'] = pd.to_datetime(df_places['End'])
df_places['Date'] = df_places['Date'].apply(lambda x: yearFixer(x))
df_places['Date'] = df_places['Date'].apply(lambda x: dateConverter(x))

# Rename index column to iter_no like STRAVA
df_places.rename(columns={'index': 'iter_no'}, inplace=True)
df_places['iter_no'] = df_places['iter_no'].apply(lambda x: iterNo(x))

# Create day_no like STRAVA
days = list(set(df_places['Date']))
days.sort()
day_no = list()
for index,day in enumerate(days):
    for dfday in df_places['Date']:
        if dfday == day:
            day_no.append(index+1)

df_places['day_no'] = pd.Series(day_no).values

In [10]:
# Check if columns are correct
df_places.head(10)

Unnamed: 0,iter_no,Date,Name,Start,End,Duration,Latitude,Longitude,day_no
0,1,2017-07-05,Place in Tallinn,2017-07-04 22:28:01,2017-07-05 07:33:40,32739,59.417177,24.799761,1
1,2,2017-07-05,Place in Tallinn,2017-07-05 08:07:19,2017-07-05 08:11:19,240,59.436725,24.744576,1
2,3,2017-07-05,Place in Tallinn,2017-07-05 08:26:17,2017-07-05 08:56:32,1815,59.443721,24.743251,1
3,4,2017-07-05,Place in Tallinn,2017-07-05 09:03:19,2017-07-05 09:10:57,458,59.43605,24.720483,1
4,5,2017-07-05,Place in Tallinn,2017-07-05 09:34:51,2017-07-05 09:54:57,1206,59.425996,24.6514,1
5,6,2017-07-05,Place in Tallinn,2017-07-05 10:25:29,2017-07-05 12:21:10,6941,59.442742,24.624479,1
6,7,2017-07-05,Place in Vääna-Jõesuu,2017-07-05 15:17:22,2017-07-05 16:12:35,3313,59.433688,24.367671,1
7,8,2017-07-05,Place in Paldiski,2017-07-05 17:54:05,2017-07-05 22:00:00,14755,59.345679,24.184467,1
8,1,2017-07-06,Place in Paldiski,2017-07-05 22:00:00,2017-07-06 11:24:00,48240,59.345679,24.184467,2
9,2,2017-07-06,Place in Paldiski,2017-07-06 11:41:44,2017-07-06 11:58:12,988,59.339574,24.105025,2


### Summary

'Summary' is the activities that **Moves** app keeps in between each pair of consecutive 'places'. Then those activities are summed up to *transport* duration, *walking* duration and *biking* duration. So for example if we check the table above, there are eight iterations(stops) at the first day. From the end of the first to the beginning of the second place, I can count 34 minutes. This could be either *walking* or *cycling* or *transport* labeled and will be summed up at the end of the day at the same group, along with the rest of the activities labeled with the same type.

In [10]:
# Places df
df_summary = pd.DataFrame()
for file_name in csv_files_summary:
    df_tmp = pd.read_csv(path_to_summary+file_name)
    df_summary = pd.concat([df_summary, df_tmp])
    
# Drop group column
df_summary = df_summary.drop('Group',axis = 1).reset_index() 
    
# Change str and end time to datetime type
df_summary['Date'] = df_summary['Date'].apply(lambda x: yearFixer(x))
df_summary['Date'] = df_summary['Date'].apply(lambda x: dateConverter(x))

# Rename index column to iter_no like STRAVA
df_summary.rename(columns={'index': 'iter_no'}, inplace=True)
df_summary['iter_no'] = df_summary['iter_no'].apply(lambda x: iterNo(x))

df_summary.head(10)

Unnamed: 0,iter_no,Date,Activity,Duration,Distance,Steps,Calories
0,1,2017-07-05,transport,1680,248.965,0,0
1,2,2017-07-05,walking,367,0.255,506,16
2,3,2017-07-05,cycling,17898,73.617,0,2089
3,1,2017-07-06,cycling,21100,125.841,0,3471
4,1,2017-07-07,walking,62,0.04,80,3
5,2,2017-07-07,cycling,23986,138.479,0,3827
6,1,2017-07-08,walking,115,0.079,150,5
7,2,2017-07-08,cycling,22708,136.213,0,3755
8,3,2017-07-08,transport,212,0.739,0,0
9,1,2017-07-09,walking,201,0.153,307,10


We mostly care about *cycling* from the above dataframe. So we will filter out the *cycling* per day, and we will add up some extra columns. 

#### Cycling Distance per day

In [11]:
# List the number of cycling meters per day 
cycling_df = df_summary[df_summary['Activity']=='cycling'].groupby('Date').sum(
).reset_index().sort_values(by='Date', ascending=1)

# Drop unesessary columns
cycling_df = cycling_df.filter(items=['Date', 'Distance','Duration'])

# Change column name
cycling_df.rename(columns={'Distance': 'ttl_cyc_km','Duration' : 'ttl_cyc_seconds'}, inplace=True)

# Create column with converted seconds to hours:minutes:seconds format (string)
cycling_df['ttl_cyc_duration'] = cycling_df['ttl_cyc_seconds'].apply(lambda x: secToHours(x))

# Create a new column named avg_speed 
cycling_df['avg_day_speed']= cycling_df[['ttl_cyc_km','ttl_cyc_seconds']].apply(lambda x: avgSpeedConverter(*x), axis=1) 

In [12]:
cycling_df.head(10)

Unnamed: 0,Date,ttl_cyc_km,ttl_cyc_seconds,ttl_cyc_duration,avg_day_speed
0,2017-07-05,73.617,17898,04:58:18,14.807308
1,2017-07-06,125.841,21100,05:51:40,21.470502
2,2017-07-07,138.479,23986,06:39:46,20.783974
3,2017-07-08,136.213,22708,06:18:28,21.594451
4,2017-07-09,154.117,26973,07:29:33,20.569503
5,2017-07-10,120.054,22762,06:19:22,18.987541
6,2017-07-11,69.906,13027,03:37:07,19.318462
7,2017-07-12,87.636,21665,06:01:05,14.562179
8,2017-07-13,109.449,25757,07:09:17,15.297449
9,2017-07-14,117.369,23943,06:39:03,17.647262


Just for the record, let's sum up the total cycling distance of the trip, captured by **Moves** app.

In [13]:
# TTL
print 'Total cycling distance of the whole trip: \t%.2f km \nTotal time cycled: \t\t\t\t%s h|m|s' % (sum(
    cycling_df['ttl_cyc_km']),secToHours(sum(cycling_df['ttl_cyc_seconds'])))

Total cycling distance of the whole trip: 	3159.71 km 
Total time cycled: 				174:13:30 h|m|s


It seems to be about **500km** longer then what **Garmin** says. That is because most of the times, **Garmin** was used only if we were sure that we will not stop in less then 10 minutes. So if we were just moving around a city to go pick some groceries, or when we split to go find water in order to be able either to cook or to take a bath at night, or when I was just forgetting to initiate a new activity on time, or when we were at a dead end and had to go back some kilometers, **Garmin** was missing data. So we can say that the **real kilometers** are **3159km** but the **actual route kilometers** were **2690km**

#### Walking Distance and steps per day


In [15]:
# List the number of walking meters per day 
walking_df = df_summary[df_summary['Activity']=='walking'].groupby('Date').sum(
).reset_index().sort_values(by='Date', ascending=1)

# Drop unesessary columns
walking_df = walking_df.filter(items=['Date', 'Distance', 'Steps'])

# Change column name
walking_df.rename(columns={'Distance': 'ttl_wal_distance', 'Steps': 'ttl_steps'}, inplace=True)

In [16]:
walking_df.head()

Unnamed: 0,Date,ttl_wal_distance,ttl_steps
0,2017-07-05,0.255,506
1,2017-07-07,0.04,80
2,2017-07-08,0.079,150
3,2017-07-09,0.153,307
4,2017-07-10,0.201,276


In [17]:
# TTL
print 'Total walking distance covered during the trip: %.2f km \nTotal steps covered during the trip: \t\t%d steps' % (
    sum(walking_df['ttl_wal_distance']),sum(walking_df['ttl_steps']))

Total walking distance covered during the trip: 12.40 km 
Total steps covered during the trip: 		19763 steps


Doing the same for the *walking* activity, we can see that I only walked about 12 km in 28 days which is quite funny. It would have been a bigger number probably if I was holding the mobile while moving around to build the tent or move around a supermarket. The biking suite I was wearing had no pockets to put the mobile in so most of the times the mobile was on the mount.

#### Calories burnt per day


In [25]:
# List the number of calories burnt per day 
calories_df = df_summary.groupby('Date').sum(
).reset_index().sort_values(by='Date', ascending=1)

# Drop unesessary columns
calories_df = calories_df.filter(items=['Date', 'Calories'])

# Change column name
calories_df.rename(columns={'Calories': 'ttl_cal_burnt'}, inplace=True)

# Add day_no column to keep as a key
calories_df['day_no']= calories_df.index + 1

In [26]:
calories_df.head()

Unnamed: 0,Date,ttl_cal_burnt,day_no
0,2017-07-05,2105,1
1,2017-07-06,3471,2
2,2017-07-07,3830,3
3,2017-07-08,3760,4
4,2017-07-09,4273,5


We can use this data instead of the **calories burnt** calculated by the custom function created here--> [parsing LifeSum Data](https://github.com/oikonang/bike_trip_project/blob/master/data_manging/Parsing%20LifeSum%20data.ipynb) so we will first download the above dataframe as csv and then merge with the calories data of the link.

In [27]:
# Save it to a csv
calories_df.to_csv('python_data/calories_from_moves.csv', index=False)

In [28]:
# TTL
print 'Total calories burnt during the trip: %d cal' % sum(calories_df['ttl_cal_burnt'])

Total calories burnt during the trip: 88954 cal
