# Analyze Fitbit Data With Python and Pandas
First we import the data into a pandas dataframe, and take a glance at the data format.

In [1]:
import pandas as pd
steps = pd.read_csv("fitbit_steps.csv",index_col='Date')
sleep = pd.read_csv("fitbit_sleep.csv",index_col='Date')
# can improve how we read this in: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
steps.head()

Unnamed: 0_level_0,Calories Burned,Steps,Distance,Floors,Minutes Sedentary,Minutes Lightly Active,Minutes Fairly Active,Minutes Very Active,Activity Calories
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
3/6/2016,3647,21066,10.4,110,525,185,63,93,2115
3/7/2016,2872,12925,6.23,15,707,130,38,39,1209
3/8/2016,2837,11698,5.64,11,844,185,5,29,1176
3/9/2016,3032,11637,8.87,13,743,220,32,17,1427
3/10/2016,2882,12453,6.0,16,745,180,5,36,1238


The `dataframe.describe()` method provides some useful summary statistics:

In [2]:
steps.describe()

Unnamed: 0,Calories Burned,Steps,Distance,Floors,Minutes Sedentary,Minutes Lightly Active,Minutes Fairly Active,Minutes Very Active,Activity Calories
count,179.0,179.0,179.0,179.0,179.0,179.0,179.0,179.0,179.0
mean,2889.726257,9569.351955,4.687486,22.49162,835.52514,181.636872,21.329609,27.72067,1224.039106
std,490.516316,4713.578219,2.316243,48.411899,247.462704,76.891628,21.502489,29.980334,608.757515
min,1843.0,0.0,0.0,0.0,306.0,0.0,0.0,0.0,0.0
25%,2691.5,7242.0,3.49,8.0,692.5,161.5,5.5,6.5,953.0
50%,2917.0,10139.0,4.94,15.0,767.0,198.0,17.0,21.0,1278.0
75%,3108.5,12179.5,5.905,24.0,897.5,228.5,29.0,40.0,1501.5
max,4716.0,34382.0,16.58,588.0,1440.0,340.0,128.0,186.0,3507.0


I know there are some days that I did not wear my Fitbit so no activity was recorded. Likewise, there may be some days where I only wore it for a small portion of the day. Let's look at all records with less than 1000 steps:

In [3]:
steps[steps.Steps < 1000]

Unnamed: 0_level_0,Calories Burned,Steps,Distance,Floors,Minutes Sedentary,Minutes Lightly Active,Minutes Fairly Active,Minutes Very Active,Activity Calories
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
5/12/2016,1858,44,0.02,0,1042,5,0,0,16
6/7/2016,1843,0,0.0,0,1440,0,0,0,0
6/8/2016,1845,0,0.0,0,1440,0,0,0,0
6/13/2016,1843,0,0.0,0,1440,0,0,0,0
6/14/2016,1960,451,0.22,0,1311,8,4,10,141
7/18/2016,1872,113,0.05,0,1357,9,0,0,39
7/23/2016,1899,347,0.17,0,1046,13,0,0,65
7/28/2016,1843,0,0.0,0,1440,0,0,0,0
7/29/2016,1843,0,0.0,0,1440,0,0,0,0
7/30/2016,1843,0,0.0,0,1440,0,0,0,0


As suspected, there are a number of records with no (or lacking) data. We will exclude these outliers. To verify they are no longer in the dataframe, we print out a few rows around item 167 (8/20/2016) that was removed.

In [4]:
steps = steps[steps.Steps > 1000]
steps.ix[165:169]
# NEEDS UPDATE NOW

Unnamed: 0_level_0,Calories Burned,Steps,Distance,Floors,Minutes Sedentary,Minutes Lightly Active,Minutes Fairly Active,Minutes Very Active,Activity Calories
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1


Likewise, we might have some outliers on the high end. We use the `sort_values` method to sort the Steps column from highest to lowest, and we specify that we want to see the top 10 results in the `head` method.

In [None]:
steps.sort_values(['Steps'], ascending=[0]).head(10)

Taking a look here, the first entry looks like quite an outlier that may through some trending analysis off. Let's remove this entry.

In [None]:
steps = steps[steps.Steps < 30000]

### Adding A Summary Column
We notice that we have minutes sedentary and than minutes active by various levels of activity. It might be useful to know *total* minutes active, so we create a new column that represents this value.

In [None]:
steps['Total Minutes Active'] = steps['Minutes Lightly Active'] + steps['Minutes Fairly Active'] + steps['Minutes Very Active']
steps.head()

Now that we have cleaned up our data and taken an initial look, let's take a quick peek at steps and floors visually.

Now that we include the `%matplotlib inline` command so that figures appear in the Jupyter notebook.

In [None]:
%matplotlib inline
import matplotlib
ts = steps['Steps']
ts.plot()
# improve plotting with this: http://earthpy.org/pandas-basics.html

### Secondary y-axis
If we want, we can include additional information, in this case the number of floors climbed. Since this is a much smaller number than steps, we will plot it on a secondary y-axis.

In [None]:
%matplotlib inline
steps.Steps.plot()
steps.Floors.plot(secondary_y=True)


### Scatter Matrix Plot
Pandas has a useful feature called Scatter Matrix Plot for visualizing linear correlations between variables in your dataframe. We select the columns (variables) to include. The `'kde'` diagonal gives us a density plot for each variable.

In [None]:
from pandas.tools.plotting import scatter_matrix
scatter_matrix(steps[['Steps', 'Floors', 'Minutes Sedentary', 'Total Minutes Active']], diagonal='kde')