Import the pandas library into the python environment.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt

Load the two data files into the environment for further data analysis.

In [2]:
person_1 = pd.read_csv('person1-step-data.csv')
person_2 = pd.read_csv('person2-step-data.csv')

Print the first person data set to see the file upload was successful.

In [3]:
person_1

Unnamed: 0,Source,Date,Hour,Count
0,Person1 iPhone SE,2014-12-07,8,13
1,Person1 iPhone SE,2014-12-07,8,13
2,Person1 iPhone SE,2014-12-07,8,1
3,Person1 iPhone SE,2014-12-07,8,9
4,Person1 iPhone SE,2014-12-07,8,15
...,...,...,...,...
183782,Person1 iPhone SE,2021-09-22,17,1241
183783,Person1 iPhone SE,2021-09-22,17,1212
183784,Person1 iPhone SE,2021-09-22,18,808
183785,Person1 iPhone SE,2021-09-22,18,392


Print the second person data set to see the file upload was successful.

In [4]:
person_2

Unnamed: 0,Source,Date,Hour,Count
0,Person2 Phone,2014-11-29,6,6
1,Person2 Phone,2014-11-29,6,4
2,Person2 Phone,2014-11-29,6,3
3,Person2 Phone,2014-11-29,6,9
4,Person2 Phone,2014-11-29,6,6
...,...,...,...,...
486254,Person2 Watch,2021-09-22,14,71
486255,Person2 Phone,2021-09-22,14,72
486256,Person2 Phone,2021-09-22,15,78
486257,Person2 Watch,2021-09-22,15,32


Find out what data type each column is for each person, for easier data cleaning in future analysis.

In [5]:
print("Person 1\n", person_1.dtypes)
print("Person 2\n", person_2.dtypes)

Person 1
 Source    object
Date      object
Hour       int64
Count      int64
dtype: object
Person 2
 Source    object
Date      object
Hour       int64
Count      int64
dtype: object


The data type for 'Source' is object, 'Date' is object, 'Hour' is int64, 'Count' is int64. Therefore, we need to change the date into a date-type for time-series data analysis.

In [6]:
person_1['Date'] = pd.to_datetime(person_1['Date'])
person_2['Date'] = pd.to_datetime(person_2['Date'])

Check if the date data is converted to date format.

In [7]:
print("Person 1\n", person_1.dtypes)
print("Person 2\n", person_2.dtypes)

Person 1
 Source            object
Date      datetime64[ns]
Hour               int64
Count              int64
dtype: object
Person 2
 Source            object
Date      datetime64[ns]
Hour               int64
Count              int64
dtype: object


Count the number of rows each data set has, to do some comparative analysis on both datasets.

In [8]:
p1_rows = len(person_1.index)
p2_rows = len(person_2.index)
if p1_rows != p2_rows:
    print("P1:{} and P2:{}".format(p1_rows, p2_rows))
    print("Row Variance: {}".format(abs(p1_rows - p2_rows)))

P1:183787 and P2:486259
Row Variance: 302472


We can see that there is a large gap between both data sets. There is a 302,472 row variance between person 1 and person 2.

Check the fitness sources of each person.

In [9]:
for x in person_1['Source'].unique():
    print(x)
print("")
for y in person_2['Source'].unique():
    print(y)

Person1 iPhone SE
Person1 Mi Fit
Person1 Misfit
Person1 Health Mate
Person1 Apple Watch

Person2 Phone
Person2 Watch


We can see further, that the source in the data for person 2 only lists two vague sources of data collection, but there are more specific sources of data collection for person 1.

Count and print the steps for each person using monthly step count sum and yearly step count sum.

In [10]:
MonthlyStep_1 = person_1.groupby(pd.Grouper(key='Date',freq='M')).sum()
print("Person 1 Step Count by Month\n", MonthlyStep_1.iloc[:,1::2])
MonthlyStep_2 = person_2.groupby(pd.Grouper(key='Date',freq='M')).sum()
print("Person 1 Step Count by Month\n", MonthlyStep_1.iloc[:,1::2])

Person 1 Step Count by Month
              Count
Date              
2014-12-31  111590
2015-01-31  131321
2015-02-28  106851
2015-03-31  101621
2015-04-30   91726
...            ...
2021-05-31  518898
2021-06-30  406440
2021-07-31  360565
2021-08-31  321087
2021-09-30  205743

[82 rows x 1 columns]
Person 1 Step Count by Month
              Count
Date              
2014-12-31  111590
2015-01-31  131321
2015-02-28  106851
2015-03-31  101621
2015-04-30   91726
...            ...
2021-05-31  518898
2021-06-30  406440
2021-07-31  360565
2021-08-31  321087
2021-09-30  205743

[82 rows x 1 columns]


In [11]:
YearlyStep_1 = person_1.groupby(pd.Grouper(key='Date',freq='Y')).sum()
print("Person 1 Step Count by Year\n", YearlyStep_1.iloc[:,1::2])
YearlyStep_2 = person_2.groupby(pd.Grouper(key='Date',freq='Y')).sum()
print("Person 1 Step Count by Year\n", YearlyStep_2.iloc[:,1::2])

Person 1 Step Count by Year
               Count
Date               
2014-12-31   111590
2015-12-31  1582971
2016-12-31  1742364
2017-12-31  1800474
2018-12-31  2339228
2019-12-31  2824389
2020-12-31  1239702
2021-12-31  3023933
Person 1 Step Count by Year
               Count
Date               
2014-12-31   199813
2015-12-31  3843685
2016-12-31  5335886
2017-12-31  5482973
2018-12-31  5869950
2019-12-31  6334428
2020-12-31  4184766
2021-12-31  3509835
