# SMARTWATCH ANALYSIS

# Import Module

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go

## Read Data

In [2]:
df = pd.read_csv('dailyActivity_merged.csv')

In [3]:
df

Unnamed: 0,Id,ActivityDate,TotalSteps,TotalDistance,TrackerDistance,LoggedActivitiesDistance,VeryActiveDistance,ModeratelyActiveDistance,LightActiveDistance,SedentaryActiveDistance,VeryActiveMinutes,FairlyActiveMinutes,LightlyActiveMinutes,SedentaryMinutes,Calories
0,1503960366,4/12/2016,13162,8.500000,8.500000,0.0,1.88,0.55,6.06,0.00,25,13,328,728,1985
1,1503960366,4/13/2016,10735,6.970000,6.970000,0.0,1.57,0.69,4.71,0.00,21,19,217,776,1797
2,1503960366,4/14/2016,10460,6.740000,6.740000,0.0,2.44,0.40,3.91,0.00,30,11,181,1218,1776
3,1503960366,4/15/2016,9762,6.280000,6.280000,0.0,2.14,1.26,2.83,0.00,29,34,209,726,1745
4,1503960366,4/16/2016,12669,8.160000,8.160000,0.0,2.71,0.41,5.04,0.00,36,10,221,773,1863
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
935,8877689391,5/8/2016,10686,8.110000,8.110000,0.0,1.08,0.20,6.80,0.00,17,4,245,1174,2847
936,8877689391,5/9/2016,20226,18.250000,18.250000,0.0,11.10,0.80,6.24,0.05,73,19,217,1131,3710
937,8877689391,5/10/2016,10733,8.150000,8.150000,0.0,1.35,0.46,6.28,0.00,18,11,224,1187,2832
938,8877689391,5/11/2016,21420,19.559999,19.559999,0.0,13.22,0.41,5.89,0.00,88,12,213,1127,3832


**Data Dictionary:**

- Id                        : Unique ID of each activity
- ActivityDate              : Date of the activity
- TotalSteps                : Total of steps travelled during the day
- TotalDistances            : Total of Distance travelled during the day
- TrackerDistance           : Distance traveled by the wearer based on the data collected by the watch's sensors.
- LoggedActivitiesDistance  : The distance recorded for a specific physical activity that the user has logged in their smartwatch or fitness tracker.
- VeryActiveDistance        : This refers to the distance traveled during activities that require high levels of energy expenditure, such as running, cycling, or intense workouts.
- ModeratelyActiveDistance  : This refers to the distance traveled during activities that require moderate levels of energy expenditure, such as brisk walking or light jogging.
- LightActiveDistance       : This refers to the distance traveled during activities that require light levels of energy expenditure, such as leisurely walking or household chores.
- SedentaryActiveDistance   : This refers to the distance traveled during activities that involve little or no physical activity, such as sitting or lying down.
- VeryActiveMinutes         : This refers to the number of minutes spent engaging in activities that require high levels of energy expenditure, such as running, cycling, or intense workouts.
- ModeratelyActiveMinutes   : This refers to the number of minutes spent engaging in activities that require moderate levels of energy expenditure, such as brisk walking or light jogging.
- LightActiveMinutes        : This refers to the number of minutes spent engaging in activities that require light levels of energy expenditure, such as leisurely walking or household chores.
- SedentaryMinutes          : This refers to the number of minutes spent engaging in activities that involve little or no physical activity, such as sitting or lying down.
- Calories                  : The amount of calories burned is determined by a variety of factors including the type of activity, the intensity of the activity, and the individual's body weight and composition.

## Check missing value & Duplicated value

In [4]:
df.isna().sum()

Id                          0
ActivityDate                0
TotalSteps                  0
TotalDistance               0
TrackerDistance             0
LoggedActivitiesDistance    0
VeryActiveDistance          0
ModeratelyActiveDistance    0
LightActiveDistance         0
SedentaryActiveDistance     0
VeryActiveMinutes           0
FairlyActiveMinutes         0
LightlyActiveMinutes        0
SedentaryMinutes            0
Calories                    0
dtype: int64

In [5]:
df.duplicated().sum()

0

## Data Preprocessing

The dataset does not have either any null or duplicated values. Let’s have a look at the information about columns in the dataset:

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 940 entries, 0 to 939
Data columns (total 15 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Id                        940 non-null    int64  
 1   ActivityDate              940 non-null    object 
 2   TotalSteps                940 non-null    int64  
 3   TotalDistance             940 non-null    float64
 4   TrackerDistance           940 non-null    float64
 5   LoggedActivitiesDistance  940 non-null    float64
 6   VeryActiveDistance        940 non-null    float64
 7   ModeratelyActiveDistance  940 non-null    float64
 8   LightActiveDistance       940 non-null    float64
 9   SedentaryActiveDistance   940 non-null    float64
 10  VeryActiveMinutes         940 non-null    int64  
 11  FairlyActiveMinutes       940 non-null    int64  
 12  LightlyActiveMinutes      940 non-null    int64  
 13  SedentaryMinutes          940 non-null    int64  
 14  Calories  

The column containing the date of the record is an object. We may need to use dates in our analysis, so let’s convert this column into a datetime column:

In [7]:
# Changing datatype of ActivityDate
df["ActivityDate"] = pd.to_datetime(df["ActivityDate"], 
                                      format="%m/%d/%Y")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 940 entries, 0 to 939
Data columns (total 15 columns):
 #   Column                    Non-Null Count  Dtype         
---  ------                    --------------  -----         
 0   Id                        940 non-null    int64         
 1   ActivityDate              940 non-null    datetime64[ns]
 2   TotalSteps                940 non-null    int64         
 3   TotalDistance             940 non-null    float64       
 4   TrackerDistance           940 non-null    float64       
 5   LoggedActivitiesDistance  940 non-null    float64       
 6   VeryActiveDistance        940 non-null    float64       
 7   ModeratelyActiveDistance  940 non-null    float64       
 8   LightActiveDistance       940 non-null    float64       
 9   SedentaryActiveDistance   940 non-null    float64       
 10  VeryActiveMinutes         940 non-null    int64         
 11  FairlyActiveMinutes       940 non-null    int64         
 12  LightlyActiveMinutes  

Look at all the columns; you will see information about very active, fairly active, lightly active, and sedentary minutes in the dataset. Let’s combine all these columns as total minutes before moving forward:

In [8]:
df["TotalMinutes"] = df["VeryActiveMinutes"] + df["FairlyActiveMinutes"] + df["LightlyActiveMinutes"] + df["SedentaryMinutes"]
df["TotalMinutes"].sample(5)

207     875
77     1440
376     742
514    1090
163    1440
Name: TotalMinutes, dtype: int64

Let's look at the descriptive of the dataset:

In [9]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Id,940.0,4855407000.0,2424805000.0,1503960000.0,2320127000.0,4445115000.0,6962181000.0,8877689000.0
TotalSteps,940.0,7637.911,5087.151,0.0,3789.75,7405.5,10727.0,36019.0
TotalDistance,940.0,5.489702,3.924606,0.0,2.62,5.245,7.7125,28.03
TrackerDistance,940.0,5.475351,3.907276,0.0,2.62,5.245,7.71,28.03
LoggedActivitiesDistance,940.0,0.1081709,0.6198965,0.0,0.0,0.0,0.0,4.942142
VeryActiveDistance,940.0,1.502681,2.658941,0.0,0.0,0.21,2.0525,21.92
ModeratelyActiveDistance,940.0,0.5675426,0.8835803,0.0,0.0,0.24,0.8,6.48
LightActiveDistance,940.0,3.340819,2.040655,0.0,1.945,3.365,4.7825,10.71
SedentaryActiveDistance,940.0,0.001606383,0.007346176,0.0,0.0,0.0,0.0,0.11
VeryActiveMinutes,940.0,21.16489,32.8448,0.0,0.0,4.0,32.0,210.0


# Explanatory Data Analysis

## Analyze the Smartwatch Data

The dataset has a “Calories” column; it contains the data about the number of calories burned in a day. Let’s have a look at the relationship between calories burned and the total steps walked in a day:

In [10]:
figure = px.scatter(data_frame = df, x="Calories",
                    y="TotalSteps", size="VeryActiveMinutes", 
                    trendline="ols", 
                    title="Relationship between Calories & Total Steps")
figure.show()

It is evident from the plot that there exists a linear correlation between the total number of steps taken and the amount of calories burned in a day.

In [11]:
figure = px.scatter(data_frame = df, x="Calories",
                    y="TotalDistance", size="VeryActiveDistance", 
                    trendline="ols", 
                    title="Relationship between Calories & Total Distances")
figure.show()

In [12]:
figure = px.scatter(data_frame = df, x="TrackerDistance",
                    y="TotalDistance", 
                    trendline="ols", 
                    title="Relationship between Tracker Distances & Total Distances")
figure.show()

In [13]:
figure = px.scatter(data_frame = df, x="TrackerDistance",
                    y="TotalDistance", size="LoggedActivitiesDistance", 
                    trendline="ols", 
                    title="Relationship between Tracker Distances & Total Distances")
figure.show()

It is evident from the plot that there exists a linear correlation between the total distance and the amount of calories burned in a day.

In [14]:
df.corr()

Unnamed: 0,Id,TotalSteps,TotalDistance,TrackerDistance,LoggedActivitiesDistance,VeryActiveDistance,ModeratelyActiveDistance,LightActiveDistance,SedentaryActiveDistance,VeryActiveMinutes,FairlyActiveMinutes,LightlyActiveMinutes,SedentaryMinutes,Calories,TotalMinutes
Id,1.0,0.185721,0.241,0.238816,0.188015,0.308691,0.026665,0.019629,-0.015698,0.303608,0.051158,-0.098754,-0.043319,0.396671,-0.048274
TotalSteps,0.185721,1.0,0.985369,0.984822,0.181849,0.740115,0.507105,0.692208,0.070505,0.667079,0.498693,0.5696,-0.327484,0.591568,-0.017285
TotalDistance,0.241,0.985369,1.0,0.999505,0.188332,0.794582,0.470758,0.662002,0.082389,0.681297,0.462899,0.5163,-0.288094,0.644962,0.004523
TrackerDistance,0.238816,0.984822,0.999505,1.0,0.162585,0.794338,0.470277,0.661365,0.074591,0.680816,0.463154,0.514713,-0.289343,0.645313,0.002416
LoggedActivitiesDistance,0.188015,0.181849,0.188332,0.162585,1.0,0.150852,0.076527,0.138302,0.154996,0.234443,0.05386,0.102135,-0.046999,0.207595,0.021689
VeryActiveDistance,0.308691,0.740115,0.794582,0.794338,0.150852,1.0,0.192986,0.157669,0.046117,0.826681,0.21173,0.059845,-0.061754,0.491959,0.072625
ModeratelyActiveDistance,0.026665,0.507105,0.470758,0.470277,0.076527,0.192986,1.0,0.237847,0.005793,0.225464,0.946934,0.162092,-0.221436,0.21679,-0.085297
LightActiveDistance,0.019629,0.692208,0.662002,0.661365,0.138302,0.157669,0.237847,1.0,0.099503,0.154966,0.220129,0.885697,-0.413552,0.466917,-0.069207
SedentaryActiveDistance,-0.015698,0.070505,0.082389,0.074591,0.154996,0.046117,0.005793,0.099503,1.0,0.008258,-0.022361,0.124185,0.035475,0.043652,0.09051
VeryActiveMinutes,0.303608,0.667079,0.681297,0.680816,0.234443,0.826681,0.225464,0.154966,0.008258,1.0,0.31242,0.051926,-0.164671,0.615838,-0.018244


In [15]:
# create a correlation matrix
corr_matrix = df.corr()

# create a scatter matrix using Plotly Express
fig = px.scatter_matrix(corr_matrix)

# update the layout to adjust the size of the figure
fig.update_layout(width=800, height=800)

# show the figure
fig.show()

As you can see that there is a linear relationship between the total number of steps and the number of calories burned in a day. Now let’s look at the average total number of active minutes in a day:

In [16]:
label = ["Very Active Minutes", "Fairly Active Minutes", 
         "Lightly Active Minutes", "Inactive Minutes"]
counts = df[["VeryActiveMinutes", "FairlyActiveMinutes", 
               "LightlyActiveMinutes", "SedentaryMinutes"]].mean()
colors = ['gold','lightgreen', "pink", "blue"]

fig = go.Figure(data=[go.Pie(labels=label, values=counts)])
fig.update_layout(title_text='Total Active Minutes')
fig.update_traces(hoverinfo='label+percent', textinfo='value', textfont_size=10,
                  marker=dict(colors=colors, line=dict(color='black', width=3)))
fig.show()

In [17]:
label = ["Very Active Minutes", "Fairly Active Minutes", 
         "Lightly Active Minutes", "Inactive Minutes"]
counts = df[["VeryActiveMinutes", "FairlyActiveMinutes", 
               "LightlyActiveMinutes", "SedentaryMinutes"]].mean()
colors = ['gold','lightgreen', "pink", "blue"]

fig = go.Figure(data=[go.Pie(labels=label, values=counts)])
fig.update_layout(title_text='Total Active Minutes')
fig.update_traces(hoverinfo='label+percent', textinfo='label+percent', textfont_size=10,
                  marker=dict(colors=colors, line=dict(color='black', width=3)))
fig.show()

Observations:
1. 81,3% Total Inactive Minutes in a day
2. 15,8% Lightly Active Minutes in a day
3. 1,74% (21 minutes) Very Active Minutes in a day
4. 1,11% (13 minutes) Fairly Active Minutes in a day

In [18]:
label = ["Very Active Distances", "Fairly Active Distances", 
         "Lightly Active Distances", "Inactive Distances"]
counts = df[["VeryActiveDistance", "ModeratelyActiveDistance", 
               "LightActiveDistance", "SedentaryActiveDistance"]].mean()
colors = ['gold','lightgreen', "pink", "blue"]

fig = go.Figure(data=[go.Pie(labels=label, values=counts)])
fig.update_layout(title_text='Total Active Minutes')
fig.update_traces(hoverinfo='label+percent', textinfo='value', textfont_size=10,
                  marker=dict(colors=colors, line=dict(color='black', width=3)))
fig.show()

In [19]:
label = ["Very Active Distances", "Fairly Active Distances", 
         "Lightly Active Distances", "Inactive Distances"]
counts = df[["VeryActiveDistance", "ModeratelyActiveDistance", 
               "LightActiveDistance", "SedentaryActiveDistance"]].mean()
colors = ['gold','lightgreen', "pink", "blue"]

fig = go.Figure(data=[go.Pie(labels=label, values=counts)])
fig.update_layout(title_text='Total Active Distance')
fig.update_traces(hoverinfo='label+percent', textinfo='label+percent', textfont_size=10,
                  marker=dict(colors=colors, line=dict(color='black', width=3)))
fig.show()


Observations:
1. 61,7% Lightly Active Distance in a day
2. 0,0297% Inactive Distances in a day
3. 27,8% Very Active Distances in a day
4. 10,5% Fairly Active Distance in a day

We transformed the data type of the ActivityDate column to the datetime column above. Let’s use it to find the weekdays of the records and add a new column to this dataset as “Day”:

In [20]:
df["Day"] = df["ActivityDate"].dt.day_name()
print(df["Day"].head())

0      Tuesday
1    Wednesday
2     Thursday
3       Friday
4     Saturday
Name: Day, dtype: object


Let’s have a look at the very active, fairly active, and lightly active minutes on each day of the week:

In [21]:
fig = go.Figure()
fig.add_trace(go.Bar(
    x=df["Day"],
    y=df["VeryActiveMinutes"],
    name='Very Active',
    marker_color='red'
))
fig.add_trace(go.Bar(
    x=df["Day"],
    y=df["FairlyActiveMinutes"],
    name='Fairly Active',
    marker_color='black'
))
fig.add_trace(go.Bar(
    x=df["Day"],
    y=df["LightlyActiveMinutes"],
    name='Lightly Active',
    marker_color='blue'
))
fig.update_layout(barmode='group', xaxis_tickangle=-45, title='Daily Activity Levels')
fig.show()


Let’s have a look at the very active, fairly active, and lightly active distance on each day of the week:

In [22]:
fig = go.Figure()
fig.add_trace(go.Bar(
    x=df["Day"],
    y=df["VeryActiveDistance"],
    name='Very Active',
    marker_color='red'
))
fig.add_trace(go.Bar(
    x=df["Day"],
    y=df["ModeratelyActiveDistance"],
    name='Fairly Active',
    marker_color='black'
))
fig.add_trace(go.Bar(
    x=df["Day"],
    y=df["LightActiveDistance"],
    name='Lightly Active',
    marker_color='blue'
))
fig.update_layout(barmode='group', xaxis_tickangle=-45, title='Daily Activity Levels by distance')
fig.show()


In [None]:
LoggedActivitiesDistance  940 non-null    float64
 6   VeryActiveDistance        940 non-null    float64
 7   ModeratelyActiveDistance  940 non-null    float64
 8   LightActiveDistance       940 non-null    float64
 9   SedentaryActiveDistance

Let’s have a look at the number of inactive minutes on each day of the week:

In [23]:
day = df["Day"].value_counts()
label = day.index
counts = df["SedentaryMinutes"]
colors = ['gold','lightgreen', "pink", "blue", "skyblue", "cyan", "orange"]

fig = go.Figure(data=[go.Pie(labels=label, values=counts)])
fig.update_layout(title_text='Inactive Minutes Daily')
fig.update_traces(hoverinfo='label+percent', textinfo='value', textfont_size=20,
                  marker=dict(colors=colors, line=dict(color='red', width=3)))
fig.show()

Insight:
- Thursday is the most inactive day according to the lifestyle of all the individuals in the dataset.

Now let’s have a look at the number of calories burned on each day of the week:

In [24]:
calories = df["Day"].value_counts()
label = calories.index
counts = df["Calories"]
colors = ['gold','lightgreen', "pink", "blue", "skyblue", "cyan", "orange"]

fig = go.Figure(data=[go.Pie(labels=label, values=counts)])
fig.update_layout(title_text='Calories Burned Daily')
fig.update_traces(hoverinfo='label+percent', textinfo='value', textfont_size=20,
                  marker=dict(colors=colors, line=dict(color='black', width=3)))
fig.show()

In [33]:
distances = df["Day"].value_counts()
label = distances.index
counts = df["TotalDistance"]
colors = ['gold','lightgreen', "pink", "blue", "skyblue", "cyan", "orange"]

fig = go.Figure(data=[go.Pie(labels=label, values=counts)])
fig.update_layout(title_text='Total Distance Daily')
fig.update_traces(hoverinfo='label+percent', textinfo='value', textfont_size=20,
                  marker=dict(colors=colors, line=dict(color='black', width=3)))
fig.show()

In [34]:
steps = df["Day"].value_counts()
label = steps.index
counts = df["TotalSteps"]
colors = ['gold','lightgreen', "pink", "blue", "skyblue", "cyan", "orange"]

fig = go.Figure(data=[go.Pie(labels=label, values=counts)])
fig.update_layout(title_text='Total Steps Daily')
fig.update_traces(hoverinfo='label+percent', textinfo='value', textfont_size=20,
                  marker=dict(colors=colors, line=dict(color='black', width=3)))
fig.show()

In [36]:
colors = {'Monday': 'red', 'Tuesday': 'green', 'Wednesday': 'blue', 'Thursday': 'orange', 
          'Friday': 'purple', 'Saturday': 'brown', 'Sunday': 'pink'}

calories = df["Day"].value_counts()
label = calories.index
counts = df["Calories"]

fig = go.Figure(data=[go.Pie(labels=label, values=counts, marker_colors=[colors[d] for d in label])])
fig.update_layout(title_text='Calories Burned Daily')
fig.update_traces(hoverinfo='label+percent', textinfo='value', textfont_size=20,
                  marker=dict(line=dict(color='black', width=3)))
fig.show()

distances = df["Day"].value_counts()
label = distances.index
counts = df["TotalDistance"]

fig = go.Figure(data=[go.Pie(labels=label, values=counts, marker_colors=[colors[d] for d in label])])
fig.update_layout(title_text='Total Distance Daily')
fig.update_traces(hoverinfo='label+percent', textinfo='value', textfont_size=20,
                  marker=dict(line=dict(color='black', width=3)))
fig.show()

steps = df["Day"].value_counts()
label = steps.index
counts = df["TotalSteps"]

fig = go.Figure(data=[go.Pie(labels=label, values=counts, marker_colors=[colors[d] for d in label])])
fig.update_layout(title_text='Total Steps Daily')
fig.update_traces(hoverinfo='label+percent', textinfo='value', textfont_size=20,
                  marker=dict(line=dict(color='black', width=3)))
fig.show()


Insight:
1. Based on the analysis of the dataset, Tuesday is one of the most active days for individuals as it has the highest number of calories burned compared to other days of the week. but why it can have a lower total distance? (my opinion may be because the accuracy of the position / recorded by the smartwatch is less precise, which will be proven with additional data on the accuracy of smartwatches.)
2. Sunday is the most inactive day for individuals, as it has the lowest calories burned and few total steps but not the lowest total distances (back to the accuracy of the smartwatch to record our data.)