<a href="https://colab.research.google.com/github/jaypatel99/6SGP/blob/master/pedometer_data_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pedometer Data Analysis

## Objective

### Purpose of this module
#### What do we want to learn more about?
**We want to understand how exercise & working out and being cautious about own health can effect our daily life**

**We want to understand how fitness influences activeness in our life**

**We want to understand how often can we keep our mood cheerful if we workout daily**

## Data Collection

### Fitness Trends Dataset
#### A dataset of fitness trends and how they change with exercise
The following dataset has been collected from **Kaggle**

[Dataset Source](https://www.kaggle.com/aroojanwarkhan/fitness-data-trends)

### Importing Essential Libraries

In [0]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

### Importing Dataset

In [0]:
from google.colab import files
uploaded = files.upload()

df = pd.read_csv('25.csv')

Saving 25.csv to 25 (2).csv


### Displaying First Few Rows From Dataset

In [0]:
df.head()

Unnamed: 0,date,step_count,mood,calories_burned,hours_of_sleep,bool_of_active,weight_kg
0,2017-10-06,5464,200,181,5,0,66
1,2017-10-07,6041,100,197,8,0,66
2,2017-10-08,25,100,0,5,0,66
3,2017-10-09,5461,100,174,4,0,66
4,2017-10-10,6915,200,223,5,500,66


### Dataset Data Dictionary



```
date - Date which tells when the data point is recorded
step_count - Count of steps recorded by the pedometer
mood - Mood measured in either "Happy", "Neutral" or "Sad" which are given numeric values of 300, 200 and 100 respectively
calories_burned - Number of calories burned per day recorded by Samsung Fit Gear
hours_of_sleep - Total sleep hours per day recorded by Samsung Fit Gear
bool_of_active - Feeling of activeness measured in either "Active" or "Inactive" which are given numeric values of 500 and 0 respectively
weight_kg - Weight in kilograms entered by the user
```



## Dataset Cleaning & Preprocessing

### Changing 'bool_of_active' column to boolean (bool) data type

In [0]:
df['bool_of_active'].value_counts()

0      54
500    42
Name: bool_of_active, dtype: int64

In [0]:
df['bool_of_active'] = df['bool_of_active'].astype(bool)
df['bool_of_active'].value_counts()

False    54
True     42
Name: bool_of_active, dtype: int64

### Converting 'mood' column numeric values to categorical string type

In [0]:
df['mood'].value_counts()

300    40
100    29
200    27
Name: mood, dtype: int64

In [0]:
df.loc[df['mood'] == 300, 'mood'] = 'Happy'
df.loc[df['mood'] == 200, 'mood'] = 'Neutral'
df.loc[df['mood'] == 100, 'mood'] = 'Sad'
df['mood'].value_counts()

Happy      40
Sad        29
Neutral    27
Name: mood, dtype: int64

## Exploratory Data Analysis

### Data Information

In [0]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 96 entries, 0 to 95
Data columns (total 7 columns):
date               96 non-null object
step_count         96 non-null int64
mood               96 non-null object
calories_burned    96 non-null int64
hours_of_sleep     96 non-null int64
bool_of_active     96 non-null bool
weight_kg          96 non-null int64
dtypes: bool(1), int64(4), object(2)
memory usage: 4.7+ KB


*   The dataset contains total of 96 rows
*   The dataset contains 7 columns
*   None of the columns contain null values
*   4 out of 7 columns have values of integer data type
*   mood column has object data type
*   bool_of_active column has bool data type
*   date column has object data type

In [0]:
type(df['mood'].iloc[0])

str

In [0]:
type(df['date'].iloc[0])

str

Object data type for mood column is used to represent string (str) type

Object data type for date column is used to represent string (str) type

### Describing Numeric Data

In [0]:
df.describe()

Unnamed: 0,step_count,calories_burned,hours_of_sleep,weight_kg
count,96.0,96.0,96.0,96.0
mean,2935.9375,93.447917,5.21875,64.28125
std,2143.384573,71.601951,1.51625,0.627495
min,25.0,0.0,2.0,64.0
25%,741.0,21.75,4.0,64.0
50%,2987.5,96.0,5.0,64.0
75%,4546.25,149.25,6.0,64.0
max,7422.0,243.0,9.0,66.0


On having a glance at the above dataframe we can say that
*    On average a user burns around 93 calories per day during workout
*    On average a user walks around 2935 steps per day during workout
*    On average a user sleeps around 5 hours
*    Spread of step_count is high
*    Spread of calories_burned is medium
*    Spread of hours_of_sleep & weight_kg is low
*    The difference of min & max value of step_count is quite large

### Describing Non-Numeric Data

In [0]:
df.describe(include=[np.object])

Unnamed: 0,date,mood
count,96,96
unique,96,3
top,2017-12-25,Happy
freq,1,40


On having a glance at the above dataframe we can say that
*    mood column has 3 unique values 'Happy', 'Neutral' & 'Sad'
*    'Happy' value in mood column has highest frequency (40)

### Single Variable - Univariate Analysis

#### Mood Distribution

In [0]:
fig = px.histogram(df, x='mood', title={
        'text': 'Mood Distribution',
        'y':0.9,
        'x':0.5
        })
fig.show()

By looking at the above bar plot of mood column we can observe that
*    mood column contains categorical type data
*    mood column has 3 unique categories
     *    Happy
     *    Neutral
     *    Sad
*    Frequency of Happy is highest (40) compared to other categories
*    Frequency of Sad is slightly greater than Neutral (29)
*    Neutral has the lowest frequency

#### Step Count Cummulative Histogram

In [0]:
import plotly.graph_objects as go

x = np.array(df['step_count'])
fig = go.Figure(data=[go.Histogram(x=x, cumulative_enabled=True)])

fig.show()

Since this is a Cummulative Histogram, For a specific bin it keeps count for all the smaller bins less than or equal to the current bin

Looking at the Cummulative Histogram we can say that 48 people walked approximately <= 2500 steps