## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

### Dataset Loading

In [None]:
# Load Dataset
df = pd.read_csv("/content/Strava Full Merged Data.csv")

### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

### Dataset Information

In [None]:
# Dataset Info
df.info()



## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
df.nunique()

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1 (Distribution of Total Steps)

In [None]:
# Distribution of Total Steps
# Select top columns for meaningful insights
key_columns = [
    'TotalSteps', 'TotalDistance', 'VeryActiveDistance',
    'ModeratelyActiveDistance', 'LightActiveDistance',
    'SedentaryActiveDistance', 'VeryActiveMinutes',
    'FairlyActiveMinutes', 'LightlyActiveMinutes',
    'SedentaryMinutes'
]

# Drop rows with all NaNs in selected columns
df_cleaned = df[key_columns].dropna(how='all')

# Histogram of Total Steps
plt.figure()
sns.histplot(df_cleaned['TotalSteps'], bins=30, kde=True, color='skyblue')
plt.title('Distribution of Total Steps')
plt.show()


##### 1. Why did you pick the specific chart?

This histogram is used to visualize because it effectively shows the frequency distribution of a single continuous variable, - "TotalSteps".

##### 2. What is/are the insight(s) found from the chart?

The daily total steps are right-skewed, with most people's steps falling between 5,000 and 10,000, while a smaller group is exceptionally active.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, it allows for targeted user strategies (e.g., motivating the majority) for positive growth; however, ignoring the significant number of users with low step counts could lead to churn and negative growth.

#### Chart - 2 (KDE plot of Active Minutes)

In [None]:
# KDE plot of Active Minutes
plt.figure()
for col in ['VeryActiveMinutes', 'FairlyActiveMinutes', 'LightlyActiveMinutes']:
    sns.kdeplot(df_cleaned[col], label=col)
plt.legend()
plt.title('KDE Plot of Active Minutes')
plt.xlabel('Minutes')
plt.ylabel('Density')
plt.grid(True)
plt.show()

##### 1. Why did you pick the specific chart?

This Kernel Density Estimate (KDE) plot is used to effectively compare the probability distributions of multiple continuous variables (different activity levels) on the same scale.

##### 2. What is/are the insight(s) found from the chart?

The insight is that users spend significantly more time in "LightlyActiveMinutes," which is broadly distributed, while "VeryActiveMinutes" and "FairlyActiveMinutes" are typically very low or zero for most users.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, this insight can drive positive impact by creating business strategies around encouraging more achievable "LightlyActiveMinutes"; conversely, focusing only on high-intensity activities could cause negative growth by demotivating the majority of users who are not accustomed to it.

#### Chart - 3 (Timeline of Total steps and calories)

In [None]:
# timeline of Total steps and calories
# Group and aggregate daily sums
metrics = ['TotalSteps', 'daily_calories']
df_daily = df.groupby('ActivityDate')[metrics].sum().reset_index()

# Plotting with Matplotlib
plt.figure(figsize=(12, 6))

for col in metrics:
    plt.plot(df_daily['ActivityDate'], df_daily[col], label=col)

plt.title('Timeline of Total Steps and Calories')
plt.xlabel('Activity Date')
plt.ylabel('Metric Value')
plt.legend()
plt.xticks(rotation=45)
plt.tight_layout()
plt.grid(True)
plt.show()

##### 1. Why did you pick the specific chart?

This line chart was selected to effectively track and compare the trends of two related metrics, Total Steps and Calories, over the continuous variable of time.

##### 2. What is/are the insight(s) found from the chart?

The chart reveals a positive correlation between Total Steps and calories burned, with both metrics showing a significant, simultaneous decline in activity after May 9th.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, this insight can create a positive impact by identifying user disengagement (the drop-off) to trigger retention campaigns; failing to address this drop would lead to negative growth as it is a clear indicator of potential user churn.

#### Chart - 4 (Violin Plot of LightlyActiveMinutes)

In [None]:
# Violin Plot of LightlyActiveMinutes
plt.figure()
sns.violinplot(x=df_cleaned['LightlyActiveMinutes'], color='orchid')
plt.title('Violin Plot of Lightly Active Minutes')
plt.xlabel('Minutes')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

This violin plot was chosen to show both the probability density and key statistical summaries (like median and interquartile range) of the "Lightly Active Minutes" data.

##### 2. What is/are the insight(s) found from the chart?

The data shows a wide, right-skewed distribution with a median of roughly 200 minutes, indicating that while many users are active in this range, there's significant variability and a tail of highly active users.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, this insight allows for creating engaging content for the typical user (around 200 mins) for positive impact; however, ignoring the smaller group with very low activity could lead to negative growth due to churn from this disengaged segment.

#### Chart - 5 (Total Time Spent in Different Activity Levels)

In [None]:
# Total Time Spent in Different Activity Levels
activity_mins = ['VeryActiveMinutes', 'FairlyActiveMinutes', 'LightlyActiveMinutes', 'SedentaryMinutes']
activity_sum = df_cleaned[activity_mins].sum().sort_values(ascending=False)

plt.figure(figsize=(8, 5))
sns.barplot(x=activity_sum.index, y=activity_sum.values, palette='pastel')
plt.title('Total Time Spent in Different Activity Levels')
plt.ylabel('Total Minutes')
plt.xlabel('')
plt.xticks(rotation=45)
plt.tight_layout()
plt.grid(axis='y')
plt.show()

##### 1. Why did you pick the specific chart?

A bar chart was chosen as it effectively compares the total time spent across distinct activity levels.

##### 2. What is/are the insight(s) found from the chart?

The chart reveals a significantly higher amount of time spent in "SedentaryMinutes" compared to "LightlyActiveMinutes," "VeryActiveMinutes," and "FairlyActiveMinutes," with "VeryActiveMinutes" and "FairlyActiveMinutes" being negligible in comparison to the other two categories.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the gained insights can help create a positive business impact by highlighting the need for products/services promoting increased activity (e.g., fitness trackers, active lifestyle programs), as the overwhelmingly high "SedentaryMinutes" indicate a large untapped market for solutions encouraging movement; there are no insights that inherently lead to negative growth, as the data simply reflects current activity levels, which can be seen as an opportunity for businesses addressing inactivity.

#### Chart - 6 (Total Steps vs Total Distance )

In [None]:
# Total Steps vs Total Distance
plt.figure(figsize=(8,6))
plt.scatter(df_cleaned['TotalSteps'], df_cleaned['TotalDistance'], alpha=0.6, color='teal')

plt.title('Total Steps vs Total Distance')
plt.xlabel('Total Steps')
plt.ylabel('Total Distance')
plt.grid(True)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

A scatter plot was chosen to visualize the relationship between "Total Steps" and "Total Distance."

##### 2. What is/are the insight(s) found from the chart?

The chart shows a strong positive linear correlation between total steps and total distance, indicating that as the number of steps increases, the total distance covered also increases proportionally.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the gained insights can lead to a positive business impact by helping fitness tracking companies refine algorithms for distance estimation, validate sensor accuracy, and develop more personalized health recommendations based on step count; there are no insights that directly lead to negative growth, as the chart simply illustrates a well-understood physiological relationship.

#### Chart - 7 (Activity-Based Average Distance Breakdown)

In [None]:
# Activity-Based Average Distance Breakdown

# Calculate mean distances
means = df_cleaned[['VeryActiveDistance', 'ModeratelyActiveDistance', 'LightActiveDistance']].mean()

# Plot pie chart
plt.figure(figsize=(8, 6))
colors = ['darkgreen', 'seagreen', 'lightgreen']
plt.pie(means.values, labels=means.index, autopct='%1.1f%%', startangle=140, colors=colors)

plt.title('Activity-Based Average Distance Breakdown')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

A pie chart was selected to illustrate the proportional breakdown of average distance covered across different activity levels.

##### 2. What is/are the insight(s) found from the chart?

The chart reveals that "LightActiveDistance" accounts for the majority (61.8%) of the average distance, followed by "VeryActiveDistance" (27.6%), and then "ModeratelyActiveDistance" (10.7%).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, The gained insights can positively impact business by indicating a larger consumer base engaged in light activity, allowing for targeted marketing of products and services catering to this segment (e.g., comfortable walking shoes, beginner-friendly fitness apps); there are no insights that inherently lead to negative growth, as the data simply presents a distribution of activity levels which can be leveraged for strategic product development and marketing.

#### Chart - 8 (Total Steps vs Active Minutes)

In [None]:
# Total Steps vs Active Minutes

plt.figure(figsize=(10, 5))

# Create scatter plot with color based on FairlyActiveMinutes
sc = plt.scatter(
    df_cleaned['TotalSteps'],
    df_cleaned['VeryActiveMinutes'],
    c=df_cleaned['FairlyActiveMinutes'],
    cmap='viridis',
    alpha=0.7
)

plt.colorbar(sc, label='Fairly Active Minutes')  # Add color scale legend
plt.title('Total Steps vs Active Minutes')
plt.xlabel('Total Steps')
plt.ylabel('Very Active Minutes')
plt.grid(True)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

A scatter plot with a color gradient was chosen to visualize the relationship between "Total Steps" and "Very Active Minutes," while also incorporating "Fairly Active Minutes" as a third variable through color intensity.

##### 2. What is/are the insight(s) found from the chart?

The chart suggests a general positive correlation between total steps and very active minutes, but with significant spread; furthermore, there appears to be a varied relationship between total steps and fairly active minutes (represented by color), with higher fairly active minutes not consistently correlating with higher total steps or very active minutes, and some individuals having high steps but low active minutes.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the gained insights can positively impact business by informing the development of more sophisticated fitness goals and personalized coaching, which can enhance user engagement for fitness apps and wearable devices; there are no insights that inherently lead to negative growth, as the data reveals nuances in activity patterns that can be leveraged for better product design and user experience.

#### Chart - 9 (Pairplot of distances)

In [None]:
# Pairplot of distances
sns.pairplot(df_cleaned[['TotalDistance', 'VeryActiveDistance', 'ModeratelyActiveDistance', 'LightActiveDistance']])
plt.suptitle('Pairplot of Distances', y=1.02)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

A pair plot was chosen to visualize the relationships between multiple distance metrics and their individual distributions.

##### 2. What is/are the insight(s) found from the chart?

The chart reveals strong positive correlations between "TotalDistance" and each of "VeryActiveDistance," "ModeratelyActiveDistance," and "LightActiveDistance," indicating that as overall distance increases, distances covered in various activity levels also tend to increase; additionally, the histograms on the diagonal show that most distances are concentrated at lower values, with a few outliers at higher distances, especially for "TotalDistance."

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the gained insights can positively impact business by allowing for more accurate segmentation of users based on their activity patterns, enabling the development of targeted fitness programs and products (e.g., beginner programs for light activity, advanced for very active), and optimizing marketing strategies to reach specific user groups; there are no insights that inherently lead to negative growth, as understanding the relationships between these metrics provides opportunities for tailored product development and user engagement.

#### Chart - 10 (Correlation Heatmap)

In [None]:
# Correlation Heatmap
plt.figure(figsize=(12, 8))
corr = df_cleaned.corr()
sns.heatmap(corr, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Heatmap')
plt.xticks(rotation = 45)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

A correlation heatmap was chosen to effectively visualize the strength and direction of linear relationships between multiple activity-related variables.

##### 2. What is/are the insight(s) found from the chart?

The heatmap reveals strong positive correlations among "TotalSteps," "TotalDistance," and "VeryActiveDistance," as well as between "VeryActiveMinutes" and "VeryActiveDistance"; conversely, "SedentaryMinutes" shows a strong negative correlation with "TotalSteps" and "TotalDistance," indicating that higher sedentary time is associated with lower overall activity.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the gained insights can positively impact business by informing the development of integrated fitness programs that emphasize reducing sedentary time while increasing active minutes and steps, leading to more comprehensive health solutions and enhanced user engagement; there are no insights that inherently lead to negative growth, as understanding these relationships provides opportunities to design more effective products and interventions for promoting healthier lifestyles.

# **Conclusion**

Based on the analysis, we can conclude that most users typically walk between 5,000 to 12,500 steps daily, with a notable peak around 10,000 steps, suggesting it's a common target. Activities such as "FairlyActiveMinutes" and "LightlyActiveMinutes" show varied distributions, indicating diverse engagement levels in light physical activities. Sedentary behavior dominates the majority of user time, overshadowing more active categories like "VeryActiveMinutes" and "FairlyActiveMinutes." There's a strong positive correlation between total steps and total distance, suggesting a direct relationship where increased steps correspond to greater distances covered. Overall, the data highlights varying activity levels and preferences among users, emphasizing the prevalence of sedentary behavior and the light physical activity in daily routines.