
# Data Pre-Processing & Visualization for Machine Learning  
### Workshop Hands-on Notebook  
**Dataset:** Smartwatch Health Data  
**Instructor:** M Fahad Bashir  

---
## Objective
This notebook demonstrates how raw, unclean data is transformed into clean,
meaningful data ready for Machine Learning using **data preprocessing** and
**visualization techniques**.



## 1. Import Required Libraries
We use:
- **Pandas** for data handling  
- **NumPy** for numerical operations  
- **Matplotlib & Seaborn** for visualization  


In [1]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

plt.show()



## 2. Load the Dataset
We load the raw smartwatch health dataset to understand its structure.


In [2]:

df = pd.read_csv("unclean_smartwatch_health_data.csv")
df.head()


Unnamed: 0,User ID,Heart Rate (BPM),Blood Oxygen Level (%),Step Count,Sleep Duration (hours),Activity Level,Stress Level
0,4174.0,58.939776,98.80965,5450.390578,7.167235622316564,Highly Active,1
1,,,98.532195,727.60161,6.538239375570314,Highly_Active,5
2,1860.0,247.803052,97.052954,2826.521994,ERROR,Highly Active,5
3,2294.0,40.0,96.894213,13797.338044,7.367789630207228,Actve,3
4,2130.0,61.950165,98.583797,15679.067648,,Highly_Active,6



## 3. Basic Data Inspection
Understanding data shape, columns, and types.


In [3]:

df.shape, df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 7 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   User ID                 9799 non-null   float64
 1   Heart Rate (BPM)        9600 non-null   float64
 2   Blood Oxygen Level (%)  9700 non-null   float64
 3   Step Count              9900 non-null   float64
 4   Sleep Duration (hours)  9850 non-null   object 
 5   Activity Level          9800 non-null   object 
 6   Stress Level            9800 non-null   object 
dtypes: float64(4), object(3)
memory usage: 547.0+ KB


((10000, 7), None)


## 4. Checking Missing Values
Missing values can negatively affect ML models.


In [4]:

df.isnull().sum()


User ID                   201
Heart Rate (BPM)          400
Blood Oxygen Level (%)    300
Step Count                100
Sleep Duration (hours)    150
Activity Level            200
Stress Level              200
dtype: int64


## 5. Handling Missing Values
We use:
- Mean for numerical columns  
- Mode for categorical columns  


In [5]:

for col in df.select_dtypes(include=np.number).columns:
    df[col].fillna(df[col].mean(), inplace=True)

for col in df.select_dtypes(include='object').columns:
    df[col].fillna(df[col].mode()[0], inplace=True)

df.isnull().sum()


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(df[col].mean(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(df[col].mode()[0], inplace=True)


User ID                   0
Heart Rate (BPM)          0
Blood Oxygen Level (%)    0
Step Count                0
Sleep Duration (hours)    0
Activity Level            0
Stress Level              0
dtype: int64

In [6]:
df

Unnamed: 0,User ID,Heart Rate (BPM),Blood Oxygen Level (%),Step Count,Sleep Duration (hours),Activity Level,Stress Level
0,4174.000000,58.939776,98.809650,5450.390578,7.167235622316564,Highly Active,1
1,3007.480253,76.035462,98.532195,727.601610,6.538239375570314,Highly_Active,5
2,1860.000000,247.803052,97.052954,2826.521994,ERROR,Highly Active,5
3,2294.000000,40.000000,96.894213,13797.338044,7.367789630207228,Actve,3
4,2130.000000,61.950165,98.583797,15679.067648,ERROR,Highly_Active,6
...,...,...,...,...,...,...,...
9995,1524.000000,78.819386,98.931927,2948.491953,7.402748595032027,Active,7
9996,4879.000000,48.632659,95.773035,4725.623070,6.3821659358529015,Sedentary,2
9997,2624.000000,73.834442,97.945874,2571.492060,6.91654920303435,Sedentary,4
9998,4907.000000,76.035462,98.401058,3364.788855,5.691233932149209,Active,8



## 6. Handling Duplicate Records
Duplicates can bias the model.


In [7]:

df.duplicated().sum()


0

In [None]:

df.drop_duplicates(inplace=True)



## 7. Handling Categorical Data
Machine Learning models require numerical input.
-------
One-Hot Encoding exploded your dataset because one (or more) categorical columns has MANY unique values.
**What pd.get_dummies() Actually Does**
For each unique category, pandas creates a new column.
Your smartwatch dataset likely contains a column such as:
User_ID
Device_ID
Timestamp
Session_ID
or a text-based unique identifier

üìå These columns can have THOUSANDS of unique values.

In [None]:
df.select_dtypes(include='object').nunique()


In [8]:
categorical_cols = ['Stress Level', 'Activity Level']  # example
df_encoded = pd.get_dummies(df, columns=categorical_cols, drop_first=True)
df_encoded

Unnamed: 0,User ID,Heart Rate (BPM),Blood Oxygen Level (%),Step Count,Sleep Duration (hours),Stress Level_10,Stress Level_2,Stress Level_3,Stress Level_4,Stress Level_5,Stress Level_6,Stress Level_7,Stress Level_8,Stress Level_9,Stress Level_Very High,Activity Level_Actve,Activity Level_Highly Active,Activity Level_Highly_Active,Activity Level_Seddentary,Activity Level_Sedentary
0,4174.000000,58.939776,98.809650,5450.390578,7.167235622316564,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False
1,3007.480253,76.035462,98.532195,727.601610,6.538239375570314,False,False,False,False,True,False,False,False,False,False,False,False,True,False,False
2,1860.000000,247.803052,97.052954,2826.521994,ERROR,False,False,False,False,True,False,False,False,False,False,False,True,False,False,False
3,2294.000000,40.000000,96.894213,13797.338044,7.367789630207228,False,False,True,False,False,False,False,False,False,False,True,False,False,False,False
4,2130.000000,61.950165,98.583797,15679.067648,ERROR,False,False,False,False,False,True,False,False,False,False,False,False,True,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,1524.000000,78.819386,98.931927,2948.491953,7.402748595032027,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False
9996,4879.000000,48.632659,95.773035,4725.623070,6.3821659358529015,False,True,False,False,False,False,False,False,False,False,False,False,False,False,True
9997,2624.000000,73.834442,97.945874,2571.492060,6.91654920303435,False,False,False,True,False,False,False,False,False,False,False,False,False,False,True
9998,4907.000000,76.035462,98.401058,3364.788855,5.691233932149209,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False


What SHOULD Be Encoded?
>‚ÄúOne-hot encoding creates one column per category.
If a column has too many unique values, encoding it creates thousands of columns, which is inefficient and harmful for ML models.‚Äù


Only encode columns that:
‚úî Have limited categories
‚úî Carry meaningful information
‚úî Are useful for prediction

Examples:

Gender

Activity type

Sleep quality

Smoking status

‚ùå DO NOT encode:

IDs

User names

Timestamps

Unique identifiers

In [11]:

df_encoded = pd.get_dummies(df, drop_first=True)
df_encoded.head()


Unnamed: 0,User ID,Heart Rate (BPM),Blood Oxygen Level (%),Step Count,Sleep Duration (hours)_0.5899874330568986,Sleep Duration (hours)_1.1086136901126222,Sleep Duration (hours)_1.2874439245359204,Sleep Duration (hours)_1.4025072035369401,Sleep Duration (hours)_1.7060873342734366,Sleep Duration (hours)_1.7254351015144884,...,Stress Level_10,Stress Level_2,Stress Level_3,Stress Level_4,Stress Level_5,Stress Level_6,Stress Level_7,Stress Level_8,Stress Level_9,Stress Level_Very High
0,4174.0,58.939776,98.80965,5450.390578,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,3007.480253,76.035462,98.532195,727.60161,False,False,False,False,False,False,...,False,False,False,False,True,False,False,False,False,False
2,1860.0,247.803052,97.052954,2826.521994,False,False,False,False,False,False,...,False,False,False,False,True,False,False,False,False,False
3,2294.0,40.0,96.894213,13797.338044,False,False,False,False,False,False,...,False,False,True,False,False,False,False,False,False,False
4,2130.0,61.950165,98.583797,15679.067648,False,False,False,False,False,False,...,False,False,False,False,False,True,False,False,False,False



## 8. Feature Scaling
Scaling ensures all features contribute equally.


In [None]:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled_data = scaler.fit_transform(df_encoded)
df_scaled = pd.DataFrame(scaled_data, columns=df_encoded.columns)
df_scaled.head()



## 9. Data Visualization
Visualization helps understand patterns and detect issues.


In [None]:

df.hist(figsize=(12,8))
plt.show()


In [None]:

plt.figure()
sns.boxplot(data=df.select_dtypes(include=np.number))
plt.xticks(rotation=90)
plt.show()


In [None]:

plt.figure()
sns.heatmap(df.select_dtypes(include=np.number).corr(), annot=True)
plt.show()



## 10. Final Clean Dataset
This dataset is now ready for Machine Learning.


In [None]:

df_scaled.head()



## Key Takeaways
- Real-world data is messy  
- Preprocessing is mandatory  
- Visualization guides decisions  
- Clean data improves ML performance  
