   - Data around us: Tabular, Text, Image, Video, Audio.
   - Focus Area of Data Scientist: Data, Model, Training
   - AI Project / Research Framework

# Data Analysis & Decision-Making Process

1 Ask (Define the Question)

- This is basically a research or business question.
- Why does the problem occur?
- If we improve the company policy, will that solve the problem?

2 Prepare (Collect Data)

- Gather all the relevant data needed to answer the question.
- Data Sources: Databases (SQL) Excel APIs CSV files etc.
- 
3 Process (Clean and Organize Data)
  
-  Process the collected data to make it ready for analysis.
-  Handle missing values, remove duplicates, and correct errors.

4 Analyze (Explore and Visualize Data)

- Perform statistical and visual analysis.
- Visualization Techniques:
- Code-based: Seaborn, Matplotlib, Plotly
- No-code tools: Excel, Tableau, Power BI

5 Model (Build Predictive Models)

- Build models to predict outcomes and gain insights.
- Techniques:
- Machine Learning Algorithms
- Deep Learning Algorithms
- AI Algorithms

6 Share (Report and Present Findings)

- Communicate results in a clear and impactful way.
- Reporting Tools: PowerPoint, Canva

7 Act (Take Action)

- Take decisions and implement actions based on the analysis.

In [1]:
import pandas as pd
import os

root_dir =  "C:\\Users\\mrmdh\\OneDrive\\Desktop\\data_science\\pandas_2"
data_dir = os.path.join(root_dir, 'data')
dataset_path = os.path.join(data_dir, 'Class 20_health_monitor_data.csv')

In [8]:
# Read_data_from_csv_file\


health_monitor_data = pd.read_csv(dataset_path)
health_monitor_data.head()

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories,Type
0,60,'2020/12/01',110,130,409.1,Easy
1,60,'2020/12/02',117,145,479.0,Moderate
2,60,'2020/12/03',103,135,340.0,Moderate
3,45,'2020/12/04',109,175,282.4,Moderate
4,45,'2020/12/05',117,148,406.0,Heavy


In [9]:
# 1_Understand_the_structure_of_the_dataset

health_monitor_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 33 entries, 0 to 32
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Duration  33 non-null     int64  
 1   Date      32 non-null     object 
 2   Pulse     33 non-null     int64  
 3   Maxpulse  33 non-null     int64  
 4   Calories  31 non-null     float64
 5   Type      33 non-null     object 
dtypes: float64(1), int64(3), object(2)
memory usage: 1.8+ KB


In [10]:
health_monitor_data['Type'].value_counts()

Type
Moderate    15
Heavy       14
Easy         4
Name: count, dtype: int64

Data Cleaning: Handle missing values

- Approach 1: Remove rows (When the dataset is large enough)
- Approach 2: Impute missing values (When the dataset is small)


In [11]:

mask = health_monitor_data.isnull().any(axis=1)
print(mask)

0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15    False
16    False
17    False
18     True
19    False
20    False
21    False
22     True
23    False
24    False
25    False
26    False
27    False
28     True
29    False
30    False
31    False
32    False
dtype: bool


In [12]:
health_monitor_data[mask]

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories,Type
18,45,'2020/12/18',90,112,,Moderate
22,45,,100,119,282.0,Moderate
28,60,'2020/12/28',103,132,,Heavy


In [14]:
average_cal = health_monitor_data['Calories'].mean()
print('Average calories: {}'.format(average_cal))

Average calories: 302.93870967741935


In [15]:
health_monitor_data['Calories']=health_monitor_data['Calories'].fillna(
    value=average_cal
)

health_monitor_data.head(20)

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories,Type
0,60,'2020/12/01',110,130,409.1,Easy
1,60,'2020/12/02',117,145,479.0,Moderate
2,60,'2020/12/03',103,135,340.0,Moderate
3,45,'2020/12/04',109,175,282.4,Moderate
4,45,'2020/12/05',117,148,406.0,Heavy
5,60,'2020/12/06',102,127,300.0,Easy
6,60,'2020/12/07',110,136,374.0,Moderate
7,450,'2020/12/08',104,134,253.3,Moderate
8,30,'2020/12/09',109,133,195.1,Heavy
9,60,'2020/12/10',98,124,269.0,Moderate


In [16]:
mask = health_monitor_data.isnull().any(axis=1)
health_monitor_data[ mask  ]

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories,Type
22,45,,100,119,282.0,Moderate


Remove null rows
1. Filtering: health_monitor_data = health_monitor_data[~mask]
2. dropna: removes the null rows


In [17]:
health_monitor_data = health_monitor_data.dropna()
health_monitor_data.isnull().sum()

Duration    0
Date        0
Pulse       0
Maxpulse    0
Calories    0
Type        0
dtype: int64

In [18]:
health_monitor_data.head(33)

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories,Type
0,60,'2020/12/01',110,130,409.1,Easy
1,60,'2020/12/02',117,145,479.0,Moderate
2,60,'2020/12/03',103,135,340.0,Moderate
3,45,'2020/12/04',109,175,282.4,Moderate
4,45,'2020/12/05',117,148,406.0,Heavy
5,60,'2020/12/06',102,127,300.0,Easy
6,60,'2020/12/07',110,136,374.0,Moderate
7,450,'2020/12/08',104,134,253.3,Moderate
8,30,'2020/12/09',109,133,195.1,Heavy
9,60,'2020/12/10',98,124,269.0,Moderate


- Remove duplicate

R1
R2
R3
R1
R5

duplicated(): mask = [False, False, False, True, False]


In [19]:
mask = health_monitor_data.duplicated()
print(mask)

0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15    False
16    False
17    False
18    False
19    False
20    False
21    False
23    False
24    False
25    False
26    False
27    False
28    False
29    False
30    False
31    False
32     True
dtype: bool


In [20]:
health_monitor_data[mask]

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories,Type
32,60,'2020/12/12',100,120,250.7,Heavy


In [21]:
health_monitor_data = health_monitor_data.drop_duplicates()

In [22]:
print(f"No of duplicate rows: {health_monitor_data.duplicated().sum()}")

No of duplicate rows: 0


In [23]:
# Convert_data_types

health_monitor_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 31 entries, 0 to 31
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Duration  31 non-null     int64  
 1   Date      31 non-null     object 
 2   Pulse     31 non-null     int64  
 3   Maxpulse  31 non-null     int64  
 4   Calories  31 non-null     float64
 5   Type      31 non-null     object 
dtypes: float64(1), int64(3), object(2)
memory usage: 1.7+ KB


In [24]:
health_monitor_data['Date'] = health_monitor_data['Date'].astype('datetime64[ns]')
health_monitor_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 31 entries, 0 to 31
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   Duration  31 non-null     int64         
 1   Date      31 non-null     datetime64[ns]
 2   Pulse     31 non-null     int64         
 3   Maxpulse  31 non-null     int64         
 4   Calories  31 non-null     float64       
 5   Type      31 non-null     object        
dtypes: datetime64[ns](1), float64(1), int64(3), object(1)
memory usage: 1.7+ KB


In [25]:
health_monitor_data.head()

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories,Type
0,60,2020-12-01,110,130,409.1,Easy
1,60,2020-12-02,117,145,479.0,Moderate
2,60,2020-12-03,103,135,340.0,Moderate
3,45,2020-12-04,109,175,282.4,Moderate
4,45,2020-12-05,117,148,406.0,Heavy


""" Remove outliers

InterQuartile Range (IQR)

IQR = Q3 - Q1
L = Q1 - 1.5 * IQR
R = Q3 + 1.5 * IQR

Any values outside L and R is called outliers.
"""