<a href="https://colab.research.google.com/github/luferIPCA/LESI-POO-2024-2025/blob/main/4_Spliting_Dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Masters' in Applied Artificial Intelligence
## Machine Learning Algorithms Course

Notebooks for MLA course

by [*lufer*](mailto:lufer@ipca.pt)

(vers 2.0)

---



# ML Modelling - Part I

**Contents**:

1.  Spliting Datasets


In [None]:
from google.colab import drive
drive.mount('/content/drive')

## Environment preparation


### Importing necessary Libraries

In [None]:
import pandas as pd
import numpy as np

#visualization
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split

Mounting Drive

In [None]:

from google.colab import drive

# it will ask for your google drive credentiaals
drive.mount('/content/gDrive/', force_remount=True)

In [None]:
#check current pwd
#import os
#print(os.getcwd())

*Loading dataset*

In [None]:

import os
#print(os.getcwd())

filePath='/content/drive/MyDrive/Colab Notebooks/MIA - ML - 2024-2025/Datasets/'
ds = pd.read_csv(filePath+"heart-disease.csv")
pd.set_option("display.precision", 2)

In [None]:
ds.head(5)

## 1 - Splitting a Dataset

Data splitting involves dividing a dataset into training, validation, and testing subsets.

In [None]:
ds.info()

### *Check Missing values*

In [None]:
#check missing values
ds.isnull().sum()

#or
#n1 = ds.isnull().any(axis=1)
#n1
#answer: zero null values

#or
#ds.columns[ds.isnull().any()]

#or
#missing value counts in each of these columns
#miss = ds.isnull().sum()/len(ds)
#miss = ds[miss > 0]
#miss.sort_values(inplace=True)
#miss


### Visualizing the Dataset

Several examples

In [None]:
# total of heart deseases for men between 20 and 40
filter = ds.query("(age  <= 40) & (age >= 20) & (sex == 1) & target==1")['target'].value_counts()
filter.plot(kind='bar',figsize=(5,5), legend=True, title="Total od Heart Desieses for Men in 20-40")
#filter

In [None]:
#see https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/05.02-Introducing-Scikit-Learn.ipynb#scrollTo=5IOss1GZkUrm
%matplotlib inline
dr=ds[(ds['age']>=20) & (ds['age']<40) & ds['sex']==1][['age','target']]
#check dr type
type(dr)
sns.pairplot(dr,hue='target', height=4,);

In [None]:
top_10 = ds['sex'].value_counts()[:10]
#or
#top_10 = ds['sex'].value_counts().head()

top_10.plot(kind='bar',figsize=(5,5))
plt.title('Heart Diseases by Sex (top 10)')
plt.xlabel('Sex')
plt.ylabel('Number of Heart Desease')

### Check correlation among all features

In [None]:
plt.figure(figsize=(15,10))
sns.heatmap(ds.corr(),linewidth=.01,annot=True,cmap="winter")
plt.show()

In [None]:
# for each feature a distribution bar char
ds.hist(figsize=(12,12))
plt.show()


### Extracting the features matrix and target array

*Shuffle the original data*

In [None]:
# shuffle the DataFrame rows
dsc = ds.copy()
dsc= dsc.sample(frac = 1)   #100%, random, suffle
dsc.head(5)

#compare with initial dataset
#ds.head(5)

*Features Matrix*

1. Main features (matrix)
2. Target Feature (Label | Categorial Target) (column)


In [None]:
#Steps

#1 - Dvide the data into features (X) and labels (y)

#dataframe with all features variables
X_dsc = ds.drop('target', axis=1)   # use all columns except the target one
#or
#X_dsc = ds['target']

#analyse
X_dsc.shape
#X_dsc

*Target array*

'y' labels

In [None]:
#who has target >0
#dsc_t=dsc.target[dsc.target> 0]
#dsc_t

In [None]:
#2 - Get dependent feature (target)
# Series with the target value (dependent feature)
y_dsc = dsc['target']         # we want to predict y using X
y_dsc.shape
#y_dsc

*Split*

In [None]:
#library to split our data into train and test sets
#train, test = train_test_split(dataset, ...)

#Note: random_state is a hyperparameter
# random_state=none - the function generate different datasets in each execution.
#                     we get different train and test sets across different executions
# random_state=40 - (like seed) the function generate the same datatsets

#1 - Split the dataset: tes=25%; training=75%
# test size=25%
X_train, X_test, y_train, y_test = train_test_split(X_dsc,y_dsc,test_size=0.25,random_state=40)
#or
# train size=80%
#X_train, X_test, y_train, y_test=train_test_split(X_dsc,y_dsc,train_size=0.8,random_state=40)


In [None]:
# Train dataset (75%)
X_train
# 75%
print(len(X_train)*100/len(dsc))

In [None]:
X_train.shape

In [None]:
# Test dataset (25%)
X_test.shape
# 25%
print(len(X_test)*100/len(dsc))

In [None]:
y_train

In [None]:
y_test

### Stratified train-test split

Split the dataset into train and test sets in a way that preserves the same proportions of examples in each class as observed in the original dataset.

Split without stratify

In [None]:
from collections import Counter
#split again
X_train, X_test, y_train, y_test=train_test_split(X_dsc,y_dsc,test_size=0.5,random_state=40)
print(Counter(y_dsc))
print(Counter(y_train))
print(Counter(y_test))
# Analysis:
# train and test dataset have significantly  different number of samples

In [None]:
len(dsc)

Split with stratification

In [None]:
#split again with stratification
X_train, X_test, y_train, y_test=train_test_split(X_dsc,y_dsc,test_size=0.5,random_state=40, stratify=y_dsc)
print(Counter(y_dsc))
print(Counter(y_train))
print(Counter(y_test))
# Analysis:
# train and test dataset have almost the same samples

End!