# Splitting the Dataset

In this step, the dataset is divided into **training** and **test** sets. The training set is used to train machine learning models, while the test set will be used to evaluate the model's performance.

We are using an **80-20** split, where 80% of the data is used for training, and 20% is used for testing. The splitting process is randomized to ensure that the model generalizes well.


In [3]:
# Import libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [4]:
# Load dataset
df = pd.read_csv('alzheimers_prediction_dataset.csv')

## Train-Test Split

In this step, we split the dataset into training and testing sets using the **train_test_split** function from **scikit-learn**:

1. **Feature Set (X)**: We drop the target column (**Alzheimers Diagnosis**) and store the rest of the features in **X**.
2. **Target Variable (y)**: The target variable is **Alzheimers Diagnosis**, which we convert to binary values (`0` for 'No' and `1` for 'Yes').

We perform an **80-20 train-test split**, ensuring the target variable is stratified so that the distribution of 'Yes' and 'No' diagnoses remains similar in both the training and test sets.

- **Training Set (X_train, y_train)**: 80% of the data.
- **Test Set (X_test, y_test)**: 20% of the data.

The sizes of the resulting training and test sets are printed for verification.


In [6]:
from sklearn.model_selection import train_test_split

# Drop the target column and store it separately
X = df.drop(columns=['Alzheimer’s Diagnosis'])
y = df['Alzheimer’s Diagnosis'].map({'No': 0, 'Yes': 1})  # convert to binary

# Early train-test split (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set size: {X_train.shape[0]} samples")
print(f"Test set size: {X_test.shape[0]} samples")


Training set size: 59426 samples
Test set size: 14857 samples


## Checking the Shape of Train and Test Sets

After splitting the dataset, we check the shapes of the training and test sets for both the features and the target variable:

- **X_train**: Features of the training set.
- **X_test**: Features of the test set.
- **y_train**: Target variable (Alzheimer's Diagnosis) for the training set.
- **y_test**: Target variable (Alzheimer's Diagnosis) for the test set.

This step helps confirm the sizes of each set, ensuring they match the expected proportions after the **train-test split**.


In [7]:
print(f'Shape of X_Train {X_train.shape}')
print(f'Shape of X_Test {X_test.shape}')
print(f'Shape of Y_Train {y_train.shape}')
print(f'Shape of Y_Test {y_test.shape}')

Shape of X_Train (59426, 24)
Shape of X_Test (14857, 24)
Shape of Y_Train (59426,)
Shape of Y_Test (14857,)
