In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split

df = pd.DataFrame({
    'Hours Study': [2, 3, 4, 5, 6],
    'Marks Scored': [50, 60, 70, 80, 90]
})
df

Unnamed: 0,Hours Study,Marks Scored
0,2,50
1,3,60
2,4,70
3,5,80
4,6,90


### Data Splitting (Traning Data and Testing Data)

In [5]:
# Data Splitting (Training Data and Testing Data)
# Here, we split the dataset into features (X) and target variable (y), then into training and testing sets.
# train_test_split is used to split the data, test_size=0.2 indicates 20% of the data is used for testing. random_state=42 ensures reproducibility.
# X contains the feature 'Hours Study' (can be multiple columns) and y contains the target 'Marks Scored'.

X = df[['Hours Study']]
y = df['Marks Scored']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# X_train, X_test, y_train, y_test now contain the split data. all in form of pandas DataFrame/Series format.
print("X_train:\n", X_train)
print('\n')
print("X_test:\n", X_test)
print('\n')
print("y_train:\n", y_train)
print('\n')
print("y_test:\n", y_test)

# Note: You should always do train_test_split first, then scale your data.

X_train:
    Hours Study
4            6
2            4
0            2
3            5


X_test:
    Hours Study
1            3


y_train:
 4    90
2    70
0    50
3    80
Name: Marks Scored, dtype: int64


y_test:
 1    60
Name: Marks Scored, dtype: int64


| Part              | Meaning                                     |
| ----------------- | ------------------------------------------- |
| `X`               | Features (input columns)                    |
| `y`               | Target (label/output)                       |
| `test_size=0.2`   | 20% data for testing, 80% data for training |
| `random_state=42` | Ensures split is **repeatable** every time  |


| Parameter      | Description                                                    |
| -------------- | -------------------------------------------------------------- |
| `test_size`    | Proportion or number of samples in test set (e.g., 0.2 or 0.3) |
| `train_size`   | Optional, opposite of test_size                                |
| `shuffle`      | Whether to randomize before splitting (default = True)         |
| `random_state` | A seed value to get same split every time                      |


| Use Case       | Split Ratio              |
| -------------- | ------------------------ |
| Small Dataset  | **70% Train / 30% Test** |
| Medium Dataset | **80% Train / 20% Test** |
| Large Dataset  | **90% Train / 10% Test** |
