# Train-Test Split

Train-test split is a crucial step in the machine learning pipeline to evaluate the performance of a model. It involves dividing the dataset into two subsets: one for training the model and the other for testing its accuracy and generalization ability.

---

#### **Purpose:**
1. **Training Set:** Used to train the model by learning patterns and relationships from the data.
2. **Testing Set:** Used to evaluate the trained model's performance on unseen data.

---

#### **Process:**
1. The dataset is divided into two parts:
   - Training set: Typically 70-80% of the data.
   - Testing set: Remaining 20-30% of the data.
2. The division is done randomly to ensure the distribution of the data remains representative.

---

#### **Key Parameters:**
1. **`test_size`:** Proportion of the dataset to be used as the test set (e.g., `0.2` for 20%).
2. **`random_state`:** Ensures reproducibility by setting a seed for random splitting.

---

#### **Importance:**
1. Prevents overfitting by ensuring the model is not evaluated on the data it was trained on.
2. Provides an unbiased estimate of the model's performance on unseen data.

---

#### **Best Practices:**
1. Shuffle the data before splitting to maintain randomness.
2. Use stratified splitting if the dataset is imbalanced to preserve the class distribution in both sets.
3. Avoid leakage of test data into the training process to ensure fair evaluation.

---

By using a train-test split, the model's ability to generalize to new, unseen data can be effectively measured, which is essential for building robust machine learning systems.


In [13]:
import pandas as pd


In [14]:
boston_data = pd.read_csv('Boston.csv')
boston_data.head()

Unnamed: 0.1,Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,black,lstat,House_price
0,1,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
1,2,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
2,3,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,4,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,5,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2


### Process: Train-Test Split

The process of train-test splitting involves dividing the input data and output labels into separate training and testing subsets. This ensures that the model is trained on one set of data and evaluated on another to gauge its generalization capability.

---

#### **Steps:**

1. **Importing Required Library:**
   - Use the `train_test_split` function from `sklearn.model_selection`.

2. **Defining Input and Output Data:**
   - `input_data`: Features or predictors in the dataset.
   - `output_data`: Target variable or labels.

3. **Performing the Split:**
   - Specify the proportion of the test set using the `test_size` parameter (e.g., `0.25` for 25%).
   - Optionally, use `random_state` to ensure reproducibility.

---

#### **Generated Outputs:**
1. **`x_train`:** Training set for features.
2. **`x_test`:** Testing set for features.
3. **`y_train`:** Training set for target labels.
4. **`y_test`:** Testing set for target labels.

---

#### **Key Points:**
1. The `test_size` parameter defines the proportion of the dataset used as the test set (e.g., 0.25 indicates 25% of the data).
2. Data is split randomly, and reproducibility can be ensured using the `random_state` parameter.
3. The function returns four subsets:
   - `x_train` and `y_train` for training the model.
   - `x_test` and `y_test` for evaluating the model.

---

#### **Purpose:**
The split ensures that the testing data is unseen during the training phase, providing an unbiased estimate of the model's performance on new data.


In [15]:
input_data = boston_data.iloc[:,:-1]
input_data.head()

Unnamed: 0.1,Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,black,lstat
0,1,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98
1,2,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14
2,3,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03
3,4,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94
4,5,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33


In [16]:
output_data = boston_data['House_price']
output_data.head()

0    24.0
1    21.6
2    34.7
3    33.4
4    36.2
Name: House_price, dtype: float64

In [17]:
input_data.shape, output_data.shape

((506, 14), (506,))

In [18]:
from sklearn.model_selection import train_test_split

In [19]:
x_train, x_test, y_train, y_test = train_test_split(input_data, output_data, test_size=0.25)

In [20]:
x_train.shape, x_test.shape, y_train.shape, y_test.shape

((379, 14), (127, 14), (379,), (127,))

In [22]:
x_train.head()

Unnamed: 0.1,Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,black,lstat
218,219,0.11069,0.0,13.89,1,0.55,5.951,93.8,2.8893,5,276,16.4,396.9,17.92
292,293,0.03615,80.0,4.95,0,0.411,6.63,23.4,5.1167,4,245,19.2,396.9,4.7
144,145,2.77974,0.0,19.58,0,0.871,4.903,97.8,1.3459,5,403,14.7,396.9,29.29
304,305,0.05515,33.0,2.18,0,0.472,7.236,41.1,4.022,7,222,18.4,393.68,6.93
262,263,0.52014,20.0,3.97,0,0.647,8.398,91.5,2.2885,5,264,13.0,386.86,5.91


In [23]:
x_test.head()

Unnamed: 0.1,Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,black,lstat
233,234,0.33147,0.0,6.2,0,0.507,8.247,70.4,3.6519,8,307,17.4,378.95,3.95
334,335,0.03738,0.0,5.19,0,0.515,6.31,38.5,6.4584,5,224,20.2,389.4,6.75
158,159,1.34284,0.0,19.58,0,0.605,6.066,100.0,1.7573,5,403,14.7,353.89,6.43
254,255,0.04819,80.0,3.64,0,0.392,6.108,32.0,9.2203,1,315,16.4,392.89,6.57
452,453,5.09017,0.0,18.1,0,0.713,6.297,91.8,2.3682,24,666,20.2,385.09,17.27


In [25]:
y_train.head()

218    21.5
292    27.9
144    11.8
304    36.1
262    48.8
Name: House_price, dtype: float64

In [26]:
y_test.head()

233    48.3
334    20.7
158    24.3
254    21.9
452    16.1
Name: House_price, dtype: float64