# Splitting the Data into Training and Test Sets

* **Training Data (Train data)**: The subset of the dataset used to train the model.
* **Test Data (Test data)**: The subset of the dataset used to evaluate the performance of the trained model.

The data is usually split into these sets to prevent overfitting and to ensure that the model generalizes well to unseen data.


![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

## Hands on! 

In [15]:

import numpy as np
from sklearn.model_selection import train_test_split

# In this lesson we will explore the train_test_split module
# Therefore we need no more than the module itself and NumPy

In [16]:
# Let's generate a new data frame 'a' which will contain all integers from 1 to 100
# The method np.arange works like the built-in method 'range' with the difference it creates an array
a = np.arange(1,101)
a

array([  1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,  13,
        14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,
        27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,  39,
        40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,  52,
        53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,  65,
        66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,  78,
        79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,  91,
        92,  93,  94,  95,  96,  97,  98,  99, 100])

In [17]:
# Similarly, let's create another ndarray 'b', which will contain integers from 501 to 600
# We have intentionally picked these numbers so we can easily compare the two
# Obviously, the difference between the elements of the two arrays is 500 for any two corresponding elements
b = np.arange(501,601)
b

array([501, 502, 503, 504, 505, 506, 507, 508, 509, 510, 511, 512, 513,
       514, 515, 516, 517, 518, 519, 520, 521, 522, 523, 524, 525, 526,
       527, 528, 529, 530, 531, 532, 533, 534, 535, 536, 537, 538, 539,
       540, 541, 542, 543, 544, 545, 546, 547, 548, 549, 550, 551, 552,
       553, 554, 555, 556, 557, 558, 559, 560, 561, 562, 563, 564, 565,
       566, 567, 568, 569, 570, 571, 572, 573, 574, 575, 576, 577, 578,
       579, 580, 581, 582, 583, 584, 585, 586, 587, 588, 589, 590, 591,
       592, 593, 594, 595, 596, 597, 598, 599, 600])

## Split the Data

You can use the `train_test_split()` method to split data into training and testing sets.

Reference: [train_test_split documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)


In [18]:
# Let's check out how this works
train_test_split(a)

[array([ 88,  11,  15,   2,  42,  30,  94,  43,  89,  52,  39,  74,  12,
         60,   8,  87,  67,  47,  16,  92,  78,  77,   4,  93,  61,  66,
         27,  25,  90,  98,  26,  13,  82,  62,  72,  48,  17,  46,  34,
         71,  33,  55,  24,  95,  45,   5,  40,  10,  32,  37,  19,  51,
         41,  91,  96,  97,  57,  49,  14,   6,  73,  21,  23,  50,  84,
         85,  79,  59,  31,  29,  99,  76,  63,  18, 100]),
 array([86,  9, 20, 64,  3, 38, 75, 70, 69, 65, 28,  1, 22, 36, 53, 81, 35,
         7, 56, 80, 68, 44, 58, 54, 83])]

## Parameters of train_test_split

* **test_size**: The proportion (float) or number (int) of the dataset to be used for testing (default = 0.25).
* **train_size**: The proportion (float) or number (int) of the dataset to be used for training (default = remainder of `test_size`).
* **random_state**: The seed value for shuffling the data during splitting (can be an int or RandomState).
* **shuffle**: Whether to shuffle the data before splitting (default = True).
* **stratify**: Ensures that the proportions of the specified data (usually the labels, `Y`) are maintained in the split. For example, if the label set `Y` consists of 25% 0's and 75% 1's in a binary classification task, setting `stratify=Y` will keep the same proportions (25% 0's and 75% 1's) in both the training and test sets.


In [19]:
a_train, a_test, b_train, b_test = train_test_split(a, b, test_size=0.2, random_state=365)

In [20]:
# Let's check the shapes
# Basically, we are checking how does the 'test_size' work
a_train.shape, a_test.shape

((80,), (20,))

In [21]:
b_train.shape, b_test.shape

((80,), (20,))

In [22]:
from sklearn.datasets import load_iris

iris = load_iris()
X = iris.data
y = iris.target

In [23]:
print(iris.DESCR)

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :

![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

# Practice: Splitting the Wine Dataset

In this example, we will load the wine dataset and split it into training and testing sets.

```python
# Import necessary libraries
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split

# Load the wine dataset
wine = load_wine()
X = wine.data  # Features
y = wine.target  # Labels

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

# Display the shape of the resulting datasets
print("Training features shape:", X_train.shape)
print("Test features shape:", X_test.shape)
print("Training labels shape:", y_train.shape)
print("Test labels shape:", y_test.shape)


In [24]:
from sklearn.datasets import load_wine

wine = load_wine()
X = wine.data
y = wine.target

In [25]:
print(wine.DESCR)

.. _wine_dataset:

Wine recognition dataset
------------------------

**Data Set Characteristics:**

    :Number of Instances: 178
    :Number of Attributes: 13 numeric, predictive attributes and the class
    :Attribute Information:
 		- Alcohol
 		- Malic acid
 		- Ash
		- Alcalinity of ash  
 		- Magnesium
		- Total phenols
 		- Flavanoids
 		- Nonflavanoid phenols
 		- Proanthocyanins
		- Color intensity
 		- Hue
 		- OD280/OD315 of diluted wines
 		- Proline

    - class:
            - class_0
            - class_1
            - class_2
		
    :Summary Statistics:
    
                                   Min   Max   Mean     SD
    Alcohol:                      11.0  14.8    13.0   0.8
    Malic Acid:                   0.74  5.80    2.34  1.12
    Ash:                          1.36  3.23    2.36  0.27
    Alcalinity of Ash:            10.6  30.0    19.5   3.3
    Magnesium:                    70.0 162.0    99.7  14.3
    Total Phenols:                0.98  3.88    2.29  0.63
    Fl

In [26]:
print(X.shape, y.shape)

(178, 13) (178,)


In [27]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

print(X_train.shape, X_test.shape)

(124, 13) (54, 13)


In [28]:
y_test.shape

(54,)