---
**Data Splitting**

Often times just training a model is not enough. We also need to test its performance. For this purpose we again need some data to test our model against.

To solve this, we split our original dataset into two subsets, namely -
- `Training dataset` : used to train the model and find weights and all.
- `Testing dataset` : used to test the model i.e. if the calculated weights can calculate the results accurately enough.

For this purpose we use the `train_test_split` function from `sklearn.model_selection` module. It takes the following parameters:
- `X values`: list of values for the independent variables
- `Y values`:  list of values for the dependent/target variable
- `test_size`: (default = 0.25) fraction proportional to test dataset
- `random_state`: (optional) controls the shuffling, used to replicate same random state again


In [9]:
from sklearn.model_selection import train_test_split

# Sample data (features and labels)
X = [[1], [2], [3], [4], [5]]
y = [100, 200, 300, 400, 500]

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

print("Train:", X_train, y_train)
print("Test:", X_test, y_test)

Train: [[18], [7], [14], [5], [3], [6], [15], [10], [8], [17], [12], [4], [1], [16], [13]] [1800, 700, 1400, 500, 300, 600, 1500, 1000, 800, 1700, 1200, 400, 100, 1600, 1300]
Test: [[19], [2], [20], [9], [11]] [1900, 200, 2000, 900, 1100]


In this example we can see that
- we first split the iris dataset into train & test datasets, then
- we trained a logistic regresison model on the train dataset, then
- we test the models accuracy using the test dataset

_Note: Accuracy is fraction value that tells how many predictions are correct out of the total predictions made._

In [12]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris

# Load sample data
X, y = load_iris(return_X_y=True)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

# Train model
model = LogisticRegression(max_iter=2000)
model.fit(X_train, y_train)

# Test model
print(f"Accuracy: {model.score(X_test, y_test):.4f}")

Accuracy: 0.9778
