
Q: Why set a value for "random_state"?

A: Ensures that a "random" process will output the same results every time, which makes your code reproducible (by you and others!)

See example 👇

In [1]:
import pandas as pd
df = pd.read_csv('http://bit.ly/kaggletrain', nrows=6)

URLError: <urlopen error [Errno 11001] getaddrinfo failed>

In [2]:
cols = ['Fare', 'Embarked', 'Sex']
X = df[cols]
y = df['Survived']

In [3]:
from sklearn.model_selection import train_test_split

In [4]:
X

Unnamed: 0,Fare,Embarked,Sex
0,7.25,S,male
1,71.2833,C,female
2,7.925,S,female
3,53.1,S,female
4,8.05,S,male
5,8.4583,Q,male


In [5]:
# any positive integer can be used for the random_state value
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=1)
X_train

Unnamed: 0,Fare,Embarked,Sex
0,7.25,S,male
3,53.1,S,female
5,8.4583,Q,male


In [6]:
# using the SAME random_state value results in the SAME random split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=1)
X_train

Unnamed: 0,Fare,Embarked,Sex
0,7.25,S,male
3,53.1,S,female
5,8.4583,Q,male


In [7]:
# using a DIFFERENT random_state value results in a DIFFERENT random split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=2)
X_train

Unnamed: 0,Fare,Embarked,Sex
2,7.925,S,female
5,8.4583,Q,male
0,7.25,S,male


### Want more tips? [View all tips on GitHub](https://github.com/justmarkham/scikit-learn-tips) or [Sign up to receive 2 tips by email every week](https://scikit-learn.tips) 💌

© 2020 [Data School](https://www.dataschool.io). All rights reserved.