# How to Sample Data in Python

## Learning Objectives
In order to get an unbiased assessment of the performance of a supervised machine learning model, we need to evaluate it based on data that it did not previously encounter during the training process. To accomplish this, we must first split our data into a training subset and a test subset prior to the model build stage. One common way to split data in this fashion is by creating non-overlapping subsets of the original data using one of several **sampling** approaches. By the end of the tutorial, you will have learned:

+ how to split data using simple random sampling
+ how to split data using stratified random sampling

In [1]:
import pandas as pd
vehicles = pd.read_csv("vehicles.csv")
vehicles

Unnamed: 0,citympg,cylinders,displacement,drive,highwaympg,make,model,class,year,transmissiontype,transmissionspeeds,co2emissions
0,14.0,6,4.1,2-Wheel Drive,19.0,Buick,Electra/Park Avenue,Large Cars,1984,Automatic,4,555.437500
1,14.0,8,5.0,2-Wheel Drive,20.0,Buick,Electra/Park Avenue,Large Cars,1984,Automatic,4,555.437500
2,18.0,8,5.7,2-Wheel Drive,26.0,Buick,Electra/Park Avenue,Large Cars,1984,Automatic,4,484.761905
3,21.0,6,4.3,Rear-Wheel Drive,31.0,Cadillac,Fleetwood/DeVille (FWD),Large Cars,1984,Automatic,4,424.166667
4,14.0,8,4.1,Rear-Wheel Drive,19.0,Cadillac,Brougham/DeVille (RWD),Large Cars,1984,Automatic,4,555.437500
5,18.0,8,5.7,Rear-Wheel Drive,26.0,Cadillac,Brougham/DeVille (RWD),Large Cars,1984,Automatic,4,484.761905
6,14.0,8,4.1,Rear-Wheel Drive,19.0,Cadillac,Brougham/DeVille (RWD),Large Cars,1984,Automatic,4,555.437500
7,18.0,4,2.4,2-Wheel Drive,21.0,Nissan,Pickup 2WD,Pickup,1984,Automatic,3,467.736842
8,18.0,4,2.2,2-Wheel Drive,24.0,Dodge,Rampage Pickup 2WD,Pickup,1984,Automatic,3,423.190476
9,20.0,4,2.0,2-Wheel Drive,21.0,Dodge,Ram 50 Pickup 2WD,Pickup,1984,Automatic,3,444.350000


In [12]:
response = 'co2emissions'
y = vehicles[[response]]
y.head()

Unnamed: 0,co2emissions
0,555.4375
1,555.4375
2,484.761905
3,424.166667
4,555.4375


In [14]:
predictors = list(vehicles.columns)
predictors

['citympg',
 'cylinders',
 'displacement',
 'drive',
 'highwaympg',
 'make',
 'model',
 'class',
 'year',
 'transmissiontype',
 'transmissionspeeds',
 'co2emissions']

In [18]:
predictors.remove(response)

ValueError: list.remove(x): x not in list

In [20]:
predictors

['citympg',
 'cylinders',
 'displacement',
 'drive',
 'highwaympg',
 'make',
 'model',
 'class',
 'year',
 'transmissiontype',
 'transmissionspeeds']

In [21]:
x = vehicles[predictors]
x.head()

Unnamed: 0,citympg,cylinders,displacement,drive,highwaympg,make,model,class,year,transmissiontype,transmissionspeeds
0,14.0,6,4.1,2-Wheel Drive,19.0,Buick,Electra/Park Avenue,Large Cars,1984,Automatic,4
1,14.0,8,5.0,2-Wheel Drive,20.0,Buick,Electra/Park Avenue,Large Cars,1984,Automatic,4
2,18.0,8,5.7,2-Wheel Drive,26.0,Buick,Electra/Park Avenue,Large Cars,1984,Automatic,4
3,21.0,6,4.3,Rear-Wheel Drive,31.0,Cadillac,Fleetwood/DeVille (FWD),Large Cars,1984,Automatic,4
4,14.0,8,4.1,Rear-Wheel Drive,19.0,Cadillac,Brougham/DeVille (RWD),Large Cars,1984,Automatic,4


## How to split data using Simple Random Sampling

In [22]:
from sklearn.model_selection import train_test_split

In [25]:
x_train,x_test,y_train,y_test = train_test_split(x,y)

In [34]:
x_train.shape

(22187, 11)

In [35]:
y_train.shape

(22187, 1)

In [32]:
x_test.shape

(14792, 11)

In [33]:
y_test.shape

(14792, 1)

In [31]:
x_train, x_test, y_train, y_test = train_test_split(x, y,test_size = 0.4)
x_test.shape

(14792, 11)

## How to split data using Stratified Random Sampling

In [36]:
x_train, x_test, y_train, y_test = train_test_split(x, y, 
                                                    test_size = 0.01,
                                                    random_state = 1234)

In [38]:
x['drive'].value_counts(normalize = True)

Rear-Wheel Drive     0.356797
Front-Wheel Drive    0.353552
All-Wheel Drive      0.239893
4-Wheel Drive        0.036480
2-Wheel Drive        0.013278
Name: drive, dtype: float64

In [39]:
x_test['drive'].value_counts(normalize = True)

Front-Wheel Drive    0.364865
Rear-Wheel Drive     0.332432
All-Wheel Drive      0.248649
4-Wheel Drive        0.035135
2-Wheel Drive        0.018919
Name: drive, dtype: float64

In [45]:
x_train, x_test, y_train, y_test = train_test_split(x, y, 
                                                    test_size = 0.01, 
                                                    random_state = 1234,
                                                    stratify = x['drive'])

In [46]:
x_test['drive'].value_counts(normalize = True)

Rear-Wheel Drive     0.356757
Front-Wheel Drive    0.354054
All-Wheel Drive      0.240541
4-Wheel Drive        0.035135
2-Wheel Drive        0.013514
Name: drive, dtype: float64