# How to Sample Data in Python

## Learning Objectives
In order to get an unbiased assessment of the performance of a supervised machine learning model, we need to evaluate it based on data that it did not previously encounter during the training process. To accomplish this, we must first split our data into a training subset and a test subset prior to the model build stage. One common way to split data in this fashion is by creating non-overlapping subsets of the original data using one of several **sampling** approaches. By the end of the tutorial, you will have learned:

+ how to split data using simple random sampling
+ how to split data using stratified random sampling

In [1]:
import pandas as pd

In [2]:
vehicles = pd.read_csv('./data/vehicles.csv')
vehicles.head()

Unnamed: 0,citympg,cylinders,displacement,drive,highwaympg,make,model,class,year,transmissiontype,transmissionspeeds,co2emissions
0,14.0,6,4.1,2-Wheel Drive,19.0,Buick,Electra/Park Avenue,Large Cars,1984,Automatic,4,555.4375
1,14.0,8,5.0,2-Wheel Drive,20.0,Buick,Electra/Park Avenue,Large Cars,1984,Automatic,4,555.4375
2,18.0,8,5.7,2-Wheel Drive,26.0,Buick,Electra/Park Avenue,Large Cars,1984,Automatic,4,484.761905
3,21.0,6,4.3,Rear-Wheel Drive,31.0,Cadillac,Fleetwood/DeVille (FWD),Large Cars,1984,Automatic,4,424.166667
4,14.0,8,4.1,Rear-Wheel Drive,19.0,Cadillac,Brougham/DeVille (RWD),Large Cars,1984,Automatic,4,555.4375


### Split X an Y

In [6]:
# Split X and Y
label = 'co2emissions'
y = vehicles[label]
y

0        555.437500
1        555.437500
2        484.761905
3        424.166667
4        555.437500
            ...    
36974    442.000000
36975    466.000000
36976    503.000000
36977    661.000000
36978    546.000000
Name: co2emissions, Length: 36979, dtype: float64

In [9]:
predictors = list(vehicles.columns)

# remove co2emission column
predictors.remove(label)
predictors

['citympg',
 'cylinders',
 'displacement',
 'drive',
 'highwaympg',
 'make',
 'model',
 'class',
 'year',
 'transmissiontype',
 'transmissionspeeds']

In [11]:
X = vehicles[predictors]
X.head()

Unnamed: 0,citympg,cylinders,displacement,drive,highwaympg,make,model,class,year,transmissiontype,transmissionspeeds
0,14.0,6,4.1,2-Wheel Drive,19.0,Buick,Electra/Park Avenue,Large Cars,1984,Automatic,4
1,14.0,8,5.0,2-Wheel Drive,20.0,Buick,Electra/Park Avenue,Large Cars,1984,Automatic,4
2,18.0,8,5.7,2-Wheel Drive,26.0,Buick,Electra/Park Avenue,Large Cars,1984,Automatic,4
3,21.0,6,4.3,Rear-Wheel Drive,31.0,Cadillac,Fleetwood/DeVille (FWD),Large Cars,1984,Automatic,4
4,14.0,8,4.1,Rear-Wheel Drive,19.0,Cadillac,Brougham/DeVille (RWD),Large Cars,1984,Automatic,4


## How to split data using Simple Random Sampling

In [12]:
from sklearn.model_selection import train_test_split

In [17]:
X_train, X_test, y_train, y_test = train_test_split(X, y,) # with default parameters

In [18]:
X_train.shape, y_train.shape

((27734, 11), (27734,))

In [19]:
X_test.shape, y_test.shape

((9245, 11), (9245,))

In [20]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4) # with custom size
X_test.shape

(14792, 11)

## How to split data using Stratified Random Sampling

In [21]:
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.1,
                                                   random_state=42) 

In [24]:
X['drive'].value_counts(normalize=True) *100 # to see percentage

Rear-Wheel Drive     35.679710
Front-Wheel Drive    35.355202
All-Wheel Drive      23.989291
4-Wheel Drive         3.648016
2-Wheel Drive         1.327781
Name: drive, dtype: float64

In [27]:
X_test['drive'].value_counts(normalize=True) * 100 # if we compare the percentage with original, we can see that there is small difference in allocations

Rear-Wheel Drive     35.776095
Front-Wheel Drive    35.127096
All-Wheel Drive      24.094105
4-Wheel Drive         3.866955
2-Wheel Drive         1.135749
Name: drive, dtype: float64

### Stratify using specific column
+ we want to mimic the distribution of test data following Original data

In [28]:
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.1,
                                                   random_state=42,
                                                   stratify = X['drive']) 

In [30]:
X_test['drive'].value_counts(normalize=True) * 100

Rear-Wheel Drive     35.694970
Front-Wheel Drive    35.343429
All-Wheel Drive      23.985938
4-Wheel Drive         3.650622
2-Wheel Drive         1.325041
Name: drive, dtype: float64

Now we can see that it is closely mimic the original distribution of 'Drive' colulmn