# Data Sampling in Python

### Learning Objectives
In order to get an unbiased assessment of the performance of a supervised machine learning model, we need to evaluate it based on data that did not previously encounter during the training process (unseen data). To accomplish this we must split our dataset into two subset, training set and testing set prior to the model build stage. One common way to create these subset is by generating non-overlaping subset of the original dataset using one of the several **sampling** methods. By the end of this tutorial you will learn:
* how to split data using __random sampling__
* how to split data using __stratified sampling__

In [2]:
import pandas as pd
df = pd.read_csv('../data/customer_loan_data.csv')
print(df.head())

           customer     amount grade    purpose    loanDate hasOtherLoans  \
0    Richard Joseph  131,958 $     C     Travel  2022-01-12           Yes   
1     Dorothy Ready  156,867 $     C  Education  2023-08-11           Yes   
2        Mark Kirby  141,932 $     B        Car  2022-10-06            No   
3       Ronald Wade  113,694 $     B     Travel  2022-07-19            No   
4  Natasha Salyards  129,879 $     A   Personal  2021-10-02            No   

  decision  
0       No  
1      Yes  
2      Yes  
3      Yes  
4      Yes  


In [3]:
response = 'decision'
y = df[[response]]
y.head()

Unnamed: 0,decision
0,No
1,Yes
2,Yes
3,Yes
4,Yes


In [None]:
predictors = list(df.columns)
predictors.remove(response)
X = df[predictors]
X.head()

In [5]:
X = df.drop(columns=[response])
X.head()

Unnamed: 0,customer,amount,grade,purpose,loanDate,hasOtherLoans
0,Richard Joseph,"131,958 $",C,Travel,2022-01-12,Yes
1,Dorothy Ready,"156,867 $",C,Education,2023-08-11,Yes
2,Mark Kirby,"141,932 $",B,Car,2022-10-06,No
3,Ronald Wade,"113,694 $",B,Travel,2022-07-19,No
4,Natasha Salyards,"129,879 $",A,Personal,2021-10-02,No


## Random Sampling without Replacement

In [7]:
from sklearn.model_selection import train_test_split

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [13]:
X_train.shape

(750, 6)

In [14]:
y_train.shape


(750, 1)

In [15]:
X_test.shape

(250, 6)

In [16]:
y_test.shape

(250, 1)

In [17]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4)
X_test.shape , y_test.shape

((400, 6), (400, 1))

## Stratified Random Sampling

In [19]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.01, random_state = 1234)

In [20]:
X['purpose'].value_counts(normalize=True)

Personal     0.182
Home         0.182
Education    0.172
Travel       0.168
Business     0.156
Car          0.140
Name: purpose, dtype: float64

In [21]:
X_test['purpose'].value_counts(normalize=True)

Home        0.4
Business    0.2
Travel      0.2
Personal    0.1
Car         0.1
Name: purpose, dtype: float64

In [24]:
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size = 0.01, 
                                                    random_state = 1234, 
                                                    stratify= X['purpose'])

In [25]:
X_test['purpose'].value_counts(normalize=True)

Personal     0.2
Travel       0.2
Home         0.2
Education    0.2
Business     0.1
Car          0.1
Name: purpose, dtype: float64