# Function To Ensure Test Set Is Representative Using Random Forests

This function uses cross-validation and a random forest model to assess if a test set is representative of the train data.

In [50]:
def datarep(X_train, X_test):
    """
    This function uses cross-validation and a random forest model
    to assess if our test set is representative of our train data.
    """
    #Add new column with the 'classes'
    X_train['Train_or_Test'] = 1
    X_test['Train_or_Test'] = 0
    
    #Concatenate both sets
    concat_df = pd.concat([X_train, X_test], axis=0)

    #Create new target variable
    import numpy as np
    y = concat_df.pop('Train_or_Test')
    y = y.astype(np.float32)

    #Instantiate the classifier and train the model
    from sklearn.ensemble import RandomForestClassifier
    random_forest = RandomForestClassifier(n_estimators = 500, random_state = 42)
    random_forest.fit(concat_df, y)

    #Implement cross-validation and print the result
    #If each score is near 0.5 it means our data is well represented
    from sklearn.model_selection import cross_val_score
    cfrep = cross_val_score(random_forest, concat_df, y, cv=5, scoring='roc_auc')
    return cfrep

## Example:

Let's start by loading the dataset.

In [51]:
from sklearn.datasets import load_digits

#Load dataset
X, y = load_digits(return_X_y=True)

print("X Shape :", X.shape)

X Shape : (1797, 64)


Now let's split the dataset into train and test sets.

In [52]:
from sklearn.model_selection import train_test_split
import pandas as pd

#Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

#Convert numpy arrays into dataframes
X_train = pd.DataFrame(X_train)
X_test = pd.DataFrame(X_test)

print("X_train shape :", X_train.shape)
print("X_test shape :", X_test.shape)

X_train shape : (1437, 64)
X_test shape : (360, 64)


We can now pass both the X_train and X_test sets to the function to assure representativeness.

In [53]:
datarep(X_train, X_test)

array([0.5382909 , 0.5289593 , 0.55299071, 0.56414537, 0.4571235 ])

To make sure the test group is representative, the score you see should be ~0.5 across the five cv rounds.

### The End