# Stratified Random Sampling Using Python and Pandas

## How to stratify sample data to match population data in order to improve the performance of Machine Learning algorithms

![charles-deluvio-pjAH2Ax4uWk-unsplash.jpg](attachment:charles-deluvio-pjAH2Ax4uWk-unsplash.jpg)
Photo by <a href="https://unsplash.com/@charlesdeluvio?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Charles Deluvio</a> on <a href="https://unsplash.com/s/photos/computer-code?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>

## Introduction

Sometimes the sample data that data scientists are given does not fit what we know about the wider population data. For example, lets assume that the data science team were given survey data and we noticed that the survey respondents were 60% male and 40% female.

In the real world the UK general population is closer to 49.4% male and 50.6% female (source: https://tinyurl.com/43hpe5e4) and certainly not 60% / 40%.

There could be many explanations for our 60% male sample data. One possibility is that the data collection method might have been flawed. Perhaps the marketing team accidentally hit more males with their marketing campaign causing an imbalance.

If we can establish that the sample data should better reflect the population then we can "stratify" the data. This will involve resampling the sample data so that the proportions match the population (see https://www.investopedia.com/ask/answers/041615/what-are-advantages-and-disadvantages-stratified-random-sampling.asp for more information).

To make matters more complex, it might be that there are multiple feature columns involved. The example in this article shows a combination of two factors as follows -

- Male undergraduates = 45% of the population
- Female undergraduates = 20% of the population
- Male graduate students = 20% of the population
- Female graduate students = 15% of the population

If our sample data has 70% male undergraduates it will not represent the population.

In Machine Learning algorithms this can cause problems down the line. If we go ahead and train our model on the sample data which has the wrong proportions it is likely that the model will be over-fitted to the training data and it is also likely that when we run the model against real-world or testing data that is in the right proportions it will underperform.

This example shows how to resample the sample data such that it reflects the population which has the potential to improve the accuracy of your machine learning models

## Getting Started

Lets start by importing the required libraries and reading in some data that was downloaded from https://www.kaggle.com/c/credit-default-prediction-ai-big-data/overview

In [1]:
import pandas as pd
import numpy as np
import random

df_credit = pd.read_csv("data/train.csv")
df_credit

Unnamed: 0,Id,Home Ownership,Annual Income,Years in current job,Tax Liens,Number of Open Accounts,Years of Credit History,Maximum Open Credit,Number of Credit Problems,Months since last delinquent,Bankruptcies,Purpose,Term,Current Loan Amount,Current Credit Balance,Monthly Debt,Credit Score,Credit Default
0,0,Own Home,482087.0,,0.0,11.0,26.3,685960.0,1.0,,1.0,debt consolidation,Short Term,99999999.0,47386.0,7914.0,749.0,0
1,1,Own Home,1025487.0,10+ years,0.0,15.0,15.3,1181730.0,0.0,,0.0,debt consolidation,Long Term,264968.0,394972.0,18373.0,737.0,1
2,2,Home Mortgage,751412.0,8 years,0.0,11.0,35.0,1182434.0,0.0,,0.0,debt consolidation,Short Term,99999999.0,308389.0,13651.0,742.0,0
3,3,Own Home,805068.0,6 years,0.0,8.0,22.5,147400.0,1.0,,1.0,debt consolidation,Short Term,121396.0,95855.0,11338.0,694.0,0
4,4,Rent,776264.0,8 years,0.0,13.0,13.6,385836.0,1.0,,0.0,debt consolidation,Short Term,125840.0,93309.0,7180.0,719.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7495,7495,Rent,402192.0,< 1 year,0.0,3.0,8.5,107866.0,0.0,,0.0,other,Short Term,129360.0,73492.0,1900.0,697.0,0
7496,7496,Home Mortgage,1533984.0,1 year,0.0,10.0,26.5,686312.0,0.0,43.0,0.0,debt consolidation,Long Term,444048.0,456399.0,12783.0,7410.0,1
7497,7497,Rent,1878910.0,6 years,0.0,12.0,32.1,1778920.0,0.0,,0.0,buy a car,Short Term,99999999.0,477812.0,12479.0,748.0,0
7498,7498,Home Mortgage,,,0.0,21.0,26.5,1141250.0,0.0,,0.0,debt consolidation,Short Term,615274.0,476064.0,37118.0,,0


## Setting Up The Test Data
To make the example make sense I am going to simplfy the "Home Ownership" feature to have the two most common values and add a new feature called "Gender" with ~60% "Male" and ~40% "Female" and then take a quick look at the results ...

In [2]:
df_credit['Gender'] = np.random.choice(['Male', 'Female'], size=len(df_credit), p=[0.6, 0.4])

ownership_filter = df_credit['Home Ownership'].isin(['Home Mortgage', 'Rent'])
df_credit = df_credit.drop(df_credit[~ownership_filter].index)

(df_credit['Home Ownership'].value_counts() / len(df_credit)).sort_values(ascending=False), (df_credit['Gender'].value_counts() / len(df_credit)).sort_values(ascending=False)

(Home Mortgage    0.531647
 Rent             0.468353
 Name: Home Ownership, dtype: float64,
 Male      0.601813
 Female    0.398187
 Name: Gender, dtype: float64)

## Preparing to Stratify
In our example we want to resample the sample data to reflect the correct proportions of Gender and Home Ownership.

The first thing we need to do is to create a single feature that contains all of the data we want to stratify on as follows ...

In [3]:
df_credit['Stratify'] = df_credit['Gender'] + ", " + df_credit['Home Ownership']
(df_credit['Stratify'].value_counts() / len(df_credit)).sort_values(ascending=False)

Male, Home Mortgage      0.321737
Male, Rent               0.280076
Female, Home Mortgage    0.209911
Female, Rent             0.188277
Name: Stratify, dtype: float64

So there we have it, we have a set of proportions in our sample data that we intend to use to train our model. However we check with our marketing team who assure us that the population proportions are as follows ...

- Male, Home Mortgage = 45% of the population
- Male, Rent = 20% of the population
- Female, Home Mortgage = 20% of the population
- Female, Rent = 15% of the population

... and two teams agree that they must resample the data to match these proportions in order to build an accurate model that will work well on real-world data in future.

## Stratifying the Data
Below is a function that uses ``DataFrame.sample`` to sample exactly the right number of rows with the right values from the source data such that the result will be stratified exactly as specified in the parameters ...

In [4]:
def stratify_data(df_data, stratify_column_name, stratify_values, stratify_proportions, random_state=None):
    """Stratifies data according to the values and proportions passed in

    Args:
        df_data (DataFrame): source data
        stratify_column_name (str): The name of the single column in the dataframe that holds the data values that will be used to stratify the data
        stratify_values (list of str): A list of all of the potential values for stratifying e.g. "Male, Graduate", "Male, Undergraduate", "Female, Graduate", "Female, Undergraduate"
        stratify_proportions (list of float): A list of numbers representing the desired propotions for stratifying e.g. 0.4, 0.4, 0.2, 0.2, The list values must add up to 1 and must match the number of values in stratify_values
        random_state (int, optional): sets the random_state. Defaults to None.

    Returns:
        DataFrame: a new dataframe based on df_data that has the new proportions represnting the desired strategy for stratifying
    """
    df_stratified = pd.DataFrame(columns = df_data.columns) # Create an empty DataFrame with column names matching df_data

    pos = -1
    for i in range(len(stratify_values)): # iterate over the stratify values (e.g. "Male, Undergraduate" etc.)
        pos += 1
        if pos == len(stratify_values) - 1: 
            ratio_len = len(df_data) - len(df_stratified) # if this is the final iteration make sure we calculate the number of values for the last set such that the return data has the same number of rows as the source data
        else:
            ratio_len = int(len(df_data) * stratify_proportions[i]) # Calculate the number of rows to match the desired proportion

        df_filtered = df_data[df_data[stratify_column_name] ==stratify_values[i]] # Filter the source data based on the currently selected stratify value
        df_temp = df_filtered.sample(replace=True, n=ratio_len, random_state=random_state) # Sample the filtered data using the calculated ratio
        
        df_stratified = pd.concat([df_stratified, df_temp]) # Add the sampled / stratified datasets together to produce the final result
        
    return df_stratified # Return the stratified, re-sampled data        

## Testing
The code below specifies the values and proportions for stratifying the data as per the required proportions i.e. -

- Male, Home Mortgage = 45% of the population
- Male, Rent = 20% of the population
- Female, Home Mortgage = 20% of the population
- Female, Rent = 15% of the population

... and takes a look at the newly stratified dataset ...

In [5]:
stratify_values = ['Male, Home Mortgage', 'Male, Rent', 'Female, Home Mortgage', 'Female, Rent']
stratify_proportions = [0.45, 0.20, 0.20, 0.15]
df_stratified = stratify_data(df_credit, 'Stratify', stratify_values, stratify_proportions, random_state=42)
df_stratified

Unnamed: 0,Id,Home Ownership,Annual Income,Years in current job,Tax Liens,Number of Open Accounts,Years of Credit History,Maximum Open Credit,Number of Credit Problems,Months since last delinquent,Bankruptcies,Purpose,Term,Current Loan Amount,Current Credit Balance,Monthly Debt,Credit Score,Credit Default,Gender,Stratify
2923,2923,Home Mortgage,1041447.0,2 years,0.0,14.0,11.4,309144.0,0.0,42.0,0.0,debt consolidation,Short Term,99999999.0,141512.0,9981.0,743.0,0,Male,"Male, Home Mortgage"
4505,4505,Home Mortgage,1936841.0,< 1 year,0.0,5.0,23.5,187110.0,0.0,45.0,0.0,home improvements,Short Term,99999999.0,59299.0,7521.0,740.0,0,Male,"Male, Home Mortgage"
3910,3910,Home Mortgage,682898.0,,0.0,9.0,29.5,565664.0,0.0,10.0,0.0,debt consolidation,Short Term,175714.0,116755.0,2840.0,742.0,0,Male,"Male, Home Mortgage"
3806,3806,Home Mortgage,1317631.0,10+ years,0.0,8.0,14.4,468886.0,0.0,25.0,0.0,debt consolidation,Long Term,435908.0,320720.0,20204.0,728.0,0,Male,"Male, Home Mortgage"
5653,5653,Home Mortgage,1685889.0,4 years,0.0,13.0,16.9,403612.0,1.0,,1.0,debt consolidation,Short Term,347028.0,127224.0,8837.0,743.0,0,Male,"Male, Home Mortgage"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7053,7053,Rent,673531.0,5 years,0.0,8.0,13.7,439934.0,0.0,,0.0,other,Short Term,216634.0,40660.0,9188.0,751.0,0,Female,"Female, Rent"
5785,5785,Rent,1051175.0,7 years,0.0,12.0,10.2,530794.0,0.0,,0.0,debt consolidation,Short Term,132770.0,281922.0,30396.0,726.0,1,Female,"Female, Rent"
4755,4755,Rent,863246.0,10+ years,0.0,11.0,25.5,282392.0,0.0,14.0,0.0,debt consolidation,Short Term,304216.0,130967.0,16308.0,746.0,0,Female,"Female, Rent"
6005,6005,Rent,756428.0,6 years,0.0,11.0,13.9,275660.0,0.0,43.0,0.0,debt consolidation,Short Term,164230.0,134520.0,18595.0,7400.0,1,Female,"Female, Rent"


And just to be sure we have result, let's take a look at the overall proportions of our ``Stratify`` feature column ...

In [6]:
df_stratified.shape, df_credit.shape

((6841, 20), (6841, 20))

In [7]:
(df_stratified['Stratify'].value_counts() / len(df_stratified)).sort_values(ascending=False)

Male, Home Mortgage      0.449934
Female, Home Mortgage    0.199971
Male, Rent               0.199971
Female, Rent             0.150124
Name: Stratify, dtype: float64

## Conclusion

We started by stating that flaws in data collection can sometimes cause sample data to have different proportions to known proportions of the population data and that can lead to over-fitted models that perform poorly when they do encounter test or live data with the right proportions.

We went on to explore how stratifying the training data and resampling it to give it the same proportions can resolve the issue and improve performance of the production algorithms.

We then chose a complex example that stratified on two features, feature engineered those two features into a new column and defined a function that performs the stratification calculations and returns a stratified dataset.

Finally we examined the results to make sure the calculations were correct.

The full source code can be found on GitHub: https://github.com/grahamharrison68/Public-Github/blob/master/Resampling/Stratified%20Sampling.ipynb

## Thank you for reading!
If you enjoyed reading this article, why not check out my other articles at https://grahamharrison-86487.medium.com/?

Also, I would love to hear from you to get your thoughts on this piece, any of my other articles or anything else related to data science and data analytics.

If you would like to get in touch to discuss any of these topics please look me up on LinkedIn — https://www.linkedin.com/in/grahamharrison1 or feel free to e-mail me at GHarrison@lincolncollege.ac.uk.