# Stratified Random Sampling

Project to illustrate stratified random sampling.

The population is divided into small groups called strata based on a particular characteristic then a subject is chosen from each stratum randomly.

In [1]:
import pandas as pd

students = {
    
    "Name": ["Ibrahim", "Ganiyat", "Joel", "Elijah", "Yusuf", "Nurain", 
            "Dayo", "David", "Olu", "Tobi"],
    
    "ID":  ['001', '002', '003', '004', '005', '006','007', '008', '009', '010'],
    
    "Grade": ['A', 'B', 'C', 'A', 'B', 'C', 'A', 'A', 'B', 'A'],
    
    "Category": [1, 2, 2, 1, 3, 3, 1, 2, 3, 3]
}

df = pd.DataFrame(students)

print(df)

      Name   ID Grade  Category
0  Ibrahim  001     A         1
1  Ganiyat  002     B         2
2     Joel  003     C         2
3   Elijah  004     A         1
4    Yusuf  005     B         3
5   Nurain  006     C         3
6     Dayo  007     A         1
7    David  008     A         2
8      Olu  009     B         3
9     Tobi  010     A         3


## Proportionate Stratified Random Sampling.

In [9]:
df_sample = df.groupby("Grade", group_keys=False).apply(lambda x:x.sample(frac=0.6))
print(df_sample)

      Name   ID Grade  Category
6     Dayo  007     A         1
7    David  008     A         2
3   Elijah  004     A         1
1  Ganiyat  002     B         2
8      Olu  009     B         3
5   Nurain  006     C         3


Students are grouped according to grades (because a student cannot have two grades), then sampling is done randomly on those groups such that the resulting sample has data that is proportionate to the entire population.

In [10]:
print(df["Grade"].value_counts(True))
print(df_sample["Grade"].value_counts(normalize=True).round(1))

A    0.5
B    0.3
C    0.2
Name: Grade, dtype: float64
A    0.5
B    0.3
C    0.2
Name: Grade, dtype: float64


## Disproportionate Stratified Random Sampling

In [11]:
df_sample2 = df.groupby('Grade', group_keys=False).apply(lambda x:x.sample(n=2))
print(df_sample2)

      Name   ID Grade  Category
0  Ibrahim  001     A         1
6     Dayo  007     A         1
4    Yusuf  005     B         3
1  Ganiyat  002     B         2
5   Nurain  006     C         3
2     Joel  003     C         2


Here random sampling of the strata is taken without regard to proportion. It is just done based on a specified number.

In [12]:
print(df['Grade'].value_counts(True))
print(df_sample2['Grade'].value_counts(normalize=True).round(1))

A    0.5
B    0.3
C    0.2
Name: Grade, dtype: float64
A    0.3
B    0.3
C    0.3
Name: Grade, dtype: float64
