# Sampling Methods

<br>

---

<br>

<br>

### Dataset

* **Attrition Dataset -** HR data relating to an emplyee's length of service

In [1]:
import random
import pandas as pd
import numpy as np

file = 'attrition.feather'
df = pd.read_feather(file)
df.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EnvironmentSatisfaction,Gender,...,PerformanceRating,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,21,0.0,Travel_Rarely,391,Research_Development,15,College,Life_Sciences,High,Male,...,Excellent,Very_High,0,0,6,Better,0,0,0,0
1,19,1.0,Travel_Rarely,528,Sales,22,Below_College,Marketing,Very_High,Male,...,Excellent,Very_High,0,0,2,Good,0,0,0,0
2,18,1.0,Travel_Rarely,230,Research_Development,3,Bachelor,Life_Sciences,High,Male,...,Excellent,High,0,0,2,Better,0,0,0,0
3,18,0.0,Travel_Rarely,812,Sales,10,Bachelor,Medical,Very_High,Female,...,Excellent,Low,0,0,2,Better,0,0,0,0
4,18,1.0,Travel_Frequently,1306,Sales,5,Bachelor,Marketing,Medium,Male,...,Excellent,Very_High,0,0,3,Better,0,0,0,0


In [3]:
# examine proportions
df.Education.value_counts(normalize=True)

Education
Bachelor         0.389116
Master           0.270748
College          0.191837
Below_College    0.115646
Doctor           0.032653
Name: proportion, dtype: float64

---

<br>

<br>

### Proportional Stratification

* **Proportional Stratification -** Maintains proportions of group when sampling (e.g. 27% of population is Mexico, therefore, 27% of sample should be Mexico)
* **Simple Sample -** May lead to under- & over-representation of groups in sample

In [6]:
# sample 10% of each group
df_prop = df.groupby('Education')\
    .sample(frac=0.1, random_state=1001)

# show how proportions are maintained
df_prop.Education.value_counts(normalize=True)

Education
Bachelor         0.387755
Master           0.272109
College          0.190476
Below_College    0.115646
Doctor           0.034014
Name: proportion, dtype: float64

---

<br>

<br>

### Equal Counts

* **Equal Counts -** Ensures equal number of observations across each group (e.g. 15 samples from Bachelor, 15 samples from Doctor...)

In [7]:
# draw 15 samples from each education group
df_eq = df.groupby('Education')\
    .sample(n=15, random_state=1001)

# show equal proportions
df_eq.Education.value_counts(normalize=True)

Education
Below_College    0.2
College          0.2
Bachelor         0.2
Master           0.2
Doctor           0.2
Name: proportion, dtype: float64

---

<br>

<br>

### Weighted

* **Weighted -** Adjusts relative propability of sampling each row - affords chance to increase/decrease representation of certain groups

In [11]:
# build weight field to over index College
df_weights = df
condition = df_weights.Education == 'College'
df_weights['Weight'] = np.where(condition, 2, 1)

# apply weights
df_weights_new = df_weights.sample(frac=0.1, weights='Weight')
df_weights_new.Education.value_counts(normalize=True)

Education
Master           0.367347
College          0.319728
Bachelor         0.265306
Below_College    0.040816
Doctor           0.006803
Name: proportion, dtype: float64

---

<br>

<br>

### Cluster

* **Cluster Sampling -** Two-stage sampling process - selects random clusters from wider dataset (e.g. only Managers & Sales Execs from a long list of job roles), then samples the selected clusters

In [21]:
# create random groups
job_roles = df.JobRole.unique().to_list()
job_roles_samp = random.sample(job_roles, k=4)

# subset for random groups
condition = df.JobRole.isin(job_roles_samp)
df_filtered = df[condition]

# remove unused groups
df_filtered.JobRole = df_filtered.JobRole\
                        .cat.remove_unused_categories()

# sample for remaining groups
df_clust = df_filtered.groupby('JobRole')\
                .sample(n=10, random_state=1001)

# show
df_clust.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_filtered.JobRole = df_filtered.JobRole\


Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EnvironmentSatisfaction,Gender,...,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager,Weight
279,33,0.0,Travel_Rarely,147,Human_Resources,2,Bachelor,Human_Resources,Medium,Male,...,Very_High,1,5,2,Better,5,4,1,4,1
934,34,1.0,Travel_Frequently,988,Human_Resources,23,Bachelor,Human_Resources,Medium,Female,...,High,3,11,2,Better,3,2,0,2,1
851,29,0.0,Travel_Rarely,332,Human_Resources,17,Bachelor,Other,Medium,Male,...,Low,0,10,3,Good,10,9,0,9,1
475,33,0.0,Travel_Rarely,1075,Human_Resources,3,College,Human_Resources,Very_High,Male,...,High,1,7,4,Best,4,3,0,3,2
528,26,0.0,Travel_Rarely,1355,Human_Resources,25,Below_College,Life_Sciences,High,Female,...,Very_High,1,8,3,Better,8,7,5,7,1
