# Day 13-14: Data Sampling Techniques

## Task:
- Explore different data sampling techniques

## Description:
- Learn about techniques like random sampling, stratified sampling, and under-sampling/oversampling for imbalanced datasets.

## What is Random Sampling?

- Random sampling involves randomly selecting a subset of data points from the dataset without considering the class distribution. While it's a simple approach, it may not effectively address class imbalances because it doesn't guarantee representation from minority classes.

In [12]:
import random

mylist = ['apple', 'banana', 'cherry']

print(random.sample(mylist, k=2))

['apple', 'cherry']


### Definition and Usage
- The sample() method returns a list with a specified number of randomly selected items from a sequence.

### Syntax

random.sample(sequence, k)

Sequence - A sequence. Can be any sequence: list, set, range etc.

k - The size of the returned list

## What is Stratified Sampling?

- Stratified sampling involves dividing the dataset into homogeneous subgroups (or strata) based on the class labels and then sampling from each stratum to ensure proportional representation of classes in the sample. This method ensures that the distribution of classes in the sample resembles that of the original dataset.

### Steps involved in stratified sampling
- **Separating the Population into Strata:** In this step, the population is divided into strata based on similar characteristics and every member of the population must belong to exactly one stratum (singular of strata).
- **Determine the sample size:** Decide how small or large the sample should be.
- **Randomly sampling each stratum:** Random samples from each stratum are selected using either **Disproportionate sampling** where the sample size of each stratum is equal irrespective of the population size of the stratum or **Proportionate sampling** where the sample size of each stratum is proportional to the population size of the stratum.

In [13]:
import pandas as pd 

# Create a dictionary of students 
students = { 
    'Name': ['Lisa', 'Kate', 'Ben', 'Kim', 'Josh', 
            'Alex', 'Evan', 'Greg', 'Sam', 'Ella'], 
    'ID': ['001', '002', '003', '004', '005', '006',  
        '007', '008', '009', '010'], 
    'Grade': ['A', 'A', 'C', 'B', 'B', 'B', 'C',  
            'A', 'A', 'A'], 
    
    'Category': [2, 3, 1, 3, 2, 3, 3, 1, 2, 1] 
} 

# Create dataframe from students dictionary 
df = pd.DataFrame(students) 

# view the dataframe 
df 

Unnamed: 0,Name,ID,Grade,Category
0,Lisa,1,A,2
1,Kate,2,A,3
2,Ben,3,C,1
3,Kim,4,B,3
4,Josh,5,B,2
5,Alex,6,B,3
6,Evan,7,C,3
7,Greg,8,A,1
8,Sam,9,A,2
9,Ella,10,A,1


In [14]:
df.groupby('Grade', group_keys=False).apply(lambda x: x.sample(2))

  df.groupby('Grade', group_keys=False).apply(lambda x: x.sample(2))


Unnamed: 0,Name,ID,Grade,Category
0,Lisa,1,A,2
8,Sam,9,A,2
4,Josh,5,B,2
5,Alex,6,B,3
6,Evan,7,C,3
2,Ben,3,C,1


In [15]:
df.groupby('Grade', group_keys=False).apply(lambda x: x.sample(frac=0.6))

  df.groupby('Grade', group_keys=False).apply(lambda x: x.sample(frac=0.6))


Unnamed: 0,Name,ID,Grade,Category
7,Greg,8,A,1
0,Lisa,1,A,2
1,Kate,2,A,3
4,Josh,5,B,2
5,Alex,6,B,3
6,Evan,7,C,3


## Under-sampling/Oversampling?

- **Under-sampling**: Under-sampling involves randomly removing samples from the majority class(es) to balance the class distribution. This can be done by randomly selecting a subset of samples from the majority class(es) to match the number of samples in the minority class.
- **Oversampling:** Oversampling involves artificially increasing the number of samples in the minority class(es) to balance the class distribution. This can be achieved by duplicating existing samples from the minority class, generating synthetic samples using techniques like SMOTE (Synthetic Minority Over-sampling Technique), or other methods.