## Simple Random Sampling

### Each item has equal chance = 1/Number of items of being selected.


In [4]:
from random import sample
import random
import numpy as np
from bisect import bisect
import pandas as pd

list1 = [1, 2, 3, 4, 5] 
  
print(sample(list1,3))


[1, 3, 2]


In [5]:
# if you want the same output, use random.seed(x) where x is an integer.
random.seed(10)

print(sample(list1,4))



[5, 1, 2, 3]


## SQL Question on sampling/big data

### Let's say we have a table with an id and name field. The table holds over 100 million rows and we want to sample a random row in the table without throttling the database.

### Write a query to randomly sample a row from this table.

#### With not big datasets, it may be okay to use a simple sql query using rand()

SELECT * FROM table ORDER BY RAND() LIMIT 1

#### But with a table of over 100 million rows the query would take a long time and a more efficient query would be:

SELECT *
  FROM table r1 JOIN
       (SELECT CEIL(RAND() *
                    (SELECT MAX(id)
                       FROM table)) AS id
        ) AS r2
       where r1.id>=r2.id 
       order by r1.id ASC
       limit 1;



http://jan.kneschke.de/projects/mysql/order-by-rand/

## Multimodal distribution (more than 1 mode in a distribution)


#### For example: ["white", "green", "red"] are the keys, weights list and we have the desired sample size.



In [6]:
## using built in random function

random.choices(["white", "green", "red"], [12, 12, 4], k=10)


['green',
 'white',
 'green',
 'green',
 'green',
 'white',
 'green',
 'white',
 'white',
 'red']

In [7]:
## raw implementation

def weighted_choice(choices):
    values, weights = zip(*choices)
    total = 0
    cum_weights = []
    for w in weights:
        total += w
        cum_weights.append(total)
    x = random.random() * total
    i = bisect(cum_weights, x)
    
    # i is basically the position in the cum_weights list at random position "x" between 0 and total
    
    # in case of below example: chance of it being <= 12 = 1/12, 
    #between 12 and 24 will be 1/12 and between 24 and 28 will be 1/4 
    
    return values[i]

n=10
random_choice=[weighted_choice([("white", 12),  ("green", 12), ("red", 4)]) for i in range(0, n)]
print(random_choice)

['red', 'white', 'red', 'green', 'white', 'white', 'green', 'green', 'green', 'green']


### An interesting observation:

While looking at the SQL question above and the raw implementation of the above, we see that we are trying to get the maximum id/weight and multiplying with a random number for randomization and then kind of appplying a divide and conquer approach which works well for big data.

## Stratified Sampling

### Samples are picked from each strata - when you have enough information about the population and you think it can be broken into groups to make the sample more representative you should use stratified sampling.

#### Each strata is homogenous.

#### The allocation can be done in two ways:
#### 1. Uniform (same sample size picked from each strata) 

 Should use it when cost does not really matter and the stratas have similar distribution
 


In [8]:

df = pd.DataFrame(dict(
        A=[1, 1, 1, 2, 2, 2, 2, 3, 4, 4],
        B=range(10)
    ))

# Column that you want to stratify by, "3" is the desired sample size you want from each strata.
df.groupby('A', group_keys=False).apply(lambda x: x.sample(min(len(x), 3)))

Unnamed: 0,A,B
2,1,2
0,1,0
1,1,1
5,2,5
6,2,6
3,2,3
7,3,7
8,4,8
9,4,9


#### 2. Proportionally

    How is the sample size determined?

    Sample size of strata=(Size of the strata in the Population/ Total Population) * Entire Sample Size

In [9]:
def stratified_prop(df, col, required_sample_size): # col is the strata
    popn_size = len(df[col])
    return df.groupby(col, group_keys=False).apply(lambda x: x.sample(int(np.rint(required_sample_size*len(x)/popn_size)))).reset_index(drop=True)
    

In [10]:
stratified_prop(df, 'A', 8)

Unnamed: 0,A,B
0,1,2
1,1,1
2,2,6
3,2,4
4,2,5
5,3,7
6,4,8
7,4,9


## Clustered Sampling

### Clusters are created and each cluster is basically treated as a unit to be picked randomly.
### Each cluster is heterogenous within and homogenuous between clusters. Generally each cluster is of same size.

#### Have been used in geographical/area specific surveys to reduce cost and resources.


In [11]:
def cluster_sampling(df, col_cluster_by, number_clusters_tocreate, sample_size):
    df['cluster'] = pd.cut(df[col_cluster_by], bins=number_clusters_tocreate, labels=False)
    clusters_picked=sample(range(0, number_clusters_tocreate), sample_size)

    return df[df['cluster'].isin(clusters_picked)].reset_index(drop=True)

In [12]:
cluster_sampling(df, 'A', 4, 2)

Unnamed: 0,A,B,cluster
0,2,3,1
1,2,4,1
2,2,5,1
3,2,6,1
4,4,8,3
5,4,9,3


## To Add:

- Bootstrap
- Right Censoring  (atleast)
- Systematic Random Sampling
