1. Selecting a sample requires less time than selecting every item in a population
2. Sample selection is a cost-efficient method
3. Analysis of the sample is less cumbersome and more practical than an analysis of the entire population

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

#### Probability Sampling:**
In probability sampling, every element of the population has an **equal chance of being selected**. Probability sampling gives us the best chance to create a sample that is truly representative of the population

#### Non-Probability Sampling:**
In non-probability sampling, all elements **do not have an equal chance of being selected**. Consequently, there is a **significant risk** of ending up with a **non-representative sample** which does not produce generalizable results

## A ] Simple Random sampling:

## 1.  with replacement :
Sampling with replacement can be defined as random sampling that allows sampling units to occur more than once.

In [1]:
import numpy as np
np.random.seed(3)
# a parameter: generate a list of unique random numbers (from 0 to 11)
# size parameter: how many samples we want (12)
# replace = True: sample with replacement
np.random.choice(a=12, size=12, replace=True)

array([10,  8,  9,  3,  8,  8,  0,  5,  3, 10, 11,  9])

In [2]:
# Import libraries
import numpy as np
import pandas as pd
# Load dataset
url = 'https://raw.githubusercontent.com/mGalarnyk/Tutorial_Data/master/King_County/kingCountyHouseData.csv'
df = pd.read_csv(url)
# Selecting columns I am interested in
columns= ['bedrooms','bathrooms','sqft_living','sqft_lot','floors','price']
df = df.loc[:, columns]
# Only want to use 15 rows of the dataset for illustrative purposes. 
df = df.head(15)
# Notice how we have 3 rows with the index label 8
df.sample(n = 15, replace = True, random_state=2)

Unnamed: 0,bedrooms,bathrooms,sqft_living,sqft_lot,floors,price
8,3,1.0,1780,7470,1.0,229500.0
13,3,1.75,1370,9680,1.0,400000.0
8,3,1.0,1780,7470,1.0,229500.0
6,3,2.25,1715,6819,2.0,257500.0
11,2,1.0,1160,6000,1.0,468000.0
2,2,1.0,770,10000,1.0,180000.0
11,2,1.0,1160,6000,1.0,468000.0
8,3,1.0,1780,7470,1.0,229500.0
7,3,1.5,1060,9711,1.0,291850.0
2,2,1.0,770,10000,1.0,180000.0


![image.png](attachment:image.png)

#### How many duplicate samples/rows should you expect when sampling with replacement to create a bootstrapped dataset?

In [3]:
url = 'https://raw.githubusercontent.com/mGalarnyk/Tutorial_Data/master/King_County/kingCountyHouseData.csv'
df = pd.read_csv(url)

columns= ['bedrooms','bathrooms','sqft_living','sqft_lot','floors','price']
df = df.loc[:, columns]
"""
Generate Bootstrapped Dataset (dataset generated with sample with replacement which has the same number of values as original dataset)
% of original rows will vary depending on random_state
"""
bootstrappedDataset = df.sample(frac = 1, replace = True, random_state = 2)

In [4]:
len(df)

21613

In [5]:
len(bootstrappedDataset.index.unique()) / len(df)

0.6317956785268126

## 2. without replacement :

In [7]:
np.random.seed(3)
np.random.choice(a=12, size=12, replace=False)

array([ 5,  4,  1,  2, 11,  6,  7,  0,  3,  9,  8, 10])

In [8]:
np.random.seed(3)
np.random.choice(a=12, size=20, replace=False)

ValueError: Cannot take a larger sample than population when 'replace=False'

#### Sampling without replacement => 
common use is in model validation procedures like train test split and cross validation.

# B] Stratified random sampling

In [None]:
mainly used in heterogenous nature population => scenario for comparison between mail & female

### https://www.scribbr.com/methodology/stratified-sampling/

![image.png](attachment:image.png)

### When to use stratified sampling

### Step 1: Define your population and subgroups
    Choosing characteristics for stratification
### Step 2: Separate the population into strata
### Step 3: Decide on the sample size for each stratum 
    whether you want your sample to be proportionate or disproportionate.
##### Proportionate versus disproportionate sampling
In **proportionate sampling**, the sample size of each stratum is equal to the subgroup’s proportion in the population as a whole.

In **disproportionate sampling**, the sample sizes of each strata are disproportionate to their representation in the population as a whole.

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)