# <font color=green>Sampling</font>
***

## <font color=green>Population and sampling</font>
***

### Population
Set of all the elements of interest in a study. Several elements can compose a population, for example: people, ages, heights, cars, etc.

Regarding size, populations can be limited (finite populations) or unlimited (infinite populations).

### Finite populations

Allow the counting of its elements. As examples we have the number of employees of a company, the number of students in a school, etc.

### Infinite populations

It is not possible to count its elements. As examples we have the number of portions that can be extracted from sea water for analysis, temperature measured at each point in a territory, etc.

<font color = red> When the elements of a population can be counted, but presenting a very large quantity, the population is assumed to be infinite. </font>.


### Sample
Representative subset of the population.

The numerical attributes of a population, such as its mean, variance and standard deviation, are known as ** parameters **. The main focus of statistical inference is precisely to generate estimates and test hypotheses about population parameters using sample information.

## <font color = green> When to use a sample? </font>
***

### Infinite populations

The study would never come to an end. It is not possible to investigate all elements of the population.

### Destructive testing

Studies where the evaluated elements are totally consumed or destroyed. Example: life tests, safety tests against car crashes.

### Quick results

Research that needs more agility in dissemination. Example: opinion polls, surveys that involve public health problems.

### High costs

When the population is finite but very large, the cost of a census can make the process unfeasible.

## <font color = green> Simple Random Sampling </font>
***

It is one of the main ways to extract a sample from a population. The fundamental requirement of this type of approach is that each element of the population has the same chances of being selected to be part of the sample.

In [2]:
import pandas as pd

data = pd.read_csv('data/data.csv', sep = ',')
len(data)

76840

In [4]:
data['Income'].mean()

2000.3831988547631

In [10]:
sample = data.sample(n = 1000, random_state=101)

In [11]:
len(sample)

1000

In [12]:
sample['Income'].mean()

1998.783

In [8]:
data['Sex'].value_counts(normalize = True)

0    0.692998
1    0.307002
Name: Sex, dtype: float64

In [13]:
sample['Sex'].value_counts(normalize = True)

0    0.706
1    0.294
Name: Sex, dtype: float64

## <font color = green> Stratified Sampling </font>
***

It is an improvement of the simple random sampling process. In this method it is proposed to divide the population into subgroups of elements with similar characteristics, that is, more homogeneous groups. With these subgroups separated, the simple random sampling technique within each subgroup is applied individually.

## <font color = green> Sampling by Conglomerates </font>
***

It also aims to improve the criterion of simple random sampling. In cluster sampling, subgroups are also created, but they will not be homogeneous as in stratified sampling. In cluster sampling, the subgroups will be heterogeneous, where then simple or stratified random sampling will be applied.

A very common example of the application of this type of technique is the division of the population into territorial groups, where the elements investigated will have quite varied characteristics.