# Teoria Sprint 5. METODES DE MOSTREIG
SOURCES: https://catalogofbias.org/  <br>
https://towardsdatascience.com/the-5-sampling-algorithms-every-data-scientist-need-to-know-43c7bc11d17c

## - TYPES ---------------------------------------------------------------------------------
### A) Probability sampling
Using random selection to choose the sample.
#### <b><span class="mark">1. Simple random sampling</span></b><br>
Fully randomly, we pick the sample from the population with no criteria nor previous sort.

#### <b><span class="mark">2. Systematic sampling</span></b><br>
Every member of the population is listed with a number, but instead of randomly generating numbers, individuals are chosen at regular intervals.

#### 3. Stratified sampling
You divide the population into subgroups (called strata) based on the relevant characteristic (e.g. gender, age range, income bracket, job role). Then you use random or systematic sampling to select a sample from each subgroup.

#### 4. Cluster sampling
Cluster sampling also involves dividing the population into subgroups, but instead of sampling individuals from each subgroup, you randomly select entire subgroups. Example: The company has offices in 10 cities across the country (all with roughly the same number of employees in similar roles). You don’t have the capacity to travel to every office to collect your data, so you use random sampling to select 3 offices – these are your clusters.

### B) Non-probability sampling
Not picking a sample by random criteria, but based on one.
#### 1. Convenience sampling
A convenience sample simply includes the individuals who happen to be most accessible to the researcher.

#### 2. Voluntary response sampling
Similar to a convenience sample, but instead of the researcher choosing participants and directly contacting them, people volunteer themselves.

#### 3. Purposive (judgement) sampling
Example: you want to know more about the opinions and experiences of disabled students at your university, so you <b>purposefully select </b> a number of students with different support needs in order to gather a varied range of data on their experiences with student services.

#### 4. Snowball sampling
If the population is hard to access, snowball sampling can be used to recruit participants via other participants.
Example: you are researching experiences of homelessness in your city. Since there is no list of all homeless people in the city, probability sampling isn’t possible. You meet one person who agrees to participate in the research, and she puts you in contact with other homeless people that she knows in the area.

### C) OTHER
#### 1. Undersampling and Oversampling
Resampling consists of removing samples from the majority class (<b>under-sampling</b>) and/or adding more examples from the minority class (<b>over-sampling</b>).
##### a) Undersampling and Oversampling using simple random sampling

##### b) Undersampling and Oversampling using IMBLEARN
IMBALANCED-LEARN (imblearn) is a Python Package to tackle the curse of imbalanced datasets. 
<br>Further info: https://github.com/scikit-learn-contrib/imbalanced-learn#id3

- <b>Undersampling using TOMEK LINKS</b><br>
One of such methods it provides is called Tomek Links. Tomek links are pairs of examples of opposite classes in close vicinity.

- <b><span class="mark">Synthetic Minority Oversampling Technique (SMOTE)</span></b><br>
When a minority is likely to not being reflected in a statistics analysis, we might wanna give it a bit more of weight to include its value in the conclusions drawn from it. The SMOTE method creates 'fake' observations similar to the real ones from the minority, so it feels like the minority is not so minor in the analysis.

#### <b><span class="mark">2. Reservoir sampling</span></b><br>
Used in data mining to obtain a sample of size n from a data stream of unknown length.

## - METHODS ---------------------------------------------------------------------------------
#### Step 1: Was the study sampled from individuals?
Then we might be talking about <b>simple random sampling</b> or <b>sistematic sampling</b>

#### Step 2: Was the study sampled from groups? 
If yes:

#### Step 3: Does the study have data about the individuals or not?
If yes, we might be in front of a <b>stratified sampling</b>. Otherwise, we might only have info about groups (such as schools, or towns) were we picked samples from. In that case we are talking about a <b>cluster sampling</b>.

#### Step 4: Was the sample easy to get?
In that case there might've been used the <b>convenience sampling</b> method.

## - SAMPLING ERROR ---------------------------------------------------------------------------------
It’s the difference between the statistic you measure and the parameter you would find if you took a census of the entire population. Since studying the whole population it's sometimes impossible because it's too big, the error we get from a sample is always expected. However, if it's lower than 3% we call it acceptable. <br>In order to avoid sampling error, we shall increase the size of the sample. That's way surveys normally take the size of thousands. <br><br>But at the same time, an statistical study could have an error differentiating the sample too far from the population, yet not being a sampling error. In this case we talk about <b>non-sampling error</b>. <br>This is due to poor data collection methods (like faulty instruments or inaccurate data recording), selection bias, non response bias (where individuals don’t want to or can’t respond to a survey), or other mistakes in collecting the data. Increasing the sample size will not reduce these errors. They key is to avoid making the errors in the first place with a well-planned design for the survey or experiment.

<br><br>Let's dive a bit deeper into the most common BIAS that can drive to sampling and non-sampling errors:



## - BIAS CATALOG --------------------------------------------------------------------------------------
### A) SELECTION BIAS
Occurs when the individuals or groups taken in a study are systematically different than the population from the beginning, leading to a systematic error in the outcome. Error when selecting the sample (it doesn't represent well the population which could overstate/understate some groups).<br>
#### 1. COLLIDER BIAS
Mistaking coincidence for causality. If A causes X, and B also causes X, A and B are related? We'd better dive a bit more on that study if we want to avoid collider bias...

#### 2. CONFOUNDING BIAS
Similar to the collider bias, confounding bias is distortion that modifies an association between an exposure and an outcome because a factor is independently associated with the exposure and the outcome. Though this time X causes A and B, so the wrong assumption is that A and B are related.

#### 3. ASCERTAINMENT BIAS
Systematic differences in the identification of individuals included in a study or distortion in the collection of data in a study.

#### 4. ATTRITION BIAS
Unequal loss of participants from study groups in a trial.

### B) CONDUCT BIAS
#### 1. HAWTHORNE EFFECT
When individuals modify an aspect of their behaviour in response to their awareness of being observed. Preventive steps: Studies using hidden observation can help avoid the Hawthorne effect.

#### 2. RECALL BIAS
Systematic error due to differences in accuracy or completeness of recall to memory of past events or experiences.

#### 3. AVAILABILITY BIAS
A distortion that arises from the use of information which is most readily available, rather than that which is necessarily most representative.

### C) CONCEPTUALIZATION BIAS
#### 1. INMORTAL TIME BIAS
A distortion that modifies an association between an exposure and an outcome, caused when a cohort study is designed so that follow-up includes a period of time where participants in the exposed group cannot experience the outcome and are essentially ‘immortal’.

### D) REPORTING BIAS
#### 1. INFORMATION BIAS
Bias that arises from systematic differences in the collection, recall, recording or handling of information used in a study. 

Availability bias
A distortion that arises from the use of information which is most readily available, rather than that which is necessarily most representative.

## - STATISTICAL STUDIES-----------------------------------------------------------------------------
### 1. Sample study
You extract a sample from a population to try to estimate a parameter of it.
### 2. Observation study
You are seeing if there's a correlation between two parameters of a dataset. Don't mistake for causality.
### 3. Experiment 
You are trying to show causality between two parameters of a sample/population (namely, if the <b>explanatory variable</b> triggers an specific <b>response variable</b>), and you do so by divind the experiment sample in two groups: one is the <b>control group</b> and the other the <b>treatment group</b>. You'll put the treatment group under experiment, leaving the control one under a 'normal' situation. Then you compare results. <br>
<b>Experimental units</b> are those subjects being tested in the experiments (persons, animals, bacteries, etc.)
<br><b>Matched pairs design</b> is the technique to run an experiment twice, swtiching the control group and the experiment group, in order to dilute any bias.
<br>In a <b>blind experiment</b> the experimental units don't know wether they are getting the medicine or the placebo. In a <b>double-blind experiment</b> nor the experimental units or the administrator of the pills are aware of wether a pill is placebo or medicine. Finally, in a <b>triple-blind experiment</b> not even the analysts of the experiment know how to differentiate placebo or medicine.
#### - Types of experiment design:
<b>1. Completely randomized: </b>using simple random sampling<br>
<b>2. Matched pairs: </b>grouping the sample by pairs first, then randomly sending each first of the pairs to the control group and the other to the treatment group<br>
<b>3. Randomized block: </b>using stratified sampling

## - SAMPLING PROCESS -----------------------------------------------------------------------------
### Step 1. Identify and define target population
Only the people above 18 yo. that can vote.
### Step 2. Select sampling frame
List of all people on the voter list.
### Step 3. Choose sampling methods
Consider each vote values the same. Probability will then be related to the number of votes a party has.
### Step 4. Determine sample size
From all the population, how many people will we take as the sample (not as large that would be cumbersome, but as little that the conclusion can't be accurately applied to the population)
### Step 5. Collect the required Data
Using the most appropiate sampling technique.