### Técnicas de Amostragem de Dados.

![alt text](https://minerandodados.com.br/wp-content/uploads/2020/05/probability-sampling-1.png)

### Amostragem Aleatória Simples

Um determinado número de elementos é retirado da população de forma aleatória.

In [None]:
import pandas as pd

Carregando a base de dados.

In [None]:
df = pd.read_csv("covid19.csv")

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50982 entries, 0 to 50981
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   case_id             0 non-null      float64
 1   provincial_case_id  50982 non-null  int64  
 2   age                 50982 non-null  object 
 3   sex                 50982 non-null  object 
 4   health_region       50982 non-null  object 
 5   province            50982 non-null  object 
 6   country             50982 non-null  object 
 7   date_report         50982 non-null  object 
 8   report_week         50982 non-null  object 
 9   has_travel_history  1150 non-null   object 
 10  locally_acquired    574 non-null    object 
 11  case_source         50982 non-null  object 
dtypes: float64(1), int64(1), object(10)
memory usage: 4.7+ MB


In [None]:
df.head()

Unnamed: 0,case_id,provincial_case_id,age,sex,health_region,province,country,date_report,report_week,has_travel_history,locally_acquired,case_source
0,,1,50-59,Male,Toronto,Ontario,Canada,2020-01-25,2020-01-19,t,,(1) https://news.ontario.ca/mohltc/en/2020/01/...
1,,2,50-59,Female,Toronto,Ontario,Canada,2020-01-27,2020-01-26,t,,(1) https://news.ontario.ca/mohltc/en/2020/01/...
2,,1,40-49,Male,Vancouver Coastal,BC,Canada,2020-01-28,2020-01-26,t,,https://news.gov.bc.ca/releases/2020HLTH0015-0...
3,,3,20-29,Female,Middlesex-London,Ontario,Canada,2020-01-31,2020-01-26,t,,(1) https://news.ontario.ca/mohltc/en/2020/01/...
4,,2,50-59,Female,Vancouver Coastal,BC,Canada,2020-02-04,2020-02-02,f,Close Contact,https://news.gov.bc.ca/releases/2020HLTH0023-0...


Criando uma amostra com apenas 1000 registros a partir do conjunto de dados.


In [None]:
df_sample = df.sample(n=1000)

In [None]:
df_sample.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 40013 to 29170
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   case_id             0 non-null      float64
 1   provincial_case_id  1000 non-null   int64  
 2   age                 1000 non-null   object 
 3   sex                 1000 non-null   object 
 4   health_region       1000 non-null   object 
 5   province            1000 non-null   object 
 6   country             1000 non-null   object 
 7   date_report         1000 non-null   object 
 8   report_week         1000 non-null   object 
 9   has_travel_history  14 non-null     object 
 10  locally_acquired    8 non-null      object 
 11  case_source         1000 non-null   object 
dtypes: float64(1), int64(1), object(10)
memory usage: 101.6+ KB


Especificando o tamanho da amostra através do percentual.

In [None]:
df_sample = df.sample(frac=0.10)

In [None]:
df_sample.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5098 entries, 8526 to 22436
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   case_id             0 non-null      float64
 1   provincial_case_id  5098 non-null   int64  
 2   age                 5098 non-null   object 
 3   sex                 5098 non-null   object 
 4   health_region       5098 non-null   object 
 5   province            5098 non-null   object 
 6   country             5098 non-null   object 
 7   date_report         5098 non-null   object 
 8   report_week         5098 non-null   object 
 9   has_travel_history  104 non-null    object 
 10  locally_acquired    55 non-null     object 
 11  case_source         5098 non-null   object 
dtypes: float64(1), int64(1), object(10)
memory usage: 517.8+ KB


### Amostragem Aleatória Estratificada

Importando o método train_test_split para fazer a amostragem.

In [None]:
from sklearn.model_selection import train_test_split

Contagem de registro.

In [None]:
df['province'].value_counts()

Quebec           25757
Ontario          16337
Alberta           4850
BC                2053
Nova Scotia        915
Saskatchewan       366
Manitoba           272
NL                 258
New Brunswick      118
PEI                 27
Repatriated         13
Yukon               11
NWT                  5
Name: province, dtype: int64

Gerando a amostragem estratificada.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df.drop('province',axis=1),
                                                    df['province'],
                                                    stratify=df['province'],
                                                    test_size=0.20)

Verificando a forma dos dados.

In [None]:
y_test.shape

(10197,)

Verificando a contagem de registros.

In [None]:
y_test.value_counts()

Quebec           5152
Ontario          3267
Alberta           970
BC                411
Nova Scotia       183
Saskatchewan       73
Manitoba           54
NL                 52
New Brunswick      24
PEI                 5
Repatriated         3
Yukon               2
NWT                 1
Name: province, dtype: int64

### Amostragem Sistemática

Gerando a semente aleatória

In [None]:
import numpy as np

In [None]:
semente = np.random.choice(10, 1) # Choose randomic number between 1 and 10

In [None]:
semente

array([2])

Gerando índices a partir da semente.

In [None]:
indices = np.arange(0,100,semente)

In [None]:
indices

array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32,
       34, 36, 38, 40, 42, 44, 46, 48, 50, 52, 54, 56, 58, 60, 62, 64, 66,
       68, 70, 72, 74, 76, 78, 80, 82, 84, 86, 88, 90, 92, 94, 96, 98])

In [None]:
len(indices)

50

Gerando a amostra a partir dos índices.

In [None]:
amostra = df.loc[indices,:] # It'll find all of informations from index in 'indices'

Verificando os dados da amostra.

In [None]:
amostra

Unnamed: 0,case_id,provincial_case_id,age,sex,health_region,province,country,date_report,report_week,has_travel_history,locally_acquired,case_source
0,,1,50-59,Male,Toronto,Ontario,Canada,2020-01-25,2020-01-19,t,,(1) https://news.ontario.ca/mohltc/en/2020/01/...
2,,1,40-49,Male,Vancouver Coastal,BC,Canada,2020-01-28,2020-01-26,t,,https://news.gov.bc.ca/releases/2020HLTH0015-0...
4,,2,50-59,Female,Vancouver Coastal,BC,Canada,2020-02-04,2020-02-02,f,Close Contact,https://news.gov.bc.ca/releases/2020HLTH0023-0...
6,,4,30-39,Female,Vancouver Coastal,BC,Canada,2020-02-06,2020-02-02,t,,https://news.gov.bc.ca/releases/2020HLTH0025-0...
8,,6,30-39,Female,Fraser,BC,Canada,2020-02-20,2020-02-16,t,,(1) https://news.gov.bc.ca/releases/2020HLTH00...
10,,5,60-69,Female,Toronto,Ontario,Canada,2020-02-26,2020-02-23,t,,(1) https://news.ontario.ca/mohltc/en/2020/02/...
12,,6,60-69,Male,Toronto,Ontario,Canada,2020-02-27,2020-02-23,f,Close Contact,(1) https://news.ontario.ca/mohltc/en/2020/02/...
14,,7,50-59,Male,Toronto,Ontario,Canada,2020-02-28,2020-02-23,t,,https://news.ontario.ca/mohltc/en/2020/02/onta...
16,,9,30-39,Female,York,Ontario,Canada,2020-02-29,2020-02-23,t,,https://news.ontario.ca/mohltc/en/2020/02/onta...
18,,11,60-69,Male,Durham,Ontario,Canada,2020-02-29,2020-02-23,f,Close Contact,https://news.ontario.ca/mohltc/en/2020/02/onta...


Contagem de registros.

In [None]:
amostra.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 50 entries, 0 to 98
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   case_id             0 non-null      float64
 1   provincial_case_id  50 non-null     int64  
 2   age                 50 non-null     object 
 3   sex                 50 non-null     object 
 4   health_region       50 non-null     object 
 5   province            50 non-null     object 
 6   country             50 non-null     object 
 7   date_report         50 non-null     object 
 8   report_week         50 non-null     object 
 9   has_travel_history  50 non-null     object 
 10  locally_acquired    10 non-null     object 
 11  case_source         50 non-null     object 
dtypes: float64(1), int64(1), object(10)
memory usage: 5.1+ KB
