# <font color=green>Knowing the data</font>
***

## <font color=green>Project dataset</font>
***

### National Household Sample Survey - 2015

The <b> National Household Sample Survey - PNAD </b> annually investigates, on a permanent basis, general characteristics of the population, education, work, income and housing and others, with variable frequency, according to the information needs for the country, such as the characteristics of migration, fertility, nuptiality, health, food security, among other topics. The collection of these statistics constitutes, over the 49 years of conducting the research, an important instrument for the formulation, validation and evaluation of policies aimed at socioeconomic development and the improvement of living conditions in Brazil.

### Data source

https://ww2.ibge.gov.br/home/estatistica/populacao/trabalhoerendimento/pnad2015/microdados.shtm

### Used variables

> ### Income
> ***

Monthly income from the main job for persons aged 10 and over.

> ### Age
> ***

Resident's age on the reference date in years.

> ### Height (own elaboration)
> ***

Dweller height in meters.

> ### UF
> ***

|Code|Description|
|---|---|
|11|Rondônia|
|12|Acre|
|13|Amazonas|
|14|Roraima|
|15|Pará|
|16|Amapá|
|17|Tocantins|
|21|Maranhão|
|22|Piauí|
|23|Ceará|
|24|Rio Grande do Norte|
|25|Paraíba|
|26|Pernambuco|
|27|Alagoas|
|28|Sergipe|
|29|Bahia|
|31|Minas Gerais|
|32|Espírito Santo|
|33|Rio de Janeiro|
|35|São Paulo|
|41|Paraná|
|42|Santa Catarina|
|43|Rio Grande do Sul|
|50|Mato Grosso do Sul|
|51|Mato Grosso|
|52|Goiás|
|53|Distrito Federal|

> ### Gender	
> ***

|Code|Description|
|---|---|
|0|Masculino|
|1|Feminino|

> ### Years of studies
> ***

|Code|Description|
|---|---|
|1|Sem instrução e menos de 1 ano|
|2|1 ano|
|3|2 anos|
|4|3 anos|
|5|4 anos|
|6|5 anos|
|7|6 anos|
|8|7 anos|
|9|8 anos|
|10|9 anos|
|11|10 anos|
|12|11 anos|
|13|12 anos|
|14|13 anos|
|15|14 anos|
|16|15 anos ou mais|
|17|Não determinados| 
||Não aplicável|

> ### Color
> ***

|Code|Description|
|---|---|
|0|Indígena|
|2|Branca|
|4|Preta|
|6|Amarela|
|8|Parda|
|9|Sem declaração|

#### <font color = 'red'> Observation </font>
***
> The following treatments were performed on the original data:
> 1. Records where <b> Income </b> was invalid (999 999 999 999) were eliminated;
> 2. Records where <b> Income </b> was missing were eliminated;
> 3. Only the records of the <b> Reference Persons </b> of each household (responsible for the household) were considered.

Importing padas and reading the project dataset

https://pandas.pydata.org/

In [6]:
import pandas as pd

data = pd.read_csv('data/data.csv')
data

Unnamed: 0,UF,Sex,Age,Color,Years of study,Income,Height
0,11,0,23,8,12,800,1.603808
1,11,1,23,2,12,1150,1.739790
2,11,1,35,8,15,880,1.760444
3,11,0,46,2,6,3500,1.783158
4,11,1,47,8,9,150,1.690631
...,...,...,...,...,...,...,...
76835,53,1,46,2,11,812,1.687030
76836,53,0,30,4,7,1500,1.792934
76837,53,0,32,8,12,1300,1.830587
76838,53,0,57,8,4,1500,1.726344


## <font color=green>Data types</font>
***

### Ordinal qualitative variables

► Variables that can be ordered or hierarchized

In [9]:
sorted(data['Years of study'].unique())

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]


### Nominal qualitative variables

► Variables that cannot be ordered or hierarchized 

In [10]:
sorted(data['UF'].unique())

[11,
 12,
 13,
 14,
 15,
 16,
 17,
 21,
 22,
 23,
 24,
 25,
 26,
 27,
 28,
 29,
 31,
 32,
 33,
 35,
 41,
 42,
 43,
 50,
 51,
 52,
 53]

In [11]:
sorted(data['Sex'].unique())

[0, 1]

In [12]:
sorted(data['Color'].unique())

[0, 2, 4, 6, 8]

### Discrete quantitative variables

► Variables that represent a count where the possible values form a finite or enumerable set.

In [19]:
print('from %s to %s years' % (data['Age'].min(), data['Age'].max()))

from 13 to 99 years


### Continuous quantitative variables

► Variables that represent a count or measurement that assume values on a continuous scale (real numbers).

In [20]:
print('from %s to %s metters' % (data['Height'].min(), data['Height'].max()))

from 1.339244614 to 2.028496765 metters


#### <font color = 'red'> Observation </font>
***
> The age variable can be classified in three different ways:

> 1. <b> QUANTITATIVE DISCRETE </b> - when it represents complete years (whole numbers);

> 2. <b> CONTINUOUS QUANTITATIVE </b> - when it represents the exact age, being represented by fractions of years; and

> 3. <b> ORDINAL QUALITATIVE </b> - when it represents age groups.

### Variable classification (Portuguese)
<img src='https://caelum-online-public.s3.amazonaws.com/1177-estatistica-parte1/01/img001.png' width='70%'>