# Leitura de Dados

- Até agora nós fizemos a leitura de dados simples, que não criavam nenhum obstáculo para nós.
- Obviamente que o mundo real não é assim. Aqui iremos aprender como os dados podem ser complicados e criar problemas já na leitura.
- Vamos passar pela maioria dos problemas criados e mostrar como a função read_csv() é extremamente robusta para lidar com a maioria dos casos
- <font color='red'>AGRADEÇAM AO WES McKINNEY, O CRIADOR DO PANDAS</font>

## 1º Passo: DOCUMENTAÇÃO

- Como o mundo real é um lugar ruim, a evolução da função read_csv() foi acompanhando esses problemas. Ela tem mais de 50 parâmetros, que lidam com quase todas as questões relacionadas com leitura e ingestão de dados. Isso pode ser confuso no começo e mostraremos alguns exemplos clássicos de dificuldades.
- Link para a <font color='green'>Documentação</font> da função e <font color='green'>Guia do Usuário</font> de leitura dos dados:
    - [Pandas Docs - API Reference - Input/output - Flat File - pandas.read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html#pandas-read-csv)
    - [Pandas Docs - User Guide - IO tools (text, CSV, HDF5, …)](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-tools-text-csv-hdf5)
- Vamos passando por alguns parâmetros em ordem de dificuldade

## Parâmetros Básicos

- Existem dois parâmetros básicos e que vc deve ter na ponta dos dedos:
    - <font color='green'>filepath_or_buffer:</font> o endereço, nome e extensão do seu arquivo
    - <font color='green'>sep:</font> o separador do seu arquivo, que por padrão é vírgula

### Leitura arquivos com separador de vírgula (.csv - comma separated values) 
- Por padrão o parâmetro sep usa vírgula, então não precisaríamos passar a vírgula para ele. Aqui só estamos passando para as coisas ficarem mais explícitas

In [4]:
# este pedaco de codigo faz com que
# o jupyter notebook mostre multiplos
# outputs numa mesma celula
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# importando biblioteca pandas
import pandas as pd

# variavel com endereco e nome do arquivo
ender_arquivo = 'data_examples/dados_1.csv'

# leitura dos dados
df = pd.read_csv(filepath_or_buffer=ender_arquivo, sep=',')
df.head()
df.info()

Unnamed: 0,Feature_0,Feature_1,Feature_2,Feature_3,Feature_4,Feature_5,Feature_6,Feature_7,Feature_8,Feature_9,...,Feature_11,Feature_12,Feature_13,Feature_14,Feature_15,Feature_16,Feature_17,Feature_18,Feature_19,Label
0,-1.918589,-2.504736,0.129979,1.144094,1.714314,0.644021,0.010067,-2.243267,-2.289591,-3.157594,...,-1.56316,-0.003531,-0.819073,2.609332,-0.957484,1.698459,0.630487,-1.818666,6.79376,0
1,1.703727,1.23649,2.632384,-0.451056,-1.318009,0.158059,-0.68688,2.123858,1.45999,2.027665,...,1.490331,-0.261566,-3.277112,0.131795,0.387347,-1.026035,1.099396,1.973326,1.679518,0
2,0.513048,-4.473221,1.972778,4.994542,-0.839594,-1.320484,0.247288,1.767002,-0.294797,4.051059,...,-3.560979,0.696071,-1.72893,0.440227,-4.808164,-0.645908,7.066454,-0.066486,3.931558,1
3,-0.295201,-5.930419,0.12267,0.290137,-5.200017,5.789733,0.614547,-1.068637,0.718901,1.797997,...,2.845911,0.319121,-1.504172,0.721375,1.882201,1.109099,-2.300612,-2.967003,3.854491,1
4,0.587042,2.79554,2.277552,-1.600965,1.504304,1.756499,0.392907,1.31822,1.573603,-3.82187,...,0.422041,-2.3298,1.773869,4.595936,-2.52503,-0.024171,2.880676,-1.999992,6.793826,0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 21 columns):
Feature_0     1000 non-null float64
Feature_1     1000 non-null float64
Feature_2     1000 non-null float64
Feature_3     1000 non-null float64
Feature_4     1000 non-null float64
Feature_5     1000 non-null float64
Feature_6     1000 non-null float64
Feature_7     1000 non-null float64
Feature_8     1000 non-null float64
Feature_9     1000 non-null float64
Feature_10    1000 non-null float64
Feature_11    1000 non-null float64
Feature_12    1000 non-null float64
Feature_13    1000 non-null float64
Feature_14    1000 non-null float64
Feature_15    1000 non-null float64
Feature_16    1000 non-null float64
Feature_17    1000 non-null float64
Feature_18    1000 non-null float64
Feature_19    1000 non-null float64
Label         1000 non-null int64
dtypes: float64(20), int64(1)
memory usage: 164.2 KB


- Aparentemente tudo correto nessa primeira olhada...

### Leitura arquivos com separadores diferentes
- Vamos ler um arquivo que tem um separador diferente
- Veja que o read_csv() leu tudo como se fosse uma única coluna, já que não há nenhuma vírgula para ele separar
- Passe o separador correto e veja o que acontece

In [21]:
# este pedaco de codigo faz com que
# o jupyter notebook mostre multiplos
# outputs numa mesma celula
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# importando biblioteca pandas
import pandas as pd

# variavel com endereco e nome do arquivo
ender_arquivo = 'data_examples/dados_2.csv'

# leitura dos dados com separador errado
df = pd.read_csv(filepath_or_buffer=ender_arquivo, sep=',')
df.head()

Unnamed: 0,Feature_0;Feature_1;Feature_2;Feature_3;Feature_4;Feature_5;Feature_6;Feature_7;Feature_8;Feature_9;Label
0,-0.1378638847304582;1.5322350606973423;2.19664...
1,-1.579387796944482;2.229178986863354;0.2274958...
2,-0.1778300406266738;0.9409898474703698;0.58590...
3,-0.3918653740601597;1.3207382812341921;2.01527...
4,0.7163588578562523;0.5333048955521471;1.483743...


- Colocando sep=';'

In [5]:
# este pedaco de codigo faz com que
# o jupyter notebook mostre multiplos
# outputs numa mesma celula
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# importando biblioteca pandas
import pandas as pd

# variavel com endereco e nome do arquivo
ender_arquivo = 'data_examples/dados_2.csv'

# leitura dos dados com o separador correto
df = pd.read_csv(filepath_or_buffer=ender_arquivo, sep=';')
df.head()
df.info()

Unnamed: 0,Feature_0,Feature_1,Feature_2,Feature_3,Feature_4,Feature_5,Feature_6,Feature_7,Feature_8,Feature_9,Label
0,-0.137864,1.532235,2.196646,0.729666,0.767479,-1.226179,-1.560715,0.681743,0.979455,0.962503,0
1,-1.579388,2.229179,0.227496,0.227534,0.635968,1.609521,1.577512,-0.474178,-3.391835,-0.771772,1
2,-0.17783,0.94099,0.585904,-1.391726,-0.73677,-1.406815,-0.483805,-0.322028,-0.703056,1.325432,0
3,-0.391865,1.320738,2.015275,-1.018709,1.290644,-1.889649,-0.404848,0.363116,-0.92343,0.31111,0
4,0.716359,0.533305,1.483744,-2.165546,-0.734811,-0.640154,-2.329548,-1.085748,1.428992,0.029796,1


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 11 columns):
Feature_0    1000 non-null float64
Feature_1    1000 non-null float64
Feature_2    1000 non-null float64
Feature_3    1000 non-null float64
Feature_4    1000 non-null float64
Feature_5    1000 non-null float64
Feature_6    1000 non-null float64
Feature_7    1000 non-null float64
Feature_8    1000 non-null float64
Feature_9    1000 non-null float64
Label        1000 non-null int64
dtypes: float64(10), int64(1)
memory usage: 86.1 KB


**Outro exemplo**

In [27]:
# este pedaco de codigo faz com que
# o jupyter notebook mostre multiplos
# outputs numa mesma celula
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# importando biblioteca pandas
import pandas as pd

# variavel com endereco e nome do arquivo
ender_arquivo = 'data_examples/dados_3.csv'

# leitura dos dados com o separador errado
df = pd.read_csv(filepath_or_buffer=ender_arquivo, sep=',')
df.head()

Unnamed: 0,Feature_0 Feature_1 Feature_2 Feature_3 Feature_4 Label
0,1.90137064905657 0.8582621914449771 1.58484298...
1,0.6000626525275318 0.2254339583897051 0.520462...
2,-1.8323599370202568 2.1997666991935403 -1.6890...
3,-1.8396144219997421 0.7415644591088034 -1.6884...
4,-0.9393044992490676 0.11093596651115936 -0.857...


Aqui o separador é um espaço

In [29]:
# este pedaco de codigo faz com que
# o jupyter notebook mostre multiplos
# outputs numa mesma celula
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# importando biblioteca pandas
import pandas as pd

# variavel com endereco e nome do arquivo
ender_arquivo = 'data_examples/dados_3.csv'

# leitura dos dados com o separador correto
df = pd.read_csv(filepath_or_buffer=ender_arquivo, sep=' ')
df.head()

Unnamed: 0,Feature_0,Feature_1,Feature_2,Feature_3,Feature_4,Label
0,1.901371,0.858262,1.584843,1.130733,-1.522889,1
1,0.600063,0.225434,0.520463,-0.074065,-0.255904,1
2,-1.83236,2.199767,-1.689085,2.345099,-0.323531,1
3,-1.839614,0.741564,-1.688487,2.199702,-0.24415,1
4,-0.939304,0.110936,-0.857044,1.014978,-0.068246,0


## Parâmetros de Coluna, Index e Nomes de Colunas

- Parâmetros:
    - <font color='green'>header:</font> indica se há cabeçalho ou não
    - <font color='green'>names:</font> nomes das colunas do arquivo caso não haja cabeçalho
    - <font color='green'>index_col:</font> indica qual coluna ser usada como index
    - <font color='green'>usecols:</font> escolhe quais colunas serão lidas, poupando memória e filtragens posteriores

### Leitura arquivos sem cabeçalho
- Com o parâmetro header, podemos indicar se há cabeçalhos nos dados. Veja os exemplos

In [10]:
# este pedaco de codigo faz com que
# o jupyter notebook mostre multiplos
# outputs numa mesma celula
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# importando biblioteca pandas
import pandas as pd

# variavel com endereco e nome do arquivo
ender_arquivo = 'data_examples/dados_4.csv'

# leitura dos dados com o separador errado
df = pd.read_csv(filepath_or_buffer=ender_arquivo, sep=',')
df.head()
df.info()

Unnamed: 0,-0.8489047958996515\t-2.1929335980140996\t-1.0439204483042874\t-1.0109765681430902\t0.04967923444446559\t1.0325366448748625\t-0.12252434423634756\t2.4436145259057063\t-0.30100583659157576\t-0.8149833399034592\t3
0,-0.9193209369310009\t1.4564591923700374\t0.147...
1,-1.2953183650245126\t0.4118428048386901\t-1.15...
2,-0.186614454242095\t-0.3842904099038448\t0.441...
3,0.13362440371325746\t-0.700826749290669\t-0.60...
4,-0.3792891582693042\t-0.9044910322445008\t-0.4...


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 999 entries, 0 to 998
Data columns (total 1 columns):
-0.8489047958996515	-2.1929335980140996	-1.0439204483042874	-1.0109765681430902	0.04967923444446559	1.0325366448748625	-0.12252434423634756	2.4436145259057063	-0.30100583659157576	-0.8149833399034592	3    999 non-null object
dtypes: object(1)
memory usage: 7.9+ KB


- 1º) O separador correto é um TAB

In [11]:
# este pedaco de codigo faz com que
# o jupyter notebook mostre multiplos
# outputs numa mesma celula
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# importando biblioteca pandas
import pandas as pd

# variavel com endereco e nome do arquivo
ender_arquivo = 'data_examples/dados_4.csv'

# leitura dos dados com o separador correto, mas sem cabecalho
df = pd.read_csv(filepath_or_buffer=ender_arquivo, sep='\t')
df.head()
df.info()

Unnamed: 0,-0.8489047958996515,-2.1929335980140996,-1.0439204483042874,-1.0109765681430902,0.04967923444446559,1.0325366448748625,-0.12252434423634756,2.4436145259057063,-0.30100583659157576,-0.8149833399034592,3
0,-0.919321,1.456459,0.147485,-3.275885,-0.333815,1.000514,0.004069,-0.360054,-0.113266,1.690327,3
1,-1.295318,0.411843,-1.159148,-2.495045,-1.221641,0.369156,-1.533276,0.941197,0.081715,1.801694,1
2,-0.186614,-0.38429,0.441852,-0.139822,-0.581831,0.628671,-0.786553,0.703124,0.718486,1.241956,1
3,0.133624,-0.700827,-0.605545,-1.746508,-0.515908,0.001855,0.426752,1.175182,0.415367,-0.560101,1
4,-0.379289,-0.904491,-0.425612,0.117369,-1.369218,0.381671,0.379926,2.041094,-1.142475,0.19537,1


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 999 entries, 0 to 998
Data columns (total 11 columns):
-0.8489047958996515     999 non-null float64
-2.1929335980140996     999 non-null float64
-1.0439204483042874     999 non-null float64
-1.0109765681430902     999 non-null float64
0.04967923444446559     999 non-null float64
1.0325366448748625      999 non-null float64
-0.12252434423634756    999 non-null float64
2.4436145259057063      999 non-null float64
-0.30100583659157576    999 non-null float64
-0.8149833399034592     999 non-null float64
3                       999 non-null int64
dtypes: float64(10), int64(1)
memory usage: 86.0 KB


- 2º) Veja que não há cabeçalho nos dados. Use o parâmetro header para indicar isso

In [12]:
# este pedaco de codigo faz com que
# o jupyter notebook mostre multiplos
# outputs numa mesma celula
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# importando biblioteca pandas
import pandas as pd

# variavel com endereco e nome do arquivo
ender_arquivo = 'data_examples/dados_4.csv'

# leitura dos dados com o separador correto e sem cabecalho
df = pd.read_csv(filepath_or_buffer=ender_arquivo, sep='\t', header=None)
df.head()
df.info()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,-0.848905,-2.192934,-1.04392,-1.010977,0.049679,1.032537,-0.122524,2.443615,-0.301006,-0.814983,3
1,-0.919321,1.456459,0.147485,-3.275885,-0.333815,1.000514,0.004069,-0.360054,-0.113266,1.690327,3
2,-1.295318,0.411843,-1.159148,-2.495045,-1.221641,0.369156,-1.533276,0.941197,0.081715,1.801694,1
3,-0.186614,-0.38429,0.441852,-0.139822,-0.581831,0.628671,-0.786553,0.703124,0.718486,1.241956,1
4,0.133624,-0.700827,-0.605545,-1.746508,-0.515908,0.001855,0.426752,1.175182,0.415367,-0.560101,1


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 11 columns):
0     1000 non-null float64
1     1000 non-null float64
2     1000 non-null float64
3     1000 non-null float64
4     1000 non-null float64
5     1000 non-null float64
6     1000 non-null float64
7     1000 non-null float64
8     1000 non-null float64
9     1000 non-null float64
10    1000 non-null int64
dtypes: float64(10), int64(1)
memory usage: 86.1 KB


- 3º) A função read_csv() já nomeia as colunas com números quando indicamos que não há cabeçalho. Com o parâmetro names, já podemos indicar quais são os nomes das colunas

In [13]:
# este pedaco de codigo faz com que
# o jupyter notebook mostre multiplos
# outputs numa mesma celula
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# importando biblioteca pandas
import pandas as pd

# variavel com endereco e nome do arquivo
ender_arquivo = 'data_examples/dados_4.csv'

# lista com nomes das colunas
colunas = ['Feature_1', 'Feature_2', 'Feature_3', 'Feature_4', 'Feature_5', 'Feature_6', 'Feature_7', 'Feature_8', \
          'Feature_9', 'Feature_10', 'Label']

# leitura dos dados com o separador correto, sem cabecalho e com nome de colunas
df = pd.read_csv(filepath_or_buffer=ender_arquivo, sep='\t', header=None, names=colunas)
df.head()
df.info()

Unnamed: 0,Feature_1,Feature_2,Feature_3,Feature_4,Feature_5,Feature_6,Feature_7,Feature_8,Feature_9,Feature_10,Label
0,-0.848905,-2.192934,-1.04392,-1.010977,0.049679,1.032537,-0.122524,2.443615,-0.301006,-0.814983,3
1,-0.919321,1.456459,0.147485,-3.275885,-0.333815,1.000514,0.004069,-0.360054,-0.113266,1.690327,3
2,-1.295318,0.411843,-1.159148,-2.495045,-1.221641,0.369156,-1.533276,0.941197,0.081715,1.801694,1
3,-0.186614,-0.38429,0.441852,-0.139822,-0.581831,0.628671,-0.786553,0.703124,0.718486,1.241956,1
4,0.133624,-0.700827,-0.605545,-1.746508,-0.515908,0.001855,0.426752,1.175182,0.415367,-0.560101,1


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 11 columns):
Feature_1     1000 non-null float64
Feature_2     1000 non-null float64
Feature_3     1000 non-null float64
Feature_4     1000 non-null float64
Feature_5     1000 non-null float64
Feature_6     1000 non-null float64
Feature_7     1000 non-null float64
Feature_8     1000 non-null float64
Feature_9     1000 non-null float64
Feature_10    1000 non-null float64
Label         1000 non-null int64
dtypes: float64(10), int64(1)
memory usage: 86.1 KB


### Leitura arquivos Indicando um Index
- Com o parâmetro index_col podemos, já em tempo de leitura, indicar qual o Index do arquivo

In [14]:
# este pedaco de codigo faz com que
# o jupyter notebook mostre multiplos
# outputs numa mesma celula
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# importando biblioteca pandas
import pandas as pd

# variavel com endereco e nome do arquivo
ender_arquivo = 'data_examples/dados_5.csv'

# leitura dos dados com o separador correto
df = pd.read_csv(filepath_or_buffer=ender_arquivo, sep='\t')
df.head()
df.info()

Unnamed: 0.1,Unnamed: 0,Feature_0,Feature_1,Feature_2,Feature_3,Feature_4,Feature_5,Feature_6,Feature_7,Feature_8,Feature_9,Target
0,0,-0.021182,0.01213,-0.018755,-0.063668,-0.037392,-0.009378,0.031608,0.014331,-0.001612,-0.048429,-5.194355
1,1,7.4e-05,0.019679,0.037544,-0.047496,-0.000815,-0.030114,-0.029297,-0.017839,0.00939,0.019566,-3.376057
2,2,-0.045825,-0.033578,-0.007455,0.02325,-0.014264,-0.004916,0.027738,0.008681,0.0235,-0.020179,-0.459187
3,3,0.02606,-0.004648,0.009928,-0.026802,0.026844,0.022864,-0.009244,-0.013375,-0.036422,0.037626,4.352418
4,4,-0.036686,-0.026732,-0.005009,0.01153,0.003022,0.003656,-0.005076,0.005985,-0.007661,-0.024999,-0.668197


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 12 columns):
Unnamed: 0    1000 non-null int64
Feature_0     1000 non-null float64
Feature_1     1000 non-null float64
Feature_2     1000 non-null float64
Feature_3     1000 non-null float64
Feature_4     1000 non-null float64
Feature_5     1000 non-null float64
Feature_6     1000 non-null float64
Feature_7     1000 non-null float64
Feature_8     1000 non-null float64
Feature_9     1000 non-null float64
Target        1000 non-null float64
dtypes: float64(11), int64(1)
memory usage: 93.9 KB


- Veja que a coluna Unnamed: 0 na verdade é o Index das linhas, podemos excluir essa coluna ou já indicar na função read_csv() que ela é um Index

In [15]:
# este pedaco de codigo faz com que
# o jupyter notebook mostre multiplos
# outputs numa mesma celula
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# importando biblioteca pandas
import pandas as pd

# variavel com endereco e nome do arquivo
ender_arquivo = 'data_examples/dados_5.csv'

# leitura dos dados com o separador correto e indicando o index
df = pd.read_csv(filepath_or_buffer=ender_arquivo, sep='\t', index_col='Unnamed: 0')
df.head()
df.info()

Unnamed: 0,Feature_0,Feature_1,Feature_2,Feature_3,Feature_4,Feature_5,Feature_6,Feature_7,Feature_8,Feature_9,Target
0,-0.021182,0.01213,-0.018755,-0.063668,-0.037392,-0.009378,0.031608,0.014331,-0.001612,-0.048429,-5.194355
1,7.4e-05,0.019679,0.037544,-0.047496,-0.000815,-0.030114,-0.029297,-0.017839,0.00939,0.019566,-3.376057
2,-0.045825,-0.033578,-0.007455,0.02325,-0.014264,-0.004916,0.027738,0.008681,0.0235,-0.020179,-0.459187
3,0.02606,-0.004648,0.009928,-0.026802,0.026844,0.022864,-0.009244,-0.013375,-0.036422,0.037626,4.352418
4,-0.036686,-0.026732,-0.005009,0.01153,0.003022,0.003656,-0.005076,0.005985,-0.007661,-0.024999,-0.668197


<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 0 to 999
Data columns (total 11 columns):
Feature_0    1000 non-null float64
Feature_1    1000 non-null float64
Feature_2    1000 non-null float64
Feature_3    1000 non-null float64
Feature_4    1000 non-null float64
Feature_5    1000 non-null float64
Feature_6    1000 non-null float64
Feature_7    1000 non-null float64
Feature_8    1000 non-null float64
Feature_9    1000 non-null float64
Target       1000 non-null float64
dtypes: float64(11)
memory usage: 93.8 KB


### Leitura arquivos indicando quais colunas devemos ler
- Usamos o parâmetro usecols para indicar um subconjunto das colunas para serem lidas. Isso pode poupar memória quando temos arquivos maiores

In [16]:
# este pedaco de codigo faz com que
# o jupyter notebook mostre multiplos
# outputs numa mesma celula
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# importando biblioteca pandas
import pandas as pd

# variavel com endereco e nome do arquivo
ender_arquivo = 'data_examples/dados_5.csv'

# lista colunas usadas
lista_cols = ['Feature_0', 'Feature_1', 'Target']

# leitura dos dados com o separador correto e indicando as colunas
df = pd.read_csv(filepath_or_buffer=ender_arquivo, sep='\t', usecols=lista_cols)
df.head()
df.info()

Unnamed: 0,Feature_0,Feature_1,Target
0,-0.021182,0.01213,-5.194355
1,7.4e-05,0.019679,-3.376057
2,-0.045825,-0.033578,-0.459187
3,0.02606,-0.004648,4.352418
4,-0.036686,-0.026732,-0.668197


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 3 columns):
Feature_0    1000 non-null float64
Feature_1    1000 non-null float64
Target       1000 non-null float64
dtypes: float64(3)
memory usage: 23.6 KB


## Parâmetros de Parsing

- Parâmetros:
    - <font color='green'>dtype:</font> você escolhe quais os tipos de dados das colunas. Com isso podemos diminuir o uso de memória e conseguir ler arquivos maiores
    - <font color='green'>skiprows:</font> podemos escolher linhas para pular e excluir dos arquivos
    - <font color='green'>skipfooter:</font> alguns softwares quando exportamos arquivos csv, criam uma nota de rodapé. Com esse parâmetro podemos pular linhas do final do arquivo
    - <font color='green'>nrows:</font> indica quantas linhas queremos ler do arquivo. Pode ser uma boa estratégia para arquivos extremamente grandes. Podemos ler primeiro uma parte dos dados e explorá-los. Assim podemos criar uma estratégia para diminuir o uso de memória

- O arquivo é bem maior do que estávamos acostumados. Vamos usar os parâmetros da função read_csv() para tentar melhorar isso

In [52]:
# este pedaco de codigo faz com que
# o jupyter notebook mostre multiplos
# outputs numa mesma celula
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# importando biblioteca pandas
import pandas as pd

# variavel com endereco e nome do arquivo
ender_arquivo = 'data_examples/dados_6.csv'

# leitura dos dados
df = pd.read_csv(filepath_or_buffer=ender_arquivo, sep='\t')
df.head()

# infos
df.info()

Unnamed: 0,-1.6268502956922566,-2.6758811909442226,3.029605205081599,-0.32383019868586205,0.1907900053337544,-2.559812063273754,0.4068668615060956,0.20208888839452155,2.0470273040743896,-2.568138109020448,0
0,0.124887,-1.025044,0.52199,-2.095166,0.221847,-1.693476,-0.426833,1.320412,1.138497,-0.102198,0
1,-0.541229,-2.368121,1.827388,-1.370912,1.68001,-2.45293,0.490978,1.023766,1.98862,-2.198035,0
2,-4.046177,4.052463,3.835139,-0.558254,-0.859638,1.935893,0.892909,-0.045699,-1.254822,2.212408,0
3,-2.79127,1.776555,2.142707,0.279566,1.242548,1.17805,-0.25854,1.273198,0.254972,1.165803,0
4,-1.384238,-1.490661,2.416589,-0.557352,-1.095562,-1.836963,-0.464932,-0.640057,1.144509,-1.176579,0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 999999 entries, 0 to 999998
Data columns (total 11 columns):
-1.6268502956922566     999999 non-null float64
-2.6758811909442226     999999 non-null float64
3.029605205081599       999999 non-null float64
-0.32383019868586205    999999 non-null float64
0.1907900053337544      999999 non-null float64
-2.559812063273754      999999 non-null float64
0.4068668615060956      999999 non-null float64
0.20208888839452155     999999 non-null float64
2.0470273040743896      999999 non-null float64
-2.568138109020448      999999 non-null float64
0                       999999 non-null int64
dtypes: float64(10), int64(1)
memory usage: 83.9 MB


- 1º) Caso na sua máquina o arquivo tenha demorado muito para carregar, podemos usar o parâmetro nrows e ler apenas algumas linhas. Assim você consegue dar uma olhada nos dados e traçar estratégias.
    - Veja que o memory usage é de 83.9 MB

In [53]:
# este pedaco de codigo faz com que
# o jupyter notebook mostre multiplos
# outputs numa mesma celula
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# importando biblioteca pandas
import pandas as pd

# variavel com endereco e nome do arquivo
ender_arquivo = 'data_examples/dados_6.csv'

# leitura dos dados com nrows
df = pd.read_csv(filepath_or_buffer=ender_arquivo, sep='\t', nrows=10_000)
df.head()

# infos
df.info()

Unnamed: 0,-1.6268502956922566,-2.6758811909442226,3.029605205081599,-0.32383019868586205,0.1907900053337544,-2.559812063273754,0.4068668615060956,0.20208888839452155,2.0470273040743896,-2.568138109020448,0
0,0.124887,-1.025044,0.52199,-2.095166,0.221847,-1.693476,-0.426833,1.320412,1.138497,-0.102198,0
1,-0.541229,-2.368121,1.827388,-1.370912,1.68001,-2.45293,0.490978,1.023766,1.98862,-2.198035,0
2,-4.046177,4.052463,3.835139,-0.558254,-0.859638,1.935893,0.892909,-0.045699,-1.254822,2.212408,0
3,-2.79127,1.776555,2.142707,0.279566,1.242548,1.17805,-0.25854,1.273198,0.254972,1.165803,0
4,-1.384238,-1.490661,2.416589,-0.557352,-1.095562,-1.836963,-0.464932,-0.640057,1.144509,-1.176579,0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 11 columns):
-1.6268502956922566     10000 non-null float64
-2.6758811909442226     10000 non-null float64
3.029605205081599       10000 non-null float64
-0.32383019868586205    10000 non-null float64
0.1907900053337544      10000 non-null float64
-2.559812063273754      10000 non-null float64
0.4068668615060956      10000 non-null float64
0.20208888839452155     10000 non-null float64
2.0470273040743896      10000 non-null float64
-2.568138109020448      10000 non-null float64
0                       10000 non-null int64
dtypes: float64(10), int64(1)
memory usage: 859.5 KB


- 2º) Lemos apenas 10.000 linhas, o que diminui o uso de memória para 859.5 KB, fazendo com que a gente consiga traçar uma estratégia antes de ler todos os dados. Algumas coisas para notar: 
    - Percebemos que não há cabeçalhos no arquivo e os tipos de dados são todos float64, o que pode aumentar o uso de memória
    - Vamos indicar que não há cabeçalhos e passar os nomes das colunas
    - Usando o parâmetro dtype, podemos mudar os tipos de dados e diminuir o uso de memória

In [78]:
# este pedaco de codigo faz com que
# o jupyter notebook mostre multiplos
# outputs numa mesma celula
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# importando bibliotecas
import pandas as pd
import numpy as np

# variavel com endereco e nome do arquivo
ender_arquivo = 'data_examples/dados_6.csv'

# cabecalho e tipos de dados
list_cols = ['Feature_0', 'Feature_1', 'Feature_2', 'Feature_3', 'Feature_4', 'Feature_5',
           'Feature_6', 'Feature_7', 'Feature_8', 'Feature_9', 'Label']

list_data_type = [np.float32, np.float32, np.float32, np.float32, np.float32, np.float32,
               np.float32, np.float32, np.float32, np.float32, np.int32]

# criando dicionario com os tipos de dados das colunas
dic_data_type = dict(zip(list_cols, list_data_type))

# leitura dos dados
df = pd.read_csv(filepath_or_buffer=ender_arquivo, sep='\t', nrows=10_000, 
                 header=None, names=list_cols, dtype=dic_data_type)
df.head()

# infos
df.info()

Unnamed: 0,Feature_0,Feature_1,Feature_2,Feature_3,Feature_4,Feature_5,Feature_6,Feature_7,Feature_8,Feature_9,Label
0,-1.62685,-2.675881,3.029605,-0.32383,0.19079,-2.559812,0.406867,0.202089,2.047027,-2.568138,0
1,0.124887,-1.025044,0.52199,-2.095166,0.221847,-1.693476,-0.426832,1.320412,1.138497,-0.102198,0
2,-0.541229,-2.368121,1.827388,-1.370912,1.68001,-2.45293,0.490978,1.023766,1.98862,-2.198035,0
3,-4.046177,4.052463,3.835139,-0.558254,-0.859638,1.935893,0.892909,-0.045699,-1.254822,2.212408,0
4,-2.79127,1.776555,2.142707,0.279566,1.242548,1.17805,-0.25854,1.273198,0.254972,1.165803,0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 11 columns):
Feature_0    10000 non-null float32
Feature_1    10000 non-null float32
Feature_2    10000 non-null float32
Feature_3    10000 non-null float32
Feature_4    10000 non-null float32
Feature_5    10000 non-null float32
Feature_6    10000 non-null float32
Feature_7    10000 non-null float32
Feature_8    10000 non-null float32
Feature_9    10000 non-null float32
Label        10000 non-null int32
dtypes: float32(10), int32(1)
memory usage: 429.8 KB


- 3º) As 10.000 linhas davam um memory usage de 859.5 KB, mudando os tipos para float32 e int32, reduzimos para 429.8KB. O próximo passo é fazer isso com o dataset inteiro. Esse tipo de estratégia funciona para quando estivermos mexendo com dados muito grandes.

In [3]:
# importando bibliotecas
import pandas as pd
import numpy as np

# variavel com endereco e nome do arquivo
ender_arquivo = 'data_examples/dados_6.csv'

# cabecalho e tipos de dados
list_cols = ['Feature_0', 'Feature_1', 'Feature_2', 'Feature_3', 'Feature_4', 'Feature_5',
           'Feature_6', 'Feature_7', 'Feature_8', 'Feature_9', 'Label']

list_data_type = [np.float32, np.float32, np.float32, np.float32, np.float32, np.float32,
               np.float32, np.float32, np.float32, np.float32, np.int32]

# criando dicionario com os tipos de dados das colunas
dic_data_type = dict(zip(list_cols, list_data_type))

# leitura dos dados
df = pd.read_csv(filepath_or_buffer=ender_arquivo, sep='\t',
                 header=None, names=list_cols, dtype=dic_data_type)

# infos
df.head()
df.tail()
df.info()

Unnamed: 0,Feature_0,Feature_1,Feature_2,Feature_3,Feature_4,Feature_5,Feature_6,Feature_7,Feature_8,Feature_9,Label
0,-1.62685,-2.675881,3.029605,-0.32383,0.19079,-2.559812,0.406867,0.202089,2.047027,-2.568138,0
1,0.124887,-1.025044,0.52199,-2.095166,0.221847,-1.693476,-0.426832,1.320412,1.138497,-0.102198,0
2,-0.541229,-2.368121,1.827388,-1.370912,1.68001,-2.45293,0.490978,1.023766,1.98862,-2.198035,0
3,-4.046177,4.052463,3.835139,-0.558254,-0.859638,1.935893,0.892909,-0.045699,-1.254822,2.212408,0
4,-2.79127,1.776555,2.142707,0.279566,1.242548,1.17805,-0.25854,1.273198,0.254972,1.165803,0


Unnamed: 0,Feature_0,Feature_1,Feature_2,Feature_3,Feature_4,Feature_5,Feature_6,Feature_7,Feature_8,Feature_9,Label
999995,1.841844,-1.081227,-0.601514,-1.041679,-1.380062,-0.929006,-0.597511,0.067266,0.017762,-2.544658,0
999996,-2.805988,1.737186,1.404542,1.806035,-0.436407,1.975983,-0.535935,0.175755,0.249403,1.371157,0
999997,-3.909082,1.589143,0.114744,4.023767,1.252311,3.577291,-1.188318,0.993558,2.153193,2.584481,0
999998,-2.581747,2.413373,2.464464,0.606348,0.679737,1.412132,0.311622,-1.438362,-1.169999,1.087373,0
999999,-0.054737,-1.125135,1.175885,-2.431503,-0.981067,-2.161906,-0.021619,-0.036317,0.935807,-0.270891,0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 11 columns):
Feature_0    1000000 non-null float32
Feature_1    1000000 non-null float32
Feature_2    1000000 non-null float32
Feature_3    1000000 non-null float32
Feature_4    1000000 non-null float32
Feature_5    1000000 non-null float32
Feature_6    1000000 non-null float32
Feature_7    1000000 non-null float32
Feature_8    1000000 non-null float32
Feature_9    1000000 non-null float32
Label        1000000 non-null int32
dtypes: float32(10), int32(1)
memory usage: 42.0 MB
