# DataFrame
Um DataFrame é uma estrutura de dados tabular composta de linhas e colunas, similar a um planilha excel, tabela de conjunto de dados, ou objeto data.frame do R. Podemos também olhar para o DataFrame como um grupo de Series que compartilham um indice.

## Criando DataFrames
Uma forma de criar DataFrame a partir de estruturas de dados python é através de um dicionário de listas (outros modos ver link pandas DataFrame)

In [3]:
# importando pandas
import pandas as pd

# criando com o metódo de dicionário
data = {'year': [2010, 2011, 2012, 2011, 2012, 2010, 2011, 2012],
        'team': ['Bears', 'Bears', 'Bears', 'Packers', 'Packers', 'Lions', 'Lions', 'Lions'],
        'wins': [11, 8, 10, 15, 11, 6, 10, 4],
        'losses': [5, 8, 6, 1, 5, 10, 6, 12]}

football = pd.DataFrame(data)
football

Unnamed: 0,year,team,wins,losses
0,2010,Bears,11,5
1,2011,Bears,8,8
2,2012,Bears,10,6
3,2011,Packers,15,1
4,2012,Packers,11,5
5,2010,Lions,6,10
6,2011,Lions,10,6
7,2012,Lions,4,12


In [11]:
# outra forma de criar dataframe
import numpy as np

data = np.random.randint(low = 0, high = 10, size = (5, 6))
print(data)
                         
pd.DataFrame(data, columns = ['a', 'b', 'c', 'd', 'e', 'f'])

# ou

pd.DataFrame(np.random.randint(low = 0, high = 10, size = (5, 6)),
             columns = ['a', 'b', 'c', 'd', 'e', 'f'], index = ['g', 'h', 'i', 'j', 'k'])

[[3 9 7 4 3 2]
 [8 3 3 6 4 1]
 [4 7 2 0 9 0]
 [8 4 8 9 9 0]
 [3 2 6 7 8 8]]


Unnamed: 0,a,b,c,d,e,f
g,9,2,5,9,6,5
h,0,2,5,1,1,1
i,8,2,7,3,9,6
j,3,3,5,0,4,7
k,4,6,4,7,1,2


In [12]:
# para acionar o help da função
pd.DataFrame?

## Lendo Arquivos CSV
Uma forma muito mais comum é de ler um conjunto de dados persistido no disco em algum formato de planilha. Por exemplo, .csv e .xlsx. Felizmente, pandas provê meios de fazer isso de forma muito fácil e intuitava.

In [15]:
# caminho do arquivo
file_path = "dados/iris-dataset.csv"

# fazer uma view do arquivo (10 primeiras linhas)
!head dados/iris-dataset.csv

df = pd.read_csv(file_path, header = None, 
                 names = ['sepal_length', 'sepal_width', 'petal_length', 'pental_width', 'species'])

5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4,Iris-setosa
4.6,3.4,1.4,0.3,Iris-setosa
5.0,3.4,1.5,0.2,Iris-setosa
4.4,2.9,1.4,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa


In [16]:
# tipo do dataframe
type(df)

pandas.core.frame.DataFrame

In [17]:
df.head(10) # 10 primeiras linhas

Unnamed: 0,sepal_length,sepal_width,petal_length,pental_width,species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
5,5.4,3.9,1.7,0.4,Iris-setosa
6,4.6,3.4,1.4,0.3,Iris-setosa
7,5.0,3.4,1.5,0.2,Iris-setosa
8,4.4,2.9,1.4,0.2,Iris-setosa
9,4.9,3.1,1.5,0.1,Iris-setosa


In [18]:
df.tail(10) # 10 últimas linhas

Unnamed: 0,sepal_length,sepal_width,petal_length,pental_width,species
140,6.7,3.1,5.6,2.4,Iris-virginica
141,6.9,3.1,5.1,2.3,Iris-virginica
142,5.8,2.7,5.1,1.9,Iris-virginica
143,6.8,3.2,5.9,2.3,Iris-virginica
144,6.7,3.3,5.7,2.5,Iris-virginica
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica
149,5.9,3.0,5.1,1.8,Iris-virginica


In [19]:
df #impre tudo

Unnamed: 0,sepal_length,sepal_width,petal_length,pental_width,species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
5,5.4,3.9,1.7,0.4,Iris-setosa
6,4.6,3.4,1.4,0.3,Iris-setosa
7,5.0,3.4,1.5,0.2,Iris-setosa
8,4.4,2.9,1.4,0.2,Iris-setosa
9,4.9,3.1,1.5,0.1,Iris-setosa


In [20]:
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,pental_width,species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


## Estatísticas Básicas
Nós conseguimos obter facilmente estatísticas básicas com `pandas.DataFrame` utilizando a função dscribe.

In [23]:
df.describe()

Unnamed: 0,sepal_length,sepal_width,petal_length,pental_width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.054,3.758667,1.198667
std,0.828066,0.433594,1.76442,0.763161
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


Por padrão, a função `describe` só mostra as estatísticas de colunas numéricas, como pode se observado no exemplo anterior. Vamos então forçá-lo a mostrar as estatísticas da coluna categórica (species).

In [24]:
df.describe(include = "all")

Unnamed: 0,sepal_length,sepal_width,petal_length,pental_width,species
count,150.0,150.0,150.0,150.0,150
unique,,,,,3
top,,,,,Iris-setosa
freq,,,,,50
mean,5.843333,3.054,3.758667,1.198667,
std,0.828066,0.433594,1.76442,0.763161,
min,4.3,2.0,1.0,0.1,
25%,5.1,2.8,1.6,0.3,
50%,5.8,3.0,4.35,1.3,
75%,6.4,3.3,5.1,1.8,


In [25]:
df.describe(include = "object")

Unnamed: 0,species
count,150
unique,3
top,Iris-setosa
freq,50


In [26]:
df.describe(include = "number")

Unnamed: 0,sepal_length,sepal_width,petal_length,pental_width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.054,3.758667,1.198667
std,0.828066,0.433594,1.76442,0.763161
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


## Número de instancias no DataFrame
Pandas DataFrame e Series ambos possuem a atributo shape, igual aos ndarray do numpy, como os quais conseguimos saber quantas linhas e colunas uma tabela tem.

In [27]:
df.shape

(150, 5)

In [28]:
df.columns

Index(['sepal_length', 'sepal_width', 'petal_length', 'pental_width',
       'species'],
      dtype='object')

In [29]:
df.index

RangeIndex(start=0, stop=150, step=1)

## Acessando as colunas
Por padrão, a indexação por colchetes é sobre as colunas em pandas. Por exemplo:

In [30]:
df.sepal_length

0      5.1
1      4.9
2      4.7
3      4.6
4      5.0
5      5.4
6      4.6
7      5.0
8      4.4
9      4.9
10     5.4
11     4.8
12     4.8
13     4.3
14     5.8
15     5.7
16     5.4
17     5.1
18     5.7
19     5.1
20     5.4
21     5.1
22     4.6
23     5.1
24     4.8
25     5.0
26     5.0
27     5.2
28     5.2
29     4.7
      ... 
120    6.9
121    5.6
122    7.7
123    6.3
124    6.7
125    7.2
126    6.2
127    6.1
128    6.4
129    7.2
130    7.4
131    7.9
132    6.4
133    6.3
134    6.1
135    7.7
136    6.3
137    6.4
138    6.0
139    6.9
140    6.7
141    6.9
142    5.8
143    6.8
144    6.7
145    6.7
146    6.3
147    6.5
148    6.2
149    5.9
Name: sepal_length, Length: 150, dtype: float64

In [31]:
type(df.sepal_length)

pandas.core.series.Series

In [32]:
df['sepal_length']

0      5.1
1      4.9
2      4.7
3      4.6
4      5.0
5      5.4
6      4.6
7      5.0
8      4.4
9      4.9
10     5.4
11     4.8
12     4.8
13     4.3
14     5.8
15     5.7
16     5.4
17     5.1
18     5.7
19     5.1
20     5.4
21     5.1
22     4.6
23     5.1
24     4.8
25     5.0
26     5.0
27     5.2
28     5.2
29     4.7
      ... 
120    6.9
121    5.6
122    7.7
123    6.3
124    6.7
125    7.2
126    6.2
127    6.1
128    6.4
129    7.2
130    7.4
131    7.9
132    6.4
133    6.3
134    6.1
135    7.7
136    6.3
137    6.4
138    6.0
139    6.9
140    6.7
141    6.9
142    5.8
143    6.8
144    6.7
145    6.7
146    6.3
147    6.5
148    6.2
149    5.9
Name: sepal_length, Length: 150, dtype: float64

In [33]:
np.mean(df.sepal_length)

5.843333333333335

In [34]:
np.log(df.sepal_length)

0      1.629241
1      1.589235
2      1.547563
3      1.526056
4      1.609438
5      1.686399
6      1.526056
7      1.609438
8      1.481605
9      1.589235
10     1.686399
11     1.568616
12     1.568616
13     1.458615
14     1.757858
15     1.740466
16     1.686399
17     1.629241
18     1.740466
19     1.629241
20     1.686399
21     1.629241
22     1.526056
23     1.629241
24     1.568616
25     1.609438
26     1.609438
27     1.648659
28     1.648659
29     1.547563
         ...   
120    1.931521
121    1.722767
122    2.041220
123    1.840550
124    1.902108
125    1.974081
126    1.824549
127    1.808289
128    1.856298
129    1.974081
130    2.001480
131    2.066863
132    1.856298
133    1.840550
134    1.808289
135    2.041220
136    1.840550
137    1.856298
138    1.791759
139    1.931521
140    1.902108
141    1.931521
142    1.757858
143    1.916923
144    1.902108
145    1.902108
146    1.840550
147    1.871802
148    1.824549
149    1.774952
Name: sepal_length, Leng

In [35]:
np.sum(df.sepal_length)

876.5

In [36]:
df.sepal_length.sum()

876.5

In [37]:
df.sepal_length + df.sepal_width

0       8.6
1       7.9
2       7.9
3       7.7
4       8.6
5       9.3
6       8.0
7       8.4
8       7.3
9       8.0
10      9.1
11      8.2
12      7.8
13      7.3
14      9.8
15     10.1
16      9.3
17      8.6
18      9.5
19      8.9
20      8.8
21      8.8
22      8.2
23      8.4
24      8.2
25      8.0
26      8.4
27      8.7
28      8.6
29      7.9
       ... 
120    10.1
121     8.4
122    10.5
123     9.0
124    10.0
125    10.4
126     9.0
127     9.1
128     9.2
129    10.2
130    10.2
131    11.7
132     9.2
133     9.1
134     8.7
135    10.7
136     9.7
137     9.5
138     9.0
139    10.0
140     9.8
141    10.0
142     8.5
143    10.0
144    10.0
145     9.7
146     8.8
147     9.5
148     9.6
149     8.9
Length: 150, dtype: float64

In [38]:
df.sepal_length * df.sepal_width

0      17.85
1      14.70
2      15.04
3      14.26
4      18.00
5      21.06
6      15.64
7      17.00
8      12.76
9      15.19
10     19.98
11     16.32
12     14.40
13     12.90
14     23.20
15     25.08
16     21.06
17     17.85
18     21.66
19     19.38
20     18.36
21     18.87
22     16.56
23     16.83
24     16.32
25     15.00
26     17.00
27     18.20
28     17.68
29     15.04
       ...  
120    22.08
121    15.68
122    21.56
123    17.01
124    22.11
125    23.04
126    17.36
127    18.30
128    17.92
129    21.60
130    20.72
131    30.02
132    17.92
133    17.64
134    15.86
135    23.10
136    21.42
137    19.84
138    18.00
139    21.39
140    20.77
141    21.39
142    15.66
143    21.76
144    22.11
145    20.10
146    15.75
147    19.50
148    21.08
149    17.70
Length: 150, dtype: float64

## Transformando pd.DataFrame em np.ndarray

In [39]:
df.as_matrix()

  """Entry point for launching an IPython kernel.


array([[5.1, 3.5, 1.4, 0.2, 'Iris-setosa'],
       [4.9, 3.0, 1.4, 0.2, 'Iris-setosa'],
       [4.7, 3.2, 1.3, 0.2, 'Iris-setosa'],
       [4.6, 3.1, 1.5, 0.2, 'Iris-setosa'],
       [5.0, 3.6, 1.4, 0.2, 'Iris-setosa'],
       [5.4, 3.9, 1.7, 0.4, 'Iris-setosa'],
       [4.6, 3.4, 1.4, 0.3, 'Iris-setosa'],
       [5.0, 3.4, 1.5, 0.2, 'Iris-setosa'],
       [4.4, 2.9, 1.4, 0.2, 'Iris-setosa'],
       [4.9, 3.1, 1.5, 0.1, 'Iris-setosa'],
       [5.4, 3.7, 1.5, 0.2, 'Iris-setosa'],
       [4.8, 3.4, 1.6, 0.2, 'Iris-setosa'],
       [4.8, 3.0, 1.4, 0.1, 'Iris-setosa'],
       [4.3, 3.0, 1.1, 0.1, 'Iris-setosa'],
       [5.8, 4.0, 1.2, 0.2, 'Iris-setosa'],
       [5.7, 4.4, 1.5, 0.4, 'Iris-setosa'],
       [5.4, 3.9, 1.3, 0.4, 'Iris-setosa'],
       [5.1, 3.5, 1.4, 0.3, 'Iris-setosa'],
       [5.7, 3.8, 1.7, 0.3, 'Iris-setosa'],
       [5.1, 3.8, 1.5, 0.3, 'Iris-setosa'],
       [5.4, 3.4, 1.7, 0.2, 'Iris-setosa'],
       [5.1, 3.7, 1.5, 0.4, 'Iris-setosa'],
       [4.6, 3.6, 1.0, 0.2, 'Iri

In [40]:
type(df.as_matrix())

  """Entry point for launching an IPython kernel.


numpy.ndarray

In [41]:
df['species'].as_matrix()

  """Entry point for launching an IPython kernel.


array(['Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa',
       'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa',
       'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa',
       'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa',
       'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa',
       'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa',
       'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa',
       'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa',
       'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa',
       'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa',
       'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa',
       'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa',
       'Iris-setosa', 'Iris-setosa', 'Iris-versicolor', 'Iris-versicolor',
       'Iris-versicolor', 'Iris-versicolor', 'Iris-versicolor',
       'Iris-versicolor', 'Iris-versicolor', 'Iris-versic

In [42]:
type(df['species'].as_matrix())

  """Entry point for launching an IPython kernel.


numpy.ndarray