# DataFrames

DataFrames are the workhorse of pandas and are directly inspired by the R programming language. We can think of a DataFrame as a bunch of Series objects put together to share the same index. Let's use pandas to explore this topic!

In [1]:
import pandas as pd
import numpy as np
from numpy.random import randint

In [2]:
# criando colunas e índices
columns = ['W', 'X', 'Y', 'Z'] # four columns
index = ['A', 'B', 'C', 'D', 'E'] # five rows

In [3]:
# randint retorna inteiros aleatórios de acordo com a range e o shape atribuído
np.random.seed(42)
data = randint(-100,100,(5,4))

In [4]:
data

array([[  2,  79,  -8, -86],
       [  6, -29,  88, -80],
       [  2,  21, -26, -13],
       [ 16,  -1,   3,  51],
       [ 30,  49, -48, -99]])

In [5]:
# criando DF com os índices e colunas específicos
df = pd.DataFrame(data, index, columns)

In [6]:
df

Unnamed: 0,W,X,Y,Z
A,2,79,-8,-86
B,6,-29,88,-80
C,2,21,-26,-13
D,16,-1,3,51
E,30,49,-48,-99


# Selection and Indexing

Let's learn the various methods to grab data from a DataFrame

# COLUMNS

## Grab a single column

In [7]:
# única coluna
df['W']

A     2
B     6
C     2
D    16
E    30
Name: W, dtype: int64

In [8]:
df.X

A    79
B   -29
C    21
D    -1
E    49
Name: X, dtype: int64

In [9]:
# mais de uma coluna
df[['W','Z']]

Unnamed: 0,W,Z
A,2,-86
B,6,-80
C,2,-13
D,16,51
E,30,-99


### Creating a new column:

In [10]:
# nova coluna sendo a soma de outras duas
df['new'] = df['W'] + df['Y']

In [11]:
df

Unnamed: 0,W,X,Y,Z,new
A,2,79,-8,-86,-6
B,6,-29,88,-80,94
C,2,21,-26,-13,-24
D,16,-1,3,51,19
E,30,49,-48,-99,-18


## Removing Columns

In [12]:
# axis = 1, pq é uma coluna
df.drop('new',axis=1)

Unnamed: 0,W,X,Y,Z
A,2,79,-8,-86
B,6,-29,88,-80
C,2,21,-26,-13
D,16,-1,3,51
E,30,49,-48,-99


In [13]:
# não apaga permanentemente
df

Unnamed: 0,W,X,Y,Z,new
A,2,79,-8,-86,-6
B,6,-29,88,-80,94
C,2,21,-26,-13,-24
D,16,-1,3,51,19
E,30,49,-48,-99,-18


In [14]:
# inserindo o inplace
df.drop('new',axis=1, inplace=True)

In [15]:
df

Unnamed: 0,W,X,Y,Z
A,2,79,-8,-86
B,6,-29,88,-80
C,2,21,-26,-13
D,16,-1,3,51
E,30,49,-48,-99


## Working with Rows

## Selecting one row by name

In [16]:
df.loc['A']

W     2
X    79
Y    -8
Z   -86
Name: A, dtype: int64

## Selecting multiple rows by name

In [17]:
df.loc[['A','C']]

Unnamed: 0,W,X,Y,Z
A,2,79,-8,-86
C,2,21,-26,-13


## Select single row by integer index location

In [18]:
df.iloc[0]

W     2
X    79
Y    -8
Z   -86
Name: A, dtype: int64

## Select multiple rows by integer index location

In [19]:
df.iloc[0:2]

Unnamed: 0,W,X,Y,Z
A,2,79,-8,-86
B,6,-29,88,-80


## Remove row by name

In [20]:
# axis = 0 é defaltu
df.drop('C',axis=0)

Unnamed: 0,W,X,Y,Z
A,2,79,-8,-86
B,6,-29,88,-80
D,16,-1,3,51
E,30,49,-48,-99


In [21]:
# não apaga permanentemente 
df

Unnamed: 0,W,X,Y,Z
A,2,79,-8,-86
B,6,-29,88,-80
C,2,21,-26,-13
D,16,-1,3,51
E,30,49,-48,-99


In [22]:
# podemos reatribuir a remoção: método indicado, inplace cairá em desuso futuramente
df = df.drop('E',axis=0)

In [23]:
df

Unnamed: 0,W,X,Y,Z
A,2,79,-8,-86
B,6,-29,88,-80
C,2,21,-26,-13
D,16,-1,3,51


### Selecting subset of rows and columns at same time

In [24]:
# valor específico, inserimos a linha e a coluna
df.loc['A', 'W']

2

In [25]:
# múltiplas linhas e colunas
df.loc[['A','C'],['W','Y']]

Unnamed: 0,W,Y
A,2,-8
C,2,-26


# Conditional Selection

An important feature of pandas is conditional selection using bracket notation, very similar to numpy:

In [27]:
# recriando o mesmo DF
np.random.seed(42)
data = randint(-100,100,(5,4))
df = pd.DataFrame(data, index, columns)

In [28]:
# boolean
df>0

Unnamed: 0,W,X,Y,Z
A,True,True,False,False
B,True,False,True,False
C,True,True,False,False
D,True,False,True,True
E,True,True,False,False


In [29]:
# Pandas remove os valores falso quando imprimimos o DF
df[df>0]

Unnamed: 0,W,X,Y,Z
A,2,79.0,,
B,6,,88.0,
C,2,21.0,,
D,16,,3.0,51.0
E,30,49.0,,


In [30]:
# visualização condicionada a UMA variável, retorna uma Serie
df['X'] > 0

A     True
B    False
C     True
D    False
E     True
Name: X, dtype: bool

In [31]:
# retorna apenas as LINHAS que a condição é True
df[df['X'] > 0]

Unnamed: 0,W,X,Y,Z
A,2,79,-8,-86
C,2,21,-26,-13
E,30,49,-48,-99


In [33]:
# podemos pegar apenas um coluna do DF acima: Serie
df[df['X']>0]['W']

A     2
C     2
E    30
Name: W, dtype: int64

In [34]:
# múltiplas colunas: DataFrame
df[df['X']>0][['Y','Z']]

Unnamed: 0,Y,Z
A,-8,-86
C,-26,-13
E,-48,-99


In [35]:
# podemos usar o .loc/.iloc no DF criado
df[df['X']>0].iloc[0]  # linha 0

W     2
X    79
Y    -8
Z   -86
Name: A, dtype: int64

For two conditions you can use | and & with parenthesis:

In [36]:
# apenas as linhas que atendem às condições
df[(df['W']>0) & (df['Y']>1)]

Unnamed: 0,W,X,Y,Z
B,6,-29,88,-80
D,16,-1,3,51


In [37]:
# como temos um novo DF, podemos acessar as colunas desejadas
df[(df['W']>0) & (df['Y']>1)][['W', 'Y']]

Unnamed: 0,W,Y
B,6,88
D,16,3


## More Index Details

Let's discuss some more features of indexing, including resetting the index or setting it something else. We'll also talk about index hierarchy!

In [38]:
# numeramos os índices e transformamos o antigo numa coluna
df.reset_index()

Unnamed: 0,index,W,X,Y,Z
0,A,2,79,-8,-86
1,B,6,-29,88,-80
2,C,2,21,-26,-13
3,D,16,-1,3,51
4,E,30,49,-48,-99


In [39]:
# não fica salvo, usar 'inplace = True' ou 'df = df.reset_index()'
df

Unnamed: 0,W,X,Y,Z
A,2,79,-8,-86
B,6,-29,88,-80
C,2,21,-26,-13
D,16,-1,3,51
E,30,49,-48,-99


In [43]:
# criando um novo índice
new_ind = ['CA', 'NY', 'WY', 'OR', 'CO']
df['States'] = new_ind

In [44]:
df

Unnamed: 0,W,X,Y,Z,States
A,2,79,-8,-86,CA
B,6,-29,88,-80,NY
C,2,21,-26,-13,WY
D,16,-1,3,51,OR
E,30,49,-48,-99,CO


In [47]:
# em vez de resetar, inserimos o novo index
df = df.set_index('States')   # note que 'States' vira o nome do índice, e não uma coluna como 'W X Y Z' 

In [48]:
# tornamos a mudança acima permanente
df

Unnamed: 0_level_0,W,X,Y,Z
States,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
CA,2,79,-8,-86
NY,6,-29,88,-80
WY,2,21,-26,-13
OR,16,-1,3,51
CO,30,49,-48,-99


In [50]:
# States é o índice, não uma coluna
df.columns

Index(['W', 'X', 'Y', 'Z'], dtype='object')

## DataFrame Summaries
There are a couple of ways to obtain summary data on DataFrames.<br>
<tt><strong>df.describe()</strong></tt> provides summary statistics on all numerical columns.<br>
<tt><strong>df.info and df.dtypes</strong></tt> displays the data type of all columns.

In [51]:
# descrição das estatísticas das colunas numéricas
df.describe()

Unnamed: 0,W,X,Y,Z
count,5.0,5.0,5.0,5.0
mean,11.2,23.8,1.8,-45.4
std,11.96662,42.109381,51.915316,63.366395
min,2.0,-29.0,-48.0,-99.0
25%,2.0,-1.0,-26.0,-86.0
50%,6.0,21.0,-8.0,-80.0
75%,16.0,49.0,3.0,-13.0
max,30.0,79.0,88.0,51.0


In [52]:
# tipo de dados das colunas e linhas não-nulas
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, CA to CO
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   W       5 non-null      int64
 1   X       5 non-null      int64
 2   Y       5 non-null      int64
 3   Z       5 non-null      int64
dtypes: int64(4)
memory usage: 200.0+ bytes


In [58]:
# tipo de dados das colunas
df.dtypes

W    int64
X    int64
Y    int64
Z    int64
dtype: object