# Operações em DataFrames

Hoje veremos algumos operações fundamentais em DataFrames:

1. **Renomeando colunas/linhas**;
1. **Reordenando linhas**;
1. **Removendo colunas/linhas**;
1. **Filtrando dados**;
1. **Criação de colunas condicionais**.

Vamos começar importando as bibliotecas Numpy e Pandas (seguindo a conveção de importação `np` e `pd`) e utilizar a função `pd.read_csv()`para carregar um arquivo texto com dados sobre carros.

Não se preocupe com a utilização da função `pd.read_csv()` no momento - na próxima aula aprenderemos a carregar diferentes arquivos de dados, incluindo arquivos `.csv`!

In [2]:
import pandas as pd
import numpy as np
tb_veic = pd.read_csv('data/dados_veiculos.csv')

Podemos usar o método `.info()` para ver quais informações a tabela contém:

In [3]:
tb_veic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35952 entries, 0 to 35951
Data columns (total 15 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Make                     35952 non-null  object 
 1   Model                    35952 non-null  object 
 2   Year                     35952 non-null  int64  
 3   Engine Displacement      35952 non-null  float64
 4   Cylinders                35952 non-null  float64
 5   Transmission             35952 non-null  object 
 6   Drivetrain               35952 non-null  object 
 7   Vehicle Class            35952 non-null  object 
 8   Fuel Type                35952 non-null  object 
 9   Fuel Barrels/Year        35952 non-null  float64
 10  City MPG                 35952 non-null  int64  
 11  Highway MPG              35952 non-null  int64  
 12  Combined MPG             35952 non-null  int64  
 13  CO2 Emission Grams/Mile  35952 non-null  float64
 14  Fuel Cost/Year        

Como podemos ver abaixo, a tabela contém 15 colunas, sendo 9 numéricas e 6 de objetos. Além disso todas as linhas parecem ter todas as variáveis preenchidas. Podemos utilizar o método `.head()` para ver as primeiras linhas dessa tabela:

In [4]:
tb_veic.head()

Unnamed: 0,Make,Model,Year,Engine Displacement,Cylinders,Transmission,Drivetrain,Vehicle Class,Fuel Type,Fuel Barrels/Year,City MPG,Highway MPG,Combined MPG,CO2 Emission Grams/Mile,Fuel Cost/Year
0,AM General,DJ Po Vehicle 2WD,1984,2.5,4.0,Automatic 3-spd,2-Wheel Drive,Special Purpose Vehicle 2WD,Regular,19.388824,18,17,17,522.764706,1950
1,AM General,FJ8c Post Office,1984,4.2,6.0,Automatic 3-spd,2-Wheel Drive,Special Purpose Vehicle 2WD,Regular,25.354615,13,13,13,683.615385,2550
2,AM General,Post Office DJ5 2WD,1985,2.5,4.0,Automatic 3-spd,Rear-Wheel Drive,Special Purpose Vehicle 2WD,Regular,20.600625,16,17,16,555.4375,2100
3,AM General,Post Office DJ8 2WD,1985,4.2,6.0,Automatic 3-spd,Rear-Wheel Drive,Special Purpose Vehicle 2WD,Regular,25.354615,13,13,13,683.615385,2550
4,ASC Incorporated,GNX,1987,3.8,6.0,Automatic 4-spd,Rear-Wheel Drive,Midsize Cars,Premium,20.600625,14,21,16,555.4375,2550


Qual método podemos utilizar para ter uma descrição rápida sobre as colunas numéricas dessa tabela?

In [17]:
tb_veic[['Make','Model']].describe()

Unnamed: 0,Make,Model
count,35952,35952
unique,127,3608
top,Chevrolet,F150 Pickup 2WD
freq,3643,197


## Manipulando Colunas

DataFrames são tabelas - portanto podemos realizar operações sobre as **colunas** ou sobre as **linhas**. Vamos começar vendo algumas operações básicas que podemos realizar sobre as colunas de um DataFrame.

### O atributo `.columns`

Como vimos na aula passada, todo DataFrame contém atributos que nos permite acessar seus índices. O atributo para acessarmos o índice de colunas é o `.columns`. Vamos utilizar esse atributo para resolver um problema básico: o método `.describe()` só retorna resultados para colunas não numéricas quando o DataFrame não tem nenhuma colunas numérica.

No entanto, muitas vezes queremos ver o resumo fornecido pelo método para as colunas de `object` (que, por via de regra, são `strings`). Como podemos fazer isso?

In [6]:
tb_veic.columns

Index(['Make', 'Model', 'Year', 'Engine Displacement', 'Cylinders',
       'Transmission', 'Drivetrain', 'Vehicle Class', 'Fuel Type',
       'Fuel Barrels/Year', 'City MPG', 'Highway MPG', 'Combined MPG',
       'CO2 Emission Grams/Mile', 'Fuel Cost/Year'],
      dtype='object')

In [7]:
tb_veic.columns[0]

'Make'

In [10]:
print(tb_veic['Year'].dtype)

int64


In [21]:
tb_veic['Model'].dtype == object

True

In [22]:
var_str = []
for column in tb_veic.columns:
    if tb_veic[column].dtype == object:
        var_str.append(column)
print(var_str)

['Make', 'Model', 'Transmission', 'Drivetrain', 'Vehicle Class', 'Fuel Type']


In [23]:
tb_veic[var_str].describe()

Unnamed: 0,Make,Model,Transmission,Drivetrain,Vehicle Class,Fuel Type
count,35952,35952,35952,35952,35952,35952
unique,127,3608,45,8,34,13
top,Chevrolet,F150 Pickup 2WD,Automatic 4-spd,Front-Wheel Drive,Compact Cars,Regular
freq,3643,197,10585,13044,5185,23587


In [24]:
var_str = [column for column in tb_veic.columns if tb_veic[column].dtype == object]
print(var_str)

['Make', 'Model', 'Transmission', 'Drivetrain', 'Vehicle Class', 'Fuel Type']


### Renomeando colunas
Temos duas formas possíveis de renomear as colunas:

- Como já vimos `.columns` é um **attribute** dos DataFrame que contém um iterável com os nomes das colunas:
    - Podemos substituir esse atributo por outro iterável de mesmo comprimento;
    - Como substituímos o atributo inteiro, precisamos especificar o nome de todas as colunas (mesmo que elas não mudem de nome).
    
- `.rename()` é um **método** de um DataFrame, através do qual podemos renomear colunas de forma seletiva:
    - Este método utiliza um dicionário com o formato `{'nome_antigo':'nome_novo'}` para renomear as colunas; 
    - O método `.rename()` serve tanto para renomear colunas quanto linhas. Portanto precisamos utilizar o argumento `axis = 1` para renomear as colunas;

#### Através do atributo `.columns`
Primeiro vamos aprender a renomear colunas através do atributo `.columns`. Além disso vamos ver situações práticas onde devemos utilizar essa forma.

In [25]:
print(tb_veic.columns)

Index(['Make', 'Model', 'Year', 'Engine Displacement', 'Cylinders',
       'Transmission', 'Drivetrain', 'Vehicle Class', 'Fuel Type',
       'Fuel Barrels/Year', 'City MPG', 'Highway MPG', 'Combined MPG',
       'CO2 Emission Grams/Mile', 'Fuel Cost/Year'],
      dtype='object')


Utilizando um list comprehesion podemos criar uma nova lista de nomes de colunas:

In [26]:
[f'C_{i}' for i in range(len(tb_veic.columns))]

['C_0',
 'C_1',
 'C_2',
 'C_3',
 'C_4',
 'C_5',
 'C_6',
 'C_7',
 'C_8',
 'C_9',
 'C_10',
 'C_11',
 'C_12',
 'C_13',
 'C_14']

In [27]:
old_names = tb_veic.columns
print(old_names)
new_names = [f'C_{i}' for i in range(len(old_names))]
print(new_names)

Index(['Make', 'Model', 'Year', 'Engine Displacement', 'Cylinders',
       'Transmission', 'Drivetrain', 'Vehicle Class', 'Fuel Type',
       'Fuel Barrels/Year', 'City MPG', 'Highway MPG', 'Combined MPG',
       'CO2 Emission Grams/Mile', 'Fuel Cost/Year'],
      dtype='object')
['C_0', 'C_1', 'C_2', 'C_3', 'C_4', 'C_5', 'C_6', 'C_7', 'C_8', 'C_9', 'C_10', 'C_11', 'C_12', 'C_13', 'C_14']


Agora, podemos substituir o atributo `.columns` e utilizar o método `.info()` para verificar que nossas colunas foram renomeadas:

In [28]:
tb_veic.columns = new_names
tb_veic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35952 entries, 0 to 35951
Data columns (total 15 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   C_0     35952 non-null  object 
 1   C_1     35952 non-null  object 
 2   C_2     35952 non-null  int64  
 3   C_3     35952 non-null  float64
 4   C_4     35952 non-null  float64
 5   C_5     35952 non-null  object 
 6   C_6     35952 non-null  object 
 7   C_7     35952 non-null  object 
 8   C_8     35952 non-null  object 
 9   C_9     35952 non-null  float64
 10  C_10    35952 non-null  int64  
 11  C_11    35952 non-null  int64  
 12  C_12    35952 non-null  int64  
 13  C_13    35952 non-null  float64
 14  C_14    35952 non-null  int64  
dtypes: float64(4), int64(5), object(6)
memory usage: 4.1+ MB


Podemos utilizar os nomes antigos guardados na variável `old_names` para voltar a tabela ao normal:

In [30]:
tb_veic.columns = old_names
tb_veic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35952 entries, 0 to 35951
Data columns (total 15 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Make                     35952 non-null  object 
 1   Model                    35952 non-null  object 
 2   Year                     35952 non-null  int64  
 3   Engine Displacement      35952 non-null  float64
 4   Cylinders                35952 non-null  float64
 5   Transmission             35952 non-null  object 
 6   Drivetrain               35952 non-null  object 
 7   Vehicle Class            35952 non-null  object 
 8   Fuel Type                35952 non-null  object 
 9   Fuel Barrels/Year        35952 non-null  float64
 10  City MPG                 35952 non-null  int64  
 11  Highway MPG              35952 non-null  int64  
 12  Combined MPG             35952 non-null  int64  
 13  CO2 Emission Grams/Mile  35952 non-null  float64
 14  Fuel Cost/Year        

O exemplo acima mostra como podemos utilizar o atributo `.columns` para renomear colunas. No entanto, ele não é muito prático. Vamos ver um caso real onde utilizamos essa técnica para limpar o nome das colunas de uma tabela:

In [32]:
import re

pattern = r'[^a-zA-Z0-9]'
print(re.findall(pattern, tb_veic.columns[9]))

[' ', '/']


O padrão acima busca todos os carácteres que **NÃO** são alfa-numéricos. Podemos utilizar a função `re.sub()` para *limpar* os nomes das colunas, substituindo espaços e barras por `_`:

In [33]:
 tb_veic.columns[9]

'Fuel Barrels/Year'

In [34]:
re.sub(pattern, '_', tb_veic.columns[9].lower())

'fuel_barrels_year'

In [35]:
print([re.sub(pattern, '_', column.lower()) for column in tb_veic.columns])

['make', 'model', 'year', 'engine_displacement', 'cylinders', 'transmission', 'drivetrain', 'vehicle_class', 'fuel_type', 'fuel_barrels_year', 'city_mpg', 'highway_mpg', 'combined_mpg', 'co2_emission_grams_mile', 'fuel_cost_year']


In [36]:
tb_veic.columns = [re.sub(pattern, '_', column.lower()) for column in tb_veic.columns]
tb_veic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35952 entries, 0 to 35951
Data columns (total 15 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   make                     35952 non-null  object 
 1   model                    35952 non-null  object 
 2   year                     35952 non-null  int64  
 3   engine_displacement      35952 non-null  float64
 4   cylinders                35952 non-null  float64
 5   transmission             35952 non-null  object 
 6   drivetrain               35952 non-null  object 
 7   vehicle_class            35952 non-null  object 
 8   fuel_type                35952 non-null  object 
 9   fuel_barrels_year        35952 non-null  float64
 10  city_mpg                 35952 non-null  int64  
 11  highway_mpg              35952 non-null  int64  
 12  combined_mpg             35952 non-null  int64  
 13  co2_emission_grams_mile  35952 non-null  float64
 14  fuel_cost_year        

In [37]:
tb_veic.columns = old_names
tb_veic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35952 entries, 0 to 35951
Data columns (total 15 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Make                     35952 non-null  object 
 1   Model                    35952 non-null  object 
 2   Year                     35952 non-null  int64  
 3   Engine Displacement      35952 non-null  float64
 4   Cylinders                35952 non-null  float64
 5   Transmission             35952 non-null  object 
 6   Drivetrain               35952 non-null  object 
 7   Vehicle Class            35952 non-null  object 
 8   Fuel Type                35952 non-null  object 
 9   Fuel Barrels/Year        35952 non-null  float64
 10  City MPG                 35952 non-null  int64  
 11  Highway MPG              35952 non-null  int64  
 12  Combined MPG             35952 non-null  int64  
 13  CO2 Emission Grams/Mile  35952 non-null  float64
 14  Fuel Cost/Year        

### Utilizando o método `.rename()`

Um problema do método acima é que se quisermos renomear apenas uma coluna teremos que criar uma lista com todos os nomes originais exceto o da coluna a ser renomeada. Neste caso é melhor utilizarmos o método `.rename()`: ele nos permite renomear uma (ou mais) colunas a partir de um dicionário `{'nome_antigo' : 'nome_novo'}`:

In [38]:
dict_nomes = dict()
dict_nomes['Year'] = 'model_year'
print(dict_nomes)

{'Year': 'model_year'}


In [39]:
tb_veic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35952 entries, 0 to 35951
Data columns (total 15 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Make                     35952 non-null  object 
 1   Model                    35952 non-null  object 
 2   Year                     35952 non-null  int64  
 3   Engine Displacement      35952 non-null  float64
 4   Cylinders                35952 non-null  float64
 5   Transmission             35952 non-null  object 
 6   Drivetrain               35952 non-null  object 
 7   Vehicle Class            35952 non-null  object 
 8   Fuel Type                35952 non-null  object 
 9   Fuel Barrels/Year        35952 non-null  float64
 10  City MPG                 35952 non-null  int64  
 11  Highway MPG              35952 non-null  int64  
 12  Combined MPG             35952 non-null  int64  
 13  CO2 Emission Grams/Mile  35952 non-null  float64
 14  Fuel Cost/Year        

In [40]:
tb_veic = tb_veic.rename(dict_nomes, axis = 1)
tb_veic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35952 entries, 0 to 35951
Data columns (total 15 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Make                     35952 non-null  object 
 1   Model                    35952 non-null  object 
 2   model_year               35952 non-null  int64  
 3   Engine Displacement      35952 non-null  float64
 4   Cylinders                35952 non-null  float64
 5   Transmission             35952 non-null  object 
 6   Drivetrain               35952 non-null  object 
 7   Vehicle Class            35952 non-null  object 
 8   Fuel Type                35952 non-null  object 
 9   Fuel Barrels/Year        35952 non-null  float64
 10  City MPG                 35952 non-null  int64  
 11  Highway MPG              35952 non-null  int64  
 12  Combined MPG             35952 non-null  int64  
 13  CO2 Emission Grams/Mile  35952 non-null  float64
 14  Fuel Cost/Year        

In [41]:
tb_veic.columns = old_names
tb_veic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35952 entries, 0 to 35951
Data columns (total 15 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Make                     35952 non-null  object 
 1   Model                    35952 non-null  object 
 2   Year                     35952 non-null  int64  
 3   Engine Displacement      35952 non-null  float64
 4   Cylinders                35952 non-null  float64
 5   Transmission             35952 non-null  object 
 6   Drivetrain               35952 non-null  object 
 7   Vehicle Class            35952 non-null  object 
 8   Fuel Type                35952 non-null  object 
 9   Fuel Barrels/Year        35952 non-null  float64
 10  City MPG                 35952 non-null  int64  
 11  Highway MPG              35952 non-null  int64  
 12  Combined MPG             35952 non-null  int64  
 13  CO2 Emission Grams/Mile  35952 non-null  float64
 14  Fuel Cost/Year        

O método `.rename()` não altera o objeto original!

In [42]:
tb_veic.rename({'Year' : 'model_year'}, axis = 1)
tb_veic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35952 entries, 0 to 35951
Data columns (total 15 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Make                     35952 non-null  object 
 1   Model                    35952 non-null  object 
 2   Year                     35952 non-null  int64  
 3   Engine Displacement      35952 non-null  float64
 4   Cylinders                35952 non-null  float64
 5   Transmission             35952 non-null  object 
 6   Drivetrain               35952 non-null  object 
 7   Vehicle Class            35952 non-null  object 
 8   Fuel Type                35952 non-null  object 
 9   Fuel Barrels/Year        35952 non-null  float64
 10  City MPG                 35952 non-null  int64  
 11  Highway MPG              35952 non-null  int64  
 12  Combined MPG             35952 non-null  int64  
 13  CO2 Emission Grams/Mile  35952 non-null  float64
 14  Fuel Cost/Year        

Se quisermos que o método substitua diretamente o nome das colunas (sem precisarmos sobrescrever a variável que contém o DataFrame) precisamos utilizar o argumento `inplace = True`:

In [43]:
tb_veic.rename({'Year' : 'model_year'}, axis = 1, inplace = True)
tb_veic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35952 entries, 0 to 35951
Data columns (total 15 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Make                     35952 non-null  object 
 1   Model                    35952 non-null  object 
 2   model_year               35952 non-null  int64  
 3   Engine Displacement      35952 non-null  float64
 4   Cylinders                35952 non-null  float64
 5   Transmission             35952 non-null  object 
 6   Drivetrain               35952 non-null  object 
 7   Vehicle Class            35952 non-null  object 
 8   Fuel Type                35952 non-null  object 
 9   Fuel Barrels/Year        35952 non-null  float64
 10  City MPG                 35952 non-null  int64  
 11  Highway MPG              35952 non-null  int64  
 12  Combined MPG             35952 non-null  int64  
 13  CO2 Emission Grams/Mile  35952 non-null  float64
 14  Fuel Cost/Year        

Podemos realizar a mesma tarefa do exemplo prático acima através do método `.rename()`. Para tanto, vamos utilizar um `dict comprehension`!

In [44]:
[re.sub(pattern, '_', column).lower() for column in tb_veic.columns]

['make',
 'model',
 'model_year',
 'engine_displacement',
 'cylinders',
 'transmission',
 'drivetrain',
 'vehicle_class',
 'fuel_type',
 'fuel_barrels_year',
 'city_mpg',
 'highway_mpg',
 'combined_mpg',
 'co2_emission_grams_mile',
 'fuel_cost_year']

In [45]:
pattern = r'[^a-zA-Z0-9]'
dict_rename = {column : re.sub(pattern, '_', column).lower() for column in tb_veic.columns}
print(dict_rename)

{'Make': 'make', 'Model': 'model', 'model_year': 'model_year', 'Engine Displacement': 'engine_displacement', 'Cylinders': 'cylinders', 'Transmission': 'transmission', 'Drivetrain': 'drivetrain', 'Vehicle Class': 'vehicle_class', 'Fuel Type': 'fuel_type', 'Fuel Barrels/Year': 'fuel_barrels_year', 'City MPG': 'city_mpg', 'Highway MPG': 'highway_mpg', 'Combined MPG': 'combined_mpg', 'CO2 Emission Grams/Mile': 'co2_emission_grams_mile', 'Fuel Cost/Year': 'fuel_cost_year'}


In [46]:
tb_veic = tb_veic.rename(dict_rename, axis = 1) #tb_veic.rename(dict_rename, axis = 1, inplace = True)
print(tb_veic.columns)

Index(['make', 'model', 'model_year', 'engine_displacement', 'cylinders',
       'transmission', 'drivetrain', 'vehicle_class', 'fuel_type',
       'fuel_barrels_year', 'city_mpg', 'highway_mpg', 'combined_mpg',
       'co2_emission_grams_mile', 'fuel_cost_year'],
      dtype='object')


In [47]:
dict_rename['PEDROPEDROPEDRO'] = 'seila'

In [48]:
tb_veic.rename(dict_rename, axis = 1)

Unnamed: 0,make,model,model_year,engine_displacement,cylinders,transmission,drivetrain,vehicle_class,fuel_type,fuel_barrels_year,city_mpg,highway_mpg,combined_mpg,co2_emission_grams_mile,fuel_cost_year
0,AM General,DJ Po Vehicle 2WD,1984,2.5,4.0,Automatic 3-spd,2-Wheel Drive,Special Purpose Vehicle 2WD,Regular,19.388824,18,17,17,522.764706,1950
1,AM General,FJ8c Post Office,1984,4.2,6.0,Automatic 3-spd,2-Wheel Drive,Special Purpose Vehicle 2WD,Regular,25.354615,13,13,13,683.615385,2550
2,AM General,Post Office DJ5 2WD,1985,2.5,4.0,Automatic 3-spd,Rear-Wheel Drive,Special Purpose Vehicle 2WD,Regular,20.600625,16,17,16,555.437500,2100
3,AM General,Post Office DJ8 2WD,1985,4.2,6.0,Automatic 3-spd,Rear-Wheel Drive,Special Purpose Vehicle 2WD,Regular,25.354615,13,13,13,683.615385,2550
4,ASC Incorporated,GNX,1987,3.8,6.0,Automatic 4-spd,Rear-Wheel Drive,Midsize Cars,Premium,20.600625,14,21,16,555.437500,2550
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
35947,smart,fortwo coupe,2013,1.0,3.0,Auto(AM5),Rear-Wheel Drive,Two Seaters,Premium,9.155833,34,38,36,244.000000,1100
35948,smart,fortwo coupe,2014,1.0,3.0,Auto(AM5),Rear-Wheel Drive,Two Seaters,Premium,9.155833,34,38,36,243.000000,1100
35949,smart,fortwo coupe,2015,1.0,3.0,Auto(AM5),Rear-Wheel Drive,Two Seaters,Premium,9.155833,34,38,36,244.000000,1100
35950,smart,fortwo coupe,2016,0.9,3.0,Auto(AM6),Rear-Wheel Drive,Two Seaters,Premium,9.155833,34,39,36,246.000000,1100


Utilizando o método `.replace()` temos duas formas de renomear as colunas de nossos dados
1. sobrescrevendo a variável que contém nosso `DataFrame`: 

    `data = data.rename(columns={'Make':'Manufacturer', 'Year':'ANO'})`

2. Usando o argumento `inplace =  True`:

    `data.rename(columns={'Make':'Manufacturer', 'Year':'ANO'}, inplace=True)`
    
O parâmetro 'inplace' será deprecado e seu uso é considerado má prática.

### Reordenando colunas em um DataFrame

Reordenar colunas em um DataFrame é simples: basta lembrarmos que podemos passar uma lista de nomes de colunas como índice para montar um novo DataFrame com as colunas na ordem dos elementos da lista:

In [49]:
print(tb_veic.columns)

Index(['make', 'model', 'model_year', 'engine_displacement', 'cylinders',
       'transmission', 'drivetrain', 'vehicle_class', 'fuel_type',
       'fuel_barrels_year', 'city_mpg', 'highway_mpg', 'combined_mpg',
       'co2_emission_grams_mile', 'fuel_cost_year'],
      dtype='object')


In [51]:
tb_veic.head()

Unnamed: 0,make,model,model_year,engine_displacement,cylinders,transmission,drivetrain,vehicle_class,fuel_type,fuel_barrels_year,city_mpg,highway_mpg,combined_mpg,co2_emission_grams_mile,fuel_cost_year
0,AM General,DJ Po Vehicle 2WD,1984,2.5,4.0,Automatic 3-spd,2-Wheel Drive,Special Purpose Vehicle 2WD,Regular,19.388824,18,17,17,522.764706,1950
1,AM General,FJ8c Post Office,1984,4.2,6.0,Automatic 3-spd,2-Wheel Drive,Special Purpose Vehicle 2WD,Regular,25.354615,13,13,13,683.615385,2550
2,AM General,Post Office DJ5 2WD,1985,2.5,4.0,Automatic 3-spd,Rear-Wheel Drive,Special Purpose Vehicle 2WD,Regular,20.600625,16,17,16,555.4375,2100
3,AM General,Post Office DJ8 2WD,1985,4.2,6.0,Automatic 3-spd,Rear-Wheel Drive,Special Purpose Vehicle 2WD,Regular,25.354615,13,13,13,683.615385,2550
4,ASC Incorporated,GNX,1987,3.8,6.0,Automatic 4-spd,Rear-Wheel Drive,Midsize Cars,Premium,20.600625,14,21,16,555.4375,2550


In [50]:
novas_colunas = ['model', 'make']
tb_veic[novas_colunas]

Unnamed: 0,model,make
0,DJ Po Vehicle 2WD,AM General
1,FJ8c Post Office,AM General
2,Post Office DJ5 2WD,AM General
3,Post Office DJ8 2WD,AM General
4,GNX,ASC Incorporated
...,...,...
35947,fortwo coupe,smart
35948,fortwo coupe,smart
35949,fortwo coupe,smart
35950,fortwo coupe,smart


Vamos utilizar o método `.sort()` das listas para criar uma lista de colunas em ordem alfabética:

In [52]:
lista_colunas = list(tb_veic.columns)
lista_colunas.sort()
print(lista_colunas)

['city_mpg', 'co2_emission_grams_mile', 'combined_mpg', 'cylinders', 'drivetrain', 'engine_displacement', 'fuel_barrels_year', 'fuel_cost_year', 'fuel_type', 'highway_mpg', 'make', 'model', 'model_year', 'transmission', 'vehicle_class']


In [53]:
tb_veic = tb_veic[lista_colunas]
tb_veic.head()

Unnamed: 0,city_mpg,co2_emission_grams_mile,combined_mpg,cylinders,drivetrain,engine_displacement,fuel_barrels_year,fuel_cost_year,fuel_type,highway_mpg,make,model,model_year,transmission,vehicle_class
0,18,522.764706,17,4.0,2-Wheel Drive,2.5,19.388824,1950,Regular,17,AM General,DJ Po Vehicle 2WD,1984,Automatic 3-spd,Special Purpose Vehicle 2WD
1,13,683.615385,13,6.0,2-Wheel Drive,4.2,25.354615,2550,Regular,13,AM General,FJ8c Post Office,1984,Automatic 3-spd,Special Purpose Vehicle 2WD
2,16,555.4375,16,4.0,Rear-Wheel Drive,2.5,20.600625,2100,Regular,17,AM General,Post Office DJ5 2WD,1985,Automatic 3-spd,Special Purpose Vehicle 2WD
3,13,683.615385,13,6.0,Rear-Wheel Drive,4.2,25.354615,2550,Regular,13,AM General,Post Office DJ8 2WD,1985,Automatic 3-spd,Special Purpose Vehicle 2WD
4,14,555.4375,16,6.0,Rear-Wheel Drive,3.8,20.600625,2550,Premium,21,ASC Incorporated,GNX,1987,Automatic 4-spd,Midsize Cars


### Removendo/Mantendo colunas

Se quisermos remover uma (ou mais) coluna de um DataFrame podemos faze-lo de duas formas:

- Utilizando a indexação para selecionar todas as colunas que queremos em nosso novo DataFrame;
- Utilizando o método `.drop()` para remover algumas colunas específicas.

Quando queremos remover muitas colunas, pode valer a penas especificar quais colunas queremos manter (primeira forma). Caso contrário, podemos utilizar o método `.drop()`:

In [54]:
colunas_manter = ['city_mpg', 'combined_mpg']
tb_veic_sub = tb_veic[colunas_manter]
tb_veic_sub.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35952 entries, 0 to 35951
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype
---  ------        --------------  -----
 0   city_mpg      35952 non-null  int64
 1   combined_mpg  35952 non-null  int64
dtypes: int64(2)
memory usage: 561.9 KB


In [55]:
[coluna for coluna in tb_veic.columns if tb_veic[coluna].dtype == int or tb_veic[coluna].dtype == float]

['city_mpg',
 'co2_emission_grams_mile',
 'combined_mpg',
 'cylinders',
 'engine_displacement',
 'fuel_barrels_year',
 'fuel_cost_year',
 'highway_mpg',
 'model_year']

In [56]:
colunas_num = [coluna for coluna in tb_veic.columns if tb_veic[coluna].dtype == int or tb_veic[coluna].dtype == float]
print(colunas_num)

['city_mpg', 'co2_emission_grams_mile', 'combined_mpg', 'cylinders', 'engine_displacement', 'fuel_barrels_year', 'fuel_cost_year', 'highway_mpg', 'model_year']


In [57]:
tb_veic_num = tb_veic[colunas_num]
tb_veic_num.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35952 entries, 0 to 35951
Data columns (total 9 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   city_mpg                 35952 non-null  int64  
 1   co2_emission_grams_mile  35952 non-null  float64
 2   combined_mpg             35952 non-null  int64  
 3   cylinders                35952 non-null  float64
 4   engine_displacement      35952 non-null  float64
 5   fuel_barrels_year        35952 non-null  float64
 6   fuel_cost_year           35952 non-null  int64  
 7   highway_mpg              35952 non-null  int64  
 8   model_year               35952 non-null  int64  
dtypes: float64(4), int64(5)
memory usage: 2.5 MB


Vamos voltar ao exercicio do começo da aula:

In [58]:
colunas_str = [coluna for coluna in tb_veic.columns if tb_veic[coluna].dtype == object]
tb_veic[colunas_str].describe()

Unnamed: 0,drivetrain,fuel_type,make,model,transmission,vehicle_class
count,35952,35952,35952,35952,35952,35952
unique,8,13,127,3608,45,34
top,Front-Wheel Drive,Regular,Chevrolet,F150 Pickup 2WD,Automatic 4-spd,Compact Cars
freq,13044,23587,3643,197,10585,5185


As técnicas acima nos permite selecionar colunas de forma simples a partir de condições específicas. No entanto, muitas vezes queremos remover apenas uma ou duas colunas de uma tabela (por exemplo, colunas que não tenham informações corretas). Para isso, podemos utilizar o método `.drop()`:

In [59]:
tb_veic_sm = tb_veic.drop('make', axis = 1)
tb_veic_sm.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35952 entries, 0 to 35951
Data columns (total 14 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   city_mpg                 35952 non-null  int64  
 1   co2_emission_grams_mile  35952 non-null  float64
 2   combined_mpg             35952 non-null  int64  
 3   cylinders                35952 non-null  float64
 4   drivetrain               35952 non-null  object 
 5   engine_displacement      35952 non-null  float64
 6   fuel_barrels_year        35952 non-null  float64
 7   fuel_cost_year           35952 non-null  int64  
 8   fuel_type                35952 non-null  object 
 9   highway_mpg              35952 non-null  int64  
 10  model                    35952 non-null  object 
 11  model_year               35952 non-null  int64  
 12  transmission             35952 non-null  object 
 13  vehicle_class            35952 non-null  object 
dtypes: float64(4), int64(5

Esse método nos permite remover mais de uma coluna (utilizando um iterável) e aceita o argumento `inplace`:

In [60]:
tb_veic_sm.drop(['transmission', 'model_year'], axis = 1, inplace = True)
tb_veic_sm.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35952 entries, 0 to 35951
Data columns (total 12 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   city_mpg                 35952 non-null  int64  
 1   co2_emission_grams_mile  35952 non-null  float64
 2   combined_mpg             35952 non-null  int64  
 3   cylinders                35952 non-null  float64
 4   drivetrain               35952 non-null  object 
 5   engine_displacement      35952 non-null  float64
 6   fuel_barrels_year        35952 non-null  float64
 7   fuel_cost_year           35952 non-null  int64  
 8   fuel_type                35952 non-null  object 
 9   highway_mpg              35952 non-null  int64  
 10  model                    35952 non-null  object 
 11  vehicle_class            35952 non-null  object 
dtypes: float64(4), int64(4), object(4)
memory usage: 3.3+ MB


## Manipulando Linhas

In [61]:
tb_veic = tb_veic.sort_values('model_year')
tb_veic

Unnamed: 0,city_mpg,co2_emission_grams_mile,combined_mpg,cylinders,drivetrain,engine_displacement,fuel_barrels_year,fuel_cost_year,fuel_type,highway_mpg,make,model,model_year,transmission,vehicle_class
0,18,522.764706,17,4.0,2-Wheel Drive,2.5,19.388824,1950,Regular,17,AM General,DJ Po Vehicle 2WD,1984,Automatic 3-spd,Special Purpose Vehicle 2WD
16337,15,522.764706,17,6.0,4-Wheel or All-Wheel Drive,2.8,19.388824,1950,Regular,21,GMC,T15 (S15) Pickup 4WD,1984,Manual 4-spd,Standard Pickup Trucks 4WD
16336,15,493.722222,18,6.0,4-Wheel or All-Wheel Drive,2.8,18.311667,1850,Regular,22,GMC,T15 (S15) Pickup 4WD,1984,Manual 5-spd,Standard Pickup Trucks 4WD
5636,15,522.764706,17,6.0,2-Wheel Drive,3.8,19.388824,1950,Regular,19,Chevrolet,El Camino Pickup 2WD,1984,Automatic 3-spd,Standard Pickup Trucks 2WD
5637,15,493.722222,18,6.0,2-Wheel Drive,3.8,18.311667,1850,Regular,22,Chevrolet,El Camino Pickup 2WD,1984,Automatic 4-spd,Standard Pickup Trucks 2WD
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1968,21,357.000000,25,6.0,All-Wheel Drive,3.0,13.184400,1600,Premium,31,BMW,340i xDrive,2017,Automatic (S8),Compact Cars
3137,15,475.000000,19,8.0,All-Wheel Drive,4.0,17.347895,2150,Premium,25,Bentley,Continental GT,2017,Automatic (S8),Compact Cars
1965,19,395.000000,23,6.0,Rear-Wheel Drive,3.0,14.330870,1750,Premium,29,BMW,340i,2017,Manual 6-spd,Compact Cars
21091,20,391.000000,23,4.0,4-Wheel Drive,2.0,14.330870,1750,Premium,28,Land Rover,Range Rover Evoque Convertible,2017,Automatic (S9),Small Sport Utility Vehicle 4WD


In [62]:
tb_veic = tb_veic.sort_values('model_year', ascending = False)
tb_veic

Unnamed: 0,city_mpg,co2_emission_grams_mile,combined_mpg,cylinders,drivetrain,engine_displacement,fuel_barrels_year,fuel_cost_year,fuel_type,highway_mpg,make,model,model_year,transmission,vehicle_class
20450,23,353.000000,25,4.0,Front-Wheel Drive,1.6,13.184400,1350,Regular,29,Kia,Forte 5,2017,Manual 6-spd,Large Cars
14409,18,426.000000,21,6.0,Front-Wheel Drive,3.6,15.695714,1600,Regular,25,GMC,Acadia FWD,2017,Automatic 6-spd,Standard Sport Utility Vehicle 2WD
18122,17,475.000000,19,6.0,All-Wheel Drive,3.3,17.347895,1750,Regular,22,Hyundai,Santa Fe Ultimate AWD,2017,Automatic (S6),Small Sport Utility Vehicle 4WD
24108,23,344.000000,26,4.0,4-Wheel Drive,2.0,12.677308,1550,Premium,31,Mercedes-Benz,GLA250 4matic,2017,Auto(AM7),Small Sport Utility Vehicle 4WD
851,24,330.000000,27,4.0,All-Wheel Drive,2.0,12.207778,1500,Premium,31,Audi,A4 quattro,2017,Auto(AM-S7),Compact Cars
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13777,16,493.722222,18,6.0,2-Wheel Drive,2.8,18.311667,1850,Regular,22,Ford,Ranger Pickup 2WD,1984,Manual 5-spd,Small Pickup Trucks 2WD
31003,22,370.291667,24,4.0,4-Wheel or All-Wheel Drive,1.8,13.733750,1400,Regular,27,Subaru,Brat 4WD,1984,Manual 4-spd,Special Purpose Vehicle 4WD
31002,20,423.190476,21,4.0,4-Wheel or All-Wheel Drive,1.8,15.695714,1600,Regular,23,Subaru,Brat 4WD,1984,Automatic 3-spd,Special Purpose Vehicle 4WD
26752,24,329.148148,27,4.0,2-Wheel Drive,2.0,12.207778,1250,Regular,31,Nissan,Pickup 2WD,1984,Manual 5-spd,Small Pickup Trucks 2WD


In [67]:
tb_veic = tb_veic.sort_values(['model_year', 'engine_displacement'], ascending = [False, False])
tb_veic.head()

Unnamed: 0,city_mpg,co2_emission_grams_mile,combined_mpg,cylinders,drivetrain,engine_displacement,fuel_barrels_year,fuel_cost_year,fuel_type,highway_mpg,make,model,model_year,transmission,vehicle_class
10889,12,623.0,14,10.0,Rear-Wheel Drive,8.4,23.543571,2900,Premium,19,Dodge,Viper,2017,Manual 6-spd,Two Seaters
3191,11,652.0,14,8.0,Rear-Wheel Drive,6.8,23.543571,2900,Premium,18,Bentley,Mulsanne,2017,Automatic (S8),Midsize Cars
30043,11,638.0,14,12.0,Rear-Wheel Drive,6.7,23.543571,2900,Premium,19,Rolls-Royce,Phantom Coupe,2017,Automatic (S8),Compact Cars
30064,11,637.0,14,12.0,Rear-Wheel Drive,6.7,23.543571,2900,Premium,19,Rolls-Royce,Phantom EWB,2017,Automatic (S8),Large Cars
30034,11,638.0,14,12.0,Rear-Wheel Drive,6.7,23.543571,2900,Premium,19,Rolls-Royce,Phantom,2017,Automatic (S8),Large Cars


In [68]:
tb_veic = tb_veic.sort_values(['model_year', 'engine_displacement', 'make'], ascending = False)
tb_veic

Unnamed: 0,city_mpg,co2_emission_grams_mile,combined_mpg,cylinders,drivetrain,engine_displacement,fuel_barrels_year,fuel_cost_year,fuel_type,highway_mpg,make,model,model_year,transmission,vehicle_class
10889,12,623.000000,14,10.0,Rear-Wheel Drive,8.4,23.543571,2900,Premium,19,Dodge,Viper,2017,Manual 6-spd,Two Seaters
3191,11,652.000000,14,8.0,Rear-Wheel Drive,6.8,23.543571,2900,Premium,18,Bentley,Mulsanne,2017,Automatic (S8),Midsize Cars
30043,11,638.000000,14,12.0,Rear-Wheel Drive,6.7,23.543571,2900,Premium,19,Rolls-Royce,Phantom Coupe,2017,Automatic (S8),Compact Cars
30064,11,637.000000,14,12.0,Rear-Wheel Drive,6.7,23.543571,2900,Premium,19,Rolls-Royce,Phantom EWB,2017,Automatic (S8),Large Cars
30034,11,638.000000,14,12.0,Rear-Wheel Drive,6.7,23.543571,2900,Premium,19,Rolls-Royce,Phantom,2017,Automatic (S8),Large Cars
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16888,31,328.387097,31,4.0,2-Wheel Drive,1.6,12.328548,1150,Diesel,32,Grumman Olson,Kubvan,1984,Manual 5-spd,Special Purpose Vehicle 2WD
31980,22,403.954545,22,4.0,4-Wheel or All-Wheel Drive,1.0,14.982273,1500,Regular,22,Suzuki,SJ 410 4WD,1984,Manual 4-spd,Special Purpose Vehicle 4WD
31982,22,403.954545,22,4.0,4-Wheel or All-Wheel Drive,1.0,14.982273,1500,Regular,22,Suzuki,SJ 410V 4WD,1984,Manual 4-spd,Special Purpose Vehicle 4WD
31984,22,403.954545,22,4.0,4-Wheel or All-Wheel Drive,1.0,14.982273,1500,Regular,22,Suzuki,SJ410K P/U 4WD,1984,Manual 4-spd,Small Pickup Trucks 4WD


In [69]:
print(tb_veic['model_year'].dtype)

int64


### Aplicando Filtros

Como vimos nas últimas aulas, o conceito de filtro é fundamental na programação para análise de dados. Vamos aplicar o que aprendemos sobre um conjunto de dados real.

Na biblioteca Pandas temos duas maneira de realizar filtros:

- Conceito de máscara;
- O método `.query()`.

Começaremos pelas máscaras.

In [70]:
cyl_6 = tb_veic['cylinders'] == 6
print(cyl_6)

10889    False
3191     False
30043    False
30064    False
30034    False
         ...  
16888    False
31980    False
31982    False
31984    False
18408    False
Name: cylinders, Length: 35952, dtype: bool


In [71]:
sum(cyl_6)

12765

In [73]:
tb_veic_cyl6 = tb_veic[tb_veic['cylinders'] == 6]
tb_veic_cyl6.head()

Unnamed: 0,city_mpg,co2_emission_grams_mile,combined_mpg,cylinders,drivetrain,engine_displacement,fuel_barrels_year,fuel_cost_year,fuel_type,highway_mpg,make,model,model_year,transmission,vehicle_class
16088,18,448.0,20,6.0,Rear-Wheel Drive,4.3,16.4805,1650,Gasoline or E85,24,GMC,Sierra C15 2WD,2017,Automatic 6-spd,Standard Pickup Trucks 2WD
16168,17,474.0,19,6.0,4-Wheel Drive,4.3,17.347895,1750,Gasoline or E85,22,GMC,Sierra K15 4WD,2017,Automatic 6-spd,Standard Pickup Trucks 4WD
7207,17,473.0,19,6.0,4-Wheel Drive,4.3,17.347895,1750,Gasoline or E85,22,Chevrolet,Silverado K15 4WD,2017,Automatic 6-spd,Standard Pickup Trucks 4WD
7131,18,448.0,20,6.0,Rear-Wheel Drive,4.3,16.4805,1650,Gasoline or E85,24,Chevrolet,Silverado C15 2WD,2017,Automatic 6-spd,Standard Pickup Trucks 2WD
29312,19,428.0,21,6.0,All-Wheel Drive,3.8,15.695714,1950,Premium,24,Porsche,911 Turbo S,2017,Auto(AM-S7),Minicompact Cars


In [74]:
tb_veic_cyl6.describe()

Unnamed: 0,city_mpg,co2_emission_grams_mile,combined_mpg,cylinders,engine_displacement,fuel_barrels_year,fuel_cost_year,highway_mpg,model_year
count,12765.0,12765.0,12765.0,12765.0,12765.0,12765.0,12765.0,12765.0,12765.0
mean,16.328946,487.609906,18.606189,6.0,3.439342,18.086572,1943.19624,22.661261,2001.294242
std,2.130492,69.240664,2.603443,0.0,0.549034,2.579958,274.865226,3.747362,9.575349
min,9.0,206.0,10.0,6.0,1.8,0.109412,1200.0,10.0,1984.0
25%,15.0,444.35,17.0,6.0,3.0,16.4805,1750.0,20.0,1993.0
50%,16.0,467.736842,19.0,6.0,3.5,17.347895,1950.0,23.0,2002.0
75%,18.0,522.764706,20.0,6.0,3.8,19.388824,2150.0,25.0,2009.0
max,32.0,888.7,31.0,6.0,5.3,32.961,3400.0,38.0,2017.0


#### Combinando condições

Podemos utilizar os operadores `&` (análogo ao `and`) e `|` (análogo ao `or`) para combinar condições de forma complexa.

Vamos começar com um problema simples: criar um DataFrame com carros da `Ford` que tenham 6 cilindros.

In [75]:
print(tb_veic.columns)

Index(['city_mpg', 'co2_emission_grams_mile', 'combined_mpg', 'cylinders',
       'drivetrain', 'engine_displacement', 'fuel_barrels_year',
       'fuel_cost_year', 'fuel_type', 'highway_mpg', 'make', 'model',
       'model_year', 'transmission', 'vehicle_class'],
      dtype='object')


In [76]:
tb_veic['make'].value_counts()

Chevrolet                        3643
Ford                             2946
Dodge                            2360
GMC                              2347
Toyota                           1836
                                 ... 
Panoz Auto-Development              1
General Motors                      1
Panos                               1
Goldacre                            1
Import Foreign Auto Sales Inc       1
Name: make, Length: 127, dtype: int64

In [77]:
print(tb_veic['make'] == 'Ford')

10889    False
3191     False
30043    False
30064    False
30034    False
         ...  
16888    False
31980    False
31982    False
31984    False
18408    False
Name: make, Length: 35952, dtype: bool


In [78]:
sum(tb_veic['make'] == 'Ford')

2946

In [79]:
(tb_veic['make'] == 'Ford') & (tb_veic['cylinders'] == 6)

10889    False
3191     False
30043    False
30064    False
30034    False
         ...  
16888    False
31980    False
31982    False
31984    False
18408    False
Length: 35952, dtype: bool

In [80]:
sum((tb_veic['make'] == 'Ford') & (tb_veic['cylinders'] == 6))

1293

In [81]:
mask_ford_6 = (tb_veic['make'] == 'Ford') & (tb_veic['cylinders'] == 6)
tb_ford6 = tb_veic[mask_ford_6]
tb_ford6.describe()

Unnamed: 0,city_mpg,co2_emission_grams_mile,combined_mpg,cylinders,engine_displacement,fuel_barrels_year,fuel_cost_year,highway_mpg,model_year
count,1293.0,1293.0,1293.0,1293.0,1293.0,1293.0,1293.0,1293.0,1293.0
mean,15.401392,522.593013,17.33488,6.0,3.779892,19.377908,1967.208043,20.777262,1998.260634
std,1.776323,72.817969,2.360996,0.0,0.716619,2.705757,275.83201,3.595862,9.648753
min,11.0,383.0,12.0,6.0,2.5,14.33087,1450.0,13.0,1984.0
25%,14.0,467.736842,15.0,6.0,3.0,17.347895,1750.0,18.0,1990.0
50%,15.0,522.764706,17.0,6.0,3.8,19.388824,1950.0,21.0,1997.0
75%,17.0,592.466667,19.0,6.0,4.2,21.974,2200.0,23.0,2006.0
max,20.0,740.583333,23.0,6.0,4.9,27.4675,2800.0,31.0,2017.0


In [82]:
tb_ford6['make'].value_counts()

Ford    1293
Name: make, dtype: int64

E se quisessemos construir um DataFrame com todos os carros da Ford de 6 cilindros e todos os carros da Chevrolet de 8 cilindros?

In [83]:
tb_veic[(tb_veic['make'] == 'Ford') & (tb_veic['cylinders'] == 6) | (tb_veic['make'] == 'Chevrolet') & (tb_veic['cylinders'] == 8)]

Unnamed: 0,city_mpg,co2_emission_grams_mile,combined_mpg,cylinders,drivetrain,engine_displacement,fuel_barrels_year,fuel_cost_year,fuel_type,highway_mpg,make,model,model_year,transmission,vehicle_class
4970,16,457.000000,19,8.0,Rear-Wheel Drive,6.2,17.347895,2150,Premium,25,Chevrolet,Camaro,2017,Manual 6-spd,Subcompact Cars
4974,17,442.000000,20,8.0,Rear-Wheel Drive,6.2,16.480500,2000,Premium,27,Chevrolet,Camaro,2017,Automatic (S8),Subcompact Cars
7203,15,531.000000,17,8.0,4-Wheel Drive,6.2,19.388824,1950,Regular,20,Chevrolet,Silverado K15 4WD,2017,Automatic 8-spd,Standard Pickup Trucks 4WD
7129,15,522.000000,17,8.0,Rear-Wheel Drive,6.2,19.388824,1950,Regular,21,Chevrolet,Silverado C15 2WD,2017,Automatic 8-spd,Standard Pickup Trucks 2WD
6955,14,520.000000,17,8.0,Rear-Wheel Drive,6.2,19.388824,2400,Premium,22,Chevrolet,SS,2017,Manual 6-spd,Large Cars
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13771,16,493.722222,18,6.0,2-Wheel Drive,2.8,18.311667,1850,Regular,21,Ford,Ranger Pickup 2WD,1984,Automatic 3-spd,Small Pickup Trucks 2WD
13773,17,467.736842,19,6.0,2-Wheel Drive,2.8,17.347895,1750,Regular,24,Ford,Ranger Pickup 2WD,1984,Manual 5-spd,Small Pickup Trucks 2WD
13775,15,522.764706,17,6.0,2-Wheel Drive,2.8,19.388824,1950,Regular,19,Ford,Ranger Pickup 2WD,1984,Automatic 3-spd,Small Pickup Trucks 2WD
13776,17,467.736842,19,6.0,2-Wheel Drive,2.8,17.347895,1750,Regular,22,Ford,Ranger Pickup 2WD,1984,Manual 4-spd,Small Pickup Trucks 2WD


In [84]:
mask_ford_6 = (tb_veic['make'] == 'Ford') & (tb_veic['cylinders'] == 6)
mask_chevrolet_8 = (tb_veic['make'] == 'Chevrolet') & (tb_veic['cylinders'] == 8)
mask_final = mask_ford_6 | mask_chevrolet_8

tb_ford_chevrolet = tb_veic[mask_final]
tb_ford_chevrolet

Unnamed: 0,city_mpg,co2_emission_grams_mile,combined_mpg,cylinders,drivetrain,engine_displacement,fuel_barrels_year,fuel_cost_year,fuel_type,highway_mpg,make,model,model_year,transmission,vehicle_class
4970,16,457.000000,19,8.0,Rear-Wheel Drive,6.2,17.347895,2150,Premium,25,Chevrolet,Camaro,2017,Manual 6-spd,Subcompact Cars
4974,17,442.000000,20,8.0,Rear-Wheel Drive,6.2,16.480500,2000,Premium,27,Chevrolet,Camaro,2017,Automatic (S8),Subcompact Cars
7203,15,531.000000,17,8.0,4-Wheel Drive,6.2,19.388824,1950,Regular,20,Chevrolet,Silverado K15 4WD,2017,Automatic 8-spd,Standard Pickup Trucks 4WD
7129,15,522.000000,17,8.0,Rear-Wheel Drive,6.2,19.388824,1950,Regular,21,Chevrolet,Silverado C15 2WD,2017,Automatic 8-spd,Standard Pickup Trucks 2WD
6955,14,520.000000,17,8.0,Rear-Wheel Drive,6.2,19.388824,2400,Premium,22,Chevrolet,SS,2017,Manual 6-spd,Large Cars
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13771,16,493.722222,18,6.0,2-Wheel Drive,2.8,18.311667,1850,Regular,21,Ford,Ranger Pickup 2WD,1984,Automatic 3-spd,Small Pickup Trucks 2WD
13773,17,467.736842,19,6.0,2-Wheel Drive,2.8,17.347895,1750,Regular,24,Ford,Ranger Pickup 2WD,1984,Manual 5-spd,Small Pickup Trucks 2WD
13775,15,522.764706,17,6.0,2-Wheel Drive,2.8,19.388824,1950,Regular,19,Ford,Ranger Pickup 2WD,1984,Automatic 3-spd,Small Pickup Trucks 2WD
13776,17,467.736842,19,6.0,2-Wheel Drive,2.8,17.347895,1750,Regular,22,Ford,Ranger Pickup 2WD,1984,Manual 4-spd,Small Pickup Trucks 2WD


### Criando colunas condicionais

Além de ser extremamente útil para criar sub-conjuntos de dados, as máscaras também são utilizadas para criar **colunas condicionais**: colunas cujo valor é determinado a partir de condicionais sobre os valores de outras colunas.

#### Utilizando `.loc`
Vamos começar com um exemplo simples: ao invés de filtrar nossos dados, vamos criar uma coluna binária indicando se um carro é da marca Ford. A primeira forma de fazermos isto é através do atributo `.loc`.

In [85]:
mask_ford = tb_veic['make'] == 'Ford'
print(mask_ford)

10889    False
3191     False
30043    False
30064    False
30034    False
         ...  
16888    False
31980    False
31982    False
31984    False
18408    False
Name: make, Length: 35952, dtype: bool


Vamos relembrar como a indexação utilizando `.loc` funciona:

``` python
data.loc[row_name, column_name]
```

Podemos passar nossa máscara como índice das linhas, no `row_name`, e podemos criar nossa coluna passando um `column_name` que ainda não existe em nosso DataFrame!

tb_veic.loc[mask_ford, 'mask']

In [102]:
mask_ford

10889    False
3191     False
30043    False
30064    False
30034    False
         ...  
16888    False
31980    False
31982    False
31984    False
18408    False
Name: make, Length: 35952, dtype: bool

In [94]:
tb_veic.loc[mask_ford, 'e_ford'] = 1
tb_veic.loc[~mask_ford, 'e_ford'] = 0

In [95]:
tb_veic['e_ford'].describe()

count    35952.000000
mean         0.081943
std          0.274281
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max          1.000000
Name: e_ford, dtype: float64

In [96]:
tb_veic[~mask_ford].head()

Unnamed: 0,city_mpg,co2_emission_grams_mile,combined_mpg,cylinders,drivetrain,engine_displacement,fuel_barrels_year,fuel_cost_year,fuel_type,highway_mpg,make,model,model_year,transmission,vehicle_class,e_ford
10889,12,623.0,14,10.0,Rear-Wheel Drive,8.4,23.543571,2900,Premium,19,Dodge,Viper,2017,Manual 6-spd,Two Seaters,0.0
3191,11,652.0,14,8.0,Rear-Wheel Drive,6.8,23.543571,2900,Premium,18,Bentley,Mulsanne,2017,Automatic (S8),Midsize Cars,0.0
30043,11,638.0,14,12.0,Rear-Wheel Drive,6.7,23.543571,2900,Premium,19,Rolls-Royce,Phantom Coupe,2017,Automatic (S8),Compact Cars,0.0
30064,11,637.0,14,12.0,Rear-Wheel Drive,6.7,23.543571,2900,Premium,19,Rolls-Royce,Phantom EWB,2017,Automatic (S8),Large Cars,0.0
30034,11,638.0,14,12.0,Rear-Wheel Drive,6.7,23.543571,2900,Premium,19,Rolls-Royce,Phantom,2017,Automatic (S8),Large Cars,0.0


Vamos utilizar essa mesma construção para criar uma classificação de eficiência dos carros, através da coluna `city_mpg`:

- Carros que fazem **menos que 15 Milhas por Galão** serão categoria **C**;
- Carros que fazem **entre 15 e 20 Milhas por Galão** serão categoria **B**;
- Carros que fazem **mais que 20 Milhas por Galão** serão categoria **A**.

Vamos guardar o resultado dessa classificação na coluna `eff_city`.

In [97]:
mask_C = tb_veic['city_mpg']  < 15
mask_B = (tb_veic['city_mpg']  >= 15) & (tb_veic['city_mpg']  < 20)
mask_A = tb_veic['city_mpg']  >= 20

In [98]:
tb_veic.loc[mask_C, 'eff_city'] = 'C'
tb_veic.loc[mask_B, 'eff_city'] = 'B'
tb_veic.loc[mask_A, 'eff_city'] = 'A'

In [99]:
tb_veic['eff_city'].value_counts()

B    17601
A     9879
C     8472
Name: eff_city, dtype: int64

Poderíamos ter criado esta coluna de forma mais abreviada, sem inicializar variáveis para cada máscara:

In [100]:
tb_veic.loc[tb_veic['city_mpg']  < 15, 'eff_city'] = 'C'
tb_veic.loc[(tb_veic['city_mpg']  >= 15) & (tb_veic['city_mpg']  < 20), 'eff_city'] = 'B'
tb_veic.loc[tb_veic['city_mpg']  >= 20, 'eff_city'] = 'A'

In [None]:
tb_veic.columns

As condições acima são **completas**, ou seja, todas as linhas de nossa tabela se enquadrarão em uma das três categorias. O que acontece quando inicializamos uma **coluna condicional** com uma máscara incompleta?

Vamos entender esse comportamento criando a coluna `eff_high`, construída a partir da coluna `highway_mpg` utilizando as condições:

- Carros que fazem **menos que 20 Milhas por Galão** serão categoria **C**;
- Carros que fazem **entre 20 e 30 Milhas por Galão** serão categoria **B**.

(O que está faltando para a condição ser completa?)

In [103]:
mask_C = tb_veic['highway_mpg']  < 20
mask_B = (tb_veic['highway_mpg']  >= 20) & (tb_veic['highway_mpg']  < 30)

In [104]:
tb_veic.loc[mask_C, 'eff_high'] = 'C'
tb_veic.loc[mask_B, 'eff_high'] = 'B'
tb_veic.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 35952 entries, 10889 to 18408
Data columns (total 18 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   city_mpg                 35952 non-null  int64  
 1   co2_emission_grams_mile  35952 non-null  float64
 2   combined_mpg             35952 non-null  int64  
 3   cylinders                35952 non-null  float64
 4   drivetrain               35952 non-null  object 
 5   engine_displacement      35952 non-null  float64
 6   fuel_barrels_year        35952 non-null  float64
 7   fuel_cost_year           35952 non-null  int64  
 8   fuel_type                35952 non-null  object 
 9   highway_mpg              35952 non-null  int64  
 10  make                     35952 non-null  object 
 11  model                    35952 non-null  object 
 12  model_year               35952 non-null  int64  
 13  transmission             35952 non-null  object 
 14  vehicle_class     

In [106]:
tb_veic.loc[~mask_C & ~mask_B].head()

Unnamed: 0,city_mpg,co2_emission_grams_mile,combined_mpg,cylinders,drivetrain,engine_displacement,fuel_barrels_year,fuel_cost_year,fuel_type,highway_mpg,make,model,model_year,transmission,vehicle_class,e_ford,eff_city,eff_high
7939,19,388.0,23,6.0,Front-Wheel Drive,3.6,14.33087,1450,Gasoline or E85,31,Chrysler,200,2017,Automatic 9-spd,Midsize Cars,0.0,B,
3785,20,376.0,24,6.0,Rear-Wheel Drive,3.6,13.73375,1400,Regular,30,Cadillac,ATS,2017,Automatic (S8),Compact Cars,0.0,A,
3878,20,376.0,24,6.0,Rear-Wheel Drive,3.6,13.73375,1400,Regular,30,Cadillac,CTS,2017,Automatic (S8),Midsize Cars,0.0,A,
3374,21,360.0,25,6.0,Front-Wheel Drive,3.6,13.1844,1350,Regular,31,Buick,LaCrosse,2017,Automatic (S8),Midsize Cars,0.0,A,
32598,21,363.0,24,6.0,Front-Wheel Drive,3.5,13.73375,1400,Regular,30,Toyota,Camry,2017,Automatic (S6),Midsize Cars,0.0,A,


#### Utilizando `np.where()`

Além de utilizar a indexação através do `.loc`, podemos utilizar a função `np.where()` do NumPy para criar colunas condicionais. A sintaxe desta função é:

`np.where(Máscara, Valor quando Máscara é Verdadeira, Valor quando False)`

O `np.where()` tem uma sintáxe muito parecida com o `if/else` (para quem já usou Excel, é o próprio IF de planilhas!). Vamos começar criando uma coluna binária simples: `cyl_6`. O valor dessa coluna será `1` quando o carro tiver 6 cilindros e `0` em todos os outros casos.

In [107]:
np.where(tb_veic['cylinders'] == 6, 1, 0)

array([0, 0, 0, ..., 0, 0, 0])

In [108]:
tb_veic['cyl_6'] = np.where(tb_veic['cylinders'] == 6, 1, 0)
tb_veic['cyl_6'].describe()

count    35952.000000
mean         0.355057
std          0.478537
min          0.000000
25%          0.000000
50%          0.000000
75%          1.000000
max          1.000000
Name: cyl_6, dtype: float64

In [109]:
tb_veic.head()

Unnamed: 0,city_mpg,co2_emission_grams_mile,combined_mpg,cylinders,drivetrain,engine_displacement,fuel_barrels_year,fuel_cost_year,fuel_type,highway_mpg,make,model,model_year,transmission,vehicle_class,e_ford,eff_city,eff_high,cyl_6
10889,12,623.0,14,10.0,Rear-Wheel Drive,8.4,23.543571,2900,Premium,19,Dodge,Viper,2017,Manual 6-spd,Two Seaters,0.0,C,C,0
3191,11,652.0,14,8.0,Rear-Wheel Drive,6.8,23.543571,2900,Premium,18,Bentley,Mulsanne,2017,Automatic (S8),Midsize Cars,0.0,C,C,0
30043,11,638.0,14,12.0,Rear-Wheel Drive,6.7,23.543571,2900,Premium,19,Rolls-Royce,Phantom Coupe,2017,Automatic (S8),Compact Cars,0.0,C,C,0
30064,11,637.0,14,12.0,Rear-Wheel Drive,6.7,23.543571,2900,Premium,19,Rolls-Royce,Phantom EWB,2017,Automatic (S8),Large Cars,0.0,C,C,0
30034,11,638.0,14,12.0,Rear-Wheel Drive,6.7,23.543571,2900,Premium,19,Rolls-Royce,Phantom,2017,Automatic (S8),Large Cars,0.0,C,C,0


Agora, vamos construir condições mais complexas.

Primeiro, vamos utilizar duas condicionais em uma máscara para criar a coluna `cyl_6_ford`: uma marcação binária dos carros da Ford com 6 cilindros.

In [110]:
tb_veic['cyl_6_ford'] = np.where((tb_veic['make'] == 'Ford') & (tb_veic['cylinders'] == 6),
                                 1,
                                 0)
print(tb_veic['cyl_6_ford'].describe())

count    35952.000000
mean         0.035965
std          0.186205
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max          1.000000
Name: cyl_6_ford, dtype: float64


Agora vamos encadear condicionais para recriar a regra de eficiência na estrada, `eff_high`, a partir da coluna `highway_mpg`:

- Carros que fazem **menos que 20 Milhas por Galão** serão categoria **C**;
- Carros que fazem **entre 20 e 30 Milhas por Galão** serão categoria **B**;
- Carros que fazem **mais que 30 Milhas por Galão** serão categoria **A**;

In [111]:
tb_veic['eff_high'] = np.where(tb_veic['highway_mpg'] < 20, 'C', 
                               np.where(tb_veic['highway_mpg'] < 30, 'B', 'A'))
tb_veic['eff_high'].value_counts()

B    21956
C     8566
A     5430
Name: eff_high, dtype: int64

# Voltamos 21h30