- Aulas F079 a F080

# Consultando um Dataset

- Podemos fazer **consultas** em um Dataframe, isso se assemelha a linguagem SQL;
- Existem métodos interessantes para fazer consultas usando operadores lógicos (>, <, ==).
- Além disso podemos fazer consultas usando instruções de agrupamento, por exemplo;
- Isso da muita flexibilidade para o Cientista de dados na hora de explorar da base de dados


**O que significa “consultar” um dataset no Pandas?**

Basicamente, é fazer perguntas sobre os dados, ou seja: _selecionar linhas, colunas ou fazer filtros para responder algo_.

Por exemplo, no SQL, você usaria algo como:
``SELECT nome, idade FROM pessoas WHERE cidade = 'SP';``

No python, através do pandas, viraria algo como:
``df[["nome", "idade"]][df["cidade"] == "SP"]``

___

In [1]:
import pandas as pd

In [2]:
# Importando a base de dados
arquivo_caminho = 'kc_house_data.csv'

dataset = pd.read_csv(arquivo_caminho, sep=',', header=0)

___

In [6]:
# Conta a quantidade de valores únicos

pd.value_counts(dataset['bedrooms'])

  pd.value_counts(dataset['bedrooms'])


bedrooms
3     9824
4     6882
2     2760
5     1601
6      272
1      199
7       38
0       13
8       13
9        6
10       3
11       1
33       1
Name: count, dtype: int64

## Método loc()

- É utilizado para visualizar informações do dataset;
- Este método recebe uma lista por parâmetro e retorna o resultado da consulta.

Por exemplo, consultar imóveis com 3 quartos:

In [7]:
dataset.loc[dataset['bedrooms']==3]

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900.0,3,1.00,1180,5650,1.0,0,0,...,7,1180.0,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,0,0,...,7,2170.0,400,1951,1991,98125,47.7210,-122.319,1690,7639
4,1954400510,20150218T000000,510000.0,3,2.00,1680,8080,1.0,0,0,...,8,1680.0,0,1987,0,98074,47.6168,-122.045,1800,7503
6,1321400060,20140627T000000,257500.0,3,2.25,1715,6819,2.0,0,0,...,7,1715.0,0,1995,0,98003,47.3097,-122.327,2238,6819
7,2008000270,20150115T000000,291850.0,3,1.50,1060,9711,1.0,0,0,...,7,1060.0,0,1963,0,98198,47.4095,-122.315,1650,9711
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21603,7852140040,20140825T000000,507250.0,3,2.50,2270,5536,2.0,0,0,...,8,2270.0,0,2003,0,98065,47.5389,-121.881,2270,5731
21604,9834201367,20150126T000000,429000.0,3,2.00,1490,1126,3.0,0,0,...,8,1490.0,0,2014,0,98144,47.5699,-122.288,1400,1230
21607,2997800021,20150219T000000,475000.0,3,2.50,1310,1294,2.0,0,0,...,8,1180.0,130,2008,0,98116,47.5773,-122.409,1330,1265
21608,263000018,20140521T000000,360000.0,3,2.50,1530,1131,3.0,0,0,...,8,1530.0,0,2009,0,98103,47.6993,-122.346,1530,1509



Outro exemplo é, _consultar imóveis com 3 quartos e com o número de banheiros maior que 2_

In [4]:
dataset.loc[(dataset['bedrooms'] == 3) & (dataset['bathrooms'] > 2)].head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
1,6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,0,0,...,7,2170.0,400,1951,1991,98125,47.721,-122.319,1690,7639
6,1321400060,20140627T000000,257500.0,3,2.25,1715,6819,2.0,0,0,...,7,1715.0,0,1995,0,98003,47.3097,-122.327,2238,6819
9,3793500160,20150312T000000,323000.0,3,2.5,1890,6560,2.0,0,0,...,7,1890.0,0,2003,0,98038,47.3684,-122.031,2390,7570
10,1736800520,20150403T000000,662500.0,3,2.5,3560,9796,1.0,0,0,...,8,,1700,1965,0,98007,47.6007,-122.145,2210,8925
21,2524049179,20140826T000000,2000000.0,3,2.75,3050,44867,1.0,0,4,...,9,2330.0,720,1968,0,98040,47.5316,-122.233,4110,20336


___

## Método sort_values()

- Esse método ordena o dataset por uma coluna especificada em ordem descrescente ou crescente.

OBS: Vale ressaltar que apenas o retorno do dataset é que foi organizado e não o dataset em si.

In [10]:
dataset.sort_values(by='price', ascending=False) # ascending é para informar se quer descrescente(False) ou crescente(True)

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
7252,6762700020,20141013T000000,7700000.0,6,8.00,12050,27600,2.5,0,3,...,13,8570.0,3480,1910,1987,98102,47.6298,-122.323,3940,8800
3914,9808700762,20140611T000000,7062500.0,5,4.50,10040,37325,2.0,1,2,...,11,7680.0,2360,1940,2001,98004,47.6500,-122.214,3930,25449
9254,9208900037,20140919T000000,6885000.0,6,7.75,9890,31374,2.0,0,4,...,13,8860.0,1030,2001,0,98039,47.6305,-122.240,4540,42730
4411,2470100110,20140804T000000,5570000.0,5,5.75,9200,35069,2.0,0,0,...,13,6200.0,3000,2001,0,98039,47.6289,-122.233,3560,24345
1448,8907500070,20150413T000000,5350000.0,5,5.00,8000,23985,2.0,0,4,...,12,6720.0,1280,2009,0,98004,47.6232,-122.220,4600,21750
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8274,3883800011,20141105T000000,82000.0,3,1.00,860,10426,1.0,0,0,...,6,860.0,0,1954,0,98146,47.4987,-122.341,1140,11250
16198,3028200080,20150324T000000,81000.0,2,1.00,730,9975,1.0,0,0,...,5,730.0,0,1943,0,98168,47.4808,-122.315,860,9000
465,8658300340,20140523T000000,80000.0,1,0.75,430,5050,1.0,0,0,...,4,430.0,0,1912,0,98014,47.6499,-121.909,1200,7500
15293,40000362,20140506T000000,78000.0,2,1.00,780,16344,1.0,0,0,...,5,780.0,0,1942,0,98168,47.4739,-122.280,1700,10387


___

## Método count()

- Serve para contar o número de linhas de uma query;

In [11]:
dataset[dataset['bedrooms'] == 4].count()

id               6882
date             6882
price            6882
bedrooms         6882
bathrooms        6882
sqft_living      6882
sqft_lot         6882
floors           6882
waterfront       6882
view             6882
condition        6882
grade            6882
sqft_above       6881
sqft_basement    6882
yr_built         6882
yr_renovated     6882
zipcode          6882
lat              6882
long             6882
sqft_living15    6882
sqft_lot15       6882
dtype: int64

___

# Alterando o dataset

## 1. Alterar colunas

In [47]:
# Criar uma nova coluna
dataset['sizes'] = (dataset['bedrooms'] *20)

In [48]:
# Renomear colunas
dataset.rename(columns={"sizes": "size"}, inplace=True)

In [15]:
dataset.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15,size
0,7129300520,20141013T000000,221900.0,3,1.0,1180,5650,1.0,0,0,...,1180.0,0,1955,0,98178,47.5112,-122.257,1340,5650,60
1,6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,0,0,...,2170.0,400,1951,1991,98125,47.721,-122.319,1690,7639,60
2,5631500400,20150225T000000,180000.0,2,1.0,770,10000,1.0,0,0,...,770.0,0,1933,0,98028,47.7379,-122.233,2720,8062,40
3,2487200875,20141209T000000,604000.0,4,3.0,1960,5000,1.0,0,0,...,1050.0,910,1965,0,98136,47.5208,-122.393,1360,5000,80
4,1954400510,20150218T000000,510000.0,3,2.0,1680,8080,1.0,0,0,...,1680.0,0,1987,0,98074,47.6168,-122.045,1800,7503,60


In [16]:
# criando uma nova coluna a partir de uma função (ou de um processamento)

# Criando uma função
def categoriza(s):
    if s >= 80:
        return 'Big'
    elif s >= 60:
        return 'Medium'
    elif s >= 40:
        return 'Small'

In [49]:
dataset['cat_size'] = dataset['size'].apply(categoriza) # O .apply serve para colocar funções para trabalhar no dataframe

In [50]:
dataset['cat_size'].head()

0    Medium
1    Medium
2     Small
3       Big
4    Medium
Name: cat_size, dtype: object

___

## 2. Alterar linhas

In [24]:
# Remover linhas pelo índice
dataset.drop(index=0).head() # Vamos remover a linha, mas apenas no retorno (o inplace está False)

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15,size,cat_size
1,6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,0,0,...,400,1951,1991,98125,47.721,-122.319,1690,7639,60,Medium
2,5631500400,20150225T000000,180000.0,2,1.0,770,10000,1.0,0,0,...,0,1933,0,98028,47.7379,-122.233,2720,8062,40,Small
3,2487200875,20141209T000000,604000.0,4,3.0,1960,5000,1.0,0,0,...,910,1965,0,98136,47.5208,-122.393,1360,5000,80,Big
4,1954400510,20150218T000000,510000.0,3,2.0,1680,8080,1.0,0,0,...,0,1987,0,98074,47.6168,-122.045,1800,7503,60,Medium
5,7237550310,20140512T000000,1225000.0,4,4.5,5420,101930,1.0,0,0,...,1530,2001,0,98053,47.6561,-122.005,4760,101930,80,Big


In [25]:
# Remover linhas através de filtros
dataset.drop(dataset[dataset.bedrooms>30].index, inplace=True)

In [26]:
# Anteriormente quando visto, tinha um quarto com 33 banheiros, se olharmos agora, esse quarto tem que ter sido excluido:
dataset['bedrooms'].value_counts()

bedrooms
3     9824
4     6882
2     2760
5     1601
6      272
1      199
7       38
0       13
8       13
9        6
10       3
11       1
Name: count, dtype: int64

___

## 3. Tratar valores nulos

In [33]:
# Para remover valores nulos:

dataset.dropna(inplace=True)

In [39]:
# Para preencher valores nulos (utilizado para colunas do tido string):
dataset.fillna("Desconhecido", inplace=True) 

In [40]:
# Para preencher valores nulos (utilizado para colunas do tido int ou float):
dataset.fillna(0, inplace=True) 

___

## 4. Alterar tipos de dados

In [52]:
# Converter o tipo da coluna
dataset['cat_size'] = dataset['cat_size'].astype(str)

In [55]:
# Alterando colunas para o tipo DATE (datetime)
dataset['date'] = pd.to_datetime(dataset['date'])

In [56]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21613 entries, 0 to 21612
Data columns (total 23 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   id             21613 non-null  int64         
 1   date           21613 non-null  datetime64[ns]
 2   price          21613 non-null  float64       
 3   bedrooms       21613 non-null  int64         
 4   bathrooms      21613 non-null  float64       
 5   sqft_living    21613 non-null  int64         
 6   sqft_lot       21613 non-null  int64         
 7   floors         21613 non-null  float64       
 8   waterfront     21613 non-null  int64         
 9   view           21613 non-null  int64         
 10  condition      21613 non-null  int64         
 11  grade          21613 non-null  int64         
 12  sqft_above     21613 non-null  object        
 13  sqft_basement  21613 non-null  int64         
 14  yr_built       21613 non-null  int64         
 15  yr_renovated   2161

___

## 5. Reestruturar o dataset

In [58]:
# Ordenar
dataset.sort_values(by="price", ascending=False).head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15,size,cat_size
7252,6762700020,2014-10-13,7700000.0,6,8.0,12050,27600,2.5,0,3,...,3480,1910,1987,98102,47.6298,-122.323,3940,8800,120,Big
3914,9808700762,2014-06-11,7062500.0,5,4.5,10040,37325,2.0,1,2,...,2360,1940,2001,98004,47.65,-122.214,3930,25449,100,Big
9254,9208900037,2014-09-19,6885000.0,6,7.75,9890,31374,2.0,0,4,...,1030,2001,0,98039,47.6305,-122.24,4540,42730,120,Big
4411,2470100110,2014-08-04,5570000.0,5,5.75,9200,35069,2.0,0,0,...,3000,2001,0,98039,47.6289,-122.233,3560,24345,100,Big
1448,8907500070,2015-04-13,5350000.0,5,5.0,8000,23985,2.0,0,4,...,1280,2009,0,98004,47.6232,-122.22,4600,21750,100,Big


___

# Percorrendo linhas de um Dataframe

Percorrer as linhas de um DataFrame é algo que às vezes precisamos, mas é importante já adiantar: no Pandas, a ideia não é percorrer linha por linha como em um for tradicional — porque isso costuma ser lento em grandes datasets.

Ainda assim, existem várias formas de fazer, dependendo da situação:

## 1. Usando .iterrows() (mais comum)

Percorre linha por linha, devolvendo o índice e a linha como Series.

In [12]:
# idx = indice
# row = é a linha que está percorrendo, assim, no print posso especificar quais linhas eu quero
# Também posso colocar o .head() para que o iterrows não percorra todas as linhas do dataframe

for idx, row in dataset.head().iterrows():
    print(idx, row['price'], row['bathrooms']) # Dessa maneira, ele percorre apenas as linhas das colunas especificadas
    # print(idx, row) # Aqui ele vai percorrer todas as colunas

0 221900.0 1.0
1 538000.0 2.25
2 180000.0 1.0
3 604000.0 3.0
4 510000.0 2.0


## 2. Usando .itertuples() (mais eficiente)

Retorna cada linha como uma `namedtuple` (mais leve que Series).

In [15]:
for row in dataset.head().itertuples():
    print(row.Index, row.price, row.bathrooms)

0 221900.0 1.0
1 538000.0 2.25
2 180000.0 1.0
3 604000.0 3.0
4 510000.0 2.0


## 3. Usando .apply() (jeito “pandas”)

Aplica uma função em cada linha ou coluna.

In [17]:
dataset.head().apply(lambda row: print(row['price'], row['bathrooms']), axis=1)

221900.0 1.0
538000.0 2.25
180000.0 1.0
604000.0 3.0
510000.0 2.0


0    None
1    None
2    None
3    None
4    None
dtype: object

___

## Atualizando dataframe ao percorrer linha a linha

In [18]:
# Valores da coluna "price" antes da atualização
dataset.price.head()

0    221900.0
1    538000.0
2    180000.0
3    604000.0
4    510000.0
Name: price, dtype: float64

In [19]:
# Percorrendo e atualizando linhas de um dataframe:

# Atualiza o valor da coluna PRICE multiplicando seu valor 2;
# É preciso usar o métodos at() para que o iterrows permita a modificação.

for indice, linha in dataset.iterrows():
    dataset.at[indice, 'price'] = linha['price'] * 2

In [20]:
dataset.price.head()

0     443800.0
1    1076000.0
2     360000.0
3    1208000.0
4    1020000.0
Name: price, dtype: float64