# Lidando com planilhas no pandas

Podemos tratar planilhas de três formatos diferentes:
 * .csv;
 * .excel;
 * .sql.


Recomendo [esse artigo](https://towardsdatascience.com/3-ways-to-load-csv-files-into-colab-7c14fcbdcb92) ensinando como lidar com planilhas no Google Colab e [esse site](https://www.kaggle.com/datasets) para download de planilhas.

Parte essencial da Ciência de Dados é o tratamento de dados que estão armazenados em uma planilha.

Abaixo temos uma planilha (*claramente desatualizada*) sobre venda de jogos de video game.

Vamos atribuir essa planilha a um data frame:

In [2]:
import pandas as pd
import numpy as np

url = "https://raw.githubusercontent.com/leoperassoli/python/master/pandas/vgsales.csv"
df1 = pd.read_csv(url)
df1

Unnamed: 0,Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
0,1,Wii Sports,Wii,2006.0,Sports,Nintendo,41.49,29.02,3.77,8.46,82.74
1,2,Super Mario Bros.,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24
2,3,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,15.85,12.88,3.79,3.31,35.82
3,4,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,15.75,11.01,3.28,2.96,33.00
4,5,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo,11.27,8.89,10.22,1.00,31.37
...,...,...,...,...,...,...,...,...,...,...,...
16593,16596,Woody Woodpecker in Crazy Castle 5,GBA,2002.0,Platform,Kemco,0.01,0.00,0.00,0.00,0.01
16594,16597,Men in Black II: Alien Escape,GC,2003.0,Shooter,Infogrames,0.01,0.00,0.00,0.00,0.01
16595,16598,SCORE International Baja 1000: The Official Game,PS2,2008.0,Racing,Activision,0.00,0.00,0.00,0.00,0.01
16596,16599,Know How 2,DS,2010.0,Puzzle,7G//AMES,0.00,0.01,0.00,0.00,0.01


Nossa planilha agora é um data frame.

É possível tratar esse data frame de milhares de formas, tudo depende do que é desejado.

Nesse caso, vamos supor que queremos que os nossos índices sejam os nomes dos jogos.

A função **.set_index("*nome_coluna*")** define os valores de uma coluna como novos índices:

In [3]:
df1 = df1.set_index("Name")
df1

Unnamed: 0_level_0,Rank,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Wii Sports,1,Wii,2006.0,Sports,Nintendo,41.49,29.02,3.77,8.46,82.74
Super Mario Bros.,2,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24
Mario Kart Wii,3,Wii,2008.0,Racing,Nintendo,15.85,12.88,3.79,3.31,35.82
Wii Sports Resort,4,Wii,2009.0,Sports,Nintendo,15.75,11.01,3.28,2.96,33.00
Pokemon Red/Pokemon Blue,5,GB,1996.0,Role-Playing,Nintendo,11.27,8.89,10.22,1.00,31.37
...,...,...,...,...,...,...,...,...,...,...
Woody Woodpecker in Crazy Castle 5,16596,GBA,2002.0,Platform,Kemco,0.01,0.00,0.00,0.00,0.01
Men in Black II: Alien Escape,16597,GC,2003.0,Shooter,Infogrames,0.01,0.00,0.00,0.00,0.01
SCORE International Baja 1000: The Official Game,16598,PS2,2008.0,Racing,Activision,0.00,0.00,0.00,0.00,0.01
Know How 2,16599,DS,2010.0,Puzzle,7G//AMES,0.00,0.01,0.00,0.00,0.01


O resultado agora é um data frame de mais facil compreensão.

Como estamos lidando com data frames, os métodos **.loc["índice"]** e **.iloc[índice]** continuam iguais:

In [4]:
print(df1.loc["Pokemon Red/Pokemon Blue"])
print("\n")
print(df1.iloc[4])

Rank                       5
Platform                  GB
Year                    1996
Genre           Role-Playing
Publisher           Nintendo
NA_Sales               11.27
EU_Sales                8.89
JP_Sales               10.22
Other_Sales                1
Global_Sales           31.37
Name: Pokemon Red/Pokemon Blue, dtype: object


Rank                       5
Platform                  GB
Year                    1996
Genre           Role-Playing
Publisher           Nintendo
NA_Sales               11.27
EU_Sales                8.89
JP_Sales               10.22
Other_Sales                1
Global_Sales           31.37
Name: Pokemon Red/Pokemon Blue, dtype: object


#Máscara Booleana

Quando queremos saber quais valores de um conjunto seguem algum critério, usamos a máscara booleana:

In [5]:
a = np.array([1,2,3,6,7,8,10,11,12])
b = a < 10
b #mascara booleana

array([ True,  True,  True,  True,  True,  True, False, False, False])

A variável **b** é um vetor que, para cada número que se encaixa no critério escolhido, recebe ***True*** e caso contrário, recebe ***False***.

Uma máscara booleana pode ser criado para tratar data frames:

In [9]:
mask = df1["Year"] < 2010
mask #mascara booleana

Name
Wii Sports                                           True
Super Mario Bros.                                    True
Mario Kart Wii                                       True
Wii Sports Resort                                    True
Pokemon Red/Pokemon Blue                             True
                                                    ...  
Woody Woodpecker in Crazy Castle 5                   True
Men in Black II: Alien Escape                        True
SCORE International Baja 1000: The Official Game     True
Know How 2                                          False
Spirits & Spells                                     True
Name: Year, Length: 16598, dtype: bool

No nosso exemplo, a máscara booleana receberá ***True*** apenas para jogos que foram lançados ***após 2010***.

Para os jogos que foram lançados ***antes de 2011***, a máscara receberá ***False***.

Passando a máscara como parâmetro, temos todos os jogos que se encaixam no critério.

Vamos criar uma máscara para o top 3:

In [12]:
top_3 = df1["Rank"] < 4 #criando a mascara
df3 = df1[top_3] #mascara por parametro
df3 #nova data frame com o top 3

Unnamed: 0_level_0,Rank,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Wii Sports,1,Wii,2006.0,Sports,Nintendo,41.49,29.02,3.77,8.46,82.74
Super Mario Bros.,2,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24
Mario Kart Wii,3,Wii,2008.0,Racing,Nintendo,15.85,12.88,3.79,3.31,35.82


O nosso data frame só recebeu os valores que estavam marcados como ***True*** da máscara booelana.

Vamos supor que agora queremos os jogos do top 3 que foram feitos para o console ***Nintendo Wii***:

In [15]:
mask_2 = df3["Platform"] == "Wii"
df4 = df3[mask_2]
df4

Unnamed: 0_level_0,Rank,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Wii Sports,1,Wii,2006.0,Sports,Nintendo,41.49,29.02,3.77,8.46,82.74
Mario Kart Wii,3,Wii,2008.0,Racing,Nintendo,15.85,12.88,3.79,3.31,35.82


Usamos a máscara booleana sempre que queremos selecionar uma parte específica do conjunto de dados que temos.