# <font color='red'>1) Getting Started With Pandas</font>


## <font color='blue'>Anatomia de um DataFrame</font>
Um __DataFrame__ é composto por uma ou mais __Series__. Os nomes das series formam os nomes das __colunas__ e os rótulos das linhas formam o __Index__.

In [1]:
import pandas as pd

#Vizualização menor com rows = 5
meteoritos = pd.read_csv('/home/nicolas.fs/Estudos-PIBE/Repositório-GIT/pandas-workshop/data/Meteorite_Landings.csv', nrows=5)
#Vizualização completa
meteorites = pd.read_csv('/home/nicolas.fs/Estudos-PIBE/Repositório-GIT/pandas-workshop/data/Meteorite_Landings.csv')

meteoritos

Unnamed: 0,name,id,nametype,recclass,mass (g),fall,year,reclat,reclong,GeoLocation
0,Aachen,1,Valid,L5,21,Fell,01/01/1880 12:00:00 AM,50.775,6.08333,"(50.775, 6.08333)"
1,Aarhus,2,Valid,H6,720,Fell,01/01/1951 12:00:00 AM,56.18333,10.23333,"(56.18333, 10.23333)"
2,Abee,6,Valid,EH4,107000,Fell,01/01/1952 12:00:00 AM,54.21667,-113.0,"(54.21667, -113.0)"
3,Acapulco,10,Valid,Acapulcoite,1914,Fell,01/01/1976 12:00:00 AM,16.88333,-99.9,"(16.88333, -99.9)"
4,Achiras,370,Valid,L6,780,Fell,01/01/1902 12:00:00 AM,-33.16667,-64.95,"(-33.16667, -64.95)"


Este comando acaba de utilizar o módulo __pandas__ para fazer a criação da tabela de Meteoritos utilizando o comando __pd.read_csv__. Com isso podemos fazer algumas análises utilizando a variável meteoritos:

In [2]:
meteoritos.name

0      Aachen
1      Aarhus
2        Abee
3    Acapulco
4     Achiras
Name: name, dtype: object

In [3]:
meteoritos.columns

Index(['name', 'id', 'nametype', 'recclass', 'mass (g)', 'fall', 'year',
       'reclat', 'reclong', 'GeoLocation'],
      dtype='object')

In [4]:
meteoritos.index

RangeIndex(start=0, stop=5, step=1)

## <font color='blue'>Criando DataFrames</font>
Podemos criar DataFrames a partir de uma variedade de fontes, como outros __objetos Python__. Veremos apenas alguns exemplos, mas podemos conferir a página da documentação para obter uma lista completa.

### Usando apenas uma linha

Do mesmo formato no qual fizemos anteriormente, utilizando o comando __pd.read__.

### Usando dados de uma API

In [5]:
import requests

response = requests.get(
    'https://data.nasa.gov/resource/gh4g-9sfh.json',
    params={'$limit': 50_000}
)

if response.ok:
    payload = response.json()
else:
    print(f'Request was not successful and returned code: {response.status_code}.')
    payload = None

Esse código está utilizando a biblioteca __requests__ para fazer uma solicitação __GET__ a uma API da NASA, pedindo um parâmetro com um __limite__ de __50.000__ registros. Se a resposta for bem-sucedida (status OK), os dados JSON são __armazenados em payload__; caso contrário, uma mensagem de erro é exibida e payload é __definido como None__.

Criaremos agora o DataFrame com os resultados de payload, sendo o comando __df.head(n)__ responsável pelo __número de rows__ que teremos:

In [6]:
import pandas as pd

df = pd.DataFrame(payload)
df.head(2)

Unnamed: 0,name,id,nametype,recclass,mass,fall,year,reclat,reclong,geolocation,:@computed_region_cbhk_fwbd,:@computed_region_nnqa_25f4
0,Aachen,1,Valid,L5,21,Fell,1880-01-01T00:00:00.000,50.775,6.08333,"{'latitude': '50.775', 'longitude': '6.08333'}",,
1,Aarhus,2,Valid,H6,720,Fell,1951-01-01T00:00:00.000,56.18333,10.23333,"{'latitude': '56.18333', 'longitude': '10.23333'}",,


## <font color='blue'>Inspecionando dados</font>
Agora que temos alguns dados, precisamos realizar uma inspeção inicial deles. Isso nos dá informações sobre a aparência dos dados, quantas linhas/colunas existem e quantos dados temos.

### Verificação da quantidade de rows e colunas

In [7]:
meteorites.shape

(45716, 10)

### Verificação do nome das colunas

In [8]:
meteorites.columns

Index(['name', 'id', 'nametype', 'recclass', 'mass (g)', 'fall', 'year',
       'reclat', 'reclong', 'GeoLocation'],
      dtype='object')

### Verificação do tipo de dado que cada coluna informa

In [9]:
meteorites.dtypes

name            object
id               int64
nametype        object
recclass        object
mass (g)       float64
fall            object
year            object
reclat         float64
reclong        float64
GeoLocation     object
dtype: object

### Vizualização dos primeiros e últimos row de dados

In [10]:
meteorites.head()

Unnamed: 0,name,id,nametype,recclass,mass (g),fall,year,reclat,reclong,GeoLocation
0,Aachen,1,Valid,L5,21.0,Fell,01/01/1880 12:00:00 AM,50.775,6.08333,"(50.775, 6.08333)"
1,Aarhus,2,Valid,H6,720.0,Fell,01/01/1951 12:00:00 AM,56.18333,10.23333,"(56.18333, 10.23333)"
2,Abee,6,Valid,EH4,107000.0,Fell,01/01/1952 12:00:00 AM,54.21667,-113.0,"(54.21667, -113.0)"
3,Acapulco,10,Valid,Acapulcoite,1914.0,Fell,01/01/1976 12:00:00 AM,16.88333,-99.9,"(16.88333, -99.9)"
4,Achiras,370,Valid,L6,780.0,Fell,01/01/1902 12:00:00 AM,-33.16667,-64.95,"(-33.16667, -64.95)"


In [11]:
meteorites.tail()

Unnamed: 0,name,id,nametype,recclass,mass (g),fall,year,reclat,reclong,GeoLocation
45711,Zillah 002,31356,Valid,Eucrite,172.0,Found,01/01/1990 12:00:00 AM,29.037,17.0185,"(29.037, 17.0185)"
45712,Zinder,30409,Valid,"Pallasite, ungrouped",46.0,Found,01/01/1999 12:00:00 AM,13.78333,8.96667,"(13.78333, 8.96667)"
45713,Zlin,30410,Valid,H4,3.3,Found,01/01/1939 12:00:00 AM,49.25,17.66667,"(49.25, 17.66667)"
45714,Zubkovsky,31357,Valid,L6,2167.0,Found,01/01/2003 12:00:00 AM,49.78917,41.5046,"(49.78917, 41.5046)"
45715,Zulu Queen,30414,Valid,L3.7,200.0,Found,01/01/1976 12:00:00 AM,33.98333,-115.68333,"(33.98333, -115.68333)"


### Pegando informações

In [12]:
meteorites.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45716 entries, 0 to 45715
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   name         45716 non-null  object 
 1   id           45716 non-null  int64  
 2   nametype     45716 non-null  object 
 3   recclass     45716 non-null  object 
 4   mass (g)     45585 non-null  float64
 5   fall         45716 non-null  object 
 6   year         45425 non-null  object 
 7   reclat       38401 non-null  float64
 8   reclong      38401 non-null  float64
 9   GeoLocation  38401 non-null  object 
dtypes: float64(3), int64(1), object(6)
memory usage: 3.5+ MB


## <font color='blue'>Extraindo subconjuntos</font>
Uma parte crucial do trabalho com DataFrames é extrair subconjuntos de dados: encontrar linhas que atendam a um determinado conjunto de critérios, isolar colunas elinhas de interesse, etc. Esta seção será muito importante para muitas tarefas de análise.

### Selecionando colunas

In [13]:
meteorites.name

0            Aachen
1            Aarhus
2              Abee
3          Acapulco
4           Achiras
            ...    
45711    Zillah 002
45712        Zinder
45713          Zlin
45714     Zubkovsky
45715    Zulu Queen
Name: name, Length: 45716, dtype: object

Podemos selecionar múltiplas colunas de uma vez:

In [14]:
meteorites[['name','mass (g)']]

Unnamed: 0,name,mass (g)
0,Aachen,21.0
1,Aarhus,720.0
2,Abee,107000.0
3,Acapulco,1914.0
4,Achiras,780.0
...,...,...
45711,Zillah 002,172.0
45712,Zinder,46.0
45713,Zlin,3.3
45714,Zubkovsky,2167.0


### Selecionando linhas

In [15]:
meteorites[100:104]

Unnamed: 0,name,id,nametype,recclass,mass (g),fall,year,reclat,reclong,GeoLocation
100,Benton,5026,Valid,LL6,2840.0,Fell,01/01/1949 12:00:00 AM,45.95,-67.55,"(45.95, -67.55)"
101,Berduc,48975,Valid,L6,270.0,Fell,01/01/2008 12:00:00 AM,-31.91,-58.32833,"(-31.91, -58.32833)"
102,Béréba,5028,Valid,Eucrite-mmict,18000.0,Fell,01/01/1924 12:00:00 AM,11.65,-3.65,"(11.65, -3.65)"
103,Berlanguillas,5029,Valid,L6,1440.0,Fell,01/01/1811 12:00:00 AM,41.68333,-3.8,"(41.68333, -3.8)"


### Indexando
Usamos __iloc[]__ para selecionar linhas e colunas por suas posições

In [16]:
meteorites.iloc[100:104,[0,3,4,6]]

Unnamed: 0,name,recclass,mass (g),year
100,Benton,LL6,2840.0,01/01/1949 12:00:00 AM
101,Berduc,L6,270.0,01/01/2008 12:00:00 AM
102,Béréba,Eucrite-mmict,18000.0,01/01/1924 12:00:00 AM
103,Berlanguillas,L6,1440.0,01/01/1811 12:00:00 AM


E usamos __loc[]__ para selecionar por nome

In [17]:
meteorites.loc[100:104, 'mass (g)':'year']

Unnamed: 0,mass (g),fall,year
100,2840.0,Fell,01/01/1949 12:00:00 AM
101,270.0,Fell,01/01/2008 12:00:00 AM
102,18000.0,Fell,01/01/1924 12:00:00 AM
103,1440.0,Fell,01/01/1811 12:00:00 AM
104,960.0,Fell,01/01/2004 12:00:00 AM


### Filtros com máscaras booleanas

Uma máscara booleana é uma estrutura semelhante a um array de valores booleanos – é uma forma de __especificar__ quais linhas/colunas queremos __selecionar (True)__ e quais __não queremos (False)__.

Aqui está um exemplo de uma máscara booleana para meteoritos pesando mais de 50 gramas e que foram encontrados na Terra (podemos também identificar duas formas de fazer esta análise):

In [18]:
(meteorites['mass (g)'] > 50) & (meteorites.fall == 'Found')

0        False
1        False
2        False
3        False
4        False
         ...  
45711     True
45712    False
45713    False
45714     True
45715     True
Length: 45716, dtype: bool

Um meio alternativo é usar o comando `query()` (tomar cuidado para utilizar os caracteres especiais corretamente):

In [19]:
meteorites.query("`mass (g)` > 1e6 and fall == 'Fell'")

Unnamed: 0,name,id,nametype,recclass,mass (g),fall,year,reclat,reclong,GeoLocation
29,Allende,2278,Valid,CV3,2000000.0,Fell,01/01/1969 12:00:00 AM,26.96667,-105.31667,"(26.96667, -105.31667)"
419,Jilin,12171,Valid,H5,4000000.0,Fell,01/01/1976 12:00:00 AM,44.05,126.16667,"(44.05, 126.16667)"
506,Kunya-Urgench,12379,Valid,H5,1100000.0,Fell,01/01/1998 12:00:00 AM,42.25,59.2,"(42.25, 59.2)"
707,Norton County,17922,Valid,Aubrite,1100000.0,Fell,01/01/1948 12:00:00 AM,39.68333,-99.86667,"(39.68333, -99.86667)"
920,Sikhote-Alin,23593,Valid,"Iron, IIAB",23000000.0,Fell,01/01/1947 12:00:00 AM,46.16,134.65333,"(46.16, 134.65333)"


## <font color='blue'>Calculando estatísticas resumidas</font>
Na próxima seção, discutiremos a limpeza de dados para uma análise mais significativa de nossos conjuntos de dados; no entanto, já podemos extrair alguns insights interessantes dos dados dos meteoritos calculando estatísticas resumidas.

### Um parâmetro x outro

In [20]:
meteorites.fall.value_counts()

fall
Found    44609
Fell      1107
Name: count, dtype: int64

### Qual é a massa do meteorito médio?

In [21]:
meteorites['mass (g)'].mean()

13278.078548601512

### Analisando médias e quantis

In [22]:
meteorites['mass (g)'].quantile([0.01, 0.05, 0.5, 0.95, 0.99])

0.01        0.44
0.05        1.10
0.50       32.60
0.95     4000.00
0.99    50600.00
Name: mass (g), dtype: float64

In [23]:
meteorites['mass (g)'].median()

32.6

### Qual é o meteorito mais pesado e o mais leve
E como __mostrá-los__

In [24]:
meteorites['mass (g)'].min()

0.0

In [25]:
#Formato padrão de filtro para mostrara apenas os elementos filtrados com a condição booleana que queremos
meteorites[meteorites['mass (g)'] == 0]

Unnamed: 0,name,id,nametype,recclass,mass (g),fall,year,reclat,reclong,GeoLocation
12640,Gove,52859,Relict,Relict iron,0.0,Found,01/01/1979 12:00:00 AM,-12.26333,136.83833,"(-12.26333, 136.83833)"
25557,Miller Range 090478,55953,Valid,CO3,0.0,Found,01/01/2009 12:00:00 AM,0.0,0.0,"(0.0, 0.0)"
31061,Österplana 048,56147,Relict,Relict OC,0.0,Found,01/01/2004 12:00:00 AM,58.58333,13.43333,"(58.58333, 13.43333)"
31062,Österplana 049,56148,Relict,Relict OC,0.0,Found,01/01/2012 12:00:00 AM,58.58333,13.43333,"(58.58333, 13.43333)"
31063,Österplana 050,56149,Relict,Relict OC,0.0,Found,01/01/2003 12:00:00 AM,58.58333,13.43333,"(58.58333, 13.43333)"
31064,Österplana 051,56150,Relict,Relict OC,0.0,Found,01/01/2006 12:00:00 AM,58.58333,13.43333,"(58.58333, 13.43333)"
31065,Österplana 052,56151,Relict,Relict OC,0.0,Found,01/01/2006 12:00:00 AM,58.58333,13.43333,"(58.58333, 13.43333)"
31066,Österplana 053,56152,Relict,Relict OC,0.0,Found,01/01/2002 12:00:00 AM,58.58333,13.43333,"(58.58333, 13.43333)"
31067,Österplana 054,56153,Relict,Relict OC,0.0,Found,01/01/2005 12:00:00 AM,58.58333,13.43333,"(58.58333, 13.43333)"
31068,Österplana 055,56154,Relict,Relict OC,0.0,Found,01/01/2008 12:00:00 AM,58.58333,13.43333,"(58.58333, 13.43333)"


In [26]:
meteorites['mass (g)'].max()

60000000.0

In [27]:
Maior = meteorites[meteorites['mass (g)'] == 60000000.0]
Maior

Unnamed: 0,name,id,nametype,recclass,mass (g),fall,year,reclat,reclong,GeoLocation
16392,Hoba,11890,Valid,"Iron, IVB",60000000.0,Found,01/01/1920 12:00:00 AM,-19.58333,17.91667,"(-19.58333, 17.91667)"


### Extraindo informações de um meteorito específico

In [28]:
meteorites.loc[meteorites['mass (g)'].idxmax()]

name                             Hoba
id                              11890
nametype                        Valid
recclass                    Iron, IVB
mass (g)                   60000000.0
fall                            Found
year           01/01/1920 12:00:00 AM
reclat                      -19.58333
reclong                      17.91667
GeoLocation     (-19.58333, 17.91667)
Name: 16392, dtype: object

### Quantos tipos diferentes de classes de meteoritos estão representados neste conjunto de dados?

In [29]:
meteorites.recclass.nunique()

466

Como por exemplo:

In [30]:
meteorites.recclass.unique()[:10]

array(['L5', 'H6', 'EH4', 'Acapulcoite', 'L6', 'LL3-6', 'H5', 'L',
       'Diogenite-pm', 'Unknown'], dtype=object)

### Obtendo algumas estatísticas resumidas sobre os próprios dados
Podemos obter estatísticas resumidas comuns para todas as colunas de uma só vez. Por padrão, serão apenas colunas numéricas, mas aqui resumiremos tudo junto:

In [31]:
meteorites.describe(include='all')

Unnamed: 0,name,id,nametype,recclass,mass (g),fall,year,reclat,reclong,GeoLocation
count,45716,45716.0,45716,45716,45585.0,45716,45425,38401.0,38401.0,38401
unique,45716,,2,466,,2,266,,,17100
top,Aachen,,Valid,L6,,Found,01/01/2003 12:00:00 AM,,,"(0.0, 0.0)"
freq,1,,45641,8285,,44609,3323,,,6214
mean,,26889.735104,,,13278.08,,,-39.12258,61.074319,
std,,16860.68303,,,574988.9,,,46.378511,80.647298,
min,,1.0,,,0.0,,,-87.36667,-165.43333,
25%,,12688.75,,,7.2,,,-76.71424,0.0,
50%,,24261.5,,,32.6,,,-71.5,35.66667,
75%,,40656.75,,,202.6,,,0.0,157.16667,


Valores NaN significam __dados ausentes__. Por exemplo, a coluna de queda contém __strings__, portanto não há valor para __média__; da mesma forma, a massa (g) é numérica, portanto não temos entradas para as estatísticas de resumo categóricas (única, superior, frequência).

# <font color='red'>2) Data Wrangling</font>

Para preparar nossos dados para análise, precisamos realizar a __Data Wrangling__. Nesta seção, aprenderemos como limpar e reformatar dados (por exemplo: renomear colunas e corrigir incompatibilidades de tipos de dados), reestruturá-los/remodelá-los e enriquecê-los (por exemplo: discretizar colunas, calcular agregações e combinar fontes de dados)

## <font color='blue'>Limpeza de dados</font>
Nesta seção, veremos como: criar renomear e eliminar colunas; conversão de tipo; e classificação. Trabalharemos com os dados de viagem de táxi de 2019 fornecidos pela NYC Open Data.

In [32]:
import pandas as pd

taxis = pd.read_csv('../data/2019_Yellow_Taxi_Trip_Data.csv')
taxis.head()

Unnamed: 0,vendorid,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,ratecodeid,store_and_fwd_flag,pulocationid,dolocationid,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge
0,2,2019-10-23T16:39:42.000,2019-10-23T17:14:10.000,1,7.93,1,N,138,170,1,29.5,1.0,0.5,7.98,6.12,0.3,47.9,2.5
1,1,2019-10-23T16:32:08.000,2019-10-23T16:45:26.000,1,2.0,1,N,11,26,1,10.5,1.0,0.5,0.0,0.0,0.3,12.3,0.0
2,2,2019-10-23T16:08:44.000,2019-10-23T16:21:11.000,1,1.36,1,N,163,162,1,9.5,1.0,0.5,2.0,0.0,0.3,15.8,2.5
3,2,2019-10-23T16:22:44.000,2019-10-23T16:43:26.000,1,1.0,1,N,170,163,1,13.0,1.0,0.5,4.32,0.0,0.3,21.62,2.5
4,2,2019-10-23T16:45:11.000,2019-10-23T16:58:49.000,1,1.96,1,N,163,236,1,10.5,1.0,0.5,0.5,0.0,0.3,15.3,2.5


### Descartando colunas
Iremos utilizar como exemplo a coluna __store_and_fwd_flag__ e as colunas de ID:

In [33]:
mask = taxis.columns.str.contains('id$|store_and_fwd_flag', regex=True)
columns_to_drop = taxis.columns[mask]
columns_to_drop

Index(['vendorid', 'ratecodeid', 'store_and_fwd_flag', 'pulocationid',
       'dolocationid'],
      dtype='object')

In [34]:
taxis = taxis.drop(columns=columns_to_drop)
taxis.head()

Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge
0,2019-10-23T16:39:42.000,2019-10-23T17:14:10.000,1,7.93,1,29.5,1.0,0.5,7.98,6.12,0.3,47.9,2.5
1,2019-10-23T16:32:08.000,2019-10-23T16:45:26.000,1,2.0,1,10.5,1.0,0.5,0.0,0.0,0.3,12.3,0.0
2,2019-10-23T16:08:44.000,2019-10-23T16:21:11.000,1,1.36,1,9.5,1.0,0.5,2.0,0.0,0.3,15.8,2.5
3,2019-10-23T16:22:44.000,2019-10-23T16:43:26.000,1,1.0,1,13.0,1.0,0.5,4.32,0.0,0.3,21.62,2.5
4,2019-10-23T16:45:11.000,2019-10-23T16:58:49.000,1,1.96,1,10.5,1.0,0.5,0.5,0.0,0.3,15.3,2.5


Criamos uma mascara chamada __mask__ utilizando os comandos e selecionando apenas as colunas que queriamos descartar. Após isso salvamos em `columns_to_drop` para depois assumir que __columns=columns_to_drop__ utilizando o comando `drop` para descartar as colunas.

### Renomeando colunas

In [35]:
taxis = taxis.rename(
    columns={
        'tpep_pickup_datetime': 'pickup', 
        'tpep_dropoff_datetime': 'dropoff'
    }
)
taxis.columns

Index(['pickup', 'dropoff', 'passenger_count', 'trip_distance', 'payment_type',
       'fare_amount', 'extra', 'mta_tax', 'tip_amount', 'tolls_amount',
       'improvement_surcharge', 'total_amount', 'congestion_surcharge'],
      dtype='object')

### Convertendo tipos

In [36]:
taxis.dtypes

pickup                    object
dropoff                   object
passenger_count            int64
trip_distance            float64
payment_type               int64
fare_amount              float64
extra                    float64
mta_tax                  float64
tip_amount               float64
tolls_amount             float64
improvement_surcharge    float64
total_amount             float64
congestion_surcharge     float64
dtype: object

Neste caso, queremos que __pickup__ e __dropoff__ sejam __datetimes__. Podemos arrumar isto:

In [37]:
taxis[['pickup', 'dropoff']] = \
    taxis[['pickup', 'dropoff']].apply(pd.to_datetime)
taxis.dtypes

pickup                   datetime64[ns]
dropoff                  datetime64[ns]
passenger_count                   int64
trip_distance                   float64
payment_type                      int64
fare_amount                     float64
extra                           float64
mta_tax                         float64
tip_amount                      float64
tolls_amount                    float64
improvement_surcharge           float64
total_amount                    float64
congestion_surcharge            float64
dtype: object

### Criando novas colunas

In [38]:
taxis = taxis.assign(
    elapsed_time=lambda x: x.dropoff - x.pickup, # 1
    cost_before_tip=lambda x: x.total_amount - x.tip_amount,
    tip_pct=lambda x: x.tip_amount / x.cost_before_tip, # 2
    fees=lambda x: x.cost_before_tip - x.fare_amount, # 3
    avg_speed=lambda x: x.trip_distance.div(
        x.elapsed_time.dt.total_seconds() / 60 / 60
    ) # 4
)

Essas __funções lambdas__ são funções pequenas e anônimas que podem receber vários argumentos, mas só podem conter uma expressão (o valor de retorno).

No caso temos algo do tipo:

`coluna_nova = lambda x: x.coluna1 operação x.coluna2`

In [39]:
taxis.head(2)

Unnamed: 0,pickup,dropoff,passenger_count,trip_distance,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,elapsed_time,cost_before_tip,tip_pct,fees,avg_speed
0,2019-10-23 16:39:42,2019-10-23 17:14:10,1,7.93,1,29.5,1.0,0.5,7.98,6.12,0.3,47.9,2.5,0 days 00:34:28,39.92,0.1999,10.42,13.804642
1,2019-10-23 16:32:08,2019-10-23 16:45:26,1,2.0,1,10.5,1.0,0.5,0.0,0.0,0.3,12.3,0.0,0 days 00:13:18,12.3,0.0,1.8,9.022556


### Ordenando por valores
Podemos usar o método `sort_values()` 

In [40]:
taxis.sort_values(['passenger_count', 'pickup'], ascending=[False, True]).head()

Unnamed: 0,pickup,dropoff,passenger_count,trip_distance,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,elapsed_time,cost_before_tip,tip_pct,fees,avg_speed
5997,2019-10-23 15:55:19,2019-10-23 16:08:25,6,1.58,2,10.0,1.0,0.5,0.0,0.0,0.3,14.3,2.5,0 days 00:13:06,14.3,0.0,4.3,7.236641
443,2019-10-23 15:56:59,2019-10-23 16:04:33,6,1.46,2,7.5,1.0,0.5,0.0,0.0,0.3,11.8,2.5,0 days 00:07:34,11.8,0.0,4.3,11.577093
8722,2019-10-23 15:57:33,2019-10-23 16:03:34,6,0.62,1,5.5,1.0,0.5,0.7,0.0,0.3,10.5,2.5,0 days 00:06:01,9.8,0.071429,4.3,6.182825
4198,2019-10-23 15:57:38,2019-10-23 16:05:07,6,1.18,1,7.0,1.0,0.5,1.0,0.0,0.3,12.3,2.5,0 days 00:07:29,11.3,0.088496,4.3,9.461024
8238,2019-10-23 15:58:31,2019-10-23 16:29:29,6,3.23,2,19.5,1.0,0.5,0.0,0.0,0.3,23.8,2.5,0 days 00:30:58,23.8,0.0,4.3,6.258342


Para escolher as linhas maiores e menores, usamos `nlargest()` e `nsmallest()`. Vejamos um exemplo olhando para as 3 viagens com maior tempo decorrido:

In [41]:
taxis.nlargest(4, 'elapsed_time')

Unnamed: 0,pickup,dropoff,passenger_count,trip_distance,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,elapsed_time,cost_before_tip,tip_pct,fees,avg_speed
7576,2019-10-23 16:52:51,2019-10-24 16:51:44,1,3.75,1,17.5,1.0,0.5,0.0,0.0,0.3,21.8,2.5,0 days 23:58:53,21.8,0.0,4.3,0.156371
6902,2019-10-23 16:51:42,2019-10-24 16:50:22,1,11.19,2,39.5,1.0,0.5,0.0,0.0,0.3,41.3,0.0,0 days 23:58:40,41.3,0.0,1.8,0.466682
4975,2019-10-23 16:18:51,2019-10-24 16:17:30,1,0.7,2,7.0,1.0,0.5,0.0,0.0,0.3,11.3,2.5,0 days 23:58:39,11.3,0.0,4.3,0.029194
6550,2019-10-23 16:49:36,2019-10-24 16:47:40,1,2.54,1,11.0,1.0,0.5,2.3,0.0,0.3,17.6,2.5,0 days 23:58:04,15.3,0.150327,4.3,0.105976


In [42]:
taxis.nsmallest(4, 'total_amount')

Unnamed: 0,pickup,dropoff,passenger_count,trip_distance,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,elapsed_time,cost_before_tip,tip_pct,fees,avg_speed
7586,2019-10-23 16:52:06,2019-10-23 17:29:50,1,10.86,2,-52.0,-4.5,-0.5,0.0,-6.12,-0.3,-65.92,-2.5,0 days 00:37:44,-65.92,-0.0,-13.92,17.268551
822,2019-10-23 16:52:52,2019-10-23 16:52:54,3,0.02,3,-52.0,-4.5,-0.5,0.0,0.0,-0.3,-57.3,0.0,0 days 00:00:02,-57.3,-0.0,-5.3,36.0
8804,2019-10-23 16:50:16,2019-10-23 17:06:08,2,0.53,4,-10.5,-1.0,-0.5,0.0,0.0,-0.3,-14.8,-2.5,0 days 00:15:52,-14.8,-0.0,-4.3,2.004202
2103,2019-10-23 16:41:17,2019-10-23 16:56:35,1,0.85,3,-10.0,-1.0,-0.5,0.0,0.0,-0.3,-14.3,-2.5,0 days 00:15:18,-14.3,-0.0,-4.3,3.333333


## <font color='blue'>Trabalhando com índices</font>

Até agora, não trabalhamos realmente com índices porque eles são apenas os números de linhas; entretanto, podemos alterar os valores que temos no índice para acessar recursos adicionais da biblioteca pandas.

### Setando e ordenando índices

Atualmente, temos um RangeIndex, mas podemos mudar para um DatetimeIndex especificando uma coluna de data e hora ao chamar set_index():

In [43]:
taxis = taxis.set_index('pickup')
taxis.head(3)

Unnamed: 0_level_0,dropoff,passenger_count,trip_distance,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,elapsed_time,cost_before_tip,tip_pct,fees,avg_speed
pickup,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
2019-10-23 16:39:42,2019-10-23 17:14:10,1,7.93,1,29.5,1.0,0.5,7.98,6.12,0.3,47.9,2.5,0 days 00:34:28,39.92,0.1999,10.42,13.804642
2019-10-23 16:32:08,2019-10-23 16:45:26,1,2.0,1,10.5,1.0,0.5,0.0,0.0,0.3,12.3,0.0,0 days 00:13:18,12.3,0.0,1.8,9.022556
2019-10-23 16:08:44,2019-10-23 16:21:11,1,1.36,1,9.5,1.0,0.5,2.0,0.0,0.3,15.8,2.5,0 days 00:12:27,13.8,0.144928,4.3,6.554217


_Obs:_ Neste modo, após colocarmos uma coluna como linha, ela não volta a ser coluna depois.

Como temos uma amostra do conjunto de dados completo, vamos classificar o índice por __ordem de horário de coleta__:

In [44]:
taxis = taxis.sort_index()

Agora podemos selecionar intervalos de nossos dados com base na data e hora da mesma forma que fizemos com os números das linhas:

In [45]:
taxis['2019-10-23 07:45':'2019-10-23 08']

Unnamed: 0_level_0,dropoff,passenger_count,trip_distance,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,elapsed_time,cost_before_tip,tip_pct,fees,avg_speed
pickup,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
2019-10-23 07:48:58,2019-10-23 07:52:09,1,0.67,2,4.5,1.0,0.5,0.0,0.0,0.3,8.8,2.5,0 days 00:03:11,8.8,0.0,4.3,12.628272
2019-10-23 08:02:09,2019-10-24 07:42:32,1,8.38,1,32.0,1.0,0.5,5.5,0.0,0.3,41.8,2.5,0 days 23:40:23,36.3,0.151515,4.3,0.353989
2019-10-23 08:18:47,2019-10-23 08:36:05,1,2.39,2,12.5,1.0,0.5,0.0,0.0,0.3,16.8,2.5,0 days 00:17:18,16.8,0.0,4.3,8.289017


Quando nao especificamos o range, usamos o comando `loc[]:`

In [46]:
taxis.loc['2019-10-23 08']

Unnamed: 0_level_0,dropoff,passenger_count,trip_distance,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,elapsed_time,cost_before_tip,tip_pct,fees,avg_speed
pickup,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
2019-10-23 08:02:09,2019-10-24 07:42:32,1,8.38,1,32.0,1.0,0.5,5.5,0.0,0.3,41.8,2.5,0 days 23:40:23,36.3,0.151515,4.3,0.353989
2019-10-23 08:18:47,2019-10-23 08:36:05,1,2.39,2,12.5,1.0,0.5,0.0,0.0,0.3,16.8,2.5,0 days 00:17:18,16.8,0.0,4.3,8.289017


### Resetando os índices

Iremos estar trabalhando com time series depois desta seção, porém, as vezes queremos resetar nosso índice para números de linhas e recolocar as colunas novamente. Podemos fazer isso utilizando o comando `reset_index()`: 

In [47]:
taxis = taxis.reset_index()
taxis.head()

Unnamed: 0,pickup,dropoff,passenger_count,trip_distance,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,elapsed_time,cost_before_tip,tip_pct,fees,avg_speed
0,2019-10-23 07:05:34,2019-10-23 08:03:16,3,14.68,1,50.0,1.0,0.5,4.0,0.0,0.3,55.8,0.0,0 days 00:57:42,51.8,0.07722,1.8,15.265165
1,2019-10-23 07:48:58,2019-10-23 07:52:09,1,0.67,2,4.5,1.0,0.5,0.0,0.0,0.3,8.8,2.5,0 days 00:03:11,8.8,0.0,4.3,12.628272
2,2019-10-23 08:02:09,2019-10-24 07:42:32,1,8.38,1,32.0,1.0,0.5,5.5,0.0,0.3,41.8,2.5,0 days 23:40:23,36.3,0.151515,4.3,0.353989
3,2019-10-23 08:18:47,2019-10-23 08:36:05,1,2.39,2,12.5,1.0,0.5,0.0,0.0,0.3,16.8,2.5,0 days 00:17:18,16.8,0.0,4.3,8.289017
4,2019-10-23 09:27:16,2019-10-23 09:33:13,2,1.11,2,6.0,1.0,0.5,0.0,0.0,0.3,7.8,0.0,0 days 00:05:57,7.8,0.0,1.8,11.193277



## <font color='blue'>Dados remodelados</font>

O taxi dataset que estamos trabalhando está em um formato propício para uma análise. Mas isto não é sempre o caso. Vamos agora ver o TSA traveler throughput data, no qual compara as taxas de transferencia de 2021 em um mesmo dia para os anos de 2020 e 2019:

In [48]:
tsa = pd.read_csv('/home/nicolas.fs/Estudos-PIBE/Repositório-GIT/pandas-workshop/data/tsa_passenger_throughput.csv', parse_dates=['Date'])
tsa

Unnamed: 0,Date,2021 Traveler Throughput,2020 Traveler Throughput,2019 Traveler Throughput
0,2021-05-14,1716561.0,250467,2664549
1,2021-05-13,1743515.0,234928,2611324
2,2021-05-12,1424664.0,176667,2343675
3,2021-05-11,1315493.0,163205,2191387
4,2021-05-10,1657722.0,215645,2512315
...,...,...,...,...
360,2020-05-19,,190477,2312727
361,2020-05-18,,244176,2615691
362,2020-05-17,,253807,2620276
363,2020-05-16,,193340,2091116


Agora iremos renomear as colunas para poder trabalhar com a remodelagem:

In [49]:
tsa = tsa.rename(columns=lambda x: x.lower().split()[0])
tsa

Unnamed: 0,date,2021,2020,2019
0,2021-05-14,1716561.0,250467,2664549
1,2021-05-13,1743515.0,234928,2611324
2,2021-05-12,1424664.0,176667,2343675
3,2021-05-11,1315493.0,163205,2191387
4,2021-05-10,1657722.0,215645,2512315
...,...,...,...,...
360,2020-05-19,,190477,2312727
361,2020-05-18,,244176,2615691
362,2020-05-17,,253807,2620276
363,2020-05-16,,193340,2091116


### Melting

Melting nos ajuda a converter os dados em um formato longo, podendo ter todos os dados de taxas de transferência do viajante em uma única coluna em linhas diferentes para cada ano:

In [50]:
tsa_melted = tsa.melt(
    id_vars='date', # column that uniquely identifies a row (can be multiple)
    var_name='year', # name for the new column created by melting
    value_name='travelers' # name for new column containing values from melted columns
)
tsa_melted

Unnamed: 0,date,year,travelers
0,2021-05-14,2021,1716561.0
1,2021-05-13,2021,1743515.0
2,2021-05-12,2021,1424664.0
3,2021-05-11,2021,1315493.0
4,2021-05-10,2021,1657722.0
...,...,...,...
1090,2020-05-19,2019,2312727.0
1091,2020-05-18,2019,2615691.0
1092,2020-05-17,2019,2620276.0
1093,2020-05-16,2019,2091116.0


_Obs:_ Podemos usar o comando `.sample(n)` caso queiramos em uma ordem aleatória.

Basicamente isso fez com que agora tenhamos mais linhas, pois temos o número de viajantes relacionados a cada ano para cada data, ao invés de várias colunas referente a cada ano para a data específica em uma só linha.

Para converter isso em uma série temporal de produtividade de viajantes, precisamos substituir o ano na __coluna de data__ pelo ano na __coluna de ano__. Caso contrário, estaremos marcando os números dos anos anteriores com o ano errado.

In [51]:
tsa_melted = tsa_melted.assign(
    date=lambda x: pd.to_datetime(x.year + x.date.dt.strftime('-%m-%d'))
)
tsa_melted

Unnamed: 0,date,year,travelers
0,2021-05-14,2021,1716561.0
1,2021-05-13,2021,1743515.0
2,2021-05-12,2021,1424664.0
3,2021-05-11,2021,1315493.0
4,2021-05-10,2021,1657722.0
...,...,...,...
1090,2019-05-19,2019,2312727.0
1091,2019-05-18,2019,2615691.0
1092,2019-05-17,2019,2620276.0
1093,2019-05-16,2019,2091116.0


Isso nos leva a alguns __valores nulos__:

In [52]:
tsa_melted.sort_values('date').tail(3)

Unnamed: 0,date,year,travelers
136,2021-12-29,2021,
135,2021-12-30,2021,
134,2021-12-31,2021,


Eles podem ser retirados utilizando o método `dropna()`:

In [53]:
tsa_melted = tsa_melted.dropna()
tsa_melted.sort_values('date').tail(3)

Unnamed: 0,date,year,travelers
2,2021-05-12,2021,1424664.0
1,2021-05-13,2021,1743515.0
0,2021-05-14,2021,1716561.0
