# Objetivo

Este notebook tem por objetivo resumir os funcionalidades básicas em **Pandas**.

# O que é Pandas?

De acordo com a própria documentação, **Pandas** é  uma biblioteca *open source* que fornece estruturas de dados e ferramentas para análise de dados de alta performance e de fácil uso para a linguagem de programação Python.

Conta com funções para análise, limpeza, exploração e manipulação de dados.

# Instalando a Biblioteca

In [None]:
# Rodar apenas se não tiver a biblioteca instalada
pip install pandas 

# Importando a Biblioteca

In [3]:
import pandas as pd
import seaborn as sns # Apenas para importar uma amostra de base de dados, se não tiver instalado, precisa instalar antes

# Criando Data Frames

Em Pandas, uma tabela de dados é chamada de Data Frame, que é definido como uma estrutura de dados bi-dimensional, como um array de duas dimensões ou uma tabela com linhas e colunas.

Um Data Frame pode ser criado a partir de:
- listas
- dicionário
- importação de arquivos
- outro data frame
- ndarray
- series
- valores escalares
- ...

## Exemplo de criação de um data frame a partir de um dicionário

In [22]:
dicionario = {
   "nome": ["Andreza", "Cissa", "Guilherme"],
   "peso": [68, 75, 87],
   "idade": [25, 22, 27]
}

data=pd.DataFrame(dicionario)
data.head()

Unnamed: 0,nome,peso,idade
0,Andreza,68,25
1,Cissa,75,22
2,Guilherme,87,27


## Leitura a partir de um Arquivo

In [4]:
df=sns.load_dataset('titanic') # Carrega os dados existentes em uma biblioteca
#Usa-se pd.read_csv para ler de um arquivo externo .csv

# Verificando a estrutura de dados

In [5]:
type(df) #Verifica o tipo da estrutura de dados

pandas.core.frame.DataFrame

In [14]:
df.shape #mostra respectivamente o número de linhas e colunas

(891, 15)

In [6]:
df.head() # Mostra as 5 primeiras linhas

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [11]:
df.tail() # Mostra as 5 últimas linhas

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
886,0,2,male,27.0,0,0,13.0,S,Second,man,True,,Southampton,no,True
887,1,1,female,19.0,0,0,30.0,S,First,woman,False,B,Southampton,yes,True
888,0,3,female,,1,2,23.45,S,Third,woman,False,,Southampton,no,False
889,1,1,male,26.0,0,0,30.0,C,First,man,True,C,Cherbourg,yes,True
890,0,3,male,32.0,0,0,7.75,Q,Third,man,True,,Queenstown,no,True


In [7]:
df.info() # Mostra detalhes das colunas

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB


In [18]:
df.nunique() # Número de valores distintos por Coluna

survived         2
pclass           3
sex              2
age             88
sibsp            7
parch            7
fare           248
embarked         3
class            3
who              3
adult_male       2
deck             7
embark_town      3
alive            2
alone            2
dtype: int64

# Selecionando colunas e alterando o tipo de Variável

In [20]:
df['sex'].unique() # Selecionar uma coluna e aplicar um método

array(['male', 'female'], dtype=object)

In [24]:
df['sex']=df['sex'].astype('category') # Substitui o tipo de variável de uma Coluna

In [28]:
df['sex'].info()

<class 'pandas.core.series.Series'>
RangeIndex: 891 entries, 0 to 890
Series name: sex
Non-Null Count  Dtype   
--------------  -----   
891 non-null    category
dtypes: category(1)
memory usage: 1.1 KB


In [37]:
df[['survived', 'age', 'alone', 'deck']].head() # Selecionando várias colunas pelo nome

Unnamed: 0,survived,age,alone,deck
0,0,22.0,False,
1,1,38.0,False,C
2,1,26.0,True,
3,1,35.0,False,C
4,0,35.0,True,


In [48]:
df.drop(['age','alone','deck'],axis=1).head() # Selecionando todas as colunas exceto as listadas

Unnamed: 0,survived,pclass,sex,sibsp,parch,fare,embarked,class,who,adult_male,embark_town,alive
0,0,3,male,1,0,7.25,S,Third,man,True,Southampton,no
1,1,1,female,1,0,71.2833,C,First,woman,False,Cherbourg,yes
2,1,3,female,0,0,7.925,S,Third,woman,False,Southampton,yes
3,1,1,female,1,0,53.1,S,First,woman,False,Southampton,yes
4,0,3,male,0,0,8.05,S,Third,man,True,Southampton,no


# Particionando os dados

In [6]:
df.iloc[0] # Seleciona a linha correspondente ao índice 0

survived                 0
pclass                   3
sex                   male
age                   22.0
sibsp                    1
parch                    0
fare                  7.25
embarked                 S
class                Third
who                    man
adult_male            True
deck                   NaN
embark_town    Southampton
alive                   no
alone                False
Name: 0, dtype: object

In [7]:
df.iloc[[0]] # Seleciona a linha correspondente ao índice 0 e traz um Data Frame como resultado

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False


In [55]:
df.iloc[0:5] # Seleciona as linhas correspondentes aos índice 0 a (5-1)

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [59]:
df.iloc[:,2].head() # Seleciona todas as linhas (:) da coluna correspondente ao índice 2

0      male
1    female
2    female
3    female
4      male
Name: sex, dtype: category
Categories (2, object): ['female', 'male']

In [65]:
df.iloc[0:5,0:3]  # Seleciona as linhas correspondentes aos índice 0 a (5-1) e às colunas de índice 0 a (3-1)

Unnamed: 0,survived,pclass,sex
0,0,3,male
1,1,1,female
2,1,3,female
3,1,1,female
4,0,3,male


In [64]:
df.iloc[0:5,[0,2]]  # Seleciona as linhas correspondentes aos índice 0 a (5-1) e às colunas correspondentes aos índices listados listadas

Unnamed: 0,survived,sex
0,0,male
1,1,female
2,1,female
3,1,female
4,0,male


In [66]:
df.loc[df['survived'] == 1].head() # Seleciona as linhas que satisfazem à condição survived = 1

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
8,1,3,female,27.0,0,2,11.1333,S,Third,woman,False,,Southampton,yes,False
9,1,2,female,14.0,1,0,30.0708,C,Second,child,False,,Cherbourg,yes,False


 **Observação:**

- iloc: localiza o índice
 
- loc: localiza o label 

In [74]:
df.loc[(df['survived'] == 1) & (df['sex']=='female')].head() # A condição pode envolver outros operadores lógicos

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone,is_child
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False,0
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True,0
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False,0
8,1,3,female,27.0,0,2,11.1333,S,Third,woman,False,,Southampton,yes,False,0
9,1,2,female,14.0,1,0,30.0708,C,Second,child,False,,Cherbourg,yes,False,1


In [9]:
df[df['embark_town'].isin(['Cherbourg','Southampton'])].head()
# isin é outra forma de declarar condições para particionar os dados
# Traz as linhas em que embark_town é igual às listadas em isin

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


# Criando Novas Variáveis e alterando a posição no Data Frame

In [11]:
df['is_child']=[1 if x < 19 else 0 for x in df['age']] # exemplo de criação de nova variável usando List Comprehension

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone,is_child
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False,0
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False,0
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True,0
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False,0
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True,0


In [20]:
# Alterando a posição da nova variável para o início do Data Frame (posição 0 no insert)
nova_variavel = df['is_child']
df = df.drop(columns=['is_child'])
df.insert(loc=0, column='is_child', value=nova_variavel)
df.head()


Unnamed: 0,is_child,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,0,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,0,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,0,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


# Ordenando os dados

In [39]:
df.sort_values('age').head() # Ordenando pela Coluna Age

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
803,1,3,male,0.42,0,1,8.5167,C,Third,child,False,,Cherbourg,yes,False
755,1,2,male,0.67,1,1,14.5,S,Second,child,False,,Southampton,yes,False
644,1,3,female,0.75,2,1,19.2583,C,Third,child,False,,Cherbourg,yes,False
469,1,3,female,0.75,2,1,19.2583,C,Third,child,False,,Cherbourg,yes,False
78,1,2,male,0.83,0,2,29.0,S,Second,child,False,,Southampton,yes,False


In [21]:
df.sort_values('age',ascending=False).head() # Ordenando em ordem decrescente

Unnamed: 0,is_child,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
630,0,1,1,male,80.0,0,0,30.0,S,First,man,True,A,Southampton,yes,True
851,0,0,3,male,74.0,0,0,7.775,S,Third,man,True,,Southampton,no,True
493,0,0,1,male,71.0,0,0,49.5042,C,First,man,True,,Cherbourg,no,True
96,0,0,1,male,71.0,0,0,34.6542,C,First,man,True,A,Cherbourg,no,True
116,0,0,3,male,70.5,0,0,7.75,Q,Third,man,True,,Queenstown,no,True


# Referências

- https://pandas.pydata.org/docs/index.html
- https://www.w3schools.com/python/pandas/default.asp