# X500 - Introdução à Ciência de Dados

Prof. Erneson Alves de Oliveira<br>
Programa de Pós-Graduação em Informática Aplicada<br>
Universidade de Fortaleza

# 2 Introdução ao Pandas

![picture](https://drive.google.com/thumbnail?id=1UzNXA9dEa9ynMRWH5PqgUUAk-dWSKkWN&sz=w600)

https://pandas.pydata.org

## 2.1 O que é Pandas?

Pandas é um módulo para manipulação de tabelas (planilhas).

![picture](https://drive.google.com/thumbnail?id=1EqkkwMguFttpovsYVstiQKsxgmPz34Kb&sz=w400)

## 2.8 Iterando sobre DataFrames

In [4]:
import pandas as pd
d = {'apples': {'A': 3, 'B': 2, 'C': 0, 'D': 1},
     'oranges': {'A': 0, 'B': 3, 'C': 7, 'D': 2},
     'bananas': {'A': 1, 'B': 3, 'C': 5, 'D': 4},
     'avocados': {'A': 9, 'B': 0, 'C': 0, 'D': 1}}

df = pd.DataFrame(d)
print(df)

   apples  oranges  bananas  avocados
A       3        0        1         9
B       2        3        3         0
C       0        7        5         0
D       1        2        4         1


In [5]:
df.index

Index(['A', 'B', 'C', 'D'], dtype='object')

In [6]:
for indice in df.index:
    # print(indice)
    print(df.loc[indice, 'apples'])

3
2
0
1


In [7]:
for indice in df.index:
    print(df['apples'][indice])

3
2
0
1


In [8]:
for i in range(df.shape[0]):
    print(df.iloc[i, 0])

3
2
0
1


In [9]:
for indice, linha in df.iterrows():
    # print(indice)
    # print(linha)
    # break
    print(indice, linha['apples'], linha['bananas'])

A 3 1
B 2 3
C 0 5
D 1 4


## 2.9 Agrupando valores

In [10]:
d = {'brands': ['Volkswagen', 'Chevrolet', 'Toyota', 'Chevrolet', 'Renault', 'Chevrolet', 'Toyota'],
     'models': ['Fusca', 'Onix', 'Corolla', 'Prisma', 'Duster', 'Onix', 'Etios'],
     'price': [1000, 30000, 60000, 32000, 35000, 40000, 35000]}

df = pd.DataFrame(d)
print(df)

       brands   models  price
0  Volkswagen    Fusca   1000
1   Chevrolet     Onix  30000
2      Toyota  Corolla  60000
3   Chevrolet   Prisma  32000
4     Renault   Duster  35000
5   Chevrolet     Onix  40000
6      Toyota    Etios  35000


In [11]:
df['brands'].value_counts()

brands
Chevrolet     3
Toyota        2
Volkswagen    1
Renault       1
Name: count, dtype: int64

In [12]:
def Minuscula(x):
    if isinstance(x, str):
        return x.lower()
    else:
        return x

df['brands'].apply(Minuscula)

0    volkswagen
1     chevrolet
2        toyota
3     chevrolet
4       renault
5     chevrolet
6        toyota
Name: brands, dtype: object

In [36]:
print(df.groupby(['brands']).sum())

KeyError: 'brands'

In [14]:
s = df.groupby(['brands', 'models']).sum()
print(s)

# print(type(s))
print(list(s.index))
print(list(s.columns))
s['price'][('Chevrolet', 'Onix')]

                    price
brands     models        
Chevrolet  Onix     70000
           Prisma   32000
Renault    Duster   35000
Toyota     Corolla  60000
           Etios    35000
Volkswagen Fusca     1000
[('Chevrolet', 'Onix'), ('Chevrolet', 'Prisma'), ('Renault', 'Duster'), ('Toyota', 'Corolla'), ('Toyota', 'Etios'), ('Volkswagen', 'Fusca')]
['price']


70000

In [15]:
# grouped = df.groupby('brands')
# print(type(grouped))
# print(dir(grouped))
# print(grouped.size())

# s = df.groupby('brands').sum()
# print(s)

# print(type(s))
print(s.sort_values(by = 'price', ascending = True))

                    price
brands     models        
Volkswagen Fusca     1000
Chevrolet  Prisma   32000
Renault    Duster   35000
Toyota     Etios    35000
           Corolla  60000
Chevrolet  Onix     70000


## 2.10 Higienização de DataFrames

In [16]:
import numpy as np
d = {'apples': [2, np.nan, 1, 5, 1],
     'oranges': [3, 2, 1, 3, 1],
     'grapes': [3, 2, 1, 3, 1]}

df = pd.DataFrame(d)
print(df)

   apples  oranges  grapes
0     2.0        3       3
1     NaN        2       2
2     1.0        1       1
3     5.0        3       3
4     1.0        1       1


In [17]:
df.describe()

Unnamed: 0,apples,oranges,grapes
count,4.0,5.0,5.0
mean,2.25,2.0,2.0
std,1.892969,1.0,1.0
min,1.0,1.0,1.0
25%,1.0,1.0,1.0
50%,1.5,2.0,2.0
75%,2.75,3.0,3.0
max,5.0,3.0,3.0


In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   apples   4 non-null      float64
 1   oranges  5 non-null      int64  
 2   grapes   5 non-null      int64  
dtypes: float64(1), int64(2)
memory usage: 252.0 bytes


In [19]:
df['apples'].mean()

2.25

In [20]:
print(df.dropna(axis = 0, how = 'any')) # Removing missing values in index direction
print(df.dropna(axis = 1, how = 'any')) # Removing missing values in column direction

   apples  oranges  grapes
0     2.0        3       3
2     1.0        1       1
3     5.0        3       3
4     1.0        1       1
   oranges  grapes
0        3       3
1        2       2
2        1       1
3        3       3
4        1       1


In [21]:
df

Unnamed: 0,apples,oranges,grapes
0,2.0,3,3
1,,2,2
2,1.0,1,1
3,5.0,3,3
4,1.0,1,1


In [22]:
print(df.drop_duplicates(keep = 'first')) # Removing duplicated rows (but keeping the first)

   apples  oranges  grapes
0     2.0        3       3
1     NaN        2       2
2     1.0        1       1
3     5.0        3       3


In [23]:
df.T

Unnamed: 0,0,1,2,3,4
apples,2.0,,1.0,5.0,1.0
oranges,3.0,2.0,1.0,3.0,1.0
grapes,3.0,2.0,1.0,3.0,1.0


In [24]:
print(df.T.drop_duplicates(keep = 'last', ).T) # Removing duplicated columns (but keeping the first)

   apples  grapes
0     2.0     3.0
1     NaN     2.0
2     1.0     1.0
3     5.0     3.0
4     1.0     1.0


In [25]:
# print(df.apples > 1)
print(df[df.apples > 1]) # Removing rows with zero values on column 'apples'
df2 = df[df.apples > 1]
print(df[df.apples > 1].reset_index(drop = True)) # ...and reseting the indexes

   apples  oranges  grapes
0     2.0        3       3
3     5.0        3       3
   apples  oranges  grapes
0     2.0        3       3
1     5.0        3       3


In [26]:
df2.reset_index(drop=True)

Unnamed: 0,apples,oranges,grapes
0,2.0,3,3
1,5.0,3,3


## 2.11 Relacionando DataFrames

![picture](https://drive.google.com/thumbnail?id=14du9_I7BTfOFm8HQ4ojy1TqPFIgeVxPZ&sz=w1200)

In [27]:
d = {'Customer_id': pd.Series([1, 2, 3, 4, 5, 7]),
     'Product': pd.Series(['Oven', 'Oven', 'Oven', 'Television', 'Television', 'Television'])}
df0 = pd.DataFrame(d)
df0

Unnamed: 0,Customer_id,Product
0,1,Oven
1,2,Oven
2,3,Oven
3,4,Television
4,5,Television
5,7,Television


In [28]:
d = {'Customer_id': pd.Series([2, 4, 6]),
     'State': pd.Series(['California', 'California', 'Texas'])}
df1 = pd.DataFrame(d)
df1

Unnamed: 0,Customer_id,State
0,2,California
1,4,California
2,6,Texas


In [29]:
print(pd.merge(df0, df1, on = 'Customer_id', how = 'inner')) # inner join

# print(pd.merge(df0, df1, on = 'Customer_id', how = 'outer')) # outter join
# print(pd.merge(df0, df1, on = 'Customer_id', how = 'left')) # left join
# print(pd.merge(df0, df1, on = 'Customer_id', how = 'right')) # right join

   Customer_id     Product       State
0            2        Oven  California
1            4  Television  California


# Referências

[1] <a> https://docs.python.org/pt-br/3/tutorial/</a>

[2] <a> https://pandas.pydata.org/</a>

[3] Joel Grus. Data science from scratch: first principles with python. O'Reilly Media, 2019.