# X500 - Introdução à Ciência de Dados

Prof. Erneson Alves de Oliveira<br>
Programa de Pós-Graduação em Informática Aplicada<br>
Universidade de Fortaleza

# 2 Introdução ao Pandas

![picture](https://drive.google.com/thumbnail?id=1UzNXA9dEa9ynMRWH5PqgUUAk-dWSKkWN&sz=w600)

https://pandas.pydata.org

## 2.1 O que é Pandas?

Pandas é um módulo para manipulação de tabelas (planilhas).

![picture](https://drive.google.com/thumbnail?id=1EqkkwMguFttpovsYVstiQKsxgmPz34Kb&sz=w400)

## 2.2 Séries

![picture](https://drive.google.com/thumbnail?id=1FmeCWF76KsBFbTM9CJQ5k1efxyD6m0Gn&sz=w300)

In [3]:
import pandas as pd
import numpy as np

apples = [3, 2, 0, 1]

s = pd.Series(apples,
              name = 'apples',
              dtype = np.int64)

print(s)
print(type(s))
print(dir(s))

0    3
1    2
2    0
3    1
Name: apples, dtype: int64
<class 'pandas.core.series.Series'>
['T', '_AXIS_LEN', '_AXIS_ORDERS', '_AXIS_TO_AXIS_NUMBER', '_HANDLED_TYPES', '__abs__', '__add__', '__and__', '__annotations__', '__array__', '__array_priority__', '__array_ufunc__', '__bool__', '__class__', '__column_consortium_standard__', '__contains__', '__copy__', '__deepcopy__', '__delattr__', '__delitem__', '__dict__', '__dir__', '__divmod__', '__doc__', '__eq__', '__finalize__', '__float__', '__floordiv__', '__format__', '__ge__', '__getattr__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__iadd__', '__iand__', '__ifloordiv__', '__imod__', '__imul__', '__init__', '__init_subclass__', '__int__', '__invert__', '__ior__', '__ipow__', '__isub__', '__iter__', '__itruediv__', '__ixor__', '__le__', '__len__', '__lt__', '__matmul__', '__mod__', '__module__', '__mul__', '__ne__', '__neg__', '__new__', '__nonzero__', '__or__', '__pandas_priority__', '__pos__', '__pow__

In [7]:
print(s.head(2))
print(s.tail(2))

0    3
1    2
Name: apples, dtype: int64
2    0
3    1
Name: apples, dtype: int64


In [None]:
print(s.ndim)
print(s.shape)

1
(4,)


In [8]:
print(s.mean())
print(s.max())
print(s.min())
print(s[0])

1.5
3
0
3


## 2.3 DataFrames (Planilhas)

![picture](https://drive.google.com/thumbnail?id=17blGuGZf1lLKyV3ZYEU9SR95jhd4rCqI&sz=w800)

In [None]:
d = {'apples': [3, 2, 0, 1],
     'oranges': [0, 3, 7, 2]}

df = pd.DataFrame(d)

print(df)
print(type(df))
# print(dir(df))

   apples  oranges
0       3        0
1       2        3
2       0        7
3       1        2
<class 'pandas.core.frame.DataFrame'>


In [16]:
print(df.ndim)
print(df.shape)
print(df.mean())

2
(4, 2)
apples     1.5
oranges    3.0
dtype: float64


In [20]:
print(df.mean()['apples'])
print(df.max())
print(df.min())
print(df.T)
transpose = df.T
print(transpose.shape)

1.5
apples     3
oranges    7
dtype: int64
apples     0
oranges    0
dtype: int64
         0  1  2  3
apples   3  2  0  1
oranges  0  3  7  2
(2, 4)


In [None]:
d = {'apples': [3, 2, 0, 1],
     'oranges': [0, 3, 7, 2]}

df = pd.DataFrame(d)
print(df)

print(list(df.index))
print(list(df.columns))

   apples  oranges
0       3        0
1       2        3
2       0        7
3       1        2
[0, 1, 2, 3]
['apples', 'oranges']


In [22]:
# minha_lista = []
# for x in df.index:
#     minha_lista.append(x)
# print(minha_lista)

minha_lista = [x for x in df.index]
print(minha_lista)

[0, 1, 2, 3]


In [25]:
d = {'apples': {'A': 3,
                'B': 2,
                'C': 0,
                'D': 1},
     'oranges': {'A': 0,
                 'B': 3,
                 'C': 7,
                 'D': 2}}

df = pd.DataFrame(d)
print(df)

print(list(df.index))
print(list(df.columns))

   apples  oranges
A       3        0
B       2        3
C       0        7
D       1        2
['A', 'B', 'C', 'D']
['apples', 'oranges']


## 2.4 Escrevendo e lendo (.csv e .xlsx)

In [28]:
d = {'apples': [3, 2, 0, 1],
     'oranges': [0, 3, 7, 2],
     'bananas': [1, 3, 5, 4],
     'avocados': [9, 0, 0, 1]}

df0 = pd.DataFrame(d)
print(df0)

   apples  oranges  bananas  avocados
0       3        0        1         9
1       2        3        3         0
2       0        7        5         0
3       1        2        4         1


### 2.4.1 Valores separados por vírgula (.csv)

In [29]:
df0.to_csv('arquivo_de_saida.csv',
           sep = ';',
           encoding = 'utf-8',
           header = True,
           index = False) # Escrevendo um CSV

In [30]:
df1 = pd.read_csv('arquivo_de_saida.csv',
                  sep = ';',
                  usecols = ['apples', 'bananas','oranges'],
                  encoding = 'utf-8',
                  nrows = 5) # Lendo um CSV
df1

Unnamed: 0,apples,oranges,bananas
0,3,0,1
1,2,3,3
2,0,7,5
3,1,2,4


### 2.4.2 Planilhas do Excel (.xlsx)

In [31]:
df0.to_excel('output.xlsx', index = False)

In [32]:
df2 = pd.read_excel('output.xlsx')
print(df2)

   apples  oranges  bananas  avocados
0       3        0        1         9
1       2        3        3         0
2       0        7        5         0
3       1        2        4         1


# Referências

[1] <a> https://docs.python.org/pt-br/3/tutorial/</a>

[2] <a> https://pandas.pydata.org/</a>

[3] Joel Grus. Data science from scratch: first principles with python. O'Reilly Media, 2019.