# Stone V4 - Data Science

---

**Data:** 16/09/2021 \
**Prof.:** Gabriel R. Freitas \
**E-mail:** gab.ribeiro.freitas@gmail.com\
**LinkedIn:** www.linkedin.com/in/grfreitas/

# Aula 4 - Pandas [1/3]

Pandas é uma biblioteca para análise de dados que facilita o trabalho com csvs, excel, etc.

Como uma tabela de dados é estruturada:

**- Colunas: pd.Series**

**- Tabelas: pd.DataFrame**
- Linhas: rows
- Células: values

In [1]:
import pandas as pd

#### Series

In [2]:
lista = [1, 2, 3, 4, 5, 6, 7]

In [3]:
series = pd.Series(lista)

In [4]:
series.max()

7

In [5]:
series.min()

1

In [6]:
series.mean()

4.0

In [7]:
series.std()

2.160246899469287

In [8]:
series[0]

1

In [9]:
series[0:4]

0    1
1    2
2    3
3    4
dtype: int64

In [10]:
series

0    1
1    2
2    3
3    4
4    5
5    6
6    7
dtype: int64

In [11]:
len(series)

7

In [12]:
series.head(3)

0    1
1    2
2    3
dtype: int64

In [13]:
series.tail(3)

4    5
5    6
6    7
dtype: int64

In [14]:
series[5]

6

In [15]:
series

0    1
1    2
2    3
3    4
4    5
5    6
6    7
dtype: int64

In [16]:
series.index

RangeIndex(start=0, stop=7, step=1)

In [17]:
series.values

array([1, 2, 3, 4, 5, 6, 7])

In [18]:
labels = ["a", "b", "c", "d", "e"]
lista2 = [10, 20, 30, 50, 60] 

In [19]:
series2 = pd.Series(data=lista2, index=labels)

In [20]:
series

0    1
1    2
2    3
3    4
4    5
5    6
6    7
dtype: int64

In [21]:
series2

a    10
b    20
c    30
d    50
e    60
dtype: int64

In [22]:
series2["a"]

10

In [23]:
series2[0]

10

In [24]:
series2[0:4]

a    10
b    20
c    30
d    50
dtype: int64

In [25]:
series2["a":"d"]

a    10
b    20
c    30
d    50
dtype: int64

In [26]:
# Series com dicionários
dicionario = {"A": 10, "B": 100, "C": 1000}
series3 = pd.Series(dicionario)

In [27]:
series3["A"]

10

In [28]:
# Operações com Series!
dic1 = {"Brasil": 10, "Colômbia": 20, "Equador": 30}
dic2 = {"Brasil": 10, "Bolívia": 20, "Japão": 10, "Equador": 50}

In [29]:
series1 = pd.Series(dic1)
series2 = pd.Series(dic2)

In [30]:
import numpy as np

In [31]:
lista = [1, "a", [1, 2, 3]]

In [32]:
print(series1)

Brasil      10
Colômbia    20
Equador     30
dtype: int64


In [33]:
print(series2)

Brasil     10
Bolívia    20
Japão      10
Equador    50
dtype: int64


In [34]:
type(np.NaN)

float

In [35]:
series3 = series1 + series2

In [36]:
series4 = series3.astype(str)

In [37]:
series4["Brasil"]

'20.0'

In [38]:
series4["Bolívia"]

'nan'

In [39]:
# Unique
series1 = pd.Series(['a','b','c','a','d','e','c','a','e','d'])
series1.unique()

array(['a', 'b', 'c', 'd', 'e'], dtype=object)

In [40]:
# Número de valores únicos
series1.nunique()

5

In [41]:
# Contagem de cada um dos valores únicos
series1.value_counts()

a    3
c    2
d    2
e    2
b    1
dtype: int64

In [42]:
# Contagem de cada um dos valores únicos normalizada (soma 100%)
series1.value_counts(normalize=True)

a    0.3
c    0.2
d    0.2
e    0.2
b    0.1
dtype: float64

In [43]:
# Máscara
series1[series1 == "a"]

0    a
3    a
7    a
dtype: object

Síntaxe diferente para comparações lógicas

**| -> or \
& -> and**

In [44]:
series1

0    a
1    b
2    c
3    a
4    d
5    e
6    c
7    a
8    e
9    d
dtype: object

In [45]:
series1[(series1 == "a") | (series1 == "b")]

0    a
1    b
3    a
7    a
dtype: object

In [46]:
series1[series1 <= "b"]

0    a
1    b
3    a
7    a
dtype: object

In [47]:
lista_de_filtro = ["a", "b", "c", "d", "f"]
series1[series1.isin(lista_de_filtro)]

0    a
1    b
2    c
3    a
4    d
6    c
7    a
9    d
dtype: object

### DataFrames

In [48]:
pd.DataFrame

pandas.core.frame.DataFrame

In [49]:
lista = np.random.rand(5, 3)

In [50]:
lista.shape

(5, 3)

In [51]:
lista

array([[0.06096652, 0.87676447, 0.90182795],
       [0.24336036, 0.63639307, 0.57781443],
       [0.12094888, 0.1960146 , 0.07169075],
       [0.9062168 , 0.14922836, 0.84772785],
       [0.79795994, 0.78413434, 0.25670512]])

In [52]:
index = ["a", "b", "c", "d", "e"]
columns = ["x", "y", "z"]

df = pd.DataFrame(data=lista, index=index, columns=columns)

In [53]:
df["x"]

a    0.060967
b    0.243360
c    0.120949
d    0.906217
e    0.797960
Name: x, dtype: float64

In [54]:
df["y"]

a    0.876764
b    0.636393
c    0.196015
d    0.149228
e    0.784134
Name: y, dtype: float64

In [55]:
type(df["x"])

pandas.core.series.Series

#### Adendo de Append num dicionário

In [56]:
dicionario_vazio = {}

In [57]:
dicionario_vazio

{}

In [58]:
dicionario_vazio["chave"] = 10

In [59]:
dicionario_vazio

{'chave': 10}

---

In [60]:
# criar nova coluna
df["s1"] = df["x"] + df["y"] + df["z"]

In [61]:
df

Unnamed: 0,x,y,z,s1
a,0.060967,0.876764,0.901828,1.839559
b,0.24336,0.636393,0.577814,1.457568
c,0.120949,0.196015,0.071691,0.388654
d,0.906217,0.149228,0.847728,1.903173
e,0.79796,0.784134,0.256705,1.838799


In [62]:
# Métodos de soma e eixo
# axis=0 -> faz referência à vertical
# axis=1 -> faz referência à horizontal

df["s2"] = df.sum(axis=1)

In [63]:
# Criando novas colunas com valores únicos
df["DADOS BANCÁRIOS - XD001"] = 0
df["50DADOS BANCÁRIOS - XD001"] = 0

In [64]:
df["x"]

a    0.060967
b    0.243360
c    0.120949
d    0.906217
e    0.797960
Name: x, dtype: float64

In [65]:
df.x

a    0.060967
b    0.243360
c    0.120949
d    0.906217
e    0.797960
Name: x, dtype: float64

In [66]:
# Chamando múltiplas colunas!
df[["x", "y"]]

Unnamed: 0,x,y
a,0.060967,0.876764
b,0.24336,0.636393
c,0.120949,0.196015
d,0.906217,0.149228
e,0.79796,0.784134


In [67]:
df

Unnamed: 0,x,y,z,s1,s2,DADOS BANCÁRIOS - XD001,50DADOS BANCÁRIOS - XD001
a,0.060967,0.876764,0.901828,1.839559,3.679118,0,0
b,0.24336,0.636393,0.577814,1.457568,2.915136,0,0
c,0.120949,0.196015,0.071691,0.388654,0.777308,0,0
d,0.906217,0.149228,0.847728,1.903173,3.806346,0,0
e,0.79796,0.784134,0.256705,1.838799,3.677599,0,0


In [68]:
#.    idx, cols
df.loc["a", "x"]

0.060966517796814945

In [69]:
df.loc["b":"d", "x":"y"]

Unnamed: 0,x,y
b,0.24336,0.636393
c,0.120949,0.196015
d,0.906217,0.149228


In [70]:
df

Unnamed: 0,x,y,z,s1,s2,DADOS BANCÁRIOS - XD001,50DADOS BANCÁRIOS - XD001
a,0.060967,0.876764,0.901828,1.839559,3.679118,0,0
b,0.24336,0.636393,0.577814,1.457568,2.915136,0,0
c,0.120949,0.196015,0.071691,0.388654,0.777308,0,0
d,0.906217,0.149228,0.847728,1.903173,3.806346,0,0
e,0.79796,0.784134,0.256705,1.838799,3.677599,0,0


In [71]:
df.loc[:, ["y", "z"]]

Unnamed: 0,y,z
a,0.876764,0.901828
b,0.636393,0.577814
c,0.196015,0.071691
d,0.149228,0.847728
e,0.784134,0.256705


In [72]:
df.iloc[2]

x                            0.120949
y                            0.196015
z                            0.071691
s1                           0.388654
s2                           0.777308
DADOS BANCÁRIOS - XD001      0.000000
50DADOS BANCÁRIOS - XD001    0.000000
Name: c, dtype: float64

#### Diferenças de acesso

In [73]:
# Mais utilizada quando apenas queremos acessar os valores e não criar uma nova coluna!
df.x

a    0.060967
b    0.243360
c    0.120949
d    0.906217
e    0.797960
Name: x, dtype: float64

In [77]:
# Mais utilizada quando apenas queremos acessar os valores e não criar uma nova coluna!
# SÓ QUE o nome da coluna possui caracteres que impossibilita a primeira maneira
df['DADOS BANCÁRIOS - XD001']

a    0
b    0
c    0
d    0
e    0
Name: DADOS BANCÁRIOS - XD001, dtype: int64

In [78]:
# Mais utilizado quando vamos criar uma nova coluna
df.loc[:, "x"]

a    0.060967
b    0.243360
c    0.120949
d    0.906217
e    0.797960
Name: x, dtype: float64

In [79]:
df

Unnamed: 0,x,y,z,s1,s2,DADOS BANCÁRIOS - XD001,50DADOS BANCÁRIOS - XD001
a,0.060967,0.876764,0.901828,1.839559,3.679118,0,0
b,0.24336,0.636393,0.577814,1.457568,2.915136,0,0
c,0.120949,0.196015,0.071691,0.388654,0.777308,0,0
d,0.906217,0.149228,0.847728,1.903173,3.806346,0,0
e,0.79796,0.784134,0.256705,1.838799,3.677599,0,0


In [80]:
# Deletando colunas -> axis=1
df.drop("50DADOS BANCÁRIOS - XD001", axis=1, inplace=True)

In [81]:
df

Unnamed: 0,x,y,z,s1,s2,DADOS BANCÁRIOS - XD001
a,0.060967,0.876764,0.901828,1.839559,3.679118,0
b,0.24336,0.636393,0.577814,1.457568,2.915136,0
c,0.120949,0.196015,0.071691,0.388654,0.777308,0
d,0.906217,0.149228,0.847728,1.903173,3.806346,0
e,0.79796,0.784134,0.256705,1.838799,3.677599,0


In [82]:
# Deletando linhas -> axis=0
df.drop("a", axis=0, inplace=True)

In [83]:
df

Unnamed: 0,x,y,z,s1,s2,DADOS BANCÁRIOS - XD001
b,0.24336,0.636393,0.577814,1.457568,2.915136,0
c,0.120949,0.196015,0.071691,0.388654,0.777308,0
d,0.906217,0.149228,0.847728,1.903173,3.806346,0
e,0.79796,0.784134,0.256705,1.838799,3.677599,0


In [84]:
df

Unnamed: 0,x,y,z,s1,s2,DADOS BANCÁRIOS - XD001
b,0.24336,0.636393,0.577814,1.457568,2.915136,0
c,0.120949,0.196015,0.071691,0.388654,0.777308,0
d,0.906217,0.149228,0.847728,1.903173,3.806346,0
e,0.79796,0.784134,0.256705,1.838799,3.677599,0


In [85]:
arr = np.random.rand(30, 6)
columns = ["x1", "x2", "x3", "x4", "x5", "x6"]

df = pd.DataFrame(arr, columns=columns)

In [86]:
# Filtros - Máscaras!
df2 = df[df >= 0.5].copy()

In [87]:
df2.head()

Unnamed: 0,x1,x2,x3,x4,x5,x6
0,,,0.845779,0.606105,,
1,,,,0.936082,,
2,0.940452,,0.667884,0.980295,,0.503808
3,0.514153,,0.732632,0.835453,,0.775345
4,0.536097,,,,0.950144,


In [88]:
# x2 > 0.3
df[df.x1 >= 0.9]

Unnamed: 0,x1,x2,x3,x4,x5,x6
2,0.940452,0.112129,0.667884,0.980295,0.287061,0.503808
8,0.945263,0.136917,0.541652,0.958419,0.353294,0.554857
19,0.953903,0.231982,0.458926,0.770054,0.283307,0.025079
22,0.974911,0.440515,0.770747,0.385909,0.642719,0.487224


In [89]:
df.x5[df.x2 >= 0.7]

6     0.904628
9     0.208983
14    0.274108
17    0.758742
18    0.998958
20    0.794734
29    0.417952
Name: x5, dtype: float64

##### Adendo de Cópias!

In [90]:
lista1 = [1, 2, 3]
lista2 = lista1.copy()

In [91]:
lista1

[1, 2, 3]

In [92]:
lista2

[1, 2, 3]

In [93]:
lista2.append(0)

In [94]:
print(lista2)

[1, 2, 3, 0]


In [95]:
print(lista1)

[1, 2, 3]


---

In [96]:
df = pd.DataFrame()

In [97]:
df.loc[:, "nome"] = pd.Series(["Gabriel", "Aline", "Juliana", "Michel", "Leonardo", "Ana"])
df.loc[:, "idade"] = pd.Series([16, 26, 10, np.NaN, 23, np.NaN])
df.loc[:, "n"] = pd.Series([1, 2, 3, 4, 5, 6])

In [98]:
df[df.idade >= 18]

Unnamed: 0,nome,idade,n
1,Aline,26.0,2
4,Leonardo,23.0,5


In [99]:
df.nome[df.idade >= 18]

1       Aline
4    Leonardo
Name: nome, dtype: object

---

In [100]:
df

Unnamed: 0,nome,idade,n
0,Gabriel,16.0,1
1,Aline,26.0,2
2,Juliana,10.0,3
3,Michel,,4
4,Leonardo,23.0,5
5,Ana,,6


In [101]:
df.mean()

idade    18.75
n         3.50
dtype: float64

In [102]:
df.idade.mean()

18.75

In [103]:
df.idade.median()

19.5

### Nulos

In [104]:
df

Unnamed: 0,nome,idade,n
0,Gabriel,16.0,1
1,Aline,26.0,2
2,Juliana,10.0,3
3,Michel,,4
4,Leonardo,23.0,5
5,Ana,,6


In [109]:
df.loc[df.nome == "Gabriel", "idade"] = 10

In [111]:
df

Unnamed: 0,nome,idade,n
0,Gabriel,10.0,1
1,Aline,26.0,2
2,Juliana,10.0,3
3,Michel,,4
4,Leonardo,23.0,5
5,Ana,,6
