# Pandas

In [1]:
!pip install pandas 

# import sys
# sys.executable
# ! path/to/python -m pip install pandas

Defaulting to user installation because normal site-packages is not writeable


You should consider upgrading via the 'c:\program files\python39\python.exe -m pip install --upgrade pip' command.


In [1]:
import numpy as np
import pandas as pd

# What is Pandas?

Pandas can be thought as an enhanced version of numpy arrays. In this case, the rows and columns can be identified with labels instead of just simple integer indices.


* [Documentation](https://pandas.pydata.org/docs/reference/index.html#api)
* [GitHub](https://github.com/pandas-dev/pandas/blob/master/pandas/core/base.py)

There are **three** main pandas elements we **need** to understand.
1. Pandas Series
2. Pandas DataFrame
3. Index
----

# 1. The Pandas Series

**The primary building block of Pandas**. A pandas series is a one-dimensional (**1-D**) indexed array.

In [2]:
# Your Code here!
pd.Series()

  pd.Series()


Series([], dtype: float64)

In [3]:
pd.Series(dtype = 'object')

Series([], dtype: object)

## 1.1 Creating a Pandas Series

In [4]:
# From a list
values = list([1, 2, 3])

series1 = pd.Series(values)
series1

0    1
1    2
2    3
dtype: int64

In [38]:
values = list([1, 2, 3])

series2 = pd.Series(values, index=['a', 'b', 'c'], name = 'valores_aleatórios_com_index_de_strings_do_tipo_objeto', dtype='object')
series2

a    1
b    2
c    3
Name: valores_aleatórios_com_index_de_strings_do_tipo_objeto, dtype: object

In [13]:
# From a dict

data = dict({
    'Black Swan' : 'MELHOR FILME',
    'Frozen': 9,
    'Titanic': 10
})
#movie_name = list()
movie_series = pd.Series(data, index=['Black Swan', 'Titanic', 'Jojo Rabit'], name = 'movies', dtype='object')
movie_series

Black Swan    MELHOR FILME
Titanic                 10
Jojo Rabit             NaN
Name: movies, dtype: object

## 1.2 Some methods and attributes
* Check Type
* Check Size
* `.describe()`
* `.values`
* `.index`

In [14]:
# Your code here!
type(movie_series)

pandas.core.series.Series

In [18]:
movie_series.map(type)

Black Swan      <class 'str'>
Titanic         <class 'int'>
Jojo Rabit    <class 'float'>
Name: movies, dtype: object

So, the `type` of `data` is a `pandas...Series` and the types of the data inside the `pandas.Series` is `int`

In [19]:
# Your code here!
len(movie_series)

3

In [20]:
movie_series.describe()

count                2
unique               2
top       MELHOR FILME
freq                 1
Name: movies, dtype: object

- Describe não é muito interessante para valores não numéricos.
- Count: valores não nulos presentes na sua séria
- Unique: valores únicos
- top: 
- freq:

In [27]:
data = dict({
    'Black Swan' : 11,
    'Frozen': 9,
    'Titanic': 10,
    'Moana': 8.5,
    'Jojo Rabit': 10    
})
#movie_name = list()
movie_series = pd.Series(data, name = 'movies', dtype='float')
print(movie_series)
movie_series.describe()

Black Swan    11.0
Frozen         9.0
Titanic       10.0
Moana          8.5
Jojo Rabit    10.0
Name: movies, dtype: float64


count     5.000000
mean      9.700000
std       0.974679
min       8.500000
25%       9.000000
50%      10.000000
75%      10.000000
max      11.000000
Name: movies, dtype: float64

In [29]:
movie_series.values

array([11. ,  9. , 10. ,  8.5, 10. ])

In [30]:
type(movie_series.values)

numpy.ndarray

In [31]:
movie_series.index

Index(['Black Swan', 'Frozen', 'Titanic', 'Moana', 'Jojo Rabit'], dtype='object')

In [32]:
type(movie_series.index)

pandas.core.indexes.base.Index

When you see `dtype: object`, it usually means you have a `str` inside your `Series`

In [None]:
# Your code here!

## 1.3 Accessing elements 

Can be done like a numpy array. 

In [37]:
# Your code here!
movie_series[3]

8.5

In [34]:
movie_series['Black Swan']

11.0

In [35]:
movie_series.loc['Black Swan']

11.0

In [36]:
movie_series.iloc[3]

8.5

É mais interessante acessar usando .loc ou .iloc, por ter melhor tempo de processamento, do que uma indexar diretamente.

# 2. The Pandas DataFrame


Pandas DataFrame can be thought as
* A group of Pandas Series
* A generalization of **2-D** numpy arrays (However, again, they bring flexibility on both the indices and column names)
* A tabular, spreadsheet-like data structure

In [39]:
pd.DataFrame()

## 2.1 Creating a Pandas DataFrame
* From a .csv, excel, query, json, from an API...
Today we'll check how to create from a 1-D array, 2-D array and from a dictionary

In [45]:
# From a Dict with Series
movie_list = ['Black Swan', 'Frozen', 'Moana']
rating_series = pd.Series(data = [10, 9, 8.5], index=movie_list, name='rating')
genre_series = pd.Series(data=['Drama', 'Animação', 'Animação'], index=movie_list, name='series')
print(rating_series)
print('----')
print(genre_series)

Black Swan    10.0
Frozen         9.0
Moana          8.5
Name: rating, dtype: float64
----
Black Swan       Drama
Frozen        Animação
Moana         Animação
Name: series, dtype: object


In [42]:
pd.DataFrame(data = {
    'rating':rating_series,
    'genre': genre_series
})

Unnamed: 0,rating,genre
Black Swan,10.0,Drama
Frozen,9.0,Animação
Moana,8.5,Animação


In [46]:
pd.concat([rating_series, genre_series], axis=1)

Unnamed: 0,rating,series
Black Swan,10.0,Drama
Frozen,9.0,Animação
Moana,8.5,Animação


In [None]:
# From a Dict with Dicts

In [47]:
# From a Dict with Lists
movie_list = ['Black Swan', 'Frozen', 'Moana']
rating_list = [10, 9, 8.5]
genre_list = ['Drama', 'Animação', 'Animação']

In [50]:
df = pd.DataFrame({'rating': rating_list, 'genre': genre_list}, index = movie_list)

In [51]:
df

Unnamed: 0,rating,genre
Black Swan,10.0,Drama
Frozen,9.0,Animação
Moana,8.5,Animação


## 2.2 Some methods and attributes
* `describe()`
* `info()`
* `transpose()`

In [53]:
# Your code here!
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3 entries, Black Swan to Moana
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   rating  3 non-null      float64
 1   genre   3 non-null      object 
dtypes: float64(1), object(1)
memory usage: 72.0+ bytes


In [56]:
df.describe()

Unnamed: 0,rating
count,3.0
mean,9.166667
std,0.763763
min,8.5
25%,8.75
50%,9.0
75%,9.5
max,10.0


In [57]:
df.describe(include='all')

Unnamed: 0,rating,genre
count,3.0,3
unique,,2
top,,Animação
freq,,2
mean,9.166667,
std,0.763763,
min,8.5,
25%,8.75,
50%,9.0,
75%,9.5,


In [59]:
df.describe(percentiles=[0.15, 0.30])

Unnamed: 0,rating
count,3.0
mean,9.166667
std,0.763763
min,8.5
15%,8.65
30%,8.8
50%,9.0
max,10.0


In [69]:
df.transpose()

Unnamed: 0,Black Swan,Frozen,Moana
rating,10.0,9.0,8.5
genre,Drama,Animação,Animação


In [67]:
df.columns

Index(['rating', 'genre'], dtype='object')

In [71]:
df.head(n=1) #mostra as primeiras n linhas, default n=5

Unnamed: 0,rating,genre
Black Swan,10.0,Drama


In [73]:
df.tail(n=1) #mostra as últimas n linha, default n=5

Unnamed: 0,rating,genre
Moana,8.5,Animação


## 2.3 Accessing elements 

These are the correct way to access data in a dataframe. You can specify both row and column. You can also specify only row.

`dataframe.loc[row_name, col_name]`

In [75]:
# Your code here!
df.loc['Black Swan', 'rating']

10.0

In [77]:
df.loc[:, 'rating']

Black Swan    10.0
Frozen         9.0
Moana          8.5
Name: rating, dtype: float64

In [78]:
df.loc['Black Swan', :]

rating     10.0
genre     Drama
Name: Black Swan, dtype: object

In [79]:
df['rating']

Black Swan    10.0
Frozen         9.0
Moana          8.5
Name: rating, dtype: float64

In [80]:
df['genre']

Black Swan       Drama
Frozen        Animação
Moana         Animação
Name: genre, dtype: object

`dataframe.iloc[row_number, col_number]`

In [81]:
# Your code here!
df.iloc[0, 0]

10.0

In [82]:
df.iloc[0, :]

rating     10.0
genre     Drama
Name: Black Swan, dtype: object

In [83]:
df.iloc[:, 0]

Black Swan    10.0
Frozen         9.0
Moana          8.5
Name: rating, dtype: float64

What is the difference of selecting a column via: `dataframe['column']` vs `dataframe.loc[:, 'column']`?

In [None]:
# Your code here!

Selecting by a sample

In [88]:
# Your code here!
df.sample(n=2, replace = True )

Unnamed: 0,rating,genre
Moana,8.5,Animação
Moana,8.5,Animação


## 2.4 Filtering

In [89]:
# query
df.rating == 10

Black Swan     True
Frozen        False
Moana         False
Name: rating, dtype: bool

In [90]:
df.genre == 'Drama'

Black Swan     True
Frozen        False
Moana         False
Name: genre, dtype: bool

In [91]:
# indexing
df[df.genre=='Drama']

Unnamed: 0,rating,genre
Black Swan,10.0,Drama


In [98]:
df.loc [(df.genre=='Drama')|(df.rating > 9)]

Unnamed: 0,rating,genre
Black Swan,10.0,Drama


In [100]:
df.query("index=='Black Swan' or index=='Moana'")

Unnamed: 0,rating,genre
Black Swan,10.0,Drama
Moana,8.5,Animação


In [102]:
#criação de uma cópia do dataframe filtrado
df2 = df[(df.genre=='Drama')|(df.rating > 9)].copy()
df2

Unnamed: 0,rating,genre
Black Swan,10.0,Drama


In [108]:
type(ord('A'))

int