In [1]:
# Importando o Pandas
import pandas as pd

Da para utilizar diversos tipos de DataFrames(ou seja, tipos de arquivos) para poder se trabalhar com o Pandas.

Alguns deles são:

Arquivos de texto/planilhas
- CSV → pd.read_csv() e df.to_csv()
- TSV (tab separado) → também com read_csv(sep="\t")
- TXT → idem, se for estruturado com separadores
- Excel (.xls, .xlsx, .xlsm, .xlsb, .odf, .ods, .odt)
    - pd.read_excel() e df.to_excel()
    - Suporta múltiplas abas de uma planilha.

🗄️ Bancos de dados

- SQL (MySQL, PostgreSQL, SQLite, etc.)
    - pd.read_sql() e df.to_sql() (precisa de conexão via SQLAlchemy ou driver).

🌐 Estruturados em texto

- JSON → pd.read_json() e df.to_json()
- HTML (tabelas) → pd.read_html()
- XML → pd.read_xml()

📦 Outros formatos comuns em ciência de dados

- Parquet → pd.read_parquet() e df.to_parquet()
- Feather → pd.read_feather() e df.to_feather()
- ORC → pd.read_orc() e df.to_orc()
- Stata (.dta) → pd.read_stata() e df.to_stata()
- SPSS (.sav) → pd.read_spss()
- HDF5 (.h5) → pd.read_hdf() e df.to_hdf()
- Pickle (.pkl) → pd.read_pickle() e df.to_pickle()

**Ou seja:** se é um formato tabular ou semi-estruturado, o Pandas provavelmente consegue lidar.

💡 No dia a dia:
- CSV é o mais comum (troca de dados entre sistemas).
- Excel é onipresente em empresas (relatórios, planilhas financeiras, etc).
- SQL é essencial (quase sempre os dados “moram” em um banco).
- Parquet está se tornando padrão em ambientes de Big Data (porque é compacto e rápido).


### 1. Exemplo 1
Vamos criar um dataframe utilizando as estruturas de dados do próprio Python:

In [3]:
# Exemplo 1

# Criar um DataFrame
dados = {
    "Nome": ["Ana", "Bruno", "Carla"],
    "Idade": [23, 35, 29],
    "Cidade": ["SP", "RJ", "BH"]
}

df1 = pd.DataFrame(dados)

In [4]:
df1

Unnamed: 0,Nome,Idade,Cidade
0,Ana,23,SP
1,Bruno,35,RJ
2,Carla,29,BH


### Exemplo 2

Vamos fazer a leitura de uma base de dados no formato **csv**.

In [15]:
# Lendo uma base de dados no formato .csv

arquivo_caminho = 'kc_house_data.csv'

dataset = pd.read_csv(arquivo_caminho, sep=',', header=0)

- O parâmetro **sep** é usado para definir qual o separador entre os dados;
- O parâmetro **header** informa em qual linha está as colunas (ou se elas existem) do meu dataset;
    - Se caso não existir colunas (_header=None_) o pandas dará um número para cada atributo da base.

In [10]:
# Vamos visualizar as cinco primeiras linhas do dataframe:
dataset.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900.0,3,1.0,1180,5650,1.0,0,0,...,7,1180.0,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,0,0,...,7,2170.0,400,1951,1991,98125,47.721,-122.319,1690,7639
2,5631500400,20150225T000000,180000.0,2,1.0,770,10000,1.0,0,0,...,6,770.0,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,20141209T000000,604000.0,4,3.0,1960,5000,1.0,0,0,...,7,1050.0,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,20150218T000000,510000.0,3,2.0,1680,8080,1.0,0,0,...,8,1680.0,0,1987,0,98074,47.6168,-122.045,1800,7503


___

Na hora de importarmos o arquivo para dataframe (pode ser csv ou algum outro) podemos definir também qual será a coluna de indexação (a coluna principal)

In [11]:
dataset = pd.read_csv(arquivo_caminho, sep=',', index_col='date') # date é o nome da coluna que quero como principal

In [12]:
dataset.head()

Unnamed: 0_level_0,id,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
20141013T000000,7129300520,221900.0,3,1.0,1180,5650,1.0,0,0,3,7,1180.0,0,1955,0,98178,47.5112,-122.257,1340,5650
20141209T000000,6414100192,538000.0,3,2.25,2570,7242,2.0,0,0,3,7,2170.0,400,1951,1991,98125,47.721,-122.319,1690,7639
20150225T000000,5631500400,180000.0,2,1.0,770,10000,1.0,0,0,3,6,770.0,0,1933,0,98028,47.7379,-122.233,2720,8062
20141209T000000,2487200875,604000.0,4,3.0,1960,5000,1.0,0,0,5,7,1050.0,910,1965,0,98136,47.5208,-122.393,1360,5000
20150218T000000,1954400510,510000.0,3,2.0,1680,8080,1.0,0,0,3,8,1680.0,0,1987,0,98074,47.6168,-122.045,1800,7503


___

Também é possível eu especificar quais colunas eu quero para o meu dataframe:

In [13]:
dataset = pd.read_csv(arquivo_caminho, sep=',', usecols=['id', 'date', 'price', 'bedrooms'])

In [14]:
dataset.head()

Unnamed: 0,id,date,price,bedrooms
0,7129300520,20141013T000000,221900.0,3
1,6414100192,20141209T000000,538000.0,3
2,5631500400,20150225T000000,180000.0,2
3,2487200875,20141209T000000,604000.0,4
4,1954400510,20150218T000000,510000.0,3


___

### Método describe

O describe() exibe informações estatísticas da base de dados.

Várias informações como desvio padrão, média, valor mínimo e valor máximo de colunas.

In [16]:
dataset.describe()

Unnamed: 0,id,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
count,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21611.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0
mean,4580302000.0,540088.1,3.370842,2.114757,2079.899736,15106.97,1.494309,0.007542,0.234303,3.40943,7.656873,1788.396095,291.509045,1971.005136,84.402258,98077.939805,47.560053,-122.213896,1986.552492,12768.455652
std,2876566000.0,367127.2,0.930062,0.770163,918.440897,41420.51,0.539989,0.086517,0.766318,0.650743,1.175459,828.128162,442.575043,29.373411,401.67924,53.505026,0.138564,0.140828,685.391304,27304.179631
min,1000102.0,75000.0,0.0,0.0,290.0,520.0,1.0,0.0,0.0,1.0,1.0,290.0,0.0,1900.0,0.0,98001.0,47.1559,-122.519,399.0,651.0
25%,2123049000.0,321950.0,3.0,1.75,1427.0,5040.0,1.0,0.0,0.0,3.0,7.0,1190.0,0.0,1951.0,0.0,98033.0,47.471,-122.328,1490.0,5100.0
50%,3904930000.0,450000.0,3.0,2.25,1910.0,7618.0,1.5,0.0,0.0,3.0,7.0,1560.0,0.0,1975.0,0.0,98065.0,47.5718,-122.23,1840.0,7620.0
75%,7308900000.0,645000.0,4.0,2.5,2550.0,10688.0,2.0,0.0,0.0,4.0,8.0,2210.0,560.0,1997.0,0.0,98118.0,47.678,-122.125,2360.0,10083.0
max,9900000000.0,7700000.0,33.0,8.0,13540.0,1651359.0,3.5,1.0,4.0,5.0,13.0,9410.0,4820.0,2015.0,2015.0,98199.0,47.7776,-121.315,6210.0,871200.0


___

### Contagem de linhas por colunas

In [17]:
dataset.count()

id               21613
date             21613
price            21613
bedrooms         21613
bathrooms        21613
sqft_living      21613
sqft_lot         21613
floors           21613
waterfront       21613
view             21613
condition        21613
grade            21613
sqft_above       21611
sqft_basement    21613
yr_built         21613
yr_renovated     21613
zipcode          21613
lat              21613
long             21613
sqft_living15    21613
sqft_lot15       21613
dtype: int64

___

### Ver uma amostra aleatória do dataframe:

In [18]:
dataset.sample(5)

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
16042,8078400020,20150223T000000,485000.0,3,2.25,1570,8111,2.0,0,0,...,8,1570.0,0,1984,0,98074,47.6324,-122.028,1990,7875
20818,9279700013,20140710T000000,1250000.0,3,3.0,3460,5353,2.0,0,0,...,10,2850.0,610,2007,0,98116,47.5858,-122.393,2460,6325
9447,3275330120,20140814T000000,309900.0,3,2.5,2020,26670,2.0,0,0,...,7,2020.0,0,1987,0,98003,47.2597,-122.31,1680,10939
341,1115300070,20141106T000000,684000.0,4,3.5,3040,8414,2.0,0,0,...,9,2420.0,620,2010,0,98059,47.5222,-122.157,3470,8066
14281,524069011,20140911T000000,622500.0,3,2.5,2290,14374,2.0,0,0,...,8,2290.0,0,1983,2012,98075,47.5886,-122.074,2290,33450


___

In [19]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21613 entries, 0 to 21612
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             21613 non-null  int64  
 1   date           21613 non-null  object 
 2   price          21613 non-null  float64
 3   bedrooms       21613 non-null  int64  
 4   bathrooms      21613 non-null  float64
 5   sqft_living    21613 non-null  int64  
 6   sqft_lot       21613 non-null  int64  
 7   floors         21613 non-null  float64
 8   waterfront     21613 non-null  int64  
 9   view           21613 non-null  int64  
 10  condition      21613 non-null  int64  
 11  grade          21613 non-null  int64  
 12  sqft_above     21611 non-null  float64
 13  sqft_basement  21613 non-null  int64  
 14  yr_built       21613 non-null  int64  
 15  yr_renovated   21613 non-null  int64  
 16  zipcode        21613 non-null  int64  
 17  lat            21613 non-null  float64
 18  long  