# 04.02 - Lectura de Datos

**Autor:** Miguel Angel Vazquez Varela  
**Nivel:** Fundamentos  
**Tiempo estimado:** 25 min

---

## Que aprenderemos?

- Leer archivos CSV
- Leer archivos Excel
- Leer JSON
- Parametros utiles de lectura
- Guardar datos

In [1]:
import pandas as pd
import numpy as np

---

## 1. Crear datos de ejemplo

Primero creamos archivos de ejemplo para practicar la lectura.

In [2]:
# Crear datos de viajes
trips_data = pd.DataFrame({
    "trip_id": range(1, 11),
    "date": pd.date_range("2024-01-01", periods=10, freq="D"),
    "duration_min": [12, 25, 8, 45, 15, 30, 18, 22, 35, 10],
    "distance_km": [2.5, 5.0, 1.8, 8.2, 3.1, 6.0, 3.5, 4.2, 7.0, 2.0],
    "station_start": ["Sol", "Atocha", "Sol", "Retiro", "Cibeles", 
                      "Sol", "Atocha", "Retiro", "Cibeles", "Sol"],
    "user_type": ["subscriber", "casual", "subscriber", "subscriber", "casual",
                  "subscriber", "subscriber", "casual", "subscriber", "casual"]
})

trips_data

Unnamed: 0,trip_id,date,duration_min,distance_km,station_start,user_type
0,1,2024-01-01,12,2.5,Sol,subscriber
1,2,2024-01-02,25,5.0,Atocha,casual
2,3,2024-01-03,8,1.8,Sol,subscriber
3,4,2024-01-04,45,8.2,Retiro,subscriber
4,5,2024-01-05,15,3.1,Cibeles,casual
5,6,2024-01-06,30,6.0,Sol,subscriber
6,7,2024-01-07,18,3.5,Atocha,subscriber
7,8,2024-01-08,22,4.2,Retiro,casual
8,9,2024-01-09,35,7.0,Cibeles,subscriber
9,10,2024-01-10,10,2.0,Sol,casual


In [3]:
# Guardar como CSV
trips_data.to_csv("trips_sample.csv", index=False)
print("Archivo CSV creado!")

Archivo CSV creado!


In [4]:
# Guardar como JSON
trips_data.to_json("trips_sample.json", orient="records", indent=2)
print("Archivo JSON creado!")

Archivo JSON creado!


  trips_data.to_json("trips_sample.json", orient="records", indent=2)


---

## 2. Leer CSV

### Lectura basica

In [5]:
df = pd.read_csv("trips_sample.csv")
df.head()

Unnamed: 0,trip_id,date,duration_min,distance_km,station_start,user_type
0,1,2024-01-01,12,2.5,Sol,subscriber
1,2,2024-01-02,25,5.0,Atocha,casual
2,3,2024-01-03,8,1.8,Sol,subscriber
3,4,2024-01-04,45,8.2,Retiro,subscriber
4,5,2024-01-05,15,3.1,Cibeles,casual


In [6]:
df.info()

<class 'pandas.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   trip_id        10 non-null     int64  
 1   date           10 non-null     str    
 2   duration_min   10 non-null     int64  
 3   distance_km    10 non-null     float64
 4   station_start  10 non-null     str    
 5   user_type      10 non-null     str    
dtypes: float64(1), int64(2), str(3)
memory usage: 612.0 bytes


**Problema:** La columna `date` es tipo `object` (texto), no fecha.

### Parsear fechas al leer

In [7]:
df = pd.read_csv("trips_sample.csv", parse_dates=["date"])
df.dtypes

trip_id                   int64
date             datetime64[us]
duration_min              int64
distance_km             float64
station_start               str
user_type                   str
dtype: object

Ahora `date` es `datetime64`.

### Otros parametros utiles

In [8]:
# Leer solo ciertas columnas
df_subset = pd.read_csv(
    "trips_sample.csv",
    usecols=["trip_id", "duration_min", "distance_km"]
)
df_subset.head()

Unnamed: 0,trip_id,duration_min,distance_km
0,1,12,2.5
1,2,25,5.0
2,3,8,1.8
3,4,45,8.2
4,5,15,3.1


In [9]:
# Leer solo las primeras N filas
df_small = pd.read_csv("trips_sample.csv", nrows=5)
print(f"Filas leidas: {len(df_small)}")

Filas leidas: 5


In [10]:
# Especificar tipos de datos
df_typed = pd.read_csv(
    "trips_sample.csv",
    dtype={
        "user_type": "category",
        "station_start": "category"
    },
    parse_dates=["date"]
)
df_typed.dtypes

trip_id                   int64
date             datetime64[us]
duration_min              int64
distance_km             float64
station_start          category
user_type              category
dtype: object

**Tip:** Usar `category` para columnas con pocos valores unicos ahorra memoria.

---

## 3. CSV con diferentes formatos

### Separador diferente (punto y coma)

In [11]:
# Crear CSV con separador ;
trips_data.to_csv("trips_semicolon.csv", index=False, sep=";")

# Leerlo
df_semi = pd.read_csv("trips_semicolon.csv", sep=";")
df_semi.head(3)

Unnamed: 0,trip_id,date,duration_min,distance_km,station_start,user_type
0,1,2024-01-01,12,2.5,Sol,subscriber
1,2,2024-01-02,25,5.0,Atocha,casual
2,3,2024-01-03,8,1.8,Sol,subscriber


### Decimal con coma (formato europeo)

In [12]:
# Crear CSV con decimales como coma
trips_data.to_csv("trips_european.csv", index=False, sep=";", decimal=",")

# Leerlo correctamente
df_euro = pd.read_csv("trips_european.csv", sep=";", decimal=",")
df_euro[["distance_km"]].head()

Unnamed: 0,distance_km
0,2.5
1,5.0
2,1.8
3,8.2
4,3.1


---

## 4. Leer JSON

In [13]:
df_json = pd.read_json("trips_sample.json")
df_json.head()

Unnamed: 0,trip_id,date,duration_min,distance_km,station_start,user_type
0,1,2024-01-01,12,2.5,Sol,subscriber
1,2,2024-01-02,25,5.0,Atocha,casual
2,3,2024-01-03,8,1.8,Sol,subscriber
3,4,2024-01-04,45,8.2,Retiro,subscriber
4,5,2024-01-05,15,3.1,Cibeles,casual


### Diferentes orientaciones de JSON

In [14]:
# orient='records' - lista de diccionarios (mas comun)
print(trips_data.head(3).to_json(orient="records", indent=2))

[
  {
    "trip_id":1,
    "date":1704067200000,
    "duration_min":12,
    "distance_km":2.5,
    "station_start":"Sol",
    "user_type":"subscriber"
  },
  {
    "trip_id":2,
    "date":1704153600000,
    "duration_min":25,
    "distance_km":5.0,
    "station_start":"Atocha",
    "user_type":"casual"
  },
  {
    "trip_id":3,
    "date":1704240000000,
    "duration_min":8,
    "distance_km":1.8,
    "station_start":"Sol",
    "user_type":"subscriber"
  }
]


  print(trips_data.head(3).to_json(orient="records", indent=2))


In [15]:
# orient='columns' - diccionario de columnas
print(trips_data.head(3).to_json(orient="columns", indent=2))

{
  "trip_id":{
    "0":1,
    "1":2,
    "2":3
  },
  "date":{
    "0":1704067200000,
    "1":1704153600000,
    "2":1704240000000
  },
  "duration_min":{
    "0":12,
    "1":25,
    "2":8
  },
  "distance_km":{
    "0":2.5,
    "1":5.0,
    "2":1.8
  },
  "station_start":{
    "0":"Sol",
    "1":"Atocha",
    "2":"Sol"
  },
  "user_type":{
    "0":"subscriber",
    "1":"casual",
    "2":"subscriber"
  }
}


  print(trips_data.head(3).to_json(orient="columns", indent=2))


---

## 5. Leer Excel

Requiere instalar `openpyxl`: `pip install openpyxl`

**Nota:** Si no tienes openpyxl instalado, estas celdas mostraran un mensaje informativo.

In [16]:
# Verificar si openpyxl esta disponible
try:
    import openpyxl
    EXCEL_AVAILABLE = True
    print(f"openpyxl version: {openpyxl.__version__}")
except ImportError:
    EXCEL_AVAILABLE = False
    print("openpyxl no instalado. Instalar con: pip install openpyxl")
    print("Las celdas de Excel se saltaran.")

openpyxl no instalado. Instalar con: pip install openpyxl
Las celdas de Excel se saltaran.


In [17]:
# Guardar y leer Excel (si openpyxl disponible)
if EXCEL_AVAILABLE:
    trips_data.to_excel("trips_sample.xlsx", index=False, sheet_name="trips")
    print("Archivo Excel creado!")
    
    df_excel = pd.read_excel("trips_sample.xlsx", sheet_name="trips")
    display(df_excel.head())
else:
    print("Saltando: openpyxl no disponible")

Saltando: openpyxl no disponible


### Excel con multiples hojas

In [18]:
if EXCEL_AVAILABLE:
    # Crear estaciones
    stations = pd.DataFrame({
        "name": ["Sol", "Atocha", "Cibeles", "Retiro"],
        "capacity": [30, 25, 35, 20],
        "zone": ["centro", "centro", "centro", "parque"]
    })

    # Guardar multiples hojas
    with pd.ExcelWriter("bike_data.xlsx") as writer:
        trips_data.to_excel(writer, sheet_name="trips", index=False)
        stations.to_excel(writer, sheet_name="stations", index=False)

    print("Excel con 2 hojas creado!")
    
    # Leer hoja especifica
    stations_df = pd.read_excel("bike_data.xlsx", sheet_name="stations")
    display(stations_df)
    
    # Leer todas las hojas
    all_sheets = pd.read_excel("bike_data.xlsx", sheet_name=None)
    print(f"Hojas disponibles: {list(all_sheets.keys())}")
else:
    print("Saltando: openpyxl no disponible")

Saltando: openpyxl no disponible


---

## 6. Leer desde URL

In [19]:
# Ejemplo con dataset publico (iris)
url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv"

iris = pd.read_csv(url)
iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


---

## 7. Guardar datos

In [20]:
# Modificar datos
df = pd.read_csv("trips_sample.csv", parse_dates=["date"])
df["speed_kmh"] = df["distance_km"] / (df["duration_min"] / 60)

df.head()

Unnamed: 0,trip_id,date,duration_min,distance_km,station_start,user_type,speed_kmh
0,1,2024-01-01,12,2.5,Sol,subscriber,12.5
1,2,2024-01-02,25,5.0,Atocha,casual,12.0
2,3,2024-01-03,8,1.8,Sol,subscriber,13.5
3,4,2024-01-04,45,8.2,Retiro,subscriber,10.933333
4,5,2024-01-05,15,3.1,Cibeles,casual,12.4


In [21]:
# Guardar CSV (sin indice)
df.to_csv("trips_processed.csv", index=False)
print("CSV guardado!")

CSV guardado!


In [22]:
# Guardar con compresion
df.to_csv("trips_compressed.csv.gz", index=False, compression="gzip")
print("Archivo comprimido creado!")

Archivo comprimido creado!


---

## 8. Limpiar archivos de ejemplo

In [23]:
import os

files_to_remove = [
    "trips_sample.csv", "trips_sample.json", "trips_sample.xlsx",
    "trips_semicolon.csv", "trips_european.csv", "bike_data.xlsx",
    "trips_processed.csv", "trips_compressed.csv.gz"
]

for f in files_to_remove:
    if os.path.exists(f):
        os.remove(f)
        print(f"Eliminado: {f}")

Eliminado: trips_sample.csv
Eliminado: trips_sample.json
Eliminado: trips_semicolon.csv
Eliminado: trips_european.csv
Eliminado: trips_processed.csv
Eliminado: trips_compressed.csv.gz


---

## Resumen

| Formato | Leer | Escribir |
|---------|------|----------|
| CSV | `pd.read_csv()` | `df.to_csv()` |
| Excel | `pd.read_excel()` | `df.to_excel()` |
| JSON | `pd.read_json()` | `df.to_json()` |

**Parametros clave de `read_csv()`:**
- `sep`: separador (`,`, `;`, `\t`)
- `decimal`: caracter decimal (`.`, `,`)
- `parse_dates`: columnas a parsear como fecha
- `usecols`: columnas a leer
- `nrows`: numero de filas
- `dtype`: tipos de datos

---

**Anterior:** [04.01 - Series y DataFrames](04_01_series_dataframe.ipynb)  
**Siguiente:** [04.03 - Seleccion y Filtrado](04_03_selection_filtering.ipynb)