# Lectura desde Excel y extracción de tablas web

En esta actividad se experimentará con la importación de datos desde diferentes fuentes usando pandas, en particular, archivos Excel y tablas HTML.

# Desarrollo

Para el desarrollo de esta actividad, se utiliza el archivo csv del clásico conjunto de datos del Titanic convertido a Excel (disponible [aquí](https://www.kaggle.com/datasets/brendan45774/test-file?resource=download)) y las tablas web disponibles en la página de Wikipedia [Tabla normal estándar](https://en.wikipedia.org/wiki/Standard_normal_table).

Iniciamos importando la librería pandas y el módulo os de Python.

In [1]:
import os
import pandas as pd

Ahora procedemos importando el archivo Excel del Titanic. Dado que solo tiene una hoja, no es necesario especificar en que hoja se encuentran los datos.

In [2]:
directorio_titanic = os.path.join(".", "Datos", "titanic.xlsx")
dtype = {
    "PassengerId": "UInt16",
    "Pclass": "UInt8",
    "Name": "string",
    "Sex": "category",
    "SibSp": "UInt8",
    "Parch": "UInt8",
    "Age": "Float32",
    "Ticket": "string",
    "Fare": "Float32",
    "Cabin": "category",
    "Embarked": "category"
}
titanic = pd.read_excel(directorio_titanic, dtype=dtype, converters={"Survived": lambda x: True if x > 0 else False})

A continuación, exploramos los datos de este Dataframe.

In [3]:
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,False,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,True,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,False,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,False,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,True,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [4]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   PassengerId  418 non-null    UInt16  
 1   Survived     418 non-null    bool    
 2   Pclass       418 non-null    UInt8   
 3   Name         418 non-null    string  
 4   Sex          418 non-null    category
 5   Age          332 non-null    Float32 
 6   SibSp        418 non-null    UInt8   
 7   Parch        418 non-null    UInt8   
 8   Ticket       418 non-null    string  
 9   Fare         417 non-null    Float32 
 10  Cabin        91 non-null     category
 11  Embarked     418 non-null    category
dtypes: Float32(2), UInt16(1), UInt8(3), bool(1), category(3), string(2)
memory usage: 18.9 KB


In [5]:
titanic.describe()

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare
count,418.0,418.0,332.0,418.0,418.0,417.0
mean,1100.5,2.26555,30.272591,0.447368,0.392344,35.627186
std,120.810458,0.841838,14.18121,0.89676,0.981429,55.907581
min,892.0,1.0,0.17,0.0,0.0,0.0
25%,996.25,1.0,21.0,0.0,0.0,7.8958
50%,1100.5,3.0,27.0,0.0,0.0,14.4542
75%,1204.75,3.0,39.0,1.0,0.0,31.5
max,1309.0,3.0,76.0,8.0,9.0,512.329224


Ahora importaremos las tablas web con pandas. Solo queremos la primera tabla, así que debemos acceder al primer elemento de la lista de Dataframes.

In [6]:
url = "https://en.wikipedia.org/wiki/Standard_normal_table"
p_acumulada = pd.read_html(url)[0]

Procedemos a inspeccionar el Dataframe

In [7]:
p_acumulada.head()

Unnamed: 0,z,−0.00,−0.01,−0.02,−0.03,−0.04,−0.05,−0.06,−0.07,−0.08,−0.09
0,-3.9,5e-05,5e-05,4e-05,4e-05,4e-05,4e-05,4e-05,4e-05,3e-05,3e-05
1,-3.8,7e-05,7e-05,7e-05,6e-05,6e-05,6e-05,6e-05,5e-05,5e-05,5e-05
2,-3.7,0.00011,0.0001,0.0001,0.0001,9e-05,9e-05,8e-05,8e-05,8e-05,8e-05
3,-3.6,0.00016,0.00015,0.00015,0.00014,0.00014,0.00013,0.00013,0.00012,0.00012,0.00011
4,-3.5,0.00023,0.00022,0.00022,0.00021,0.0002,0.00019,0.00019,0.00018,0.00017,0.00017


In [8]:
p_acumulada.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44 entries, 0 to 43
Data columns (total 11 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   z       41 non-null     object
 1   −0.00   41 non-null     object
 2   −0.01   41 non-null     object
 3   −0.02   41 non-null     object
 4   −0.03   41 non-null     object
 5   −0.04   41 non-null     object
 6   −0.05   41 non-null     object
 7   −0.06   41 non-null     object
 8   −0.07   41 non-null     object
 9   −0.08   41 non-null     object
 10  −0.09   41 non-null     object
dtypes: object(11)
memory usage: 3.9+ KB


Finalmente, exportamos los Dataframes a un archivo Excel cada una.

In [9]:
directorio = os.path.join(".", "Datos")
titanic.to_excel(os.path.join(directorio, "nuevo_titanic.xlsx"))
p_acumulada.to_excel(os.path.join(directorio, "probabilidad_acumulada.xlsx"))