# Carga de tabla de datos en Pandas

Es bastante usual que los conjuntos de datos estén en formato de archivo csv, donde los valores están separados por comas. Estas tablas pueden tener o no encabezados e índices. Pandas por defecto asume que la primera fila es de encabezados.

In [1]:
from pathlib import Path
import pandas as pd

ROOT_DIR = Path().resolve().parent
DATA_DIR = ROOT_DIR / "data/raw"
file_path = 'home_data.csv'

df = pd.read_csv(DATA_DIR / file_path)
df.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900,3,1.0,1180,5650,1.0,0,0,...,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000,3,2.25,2570,7242,2.0,0,0,...,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,5631500400,20150225T000000,180000,2,1.0,770,10000,1.0,0,0,...,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,20141209T000000,604000,4,3.0,1960,5000,1.0,0,0,...,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,20150218T000000,510000,3,2.0,1680,8080,1.0,0,0,...,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


Pandas identifica automáticamente el tipo de datos que contienen cada columna.

In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21613 entries, 0 to 21612
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             21613 non-null  int64  
 1   date           21613 non-null  object 
 2   price          21613 non-null  int64  
 3   bedrooms       21613 non-null  int64  
 4   bathrooms      21613 non-null  float64
 5   sqft_living    21613 non-null  int64  
 6   sqft_lot       21613 non-null  int64  
 7   floors         21613 non-null  float64
 8   waterfront     21613 non-null  int64  
 9   view           21613 non-null  int64  
 10  condition      21613 non-null  int64  
 11  grade          21613 non-null  int64  
 12  sqft_above     21613 non-null  int64  
 13  sqft_basement  21613 non-null  int64  
 14  yr_built       21613 non-null  int64  
 15  yr_renovated   21613 non-null  int64  
 16  zipcode        21613 non-null  int64  
 17  lat            21613 non-null  float64
 18  long  

Si no estamos de acuerdo con el tipo de dato que identificó Pandas, se puede cambiar con el método `.astype()`. Por ejemplo, si queremos convertir la columna `id` a string, podemos hacerlo de la siguiente manera:

In [3]:
df = df.astype({'id':'str'})
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21613 entries, 0 to 21612
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             21613 non-null  object 
 1   date           21613 non-null  object 
 2   price          21613 non-null  int64  
 3   bedrooms       21613 non-null  int64  
 4   bathrooms      21613 non-null  float64
 5   sqft_living    21613 non-null  int64  
 6   sqft_lot       21613 non-null  int64  
 7   floors         21613 non-null  float64
 8   waterfront     21613 non-null  int64  
 9   view           21613 non-null  int64  
 10  condition      21613 non-null  int64  
 11  grade          21613 non-null  int64  
 12  sqft_above     21613 non-null  int64  
 13  sqft_basement  21613 non-null  int64  
 14  yr_built       21613 non-null  int64  
 15  yr_renovated   21613 non-null  int64  
 16  zipcode        21613 non-null  int64  
 17  lat            21613 non-null  float64
 18  long  

In [4]:
df = df.astype({'id':'int64'})
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21613 entries, 0 to 21612
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             21613 non-null  int64  
 1   date           21613 non-null  object 
 2   price          21613 non-null  int64  
 3   bedrooms       21613 non-null  int64  
 4   bathrooms      21613 non-null  float64
 5   sqft_living    21613 non-null  int64  
 6   sqft_lot       21613 non-null  int64  
 7   floors         21613 non-null  float64
 8   waterfront     21613 non-null  int64  
 9   view           21613 non-null  int64  
 10  condition      21613 non-null  int64  
 11  grade          21613 non-null  int64  
 12  sqft_above     21613 non-null  int64  
 13  sqft_basement  21613 non-null  int64  
 14  yr_built       21613 non-null  int64  
 15  yr_renovated   21613 non-null  int64  
 16  zipcode        21613 non-null  int64  
 17  lat            21613 non-null  float64
 18  long  

Un tipo especial de dato son las fechas, que son datos de tiempo. En este dataset la variable `date` es de este tipo, por lo que hay cambiarlo a datetime, usando el método `pd.to_datetime()`. 

In [5]:
df['date'] = pd.to_datetime(df['date'], dayfirst=False)
df.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,2014-10-13,221900,3,1.0,1180,5650,1.0,0,0,...,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,2014-12-09,538000,3,2.25,2570,7242,2.0,0,0,...,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,5631500400,2015-02-25,180000,2,1.0,770,10000,1.0,0,0,...,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,2014-12-09,604000,4,3.0,1960,5000,1.0,0,0,...,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,2015-02-18,510000,3,2.0,1680,8080,1.0,0,0,...,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


# Indexación y slicing

En Pandas cada fila tiene un índice que la identifica. Es posible convertir una columna en índices, lo que puede ser útil para acceder a filas específicas.

In [6]:
df.set_index('id', inplace=True)
df.head()

Unnamed: 0_level_0,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
7129300520,2014-10-13,221900,3,1.0,1180,5650,1.0,0,0,3,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
6414100192,2014-12-09,538000,3,2.25,2570,7242,2.0,0,0,3,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
5631500400,2015-02-25,180000,2,1.0,770,10000,1.0,0,0,3,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
2487200875,2014-12-09,604000,4,3.0,1960,5000,1.0,0,0,5,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
1954400510,2015-02-18,510000,3,2.0,1680,8080,1.0,0,0,3,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


Para extraer filas se deben usar los atributos `loc[ ]` y `iloc[ ]`. El primero usa los índices establecidos por el programador, conocidos en este contexto como **índices explícitos**, y el segundo usa el **número de fila o filas**, conocidos como **índices implícitos**.

In [7]:
df.loc[:5631500400]

Unnamed: 0_level_0,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
7129300520,2014-10-13,221900,3,1.0,1180,5650,1.0,0,0,3,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
6414100192,2014-12-09,538000,3,2.25,2570,7242,2.0,0,0,3,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
5631500400,2015-02-25,180000,2,1.0,770,10000,1.0,0,0,3,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062


In [8]:
df.iloc[0:2]

Unnamed: 0_level_0,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
7129300520,2014-10-13,221900,3,1.0,1180,5650,1.0,0,0,3,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
6414100192,2014-12-09,538000,3,2.25,2570,7242,2.0,0,0,3,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639


Cuando no se tienen índices explícitos, se puede usar tanto `.iloc` como `.loc` con índices implícitos.

In [9]:
df.reset_index(inplace=True)
df.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,2014-10-13,221900,3,1.0,1180,5650,1.0,0,0,...,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,2014-12-09,538000,3,2.25,2570,7242,2.0,0,0,...,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,5631500400,2015-02-25,180000,2,1.0,770,10000,1.0,0,0,...,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,2014-12-09,604000,4,3.0,1960,5000,1.0,0,0,...,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,2015-02-18,510000,3,2.0,1680,8080,1.0,0,0,...,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


In [10]:
df.loc[:3]

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,2014-10-13,221900,3,1.0,1180,5650,1.0,0,0,...,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,2014-12-09,538000,3,2.25,2570,7242,2.0,0,0,...,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,5631500400,2015-02-25,180000,2,1.0,770,10000,1.0,0,0,...,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,2014-12-09,604000,4,3.0,1960,5000,1.0,0,0,...,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000


In [11]:
df.iloc[:3]

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,2014-10-13,221900,3,1.0,1180,5650,1.0,0,0,...,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,2014-12-09,538000,3,2.25,2570,7242,2.0,0,0,...,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,5631500400,2015-02-25,180000,2,1.0,770,10000,1.0,0,0,...,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062


A las columnas se puede acceder usando los nombres.

In [12]:
df[['bedrooms']]

Unnamed: 0,bedrooms
0,3
1,3
2,2
3,4
4,3
...,...
21608,3
21609,4
21610,2
21611,3


Para seleccionar varias columnas se deben pasar los nombres en una lista.

In [13]:
df[['id', 'date', 'price']]

Unnamed: 0,id,date,price
0,7129300520,2014-10-13,221900
1,6414100192,2014-12-09,538000
2,5631500400,2015-02-25,180000
3,2487200875,2014-12-09,604000
4,1954400510,2015-02-18,510000
...,...,...,...
21608,263000018,2014-05-21,360000
21609,6600060120,2015-02-23,400000
21610,1523300141,2014-06-23,402101
21611,291310100,2015-01-16,400000


No se puede hacer *slicing* con las columnas directamente, pero se puede hacer con el método `loc`.

In [14]:
df['id':'price']

TypeError: cannot do slice indexing on RangeIndex with these indexers [id] of type str

In [15]:
df.loc[:, 'id':'price'].head()

Unnamed: 0,id,date,price
0,7129300520,2014-10-13,221900
1,6414100192,2014-12-09,538000
2,5631500400,2015-02-25,180000
3,2487200875,2014-12-09,604000
4,1954400510,2015-02-18,510000


Para borrar filas o columnas se usa el método .drop().

In [16]:
df.drop('sqft_living15', axis=1).head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_lot15
0,7129300520,2014-10-13,221900,3,1.0,1180,5650,1.0,0,0,3,7,1180,0,1955,0,98178,47.5112,-122.257,5650
1,6414100192,2014-12-09,538000,3,2.25,2570,7242,2.0,0,0,3,7,2170,400,1951,1991,98125,47.721,-122.319,7639
2,5631500400,2015-02-25,180000,2,1.0,770,10000,1.0,0,0,3,6,770,0,1933,0,98028,47.7379,-122.233,8062
3,2487200875,2014-12-09,604000,4,3.0,1960,5000,1.0,0,0,5,7,1050,910,1965,0,98136,47.5208,-122.393,5000
4,1954400510,2015-02-18,510000,3,2.0,1680,8080,1.0,0,0,3,8,1680,0,1987,0,98074,47.6168,-122.045,7503


In [17]:
df.drop(2, axis=0).head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,2014-10-13,221900,3,1.0,1180,5650,1.0,0,0,...,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,2014-12-09,538000,3,2.25,2570,7242,2.0,0,0,...,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
3,2487200875,2014-12-09,604000,4,3.0,1960,5000,1.0,0,0,...,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,2015-02-18,510000,3,2.0,1680,8080,1.0,0,0,...,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503
5,7237550310,2014-05-12,1225000,4,4.5,5420,101930,1.0,0,0,...,11,3890,1530,2001,0,98053,47.6561,-122.005,4760,101930


Se puede hacer slicing, o filtrado, usando operadores lógicos y de comparación.

In [18]:
df[df['bedrooms']>=4].head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
3,2487200875,2014-12-09,604000,4,3.0,1960,5000,1.0,0,0,...,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
5,7237550310,2014-05-12,1225000,4,4.5,5420,101930,1.0,0,0,...,11,3890,1530,2001,0,98053,47.6561,-122.005,4760,101930
14,1175000570,2015-03-12,530000,5,2.0,1810,4850,1.5,0,0,...,7,1810,0,1900,0,98107,47.67,-122.394,1360,4850
15,9297300055,2015-01-24,650000,4,3.0,2950,5000,2.0,0,3,...,9,1980,970,1979,0,98126,47.5714,-122.375,2140,4000
17,6865200140,2014-05-29,485000,4,1.0,1600,4300,1.5,0,0,...,7,1600,0,1916,0,98103,47.6648,-122.343,1610,4300


In [19]:
df[df['date'] > '2015'].head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
2,5631500400,2015-02-25,180000,2,1.0,770,10000,1.0,0,0,...,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
4,1954400510,2015-02-18,510000,3,2.0,1680,8080,1.0,0,0,...,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503
7,2008000270,2015-01-15,291850,3,1.5,1060,9711,1.0,0,0,...,7,1060,0,1963,0,98198,47.4095,-122.315,1650,9711
8,2414600126,2015-04-15,229500,3,1.0,1780,7470,1.0,0,0,...,7,1050,730,1960,0,98146,47.5123,-122.337,1780,8113
9,3793500160,2015-03-12,323000,3,2.5,1890,6560,2.0,0,0,...,7,1890,0,2003,0,98038,47.3684,-122.031,2390,7570


# Operaciones entre columnas de un DataFrame, y ordenación por valores

Pandas permite crear nuevas columnas a partir de la realización de operaciones matemáticas entre columnas, o de operaciones con funciones de agregación.

In [20]:
df['price_area_relation'] = df.price/df.sqft_living
df.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15,price_area_relation
0,7129300520,2014-10-13,221900,3,1.0,1180,5650,1.0,0,0,...,1180,0,1955,0,98178,47.5112,-122.257,1340,5650,188.050847
1,6414100192,2014-12-09,538000,3,2.25,2570,7242,2.0,0,0,...,2170,400,1951,1991,98125,47.721,-122.319,1690,7639,209.338521
2,5631500400,2015-02-25,180000,2,1.0,770,10000,1.0,0,0,...,770,0,1933,0,98028,47.7379,-122.233,2720,8062,233.766234
3,2487200875,2014-12-09,604000,4,3.0,1960,5000,1.0,0,0,...,1050,910,1965,0,98136,47.5208,-122.393,1360,5000,308.163265
4,1954400510,2015-02-18,510000,3,2.0,1680,8080,1.0,0,0,...,1680,0,1987,0,98074,47.6168,-122.045,1800,7503,303.571429


In [21]:
df['standard_lot_area'] = (df.sqft_lot - df.sqft_lot.mean())/df.sqft_lot.std()
df.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15,price_area_relation,standard_lot_area
0,7129300520,2014-10-13,221900,3,1.0,1180,5650,1.0,0,0,...,0,1955,0,98178,47.5112,-122.257,1340,5650,188.050847,-0.228316
1,6414100192,2014-12-09,538000,3,2.25,2570,7242,2.0,0,0,...,400,1951,1991,98125,47.721,-122.319,1690,7639,209.338521,-0.189881
2,5631500400,2015-02-25,180000,2,1.0,770,10000,1.0,0,0,...,0,1933,0,98028,47.7379,-122.233,2720,8062,233.766234,-0.123296
3,2487200875,2014-12-09,604000,4,3.0,1960,5000,1.0,0,0,...,910,1965,0,98136,47.5208,-122.393,1360,5000,308.163265,-0.244009
4,1954400510,2015-02-18,510000,3,2.0,1680,8080,1.0,0,0,...,0,1987,0,98074,47.6168,-122.045,1800,7503,303.571429,-0.169649


Es posible reordenar un DataFrame por valores de columnas o de índices.

In [22]:
df.sort_values('date')

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15,price_area_relation,standard_lot_area
16768,5561000190,2014-05-02,437500,3,2.25,1970,35100,2.0,0,0,...,0,1977,0,98027,47.4635,-121.991,2340,35100,222.081218,0.482684
9596,472000620,2014-05-02,790000,3,2.50,2600,4750,1.0,0,0,...,900,1951,0,98117,47.6833,-122.400,2380,4750,303.846154,-0.250044
9587,1024069009,2014-05-02,675000,5,2.50,2820,67518,2.0,0,0,...,0,1979,0,98029,47.5794,-122.025,2820,48351,239.361702,1.265340
20602,7853361370,2014-05-02,555000,4,2.50,3310,6500,2.0,0,0,...,0,2012,0,98065,47.5150,-121.870,2380,5000,167.673716,-0.207795
11577,5056500260,2014-05-02,440000,4,2.25,2160,8119,1.0,0,0,...,1080,1966,0,98006,47.5443,-122.177,1850,9000,203.703704,-0.168708
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7898,1422700040,2015-05-14,183000,3,1.00,1170,7320,1.0,0,0,...,0,1962,0,98188,47.4685,-122.282,2040,7320,156.410256,-0.187998
928,8730000270,2015-05-14,359000,2,2.75,1370,1140,2.0,0,0,...,290,2009,0,98133,47.7052,-122.343,1370,1090,262.043796,-0.337199
5637,7923600250,2015-05-15,450000,5,2.00,1870,7344,1.5,0,0,...,0,1960,0,98007,47.5951,-122.144,1870,7650,240.641711,-0.187418
13053,5101400871,2015-05-24,445500,2,1.75,1390,6670,1.0,0,0,...,670,1941,0,98115,47.6914,-122.308,920,6380,320.503597,-0.203691


In [23]:
df.sort_values('date', ascending=False)

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15,price_area_relation,standard_lot_area
16594,9106000005,2015-05-27,1310000,4,2.25,3750,5000,2.0,0,0,...,1310,1924,0,98115,47.6747,-122.303,2170,4590,349.333333,-0.244009
13053,5101400871,2015-05-24,445500,2,1.75,1390,6670,1.0,0,0,...,670,1941,0,98115,47.6914,-122.308,920,6380,320.503597,-0.203691
5637,7923600250,2015-05-15,450000,5,2.00,1870,7344,1.5,0,0,...,0,1960,0,98007,47.5951,-122.144,1870,7650,240.641711,-0.187418
928,8730000270,2015-05-14,359000,2,2.75,1370,1140,2.0,0,0,...,290,2009,0,98133,47.7052,-122.343,1370,1090,262.043796,-0.337199
6197,9178601660,2015-05-14,1695000,5,3.00,3320,5354,2.0,0,0,...,0,2004,0,98103,47.6542,-122.331,2330,4040,510.542169,-0.235462
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20559,3438501320,2014-05-02,295000,2,2.50,1630,1368,2.0,0,0,...,350,2009,0,98106,47.5489,-122.363,1590,2306,180.981595,-0.331695
7323,2202500290,2014-05-02,435000,4,1.00,1450,8800,1.0,0,0,...,0,1954,0,98006,47.5746,-122.135,1260,8942,300.000000,-0.152267
12366,587550340,2014-05-02,604000,3,2.50,3240,33151,2.0,0,2,...,0,1995,0,98023,47.3256,-122.378,4050,24967,186.419753,0.435630
9587,1024069009,2014-05-02,675000,5,2.50,2820,67518,2.0,0,0,...,0,1979,0,98029,47.5794,-122.025,2820,48351,239.361702,1.265340


In [24]:
df.set_index('id').sort_index()

Unnamed: 0_level_0,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,...,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15,price_area_relation,standard_lot_area
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1000102,2015-04-22,300000,6,3.00,2400,9373,2.0,0,0,3,...,0,1991,0,98002,47.3262,-122.214,2060,7316,125.000000,-0.138433
1000102,2014-09-16,280000,6,3.00,2400,9373,2.0,0,0,3,...,0,1991,0,98002,47.3262,-122.214,2060,7316,116.666667,-0.138433
1200019,2014-05-08,647500,4,1.75,2060,26036,1.0,0,0,4,...,900,1947,0,98166,47.4444,-122.351,2590,21891,314.320388,0.263856
1200021,2014-08-11,400000,3,1.00,1460,43000,1.0,0,0,3,...,0,1952,0,98166,47.4434,-122.347,2250,20023,273.972603,0.673411
2800031,2015-04-01,235000,3,1.00,1430,7599,1.5,0,0,4,...,420,1930,0,98168,47.4783,-122.265,1290,10320,164.335664,-0.181262
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9842300095,2014-07-25,365000,5,2.00,1600,4168,1.5,0,0,3,...,0,1927,0,98126,47.5297,-122.381,1190,4168,228.125000,-0.264095
9842300485,2015-03-11,380000,2,1.00,1040,7372,1.0,0,0,5,...,200,1939,0,98126,47.5285,-122.378,1930,5150,365.384615,-0.186742
9842300540,2014-06-24,339000,3,1.00,1100,4128,1.0,0,0,4,...,380,1942,0,98126,47.5296,-122.379,1510,4538,308.181818,-0.265061
9895000040,2014-07-03,399900,2,1.75,1410,1005,1.5,0,0,3,...,510,2011,0,98027,47.5446,-122.018,1440,1188,283.617021,-0.340459


# Carga de datos de archivos csv que no vienen en formato estándar

El método `read_csv()` de Pandas asume que los valores en cada fila están separados por comas, y que la primera fila es de encabezado, es decir, contiene los nombres de las variables. Si no es así, se deben modificar las opciones por defecto. Veamos un caso.

In [25]:
file_path = DATA_DIR / "auto-mpg.data"

df = pd.read_csv(file_path)
df.head()

Unnamed: 0,"18.0 8. 307.0 130.0 3504. 12.0 70. 1.\t""chevrolet chevelle malibu"""
0,15.0 8. 350.0 165.0 3693. 1...
1,18.0 8. 318.0 150.0 3436. 1...
2,16.0 8. 304.0 150.0 3433. 1...
3,17.0 8. 302.0 140.0 3449. 1...
4,15.0 8. 429.0 198.0 4341. 1...


In [26]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 405 entries, 0 to 404
Data columns (total 1 columns):
 #   Column                                                                                   Non-Null Count  Dtype 
---  ------                                                                                   --------------  ----- 
 0   18.0   8.   307.0      130.0      3504.      12.0   70.  1.	"chevrolet chevelle malibu"  405 non-null    object
dtypes: object(1)
memory usage: 3.3+ KB


Vemos que el archivo no cargó correctamente. Para que cargue correctamente debemos especificar que no tiene encabezado y que el separador es el espacio en blanco. Esto lo hacemos con los parámetros `header=None` y `sep=r"\s+"`.

In [27]:
df = pd.read_csv(
    file_path, 
    header=None, 
    sep=r"\s+"
    )

df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,18.0,8.0,307.0,130.0,3504.0,12.0,70.0,1.0,chevrolet chevelle malibu
1,15.0,8.0,350.0,165.0,3693.0,11.5,70.0,1.0,buick skylark 320
2,18.0,8.0,318.0,150.0,3436.0,11.0,70.0,1.0,plymouth satellite
3,16.0,8.0,304.0,150.0,3433.0,12.0,70.0,1.0,amc rebel sst
4,17.0,8.0,302.0,140.0,3449.0,10.5,70.0,1.0,ford torino


Cómo no pasamos los nombres de las columnas en el parámetro `names`, Pandas les asigna nombres por defecto que son números enteros. Pero si usamos el parámetro `names` podemos especificar los nombres de las columnas.

También podríamos cambiar el nombre de las columnas después de cargar el archivo, modificando el atributo `columns` del DataFrame.

In [28]:
df.columns=['mpg', 'cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'model_year', 'origin', 'car_name']
df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,car_name
0,18.0,8.0,307.0,130.0,3504.0,12.0,70.0,1.0,chevrolet chevelle malibu
1,15.0,8.0,350.0,165.0,3693.0,11.5,70.0,1.0,buick skylark 320
2,18.0,8.0,318.0,150.0,3436.0,11.0,70.0,1.0,plymouth satellite
3,16.0,8.0,304.0,150.0,3433.0,12.0,70.0,1.0,amc rebel sst
4,17.0,8.0,302.0,140.0,3449.0,10.5,70.0,1.0,ford torino


In [29]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 406 entries, 0 to 405
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           398 non-null    float64
 1   cylinders     406 non-null    float64
 2   displacement  406 non-null    float64
 3   horsepower    400 non-null    float64
 4   weight        406 non-null    float64
 5   acceleration  406 non-null    float64
 6   model_year    406 non-null    float64
 7   origin        406 non-null    float64
 8   car_name      406 non-null    object 
dtypes: float64(8), object(1)
memory usage: 28.7+ KB


# Detección de datos nulos y duplicados

Los datos nulos son datos desconocidos, es decir valores de variables de los que no se tiene información para algún registro.

Pandas es capaz de reconocer autómaticamente algunas secuencias de caracteres como datos nulos, como “ ”, “#N/A”, “#N/A N/A”, “#NA”, “-1.#IND”, “-1.#QNAN”, “-NaN”, “-nan”, “1.#IND”, “1.#QNAN”, “<NA>”, “N/A”, “NA”, “NULL”, “NaN”, “None”, “n/a”, “nan”, “null”.

Los métodos `isnull()` y `notnull()` nos permiten detectar datos nulos y no nulos, respectivamente.

In [30]:
df.isnull().sum()

mpg             8
cylinders       0
displacement    0
horsepower      6
weight          0
acceleration    0
model_year      0
origin          0
car_name        0
dtype: int64

In [31]:
df.notnull().sum()

mpg             398
cylinders       406
displacement    406
horsepower      400
weight          406
acceleration    406
model_year      406
origin          406
car_name        406
dtype: int64

A veces, los datos nulos no son reconocidos por Pandas, por lo que es necesario especificarlos, usando el parámetro `na_values`.

In [32]:
file_path = DATA_DIR / "adult_modified.data"

df = pd.read_csv(
    file_path, 
    header=None,
    names=['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income']
    )
    
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,?,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,?,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [33]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education-num   32561 non-null  object
 5   marital-status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital-gain    32561 non-null  int64 
 11  capital-loss    32561 non-null  object
 12  hours-per-week  32561 non-null  object
 13  native-country  32561 non-null  object
 14  income          32561 non-null  object
dtypes: int64(3), object(12)
memory usage: 3.7+ MB


In [34]:
df.isnull().sum()

age               0
workclass         0
fnlwgt            0
education         0
education-num     0
marital-status    0
occupation        0
relationship      0
race              0
sex               0
capital-gain      0
capital-loss      0
hours-per-week    0
native-country    0
income            0
dtype: int64

Analizando la estructura del dataset, se puede observar que las columnas "education-num" y "hours-per-week" son numéricas, pero Pandas no las identifica así. Esto se puede deber a que en el archivo original, algunas celdas contienen el valor "?", lo que impide que Pandas las identifique como numéricas. Esto se puede arreglar de 2 formas:
1. Usando el parámetro na_values="?" al método read_csv.
2. Cambiando el tipo de dato de las columnas después de cargar el dataset, y forzando la conversión a numérico.


In [35]:
df['education-num'] = pd.to_numeric(df['education-num'], errors='coerce')
df['hours-per-week'] = pd.to_numeric(df['hours-per-week'], errors='coerce')

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             32561 non-null  int64  
 1   workclass       32561 non-null  object 
 2   fnlwgt          32561 non-null  int64  
 3   education       32561 non-null  object 
 4   education-num   32560 non-null  float64
 5   marital-status  32561 non-null  object 
 6   occupation      32561 non-null  object 
 7   relationship    32561 non-null  object 
 8   race            32561 non-null  object 
 9   sex             32561 non-null  object 
 10  capital-gain    32561 non-null  int64  
 11  capital-loss    32561 non-null  object 
 12  hours-per-week  32559 non-null  float64
 13  native-country  32561 non-null  object 
 14  income          32561 non-null  object 
dtypes: float64(2), int64(3), object(10)
memory usage: 3.7+ MB


In [36]:
df.isnull().sum()

age               0
workclass         0
fnlwgt            0
education         0
education-num     1
marital-status    0
occupation        0
relationship      0
race              0
sex               0
capital-gain      0
capital-loss      0
hours-per-week    2
native-country    0
income            0
dtype: int64

In [37]:
df = pd.read_csv(
    file_path, 
    header=None,
    names=['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income'],
    na_values=[' ?', ' '],
    )
    
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13.0,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0.0,,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0.0,13.0,United-States,<=50K
2,38,Private,215646,HS-grad,9.0,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0.0,40.0,United-States,<=50K
3,53,Private,234721,11th,7.0,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0.0,40.0,United-States,<=50K
4,28,Private,338409,Bachelors,,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0.0,40.0,Cuba,<=50K


In [38]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             32561 non-null  int64  
 1   workclass       30725 non-null  object 
 2   fnlwgt          32561 non-null  int64  
 3   education       32561 non-null  object 
 4   education-num   32560 non-null  float64
 5   marital-status  32561 non-null  object 
 6   occupation      30718 non-null  object 
 7   relationship    32561 non-null  object 
 8   race            32561 non-null  object 
 9   sex             32561 non-null  object 
 10  capital-gain    32561 non-null  int64  
 11  capital-loss    32560 non-null  float64
 12  hours-per-week  32559 non-null  float64
 13  native-country  31978 non-null  object 
 14  income          32561 non-null  object 
dtypes: float64(3), int64(3), object(9)
memory usage: 3.7+ MB


In [39]:
df.isnull().sum()

age                  0
workclass         1836
fnlwgt               0
education            0
education-num        1
marital-status       0
occupation        1843
relationship         0
race                 0
sex                  0
capital-gain         0
capital-loss         1
hours-per-week       2
native-country     583
income               0
dtype: int64

Pandas también tiene métodos para identificar y gestionar registros duplicados:

In [40]:
print('Número de registros duplicados: ', df.duplicated().sum())

Número de registros duplicados:  24


# Ejercicio

1. Cargue el archivo "Telco_customer_churn.csv", cuidando que quede correctamente cargado.

2. Indique cuantos registros y cuantas columnas tiene el dataframe creado.

3. Indique que variables son categóricas. ¿Es consistente con el contenido de las variables?

4. Indique en que variables hay datos nulos.

5. Indique si hay registros duplicados.