# DataFrames

Los DataFrames son el objeto principal en el trabajo con Pandas.

Podemos pensar en ellos como si fuesen un conjuntos de series una al lado de la otra que comparten el mismo indice, de manera que forman una tabla.


In [1]:
import pandas as pd
import numpy as np

## Existen distintas formas de construir un DataFrame

Nosotros vamos a ver un par de ellas, en el siguiente articulo vemos que existen mas formas de crearlos.

[Formas de crear un DataFrame en pandas](https://www.geeksforgeeks.org/different-ways-to-create-pandas-dataframe/)

In [2]:
# usando lista de listas

data = [["julio",30,'M'],["julia",34,'F'],["juli",37,'M'],["julian",36,'M']]
columns = ["Nombre","Edad","Sexo"]
indices = ['a b c d'.split()]

In [3]:
pd.DataFrame(data)

Unnamed: 0,0,1,2
0,julio,30,M
1,julia,34,F
2,juli,37,M
3,julian,36,M


In [4]:
pd.DataFrame(data,columns=columns)

Unnamed: 0,Nombre,Edad,Sexo
0,julio,30,M
1,julia,34,F
2,juli,37,M
3,julian,36,M


In [5]:
pd.DataFrame(data,indices,columns)

Unnamed: 0,Nombre,Edad,Sexo
a,julio,30,M
b,julia,34,F
c,juli,37,M
d,julian,36,M


In [6]:
# usando diccionarios
data = {'Nombre':'julio julia flor juan'.split(), 'Edad':[23,45,3,44]}

In [7]:
pd.DataFrame(data)

Unnamed: 0,Nombre,Edad
0,julio,23
1,julia,45
2,flor,3
3,juan,44


In [8]:
pd.DataFrame(data,indices)

Unnamed: 0,Nombre,Edad
a,julio,23
b,julia,45
c,flor,3
d,juan,44


In [None]:
# usamos numpy para crear un dataframe con numeros aleatorios

In [9]:
random_array = np.random.rand(5,4)
random_array

array([[0.14302854, 0.40447569, 0.75093517, 0.7630948 ],
       [0.32712352, 0.77731844, 0.98129251, 0.25273625],
       [0.0825063 , 0.75007823, 0.35441656, 0.97396753],
       [0.84171934, 0.62690468, 0.55818658, 0.83393318],
       [0.18826054, 0.42797143, 0.21376253, 0.84530077]])

In [10]:
df = pd.DataFrame(random_array,index='A B C D E'.split(),columns='W X Y Z'.split())

In [11]:
df

Unnamed: 0,W,X,Y,Z
A,0.143029,0.404476,0.750935,0.763095
B,0.327124,0.777318,0.981293,0.252736
C,0.082506,0.750078,0.354417,0.973968
D,0.841719,0.626905,0.558187,0.833933
E,0.188261,0.427971,0.213763,0.845301


## Indexing

Veamos varios metodos para agarrar datos de un DataFrame

In [12]:
# Agarramos todos los valores de una columna
df['W']

A    0.143029
B    0.327124
C    0.082506
D    0.841719
E    0.188261
Name: W, dtype: float64

In [14]:
# Podemos agarrar todos los valores de mas de una columna, pasando una lista de las columnas que queremos
df[['W','Z']]

Unnamed: 0,W,Z
A,0.143029,0.763095
B,0.327124,0.252736
C,0.082506,0.973968
D,0.841719,0.833933
E,0.188261,0.845301


In [15]:
df[['Z','W']]

Unnamed: 0,Z,W
A,0.763095,0.143029
B,0.252736,0.327124
C,0.973968,0.082506
D,0.833933,0.841719
E,0.845301,0.188261


In [16]:
# Tambien podemos agarrar una columna de esta manera (NO SE RECOMIENDA)
df.W

A    0.143029
B    0.327124
C    0.082506
D    0.841719
E    0.188261
Name: W, dtype: float64

**Vemos que las columnas de un df son Series**

In [13]:
type(df['W'])

pandas.core.series.Series

**Cambiamos el valor de los datos de una columna**

In [17]:
# al igual que para numpy, usamos broadcasting

df['W'] = 999

In [18]:
df

Unnamed: 0,W,X,Y,Z
A,999,0.404476,0.750935,0.763095
B,999,0.777318,0.981293,0.252736
C,999,0.750078,0.354417,0.973968
D,999,0.626905,0.558187,0.833933
E,999,0.427971,0.213763,0.845301


**Creamos columnas nuevas**

In [19]:
df["nueva_columna"] = 5

In [20]:
df

Unnamed: 0,W,X,Y,Z,nueva_columna
A,999,0.404476,0.750935,0.763095,5
B,999,0.777318,0.981293,0.252736,5
C,999,0.750078,0.354417,0.973968,5
D,999,0.626905,0.558187,0.833933,5
E,999,0.427971,0.213763,0.845301,5


In [21]:
# creamos columna nueva con operacion entre columnas
df['new'] = df['W'] + df["Z"]

In [22]:
df

Unnamed: 0,W,X,Y,Z,nueva_columna,new
A,999,0.404476,0.750935,0.763095,5,999.763095
B,999,0.777318,0.981293,0.252736,5,999.252736
C,999,0.750078,0.354417,0.973968,5,999.973968
D,999,0.626905,0.558187,0.833933,5,999.833933
E,999,0.427971,0.213763,0.845301,5,999.845301


In [23]:
# podemos reasignar el valor de una columna ya creada
df['new'] = df['W'] * df["Z"]

In [24]:
df

Unnamed: 0,W,X,Y,Z,nueva_columna,new
A,999,0.404476,0.750935,0.763095,5,762.331704
B,999,0.777318,0.981293,0.252736,5,252.483517
C,999,0.750078,0.354417,0.973968,5,972.993558
D,999,0.626905,0.558187,0.833933,5,833.099248
E,999,0.427971,0.213763,0.845301,5,844.455473


**Eliminamos Columnas**

In [26]:
df.drop("new",axis=1)

Unnamed: 0,W,X,Y,Z,nueva_columna
A,999,0.404476,0.750935,0.763095,5
B,999,0.777318,0.981293,0.252736,5
C,999,0.750078,0.354417,0.973968,5
D,999,0.626905,0.558187,0.833933,5
E,999,0.427971,0.213763,0.845301,5


In [27]:
df
#df = df.drop("new",axis=1)

Unnamed: 0,W,X,Y,Z,nueva_columna,new
A,999,0.404476,0.750935,0.763095,5,762.331704
B,999,0.777318,0.981293,0.252736,5,252.483517
C,999,0.750078,0.354417,0.973968,5,972.993558
D,999,0.626905,0.558187,0.833933,5,833.099248
E,999,0.427971,0.213763,0.845301,5,844.455473


In [28]:
# No es una operacion en el lugar, a menos que lo especifiquemos
df.drop("new",axis=1,inplace=True)

In [29]:
df

Unnamed: 0,W,X,Y,Z,nueva_columna
A,999,0.404476,0.750935,0.763095,5
B,999,0.777318,0.981293,0.252736,5
C,999,0.750078,0.354417,0.973968,5
D,999,0.626905,0.558187,0.833933,5
E,999,0.427971,0.213763,0.845301,5


Tambien podemos eliminar filas de esta manera:

In [30]:
df.drop("E")

Unnamed: 0,W,X,Y,Z,nueva_columna
A,999,0.404476,0.750935,0.763095,5
B,999,0.777318,0.981293,0.252736,5
C,999,0.750078,0.354417,0.973968,5
D,999,0.626905,0.558187,0.833933,5


In [31]:
df

Unnamed: 0,W,X,Y,Z,nueva_columna
A,999,0.404476,0.750935,0.763095,5
B,999,0.777318,0.981293,0.252736,5
C,999,0.750078,0.354417,0.973968,5
D,999,0.626905,0.558187,0.833933,5
E,999,0.427971,0.213763,0.845301,5


**Seleccionando Filas del DataFrame**

Para ello podemos utilizar dos metodos, el .loc o el .iloc

`.loc[]` selecciona filas (y columnas) por la etiqueta de las mismas

`.iloc[]` seleccionas filas (y columnas) por la posicion de las mismas

In [32]:
# con loc
df.loc['A']

W                999.000000
X                  0.404476
Y                  0.750935
Z                  0.763095
nueva_columna      5.000000
Name: A, dtype: float64

In [33]:
# con iloc
df.iloc[0]

W                999.000000
X                  0.404476
Y                  0.750935
Z                  0.763095
nueva_columna      5.000000
Name: A, dtype: float64

In [34]:
# si quiero seleccionar mas de una fila en especifico puedo usar el fancy indexing de numpy
df.loc[['A',"D"]]

Unnamed: 0,W,X,Y,Z,nueva_columna
A,999,0.404476,0.750935,0.763095,5
D,999,0.626905,0.558187,0.833933,5


In [35]:
df.iloc[[0,3]]

Unnamed: 0,W,X,Y,Z,nueva_columna
A,999,0.404476,0.750935,0.763095,5
D,999,0.626905,0.558187,0.833933,5


**Igual que en Numpy, puedo seleccionar filas y columnas**

In [37]:
df

Unnamed: 0,W,X,Y,Z,nueva_columna
A,999,0.404476,0.750935,0.763095,5
B,999,0.777318,0.981293,0.252736,5
C,999,0.750078,0.354417,0.973968,5
D,999,0.626905,0.558187,0.833933,5
E,999,0.427971,0.213763,0.845301,5


In [36]:
df.loc['A','Y']

0.7509351717162787

In [38]:
df.loc[['A','E'],['W','Z']]

Unnamed: 0,W,Z
A,999,0.763095
E,999,0.845301


### Seleccion por condiciones (filtrado)

Al igual que en Numpy, podemos hacer una seleccion con condiciones, creando la mascara.

Dicha seleccion puede ser a nivel del dataframe o a nivel de alguna columna.


In [39]:
df

Unnamed: 0,W,X,Y,Z,nueva_columna
A,999,0.404476,0.750935,0.763095,5
B,999,0.777318,0.981293,0.252736,5
C,999,0.750078,0.354417,0.973968,5
D,999,0.626905,0.558187,0.833933,5
E,999,0.427971,0.213763,0.845301,5


In [40]:
df>0.5

Unnamed: 0,W,X,Y,Z,nueva_columna
A,True,False,True,True,True
B,True,True,True,False,True
C,True,True,False,True,True
D,True,True,True,True,True
E,True,False,False,True,True


In [41]:
df[df>0.5]

Unnamed: 0,W,X,Y,Z,nueva_columna
A,999,,0.750935,0.763095,5
B,999,0.777318,0.981293,,5
C,999,0.750078,,0.973968,5
D,999,0.626905,0.558187,0.833933,5
E,999,,,0.845301,5


In [43]:
df['Z']>0.5

A     True
B    False
C     True
D     True
E     True
Name: Z, dtype: bool

In [44]:
df[df['Z']>0.5]

Unnamed: 0,W,X,Y,Z,nueva_columna
A,999,0.404476,0.750935,0.763095,5
C,999,0.750078,0.354417,0.973968,5
D,999,0.626905,0.558187,0.833933,5
E,999,0.427971,0.213763,0.845301,5


In [45]:
df[df['Z']>0.5]['Y']

A    0.750935
C    0.354417
D    0.558187
E    0.213763
Name: Y, dtype: float64

In [46]:
df[df['Z']>0.5][['Y',"Z"]]

Unnamed: 0,Y,Z
A,0.750935,0.763095
C,0.354417,0.973968
D,0.558187,0.833933
E,0.213763,0.845301


**Que pasa si queremos establecer dos o mas condiciones**

En Python usabamos las sentencias `and` `or` y `not`, pero en Pandas cambia la sintaxis de la forma:

& (and)

| (or)

~ (not)


In [47]:
df[(df["W"]>0.5) & (df["Y"]>0.7)]

Unnamed: 0,W,X,Y,Z,nueva_columna
A,999,0.404476,0.750935,0.763095,5
B,999,0.777318,0.981293,0.252736,5


In [48]:
df

Unnamed: 0,W,X,Y,Z,nueva_columna
A,999,0.404476,0.750935,0.763095,5
B,999,0.777318,0.981293,0.252736,5
C,999,0.750078,0.354417,0.973968,5
D,999,0.626905,0.558187,0.833933,5
E,999,0.427971,0.213763,0.845301,5


**Otras formas de filtrado**

In [50]:
# isin()
# devuelve True si el elemento esta dentro de la coleccion, False en caso contrario

lista = [1,2,3,5]
df['nueva_columna'].isin(lista)

A    True
B    True
C    True
D    True
E    True
Name: nueva_columna, dtype: bool

In [52]:
# between()
# devuelve True si el elemento se encuentra dentro del rango comprendido entre los limites, Falso en caso contrario

df["Z"].between(0,0.5)


A    False
B     True
C    False
D    False
E    False
Name: Z, dtype: bool

## Mas detalles sobre indices

Veamos como resetear y cambiar los indices

In [53]:
df

Unnamed: 0,W,X,Y,Z,nueva_columna
A,999,0.404476,0.750935,0.763095,5
B,999,0.777318,0.981293,0.252736,5
C,999,0.750078,0.354417,0.973968,5
D,999,0.626905,0.558187,0.833933,5
E,999,0.427971,0.213763,0.845301,5


In [54]:
# Reseteamos al index por default (0 1 ... n)
df.reset_index()

Unnamed: 0,index,W,X,Y,Z,nueva_columna
0,A,999,0.404476,0.750935,0.763095,5
1,B,999,0.777318,0.981293,0.252736,5
2,C,999,0.750078,0.354417,0.973968,5
3,D,999,0.626905,0.558187,0.833933,5
4,E,999,0.427971,0.213763,0.845301,5


In [55]:
df

Unnamed: 0,W,X,Y,Z,nueva_columna
A,999,0.404476,0.750935,0.763095,5
B,999,0.777318,0.981293,0.252736,5
C,999,0.750078,0.354417,0.973968,5
D,999,0.626905,0.558187,0.833933,5
E,999,0.427971,0.213763,0.845301,5


In [56]:
# Creamos un indice nuevo
newind = '10 20 30 40 50'.split()

In [57]:
df['nuevo_indice'] = newind

In [58]:
df

Unnamed: 0,W,X,Y,Z,nueva_columna,nuevo_indice
A,999,0.404476,0.750935,0.763095,5,10
B,999,0.777318,0.981293,0.252736,5,20
C,999,0.750078,0.354417,0.973968,5,30
D,999,0.626905,0.558187,0.833933,5,40
E,999,0.427971,0.213763,0.845301,5,50


In [59]:
df.set_index("nuevo_indice")

Unnamed: 0_level_0,W,X,Y,Z,nueva_columna
nuevo_indice,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
10,999,0.404476,0.750935,0.763095,5
20,999,0.777318,0.981293,0.252736,5
30,999,0.750078,0.354417,0.973968,5
40,999,0.626905,0.558187,0.833933,5
50,999,0.427971,0.213763,0.845301,5


In [60]:
df

Unnamed: 0,W,X,Y,Z,nueva_columna,nuevo_indice
A,999,0.404476,0.750935,0.763095,5,10
B,999,0.777318,0.981293,0.252736,5,20
C,999,0.750078,0.354417,0.973968,5,30
D,999,0.626905,0.558187,0.833933,5,40
E,999,0.427971,0.213763,0.845301,5,50


In [61]:
df.set_index("nuevo_indice",inplace = True)

In [62]:
df

Unnamed: 0_level_0,W,X,Y,Z,nueva_columna
nuevo_indice,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
10,999,0.404476,0.750935,0.763095,5
20,999,0.777318,0.981293,0.252736,5
30,999,0.750078,0.354417,0.973968,5
40,999,0.626905,0.558187,0.833933,5
50,999,0.427971,0.213763,0.845301,5
