# DataFramesS

Los DataFrames son el objeto principal en el trabajo con Pandas.

Podemos pensar en ellos como si fuesen un conjuntos de series una al lado de la otra que comparten el mismo indice, de manera que forman una tabla.


In [4]:
import pandas as pd
import numpy as np

## Existen distintas formas de construir un DataFrame

Nosotros vamos a ver un par de ellas, en el siguiente articulo vemos que existen mas formas de crearlos.

[Formas de crear un DataFrame en pandas](https://www.geeksforgeeks.org/different-ways-to-create-pandas-dataframe/)

In [9]:
# usando lista de listas

data = [["julio",30,'M'],["julia",34,'F'],["juli",37,'M'],["julian",36,'M']]
columns = ["Nombre","Edad","Sexo"]
indices = ['a b c d'.split()]

In [10]:
pd.DataFrame(data,columns=columns)

Unnamed: 0,Nombre,Edad,Sexo
0,julio,30,M
1,julia,34,F
2,juli,37,M
3,julian,36,M


In [11]:
pd.DataFrame(data,indices,columns=columns)

Unnamed: 0,Nombre,Edad,Sexo
a,julio,30,M
b,julia,34,F
c,juli,37,M
d,julian,36,M


In [12]:
# usando diccionarios

data = {'Nombre':['Tom', 'nick', 'krish', 'jack'], 'Edad':[20, 21, 19, 18]} 

pd.DataFrame(data)
  

Unnamed: 0,Nombre,Edad
0,Tom,20
1,nick,21
2,krish,19
3,jack,18


In [13]:
pd.DataFrame(data,indices)

Unnamed: 0,Nombre,Edad
a,Tom,20
b,nick,21
c,krish,19
d,jack,18


In [None]:
# usamos numpy para crear un dataframe con numeros aleatorios

In [14]:
random_array = np.random.rand(5,4)
random_array

array([[0.5415878 , 0.55266684, 0.80625356, 0.14612342],
       [0.46192211, 0.52298459, 0.22794569, 0.72042511],
       [0.64551921, 0.85480629, 0.51046963, 0.32362971],
       [0.93368794, 0.90568575, 0.54867702, 0.22696869],
       [0.7028636 , 0.07733428, 0.81813327, 0.95023462]])

In [15]:
df = pd.DataFrame(random_array,index='A B C D E'.split(),columns='W X Y Z'.split())

In [16]:
df

Unnamed: 0,W,X,Y,Z
A,0.541588,0.552667,0.806254,0.146123
B,0.461922,0.522985,0.227946,0.720425
C,0.645519,0.854806,0.51047,0.32363
D,0.933688,0.905686,0.548677,0.226969
E,0.702864,0.077334,0.818133,0.950235


## Indexing

Veamos varios metodos para agarrar datos de un DataFrame

In [18]:
# Agarramos todos los valores de una columna
df['W']

A    0.541588
B    0.461922
C    0.645519
D    0.933688
E    0.702864
Name: W, dtype: float64

In [188]:
# Podemos agarrar todos los valores de mas de una columna, pasando una lista de las columnas que queremos
df[['W','Z']]

Unnamed: 0,W,Z
A,2.70685,0.503826
B,0.651118,0.605965
C,-2.018168,-0.589001
D,0.188695,0.955057
E,0.190794,0.683509


In [19]:
# Tambien podemos agarrar una columna de esta manera (NO SE RECOMIENDA)
df.W

A    0.541588
B    0.461922
C    0.645519
D    0.933688
E    0.702864
Name: W, dtype: float64

**Vemos que las columnas de un df son Series**

In [20]:
type(df['W'])

pandas.core.series.Series

**Cambiamos el valor de los datos de una columna**

In [21]:
# al igual que para numpy, usamos broadcasting

df["W"] = 999

In [22]:
df

Unnamed: 0,W,X,Y,Z
A,999,0.552667,0.806254,0.146123
B,999,0.522985,0.227946,0.720425
C,999,0.854806,0.51047,0.32363
D,999,0.905686,0.548677,0.226969
E,999,0.077334,0.818133,0.950235


**Creamos columnas nuevas**

In [50]:
df["nueva_columna"] = 5

In [51]:
df

Unnamed: 0,W,X,Y,Z,nueva_columna,new
A,999,0.552667,0.806254,0.146123,5,805.447311
B,999,0.522985,0.227946,0.720425,5,227.717749
C,999,0.854806,0.51047,0.32363,5,509.959165
D,999,0.905686,0.548677,0.226969,5,548.128346
E,999,0.077334,0.818133,0.950235,5,817.315134


In [29]:
# creamos columna nueva con operacion entre columnas
df['new'] = df['W'] + df['Y']

In [30]:
df

Unnamed: 0,W,X,Y,Z,nueva_columna,new
A,999,0.552667,0.806254,0.146123,valor_nuevo,999.806254
B,999,0.522985,0.227946,0.720425,valor_nuevo,999.227946
C,999,0.854806,0.51047,0.32363,valor_nuevo,999.51047
D,999,0.905686,0.548677,0.226969,valor_nuevo,999.548677
E,999,0.077334,0.818133,0.950235,valor_nuevo,999.818133


In [31]:
# podemos reasignar el valor de una columna ya creada
df['new'] = df['W'] * df['Y']

In [32]:
df

Unnamed: 0,W,X,Y,Z,nueva_columna,new
A,999,0.552667,0.806254,0.146123,valor_nuevo,805.447311
B,999,0.522985,0.227946,0.720425,valor_nuevo,227.717749
C,999,0.854806,0.51047,0.32363,valor_nuevo,509.959165
D,999,0.905686,0.548677,0.226969,valor_nuevo,548.128346
E,999,0.077334,0.818133,0.950235,valor_nuevo,817.315134


**Eliminamos Columnas**

In [193]:
df.drop('new',axis=1)

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [194]:
# No es una operacion en el lugar, a menos que lo especifiquemos
df

Unnamed: 0,W,X,Y,Z,new
A,2.70685,0.628133,0.907969,0.503826,3.614819
B,0.651118,-0.319318,-0.848077,0.605965,-0.196959
C,-2.018168,0.740122,0.528813,-0.589001,-1.489355
D,0.188695,-0.758872,-0.933237,0.955057,-0.744542
E,0.190794,1.978757,2.605967,0.683509,2.796762


In [195]:
df.drop('new',axis=1,inplace=True)

In [196]:
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


Tambien podemos eliminar filas de esta manera:

In [33]:
df.drop('E',axis=0)

Unnamed: 0,W,X,Y,Z,nueva_columna,new
A,999,0.552667,0.806254,0.146123,valor_nuevo,805.447311
B,999,0.522985,0.227946,0.720425,valor_nuevo,227.717749
C,999,0.854806,0.51047,0.32363,valor_nuevo,509.959165
D,999,0.905686,0.548677,0.226969,valor_nuevo,548.128346


In [34]:
df

Unnamed: 0,W,X,Y,Z,nueva_columna,new
A,999,0.552667,0.806254,0.146123,valor_nuevo,805.447311
B,999,0.522985,0.227946,0.720425,valor_nuevo,227.717749
C,999,0.854806,0.51047,0.32363,valor_nuevo,509.959165
D,999,0.905686,0.548677,0.226969,valor_nuevo,548.128346
E,999,0.077334,0.818133,0.950235,valor_nuevo,817.315134


**Seleccionando Filas del DataFrame**

Para ello podemos utilizar dos metodos, el .loc o el .iloc

`.loc[]` selecciona filas (y columnas) por la etiqueta de las mismas

`.iloc[]` seleccionas filas (y columnas) por la posicion de las mismas

In [36]:
# con loc
df.loc['A']

W                        999
X                   0.552667
Y                   0.806254
Z                   0.146123
nueva_columna    valor_nuevo
new                  805.447
Name: A, dtype: object

In [37]:
# con iloc
df.iloc[2]

W                        999
X                   0.854806
Y                    0.51047
Z                    0.32363
nueva_columna    valor_nuevo
new                  509.959
Name: C, dtype: object

In [38]:
# si quiero seleccionar mas de una fila puedo usar el fancy indexing de numpy

df.loc[["A","D"]]

Unnamed: 0,W,X,Y,Z,nueva_columna,new
A,999,0.552667,0.806254,0.146123,valor_nuevo,805.447311
D,999,0.905686,0.548677,0.226969,valor_nuevo,548.128346


In [39]:
df.iloc[[0,3]]

Unnamed: 0,W,X,Y,Z,nueva_columna,new
A,999,0.552667,0.806254,0.146123,valor_nuevo,805.447311
D,999,0.905686,0.548677,0.226969,valor_nuevo,548.128346


**Igual que en Numpy, puedo seleccionar filas y columnas**

In [40]:
df.loc['B','Y']

0.2279456942479382

In [41]:
df.loc[['A','B'],['W','Y']]

Unnamed: 0,W,Y
A,999,0.806254
B,999,0.227946


### Seleccion por condiciones (filtrado)

Al igual que en Numpy, podemos hacer una seleccion con condiciones, creando la mascara.

Dicha seleccion puede ser a nivel del dataframe o a nivel de alguna columna.


In [48]:
df

Unnamed: 0,W,X,Y,Z,nueva_columna,new
A,999,0.552667,0.806254,0.146123,valor_nuevo,805.447311
B,999,0.522985,0.227946,0.720425,valor_nuevo,227.717749
C,999,0.854806,0.51047,0.32363,valor_nuevo,509.959165
D,999,0.905686,0.548677,0.226969,valor_nuevo,548.128346
E,999,0.077334,0.818133,0.950235,valor_nuevo,817.315134


In [52]:
df>0.5

Unnamed: 0,W,X,Y,Z,nueva_columna,new
A,True,True,True,False,True,True
B,True,True,False,True,True,True
C,True,True,True,False,True,True
D,True,True,True,False,True,True
E,True,False,True,True,True,True


In [53]:
df[df>0.5]

Unnamed: 0,W,X,Y,Z,nueva_columna,new
A,999,0.552667,0.806254,,5,805.447311
B,999,0.522985,,0.720425,5,227.717749
C,999,0.854806,0.51047,,5,509.959165
D,999,0.905686,0.548677,,5,548.128346
E,999,,0.818133,0.950235,5,817.315134


In [54]:
df['Z']>0.5

A    False
B     True
C    False
D    False
E     True
Name: Z, dtype: bool

In [47]:
df[df['Z']>0.5]

Unnamed: 0,W,X,Y,Z,nueva_columna,new
B,999,0.522985,0.227946,0.720425,valor_nuevo,227.717749
E,999,0.077334,0.818133,0.950235,valor_nuevo,817.315134


In [55]:
df[df['Z']>0.5]['Y']

B    0.227946
E    0.818133
Name: Y, dtype: float64

In [56]:
df[df['W']>0][['Y','X']]

Unnamed: 0,Y,X
A,0.806254,0.552667
B,0.227946,0.522985
C,0.51047,0.854806
D,0.548677,0.905686
E,0.818133,0.077334


**Que pasa si queremos establecer dos o mas condiciones**

En Python usabamos las sentencias `and` `or` y `not`, pero en Pandas cambia la sintaxis de la forma:

& (and)

| (or)

~ (not)


In [60]:
df[(df['W']>0.5) & (df['Y'] > 0.7)]

Unnamed: 0,W,X,Y,Z,nueva_columna,new
A,999,0.552667,0.806254,0.146123,5,805.447311
E,999,0.077334,0.818133,0.950235,5,817.315134


**Otras formas de filtrado**

In [77]:
# isin()
# devuelve True si el elemento esta dentro de la coleccion, False en caso contrario

lista_numeros = [1,2,3]
df['Z'].isin(lista_numeros)

nuevo_indice
10    False
20    False
30    False
40    False
50    False
Name: Z, dtype: bool

In [74]:
# between()
# devuelve True si el elemento se encuentra dentro del rango comprendido entre los limites, Falso en caso contrario

df['Z'].between(0,1)

True

## Mas detalles sobre indices

Veamos como resetear y cambiar los indices

In [61]:
df

Unnamed: 0,W,X,Y,Z,nueva_columna,new
A,999,0.552667,0.806254,0.146123,5,805.447311
B,999,0.522985,0.227946,0.720425,5,227.717749
C,999,0.854806,0.51047,0.32363,5,509.959165
D,999,0.905686,0.548677,0.226969,5,548.128346
E,999,0.077334,0.818133,0.950235,5,817.315134


In [62]:
# Reseteamos al index por default (0 1 ... n)
df.reset_index()

Unnamed: 0,index,W,X,Y,Z,nueva_columna,new
0,A,999,0.552667,0.806254,0.146123,5,805.447311
1,B,999,0.522985,0.227946,0.720425,5,227.717749
2,C,999,0.854806,0.51047,0.32363,5,509.959165
3,D,999,0.905686,0.548677,0.226969,5,548.128346
4,E,999,0.077334,0.818133,0.950235,5,817.315134


In [63]:
df

Unnamed: 0,W,X,Y,Z,nueva_columna,new
A,999,0.552667,0.806254,0.146123,5,805.447311
B,999,0.522985,0.227946,0.720425,5,227.717749
C,999,0.854806,0.51047,0.32363,5,509.959165
D,999,0.905686,0.548677,0.226969,5,548.128346
E,999,0.077334,0.818133,0.950235,5,817.315134


In [64]:
newind = '10 20 30 40 50'.split()

In [65]:
df['nuevo_indice'] = newind

In [66]:
df

Unnamed: 0,W,X,Y,Z,nueva_columna,new,nuevo_indice
A,999,0.552667,0.806254,0.146123,5,805.447311,10
B,999,0.522985,0.227946,0.720425,5,227.717749,20
C,999,0.854806,0.51047,0.32363,5,509.959165,30
D,999,0.905686,0.548677,0.226969,5,548.128346,40
E,999,0.077334,0.818133,0.950235,5,817.315134,50


In [67]:
df.set_index('nuevo_indice')

Unnamed: 0_level_0,W,X,Y,Z,nueva_columna,new
nuevo_indice,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
10,999,0.552667,0.806254,0.146123,5,805.447311
20,999,0.522985,0.227946,0.720425,5,227.717749
30,999,0.854806,0.51047,0.32363,5,509.959165
40,999,0.905686,0.548677,0.226969,5,548.128346
50,999,0.077334,0.818133,0.950235,5,817.315134


In [68]:
df

Unnamed: 0,W,X,Y,Z,nueva_columna,new,nuevo_indice
A,999,0.552667,0.806254,0.146123,5,805.447311,10
B,999,0.522985,0.227946,0.720425,5,227.717749,20
C,999,0.854806,0.51047,0.32363,5,509.959165,30
D,999,0.905686,0.548677,0.226969,5,548.128346,40
E,999,0.077334,0.818133,0.950235,5,817.315134,50


In [69]:
df.set_index('nuevo_indice',inplace=True)

In [70]:
df

Unnamed: 0_level_0,W,X,Y,Z,nueva_columna,new
nuevo_indice,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
10,999,0.552667,0.806254,0.146123,5,805.447311
20,999,0.522985,0.227946,0.720425,5,227.717749
30,999,0.854806,0.51047,0.32363,5,509.959165
40,999,0.905686,0.548677,0.226969,5,548.128346
50,999,0.077334,0.818133,0.950235,5,817.315134
