<img src="https://user-images.githubusercontent.com/7065401/75165824-badf4680-5701-11ea-9c5b-5475b0a33abf.png"
    style="width:300px; float: right; margin: 0 40px 40px 40px;"></img>


# Pandas - `DataFrame`s


Probablemente la estructura de datos más importante de pandas sea el "DataFrame". Es una estructura tabular estrechamente integrada con "Series".




![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

In [2]:
import numpy as np
import pandas as pd

Continuamos con el analisis de los paises del G7.

Por lo general los DataFrame se crean obteniando los datos de una base de datos, o de un csv.

In [22]:
df = pd.DataFrame({
    'Poblacion': [35.467, 63.951, 80.94 , 60.665, 127.061, 64.511, 318.523],
    'PBI': [
        1785387,
        2833687,
        3874437,
        2167744,
        4602367,
        2950039,
        17348075
    ],
    'Superficie': [
        9984670,
        640679,
        357114,
        301336,
        377930,
        242495,
        9525067
    ],
    'HDI': [
        0.913,
        0.888,
        0.916,
        0.873,
        0.891,
        0.907,
        0.915
    ],
    'Continente': [
        'America',
        'Europa',
        'Europa',
        'Europa',
        'Asia',
        'Europe',
        'America',
    ]
}, columns=['Poblacion', 'PBI', 'Superficie', 'HDI', 'Continente'])

In [4]:
df

Unnamed: 0,Poblacion,PBI,Superficie,HDI,Continente
0,35.467,1785387,9984670,0.913,America
1,63.951,2833687,640679,0.888,Europa
2,80.94,3874437,357114,0.916,Europa
3,60.665,2167744,301336,0.873,Europa
4,127.061,4602367,377930,0.891,Asia
5,64.511,2950039,242495,0.907,Europe
6,318.523,17348075,9525067,0.915,America


Los DataFrames también tienen índices. Como se puede ver en la "tabla" anterior, pandas ha asignado automáticamente un índice numérico autoincremental a cada "fila" en nuestro DataFrame. En nuestro caso, sabemos que cada fila representa un país, así que simplemente reasignaremos el índice:

In [25]:
df.index = [
    'Canada',
    'Francia',
    'Alemania',
    'Italia',
    'Japon',
    'Reino Unido',
    'Estados Unidos',
]

In [6]:
df

Unnamed: 0,Poblacion,PBI,Superficie,HDI,Continente
Canada,35.467,1785387,9984670,0.913,America
Francia,63.951,2833687,640679,0.888,Europa
Alemania,80.94,3874437,357114,0.916,Europa
Italia,60.665,2167744,301336,0.873,Europa
Japon,127.061,4602367,377930,0.891,Asia
Reino Unido,64.511,2950039,242495,0.907,Europe
Estados Unidos,318.523,17348075,9525067,0.915,America


In [7]:
df.columns

Index(['Poblacion', 'PBI', 'Superficie', 'HDI', 'Continente'], dtype='object')

In [8]:
df.index

Index(['Canada', 'Francia', 'Alemania', 'Italia', 'Japon', 'Reino Unido',
       'Estados Unidos'],
      dtype='object')

In [9]:
# informacion sobre la estructura del dataframe
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7 entries, Canada to Estados Unidos
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Poblacion   7 non-null      float64
 1   PBI         7 non-null      int64  
 2   Superficie  7 non-null      int64  
 3   HDI         7 non-null      float64
 4   Continente  7 non-null      object 
dtypes: float64(2), int64(2), object(1)
memory usage: 636.0+ bytes


In [10]:
df.shape

(7, 5)

In [11]:
# numero total de elementos el producto fila x columna
df.size

35

In [12]:
df.describe()

Unnamed: 0,Poblacion,PBI,Superficie,HDI
count,7.0,7.0,7.0,7.0
mean,107.302571,5080248.0,3061327.0,0.900429
std,97.24997,5494020.0,4576187.0,0.016592
min,35.467,1785387.0,242495.0,0.873
25%,62.308,2500716.0,329225.0,0.8895
50%,64.511,2950039.0,377930.0,0.907
75%,104.0005,4238402.0,5082873.0,0.914
max,318.523,17348080.0,9984670.0,0.916


In [13]:
df.dtypes

Unnamed: 0,0
Poblacion,float64
PBI,int64
Superficie,int64
HDI,float64
Continente,object


In [14]:
df.dtypes.value_counts()

Unnamed: 0,count
float64,2
int64,2
object,1


![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## Indexacion, Seleccion y Slicing

Podemos seleccionar una columna de manera individual utilizando indexacion. La columna es representada como una Serie.


In [26]:
df

Unnamed: 0,Poblacion,PBI,Superficie,HDI,Continente
Canada,35.467,1785387,9984670,0.913,America
Francia,63.951,2833687,640679,0.888,Europa
Alemania,80.94,3874437,357114,0.916,Europa
Italia,60.665,2167744,301336,0.873,Europa
Japon,127.061,4602367,377930,0.891,Asia
Reino Unido,64.511,2950039,242495,0.907,Europe
Estados Unidos,318.523,17348075,9525067,0.915,America


In [17]:
df['Poblacion']

El nombre de la serie, es el nombre de la columna

In [18]:
df['Poblacion'].to_frame()

Unnamed: 0,Poblacion
Canada,35.467
Francia,63.951
Alemania,80.94
Italia,60.665
Japon,127.061
Reino Unido,64.511
Estados Unidos,318.523


Tambien es posible seleccionar multiples columnas

In [23]:
df[['Poblacion', 'PBI']]

Unnamed: 0,Poblacion,PBI
0,35.467,1785387
1,63.951,2833687
2,80.94,3874437
3,60.665,2167744
4,127.061,4602367
5,64.511,2950039
6,318.523,17348075


En este caso, el resultado es otro DataFrame

La seleccion de filas, funciona mejor con loc y con iloc

In [27]:
df.loc['Italia']

Unnamed: 0,Italia
Poblacion,60.665
PBI,2167744
Superficie,301336
HDI,0.873
Continente,Europa


In [28]:
df.loc['Francia':'Italia']

Unnamed: 0,Poblacion,PBI,Superficie,HDI,Continente
Francia,63.951,2833687,640679,0.888,Europa
Alemania,80.94,3874437,357114,0.916,Europa
Italia,60.665,2167744,301336,0.873,Europa


Tambien se pueden seleccinar que columna/s se quieren obtener

In [29]:
df.loc['Francia': 'Italia', 'Poblacion']

Unnamed: 0,Poblacion
Francia,63.951
Alemania,80.94
Italia,60.665


In [30]:
df.loc['Francia':'Italia', ['Poblacion', "PBI"]]

Unnamed: 0,Poblacion,PBI
Francia,63.951,2833687
Alemania,80.94,3874437
Italia,60.665,2167744


Recordar: iloc funciona con los indices numericos

In [31]:
df.iloc[0]

Unnamed: 0,Canada
Poblacion,35.467
PBI,1785387
Superficie,9984670
HDI,0.913
Continente,America


> **RECOMENDACION: Siempre usar `loc` and `iloc` especialmente en DataFrame con indices numericos.**

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## Seleccion Condicional (arrays booleanos)

Funciona igual que en las Series

In [32]:
df

Unnamed: 0,Poblacion,PBI,Superficie,HDI,Continente
Canada,35.467,1785387,9984670,0.913,America
Francia,63.951,2833687,640679,0.888,Europa
Alemania,80.94,3874437,357114,0.916,Europa
Italia,60.665,2167744,301336,0.873,Europa
Japon,127.061,4602367,377930,0.891,Asia
Reino Unido,64.511,2950039,242495,0.907,Europe
Estados Unidos,318.523,17348075,9525067,0.915,America


In [33]:
df['Poblacion'] > 70

Unnamed: 0,Poblacion
Canada,False
Francia,False
Alemania,True
Italia,False
Japon,True
Reino Unido,False
Estados Unidos,True


In [34]:
df.loc[df['Poblacion'] > 70]

Unnamed: 0,Poblacion,PBI,Superficie,HDI,Continente
Alemania,80.94,3874437,357114,0.916,Europa
Japon,127.061,4602367,377930,0.891,Asia
Estados Unidos,318.523,17348075,9525067,0.915,America


![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## Eliminando Filas y Columnas


In [35]:
df

Unnamed: 0,Poblacion,PBI,Superficie,HDI,Continente
Canada,35.467,1785387,9984670,0.913,America
Francia,63.951,2833687,640679,0.888,Europa
Alemania,80.94,3874437,357114,0.916,Europa
Italia,60.665,2167744,301336,0.873,Europa
Japon,127.061,4602367,377930,0.891,Asia
Reino Unido,64.511,2950039,242495,0.907,Europe
Estados Unidos,318.523,17348075,9525067,0.915,America


In [36]:
#filas
new_df = df.drop('Canada')

In [37]:
df

Unnamed: 0,Poblacion,PBI,Superficie,HDI,Continente
Canada,35.467,1785387,9984670,0.913,America
Francia,63.951,2833687,640679,0.888,Europa
Alemania,80.94,3874437,357114,0.916,Europa
Italia,60.665,2167744,301336,0.873,Europa
Japon,127.061,4602367,377930,0.891,Asia
Reino Unido,64.511,2950039,242495,0.907,Europe
Estados Unidos,318.523,17348075,9525067,0.915,America


In [38]:
new_df

Unnamed: 0,Poblacion,PBI,Superficie,HDI,Continente
Francia,63.951,2833687,640679,0.888,Europa
Alemania,80.94,3874437,357114,0.916,Europa
Italia,60.665,2167744,301336,0.873,Europa
Japon,127.061,4602367,377930,0.891,Asia
Reino Unido,64.511,2950039,242495,0.907,Europe
Estados Unidos,318.523,17348075,9525067,0.915,America


In [39]:
df.drop(['Canada', 'Japon'])

Unnamed: 0,Poblacion,PBI,Superficie,HDI,Continente
Francia,63.951,2833687,640679,0.888,Europa
Alemania,80.94,3874437,357114,0.916,Europa
Italia,60.665,2167744,301336,0.873,Europa
Reino Unido,64.511,2950039,242495,0.907,Europe
Estados Unidos,318.523,17348075,9525067,0.915,America


In [40]:
df.drop(columns=['Poblacion', 'PBI'])

Unnamed: 0,Superficie,HDI,Continente
Canada,9984670,0.913,America
Francia,640679,0.888,Europa
Alemania,357114,0.916,Europa
Italia,301336,0.873,Europa
Japon,377930,0.891,Asia
Reino Unido,242495,0.907,Europe
Estados Unidos,9525067,0.915,America


In [41]:
# Elimina filas
df.drop(['Italia', 'Canada'], axis=0)

Unnamed: 0,Poblacion,PBI,Superficie,HDI,Continente
Francia,63.951,2833687,640679,0.888,Europa
Alemania,80.94,3874437,357114,0.916,Europa
Japon,127.061,4602367,377930,0.891,Asia
Reino Unido,64.511,2950039,242495,0.907,Europe
Estados Unidos,318.523,17348075,9525067,0.915,America


> axis=0 Accion en filas<br>
> axis= 1 Accion en columnas

In [42]:
df.drop(['Poblacion', 'PBI'], axis=1)

Unnamed: 0,Superficie,HDI,Continente
Canada,9984670,0.913,America
Francia,640679,0.888,Europa
Alemania,357114,0.916,Europa
Italia,301336,0.873,Europa
Japon,377930,0.891,Asia
Reino Unido,242495,0.907,Europe
Estados Unidos,9525067,0.915,America


In [43]:
df.drop(['Poblacion', 'HDI'], axis='columns')

Unnamed: 0,PBI,Superficie,Continente
Canada,1785387,9984670,America
Francia,2833687,640679,Europa
Alemania,3874437,357114,Europa
Italia,2167744,301336,Europa
Japon,4602367,377930,Asia
Reino Unido,2950039,242495,Europe
Estados Unidos,17348075,9525067,America


In [44]:
df.drop(['Canada', 'Alemania'], axis='rows')

Unnamed: 0,Poblacion,PBI,Superficie,HDI,Continente
Francia,63.951,2833687,640679,0.888,Europa
Italia,60.665,2167744,301336,0.873,Europa
Japon,127.061,4602367,377930,0.891,Asia
Reino Unido,64.511,2950039,242495,0.907,Europe
Estados Unidos,318.523,17348075,9525067,0.915,America


Los metodos anteriores retornan un nuevo DataFrame. Para modificar el mismo DataFrame se puede utilizar el atributo inplace

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## Modificando los DataFrames

Se pueden agregar columnas o reemplazar valores de las columnas

### Agregando una nueva columna

In [45]:
idiomas = pd.Series(
    ['Frances', 'Aleman', 'Italiano'],
    index=['Francia', 'Alemania', 'Italia'],
    name='Idiomas'
)

In [46]:
idiomas

Unnamed: 0,Idiomas
Francia,Frances
Alemania,Aleman
Italia,Italiano


In [48]:
df['Idiomas']= idiomas

In [49]:
df

Unnamed: 0,Poblacion,PBI,Superficie,HDI,Continente,Idiomas
Canada,35.467,1785387,9984670,0.913,America,
Francia,63.951,2833687,640679,0.888,Europa,Frances
Alemania,80.94,3874437,357114,0.916,Europa,Aleman
Italia,60.665,2167744,301336,0.873,Europa,Italiano
Japon,127.061,4602367,377930,0.891,Asia,
Reino Unido,64.511,2950039,242495,0.907,Europe,
Estados Unidos,318.523,17348075,9525067,0.915,America,


### Reemplazando valores de una columna

In [50]:
df['Idiomas'] = 'Ingles'

In [51]:
df

Unnamed: 0,Poblacion,PBI,Superficie,HDI,Continente,Idiomas
Canada,35.467,1785387,9984670,0.913,America,Ingles
Francia,63.951,2833687,640679,0.888,Europa,Ingles
Alemania,80.94,3874437,357114,0.916,Europa,Ingles
Italia,60.665,2167744,301336,0.873,Europa,Ingles
Japon,127.061,4602367,377930,0.891,Asia,Ingles
Reino Unido,64.511,2950039,242495,0.907,Europe,Ingles
Estados Unidos,318.523,17348075,9525067,0.915,America,Ingles


In [52]:
df.rename(index=str.upper)

Unnamed: 0,Poblacion,PBI,Superficie,HDI,Continente,Idiomas
CANADA,35.467,1785387,9984670,0.913,America,Ingles
FRANCIA,63.951,2833687,640679,0.888,Europa,Ingles
ALEMANIA,80.94,3874437,357114,0.916,Europa,Ingles
ITALIA,60.665,2167744,301336,0.873,Europa,Ingles
JAPON,127.061,4602367,377930,0.891,Asia,Ingles
REINO UNIDO,64.511,2950039,242495,0.907,Europe,Ingles
ESTADOS UNIDOS,318.523,17348075,9525067,0.915,America,Ingles


### Eliminando columnas

In [53]:
df.drop(columns='Idiomas', inplace=True)

In [54]:
df

Unnamed: 0,Poblacion,PBI,Superficie,HDI,Continente
Canada,35.467,1785387,9984670,0.913,America
Francia,63.951,2833687,640679,0.888,Europa
Alemania,80.94,3874437,357114,0.916,Europa
Italia,60.665,2167744,301336,0.873,Europa
Japon,127.061,4602367,377930,0.891,Asia
Reino Unido,64.511,2950039,242495,0.907,Europe
Estados Unidos,318.523,17348075,9525067,0.915,America


### Agregando valores

In [55]:
df.append( pd.Series({
    'Poblacion': 3,
    'PBI': 5
}, name='China'))

AttributeError: 'DataFrame' object has no attribute 'append'

In [56]:
pd.concat([df, pd.Series({
    'Poblacion': 3,
    'PBI': 5
}, name='China')])

Unnamed: 0,Poblacion,PBI,Superficie,HDI,Continente,China
Canada,35.467,1785387.0,9984670.0,0.913,America,
Francia,63.951,2833687.0,640679.0,0.888,Europa,
Alemania,80.94,3874437.0,357114.0,0.916,Europa,
Italia,60.665,2167744.0,301336.0,0.873,Europa,
Japon,127.061,4602367.0,377930.0,0.891,Asia,
Reino Unido,64.511,2950039.0,242495.0,0.907,Europe,
Estados Unidos,318.523,17348075.0,9525067.0,0.915,America,
Poblacion,,,,,,3.0
PBI,,,,,,5.0


### Creando nuevas columnas a partir de otras columnas

Podemos calcular el producto bruto per capita, haciendo la siguiente cuenta: `PBI / Poblacion`.



In [57]:
# volver a correr el data frame
df

Unnamed: 0,Poblacion,PBI,Superficie,HDI,Continente
Canada,35.467,1785387,9984670,0.913,America
Francia,63.951,2833687,640679,0.888,Europa
Alemania,80.94,3874437,357114,0.916,Europa
Italia,60.665,2167744,301336,0.873,Europa
Japon,127.061,4602367,377930,0.891,Asia
Reino Unido,64.511,2950039,242495,0.907,Europe
Estados Unidos,318.523,17348075,9525067,0.915,America


In [58]:
df[['Poblacion', 'PBI']]

Unnamed: 0,Poblacion,PBI
Canada,35.467,1785387
Francia,63.951,2833687
Alemania,80.94,3874437
Italia,60.665,2167744
Japon,127.061,4602367
Reino Unido,64.511,2950039
Estados Unidos,318.523,17348075


In [59]:
df['PBI'] / df['Poblacion']

Unnamed: 0,0
Canada,50339.385908
Francia,44310.284437
Alemania,47868.013343
Italia,35733.025633
Japon,36221.712406
Reino Unido,45729.239975
Estados Unidos,54464.12033


el resultado de la division es otra serie que se agrega como columna al DataFrame original

In [60]:
df['PBI Per Capita'] = df['PBI'] / df['Poblacion']

In [61]:
df

Unnamed: 0,Poblacion,PBI,Superficie,HDI,Continente,PBI Per Capita
Canada,35.467,1785387,9984670,0.913,America,50339.385908
Francia,63.951,2833687,640679,0.888,Europa,44310.284437
Alemania,80.94,3874437,357114,0.916,Europa,47868.013343
Italia,60.665,2167744,301336,0.873,Europa,35733.025633
Japon,127.061,4602367,377930,0.891,Asia,36221.712406
Reino Unido,64.511,2950039,242495,0.907,Europe,45729.239975
Estados Unidos,318.523,17348075,9525067,0.915,America,54464.12033


![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## Estadisticas

In [62]:
df.head()

Unnamed: 0,Poblacion,PBI,Superficie,HDI,Continente,PBI Per Capita
Canada,35.467,1785387,9984670,0.913,America,50339.385908
Francia,63.951,2833687,640679,0.888,Europa,44310.284437
Alemania,80.94,3874437,357114,0.916,Europa,47868.013343
Italia,60.665,2167744,301336,0.873,Europa,35733.025633
Japon,127.061,4602367,377930,0.891,Asia,36221.712406


In [None]:
df.describe()

In [None]:
poblacion = df['Poblacion']

In [None]:
poblacion.min(), poblacion.max()

In [None]:
poblacion.sum()

In [None]:
poblacion.mean()

In [None]:
poblacion.std()

In [None]:
poblacion.median()

In [None]:
poblacion.describe()

![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)