Transformaciones de tablas con pandas
=======

* *60 min* | Última modificación: Julio 04, 2019.

**Bibliografía**.

> [pandas 0.18.1 documentation](http://pandas.pydata.org/pandas-docs/stable/index.html)  
[10 Minutes to pandas](http://pandas.pydata.org/pandas-docs/stable/10min.html) 

**Preparación de datos**

Muchos de los ejemplos anteriores pueden ser aplicados directamente a las columnas de un dataframe.

In [1]:
## importa la librería
import numpy as np
import pandas as pd
import altair as alt
import matplotlib.pyplot as plt
import matplotlib as mpl
alt.renderers.enable('notebook');
%load_ext rpy2.ipython
%matplotlib inline
pd.set_option('display.notebook_repr_html', False)

In [2]:
df = pd.read_csv(
    "https://raw.githubusercontent.com/jdvelasq/datalabs/master/datasets/iris.csv",
    sep = ',',         # separador de campos
    thousands = None,  # separador de miles para números
    decimal = '.')     # separador de los decimales para números

df.head()

   Sepal_Length  Sepal_Width  Petal_Length  Petal_Width Species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa
4           5.0          3.6           1.4          0.2  setosa

## Melt & Pivot

In [3]:
##
## agrega una clave para identificar cada caso
##
df['id'] = list(range(150))
df.head()

   Sepal_Length  Sepal_Width  Petal_Length  Petal_Width Species  id
0           5.1          3.5           1.4          0.2  setosa   0
1           4.9          3.0           1.4          0.2  setosa   1
2           4.7          3.2           1.3          0.2  setosa   2
3           4.6          3.1           1.5          0.2  setosa   3
4           5.0          3.6           1.4          0.2  setosa   4

In [4]:
df_melt = pd.melt(
    df,                        # DataFrame
    id_vars = 'id',            # columnas que no se apilan
    var_name = 'Variables',    # nombre de la columna que contiene las columnas apiladas 
    value_name = 'Values')     # nombre de la columna que contiene los valores
df_melt.head()

   id     Variables Values
0   0  Sepal_Length    5.1
1   1  Sepal_Length    4.9
2   2  Sepal_Length    4.7
3   3  Sepal_Length    4.6
4   4  Sepal_Length      5

In [5]:
df_melt.tail()

      id Variables     Values
745  145   Species  virginica
746  146   Species  virginica
747  147   Species  virginica
748  148   Species  virginica
749  149   Species  virginica

In [6]:
df_melt.pivot(
    index = 'id',
    columns='Variables',
    values='Values'
).head(10)

Variables Petal_Length Petal_Width Sepal_Length Sepal_Width Species
id                                                                 
0                  1.4         0.2          5.1         3.5  setosa
1                  1.4         0.2          4.9           3  setosa
2                  1.3         0.2          4.7         3.2  setosa
3                  1.5         0.2          4.6         3.1  setosa
4                  1.4         0.2            5         3.6  setosa
5                  1.7         0.4          5.4         3.9  setosa
6                  1.4         0.3          4.6         3.4  setosa
7                  1.5         0.2            5         3.4  setosa
8                  1.4         0.2          4.4         2.9  setosa
9                  1.5         0.1          4.9         3.1  setosa

**Nota.---** De forma similar, estas operaciones puede ser realizadas en el lenguaje R usando las librerías `reshape2` o `tidyr`.

Usando `reshape2`.

In [7]:
%%R -i df
library(dplyr)
library(reshape2)

df_melt <- melt(df, 
                id = 'id',
                measured = c('Sepal_Length', 'Sepal_Width', 'Petal_Length', 
                             'Petal_Width', 'Species'))
df_melt %>% head(10)

R[write to console]: 
Attaching package: ‘dplyr’


R[write to console]: The following objects are masked from ‘package:stats’:

    filter, lag


R[write to console]: The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union




   id     variable value
1   0 Sepal_Length   5.1
2   1 Sepal_Length   4.9
3   2 Sepal_Length   4.7
4   3 Sepal_Length   4.6
5   4 Sepal_Length     5
6   5 Sepal_Length   5.4
7   6 Sepal_Length   4.6
8   7 Sepal_Length     5
9   8 Sepal_Length   4.4
10  9 Sepal_Length   4.9


In [8]:
%%R
df_melt %>% dcast(id ~ variable, value.var = "value") %>% head(10)

   id Sepal_Length Sepal_Width Petal_Length Petal_Width Species
1   0          5.1         3.5          1.4         0.2  setosa
2   1          4.9           3          1.4         0.2  setosa
3   2          4.7         3.2          1.3         0.2  setosa
4   3          4.6         3.1          1.5         0.2  setosa
5   4            5         3.6          1.4         0.2  setosa
6   5          5.4         3.9          1.7         0.4  setosa
7   6          4.6         3.4          1.4         0.3  setosa
8   7            5         3.4          1.5         0.2  setosa
9   8          4.4         2.9          1.4         0.2  setosa
10  9          4.9         3.1          1.5         0.1  setosa


Usando `tidyr`.

In [9]:
%%R
library(tidyr)

df_melt <- df %>% gather(key, value, -id) 
df_melt %>% head(10)

R[write to console]: 
Attaching package: ‘tidyr’


R[write to console]: The following object is masked from ‘package:reshape2’:

    smiths




   id          key value
1   0 Sepal_Length   5.1
2   1 Sepal_Length   4.9
3   2 Sepal_Length   4.7
4   3 Sepal_Length   4.6
5   4 Sepal_Length     5
6   5 Sepal_Length   5.4
7   6 Sepal_Length   4.6
8   7 Sepal_Length     5
9   8 Sepal_Length   4.4
10  9 Sepal_Length   4.9


In [10]:
%%R
df_melt %>% tail(10)

     id     key     value
741 140 Species virginica
742 141 Species virginica
743 142 Species virginica
744 143 Species virginica
745 144 Species virginica
746 145 Species virginica
747 146 Species virginica
748 147 Species virginica
749 148 Species virginica
750 149 Species virginica


In [11]:
%%R
df_melt %>% spread(key, value)  %>% head(10)

   id Petal_Length Petal_Width Sepal_Length Sepal_Width Species
1   0          1.4         0.2          5.1         3.5  setosa
2   1          1.4         0.2          4.9           3  setosa
3   2          1.3         0.2          4.7         3.2  setosa
4   3          1.5         0.2          4.6         3.1  setosa
5   4          1.4         0.2            5         3.6  setosa
6   5          1.7         0.4          5.4         3.9  setosa
7   6          1.4         0.3          4.6         3.4  setosa
8   7          1.5         0.2            5         3.4  setosa
9   8          1.4         0.2          4.4         2.9  setosa
10  9          1.5         0.1          4.9         3.1  setosa


## Stack & Unstack

In [12]:
df.stack().head(24)

0  Sepal_Length       5.1
   Sepal_Width        3.5
   Petal_Length       1.4
   Petal_Width        0.2
   Species         setosa
   id                   0
1  Sepal_Length       4.9
   Sepal_Width          3
   Petal_Length       1.4
   Petal_Width        0.2
   Species         setosa
   id                   1
2  Sepal_Length       4.7
   Sepal_Width        3.2
   Petal_Length       1.3
   Petal_Width        0.2
   Species         setosa
   id                   2
3  Sepal_Length       4.6
   Sepal_Width        3.1
   Petal_Length       1.5
   Petal_Width        0.2
   Species         setosa
   id                   3
dtype: object

In [13]:
df.stack().unstack().head(4)

  Sepal_Length Sepal_Width Petal_Length Petal_Width Species id
0          5.1         3.5          1.4         0.2  setosa  0
1          4.9           3          1.4         0.2  setosa  1
2          4.7         3.2          1.3         0.2  setosa  2
3          4.6         3.1          1.5         0.2  setosa  3

## Tablas dinámicas

In [14]:
df = pd.DataFrame({
    'key1'    : ['a', 'a', 'b', 'b', 'c', 'c'],
    'key2'    : ['A', 'B', 'A', 'B', 'A', 'B'],
    'values1' : [ 1,   2,   3,   4,   5,   6 ],
    'values2' : [ 7,   8,   9,  10,  11,  12]})
df

  key1 key2  values1  values2
0    a    A        1        7
1    a    B        2        8
2    b    A        3        9
3    b    B        4       10
4    c    A        5       11
5    c    B        6       12

In [15]:
pd.pivot_table(
    df, 
    index = ['key1', 'key2'],
    values = ['values1', 'values2'])

           values1  values2
key1 key2                  
a    A           1        7
     B           2        8
b    A           3        9
     B           4       10
c    A           5       11
     B           6       12

In [16]:
pd.pivot_table(
    df, 
    index = ['key2', 'key1'],
    values = ['values1', 'values2'])

           values1  values2
key2 key1                  
A    a           1        7
     b           3        9
     c           5       11
B    a           2        8
     b           4       10
     c           6       12

## Paneles de DataFrames

In [17]:
## se crean los DataFrames
df1 = pd.DataFrame({
    'colA': [1, 2],
    'colB': [3, 4]})

df2 = pd.DataFrame({
    'colB': [5, 6],
    'colC': [7, 8]})

df3 = pd.DataFrame({
    'colC': [9, 0],
    'colD': [1, 2]})

In [18]:
## creación del panel como un diccionario
pdPanel = { 'df1': df1,
            'df2': df2,
            'df3': df3}
print(pdPanel)

{'df1':    colA  colB
0     1     3
1     2     4, 'df2':    colB  colC
0     5     7
1     6     8, 'df3':    colC  colD
0     9     1
1     0     2}
