In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
float_data = pd.Series([1.8, -3.5, np.nan, 0])

In [2]:
float_data

0    1.8
1   -3.5
2    NaN
3    0.0
dtype: float64

# Clase 74

### 2.5 Permutación y muestreo aleatorio

Es posible permutar (reordenar aleatoriamente) una Serie o las filas de un DataFrame usando la función `numpy.random.permutation`. Llamar a `permutation` con la longitud del eje que se desea permutar produce un array de enteros que indican el nuevo ordenamiento:

In [14]:
df = pd.DataFrame(np.arange(5 * 7).reshape((5, 7)))
df

Unnamed: 0,0,1,2,3,4,5,6
0,0,1,2,3,4,5,6
1,7,8,9,10,11,12,13
2,14,15,16,17,18,19,20
3,21,22,23,24,25,26,27
4,28,29,30,31,32,33,34


In [15]:
sampler = np.random.permutation(5)
sampler

array([1, 2, 3, 0, 4], dtype=int32)

Este array puede utilizarse entonces en la indexación basada en `iloc` o en la función equivalente `take()
` :

In [16]:
df.take(sampler)
# Reordena las filas del DataFrame df según el índice proporcionado por sampler.

Unnamed: 0,0,1,2,3,4,5,6
1,7,8,9,10,11,12,13
2,14,15,16,17,18,19,20
3,21,22,23,24,25,26,27
0,0,1,2,3,4,5,6
4,28,29,30,31,32,33,34


In [17]:
df.iloc[sampler]

Unnamed: 0,0,1,2,3,4,5,6
1,7,8,9,10,11,12,13
2,14,15,16,17,18,19,20
3,21,22,23,24,25,26,27
0,0,1,2,3,4,5,6
4,28,29,30,31,32,33,34


Invocando `take()` con `axis="columns"`, también podríamos seleccionar una permutación de las columnas:

In [18]:
column_sampler = np.random.permutation(7)
column_sampler

array([5, 6, 2, 0, 4, 1, 3], dtype=int32)

In [19]:
df.take(column_sampler, axis="columns")

Unnamed: 0,5,6,2,0,4,1,3
0,5,6,2,0,4,1,3
1,12,13,9,7,11,8,10
2,19,20,16,14,18,15,17
3,26,27,23,21,25,22,24
4,33,34,30,28,32,29,31


Si se necesita seleccionar un subconjunto aleatorio (ramdom subset) sin reemplazo (la misma fila no puede aparecer dos veces), puede utilizar el método `sample()` en Series y DataFrame:

In [20]:
df

Unnamed: 0,0,1,2,3,4,5,6
0,0,1,2,3,4,5,6
1,7,8,9,10,11,12,13
2,14,15,16,17,18,19,20
3,21,22,23,24,25,26,27
4,28,29,30,31,32,33,34


In [21]:
df.sample(n=3)

Unnamed: 0,0,1,2,3,4,5,6
2,14,15,16,17,18,19,20
1,7,8,9,10,11,12,13
3,21,22,23,24,25,26,27


Para generar una muestra con reemplazo (para permitir elecciones repetidas), pase `replace=True` a `sample()`:

In [22]:
choices = pd.Series([5, 7, -1, 6, 4])
choices

0    5
1    7
2   -1
3    6
4    4
dtype: int64

In [23]:
choices.sample(n=10, replace=True)

1    7
4    4
3    6
2   -1
0    5
4    4
4    4
0    5
4    4
1    7
dtype: int64

### Cálculo de indicadores/variables ficticias (dummy)

Otro tipo de transformación que se utiliza mucho para modelado estadístico o aplicaciones de aprendizaje automático es convertir una `variable categórica` en un array de dummies o indicadores, en otras palabras convertir categorías a números. Si una columna en un DataFrame tiene k valores distintos, se derivaría un array o DataFrame con k columnas que contengan todos los 1s y 0s. Pandas tiene una función `pandas.get_dummies()` para hacer esto, aunque también podría idear una usted mismo. Veamos un ejemplo de DataFrame:

In [24]:
df = pd.DataFrame({"key": ["b", "b", "a", "c", "a", "b"],
                   "data1": range(6)})
df

Unnamed: 0,key,data1
0,b,0
1,b,1
2,a,2
3,c,3
4,a,4
5,b,5


In [30]:
pd.get_dummies(df["key"], dtype=float)

Unnamed: 0,a,b,c
0,0.0,1.0,0.0
1,0.0,1.0,0.0
2,1.0,0.0,0.0
3,0.0,0.0,1.0
4,1.0,0.0,0.0
5,0.0,1.0,0.0


Aquí se ha pasado `dtype=float` para cambiar el tipo de salida de boolean (el predeterminado en las versiones más recientes de pandas) a coma flotante (floating point).

In [28]:
df_1 = pd.get_dummies(df["key"])
df_1

Unnamed: 0,a,b,c
0,False,True,False
1,False,True,False
2,True,False,False
3,False,False,True
4,True,False,False
5,False,True,False


En algunos casos, es posible que desee añadir un prefijo a las columnas en el DataFrame del indicador, que luego se pueden fusionar con los otros datos. `pandas.get_dummies` tiene un argumento de prefijo para hacer esto:

In [224]:
dummies = pd.get_dummies(df["key"], prefix="key", dtype=float)
dummies

Unnamed: 0,key_a,key_b,key_c
0,0.0,1.0,0.0
1,0.0,1.0,0.0
2,1.0,0.0,0.0
3,0.0,0.0,1.0
4,1.0,0.0,0.0
5,0.0,1.0,0.0


In [225]:
df_with_dummy = df[["data1"]].join(dummies) # .join lo veremos a detalle mas adelante
df_with_dummy

Unnamed: 0,data1,key_a,key_b,key_c
0,0,0.0,1.0,0.0
1,1,0.0,1.0,0.0
2,2,1.0,0.0,0.0
3,3,0.0,0.0,1.0
4,4,1.0,0.0,0.0
5,5,0.0,1.0,0.0


Otro ejemplo

In [35]:
df_2 = pd.DataFrame({"key": ["b", "b", "a", "c", "a", "b"],
                   "data1": range(6),
                   "Colores": ["Verde", "Rojo", "Verde", "Amarillo", "Rojo", "Verde"]})
df_2

Unnamed: 0,key,data1,Colores
0,b,0,Verde
1,b,1,Rojo
2,a,2,Verde
3,c,3,Amarillo
4,a,4,Rojo
5,b,5,Verde


In [38]:
dummies_1 = pd.get_dummies(df_2[["key", "Colores"]], prefix=["key", "Colores"], dtype=float)
dummies_1

Unnamed: 0,key_a,key_b,key_c,Colores_Amarillo,Colores_Rojo,Colores_Verde
0,0.0,1.0,0.0,0.0,0.0,1.0
1,0.0,1.0,0.0,0.0,1.0,0.0
2,1.0,0.0,0.0,0.0,0.0,1.0
3,0.0,0.0,1.0,1.0,0.0,0.0
4,1.0,0.0,0.0,0.0,1.0,0.0
5,0.0,1.0,0.0,0.0,0.0,1.0


Si una fila de un DataFrame contiene categorías, tenemos que utilizar un enfoque diferente para crear las variables ficticias. Veamos el conjunto de datos `MovieLens`:

In [41]:
mnames = ["movie_id", "title", "genres"]
movies = pd.read_table('movies.dat', sep="::",
                       header=None, names=mnames, engine="python")

movies[:10]  # Similar a un .head()                    

Unnamed: 0,movie_id,title,genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy
5,6,Heat (1995),Action|Crime|Thriller
6,7,Sabrina (1995),Comedy|Romance
7,8,Tom and Huck (1995),Adventure|Children's
8,9,Sudden Death (1995),Action
9,10,GoldenEye (1995),Action|Adventure|Thriller


Pandas ha implementado un método especial de la serie `str.get_dummies` (los métodos que empiezan por `str`. Se tratan con más detalle más adelante en Manipulación de cadenas) que maneja este escenario de pertenencia a múltiples grupos codificados como una cadena delimitada:

In [42]:
dummies_2 = movies["genres"].str.get_dummies("|")
dummies_2

Unnamed: 0,Action,Adventure,Animation,Children's,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0
2,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0
3,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3878,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
3879,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
3880,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
3881,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0


In [43]:
dummies_2.iloc[:10, :6]

Unnamed: 0,Action,Adventure,Animation,Children's,Comedy,Crime
0,0,0,1,1,1,0
1,0,1,0,1,0,0
2,0,0,0,0,1,0
3,0,0,0,0,1,0
4,0,0,0,0,1,0
5,1,0,0,0,0,1
6,0,0,0,0,1,0
7,0,1,0,1,0,0
8,1,0,0,0,0,0
9,1,1,0,0,0,0


Entonces, como antes, puedes combinar esto con `movies` añadiendo un `"Genre_"` a los nombres de las columnas en el DataFrame de `dummies` con el método `add_prefix`:

In [45]:
movies_windic = movies.join(dummies_2.add_prefix("Genre_"))
movies_windic

Unnamed: 0,movie_id,title,genres,Genre_Action,Genre_Adventure,Genre_Animation,Genre_Children's,Genre_Comedy,Genre_Crime,Genre_Documentary,...,Genre_Fantasy,Genre_Film-Noir,Genre_Horror,Genre_Musical,Genre_Mystery,Genre_Romance,Genre_Sci-Fi,Genre_Thriller,Genre_War,Genre_Western
0,1,Toy Story (1995),Animation|Children's|Comedy,0,0,1,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2,Jumanji (1995),Adventure|Children's|Fantasy,0,1,0,1,0,0,0,...,1,0,0,0,0,0,0,0,0,0
2,3,Grumpier Old Men (1995),Comedy|Romance,0,0,0,0,1,0,0,...,0,0,0,0,0,1,0,0,0,0
3,4,Waiting to Exhale (1995),Comedy|Drama,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,Father of the Bride Part II (1995),Comedy,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3878,3948,Meet the Parents (2000),Comedy,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
3879,3949,Requiem for a Dream (2000),Drama,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3880,3950,Tigerland (2000),Drama,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3881,3951,Two Family House (2000),Drama,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [46]:
movies_windic.iloc[0]

movie_id                                       1
title                           Toy Story (1995)
genres               Animation|Children's|Comedy
Genre_Action                                   0
Genre_Adventure                                0
Genre_Animation                                1
Genre_Children's                               1
Genre_Comedy                                   1
Genre_Crime                                    0
Genre_Documentary                              0
Genre_Drama                                    0
Genre_Fantasy                                  0
Genre_Film-Noir                                0
Genre_Horror                                   0
Genre_Musical                                  0
Genre_Mystery                                  0
Genre_Romance                                  0
Genre_Sci-Fi                                   0
Genre_Thriller                                 0
Genre_War                                      0
Genre_Western       

Nota: Para datos mucho más grandes, este método de construcción de variables indicadoras con pertenencia múltiple no es especialmente rápido. Sería mejor escribir una función de nivel inferior que escriba directamente en una array de NumPy y, a continuación, envolver el resultado en un DataFrame.

Para aplicaciones estadísticas tambien se suele combinar `pandas.get_dummies` con una función de discretización como `pandas.cut`:

In [47]:
np.random.seed(12345) # para que el ejemplo sea repetible

values = np.random.uniform(size=10)
values

array([0.92961609, 0.31637555, 0.18391881, 0.20456028, 0.56772503,
       0.5955447 , 0.96451452, 0.6531771 , 0.74890664, 0.65356987])

In [48]:
bins = [0, 0.2, 0.4, 0.6, 0.8, 1]

In [49]:
pd.get_dummies(pd.cut(values, bins))

Unnamed: 0,"(0.0, 0.2]","(0.2, 0.4]","(0.4, 0.6]","(0.6, 0.8]","(0.8, 1.0]"
0,False,False,False,False,True
1,False,True,False,False,False
2,True,False,False,False,False
3,False,True,False,False,False
4,False,False,True,False,False
5,False,False,True,False,False
6,False,False,False,False,True
7,False,False,False,True,False
8,False,False,False,True,False
9,False,False,False,True,False


In [50]:
pd.get_dummies(pd.cut(values, bins), dtype=float)

Unnamed: 0,"(0.0, 0.2]","(0.2, 0.4]","(0.4, 0.6]","(0.6, 0.8]","(0.8, 1.0]"
0,0.0,0.0,0.0,0.0,1.0
1,0.0,1.0,0.0,0.0,0.0
2,1.0,0.0,0.0,0.0,0.0
3,0.0,1.0,0.0,0.0,0.0
4,0.0,0.0,1.0,0.0,0.0
5,0.0,0.0,1.0,0.0,0.0
6,0.0,0.0,0.0,0.0,1.0
7,0.0,0.0,0.0,1.0,0.0
8,0.0,0.0,0.0,1.0,0.0
9,0.0,0.0,0.0,1.0,0.0


### Ejercicio 3: Dado el siguiente conjunto de datos:

In [None]:
'''
data = {
    'Tienda': ['A', 'B', 'C', 'D', 'E'],
    'Producto': ['P1', 'P2', 'P3', 'P4', 'P5'],
    'Ventas': [150, 200, 300, 250, 100]
}
'''

1- Crear un dataframe, luego establezca una semilla para la reproducibilidad del ejemplo:

In [None]:
# np.random.seed(42)

2- Crear dos dataframes mas; uno que sea una permutación de columnas y el otro una permutación filas.

3- Partiendo de los dos dataframes anteriores extraiga una muestra aletoria de 2 filas.

### Ejercicio 4. Partiendo del siguiente diccionario:

In [None]:
'''
data = {
    'Nombre': ['Ana', 'Luis', 'María', 'Pedro', 'Laura', 'Carlos', 'Marta', 'Jorge'],
    'Edad': [23, 45, 31, 34, 28, 40, 36, 50],
    'Departamento': ['Ventas', 'IT', 'IT', 'Ventas', 'Marketing', 'Ventas', 'Marketing', 'IT'],
    'Salario': [50000, 60000, 55000, 52000, 58000, 51000, 60000, 62000]
}
'''

In [None]:
2- Convertir la columna departamento en variables dummy

3- Definir los límites (los intervaloes) y etiquetas para los grupos de edad

4- Crear la nueva columna con nombre: Grupo_Edad utilizando .cut()

5-Imprime el dataframe final