# Métodos básicos y operaciones sobre un DataFrame

En la lección anterior vimos diferentes maneras de modificar el índice de un *DataFrame*.

En esta lección hablaremos de los principales métodos de un *DataFrame* y las principales operaciones que podremos aplicar sobre este tipo de dato y que resultarán útiles en los proyectos que desarrollemos.

En particular veremos:

1. Métodos que nos permiten obtener información general de un *DataFrame*
2. Métodos para realizar operaciones matemáticas y de síntesis (*summarization*) de un *DataFrame*
3. El método `apply`, que nos permite aplicar cualquier tipo de función al *DataFrame*

Y, al igual que en las lecciones anteriores, al final aplicaremos algunas de estas ideas a través de un ejemplo práctico.

## 1. Métodos para obtener información general del *DataFrame*

Estos métodos resultan muy útiles en la primera fase de exploración de cualquier *DataFrame*, pues al aplicarlos nos permiten tener un panorama general de las características del set de datos.

Comencemos importando las librerías requeridas y leyendo el dataset *peliculas.csv*:

In [1]:
import pandas as pd
import numpy as np

df = pd.read_csv('peliculas.csv')
df

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
2,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
3,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
4,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,...,,,,,,,12.0,7.1,,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4911,Color,Scott Smith,1.0,87.0,2.0,318.0,Daphne Zuniga,637.0,,Comedy|Drama,...,6.0,English,Canada,,,2013.0,470.0,7.7,,84
4912,Color,,43.0,43.0,,319.0,Valorie Curry,841.0,,Crime|Drama|Mystery|Thriller,...,359.0,English,USA,TV-14,,,593.0,7.5,16.00,32000
4913,Color,Benjamin Roberds,13.0,76.0,0.0,0.0,Maxwell Moody,0.0,,Drama|Horror|Thriller,...,3.0,English,USA,,1400.0,2013.0,0.0,6.3,,16
4914,Color,Daniel Hsia,14.0,100.0,0.0,489.0,Daniel Henney,946.0,10443.0,Comedy|Drama|Romance,...,9.0,English,USA,PG-13,,2012.0,719.0,6.3,2.35,660


El primer método es `info()`, del cual ya hablamos en una lección anterior y que nos permite imprimir en pantalla información general sobre la cantidad de datos "no-nulos" y el tipo de dato presentes en cada columna:

In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4916 entries, 0 to 4915
Data columns (total 28 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   color                      4897 non-null   object 
 1   director_name              4814 non-null   object 
 2   num_critic_for_reviews     4867 non-null   float64
 3   duration                   4901 non-null   float64
 4   director_facebook_likes    4814 non-null   float64
 5   actor_3_facebook_likes     4893 non-null   float64
 6   actor_2_name               4903 non-null   object 
 7   actor_1_facebook_likes     4909 non-null   float64
 8   gross                      4054 non-null   float64
 9   genres                     4916 non-null   object 
 10  actor_1_name               4909 non-null   object 
 11  movie_title                4916 non-null   object 
 12  num_voted_users            4916 non-null   int64  
 13  cast_total_facebook_likes  4916 non-null   int64

Los métodos `head()`, `tail()` y `sample()` nos permiten imprimir en pantalla las primeras filas del *DataFrame*, las últimas o una selección aleatoria de filas, respectivamente:

In [3]:
# Head: sin argumentos imprime las primeras 5 filas del DataFrame
df.head()

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
2,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
3,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
4,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,...,,,,,,,12.0,7.1,,0


In [4]:
# Head: con argumento "n" imprime las "n" primeras filas
df.head(10)

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
2,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
3,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
4,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,...,,,,,,,12.0,7.1,,0
5,Color,Andrew Stanton,462.0,132.0,475.0,530.0,Samantha Morton,640.0,73058679.0,Action|Adventure|Sci-Fi,...,738.0,English,USA,PG-13,263700000.0,2012.0,632.0,6.6,2.35,24000
6,Color,Sam Raimi,392.0,156.0,0.0,4000.0,James Franco,24000.0,336530303.0,Action|Adventure|Romance,...,1902.0,English,USA,PG-13,258000000.0,2007.0,11000.0,6.2,2.35,0
7,Color,Nathan Greno,324.0,100.0,15.0,284.0,Donna Murphy,799.0,200807262.0,Adventure|Animation|Comedy|Family|Fantasy|Musi...,...,387.0,English,USA,PG,260000000.0,2010.0,553.0,7.8,1.85,29000
8,Color,Joss Whedon,635.0,141.0,0.0,19000.0,Robert Downey Jr.,26000.0,458991599.0,Action|Adventure|Sci-Fi,...,1117.0,English,USA,PG-13,250000000.0,2015.0,21000.0,7.5,2.35,118000
9,Color,David Yates,375.0,153.0,282.0,10000.0,Daniel Radcliffe,25000.0,301956980.0,Adventure|Family|Fantasy|Mystery,...,973.0,English,UK,PG,250000000.0,2009.0,11000.0,7.5,2.35,10000


In [6]:
# Tail: funciona de la misma forma que "head" pero imprime las 5 últimas filas
# (si no agregamos argumento) o las "n" últimas filas (si el argumento es un número entero "n")
df.tail(10)

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
4906,Color,Shane Carruth,143.0,77.0,291.0,8.0,David Sullivan,291.0,424760.0,Drama|Sci-Fi|Thriller,...,371.0,English,USA,PG-13,7000.0,2004.0,45.0,7.0,1.85,19000
4907,Color,Neill Dela Llana,35.0,80.0,0.0,0.0,Edgar Tancangco,0.0,70071.0,Thriller,...,35.0,English,Philippines,Not Rated,7000.0,2005.0,0.0,6.3,,74
4908,Color,Robert Rodriguez,56.0,81.0,0.0,6.0,Peter Marquardt,121.0,2040920.0,Action|Crime|Drama|Romance|Thriller,...,130.0,Spanish,USA,R,7000.0,1992.0,20.0,6.9,1.37,0
4909,Color,Anthony Vallone,,84.0,2.0,2.0,John Considine,45.0,,Crime|Drama,...,1.0,English,USA,PG-13,3250.0,2005.0,44.0,7.8,,4
4910,Color,Edward Burns,14.0,95.0,0.0,133.0,Caitlin FitzGerald,296.0,4584.0,Comedy|Drama,...,14.0,English,USA,Not Rated,9000.0,2011.0,205.0,6.4,,413
4911,Color,Scott Smith,1.0,87.0,2.0,318.0,Daphne Zuniga,637.0,,Comedy|Drama,...,6.0,English,Canada,,,2013.0,470.0,7.7,,84
4912,Color,,43.0,43.0,,319.0,Valorie Curry,841.0,,Crime|Drama|Mystery|Thriller,...,359.0,English,USA,TV-14,,,593.0,7.5,16.0,32000
4913,Color,Benjamin Roberds,13.0,76.0,0.0,0.0,Maxwell Moody,0.0,,Drama|Horror|Thriller,...,3.0,English,USA,,1400.0,2013.0,0.0,6.3,,16
4914,Color,Daniel Hsia,14.0,100.0,0.0,489.0,Daniel Henney,946.0,10443.0,Comedy|Drama|Romance,...,9.0,English,USA,PG-13,,2012.0,719.0,6.3,2.35,660
4915,Color,Jon Gunn,43.0,90.0,16.0,16.0,Brian Herzlinger,86.0,85222.0,Documentary,...,84.0,English,USA,PG,1100.0,2004.0,23.0,6.6,1.85,456


In [7]:
# Sample: se usa con un argumento "n" (número entero). Al ejecutarlo selecciona aleatoriamente "n"
# filas del DataFrame y las imprime en pantalla
df.sample(10)

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
2867,Color,Je-kyu Kang,86.0,148.0,16.0,489.0,Bin Won,717.0,1110186.0,Action|Drama|War,...,224.0,Korean,South Korea,R,12800000.0,2004.0,517.0,8.1,2.35,0
4566,Color,Robert Greenwald,21.0,75.0,21.0,0.0,Katy Helvenston-Wettengal,25.0,,Documentary|War,...,24.0,English,USA,,750000.0,2006.0,0.0,7.7,1.78,274
1119,Color,Yimou Zhang,189.0,114.0,611.0,10.0,Ye Liu,879.0,6565495.0,Drama|Romance,...,229.0,Mandarin,China,R,45000000.0,2006.0,52.0,7.0,2.35,0
426,Color,Doug Liman,238.0,88.0,218.0,521.0,Hayden Christensen,17000.0,80170146.0,Action|Adventure|Sci-Fi|Thriller,...,488.0,English,USA,PG-13,85000000.0,2008.0,4000.0,6.1,2.35,0
3856,Color,Mora Stephens,35.0,103.0,5.0,842.0,Alexandra Breckenridge,1000.0,,Drama|Thriller,...,20.0,English,USA,R,4500000.0,2015.0,1000.0,5.7,2.35,987
1915,Color,Ronny Yu,114.0,89.0,31.0,71.0,Park Bench,285.0,32368960.0,Comedy|Fantasy|Horror|Romance,...,292.0,English,Canada,R,25000000.0,1998.0,81.0,5.3,1.85,0
4889,Color,Joseph Mazzella,,90.0,0.0,9.0,Mikaal Bates,313.0,,Crime|Drama|Thriller,...,2.0,English,USA,,25000.0,2015.0,25.0,4.8,,33
2351,Color,Michael Dowse,136.0,97.0,31.0,562.0,Michael Biehn,2000.0,6923891.0,Comedy|Drama|Romance,...,83.0,English,USA,R,23000000.0,2011.0,2000.0,6.3,2.35,0
475,Color,Simon Wells,124.0,96.0,25.0,102.0,Alan Young,891.0,56684819.0,Action|Adventure|Sci-Fi,...,615.0,English,USA,PG-13,80000000.0,2002.0,639.0,5.9,2.35,3000
2745,Color,David Zucker,56.0,90.0,119.0,636.0,Tyler Labine,869.0,15549702.0,Comedy|Romance,...,123.0,English,USA,PG-13,14000000.0,2003.0,779.0,4.6,1.85,411


El método `describe()` permite imprimir en pantalla información estadística de las columnas que contienen variables numéricas:

In [8]:
df.describe()

Unnamed: 0,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_1_facebook_likes,gross,num_voted_users,cast_total_facebook_likes,facenumber_in_poster,num_user_for_reviews,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
count,4867.0,4901.0,4814.0,4893.0,4909.0,4054.0,4916.0,4916.0,4903.0,4895.0,4432.0,4810.0,4903.0,4916.0,4590.0,4916.0
mean,137.988905,107.090798,691.014541,631.276313,6494.488491,47644510.0,82644.92,9579.815907,1.37732,267.668846,36547490.0,2002.447609,1621.923516,6.437429,2.222349,7348.294142
std,120.239379,25.286015,2832.954125,1625.874802,15106.986884,67372550.0,138322.2,18164.31699,2.023826,372.934839,100242700.0,12.453977,4011.299523,1.127802,1.40294,19206.016458
min,1.0,7.0,0.0,0.0,0.0,162.0,5.0,0.0,0.0,1.0,218.0,1916.0,0.0,1.6,1.18,0.0
25%,49.0,93.0,7.0,132.0,607.0,5019656.0,8361.75,1394.75,0.0,64.0,6000000.0,1999.0,277.0,5.8,1.85,0.0
50%,108.0,103.0,48.0,366.0,982.0,25043960.0,33132.5,3049.0,1.0,153.0,19850000.0,2005.0,593.0,6.6,2.35,159.0
75%,191.0,118.0,189.75,633.0,11000.0,61108410.0,93772.75,13616.75,2.0,320.5,43000000.0,2011.0,912.0,7.2,2.35,2000.0
max,813.0,511.0,23000.0,23000.0,640000.0,760505800.0,1689764.0,656730.0,43.0,5060.0,4200000000.0,2016.0,137000.0,9.5,16.0,349000.0


El método `value_counts()` también resulta útil pues permite imprimir en pantallal el conteo de valores presentes en cada columna. Funciona tanto para variables numéricas como categóricas, aunque resulta más útil en el caso de variables categóricas:

In [10]:
# value_counts para la columna numérica "duration"
#df['duration'].value_counts()

# value_counts para la columna categórica "color"
df['color'].value_counts()

Color              4693
Black and White     204
Name: color, dtype: int64

## 2. Operaciones matemáticas y de síntesis

Pandas contiene varios métodos que usan operaciones matemáticas para resumir una columna o una fila. Estos métodos **funcionan únicamente en las columnas con datos numéricos**.

Los más usados son:

- `min()` y `max()`: calcula el mínimo/máximo de cada fila o columna
- `sum()`: calcula la suma de los elementos de cada fila o columna
- `mean()`, `std()`, `median()`: calcula la media/desviación estándar/mediana de cada fila o columna

Por defecto estas operaciones se aplican sobre las columnas (`axis = 1`) pero si definimos `axis = 0` el cálculo se hará por filas:

In [11]:
# Calcular el mínimo de cada columna
df.min()

  df.min()


num_critic_for_reviews                                                     1.0
duration                                                                   7.0
director_facebook_likes                                                    0.0
actor_3_facebook_likes                                                     0.0
actor_1_facebook_likes                                                     0.0
gross                                                                    162.0
genres                                                                  Action
movie_title                                                            #Horror
num_voted_users                                                              5
cast_total_facebook_likes                                                    0
facenumber_in_poster                                                       0.0
movie_imdb_link              http://www.imdb.com/title/tt0006864/?ref_=fn_t...
num_user_for_reviews                                

En el caso anterior Pandas imprime en pantalla una advertencia (*warning*) que nos indica que se deben seleccionar únicamente columnas con valores numéricos.

Hagamos el cálculo nuevamente pero seleccionando únicamente las columnas numéricas (a través del método `select_dtypes` visto en una lección anterior):

In [12]:
df.select_dtypes(include="number").min() # y probar con max(), sum(), mean(), std(), median()

num_critic_for_reviews          1.00
duration                        7.00
director_facebook_likes         0.00
actor_3_facebook_likes          0.00
actor_1_facebook_likes          0.00
gross                         162.00
num_voted_users                 5.00
cast_total_facebook_likes       0.00
facenumber_in_poster            0.00
num_user_for_reviews            1.00
budget                        218.00
title_year                   1916.00
actor_2_facebook_likes          0.00
imdb_score                      1.60
aspect_ratio                    1.18
movie_facebook_likes            0.00
dtype: float64

## 3. *Apply*: aplicar una función específica por filas o columnas

En muchas ocasiones nos interesa aplicar una función determinada a cada elementos del *DataFrame* o a ciertas columnas y a veces dicha función no está contenida en la instalación de Pandas.

En estos casos **NO se recomienda iterar sobre las filas o columnas** y en su lugar podemos usar el método `apply` que requiere generalmente un único argumento de entrada: la función que queremos aplicar al set de datos.

Por defecto `apply` usa el argumento `axis = 0` lo cual indica que la operación se aplica por columnas. Si queremos aplicar la función por filas debemos usar `axis = 1`.

Una primera forma de usar `apply` es por ejemplo a través de una función pre-definida en NumPy:

In [13]:
# Ejemplo 1: calculemos la raíz cuadrada de todos los datos numéricos,
# usando la función "sqrt" de NumPy
df.apply(np.sqrt)

TypeError: loop of ufunc does not support argument 0 of type str which has no callable sqrt method

En el caso anterior el uso de `apply` arroja un error, pues `np.sqrt` se aplica sólo a valores numéricos.

Seleccionemos únicamente las columnas numéricas y luego usemos `apply` + `np.sqrt` para realizar la operación deseada:

In [14]:
# Seleccionar columnas numéricas y aplicar la función "np.sqrt"
df.select_dtypes(include="number").apply(np.sqrt)

Unnamed: 0,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_1_facebook_likes,gross,num_voted_users,cast_total_facebook_likes,facenumber_in_poster,num_user_for_reviews,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,26.888659,13.341664,0.000000,29.240383,31.622777,27577.270478,941.384087,69.526973,0.000000,55.263008,15394.804318,44.821870,30.594117,2.810694,1.334166,181.659021
1,17.378147,13.000000,23.727621,31.622777,200.000000,17589.887777,686.454660,219.886334,0.000000,35.185224,17320.508076,44.799554,70.710678,2.664583,1.532971,0.000000
2,24.535688,12.165525,0.000000,12.688578,104.880885,14144.757863,525.231378,108.166538,1.000000,31.527766,15652.475842,44.888751,19.824228,2.607681,1.532971,291.547595
3,28.513155,12.806248,148.323970,151.657509,164.316767,21169.096391,1069.736884,326.739958,0.000000,51.971146,15811.388301,44.855323,151.657509,2.915476,1.532971,404.969135
4,,,11.445523,,11.445523,,2.828427,11.958261,0.000000,,,,3.464102,2.664583,,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4911,1.000000,9.327379,1.414214,17.832555,25.238859,,25.079872,47.780749,1.414214,2.449490,,44.866469,21.679483,2.774887,,9.165151
4912,6.557439,6.557439,,17.860571,29.000000,,271.733325,41.868843,1.000000,18.947295,,,24.351591,2.738613,4.000000,178.885438
4913,3.605551,8.717798,0.000000,0.000000,0.000000,,6.164414,0.000000,0.000000,1.732051,37.416574,44.866469,0.000000,2.509980,,4.000000
4914,3.741657,10.000000,0.000000,22.113344,30.757113,102.190998,35.425979,48.846699,2.236068,3.000000,,44.855323,26.814175,2.509980,1.532971,25.690465


Y también podemos aplicar una función sólo a un grupo de columnas específico:

In [15]:
# Aplicar np.sqrt sólo a las columnas "duration" y "gross" (que son numéricas)
df[['duration', 'gross']].apply(np.sqrt)

Unnamed: 0,duration,gross
0,13.341664,27577.270478
1,13.000000,17589.887777
2,12.165525,14144.757863
3,12.806248,21169.096391
4,,
...,...,...
4911,9.327379,
4912,6.557439,
4913,8.717798,
4914,10.000000,102.190998


Una segunda forma de usar `apply` consiste en:

1. Definir la función a aplicar creando una función en Python (con la palabra clave `def`)
2. Aplicar la función creada en el paso (1) con `apply`

Veamos un ejemplo: la columna "genres" contiene los diferentes géneros de cada película. Estos se encuentran en formato *string* y para diferenciar un género de otro se usa el separador "|":

In [16]:
df["genres"]

0       Action|Adventure|Fantasy|Sci-Fi
1              Action|Adventure|Fantasy
2             Action|Adventure|Thriller
3                       Action|Thriller
4                           Documentary
                     ...               
4911                       Comedy|Drama
4912       Crime|Drama|Mystery|Thriller
4913              Drama|Horror|Thriller
4914               Comedy|Drama|Romance
4915                        Documentary
Name: genres, Length: 4916, dtype: object

La idea es crear y aplicar una función que permita reemplazar el separador "|" por ", " (coma y espacio). Veamos cómo hacerlo:

In [17]:
# Paso 1: definir la función
def change_str(string):
    if "|" in string:
        return string.replace("|", ", ")
    
# Paso 2: aplicarla a la columna de interés
df['genres'].apply(change_str)

0       Action, Adventure, Fantasy, Sci-Fi
1               Action, Adventure, Fantasy
2              Action, Adventure, Thriller
3                         Action, Thriller
4                                     None
                       ...                
4911                         Comedy, Drama
4912       Crime, Drama, Mystery, Thriller
4913               Drama, Horror, Thriller
4914                Comedy, Drama, Romance
4915                                  None
Name: genres, Length: 4916, dtype: object

Y existe una tercera alternativa para el uso de `apply`.

En el caso de que la función se pueda escribir de manera compacta, podemos optar por usar **funciones lambda** de Python en lugar de `def`.

Repitamos el ejercicio anterior pero usando funciones lambda:

In [18]:
# Cambiar separador "|" por ", "
df['genres'].apply(lambda string: string.replace('|',', '))

0       Action, Adventure, Fantasy, Sci-Fi
1               Action, Adventure, Fantasy
2              Action, Adventure, Thriller
3                         Action, Thriller
4                              Documentary
                       ...                
4911                         Comedy, Drama
4912       Crime, Drama, Mystery, Thriller
4913               Drama, Horror, Thriller
4914                Comedy, Drama, Romance
4915                           Documentary
Name: genres, Length: 4916, dtype: object

## 4. Ejemplo práctico

En este ejemplo combinaremos algunas ideas vistas en lecciones anteriores con algunas ideas de esta lección para procesar algunas columnas del *DataFrame*.

Estas son las fases de procesamiento a implementar:

1. Reorganizar las columnas del *DataFrame* para que resulte más fácil su exploración
2. Eliminar el exceso de columnas tipo "...facebook_likes". En su lugar generar una nueva columna ("total_fb_likes") que contenga la suma de "facebook_likes" de las columnas en mención. Importante: la suma se debe realizar **por filas** y en caso de que en la fila exista algún *NaN* la suma debe arrojar *NaN*.
3. Eliminar las columnas no deseadas del tipo "...facebook_likes"

Veamos cómo implementar estas etapas de procesamiento:

In [19]:
# 1. Reorganizar las columnas del DataFrame para que resulte más fácil su exploración
# Esto ya lo hicimos en una lección anterior

# Agrupemos las columnas en categóricas y numéricas y definamos un orden basado en su importancia
# Columnas categóricas organizadas
col_cat = ['movie_title', 'director_name', 'actor_1_name', 'actor_2_name', 'actor_3_name', 
           'genres', 'plot_keywords', 'language', 'country', 'movie_imdb_link', 'color']

# Columnas numéricas organizadas
col_num = ['budget', 'gross', 'imdb_score', 'duration', 'title_year', 'director_facebook_likes', 
           'actor_1_facebook_likes', 'actor_2_facebook_likes', 'actor_3_facebook_likes',
           'movie_facebook_likes', 'cast_total_facebook_likes', 'num_critic_for_reviews', 
           'num_voted_users', 'num_user_for_reviews', 'facenumber_in_poster',
           'content_rating', 'aspect_ratio']

# Combinar columnas categóricas y numéricas organizadas
col_order = col_cat + col_num

# Fijar este orden de columnas en el DataFrame
df = df[col_order]
df

Unnamed: 0,movie_title,director_name,actor_1_name,actor_2_name,actor_3_name,genres,plot_keywords,language,country,movie_imdb_link,...,actor_2_facebook_likes,actor_3_facebook_likes,movie_facebook_likes,cast_total_facebook_likes,num_critic_for_reviews,num_voted_users,num_user_for_reviews,facenumber_in_poster,content_rating,aspect_ratio
0,Avatar,James Cameron,CCH Pounder,Joel David Moore,Wes Studi,Action|Adventure|Fantasy|Sci-Fi,avatar|future|marine|native|paraplegic,English,USA,http://www.imdb.com/title/tt0499549/?ref_=fn_t...,...,936.0,855.0,33000,4834,723.0,886204,3054.0,0.0,PG-13,1.78
1,Pirates of the Caribbean: At World's End,Gore Verbinski,Johnny Depp,Orlando Bloom,Jack Davenport,Action|Adventure|Fantasy,goddess|marriage ceremony|marriage proposal|pi...,English,USA,http://www.imdb.com/title/tt0449088/?ref_=fn_t...,...,5000.0,1000.0,0,48350,302.0,471220,1238.0,0.0,PG-13,2.35
2,Spectre,Sam Mendes,Christoph Waltz,Rory Kinnear,Stephanie Sigman,Action|Adventure|Thriller,bomb|espionage|sequel|spy|terrorist,English,UK,http://www.imdb.com/title/tt2379713/?ref_=fn_t...,...,393.0,161.0,85000,11700,602.0,275868,994.0,1.0,PG-13,2.35
3,The Dark Knight Rises,Christopher Nolan,Tom Hardy,Christian Bale,Joseph Gordon-Levitt,Action|Thriller,deception|imprisonment|lawlessness|police offi...,English,USA,http://www.imdb.com/title/tt1345836/?ref_=fn_t...,...,23000.0,23000.0,164000,106759,813.0,1144337,2701.0,0.0,PG-13,2.35
4,Star Wars: Episode VII - The Force Awakens,Doug Walker,Doug Walker,Rob Walker,,Documentary,,,,http://www.imdb.com/title/tt5289954/?ref_=fn_t...,...,12.0,,0,143,,8,,0.0,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4911,Signed Sealed Delivered,Scott Smith,Eric Mabius,Daphne Zuniga,Crystal Lowe,Comedy|Drama,fraud|postal worker|prison|theft|trial,English,Canada,http://www.imdb.com/title/tt3000844/?ref_=fn_t...,...,470.0,318.0,84,2283,1.0,629,6.0,2.0,,
4912,The Following,,Natalie Zea,Valorie Curry,Sam Underwood,Crime|Drama|Mystery|Thriller,cult|fbi|hideout|prison escape|serial killer,English,USA,http://www.imdb.com/title/tt2071645/?ref_=fn_t...,...,593.0,319.0,32000,1753,43.0,73839,359.0,1.0,TV-14,16.00
4913,A Plague So Pleasant,Benjamin Roberds,Eva Boehnke,Maxwell Moody,David Chandler,Drama|Horror|Thriller,,English,USA,http://www.imdb.com/title/tt2107644/?ref_=fn_t...,...,0.0,0.0,16,0,13.0,38,3.0,0.0,,
4914,Shanghai Calling,Daniel Hsia,Alan Ruck,Daniel Henney,Eliza Coupe,Comedy|Drama|Romance,,English,USA,http://www.imdb.com/title/tt2070597/?ref_=fn_t...,...,719.0,489.0,660,2386,14.0,1255,9.0,5.0,PG-13,2.35


In [20]:
# 2. Crear columna "total_fb_likes"

# 2.1. Buscar columnas cuya etiqueta contenga el string "facebook_likes"
col_fb = df.filter(like = "facebook_likes").columns
col_fb

Index(['director_facebook_likes', 'actor_1_facebook_likes',
       'actor_2_facebook_likes', 'actor_3_facebook_likes',
       'movie_facebook_likes', 'cast_total_facebook_likes'],
      dtype='object')

In [21]:
# 2.2. Realizar la suma POR FILAS de las columnas anteriores, teniendo en cuenta los "NaN"
# y almacenando el resultado en la nueva columna "total_fb_likes"
df['total_fb_likes']=df[col_fb].sum(axis=1, skipna=False)
df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['total_fb_likes']=df[col_fb].sum(axis=1, skipna=False)


Unnamed: 0,movie_title,director_name,actor_1_name,actor_2_name,actor_3_name,genres,plot_keywords,language,country,movie_imdb_link,...,actor_3_facebook_likes,movie_facebook_likes,cast_total_facebook_likes,num_critic_for_reviews,num_voted_users,num_user_for_reviews,facenumber_in_poster,content_rating,aspect_ratio,total_fb_likes
0,Avatar,James Cameron,CCH Pounder,Joel David Moore,Wes Studi,Action|Adventure|Fantasy|Sci-Fi,avatar|future|marine|native|paraplegic,English,USA,http://www.imdb.com/title/tt0499549/?ref_=fn_t...,...,855.0,33000,4834,723.0,886204,3054.0,0.0,PG-13,1.78,40625.0
1,Pirates of the Caribbean: At World's End,Gore Verbinski,Johnny Depp,Orlando Bloom,Jack Davenport,Action|Adventure|Fantasy,goddess|marriage ceremony|marriage proposal|pi...,English,USA,http://www.imdb.com/title/tt0449088/?ref_=fn_t...,...,1000.0,0,48350,302.0,471220,1238.0,0.0,PG-13,2.35,94913.0
2,Spectre,Sam Mendes,Christoph Waltz,Rory Kinnear,Stephanie Sigman,Action|Adventure|Thriller,bomb|espionage|sequel|spy|terrorist,English,UK,http://www.imdb.com/title/tt2379713/?ref_=fn_t...,...,161.0,85000,11700,602.0,275868,994.0,1.0,PG-13,2.35,108254.0
3,The Dark Knight Rises,Christopher Nolan,Tom Hardy,Christian Bale,Joseph Gordon-Levitt,Action|Thriller,deception|imprisonment|lawlessness|police offi...,English,USA,http://www.imdb.com/title/tt1345836/?ref_=fn_t...,...,23000.0,164000,106759,813.0,1144337,2701.0,0.0,PG-13,2.35,365759.0
4,Star Wars: Episode VII - The Force Awakens,Doug Walker,Doug Walker,Rob Walker,,Documentary,,,,http://www.imdb.com/title/tt5289954/?ref_=fn_t...,...,,0,143,,8,,0.0,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4911,Signed Sealed Delivered,Scott Smith,Eric Mabius,Daphne Zuniga,Crystal Lowe,Comedy|Drama,fraud|postal worker|prison|theft|trial,English,Canada,http://www.imdb.com/title/tt3000844/?ref_=fn_t...,...,318.0,84,2283,1.0,629,6.0,2.0,,,3794.0
4912,The Following,,Natalie Zea,Valorie Curry,Sam Underwood,Crime|Drama|Mystery|Thriller,cult|fbi|hideout|prison escape|serial killer,English,USA,http://www.imdb.com/title/tt2071645/?ref_=fn_t...,...,319.0,32000,1753,43.0,73839,359.0,1.0,TV-14,16.00,
4913,A Plague So Pleasant,Benjamin Roberds,Eva Boehnke,Maxwell Moody,David Chandler,Drama|Horror|Thriller,,English,USA,http://www.imdb.com/title/tt2107644/?ref_=fn_t...,...,0.0,16,0,13.0,38,3.0,0.0,,,16.0
4914,Shanghai Calling,Daniel Hsia,Alan Ruck,Daniel Henney,Eliza Coupe,Comedy|Drama|Romance,,English,USA,http://www.imdb.com/title/tt2070597/?ref_=fn_t...,...,489.0,660,2386,14.0,1255,9.0,5.0,PG-13,2.35,5200.0


In [22]:
# 3. Eliminar las columnas del tipo "...+facebook_likes"
df = df.drop(columns=col_fb)
df

Unnamed: 0,movie_title,director_name,actor_1_name,actor_2_name,actor_3_name,genres,plot_keywords,language,country,movie_imdb_link,...,imdb_score,duration,title_year,num_critic_for_reviews,num_voted_users,num_user_for_reviews,facenumber_in_poster,content_rating,aspect_ratio,total_fb_likes
0,Avatar,James Cameron,CCH Pounder,Joel David Moore,Wes Studi,Action|Adventure|Fantasy|Sci-Fi,avatar|future|marine|native|paraplegic,English,USA,http://www.imdb.com/title/tt0499549/?ref_=fn_t...,...,7.9,178.0,2009.0,723.0,886204,3054.0,0.0,PG-13,1.78,40625.0
1,Pirates of the Caribbean: At World's End,Gore Verbinski,Johnny Depp,Orlando Bloom,Jack Davenport,Action|Adventure|Fantasy,goddess|marriage ceremony|marriage proposal|pi...,English,USA,http://www.imdb.com/title/tt0449088/?ref_=fn_t...,...,7.1,169.0,2007.0,302.0,471220,1238.0,0.0,PG-13,2.35,94913.0
2,Spectre,Sam Mendes,Christoph Waltz,Rory Kinnear,Stephanie Sigman,Action|Adventure|Thriller,bomb|espionage|sequel|spy|terrorist,English,UK,http://www.imdb.com/title/tt2379713/?ref_=fn_t...,...,6.8,148.0,2015.0,602.0,275868,994.0,1.0,PG-13,2.35,108254.0
3,The Dark Knight Rises,Christopher Nolan,Tom Hardy,Christian Bale,Joseph Gordon-Levitt,Action|Thriller,deception|imprisonment|lawlessness|police offi...,English,USA,http://www.imdb.com/title/tt1345836/?ref_=fn_t...,...,8.5,164.0,2012.0,813.0,1144337,2701.0,0.0,PG-13,2.35,365759.0
4,Star Wars: Episode VII - The Force Awakens,Doug Walker,Doug Walker,Rob Walker,,Documentary,,,,http://www.imdb.com/title/tt5289954/?ref_=fn_t...,...,7.1,,,,8,,0.0,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4911,Signed Sealed Delivered,Scott Smith,Eric Mabius,Daphne Zuniga,Crystal Lowe,Comedy|Drama,fraud|postal worker|prison|theft|trial,English,Canada,http://www.imdb.com/title/tt3000844/?ref_=fn_t...,...,7.7,87.0,2013.0,1.0,629,6.0,2.0,,,3794.0
4912,The Following,,Natalie Zea,Valorie Curry,Sam Underwood,Crime|Drama|Mystery|Thriller,cult|fbi|hideout|prison escape|serial killer,English,USA,http://www.imdb.com/title/tt2071645/?ref_=fn_t...,...,7.5,43.0,,43.0,73839,359.0,1.0,TV-14,16.00,
4913,A Plague So Pleasant,Benjamin Roberds,Eva Boehnke,Maxwell Moody,David Chandler,Drama|Horror|Thriller,,English,USA,http://www.imdb.com/title/tt2107644/?ref_=fn_t...,...,6.3,76.0,2013.0,13.0,38,3.0,0.0,,,16.0
4914,Shanghai Calling,Daniel Hsia,Alan Ruck,Daniel Henney,Eliza Coupe,Comedy|Drama|Romance,,English,USA,http://www.imdb.com/title/tt2070597/?ref_=fn_t...,...,6.3,100.0,2012.0,14.0,1255,9.0,5.0,PG-13,2.35,5200.0
