# Selección y manipulación de columnas y filas

En la lección anterior vimos los principales tipos de datos que podemos tener dentro de un *DataFrame*.

En esta lección veremos diferentes formas de seleccionar y manipular columnas y filas, todas ellas muy usadas en el proceso de limpieza y preparación de datos que encontraremos en nuestros proyectos de Ciencia de Datos y Machine Learning.

En particular veremos:
- Diferentes formas de seleccionar una o múltiples columnas de nuestro *DataFrame*
- Diferentes formas de seleccionar una o múltiples filas de nuestro *DataFrame*
- Cómo seleccionar porciones de nuestro *DataFrame* (es decir un rango de filas y de columnas)
- Cómo modificar columnas y filas del *DataFrame* (cambiar nombre, eliminar y agregar)
 
Y al final de todo esto veremos un ejemplo práctico que combina varias de estas ideas.

## 1. Selección de columnas

En este caso veremos cómo seleccionar de forma sencilla una y múltiples columnas así como formas avanzadas de selección usando los métodos *loc*, *select_dtypes* y *filter* y el uso de condicionales.

Comencemos leyendo el mismo set de datos que hemos venido usando en las últimas lecciones (*peliculas.csv*):

In [1]:
# Comencemos importando la librería
import pandas as pd

# Y usemos "read_csv" para leer el set de datos "peliculas.csv"
df = pd.read_csv('peliculas.csv')

# E imprimamos en pantalla el resultado obtenido
df

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
2,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
3,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
4,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,...,,,,,,,12.0,7.1,,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4911,Color,Scott Smith,1.0,87.0,2.0,318.0,Daphne Zuniga,637.0,,Comedy|Drama,...,6.0,English,Canada,,,2013.0,470.0,7.7,,84
4912,Color,,43.0,43.0,,319.0,Valorie Curry,841.0,,Crime|Drama|Mystery|Thriller,...,359.0,English,USA,TV-14,,,593.0,7.5,16.00,32000
4913,Color,Benjamin Roberds,13.0,76.0,0.0,0.0,Maxwell Moody,0.0,,Drama|Horror|Thriller,...,3.0,English,USA,,1400.0,2013.0,0.0,6.3,,16
4914,Color,Daniel Hsia,14.0,100.0,0.0,489.0,Daniel Henney,946.0,10443.0,Comedy|Drama|Romance,...,9.0,English,USA,PG-13,,2012.0,719.0,6.3,2.35,660


Comencemos con la selección básica, que usa esta sintaxis: `df['nombre_columna']` o `df[[listado_de_columnas]]`:

In [2]:
# Selección de una columna sencilla
fb_likes = df['movie_facebook_likes']
fb_likes

0        33000
1            0
2        85000
3       164000
4            0
         ...  
4911        84
4912     32000
4913        16
4914       660
4915       456
Name: movie_facebook_likes, Length: 4916, dtype: int64

In [3]:
# En este caso la selección retorna como resultado una Serie
type(fb_likes)

pandas.core.series.Series

In [4]:
# Selección de múltiples columnas
df_cols = df[['director_name', 'duration', 'gross']]
df_cols

Unnamed: 0,director_name,duration,gross
0,James Cameron,178.0,760505847.0
1,Gore Verbinski,169.0,309404152.0
2,Sam Mendes,148.0,200074175.0
3,Christopher Nolan,164.0,448130642.0
4,Doug Walker,,
...,...,...,...
4911,Scott Smith,87.0,
4912,,43.0,
4913,Benjamin Roberds,76.0,
4914,Daniel Hsia,100.0,10443.0


También podemos hacer uso del método [`loc`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html):

In [5]:
# Seleccionar las columnas "duration" y "gross" usando el método "loc"
df_dur_gross = df.loc[:, ['duration', 'gross']]
df_dur_gross

Unnamed: 0,duration,gross
0,178.0,760505847.0
1,169.0,309404152.0
2,148.0,200074175.0
3,164.0,448130642.0
4,,
...,...,...
4911,87.0,
4912,43.0,
4913,76.0,
4914,100.0,10443.0


El método [`select_dtypes`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.select_dtypes.html) permite seleccionar columnas con base en el tipo de dato:

In [6]:
# Imprimir en pantalla los tipos de dato que tenemos en el DataFrame
df.dtypes.value_counts()

float64    13
object     12
int64       3
dtype: int64

In [7]:
# Seleccionar sólo las columnas tipo "int64"
df_int64 = df.select_dtypes(include="int")
df_int64

Unnamed: 0,num_voted_users,cast_total_facebook_likes,movie_facebook_likes
0,886204,4834,33000
1,471220,48350,0
2,275868,11700,85000
3,1144337,106759,164000
4,8,143,0
...,...,...,...
4911,629,2283,84
4912,73839,1753,32000
4913,38,0,16
4914,1255,2386,660


In [8]:
# Seleccionar todas las columnas con datos numéricos
df_num = df.select_dtypes(include="number")
df_num.dtypes.value_counts()

float64    13
int64       3
dtype: int64

In [9]:
# Seleccionar únicamente las columnas que NO sean numéricas
df_no_num = df.select_dtypes(exclude="number")
df_no_num.dtypes.value_counts()

object    12
dtype: int64

In [10]:
# Seleccionar columnas que contengan un string en particular, usando "filter"
df_fb = df.filter(like="facebook")
df_fb

Unnamed: 0,director_facebook_likes,actor_3_facebook_likes,actor_1_facebook_likes,cast_total_facebook_likes,actor_2_facebook_likes,movie_facebook_likes
0,0.0,855.0,1000.0,4834,936.0,33000
1,563.0,1000.0,40000.0,48350,5000.0,0
2,0.0,161.0,11000.0,11700,393.0,85000
3,22000.0,23000.0,27000.0,106759,23000.0,164000
4,131.0,,131.0,143,12.0,0
...,...,...,...,...,...,...
4911,2.0,318.0,637.0,2283,470.0,84
4912,,319.0,841.0,1753,593.0,32000
4913,0.0,0.0,0.0,0,0.0,16
4914,0.0,489.0,946.0,2386,719.0,660


## 2. Selección de filas

En este caso disponemos de los métodos `iloc` (selección con base en el **í**ndice) y `loc` (usado anteriormente).

Veamos cómo usar `iloc`:

In [11]:
# Seleccionar una fila con base en su índice ("iloc")
row_3 = df.iloc[3]
row_3

color                                                                    Color
director_name                                                Christopher Nolan
num_critic_for_reviews                                                   813.0
duration                                                                 164.0
director_facebook_likes                                                22000.0
actor_3_facebook_likes                                                 23000.0
actor_2_name                                                    Christian Bale
actor_1_facebook_likes                                                 27000.0
gross                                                              448130642.0
genres                                                         Action|Thriller
actor_1_name                                                         Tom Hardy
movie_title                                              The Dark Knight Rises
num_voted_users                                     

In [12]:
# Cuando se selecciona sólo una fila, Pandas retorna una Serie
type(row_3)

pandas.core.series.Series

In [13]:
# Seleccionar un grupo de filas CONSECUTIVAS (dentro de un rango de índices)
rows_3_6 = df.iloc[3:7] # Selecciona filas de la 3 a la 6
rows_3_6

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
3,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
4,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,...,,,,,,,12.0,7.1,,0
5,Color,Andrew Stanton,462.0,132.0,475.0,530.0,Samantha Morton,640.0,73058679.0,Action|Adventure|Sci-Fi,...,738.0,English,USA,PG-13,263700000.0,2012.0,632.0,6.6,2.35,24000
6,Color,Sam Raimi,392.0,156.0,0.0,4000.0,James Franco,24000.0,336530303.0,Action|Adventure|Romance,...,1902.0,English,USA,PG-13,258000000.0,2007.0,11000.0,6.2,2.35,0


In [14]:
# Cuando se seleccionan múltiples filas, Pandas retorna otro DataFrame
type(rows_3_6)

pandas.core.frame.DataFrame

In [15]:
# Seleccionar un grupo de filas NO CONSECUTIVAS
rows_no_consec = df.iloc[[3, 27, 8]]
rows_no_consec

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
3,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
27,Color,Anthony Russo,516.0,147.0,94.0,11000.0,Scarlett Johansson,21000.0,407197282.0,Action|Adventure|Sci-Fi,...,1022.0,English,USA,PG-13,250000000.0,2016.0,19000.0,8.2,2.35,72000
8,Color,Joss Whedon,635.0,141.0,0.0,19000.0,Robert Downey Jr.,26000.0,458991599.0,Action|Adventure|Sci-Fi,...,1117.0,English,USA,PG-13,250000000.0,2015.0,21000.0,7.5,2.35,118000


Y podemos usar condicionales (como en el caso de NumPy) para seleccionar una fila dependiendo de si se cumplen las condiciones establecidas.

En este caso usamos `loc`:

In [16]:
# Seleccionar sólo las filas para las cuales la columna "director_name" es "Cristopher Nolan"
df_CN = df.loc[df['director_name']=='Christopher Nolan']
df_CN

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
3,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
66,Color,Christopher Nolan,645.0,152.0,22000.0,11000.0,Heath Ledger,23000.0,533316061.0,Action|Crime|Drama|Thriller,...,4667.0,English,USA,PG-13,185000000.0,2008.0,13000.0,9.0,2.35,37000
96,Color,Christopher Nolan,712.0,169.0,22000.0,6000.0,Anne Hathaway,11000.0,187991439.0,Adventure|Drama|Sci-Fi,...,2725.0,English,USA,PG-13,165000000.0,2014.0,11000.0,8.6,2.35,349000
97,Color,Christopher Nolan,642.0,148.0,22000.0,23000.0,Tom Hardy,29000.0,292568851.0,Action|Adventure|Sci-Fi|Thriller,...,2803.0,English,USA,PG-13,160000000.0,2010.0,27000.0,8.8,2.35,175000
120,Color,Christopher Nolan,478.0,128.0,22000.0,11000.0,Liam Neeson,23000.0,205343774.0,Action|Adventure,...,2685.0,English,USA,PG-13,150000000.0,2005.0,14000.0,8.3,2.35,15000
1057,Color,Christopher Nolan,185.0,118.0,22000.0,319.0,Maura Tierney,14000.0,67263182.0,Drama|Mystery|Thriller,...,651.0,English,USA,R,46000000.0,2002.0,509.0,7.2,2.35,0
1222,Color,Christopher Nolan,341.0,130.0,22000.0,19000.0,Hugh Jackman,23000.0,53082743.0,Drama|Mystery|Sci-Fi|Thriller,...,1100.0,English,USA,PG-13,40000000.0,2006.0,20000.0,8.5,2.35,49000
3646,Black and White,Christopher Nolan,274.0,113.0,22000.0,379.0,Thomas Lennon,716.0,25530884.0,Mystery|Thriller,...,2067.0,English,USA,R,9000000.0,2000.0,651.0,8.5,2.35,40000


Y podemos combinar múltiples condiciones en el argumento de `loc`. Ejemplo con OR:

In [17]:
# Seleccionar sólo las filas para las cuales la columna "director_name" es "Cristopher Nolan" o
# Anthony Russo
df_CN_AR = df.loc[(df['director_name']=='Christopher Nolan') | (df['director_name']=='Anthony Russo')]
df_CN_AR

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
3,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
27,Color,Anthony Russo,516.0,147.0,94.0,11000.0,Scarlett Johansson,21000.0,407197282.0,Action|Adventure|Sci-Fi,...,1022.0,English,USA,PG-13,250000000.0,2016.0,19000.0,8.2,2.35,72000
66,Color,Christopher Nolan,645.0,152.0,22000.0,11000.0,Heath Ledger,23000.0,533316061.0,Action|Crime|Drama|Thriller,...,4667.0,English,USA,PG-13,185000000.0,2008.0,13000.0,9.0,2.35,37000
86,Color,Anthony Russo,576.0,136.0,94.0,2000.0,Chris Evans,19000.0,259746958.0,Action|Adventure|Sci-Fi,...,742.0,English,USA,PG-13,170000000.0,2014.0,11000.0,7.8,2.35,55000
96,Color,Christopher Nolan,712.0,169.0,22000.0,6000.0,Anne Hathaway,11000.0,187991439.0,Adventure|Drama|Sci-Fi,...,2725.0,English,USA,PG-13,165000000.0,2014.0,11000.0,8.6,2.35,349000
97,Color,Christopher Nolan,642.0,148.0,22000.0,23000.0,Tom Hardy,29000.0,292568851.0,Action|Adventure|Sci-Fi|Thriller,...,2803.0,English,USA,PG-13,160000000.0,2010.0,27000.0,8.8,2.35,175000
120,Color,Christopher Nolan,478.0,128.0,22000.0,11000.0,Liam Neeson,23000.0,205343774.0,Action|Adventure,...,2685.0,English,USA,PG-13,150000000.0,2005.0,14000.0,8.3,2.35,15000
888,Color,Anthony Russo,136.0,110.0,94.0,240.0,Billy Gardell,277.0,75604320.0,Comedy|Romance,...,195.0,English,USA,PG-13,54000000.0,2006.0,245.0,5.6,1.85,0
1057,Color,Christopher Nolan,185.0,118.0,22000.0,319.0,Maura Tierney,14000.0,67263182.0,Drama|Mystery|Thriller,...,651.0,English,USA,R,46000000.0,2002.0,509.0,7.2,2.35,0
1222,Color,Christopher Nolan,341.0,130.0,22000.0,19000.0,Hugh Jackman,23000.0,53082743.0,Drama|Mystery|Sci-Fi|Thriller,...,1100.0,English,USA,PG-13,40000000.0,2006.0,20000.0,8.5,2.35,49000


Y ejemplo con AND y condicionales aplicados en diferentes columnas:

In [18]:
# Seleccionar sólo las filas para las cuales la columna "director_name" es "Cristopher Nolan" y
# la columna "imdb_score" es mayor que 8.5
df_CN_imdb = df.loc[(df['director_name']=='Christopher Nolan') & (df['imdb_score']>8.5)]
df_CN_imdb

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
66,Color,Christopher Nolan,645.0,152.0,22000.0,11000.0,Heath Ledger,23000.0,533316061.0,Action|Crime|Drama|Thriller,...,4667.0,English,USA,PG-13,185000000.0,2008.0,13000.0,9.0,2.35,37000
96,Color,Christopher Nolan,712.0,169.0,22000.0,6000.0,Anne Hathaway,11000.0,187991439.0,Adventure|Drama|Sci-Fi,...,2725.0,English,USA,PG-13,165000000.0,2014.0,11000.0,8.6,2.35,349000
97,Color,Christopher Nolan,642.0,148.0,22000.0,23000.0,Tom Hardy,29000.0,292568851.0,Action|Adventure|Sci-Fi|Thriller,...,2803.0,English,USA,PG-13,160000000.0,2010.0,27000.0,8.8,2.35,175000


## 3. Selección de filas y columnas

Y podemos combinar los métodos vistos anteriormente para seleccionar **porciones** (filas, columnas) de un *DataFrame*.

Las sintaxis a usar dependen de si recurrimos al método `iloc` o al método `loc`:

- Con el método `iloc`. En este caso la sintaxis es: `df.iloc[idxs_filas, idxs_columnas]`
- Con el método `loc`. En este caso la sintaxis es: `df.loc[nombres_filas, nombres_columnas`]

Y también podemos seguir haciendo uso de condicionales.

Veamos inicialmente la selección de filas y columnas con `iloc`:

In [19]:
# Ejemplo 1: seleccionar de las filas 3 a la 5 y de las columnas 2 a la 5 (iloc)
df_ejm_1 = df.iloc[3:6, 2:6]
df_ejm_1

Unnamed: 0,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes
3,813.0,164.0,22000.0,23000.0
4,,,131.0,
5,462.0,132.0,475.0,530.0


Y generar el mismo resultado anterior pero usando `loc`:

In [20]:
# Ejemplo 2: seleccionar la misma porción anterior pero con "loc"
row_names = [3, 4, 5]
col_names = ['num_critic_for_reviews', 'duration', 'director_facebook_likes', 'actor_3_facebook_likes']
df_ejm_2 = df.loc[row_names, col_names]
df_ejm_2

Unnamed: 0,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes
3,813.0,164.0,22000.0,23000.0
4,,,131.0,
5,462.0,132.0,475.0,530.0


Resulta más compacta la sintaxis usando `iloc`. 

Pero, ¿qué pasa si no sabemos el índice de las columnas pero sí conocemos sus nombres?

En este caso podemos:

- Usar el método `get_loc` aplicado sobre las columnas para obtener el índice de las columnas de interés
- Y luego usar el método `iloc` para seleccionar la porción deseada

Veamos cada uno de estos pasos:

In [21]:
# Ejemplo 3: seleccionar filas 3 a 5 y desde la columna "duration" hasta "language"

# Paso 1: Buscar índices de las columnas "duration" y "language"
start_col = df.columns.get_loc("duration")
end_col = df.columns.get_loc("language")

print(f'Índice columna "duration": {start_col}')
print(f'Índice columna "language": {end_col}')

Índice columna "duration": 3
Índice columna "language": 19


In [22]:
# Y seleccionar la porción deseada usando "iloc"
df_ejm_3 = df.iloc[3:6,start_col:end_col+1]
df_ejm_3

Unnamed: 0,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,actor_1_name,movie_title,num_voted_users,cast_total_facebook_likes,actor_3_name,facenumber_in_poster,plot_keywords,movie_imdb_link,num_user_for_reviews,language
3,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,Tom Hardy,The Dark Knight Rises,1144337,106759,Joseph Gordon-Levitt,0.0,deception|imprisonment|lawlessness|police offi...,http://www.imdb.com/title/tt1345836/?ref_=fn_t...,2701.0,English
4,,131.0,,Rob Walker,131.0,,Documentary,Doug Walker,Star Wars: Episode VII - The Force Awakens,8,143,,0.0,,http://www.imdb.com/title/tt5289954/?ref_=fn_t...,,
5,132.0,475.0,530.0,Samantha Morton,640.0,73058679.0,Action|Adventure|Sci-Fi,Daryl Sabara,John Carter,212204,1873,Polly Walker,1.0,alien|american civil war|male nipple|mars|prin...,http://www.imdb.com/title/tt0401729/?ref_=fn_t...,738.0,English


## 4. Modificación de columnas y filas

Además de la extracción de filas, columnas y porciones del *DataFrame*, Pandas permite modificar las columnas y filas. Es decir permite renombrarlas, eliminarlas o agregar una fila o columna.

Veamos estas operaciones:

In [23]:
# Renombrar columnas: cambiar el nombre de la columna "duration" a "duration (min)"
df.rename(columns = {"duration": "duration (min)"})

Unnamed: 0,color,director_name,num_critic_for_reviews,duration (min),director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
2,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
3,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
4,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,...,,,,,,,12.0,7.1,,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4911,Color,Scott Smith,1.0,87.0,2.0,318.0,Daphne Zuniga,637.0,,Comedy|Drama,...,6.0,English,Canada,,,2013.0,470.0,7.7,,84
4912,Color,,43.0,43.0,,319.0,Valorie Curry,841.0,,Crime|Drama|Mystery|Thriller,...,359.0,English,USA,TV-14,,,593.0,7.5,16.00,32000
4913,Color,Benjamin Roberds,13.0,76.0,0.0,0.0,Maxwell Moody,0.0,,Drama|Horror|Thriller,...,3.0,English,USA,,1400.0,2013.0,0.0,6.3,,16
4914,Color,Daniel Hsia,14.0,100.0,0.0,489.0,Daniel Henney,946.0,10443.0,Comedy|Drama|Romance,...,9.0,English,USA,PG-13,,2012.0,719.0,6.3,2.35,660


In [24]:
# Sin embargo el cambio no persiste
df

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
2,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
3,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
4,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,...,,,,,,,12.0,7.1,,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4911,Color,Scott Smith,1.0,87.0,2.0,318.0,Daphne Zuniga,637.0,,Comedy|Drama,...,6.0,English,Canada,,,2013.0,470.0,7.7,,84
4912,Color,,43.0,43.0,,319.0,Valorie Curry,841.0,,Crime|Drama|Mystery|Thriller,...,359.0,English,USA,TV-14,,,593.0,7.5,16.00,32000
4913,Color,Benjamin Roberds,13.0,76.0,0.0,0.0,Maxwell Moody,0.0,,Drama|Horror|Thriller,...,3.0,English,USA,,1400.0,2013.0,0.0,6.3,,16
4914,Color,Daniel Hsia,14.0,100.0,0.0,489.0,Daniel Henney,946.0,10443.0,Comedy|Drama|Romance,...,9.0,English,USA,PG-13,,2012.0,719.0,6.3,2.35,660


In [25]:
# Cambio usando inplace = True
df.rename(columns = {"duration": "duration (min)"}, inplace = True)

In [26]:
df

Unnamed: 0,color,director_name,num_critic_for_reviews,duration (min),director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
2,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
3,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
4,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,...,,,,,,,12.0,7.1,,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4911,Color,Scott Smith,1.0,87.0,2.0,318.0,Daphne Zuniga,637.0,,Comedy|Drama,...,6.0,English,Canada,,,2013.0,470.0,7.7,,84
4912,Color,,43.0,43.0,,319.0,Valorie Curry,841.0,,Crime|Drama|Mystery|Thriller,...,359.0,English,USA,TV-14,,,593.0,7.5,16.00,32000
4913,Color,Benjamin Roberds,13.0,76.0,0.0,0.0,Maxwell Moody,0.0,,Drama|Horror|Thriller,...,3.0,English,USA,,1400.0,2013.0,0.0,6.3,,16
4914,Color,Daniel Hsia,14.0,100.0,0.0,489.0,Daniel Henney,946.0,10443.0,Comedy|Drama|Romance,...,9.0,English,USA,PG-13,,2012.0,719.0,6.3,2.35,660


In [27]:
# Otra forma es sobre-escribiendo el dataframe

# Volvamos a cambiar de "duration (min)" a "duration"
df = df.rename(columns = {'duration (min)': 'duration'})
df

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
2,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
3,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
4,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,...,,,,,,,12.0,7.1,,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4911,Color,Scott Smith,1.0,87.0,2.0,318.0,Daphne Zuniga,637.0,,Comedy|Drama,...,6.0,English,Canada,,,2013.0,470.0,7.7,,84
4912,Color,,43.0,43.0,,319.0,Valorie Curry,841.0,,Crime|Drama|Mystery|Thriller,...,359.0,English,USA,TV-14,,,593.0,7.5,16.00,32000
4913,Color,Benjamin Roberds,13.0,76.0,0.0,0.0,Maxwell Moody,0.0,,Drama|Horror|Thriller,...,3.0,English,USA,,1400.0,2013.0,0.0,6.3,,16
4914,Color,Daniel Hsia,14.0,100.0,0.0,489.0,Daniel Henney,946.0,10443.0,Comedy|Drama|Romance,...,9.0,English,USA,PG-13,,2012.0,719.0,6.3,2.35,660


In [28]:
# Eliminar columnas con "drop"

# Eliminar la columna "duration"
df_no_duration = df.drop(labels=['duration'], axis=1)
df_no_duration

Unnamed: 0,color,director_name,num_critic_for_reviews,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,actor_1_name,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,CCH Pounder,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,Johnny Depp,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
2,Color,Sam Mendes,602.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,Christoph Waltz,...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
3,Color,Christopher Nolan,813.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,Tom Hardy,...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
4,,Doug Walker,,131.0,,Rob Walker,131.0,,Documentary,Doug Walker,...,,,,,,,12.0,7.1,,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4911,Color,Scott Smith,1.0,2.0,318.0,Daphne Zuniga,637.0,,Comedy|Drama,Eric Mabius,...,6.0,English,Canada,,,2013.0,470.0,7.7,,84
4912,Color,,43.0,,319.0,Valorie Curry,841.0,,Crime|Drama|Mystery|Thriller,Natalie Zea,...,359.0,English,USA,TV-14,,,593.0,7.5,16.00,32000
4913,Color,Benjamin Roberds,13.0,0.0,0.0,Maxwell Moody,0.0,,Drama|Horror|Thriller,Eva Boehnke,...,3.0,English,USA,,1400.0,2013.0,0.0,6.3,,16
4914,Color,Daniel Hsia,14.0,0.0,489.0,Daniel Henney,946.0,10443.0,Comedy|Drama|Romance,Alan Ruck,...,9.0,English,USA,PG-13,,2012.0,719.0,6.3,2.35,660


In [29]:
# Podemos lograr el mismo resultado usando el parámetro "columns" en lugar de "labels"+axis"
df_no_duration = df.drop(columns="duration")
df_no_duration

Unnamed: 0,color,director_name,num_critic_for_reviews,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,actor_1_name,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,CCH Pounder,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,Johnny Depp,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
2,Color,Sam Mendes,602.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,Christoph Waltz,...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
3,Color,Christopher Nolan,813.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,Tom Hardy,...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
4,,Doug Walker,,131.0,,Rob Walker,131.0,,Documentary,Doug Walker,...,,,,,,,12.0,7.1,,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4911,Color,Scott Smith,1.0,2.0,318.0,Daphne Zuniga,637.0,,Comedy|Drama,Eric Mabius,...,6.0,English,Canada,,,2013.0,470.0,7.7,,84
4912,Color,,43.0,,319.0,Valorie Curry,841.0,,Crime|Drama|Mystery|Thriller,Natalie Zea,...,359.0,English,USA,TV-14,,,593.0,7.5,16.00,32000
4913,Color,Benjamin Roberds,13.0,0.0,0.0,Maxwell Moody,0.0,,Drama|Horror|Thriller,Eva Boehnke,...,3.0,English,USA,,1400.0,2013.0,0.0,6.3,,16
4914,Color,Daniel Hsia,14.0,0.0,489.0,Daniel Henney,946.0,10443.0,Comedy|Drama|Romance,Alan Ruck,...,9.0,English,USA,PG-13,,2012.0,719.0,6.3,2.35,660


In [30]:
# O por ejemplo eliminar sólo las filas de la 6 a la 8
etiquetas = [6, 7, 8]
df_norows_678 = df.drop(labels=etiquetas, axis=0)
df_norows_678.head(15)

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
2,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
3,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
4,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,...,,,,,,,12.0,7.1,,0
5,Color,Andrew Stanton,462.0,132.0,475.0,530.0,Samantha Morton,640.0,73058679.0,Action|Adventure|Sci-Fi,...,738.0,English,USA,PG-13,263700000.0,2012.0,632.0,6.6,2.35,24000
9,Color,David Yates,375.0,153.0,282.0,10000.0,Daniel Radcliffe,25000.0,301956980.0,Adventure|Family|Fantasy|Mystery,...,973.0,English,UK,PG,250000000.0,2009.0,11000.0,7.5,2.35,10000
10,Color,Zack Snyder,673.0,183.0,0.0,2000.0,Lauren Cohan,15000.0,330249062.0,Action|Adventure|Sci-Fi,...,3018.0,English,USA,PG-13,250000000.0,2016.0,4000.0,6.9,2.35,197000
11,Color,Bryan Singer,434.0,169.0,0.0,903.0,Marlon Brando,18000.0,200069408.0,Action|Adventure|Sci-Fi,...,2367.0,English,USA,PG-13,209000000.0,2006.0,10000.0,6.1,2.35,0
12,Color,Marc Forster,403.0,106.0,395.0,393.0,Mathieu Amalric,451.0,168368427.0,Action|Adventure,...,1243.0,English,UK,PG-13,200000000.0,2008.0,412.0,6.7,2.35,0


In [31]:
# Y podemos lograr exactamente el mismo resultado anterior usando el parámetro "index" en lugar de
# "labels" + "axis"
df_norows_678 = df.drop(index=[6,7,8])
df_norows_678.head(15)

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
2,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
3,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
4,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,...,,,,,,,12.0,7.1,,0
5,Color,Andrew Stanton,462.0,132.0,475.0,530.0,Samantha Morton,640.0,73058679.0,Action|Adventure|Sci-Fi,...,738.0,English,USA,PG-13,263700000.0,2012.0,632.0,6.6,2.35,24000
9,Color,David Yates,375.0,153.0,282.0,10000.0,Daniel Radcliffe,25000.0,301956980.0,Adventure|Family|Fantasy|Mystery,...,973.0,English,UK,PG,250000000.0,2009.0,11000.0,7.5,2.35,10000
10,Color,Zack Snyder,673.0,183.0,0.0,2000.0,Lauren Cohan,15000.0,330249062.0,Action|Adventure|Sci-Fi,...,3018.0,English,USA,PG-13,250000000.0,2016.0,4000.0,6.9,2.35,197000
11,Color,Bryan Singer,434.0,169.0,0.0,903.0,Marlon Brando,18000.0,200069408.0,Action|Adventure|Sci-Fi,...,2367.0,English,USA,PG-13,209000000.0,2006.0,10000.0,6.1,2.35,0
12,Color,Marc Forster,403.0,106.0,395.0,393.0,Mathieu Amalric,451.0,168368427.0,Action|Adventure,...,1243.0,English,UK,PG-13,200000000.0,2008.0,412.0,6.7,2.35,0


Y podemos añadir columnas y filas al *DataFrame*.

Por ejemplo, para añadir una columna usamos la sintaxis `df['nueva_columna'] = valores`:

In [32]:
# Ejemplo: añadir la columna "net_profit" (ganancias netas) resultado de restar el presupuesto de la película
# ("budget") de los ingresos brutos ("gross")
df['net_profit'] = df['gross'] - df['budget']
df

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes,net_profit
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000,523505847.0
1,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0,9404152.0
2,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000,-44925825.0
3,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000,198130642.0
4,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,...,,,,,,12.0,7.1,,0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4911,Color,Scott Smith,1.0,87.0,2.0,318.0,Daphne Zuniga,637.0,,Comedy|Drama,...,English,Canada,,,2013.0,470.0,7.7,,84,
4912,Color,,43.0,43.0,,319.0,Valorie Curry,841.0,,Crime|Drama|Mystery|Thriller,...,English,USA,TV-14,,,593.0,7.5,16.00,32000,
4913,Color,Benjamin Roberds,13.0,76.0,0.0,0.0,Maxwell Moody,0.0,,Drama|Horror|Thriller,...,English,USA,,1400.0,2013.0,0.0,6.3,,16,
4914,Color,Daniel Hsia,14.0,100.0,0.0,489.0,Daniel Henney,946.0,10443.0,Comedy|Drama|Romance,...,English,USA,PG-13,,2012.0,719.0,6.3,2.35,660,


Y también podemos añadir filas al *DataFrame*. El método más usado es `loc`:

In [33]:
# Definir los valores a añadir (deben coincidir con la info de cada columna)
valores = ['Color', 'Martin Scorsese', 47.0, 123, 83, 1260, 'Joe Pesci', 12800, 235000000,
           'Drama/Thriller', 'Robert De Niro', 'The Irish Man', 273, 13900, 'Al Pacino',
           5, 'mob, mafia', 'https://www.imdb.com/title/tt1302006/', 546,
           'English', 'USA', 'PG-16', 135000000, 2019, 62400, 7.8, 2.35, 13402, 100000000]

# Usar "loc" para añadir la fila
df.loc[len(df.index)] = valores
df

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes,net_profit
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000,523505847.0
1,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0,9404152.0
2,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000,-44925825.0
3,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000,198130642.0
4,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,...,,,,,,12.0,7.1,,0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4912,Color,,43.0,43.0,,319.0,Valorie Curry,841.0,,Crime|Drama|Mystery|Thriller,...,English,USA,TV-14,,,593.0,7.5,16.00,32000,
4913,Color,Benjamin Roberds,13.0,76.0,0.0,0.0,Maxwell Moody,0.0,,Drama|Horror|Thriller,...,English,USA,,1400.0,2013.0,0.0,6.3,,16,
4914,Color,Daniel Hsia,14.0,100.0,0.0,489.0,Daniel Henney,946.0,10443.0,Comedy|Drama|Romance,...,English,USA,PG-13,,2012.0,719.0,6.3,2.35,660,
4915,Color,Jon Gunn,43.0,90.0,16.0,16.0,Brian Herzlinger,86.0,85222.0,Documentary,...,English,USA,PG,1100.0,2004.0,23.0,6.6,1.85,456,84122.0


## 5. Ejemplo práctico

Usando lo visto hasta el momento intentemos hacer lo siguiente:

1. Eliminemos las columnas que no contienen información relevante: "color", "facenumber_in_poster", "movie_imdb_link", "country", "aspect_ratio"
2. Reemplacemos en las columnas que contengan la palabra "facebook" por "fb" (para acortar dichos nombres)
3. Eliminemos las filas con datos faltantes (NaN)

In [34]:
# 1. Eliminar las columnas "color", "facenumber_in_poster", "movie_imdb_link", "country", "aspect_ratio"
df_clean = df.drop(columns = ["color", "facenumber_in_poster", "movie_imdb_link", "country", "aspect_ratio"])

print(f'Tamaño dataset original: {df.shape}')
print(f'Tamaño dataset "clean": {df_clean.shape}')

Tamaño dataset original: (4917, 29)
Tamaño dataset "clean": (4917, 24)


In [35]:
# Realizar cambios de nombre de columnas
cols_orig = df_clean.columns
cols_clean = []
for col in cols_orig:
    if 'facebook' in col:
        cols_clean.append(col.replace('facebook','fb'))
    else:
        cols_clean.append(col)

print('Columna original -> Columna modificada:')
for col_org, col_mod in zip(cols_orig, cols_clean):
    print(f'{col_org} -> {col_mod}')

Columna original -> Columna modificada:
director_name -> director_name
num_critic_for_reviews -> num_critic_for_reviews
duration -> duration
director_facebook_likes -> director_fb_likes
actor_3_facebook_likes -> actor_3_fb_likes
actor_2_name -> actor_2_name
actor_1_facebook_likes -> actor_1_fb_likes
gross -> gross
genres -> genres
actor_1_name -> actor_1_name
movie_title -> movie_title
num_voted_users -> num_voted_users
cast_total_facebook_likes -> cast_total_fb_likes
actor_3_name -> actor_3_name
plot_keywords -> plot_keywords
num_user_for_reviews -> num_user_for_reviews
language -> language
content_rating -> content_rating
budget -> budget
title_year -> title_year
actor_2_facebook_likes -> actor_2_fb_likes
imdb_score -> imdb_score
movie_facebook_likes -> movie_fb_likes
net_profit -> net_profit


In [36]:
# Reemplazar "facebook" por "fb" en las columnas relevantes
df_clean.columns = cols_clean
df_clean

Unnamed: 0,director_name,num_critic_for_reviews,duration,director_fb_likes,actor_3_fb_likes,actor_2_name,actor_1_fb_likes,gross,genres,actor_1_name,...,plot_keywords,num_user_for_reviews,language,content_rating,budget,title_year,actor_2_fb_likes,imdb_score,movie_fb_likes,net_profit
0,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,CCH Pounder,...,avatar|future|marine|native|paraplegic,3054.0,English,PG-13,237000000.0,2009.0,936.0,7.9,33000,523505847.0
1,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,Johnny Depp,...,goddess|marriage ceremony|marriage proposal|pi...,1238.0,English,PG-13,300000000.0,2007.0,5000.0,7.1,0,9404152.0
2,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,Christoph Waltz,...,bomb|espionage|sequel|spy|terrorist,994.0,English,PG-13,245000000.0,2015.0,393.0,6.8,85000,-44925825.0
3,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,Tom Hardy,...,deception|imprisonment|lawlessness|police offi...,2701.0,English,PG-13,250000000.0,2012.0,23000.0,8.5,164000,198130642.0
4,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,Doug Walker,...,,,,,,,12.0,7.1,0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4912,,43.0,43.0,,319.0,Valorie Curry,841.0,,Crime|Drama|Mystery|Thriller,Natalie Zea,...,cult|fbi|hideout|prison escape|serial killer,359.0,English,TV-14,,,593.0,7.5,32000,
4913,Benjamin Roberds,13.0,76.0,0.0,0.0,Maxwell Moody,0.0,,Drama|Horror|Thriller,Eva Boehnke,...,,3.0,English,,1400.0,2013.0,0.0,6.3,16,
4914,Daniel Hsia,14.0,100.0,0.0,489.0,Daniel Henney,946.0,10443.0,Comedy|Drama|Romance,Alan Ruck,...,,9.0,English,PG-13,,2012.0,719.0,6.3,660,
4915,Jon Gunn,43.0,90.0,16.0,16.0,Brian Herzlinger,86.0,85222.0,Documentary,John August,...,actress name in title|crush|date|four word tit...,84.0,English,PG,1100.0,2004.0,23.0,6.6,456,84122.0


In [37]:
# Eliminar las filas con datos faltantes
df_clean = df_clean.dropna()
df_clean.shape

(3710, 24)

In [38]:
df_clean

Unnamed: 0,director_name,num_critic_for_reviews,duration,director_fb_likes,actor_3_fb_likes,actor_2_name,actor_1_fb_likes,gross,genres,actor_1_name,...,plot_keywords,num_user_for_reviews,language,content_rating,budget,title_year,actor_2_fb_likes,imdb_score,movie_fb_likes,net_profit
0,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,CCH Pounder,...,avatar|future|marine|native|paraplegic,3054.0,English,PG-13,237000000.0,2009.0,936.0,7.9,33000,523505847.0
1,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,Johnny Depp,...,goddess|marriage ceremony|marriage proposal|pi...,1238.0,English,PG-13,300000000.0,2007.0,5000.0,7.1,0,9404152.0
2,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,Christoph Waltz,...,bomb|espionage|sequel|spy|terrorist,994.0,English,PG-13,245000000.0,2015.0,393.0,6.8,85000,-44925825.0
3,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,Tom Hardy,...,deception|imprisonment|lawlessness|police offi...,2701.0,English,PG-13,250000000.0,2012.0,23000.0,8.5,164000,198130642.0
5,Andrew Stanton,462.0,132.0,475.0,530.0,Samantha Morton,640.0,73058679.0,Action|Adventure|Sci-Fi,Daryl Sabara,...,alien|american civil war|male nipple|mars|prin...,738.0,English,PG-13,263700000.0,2012.0,632.0,6.6,24000,-190641321.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4907,Neill Dela Llana,35.0,80.0,0.0,0.0,Edgar Tancangco,0.0,70071.0,Thriller,Ian Gamazon,...,jihad|mindanao|philippines|security guard|squa...,35.0,English,Not Rated,7000.0,2005.0,0.0,6.3,74,63071.0
4908,Robert Rodriguez,56.0,81.0,0.0,6.0,Peter Marquardt,121.0,2040920.0,Action|Crime|Drama|Romance|Thriller,Carlos Gallardo,...,assassin|death|guitar|gun|mariachi,130.0,Spanish,R,7000.0,1992.0,20.0,6.9,0,2033920.0
4910,Edward Burns,14.0,95.0,0.0,133.0,Caitlin FitzGerald,296.0,4584.0,Comedy|Drama,Kerry Bishé,...,written and directed by cast member,14.0,English,Not Rated,9000.0,2011.0,205.0,6.4,413,-4416.0
4915,Jon Gunn,43.0,90.0,16.0,16.0,Brian Herzlinger,86.0,85222.0,Documentary,John August,...,actress name in title|crush|date|four word tit...,84.0,English,PG,1100.0,2004.0,23.0,6.6,456,84122.0
