DataFrame objects were designed to easily query and interact with many columns, each of which is represented as a Series object. We discussed how Series objects work in the previous mission and in this mission we'll learn about how DataFrames build on Series objects to provide a powerful data analysis toolkit. 

Series objects maintain data alignment between the index labels and the data values. Since DataFrame objects are, at the core, a collection of columns where each column is a Series, they also maintain alignment along both the columns and the rows. Pandas DataFrames utilize a shared row index across columns, which is an integer index by default. By default, Pandas enforces this shared row index by throwing an error if you read in a CSV where the columns don't contain exactly the same number of elements.

Whenever you call a method that returns or prints a DataFrame, the left-most column contains the values for the index. You can use the index attribute to access the values in the index directly as well. 

In [1]:
import pandas as pd
fandango = pd.read_csv("dati\\fandango_score_comparison.csv")
fandango.head(2)

Unnamed: 0,FILM,RottenTomatoes,RottenTomatoes_User,Metacritic,Metacritic_User,IMDB,Fandango_Stars,Fandango_Ratingvalue,RT_norm,RT_user_norm,...,IMDB_norm,RT_norm_round,RT_user_norm_round,Metacritic_norm_round,Metacritic_user_norm_round,IMDB_norm_round,Metacritic_user_vote_count,IMDB_user_vote_count,Fandango_votes,Fandango_Difference
0,Avengers: Age of Ultron (2015),74,86,66,7.1,7.8,5.0,4.5,3.7,4.3,...,3.9,3.5,4.5,3.5,3.5,4.0,1330,271107,14846,0.5
1,Cinderella (2015),85,80,67,7.5,7.1,5.0,4.5,4.25,4.0,...,3.55,4.5,4.0,3.5,4.0,3.5,249,65709,12640,0.5


In [3]:
first_last = fandango.iloc[[0,fandango.shape[0] - 1]]

In [34]:
fandango.loc[[0,fandango.shape[0] - 1]]

Unnamed: 0,FILM,RottenTomatoes,RottenTomatoes_User,Metacritic,Metacritic_User,IMDB,Fandango_Stars,Fandango_Ratingvalue,RT_norm,RT_user_norm,...,IMDB_norm,RT_norm_round,RT_user_norm_round,Metacritic_norm_round,Metacritic_user_norm_round,IMDB_norm_round,Metacritic_user_vote_count,IMDB_user_vote_count,Fandango_votes,Fandango_Difference
0,Avengers: Age of Ultron (2015),74,86,66,7.1,7.8,5.0,4.5,3.7,4.3,...,3.9,3.5,4.5,3.5,3.5,4.0,1330,271107,14846,0.5
145,"Kumiko, The Treasure Hunter (2015)",87,63,68,6.4,6.7,3.5,3.5,4.35,3.15,...,3.35,4.5,3.0,3.5,3.0,3.5,19,5289,41,0.0


In [4]:
first_last

Unnamed: 0,FILM,RottenTomatoes,RottenTomatoes_User,Metacritic,Metacritic_User,IMDB,Fandango_Stars,Fandango_Ratingvalue,RT_norm,RT_user_norm,...,IMDB_norm,RT_norm_round,RT_user_norm_round,Metacritic_norm_round,Metacritic_user_norm_round,IMDB_norm_round,Metacritic_user_vote_count,IMDB_user_vote_count,Fandango_votes,Fandango_Difference
0,Avengers: Age of Ultron (2015),74,86,66,7.1,7.8,5.0,4.5,3.7,4.3,...,3.9,3.5,4.5,3.5,3.5,4.0,1330,271107,14846,0.5
145,"Kumiko, The Treasure Hunter (2015)",87,63,68,6.4,6.7,3.5,3.5,4.35,3.15,...,3.35,4.5,3.0,3.5,3.0,3.5,19,5289,41,0.0


The DataFrame object contains a set_index() method that allows you to pass in the name of the column you'd like Pandas to use as the index for the DataFrame. Pandas, by default, will return a new DataFrame that is indexed by the values in the specified column and will drop that column from the DataFrame. The set_index() method contains a few parameters to tweak this behavior:

•inplace: if set to True, will set the index to the current DataFrame instead of returning a new one

•drop: if set to False, will keep the column you specified for the index in the DataFrame


In [3]:
fandango_films = fandango.set_index("FILM", drop = False)

In [9]:
fandango_films.head(5)

Unnamed: 0_level_0,FILM,RottenTomatoes,RottenTomatoes_User,Metacritic,Metacritic_User,IMDB,Fandango_Stars,Fandango_Ratingvalue,RT_norm,RT_user_norm,...,IMDB_norm,RT_norm_round,RT_user_norm_round,Metacritic_norm_round,Metacritic_user_norm_round,IMDB_norm_round,Metacritic_user_vote_count,IMDB_user_vote_count,Fandango_votes,Fandango_Difference
FILM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Avengers: Age of Ultron (2015),Avengers: Age of Ultron (2015),74,86,66,7.1,7.8,5.0,4.5,3.7,4.3,...,3.9,3.5,4.5,3.5,3.5,4.0,1330,271107,14846,0.5
Cinderella (2015),Cinderella (2015),85,80,67,7.5,7.1,5.0,4.5,4.25,4.0,...,3.55,4.5,4.0,3.5,4.0,3.5,249,65709,12640,0.5
Ant-Man (2015),Ant-Man (2015),80,90,64,8.1,7.8,5.0,4.5,4.0,4.5,...,3.9,4.0,4.5,3.0,4.0,4.0,627,103660,12055,0.5
Do You Believe? (2015),Do You Believe? (2015),18,84,22,4.7,5.4,5.0,4.5,0.9,4.2,...,2.7,1.0,4.0,1.0,2.5,2.5,31,3136,1793,0.5
Hot Tub Time Machine 2 (2015),Hot Tub Time Machine 2 (2015),14,28,29,3.4,5.1,3.5,3.0,0.7,1.4,...,2.55,0.5,1.5,1.5,1.5,2.5,88,19560,1021,0.5


In [14]:
fandango_films.loc["Cinderella (2015)"]

FILM                          Cinderella (2015)
RottenTomatoes                               85
RottenTomatoes_User                          80
Metacritic                                   67
Metacritic_User                             7.5
IMDB                                        7.1
Fandango_Stars                                5
Fandango_Ratingvalue                        4.5
RT_norm                                    4.25
RT_user_norm                                  4
Metacritic_norm                            3.35
Metacritic_user_nom                        3.75
IMDB_norm                                  3.55
RT_norm_round                               4.5
RT_user_norm_round                            4
Metacritic_norm_round                       3.5
Metacritic_user_norm_round                    4
IMDB_norm_round                             3.5
Metacritic_user_vote_count                  249
IMDB_user_vote_count                      65709
Fandango_votes                          

In [15]:
fandango_films.index

Index(['Avengers: Age of Ultron (2015)', 'Cinderella (2015)', 'Ant-Man (2015)',
       'Do You Believe? (2015)', 'Hot Tub Time Machine 2 (2015)',
       'The Water Diviner (2015)', 'Irrational Man (2015)', 'Top Five (2014)',
       'Shaun the Sheep Movie (2015)', 'Love & Mercy (2015)',
       ...
       'The Woman In Black 2 Angel of Death (2015)', 'Danny Collins (2015)',
       'Spare Parts (2015)', 'Serena (2015)', 'Inside Out (2015)',
       'Mr. Holmes (2015)', ''71 (2015)', 'Two Days, One Night (2014)',
       'Gett: The Trial of Viviane Amsalem (2015)',
       'Kumiko, The Treasure Hunter (2015)'],
      dtype='object', name='FILM', length=146)

When selecting multiple rows, a DataFrame is returned, but when selecting an individual row, a Series object is returned instead. Like with Series objects, Pandas will maintain the original integer index even if you specify a custom index, so you can still take advantage of selection by row number.

In [33]:
location = ["The Lazarus Effect (2015)", "Gett: The Trial of Viviane Amsalem (2015)","Mr. Holmes (2015)"]
best_movies_ever = fandango_films.loc[location]

In [38]:
fandango_films.loc[:,"RottenTomatoes"]

FILM
Avengers: Age of Ultron (2015)                     74
Cinderella (2015)                                  85
Ant-Man (2015)                                     80
Do You Believe? (2015)                             18
Hot Tub Time Machine 2 (2015)                      14
The Water Diviner (2015)                           63
Irrational Man (2015)                              42
Top Five (2014)                                    86
Shaun the Sheep Movie (2015)                       99
Love & Mercy (2015)                                89
Far From The Madding Crowd (2015)                  84
Black Sea (2015)                                   82
Leviathan (2014)                                   99
Unbroken (2014)                                    51
The Imitation Game (2014)                          90
Taken 3 (2015)                                      9
Ted 2 (2015)                                       46
Southpaw (2015)                                    59
Night at the Museum: Se

The apply() method in Pandas allows us to specify Python logic that we want evaluated over Series objects in a DataFrame. Recall that rows and columns are both represented as Series objects in a DataFrame. Here are some examples of what we can accomplish using the apply() method:

•calculate the standard deviations for each numeric column

•lower-case all film names in the FILM column

The apply() method requires you to pass in a vectorized operation that can be applied over each Series object. By default, the method runs over the DataFrame's columns but you can use the axis parameter to change this (we'll use this later). If the vectorized operation usually returns a single value (e.g. the NumPy std() function), a Series object will be returned containing the computed value for each column. If it instead usually returns a value for each element (e.g. multiplying or dividing by 2), a DataFrame will be returned instead with the transformation made over all the values.

In [4]:
import numpy as np

# returns the data types as a Series
types = fandango_films.dtypes
# filter data types to just floats, index attributes returns just column names
float_columns = types[types.values == 'float64'].index
# use bracket notation to filter columns to just float columns
float_df = fandango_films[float_columns]

# `x` is a Series object representing a column
deviations = float_df.apply(lambda x: np.std(x))

print(deviations)

Metacritic_User               1.505529
IMDB                          0.955447
Fandango_Stars                0.538532
Fandango_Ratingvalue          0.501106
RT_norm                       1.503265
RT_user_norm                  0.997787
Metacritic_norm               0.972522
Metacritic_user_nom           0.752765
IMDB_norm                     0.477723
RT_norm_round                 1.509404
RT_user_norm_round            1.003559
Metacritic_norm_round         0.987561
Metacritic_user_norm_round    0.785412
IMDB_norm_round               0.501043
Fandango_Difference           0.152141
dtype: float64


Since the NumPy std() method returns a single computed value when applied over a Series, in the previous code cell, the apply() method returned a single value for each column. If you instead used a NumPy function that returns a value for each element in a Series (instead of just a single computed value for the Series), you can transform all of the values in each column and return a DataFrame with the new values instead.

In [40]:
float_df.apply(lambda x: x/2)

Unnamed: 0_level_0,Metacritic_User,IMDB,Fandango_Stars,Fandango_Ratingvalue,RT_norm,RT_user_norm,Metacritic_norm,Metacritic_user_nom,IMDB_norm,RT_norm_round,RT_user_norm_round,Metacritic_norm_round,Metacritic_user_norm_round,IMDB_norm_round,Fandango_Difference
FILM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
Avengers: Age of Ultron (2015),3.55,3.90,2.50,2.25,1.850,2.150,1.650,1.775,1.950,1.75,2.25,1.75,1.75,2.00,0.25
Cinderella (2015),3.75,3.55,2.50,2.25,2.125,2.000,1.675,1.875,1.775,2.25,2.00,1.75,2.00,1.75,0.25
Ant-Man (2015),4.05,3.90,2.50,2.25,2.000,2.250,1.600,2.025,1.950,2.00,2.25,1.50,2.00,2.00,0.25
Do You Believe? (2015),2.35,2.70,2.50,2.25,0.450,2.100,0.550,1.175,1.350,0.50,2.00,0.50,1.25,1.25,0.25
Hot Tub Time Machine 2 (2015),1.70,2.55,1.75,1.50,0.350,0.700,0.725,0.850,1.275,0.25,0.75,0.75,0.75,1.25,0.25
The Water Diviner (2015),3.40,3.60,2.25,2.00,1.575,1.550,1.250,1.700,1.800,1.50,1.50,1.25,1.75,1.75,0.25
Irrational Man (2015),3.80,3.45,2.00,1.75,1.050,1.325,1.325,1.900,1.725,1.00,1.25,1.25,2.00,1.75,0.25
Top Five (2014),3.40,3.25,2.00,1.75,2.150,1.600,2.025,1.700,1.625,2.25,1.50,2.00,1.75,1.75,0.25
Shaun the Sheep Movie (2015),4.40,3.70,2.25,2.00,2.475,2.050,2.025,2.200,1.850,2.50,2.00,2.00,2.25,1.75,0.25
Love & Mercy (2015),4.25,3.90,2.25,2.00,2.225,2.175,2.000,2.125,1.950,2.25,2.25,2.00,2.25,2.00,0.25


To apply a function over the rows (each row will be treated as a Series object) in a DataFrame, we need to set the axis parameter to 1 after we specify the function we want to apply. 

In [5]:
rt_mt_user = float_df[['RT_user_norm', 'Metacritic_user_nom']]
rt_mt_user.apply(lambda x: np.mean(x), axis=1)

FILM
Avengers: Age of Ultron (2015)                    3.925
Cinderella (2015)                                 3.875
Ant-Man (2015)                                    4.275
Do You Believe? (2015)                            3.275
Hot Tub Time Machine 2 (2015)                     1.550
The Water Diviner (2015)                          3.250
Irrational Man (2015)                             3.225
Top Five (2014)                                   3.300
Shaun the Sheep Movie (2015)                      4.250
Love & Mercy (2015)                               4.300
Far From The Madding Crowd (2015)                 3.800
Black Sea (2015)                                  3.150
Leviathan (2014)                                  3.775
Unbroken (2014)                                   3.375
The Imitation Game (2014)                         4.350
Taken 3 (2015)                                    2.300
Ted 2 (2015)                                      3.075
Southpaw (2015)                            

In [4]:
fandango.sort_values(by = 'RottenTomatoes', inplace = False)

Unnamed: 0,FILM,RottenTomatoes,RottenTomatoes_User,Metacritic,Metacritic_User,IMDB,Fandango_Stars,Fandango_Ratingvalue,RT_norm,RT_user_norm,...,IMDB_norm,RT_norm_round,RT_user_norm_round,Metacritic_norm_round,Metacritic_user_norm_round,IMDB_norm_round,Metacritic_user_vote_count,IMDB_user_vote_count,Fandango_votes,Fandango_Difference
133,Paul Blart: Mall Cop 2 (2015),5,36,13,2.4,4.3,3.5,3.5,0.25,1.80,...,2.15,0.5,2.0,0.5,1.0,2.0,211,15004,3054,0.0
105,Hitman: Agent 47 (2015),7,49,28,3.3,5.9,4.0,3.9,0.35,2.45,...,2.95,0.5,2.5,1.5,1.5,3.0,67,4260,917,0.1
53,Hot Pursuit (2015),8,37,31,3.7,4.9,4.0,3.7,0.40,1.85,...,2.45,0.5,2.0,1.5,2.0,2.5,78,17061,2618,0.3
48,Fantastic Four (2015),9,20,27,2.5,4.0,3.0,2.7,0.45,1.00,...,2.00,0.5,1.0,1.5,1.5,2.0,421,39838,6288,0.3
15,Taken 3 (2015),9,46,26,4.6,6.1,4.5,4.1,0.45,2.30,...,3.05,0.5,2.5,1.5,2.5,3.0,240,104235,6757,0.4
33,The Boy Next Door (2015),10,35,30,5.5,4.6,4.0,3.6,0.50,1.75,...,2.30,0.5,2.0,1.5,3.0,2.5,75,19658,2800,0.4
35,The Loft (2015),11,40,24,2.4,6.3,4.0,3.6,0.55,2.00,...,3.15,0.5,2.0,1.0,1.0,3.0,80,21319,811,0.4
60,Unfinished Business (2015),11,27,32,3.8,5.4,3.5,3.2,0.55,1.35,...,2.70,0.5,1.5,1.5,2.0,2.5,39,14346,821,0.3
59,Mortdecai (2015),12,30,27,3.2,5.5,3.5,3.2,0.60,1.50,...,2.75,0.5,1.5,1.5,1.5,3.0,144,31878,1196,0.3
58,Seventh Son (2015),12,35,30,3.9,5.5,3.5,3.2,0.60,1.75,...,2.75,0.5,2.0,1.5,2.0,3.0,126,41177,1213,0.3
