In this mission and the next, we're going to dive into some of Pandas' internals to better understand how Pandas does things under the hood.

The three key data structures in Pandas are:

•Series (collection of values)

•DataFrame (collection of Series objects)

•Panel (collection of DataFrame objects)

and we'll be focusing on the Series object in this mission.

Series objects use NumPy arrays for fast computation, but build on them by adding valuable features for analyzing data. For example, while NumPy arrays utilize an integer index, Series objects can utilize other index types, like a string index. Series objects also allow for mixed data types and utilize the NaN Python value for handling missing values.

A Series object can hold many data types, including:

•float - for representing float values

•int - for representing integer values

•bool - for representing Boolean values

•datetime64[ns] - for representing date & time, without time-zone

•datetime64[ns, tz] - for representing date & time, with time-zone

•timedelta[ns] - for representing differences in dates & times (seconds, minutes, etc.)

•category - for representing categorical values

•object - for representing String values

In [2]:
import pandas as pd
fandango = pd.read_csv("dati\\fandango_score_comparison.csv")
fandango.head(2)

Unnamed: 0,FILM,RottenTomatoes,RottenTomatoes_User,Metacritic,Metacritic_User,IMDB,Fandango_Stars,Fandango_Ratingvalue,RT_norm,RT_user_norm,...,IMDB_norm,RT_norm_round,RT_user_norm_round,Metacritic_norm_round,Metacritic_user_norm_round,IMDB_norm_round,Metacritic_user_vote_count,IMDB_user_vote_count,Fandango_votes,Fandango_Difference
0,Avengers: Age of Ultron (2015),74,86,66,7.1,7.8,5,4.5,3.7,4.3,...,3.9,3.5,4.5,3.5,3.5,4.0,1330,271107,14846,0.5
1,Cinderella (2015),85,80,67,7.5,7.1,5,4.5,4.25,4.0,...,3.55,4.5,4.0,3.5,4.0,3.5,249,65709,12640,0.5


DataFrames use Series objects to represent the columns in the data. When you select a single column from a DataFrame, Pandas will return the Series object representing that column. Each individual Series object in a DataFrame is indexed using the integer data type by default. Each value in the Series has a unique integer index, or position. The integer index is 0-indexed, like most Python data structures, and ranges from 0 to n-1, where n is the number of rows. With an integer index, you can select an individual value in the Series if you know it's position as well as select multiple values by passing in a list of index values (similar to a NumPy array). 

For both NumPy arrays and Series objects, you can utilize integer index by using bracket notation to slice and select values. Where Series objects diverge from NumPy arrays, however, is the ability to specify a custom index for the values.

In [4]:
fandango.columns

Index(['FILM', 'RottenTomatoes', 'RottenTomatoes_User', 'Metacritic',
       'Metacritic_User', 'IMDB', 'Fandango_Stars', 'Fandango_Ratingvalue',
       'RT_norm', 'RT_user_norm', 'Metacritic_norm', 'Metacritic_user_nom',
       'IMDB_norm', 'RT_norm_round', 'RT_user_norm_round',
       'Metacritic_norm_round', 'Metacritic_user_norm_round',
       'IMDB_norm_round', 'Metacritic_user_vote_count', 'IMDB_user_vote_count',
       'Fandango_votes', 'Fandango_Difference'],
      dtype='object')

In [6]:
series_film = fandango["FILM"]
series_film.head()

0    Avengers: Age of Ultron (2015)
1                 Cinderella (2015)
2                    Ant-Man (2015)
3            Do You Believe? (2015)
4     Hot Tub Time Machine 2 (2015)
Name: FILM, dtype: object

In [4]:
series_rt = fandango["RottenTomatoes"]
series_rt.head()

0    74
1    85
2    80
3    18
4    14
Name: RottenTomatoes, dtype: int64

We want a new index to find the score for every film  with the index equal to the name of the film and value of the series (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html#pandas.Series) equal to the score:

In [7]:
from pandas import Series
series_custom = Series(data = series_rt.values, index = series_film.values)
series_custom

Avengers: Age of Ultron (2015)                     74
Cinderella (2015)                                  85
Ant-Man (2015)                                     80
Do You Believe? (2015)                             18
Hot Tub Time Machine 2 (2015)                      14
The Water Diviner (2015)                           63
Irrational Man (2015)                              42
Top Five (2014)                                    86
Shaun the Sheep Movie (2015)                       99
Love & Mercy (2015)                                89
Far From The Madding Crowd (2015)                  84
Black Sea (2015)                                   82
Leviathan (2014)                                   99
Unbroken (2014)                                    51
The Imitation Game (2014)                          90
Taken 3 (2015)                                      9
Ted 2 (2015)                                       46
Southpaw (2015)                                    59
Night at the Museum: Secret 

In [19]:
series_custom["Inside Out (2015)"]

98

In [21]:
series_custom[['Minions (2015)', 'Leviathan (2014)']]

Minions (2015)      54
Leviathan (2014)    99
dtype: int64

In [27]:
series_custom[5]

63

In [30]:
# .loc use the index, while [index] use the automatic index
series_film[5]
series_film.loc[5]

'The Water Diviner (2015)'

Reindexing is the Pandas way of modifying the alignment between labels (index) and the data (values). The reindex() method allows you to specify an alternate ordering of the labels (index) for a Series object. This method takes in a list of strings corresponding to the order of labels you'd like for that Series object. 

We can use the reindex() method to sort series_custom in alphabetical order by film. To accomplish this, we need to:

•return a list representation of the current index using tolist()

•sort the index using sorted()

•use reindex() to set the new ordered index

In [31]:
original_index = series_custom.index.tolist()

In [32]:
original_index

['Avengers: Age of Ultron (2015)',
 'Cinderella (2015)',
 'Ant-Man (2015)',
 'Do You Believe? (2015)',
 'Hot Tub Time Machine 2 (2015)',
 'The Water Diviner (2015)',
 'Irrational Man (2015)',
 'Top Five (2014)',
 'Shaun the Sheep Movie (2015)',
 'Love & Mercy (2015)',
 'Far From The Madding Crowd (2015)',
 'Black Sea (2015)',
 'Leviathan (2014)',
 'Unbroken (2014)',
 'The Imitation Game (2014)',
 'Taken 3 (2015)',
 'Ted 2 (2015)',
 'Southpaw (2015)',
 'Night at the Museum: Secret of the Tomb (2014)',
 'Pixels (2015)',
 'McFarland, USA (2015)',
 'Insidious: Chapter 3 (2015)',
 'The Man From U.N.C.L.E. (2015)',
 'Run All Night (2015)',
 'Trainwreck (2015)',
 'Selma (2014)',
 'Ex Machina (2015)',
 'Still Alice (2015)',
 'Wild Tales (2014)',
 'The End of the Tour (2015)',
 'Red Army (2015)',
 'When Marnie Was There (2015)',
 'The Hunting Ground (2015)',
 'The Boy Next Door (2015)',
 'Aloha (2015)',
 'The Loft (2015)',
 '5 Flights Up (2015)',
 'Welcome to Me (2015)',
 'Saint Laurent (2015)'

In [38]:
original_index = sorted(original_index)
sorted_series_custom = series_custom.reindex(original_index)

To make sorting easier, Pandas comes with a sort_index() method, which returns a Series sorted by the index, and a sort_values() method method, which returns a Series sorted by the values. Since the values representing the Rotten Tomatoes scores are integers, sorting by values will sort in numerically ascending order (low to high) in our case.

In both cases, the link between each element's index (film name) and value (score) is preserved. This is known as data alignment and is a key tenet of Pandas that is incredibly important when analyzing data. Unless we specifically change a value or an index, Pandas allows us to assume the linking will be preserved.


In [44]:
sc2 = series_custom.sort_index()
sc3 = series_custom.sort_values()
print(sc2.head(10))
print(sc3.head(10))

'71 (2015)                    97
5 Flights Up (2015)           52
A Little Chaos (2015)         40
A Most Violent Year (2014)    90
About Elly (2015)             97
Aloha (2015)                  19
American Sniper (2015)        72
American Ultra (2015)         46
Amy (2015)                    97
Annie (2014)                  27
dtype: int64
Paul Blart: Mall Cop 2 (2015)     5
Hitman: Agent 47 (2015)           7
Hot Pursuit (2015)                8
Fantastic Four (2015)             9
Taken 3 (2015)                    9
The Boy Next Door (2015)         10
The Loft (2015)                  11
Unfinished Business (2015)       11
Mortdecai (2015)                 12
Seventh Son (2015)               12
dtype: int64


Since Pandas builds on top of NumPy, it takes advantage of NumPy's vectorizaton capabilities which generates incredibly optimized. You can use any of the standard Python arithmetic operators (+, -, *, and /) to transform every value in a Series object.

In [45]:
series_normalized = (series_custom/100)*5

Pandas utilizes vectorized operations everywhere, including when filtering values within a single Series object or comparing 2 different Series objects.

To help make it easy to separate complex comparison and filtering logic into modular pieces, Pandas returns Boolean Series objects as the intermediate representation of the logic. We can specify filtering criteria in different variables and chain them together using the & operator, which represents and, as well as the | operator, representing or. Finally, we can utilize a Series object's bracket notation to pass in an expression representing a Boolean Series object to get back the filtered dataset.

In [59]:
series_custom[series_custom > 50]

Avengers: Age of Ultron (2015)                                             74
Cinderella (2015)                                                          85
Ant-Man (2015)                                                             80
The Water Diviner (2015)                                                   63
Top Five (2014)                                                            86
Shaun the Sheep Movie (2015)                                               99
Love & Mercy (2015)                                                        89
Far From The Madding Crowd (2015)                                          84
Black Sea (2015)                                                           82
Leviathan (2014)                                                           99
Unbroken (2014)                                                            51
The Imitation Game (2014)                                                  90
Southpaw (2015)                                                 

One of the core tenets of Pandas is data alignment. Series objects align along indices and DataFrame objects align along both indices and columns. This means that for Series objects, the link between the index labels and the actual values is implicitly preserved across operations and transformations unless we explicitly break the link. This core tenet allows us to use Pandas effectively when working with data and is a big advantages over just using NumPy objects. For Series objects in particular, this means we can use the standard Python arithmetic operators (+, -, *, and /) to add, subtract, multiple, and divide the values at each index label for 2 different Series objects.

In [8]:
# vote critics, users and mean between the two of them
rt_critics = Series(fandango['RottenTomatoes'].values, index=fandango['FILM'])
rt_users = Series(fandango['RottenTomatoes_User'].values, index=fandango['FILM'])
rt_mean = (rt_critics + rt_users)/2

In [9]:
rt_mean.head

<bound method NDFrame.head of FILM
Avengers: Age of Ultron (2015)                    80.0
Cinderella (2015)                                 82.5
Ant-Man (2015)                                    85.0
Do You Believe? (2015)                            51.0
Hot Tub Time Machine 2 (2015)                     21.0
The Water Diviner (2015)                          62.5
Irrational Man (2015)                             47.5
Top Five (2014)                                   75.0
Shaun the Sheep Movie (2015)                      90.5
Love & Mercy (2015)                               88.0
Far From The Madding Crowd (2015)                 80.5
Black Sea (2015)                                  71.0
Leviathan (2014)                                  89.0
Unbroken (2014)                                   60.5
The Imitation Game (2014)                         91.0
Taken 3 (2015)                                    27.5
Ted 2 (2015)                                      52.0
Southpaw (2015)               