# Pandas internals: Series
https://www.dataquest.io/mission/146/pandas-internals-series

The three key data structures in Pandas are:
- Series (collection of values)
- DataFrame (collection of Series objects)
- Panel (collection of DataFrame objects)

this excercise is to learn more about pandas, specifically series objects by using https://github.com/fivethirtyeight/data/tree/master/fandango

> Series objects use NumPy arrays for fast computation, but build on them by adding valuable features for analyzing data. For example, while NumPy arrays utilize an integer index, Series objects can utilize other index types, like a string index. Series objects also allow for mixed data types and utilize the NaN Python value for handling missing values. A Series object can hold many data types

In [29]:
import pandas as pd
fandango = pd.read_csv("data/fandango_score_comparison.csv") 
fandango.head(2)

Unnamed: 0,FILM,RottenTomatoes,RottenTomatoes_User,Metacritic,Metacritic_User,IMDB,Fandango_Stars,Fandango_Ratingvalue,RT_norm,RT_user_norm,...,IMDB_norm,RT_norm_round,RT_user_norm_round,Metacritic_norm_round,Metacritic_user_norm_round,IMDB_norm_round,Metacritic_user_vote_count,IMDB_user_vote_count,Fandango_votes,Fandango_Difference
0,Avengers: Age of Ultron (2015),74,86,66,7.1,7.8,5.0,4.5,3.7,4.3,...,3.9,3.5,4.5,3.5,3.5,4.0,1330,271107,14846,0.5
1,Cinderella (2015),85,80,67,7.5,7.1,5.0,4.5,4.25,4.0,...,3.55,4.5,4.0,3.5,4.0,3.5,249,65709,12640,0.5


> DataFrames use Series objects to represent the columns in the data. When you select a single column from a DataFrame, Pandas will return the Series object representing that column. Each individual Series object in a DataFrame is indexed using the integer data type by default. Each value in the Series has a unique integer index, or position. The integer index is 0-indexed, like most Python data structures, and ranges from 0 to n-1, where n is the number of rows. With an integer index, you can select an individual value in the Series if you know it's position as well as select multiple values by passing in a list of index values (similar to a NumPy array).

> For both NumPy arrays and Series objects, you can utilize integer index by using bracket notation to slice and select values. Where Series objects diverge from NumPy arrays, however, is the ability to specify a custom index for the values.

> To explore this idea further, let's use two Series objects representing the film names and Rotten Tomatoes scores.

In [13]:
series_film = fandango['FILM'] 
print(series_film.head(5))
series_rt = fandango['RottenTomatoes']
print(series_rt.head(5))

0    Avengers: Age of Ultron (2015)
1                 Cinderella (2015)
2                    Ant-Man (2015)
3            Do You Believe? (2015)
4     Hot Tub Time Machine 2 (2015)
Name: FILM, dtype: object
0    74
1    85
2    80
3    18
4    14
Name: RottenTomatoes, dtype: int64


In [30]:
film_names = series_film.values
rt_scores = series_rt.values
series_custom = pd.Series(index=film_names, data=rt_scores)
series_custom[['Minions (2015)', 'Leviathan (2014)']]

Minions (2015)      54
Leviathan (2014)    99
dtype: int64

> Even though we specified that the Series object uses a custom, string index, the object still maintains an internal integer index that we can use for selection. In this way, Series objects act both like a dictionary and a list since we can access values using our custom index (like the keys in a dictionary) or the integer index (like the index in a list).

In [31]:
#example of using slice addressing on series objects
series_custom[5:10]

The Water Diviner (2015)        63
Irrational Man (2015)           42
Top Five (2014)                 86
Shaun the Sheep Movie (2015)    99
Love & Mercy (2015)             89
dtype: int64

We can use the reindex() method to sort series_custom in alphabetical order by film. To accomplish this, we need to:

- return a list representation of the current index using tolist()
- sort the index using sorted()
- use reindex() to set the new ordered index

In [67]:
original_index = series_custom.index.tolist()
t = sorted(original_index)
sorted_by_index = series_custom.reindex(t)
sorted_by_index.head()

'71 (2015)                    97
5 Flights Up (2015)           52
A Little Chaos (2015)         40
A Most Violent Year (2014)    90
About Elly (2015)             97
dtype: int64

Pandas comes with a sort_index() method, which returns a Series sorted by the index, and a sort_values() method method, which returns a Series sorted by the values.

In both cases, the link between each element's index (film name) and value (score) is preserved. This is known as data alignment and is a key tenet of Pandas that is incredibly important when analyzing data. Unless we specifically change a value or an index, Pandas allows us to assume the linking will be preserved.

In [42]:
sc2 = series_custom.sort_index()
sc3 = series_custom.sort_values()
print(sc2[:5])
print(sc3[-5:])

'71 (2015)                    97
5 Flights Up (2015)           52
A Little Chaos (2015)         40
A Most Violent Year (2014)    90
About Elly (2015)             97
dtype: int64
Song of the Sea (2014)                        99
Phoenix (2015)                                99
Selma (2014)                                  99
Seymour: An Introduction (2015)              100
Gett: The Trial of Viviane Amsalem (2015)    100
dtype: int64


To modify a series, instead of looping through each value or index in the series object, you can use any of the standard Python arithmetic operators `(+, -, *, and /)` to transform every value in a Series object. 

For example, if we wanted to transform the Rotten Tomatoes scores from a 0 to 100 point scale to a 0 to 10 scale, we can use the Python / division operator to divide the Series by 10: `series_custom/10`


In [43]:
# to normalize scores from a 100 point to a 5 point scale, divide every value by 20
series_normalized = series_custom / 20
series_normalized.head()

Avengers: Age of Ultron (2015)    3.70
Cinderella (2015)                 4.25
Ant-Man (2015)                    4.00
Do You Believe? (2015)            0.90
Hot Tub Time Machine 2 (2015)     0.70
dtype: float64

In [46]:
series_custom[series_custom > 50].head()

Avengers: Age of Ultron (2015)    74
Cinderella (2015)                 85
Ant-Man (2015)                    80
The Water Diviner (2015)          63
Top Five (2014)                   86
dtype: int64

In [47]:
criteria_one = series_custom > 50
criteria_two = series_custom < 75
both_criteria = series_custom[criteria_one & criteria_two]

In [55]:
rt_critics = pd.Series(fandango['RottenTomatoes'].values, index=fandango['FILM'])
rt_users = pd.Series(fandango['RottenTomatoes_User'].values, index=fandango['FILM'])

#the long way
rt_mean = pd.Series((rt_critics+rt_users)/2, index=rt_users.index)
#short way
rt_mean = (rt_critics + rt_users) / 2
rt_mean.head()

FILM
Avengers: Age of Ultron (2015)    80.0
Cinderella (2015)                 82.5
Ant-Man (2015)                    85.0
Do You Believe? (2015)            51.0
Hot Tub Time Machine 2 (2015)     21.0
dtype: float64

In [72]:
r = pd.Series(fandango['RottenTomatoes'].values, index=fandango['FILM'])
r.head()

FILM
Avengers: Age of Ultron (2015)    74
Cinderella (2015)                 85
Ant-Man (2015)                    80
Do You Believe? (2015)            18
Hot Tub Time Machine 2 (2015)     14
dtype: int64

# Pandas internals: DataFrames
https://www.dataquest.io/mission/147/pandas-internals-data-frames

In [3]:
import pandas as pd
fandango = pd.read_csv('data/fandango_score_comparison.csv')

print(fandango.head(2))
print('-------')
print(fandango.index)

                             FILM  RottenTomatoes  RottenTomatoes_User  \
0  Avengers: Age of Ultron (2015)              74                   86   
1               Cinderella (2015)              85                   80   

   Metacritic  Metacritic_User  IMDB  Fandango_Stars  Fandango_Ratingvalue  \
0          66              7.1   7.8             5.0                   4.5   
1          67              7.5   7.1             5.0                   4.5   

   RT_norm  RT_user_norm         ...           IMDB_norm  RT_norm_round  \
0     3.70           4.3         ...                3.90            3.5   
1     4.25           4.0         ...                3.55            4.5   

   RT_user_norm_round  Metacritic_norm_round  Metacritic_user_norm_round  \
0                 4.5                    3.5                         3.5   
1                 4.0                    3.5                         4.0   

   IMDB_norm_round  Metacritic_user_vote_count  IMDB_user_vote_count  \
0              

In [22]:
# a df containing just the first and last row of fandango
print(fandango.shape) # this prints the number of rows and column

# a better way to access the last row is to use the .shape attribtute
last_row = fandango.shape[0] - 1 # this is the index of the last row

first_last = fandango.iloc[[0, last_row]]
first_last

(146, 22)


Unnamed: 0,FILM,RottenTomatoes,RottenTomatoes_User,Metacritic,Metacritic_User,IMDB,Fandango_Stars,Fandango_Ratingvalue,RT_norm,RT_user_norm,...,IMDB_norm,RT_norm_round,RT_user_norm_round,Metacritic_norm_round,Metacritic_user_norm_round,IMDB_norm_round,Metacritic_user_vote_count,IMDB_user_vote_count,Fandango_votes,Fandango_Difference
0,Avengers: Age of Ultron (2015),74,86,66,7.1,7.8,5.0,4.5,3.7,4.3,...,3.9,3.5,4.5,3.5,3.5,4.0,1330,271107,14846,0.5
145,"Kumiko, The Treasure Hunter (2015)",87,63,68,6.4,6.7,3.5,3.5,4.35,3.15,...,3.35,4.5,3.0,3.5,3.0,3.5,19,5289,41,0.0


Use the Pandas DataFrame method set_index to assign the FILM column as the custom index for the DataFrame without the FILM column dropped from the DataFrame. We want to keep the original DataFrame so assign the new DataFrame to fandango_films.
Display the index for fandango_films using the index attribute and the print function.

In [27]:
fandango_films = fandango.set_index('FILM', drop=False)
print(fandango_films.index)

Index(['Avengers: Age of Ultron (2015)', 'Cinderella (2015)', 'Ant-Man (2015)',
       'Do You Believe? (2015)', 'Hot Tub Time Machine 2 (2015)',
       'The Water Diviner (2015)', 'Irrational Man (2015)', 'Top Five (2014)',
       'Shaun the Sheep Movie (2015)', 'Love & Mercy (2015)',
       ...
       'The Woman In Black 2 Angel of Death (2015)', 'Danny Collins (2015)',
       'Spare Parts (2015)', 'Serena (2015)', 'Inside Out (2015)',
       'Mr. Holmes (2015)', ''71 (2015)', 'Two Days, One Night (2014)',
       'Gett: The Trial of Viviane Amsalem (2015)',
       'Kumiko, The Treasure Hunter (2015)'],
      dtype='object', name='FILM', length=146)


In [32]:
m = ['The Lazarus Effect (2015)', 'Gett: The Trial of Viviane Amsalem (2015)', 'Mr. Holmes (2015)']
best_movies_ever = fandango_films.loc[m]
best_movies_ever

Unnamed: 0_level_0,FILM,RottenTomatoes,RottenTomatoes_User,Metacritic,Metacritic_User,IMDB,Fandango_Stars,Fandango_Ratingvalue,RT_norm,RT_user_norm,...,IMDB_norm,RT_norm_round,RT_user_norm_round,Metacritic_norm_round,Metacritic_user_norm_round,IMDB_norm_round,Metacritic_user_vote_count,IMDB_user_vote_count,Fandango_votes,Fandango_Difference
FILM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
The Lazarus Effect (2015),The Lazarus Effect (2015),14,23,31,4.9,5.2,3.0,3.0,0.7,1.15,...,2.6,0.5,1.0,1.5,2.5,2.5,62,17691,1651,0.0
Gett: The Trial of Viviane Amsalem (2015),Gett: The Trial of Viviane Amsalem (2015),100,81,90,7.3,7.8,3.5,3.5,5.0,4.05,...,3.9,5.0,4.0,4.5,3.5,4.0,19,1955,59,0.0
Mr. Holmes (2015),Mr. Holmes (2015),87,78,67,7.9,7.4,4.0,4.0,4.35,3.9,...,3.7,4.5,4.0,3.5,4.0,3.5,33,7367,1348,0.0


apply() method

In [36]:
import numpy as np

# returns the data types as a Series
types = fandango_films.dtypes
# filter data types to just floats, index attributes returns just column names
float_columns = types[types.values == 'float64'].index
print(float_columns)
# use bracket notation to filter columns to just float columns
float_df = fandango_films[float_columns]
print(float_df.head(2))

# `x` is a Series object representing a column
deviations = float_df.apply(lambda x: np.std(x))

print(deviations)

Index(['Metacritic_User', 'IMDB', 'Fandango_Stars', 'Fandango_Ratingvalue',
       'RT_norm', 'RT_user_norm', 'Metacritic_norm', 'Metacritic_user_nom',
       'IMDB_norm', 'RT_norm_round', 'RT_user_norm_round',
       'Metacritic_norm_round', 'Metacritic_user_norm_round',
       'IMDB_norm_round', 'Fandango_Difference'],
      dtype='object')
                                Metacritic_User  IMDB  Fandango_Stars  \
FILM                                                                    
Avengers: Age of Ultron (2015)              7.1   7.8             5.0   
Cinderella (2015)                           7.5   7.1             5.0   

                                Fandango_Ratingvalue  RT_norm  RT_user_norm  \
FILM                                                                          
Avengers: Age of Ultron (2015)                   4.5     3.70           4.3   
Cinderella (2015)                                4.5     4.25           4.0   

                                Metacritic_no

In [40]:
double_df = float_df.apply(lambda x: x*2)
print(double_df.head(1))

halved_df = float_df.apply(lambda x: x/2)
print(halved_df.head(1))

                                Metacritic_User  IMDB  Fandango_Stars  \
FILM                                                                    
Avengers: Age of Ultron (2015)             14.2  15.6            10.0   

                                Fandango_Ratingvalue  RT_norm  RT_user_norm  \
FILM                                                                          
Avengers: Age of Ultron (2015)                   9.0      7.4           8.6   

                                Metacritic_norm  Metacritic_user_nom  \
FILM                                                                   
Avengers: Age of Ultron (2015)              6.6                  7.1   

                                IMDB_norm  RT_norm_round  RT_user_norm_round  \
FILM                                                                           
Avengers: Age of Ultron (2015)        7.8            7.0                 9.0   

                                Metacritic_norm_round  \
FILM                       

using apply() over rows

In [41]:
rt_mt_user = float_df[['RT_user_norm', 'Metacritic_user_nom']]
rt_mt_deviations = rt_mt_user.apply(lambda x: np.std(x), axis=1)
print(rt_mt_deviations[0:5])

FILM
Avengers: Age of Ultron (2015)    0.375
Cinderella (2015)                 0.125
Ant-Man (2015)                    0.225
Do You Believe? (2015)            0.925
Hot Tub Time Machine 2 (2015)     0.150
dtype: float64


In [42]:
rt_mt_means = rt_mt_user.apply(lambda x: np.mean(x), axis=1)
print(rt_mt_means[0:5])

FILM
Avengers: Age of Ultron (2015)    3.925
Cinderella (2015)                 3.875
Ant-Man (2015)                    4.275
Do You Believe? (2015)            3.275
Hot Tub Time Machine 2 (2015)     1.550
dtype: float64
