# Exploring Movie Genres and Ratings Over Time

### Questions to Investigate:
* How has the popularity of genres evolved over time?
* More specifically, what has been the percentage distribution among movies throughout the years?
* Does genre popularity correlate to particular decades or identifiable cultural eras?

This initial approach will only seek correlations between the percent *frequency* of movies of each genre in each year.
<br>**\*\*\*n.b. some results may appear skewed due to movies being tagged with multiple genres, as well as limited movie data in certain years\*\*\***

Additionally, we can explore how movies through the years in each genre have been rated:

* Do the the *quality* of movies of each genre (determined by user ratings) also display any trends throughout the years?<br> e.g. Were there any stretches of years where comedies were more poorly-rated? Were Sci-Fi movies more highly acclaimed during any decade? What was happening in the world at that time that might have had an influence? 

In [1]:
# Import necessary libraries

import pandas as pd
import numpy as np
import plotly
import plotly.graph_objs as go
from plotly.offline import iplot as plot, init_notebook_mode
init_notebook_mode(connected=True)

In [2]:
# Import CSV source data files

movies_file = r'C:\Users\Mark Pothier\Documents\Code\data_science_tutorials\edX - UC San Diego\Python for Data Science\Week 4\Week-4-Pandas\Week-4-Pandas\movielens\movies.csv'
movies = pd.read_csv(movies_file)

In [3]:
# Extract the year from the Title column, add as new column

movies_years = movies.copy(deep=True)
movies_years['Year'] = movies_years['title'].str.extract('.*\((.*)\).*')
movies_years = movies_years[movies_years['Year'].str.contains('^....$', na=False)]
movies_years['Year'] = movies_years['Year'].astype(np.int64)
movies_years.head()

Unnamed: 0,movieId,title,genres,Year
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995
1,2,Jumanji (1995),Adventure|Children|Fantasy,1995
2,3,Grumpier Old Men (1995),Comedy|Romance,1995
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,1995
4,5,Father of the Bride Part II (1995),Comedy,1995


In [4]:
# Split apart the strings in genres, in order to extract a list of unique genres across the entire dataset

movie_genres = movies[['movieId','genres']].copy(deep=True)
movie_genres = movie_genres['genres'].str.split('|', expand=True)

unique_genres = list(pd.unique(movie_genres.values.ravel('K'))[:-2])
unique_genres

['Adventure',
 'Comedy',
 'Action',
 'Drama',
 'Crime',
 'Children',
 'Mystery',
 'Documentary',
 'Animation',
 'Thriller',
 'Horror',
 'Fantasy',
 'Western',
 'Film-Noir',
 'Romance',
 'War',
 'Sci-Fi',
 'Musical',
 'IMAX']

In [5]:
# For each genre, create a new column in the DataFrame
# Each cell is evaluated as 'True' or 'False' depending the genre's inclusion in the original 'genres' string

movies_expanded = movies_years.copy(deep=True)

for value in unique_genres:
    movies_expanded[value] = movies_expanded['genres'].str.contains(value)
movies_expanded.head()

Unnamed: 0,movieId,title,genres,Year,Adventure,Comedy,Action,Drama,Crime,Children,...,Thriller,Horror,Fantasy,Western,Film-Noir,Romance,War,Sci-Fi,Musical,IMAX
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995,True,True,False,False,False,True,...,False,False,True,False,False,False,False,False,False,False
1,2,Jumanji (1995),Adventure|Children|Fantasy,1995,True,False,False,False,False,True,...,False,False,True,False,False,False,False,False,False,False
2,3,Grumpier Old Men (1995),Comedy|Romance,1995,False,True,False,False,False,False,...,False,False,False,False,False,True,False,False,False,False
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,1995,False,True,False,True,False,False,...,False,False,False,False,False,True,False,False,False,False
4,5,Father of the Bride Part II (1995),Comedy,1995,False,True,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [6]:
# Group the DataFrame by 'Year', summing the True/False values of each genre as the rows are collapsed

movies_yearly_genre_counts = movies_expanded.groupby(['Year'])[unique_genres].sum().astype(np.int64)
movies_yearly_genre_counts['TOTAL'] = movies_yearly_genre_counts.sum(axis=1)
movies_yearly_genre_counts.tail()

Unnamed: 0_level_0,Adventure,Comedy,Action,Drama,Crime,Children,Mystery,Documentary,Animation,Thriller,Horror,Fantasy,Western,Film-Noir,Romance,War,Sci-Fi,Musical,IMAX,TOTAL
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
2011,78,287,132,471,74,46,47,182,49,181,86,52,6,5,106,28,72,17,24,1943
2012,67,303,127,439,92,31,42,187,52,167,110,43,5,2,113,23,62,25,32,1922
2013,77,260,122,437,82,33,45,193,49,178,90,47,5,1,112,16,63,19,32,1861
2014,60,221,111,315,81,33,32,102,33,123,67,35,7,0,80,17,55,14,15,1401
2015,16,37,21,46,10,8,6,15,6,23,8,5,0,0,8,1,12,0,0,222


### To reiterate an important point from the introduction:

Many movies were tagged with **multiple** genres. **The values being plotted correspond with the number of genre tags**, and *not* the number of movies. In other words, # genre tags > # movies.

*This may not an ideal way to process the data (it would be better to assign a single, primary genre to each movie; however, this information is not readily distinguishable), so we will have to accept that movies with greater complexity (i.e. more genre tags) are creating multiple genre hits.*


In [7]:
# To help clarify this last point, let's also get the data for number of movies per year, 
# to be plotted on top of the stacked bars as a line graph

movies_per_year = movies_years.copy(deep=True)
movies_per_year = movies_per_year.groupby(['Year'], as_index=False)['movieId'].count()
movies_per_year.tail()

Unnamed: 0,Year,movieId
113,2011,1016
114,2012,1022
115,2013,1011
116,2014,740
117,2015,120


In [8]:
# First, plot a stacked bar chart depicting the count of each genre, per year

data1 = [
    go.Bar(
        x=movies_yearly_genre_counts.index.values,
        y=movies_yearly_genre_counts[genre],
        name=genre
    ) for genre in unique_genres
]

data2 = [
    go.Scatter(
        x=movies_per_year['Year'],
        y=movies_per_year['movieId'],
        name='Movies per Year',
        mode='lines',
        line=dict(
            color = 'yellow',
            width = 2
        )
    )
]

data = data1 + data2

layout = go.Layout(
    barmode='stack',
    title='Movie Genre Hits per Year',
    yaxis={'title': 'Movie Genre Hits per Year', 'range': [0,2500]},
    xaxis={'title': 'Year', 'range': [1912,2015]},
    bargap=0.0,
    autosize=False,
    width=960,
    height=600,
    hovermode='closest'
)

fig = go.Figure(data=data, layout=layout)
plot(fig, filename='genre-hits-per-year')

This works, but the escalation of the number of movies made over time makes it difficult to see trends in yearly genre composition.

Let's normalize the plot to show all genres as a percentage out of 100 of each individual year:

In [9]:
# Normalzie values as a percentage out of 100

movies_yearly_genre_counts_normalized = movies_yearly_genre_counts.div(movies_yearly_genre_counts['TOTAL'], axis=0)
movies_yearly_genre_counts_normalized = movies_yearly_genre_counts_normalized.fillna(0)
movies_yearly_genre_counts_normalized = movies_yearly_genre_counts_normalized.drop(['TOTAL'], axis=1)
movies_yearly_genre_counts_normalized = movies_yearly_genre_counts_normalized * 100
movies_yearly_genre_counts_normalized.tail()

Unnamed: 0_level_0,Adventure,Comedy,Action,Drama,Crime,Children,Mystery,Documentary,Animation,Thriller,Horror,Fantasy,Western,Film-Noir,Romance,War,Sci-Fi,Musical,IMAX
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
2011,4.014411,14.770973,6.793618,24.240865,3.808543,2.367473,2.41894,9.366958,2.521873,9.315492,4.426145,2.676274,0.308801,0.257334,5.455481,1.441071,3.70561,0.874936,1.235203
2012,3.485952,15.764828,6.6077,22.840791,4.786681,1.612903,2.185224,9.729448,2.705515,8.688866,5.723205,2.237253,0.260146,0.104058,5.879292,1.19667,3.225806,1.300728,1.664932
2013,4.13756,13.970983,6.555615,23.481999,4.406233,1.77324,2.418055,10.370768,2.632993,9.56475,4.83611,2.525524,0.268673,0.053735,6.01827,0.859753,3.385277,1.020956,1.719506
2014,4.282655,15.774447,7.922912,22.48394,5.781585,2.35546,2.284083,7.280514,2.35546,8.779443,4.782298,2.498216,0.499643,0.0,5.710207,1.213419,3.925767,0.999286,1.070664
2015,7.207207,16.666667,9.459459,20.720721,4.504505,3.603604,2.702703,6.756757,2.702703,10.36036,3.603604,2.252252,0.0,0.0,3.603604,0.45045,5.405405,0.0,0.0


In [73]:
# Now, plot a stacked bar chart depicting the PERCENTAGE of each genre of all counts per year

data = [
    go.Bar(
        x=movies_yearly_genre_counts_normalized.index.values,
        y=movies_yearly_genre_counts_normalized[genre],
        name=genre
    ) for genre in unique_genres
]

layout = go.Layout(
    barmode='stack',
    title='Movie Genre Hits Composition per Year',
    yaxis={'title': 'Percentage of All Genre Hits', 'range': [0,100]},
    xaxis={'title': 'Year', 'range': [1912,2015]},
    bargap=0.0,
    autosize=False,
    width=960,
    height=600,
    hovermode='closest'
)

fig = go.Figure(data=data, layout=layout)
plot(fig, filename='genre-hits-composition-per-year')

In [11]:
# Reformat values for creation of stacked area chart

movies_yearly_genre_counts_cumulative = movies_yearly_genre_counts_normalized.copy(deep=True)
movies_yearly_genre_counts_cumulative = movies_yearly_genre_counts_cumulative.cumsum(axis=1)
movies_yearly_genre_counts_cumulative.head()

Unnamed: 0_level_0,Adventure,Comedy,Action,Drama,Crime,Children,Mystery,Documentary,Animation,Thriller,Horror,Fantasy,Western,Film-Noir,Romance,War,Sci-Fi,Musical,IMAX
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
1891,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1893,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1894,0.0,0.0,0.0,0.0,0.0,0.0,0.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0,100.0,100.0
1895,0.0,50.0,50.0,50.0,50.0,50.0,50.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0
1896,0.0,0.0,0.0,0.0,0.0,0.0,0.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0,100.0,100.0,100.0,100.0,100.0


In [12]:
# Replot as a stacked area chart

data = [
    go.Scatter(
        x=movies_yearly_genre_counts_cumulative.index.values,
        y=movies_yearly_genre_counts_cumulative[genre],
        text=round(movies_yearly_genre_counts_normalized[genre],1),
        hoverinfo='x+text+name',
        name=genre,
        mode='lines',
        line=dict(
            width=0.5,
            shape='spline',
            smoothing=1.3
        ),
        fill='tonexty',
    ) for genre in unique_genres
]

layout = go.Layout(
    title='Movie Genre Hits Composition per Year',
    yaxis={'title': 'Percentage of All Genre Hits', 'range': [0,100], 'ticksuffix':'%', 'hoverformat': '%'},
    xaxis={'title': 'Year', 'range': [1912,2015]},
    autosize=False,
    width=960,
    height=600,
    hovermode='closest'
)

fig = go.Figure(data=data, layout=layout)
plot(fig, filename='genre-hits-composition-per-year-area-stack')

# How do ratings compare with these genre frequencies?

Let's bring in data from the ratings CSV file and try to determine how genres over the years are rated.

In [13]:
# Import CSV source data files

ratings_file = r'C:\Users\Mark Pothier\Documents\Code\data_science_tutorials\edX - UC San Diego\Python for Data Science\Week 4\Week-4-Pandas\Week-4-Pandas\movielens\ratings.csv'
ratings = pd.read_csv(ratings_file)

In [14]:
ratings_simple = ratings[['movieId','rating']].copy(deep=True)
ratings_simple.head()

Unnamed: 0,movieId,rating
0,2,3.5
1,29,3.5
2,32,3.5
3,47,3.5
4,50,3.5


In [15]:
ratings_with_genres = ratings_simple.merge(movies_expanded, on='movieId', how='inner')
ratings_with_genres.tail()

Unnamed: 0,movieId,rating,title,genres,Year,Adventure,Comedy,Action,Drama,Crime,...,Thriller,Horror,Fantasy,Western,Film-Noir,Romance,War,Sci-Fi,Musical,IMAX
19999850,121017,3.5,The Gentleman from Epsom (1962),Comedy|Crime,1962,False,True,False,False,True,...,False,False,False,False,False,False,False,False,False,False
19999851,121019,4.5,The Great Spy Chase (1964),Action|Comedy|Thriller,1964,False,True,True,False,False,...,True,False,False,False,False,False,False,False,False,False
19999852,121021,4.5,Taxi for Tobruk (1961),Drama|War,1961,False,False,False,True,False,...,False,False,False,False,False,False,True,False,False,False
19999853,110167,4.5,"Judge and the Assassin, The (Juge et l'assassi...",Crime|Drama,1976,False,False,False,True,True,...,False,False,False,False,False,False,False,False,False,False
19999854,110510,4.5,Série noire (1979),Film-Noir,1979,False,False,False,False,False,...,False,False,False,False,True,False,False,False,False,False


In [16]:
for column in ratings_with_genres:
    if column in unique_genres:
        ratings_with_genres[column] = ratings_with_genres[column] * ratings_with_genres['rating']

ratings_with_genres = ratings_with_genres.replace(0, np.NaN)
ratings_with_genres.tail()

Unnamed: 0,movieId,rating,title,genres,Year,Adventure,Comedy,Action,Drama,Crime,...,Thriller,Horror,Fantasy,Western,Film-Noir,Romance,War,Sci-Fi,Musical,IMAX
19999850,121017,3.5,The Gentleman from Epsom (1962),Comedy|Crime,1962,,3.5,,,3.5,...,,,,,,,,,,
19999851,121019,4.5,The Great Spy Chase (1964),Action|Comedy|Thriller,1964,,4.5,4.5,,,...,4.5,,,,,,,,,
19999852,121021,4.5,Taxi for Tobruk (1961),Drama|War,1961,,,,4.5,,...,,,,,,,4.5,,,
19999853,110167,4.5,"Judge and the Assassin, The (Juge et l'assassi...",Crime|Drama,1976,,,,4.5,4.5,...,,,,,,,,,,
19999854,110510,4.5,Série noire (1979),Film-Noir,1979,,,,,,...,,,,,4.5,,,,,


In [17]:
total_yearly_genre_rating = ratings_with_genres.groupby(['Year'], as_index=True)[unique_genres].count()
total_yearly_genre_rating['TOTAL'] = total_yearly_genre_rating.sum(axis=1)
total_yearly_genre_rating.head()

Unnamed: 0_level_0,Adventure,Comedy,Action,Drama,Crime,Children,Mystery,Documentary,Animation,Thriller,Horror,Fantasy,Western,Film-Noir,Romance,War,Sci-Fi,Musical,IMAX,TOTAL
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
1891,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1893,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1894,0,0,0,0,0,0,0,7,0,0,0,0,0,0,0,0,0,7,0,14
1895,0,2,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,3
1896,0,0,0,0,0,0,0,16,0,0,0,0,0,0,7,0,0,0,0,23


In [18]:
avg_yearly_genre_rating = ratings_with_genres.groupby(['Year'], as_index=True)[unique_genres].mean()
avg_yearly_genre_rating.head()

Unnamed: 0_level_0,Adventure,Comedy,Action,Drama,Crime,Children,Mystery,Documentary,Animation,Thriller,Horror,Fantasy,Western,Film-Noir,Romance,War,Sci-Fi,Musical,IMAX
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
1891,,,,,,,,,,,,,,,,,,,
1893,,,,,,,,,,,,,,,,,,,
1894,,,,,,,,2.714286,,,,,,,,,,3.428571,
1895,,2.25,,,,,,4.0,,,,,,,,,,,
1896,,,,,,,,3.4375,,,,,,,2.928571,,,,


In [19]:
# Create a line plot, where:
# each line represents a genres, 
# y-axis is the rating from 0-5, 
# marker size reflects total number (count) of ratings

data = [
    go.Scatter(
        x=avg_yearly_genre_rating.index.values,
        y=avg_yearly_genre_rating[genre],
        name=genre,
        hoverinfo='x+y+name',
        mode='markers',
        line=dict(
            width = 1
        ),
        marker = dict(
            size = total_yearly_genre_rating[genre]/total_yearly_genre_rating['TOTAL']*150 + 5,
            opacity = 1
        ),
    ) for genre in unique_genres
]

layout = go.Layout(
    barmode='stack',
    title='Genre Ratings (and Counts) by Year',
    yaxis={'title': 'Average Rating', 'range': [1.8,5]},
    xaxis={'title': 'Year', 'range': [1912,2015]},
    autosize=False,
    width=960,
    height=600,
    hovermode='closest'
)

fig = go.Figure(data=data, layout=layout)
plot(fig, filename='genre-ratings-by-year')

That's too densely packed to be readily legible, so let's replot with the default trace visibility set to "legendsonly", then we can turn on individual or multiple traces as we please

In [20]:
data = [
    go.Scatter(
        x=avg_yearly_genre_rating.index.values,
        y=avg_yearly_genre_rating[genre],
        name=genre,
        hoverinfo='x+y+name',
        mode='markers',
        line=dict(
            width = 1
        ),
        marker = dict(
            size = total_yearly_genre_rating[genre]/total_yearly_genre_rating['TOTAL']*150 + 5,
            opacity = 1
        ),
        visible='legendonly'
    ) for genre in unique_genres
]

layout = go.Layout(
    barmode='stack',
    title='Genre Ratings (and Counts) by Year',
    yaxis={'title': 'Average Rating', 'range': [1.8,5]},
    xaxis={'title': 'Year', 'range': [1912,2015]},
    autosize=False,
    width=960,
    height=600,
    hovermode='closest'
)

fig = go.Figure(data=data, layout=layout)
plot(fig, filename='genre-ratings-by-year')

__________________________________________________________________________________________________________________________
__________________________________________________________________________________________________________________________


Finally, for the heck of it, let's look at the overall average rating of each genre:

In [31]:
avg_genre_rating = ratings_with_genres.copy(deep=True)
avg_genre_rating = avg_genre_rating[unique_genres]
avg_genre_rating = avg_genre_rating.mean(axis=0).sort_values(ascending=False)
avg_genre_rating

Film-Noir      3.965381
War            3.809531
Documentary    3.739719
Crime          3.674528
Drama          3.674296
Mystery        3.663509
IMAX           3.655946
Animation      3.617494
Western        3.570498
Musical        3.558090
Romance        3.541803
Thriller       3.507112
Fantasy        3.505946
Adventure      3.501893
Action         3.443864
Sci-Fi         3.436765
Comedy         3.425990
Children       3.408114
Horror         3.277224
dtype: float64

In [70]:
data = [
    go.Bar(
        x=avg_genre_rating.index.values,
        y=avg_genre_rating
    )
]

layout = go.Layout(
    title='Average Rating by Genre',
    yaxis={'title': 'Average Rating', 'range': [3,4]},
    xaxis={'title': 'Genres', 'tickangle': -90},
    autosize=False,
    width=960,
    height=600,
    hovermode='closest'
)

fig = go.Figure(data=data, layout=layout)
plot(fig, filename='avg-genre-ratings')