# Pandas - Merging, grouping, aggregation and complex relationships

In [None]:
import pandas as pd
import numpy as np
from IPython.display import display, IFrame

np.random.seed(42)

In [None]:
# Read the data
imdb_titles = pd.read_parquet("../data/imdb_movie_titles.parquet")
imdb_ratings = pd.read_parquet("../data/imdb_movie_ratings.parquet")
tmdb = pd.read_parquet("../data/tmdb.parquet")

## Merging data

We would like to attach ratings to the movie titles dataset. 
Let's merge (join) the two two IMDB data sets. That's quite simple because they share the `tconst` columns.

In [None]:
imdb_ratings.sample(10)

Luckily, the datasets have a common column (tconst) and can be merged easily using the [`pandas.merge`](https://pandas.pydata.org/docs/reference/api/pandas.merge.html) function. We also specify a very useful `validate` parameter. In our case, `validate=“1:1”` checks if merge keys are unique in both left and right datasets.

In [None]:
imdb_rated = pd.merge(imdb_titles, imdb_ratings, on='tconst', how='inner', validate="1:1")

In [None]:
imdb_rated.head(10)

In this case, we could also use the `tconst` column as index and either use `pd.merge(imdb_titles, imdb_ratings, left_index=True, right_index=True)` or the simplified `join` method:

In [None]:
# we set the index to the tconst column here for demonstration
imdb_titles.set_index("tconst").join(imdb_ratings.set_index("tconst")).head(10)

Similarly, we can merge the third TMDB dataset. It also includes `tconst` but as `imdb_id` column. We will also merge only certain columns from TMDB.

In [None]:
movies_rated = pd.merge(
    imdb_rated,
    tmdb[["imdb_id", "budget", "popularity", "revenue", "vote_average", "vote_count"]],
    left_on="tconst",
    right_on="imdb_id",
    validate="1:1",
)

In [None]:
movies_rated.sample(5)

You may have noticed that we have two duplicate columns: `tconst` and `imdb_id`. Drop the `imdb_id` column as an exercise.

In [None]:
# exercise

movies_rated = movies_rated.drop(columns=["imdb_id"])

### Consumer Price Index

We will use one more dataset for this workshop, not strictly related to movies: The US Consumer Price Index, available from https://data.bls.gov/timeseries/CUUR0000SA0.

In [None]:
cpi = pd.read_excel("../data/cpi.xlsx", skiprows=10, header=1).convert_dtypes().rename(columns={"Year": "year"})
cpi

Exercise: Calculate and insert `Annual` value for 2022 based on the mean of available months (do not consider the number of days per month).

In [None]:
month_cols = cpi.columns[1:13]
cpi = cpi.assign(
    Annual=cpi["Annual"].where(cpi["Annual"].notna(), cpi[month_cols].mean(axis="columns"))
)
cpi

You probably realised that merging this dataset will be a bit different as we do not have CPI related to individual movies. However, movies are created in certain years. We can thus associate CPI by the movie release year. Of course we introduce some (maybe non-trivial) inconsistency when we do some analyses when we recalculate the dollar value based on CPI because the values (e.g. budgets) are potentially related to other year(s).

Let's do the merge anyways and add a CPI value for each movie based on the release year average CPI:

In [None]:
movies_rated_cpi = movies_rated.merge(
    cpi[["year", "Annual"]].rename(columns={"Annual": "CPI"}),
    on="year",
)

In [None]:
movies_rated_cpi[["primaryTitle", "year", "CPI"]].sample(10)

*Exercise:* Add the right `validate` parameter to the merge with CPI.

In [None]:
# exercise

movies_rated_cpi = movies_rated.merge(
    cpi[["year", "Annual"]].rename(columns={"Annual": "CPI"}),
    on="year",
    validate="many_to_one",
)

## Visual analysis

Pandas provides covenience methods for plotting using [Matplotlib](https://matplotlib.org). We will not show them here (you can follow the official [10 min tutorial on plotting](https://pandas.pydata.org/docs/user_guide/10min.html#plotting)); instead, we make use of [Plotly](https://plotly.com/python/), a "graphing library makes interactive, publication-quality graph". In particular, we will use the [Plotly Express](https://plotly.github.io/plotly.py-docs/plotly.express.html#px) high level interface.

There are more libraries you can explore when you have more time, such as https://bokeh.pydata.org/en/latest/, https://altair-viz.github.io/, http://holoviews.org. You can find an overview at https://pyviz.org/high-level/index.html.

In [None]:
import plotly.express as px

Let's try to visually "answer" whether the number of votes related to rating using a simple scatter plot:

In [None]:
px.scatter(movies_rated, x="numVotes", y="averageRating")

This is already quite useful. With a minimalistic function call, we have a *interactive* plot right *in the notebook*. Let's play with it for a while:
* Zoom in / out.
* Hover over data points.

We can tune the plot a bit more. For example, the number of votes would be better on a logarithmic scale. We can also add some opacity to get a sense of the density of the data points.

In [None]:
px.scatter(movies_rated, x="numVotes", y="averageRating", log_x=True, opacity=0.1)

To get quick insights into relationships between multiple variables, we can use `scatter_matrix`. 
* It is better to select only certain columns using the `dimensions` parameter.
* `hover_name` lets you modify the hover title.

In [None]:
px.scatter_matrix(
    movies_rated, 
    dimensions=["numVotes", "averageRating", "budget", "popularity", "revenue"],
    hover_name="primaryTitle",
)

### Working with categorical data

The plots above are for numerical data (real or integer numbers in our case). What about working with some categories? We have some categories already available in the dataset, we can also create some artificially.

First, let's create a `decade` category, which corresponds to the decade of the release year.

In [None]:
def interval_to_decade_name(interval):
    return str(interval.left)+"s"

decades = pd.cut(
    movies_rated_cpi["year"],
    bins=range(1890, 2021, 10),
).apply(interval_to_decade_name)

decades.tail(5)

We can now create [box plots](https://en.wikipedia.org/wiki/Box_plot) of average rating per decade. 

In [None]:
px.box(
    movies_rated_cpi,
    x=decades,
    y="averageRating",
)

To work with genres, we need to do some extra manipulation. We first need to split the `genres` strings as they are comma-separated values. We then use `explode` to create individual rows per every genre. This means that there will be possibly multiple rows per one movie.

In [None]:
decades_and_genres = (
    movies_rated_cpi.assign(
        decade = pd.cut(
            movies_rated_cpi["year"],
            bins=range(1890, 2021, 10)
        ).apply(interval_to_decade_name),
        genres = movies_rated_cpi["genres"].str.split(",")
    )
    .rename({"genres": "genre"}, axis="columns")
    .explode("genre")
)

decades_and_genres

Let's see what happened. One (side) effect that the index is duplicated. We will use this to filter out movies with multiple genres.

In [None]:
decades_and_genres.loc[decades_and_genres.index.duplicated(keep=False), ["primaryTitle", "genre"]].head(6)

In [None]:
px.histogram(
    decades_and_genres.astype({"decade": str}),
    x="averageRating",
    facet_col="decade",
    facet_col_wrap=4,
    width=1000, 
    height=1000,
    histnorm="probability density",
    nbins=30,
)

In [None]:
decades_vs_genres_rating = pd.crosstab(
    index=decades_and_genres["decade"],
    columns=decades_and_genres["genre"],
    values=decades_and_genres["averageRating"],
    aggfunc="mean"
)

decades_vs_genres_rating[["Documentary", "Horror"]]

In [None]:
px.violin(
    decades_and_genres.loc[decades_and_genres["genre"].isin(["Documentary", "Horror"])].sort_values("decade"),
    x="decade",
    y="averageRating",
    color="genre",
    violinmode="overlay",
)

## Grouping & aggregation

A common pattern in data analysis is grouping (or binning) data based on some property and getting some aggredate statistics.

*Example:* Group this workshop participants by nationality a get the cardinality (the size) of each group.

In [None]:
grouped_by_genre = decades_and_genres.groupby('genre')

What did we get? 

In [None]:
grouped_by_genre

What's this `DataFrameGroupBy` object? [Its use case is](http://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html):
* Splitting the data into groups based on some criteria.
* Applying a function to each group independently.
* Combining the results into a data structure.


Let's try a simple aggregate: mean of rating per type group:

In [None]:
grouped_by_genre.averageRating.mean().sort_values()

Movies are worst rated, TV episodes and video games are much better rated. What does that mean?

`GroupBy` objects have `agg` method which is quite versatile to describe what aggregations we need. As an example:

In [None]:
grouped_by_genre.agg({"averageRating": ["mean", "median", "std"], "year": ["min", "max", "median"]})

What if we were to group by decade? We don't have a decade column but we can just calculate the decades and use it for `groupby`.

In [None]:
group_by_decade = decades_and_genres.groupby("decade")

**Exercise:** Use `group_by_decade.agg` similarly to above to get the mean of average rating and the total number of votes in each decade.

In [None]:
# exercise


70's, 80's, 90's are the dark ages of the film industry?

**Exercise:**: Find the most profitable film for each studio. Use `groupby` and either `apply` with `pn.nlargest` or `sort_values` and `first`.

In [None]:
%exercise

# result = movies.groupby ...

# display 
result.nlargest(10, columns="lifetime_gross")[["title", "startYear", "lifetime_gross"]]

In [None]:
%validate

assert result.loc["Sony", "lifetime_gross"].values == 373585825

In [None]:
# TODO qcut star rating

## Pivoting

> pivot (third-person singular simple present pivots, present participle pivoting, simple past and past participle pivoted)
 **To turn on an exact spot.**
 
> A pivot table is a table of statistics that summarizes the data of a more extensive table ...
> Although pivot table is a generic term, Microsoft Corporation trademarked PivotTable in the United States in 1994.

Our pivoting task: Get a table with numbers of titles per year (as row) and type (as column).

One approach is to use `groupby`, `count` aggregation and `unstack`.

In [None]:
grouped_by_year_and_type = decades_and_genres.groupby(['year', 'genre'])

In [None]:
pivoted = (
    grouped_by_year_and_type["numVotes"]
    .count()
    .unstack()
)
pivoted.tail()

There's a shortcut though, see if you we can use it.

**Exercise:** Create the `pivoted` table using `pivot_table`:

In [None]:
# exercise

pivot_table = decades_and_genres.pivot_table(values="numVotes", index="year", columns="genre", aggfunc="count")

# display - do not edit
pivot_table.tail()

We can now use this to plot a kind of a histogram with colour for title types.

In [None]:
px.imshow(pivoted)

In [None]:
px.density_heatmap(decades_and_genres, x="genre", y="year", histfunc="count")

In [None]:
# exercise: normalize the data per genre so the 

## Final mini-project - creative, unbounded, free-style

Here are some ideas of what you can do with the data.

* Create 5-star rating based on quantiles using `quantile` and `cut` or `qcut`.
* Group by studio / decade / rating
* Compare simple (arithmetic) mean `averageRating` in each group with `averageRating` average weighted by `numVotes` ($ \frac{\sum \rm{averageRating} \times \rm{numVotes}} {\sum \rm{numVotes}} $). Use `apply` and the `wavg` function from https://pbpython.com/weighted-average.html. This function is quite time and memory consuming and thus not ideal for large data sets. You can try to implement weighted average using standard `mean`. Check the performance with the `%timeit` magic.
* Use the 5-star rating for `hue` in an interesting seaborn plot (see https://seaborn.pydata.org/tutorial/relational.html)
* Use `sns.catplot` to visualize the distrubution of incomes in each 5-star rating group. 

A couple more ideas can be found in https://github.com/brandon-rhodes/pycon-pandas-tutorial

After you have solved all of those, come up with your own quests - we may still be around and help you :-D

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=234de414-c5f7-4e4d-a314-25100ac19112' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>