# Module 6.2 Activity: GroupBy and Join

Now that we have had more exposure to the types of operations we can do with DataFrames, let's dive right in! This notebook primarily focuses on DataFrames and how to interact with them using `groupby` and `merge`.

We're going to be working with data on TV shows today! We'll be looking at TV shows that are on various streaming services and checking on their ratings. Specifically, we'll be looking at two datasets that we got from here[https://www.kaggle.com/ruchi798/tv-shows-on-netflix-prime-video-hulu-and-disney]: one with TV show ratings, and one with the streaming services they're on!

In [None]:
import pandas as pd

tv_show_ratings = pd.read_csv('tv_show_ratings.csv')
tv_show_streaming = pd.read_csv('tv_show_streaming.csv')

Let's warm up with a review of the material we covered last class. First, let's take a look at our datasets:

In [None]:
tv_show_ratings.head()

In [None]:
tv_show_streaming.head()

### Question 1:
We can start by trying to find the average rating of TV shows in the 16+ age group. Let's first filter out our 16+ age group from our ratings table:

In [None]:
tv_show_ratings_16 = ... 

We can now take the average of our IMDb ratings and Rotten Tomatoes ratings for our filtered DataFrame!

In [None]:
average_imdb_rating = ...
average_rotten_tomatoes_rating = ...

print('Average IMDb Rating of 16+ Shows:', average_imdb_rating)
print('Average Rotten Tomatoes Rating of 16+ Shows:', average_rotten_tomatoes_rating)

We could repeat this process for every age group manually, but that might be difficult...

However, we can use GroupBy to make this process easier! As you may recall from the slides, a GroupBy expression consists of two parts: the **groupby call**, which takes in one or more columns to form groups from, and a **aggregation function**, which calculates some value for each group. 

For example, if we wanted to get the average rating for each age group, we could do the following:

In [None]:
tv_show_ratings.groupby('Age').mean()

Note that we don't actually need the `Year` column's average for each age group! This one line of code allowed us to calculate the average rating for each group... You might notice that one group has a much higher rating than the others!

### Question 2:
Now, try using `groupby` to get the average rating for each year:

In [None]:
tv_show_ratings.groupby(...)._____() # Fill in the dots and the underscore!

We can use `groupby` to do things other than calculating the mean! For a full list of aggregation functions, you can go here: https://pandas.pydata.org/pandas-docs/stable/reference/groupby.html


### Question 3:
Try finding the **max rating for each age group** and the **number of tv shows from each year** using `groupby`!

In [None]:
... # Your code here; find the max rating for each age group!

In [None]:
... # Your code here; find the number of TV shows from each year!

We have another dataset that might be interesting to us, and that's our streaming services data, contained in `tv_show_streaming`! For each TV show and streaming service, it contains a `1` if the TV show is on the streaming service and a `0` if it isn't.

We might be able to find some interesting information about the years of TV shows and the average ratings per streaming service if we were able to combine the two datasets...

And there is a way for us to do that, using the `merge` and `join` functions! Joining allows you to merge two datasets based on whether a column is equal.

First, let's go through a quick example about how we might join two datasets of candy.

In [None]:
candy = pd.DataFrame({
    'Candy': ["Sour Patch Kids", "Skittles", "Snickers", "Candy Corn", "Starburst", "M&M’s"], 
    'Quantity': [14, 18, 22, 32, 6, 43],
    'Price': ["Expensive", "Cheap", "Cheap", "Expensive", "Expensive", "Cheap"]
})

candy_rankings = pd.DataFrame({
    'Candy': ["Sour Patch Kids", "Skittles", "Snickers", "Candy Corn", "Starburst", "M&M’s"],
    'Ranking': [3, 4, 2, 1, 6, 5]
})

In [None]:
candy.head()

In [None]:
candy_rankings.head()

It might be helpful for us to have the ranking column in the same table as the quantity and the price! To do that, we can use the `merge` function in Pandas to join two tables on a column. We set the `on` keyword to the name of the column we are merging on. If you want to merge on a column that has a different name in each table, you can use the `left_on` and `right_on` arguments instead!

In [None]:
candy.merge(candy_rankings, on='Candy')

We can also do this through the `join` function, although the `join` function requires the column in the table you are merging with to be an index.

In [None]:
candy.join(candy_rankings.set_index('Candy'), on='Candy')

### Question 4:
Let's get back to our rating data! Write code that uses the `merge` method to combine our `tv_show_ratings` and `tv_show_streaming` tables:

In [None]:
tv_shows = ________.merge(..., on=...)

In [None]:
tv_shows.head()

### Question 5:
Now that we have merged data, we can do more analysis! Find the average rating of TV shows that are on Netflix and find the average rating of TV shows that are not on Netflix:

In [None]:
... # Your code here: find the average rating of TV shows that are on Netflix! (Hint: use groupby)

### Question 6:
Let's take a look at Disney+. First, filter your data by only including TV shows that are present on Disney+:

In [None]:
disney = ...

Now, find the number of TV shows in each age group category on Disney+! What do you notice about these numbers?

In [None]:
... # Your code here: again, it might be useful to use groupby here! You've done something similar above.

### Exploratory Analysis
We've written some interesting code here! However, there's a lot more to explore. Using the tools you've learned so far -- `groupby`, `merge`, filtering, etc. -- try and look for some more patterns in the data in the remaining cells! Share your insights with your peers and with the class.