# Basic Data Processing with Pandas - Part 2

In this tutorial, we will continue to explore the basics of data processing with Pandas. We will keep working on the IMDB dataset.

This time, we will answer more complex questions and learn more advanced techniques! We will also learn basic data visualization with Pandas.

Some of the questions we will answer today:

- What are the highest rated actors?
- Which director-actor combination generates the highest revenue?
- Do actors whose names start with a J tend to have higher ratings?
- Do movies with the word "love" in the title tend to have higher revenue?

## Getting Started

First, we need to import the Pandas library and load the dataset.

In [None]:
import pandas as pd

movies_df = pd.read_csv("datasets/IMDB-Movie-Data.csv", index_col=0)

Let's start by revisiting the last two bonus tasks from the previous tutorial:

- Find how many unique directors are there in the dataset.
- Find out the number of movies released by year in the dataset.

Both of them involved counting values in a column. What Pandas functions are useful for this?

## Counting

There are different levels of counting we can do with Pandas. 

We can count the number of rows in a DataFrame, the number of unique values in a column, the number of times a specific value appears in a column, and even count values within a group.

### Counting Rows

The simplest way to count the number of rows in a DataFrame is to use the `len()` function. This function works on any Python object, not just DataFrames.

We can also use the `shape` attribute of a DataFrame, which returns a tuple with the number of rows and columns in the DataFrame.

`shape` will also be useful in later tutorials, when we will learn how to split and combine DataFrames. It will also be usef a lot with Numpy (tutorial 7).

In [None]:
print(len(movies_df))
print(movies_df.shape)

### Counting Unique Values

To count the number of unique values in a column, we can use the `nunique()` method.

We can also use the `unique()` method to get a list of all the unique values in a column.

For instance, let's get the number of unique directors:

In [None]:
print(movies_df.Director.nunique())

# And if we want to get the list of unique directors:
print(movies_df.Director.unique()[:5])

### Counting Specific Values

What if we want to check the number of movies directed by a specific director? We can use the `value_counts()` method.

`value_counts()` counts how many times each value appears in a column. It returns a Series, with the values as the index and the counts as the values.

We can also get the normalized (percentage) value counts with the `normalize` parameter.

In [None]:
# Getting the number of movies directed by each director:
print(movies_df.Director.value_counts()[:5])
# Normalized value counts.
print((movies_df.Director.value_counts(normalize=True)*100)[:5])
# The percentage must add up to 100%. Let's check that:
print((movies_df.Director.value_counts(normalize=True)*100).sum())

What year in our dataset had the most movies released? And the least?

Check the number of movies per year, both in absolute numbers and as a percentage.

In [None]:
# number of movies per year in absolute numbers
print(movies_df.Year.value_counts())

# number of movies per year as a percentage
print(movies_df.Year.value_counts(normalize=True)*100)

How many directors have directed at least 3 movies with a Metascore of 70 or higher?

In [None]:
# Greater or equal to 70
ge_70 = movies_df[movies_df.Metascore >=70].Director.value_counts()
ge_70[ge_70>=3].shape[0]

Which directors have directed most movies in different years? Find the top 5.

In [None]:
display(movies_df.groupby(by="Director")["Year"].nunique().sort_values(ascending=False).head(5))

Let's try something now... Let's get the count of unique actors.

In [None]:
movies_df.Actors.value_counts()

What happened there?

The `Actors` column actually contains a list of actors in a string, separated by commas. We need to split it!

We will go back to this issue later.

### Counting Values Within Groups

We can also count values within groups. The basic function for this is `count()`.

`count()` counts the number of non-null values in each group. It returns a Series, with the groups as the index and the counts as the values.

We can use the previous methods and apply them to groups (with the `groupby()` method).

For instance, let's count the number of movies per genre.

In [None]:
# Using count():
movies_df.groupby("Genre").Title.count()

Now combine the different count methods to answer the following questions:

- What is the unique number of directors per genre?
- How many movies per genre has each director directed?
- Who is the director with the highest number of movies in different genres?
- And who is the director with the highest number of movies within the **SAME** genre?

In [None]:
# What is the unique number of directors per genre?
display(movies_df.groupby(by="Genre")["Director"].nunique())

In [None]:
# How many movies per genre has each director directed?
movies_df.groupby(by=["Director", "Genre"])["Title"].nunique()

In [None]:
# Who is the director with the highest number of movies in different genres?
movies_df.groupby(by=["Director"])["Genre"].nunique().sort_values(ascending=False).head(1)

In [None]:
# And who is the director with the highest number of movies within the **SAME** genre?
movies_df.groupby(by=["Director", "Genre"])["Title"].nunique().sort_values(ascending=False).head(1)

## Processing Text

As we saw earlier, the `Actors` column contains a list of actors in a string, separated by commas.

If we want to process this column like we did with `Director`, we need to split the string into a list of actors.

### Splitting Strings

We can do this with the `str.split()` method.

`str` (short for string) is a special attribute of Pandas Series and DataFrames. It contains a lot of useful methods for processing text.

Check out the documentation for more information: https://pandas.pydata.org/pandas-docs/stable/reference/series.html#string-handling

In [None]:
# The str.split() method takes a separator as a parameter. By default, it splits on whitespace. In our case, we want to split on commas.
movies_df.Actors.str.split(", ")

So `str.split()` turned the string into a list of strings. But how do we add each actor to the DataFrame?

The easiest option is to add each actor as a new row. (do you see any problems with this approach?)

We can do this with the `explode()` method.

`explode()` takes a column with lists and turns each element of the list into a new row.

In [None]:
# Here we are changing the Actors column to a list of actors. We will learn more about creating and modifying columns in the next tutorial.

movies_df["Actors"] = movies_df.Actors.str.split(", ")
# We create a new df with the "exploded" column
movies_df_actors = movies_df.explode("Actors").copy()

Let's have a look at what it does

In [None]:
movies_df_actors.explode("Actors")

Now we have each actor as a separate row. `explode()` copies the other columns for each new row. Now the length of the DataFrame is the sum of the lengths of the lists in the `Actors` column.

Let's check a few things:

- Number of unique actors
- Number of movies per actor
- Number of movies per actor per year

In [None]:
# Your code here:

# Number of unique actors
movies_df_actors.Actors.nunique()

In [None]:
# Number of movies per actor
display(movies_df_actors.groupby("Actors")["Title"].nunique())

In [None]:
# Number of movies per actor per year
movies_df_actors.groupby(["Actors", "Year"])["Title"].nunique()

### Other `str` Methods

There are many other useful methods in the `str` attribute. Let's have a look at some of the most common ones.

#### `str.contains()`

We can use `str.contains()` to check if a string contains a specific substring.

For instance, let's check how many movies have the word "love" in the title.

In [None]:
# The parameter case=False makes the search case insensitive (i.e. it will find "love" and "Love")
movies_df_actors[movies_df_actors.Title.str.contains("love", case=False)].Title.nunique()

#### `str.startswith()` and `str.endswith()`

We can use these methods to check if a string starts or ends with a specific substring.

For instance, let's check how many movies have the word "the" at the beginning of the title.

In [None]:
# str.startswith() doesn't take a case parameter, so we have to convert the strings to lowercase first if we want to do a case insensitive search.
movies_df_actors[movies_df_actors.Title.str.startswith("The")].Title.nunique()

#### `str.lower()` and `str.upper()`

We can use these methods to convert a string to lowercase or uppercase.

Let's make all the director names uppercase and check which directors are called STEVEN.

In [None]:
movies_df_actors[movies_df_actors.Director.str.upper().str.contains("STEVEN")].Director.unique()

Spoiler: we will learn how to create and modify columns in the next tutorial, but here is a preview that will help you in some tasks:

```python
movies_df_actors["is_steven"] = movies_df_actors["Director"].str.upper().str.contains("STEVEN")
```

#### `str.replace()`

We can use this method to replace a substring with another substring.

Let's ruin the title of some movies by replacing "the" with "a random".

In [None]:
movies_df_actors["ruined_title"] = movies_df_actors.Title.str.replace("the", "A Random", case=False)
movies_df_actors.query("Title != ruined_title").ruined_title.sample(5)

What if we only want to replace EXACT matches? We can use the regular expressions for that.

Regular expressions are a powerful tool for processing. They are a bit complicated, so we will not go into detail for now. But you can check out the documentation for more information: https://docs.python.org/3/library/re.html

Just as an example, this is how we would ensure that we only replace exact matches of "the":


In [None]:
# The r before the quotes indicates that the string contains a regular expression pattern. \b is a special character that matches the beginning or end of a word.
movies_df_actors["ruined_title"] = movies_df_actors.Title.str.replace(r"\bthe\b", "A Random", case=False, regex="True")
movies_df_actors.query("Title != ruined_title").ruined_title.sample(5)

There are many more methods. Remember to check https://pandas.pydata.org/pandas-docs/stable/reference/series.html#string-handling for more information.


## Aggregating

Now that we have the actors as separete rows, we can aggregate their data to get some interesting insights.

We have already learned the `groupby()` method, which is the basic function for aggregating data. Let's review it:

Let's find out which directors worked with the highest number of different actors.

In [None]:
movies_df_actors.groupby("Director").Actors.nunique().sort_values(ascending=False).head(5)

`groupby()` can also group by multiple columns. Let's find out which directors worked with the highest number of different actors in each year.

In [None]:
movies_df_actors.groupby(["Director", "Year"]).Actors.nunique().sort_values(ascending=False).head(5)

### `sum()`

`sum()` sums the values in each group. It can also be used to count the number of non-null (or True) values.

The other basic statistical functions are `mean()`, `median()`, `min()`, `max()`, and `std()`. They all work the same way.

Let's check the mean revenue of each actor.

In [None]:
movies_df_actors.groupby("Actors")["Revenue (Millions)"].mean().sort_values(ascending=False)[:10]

### `agg()`

`agg()` allows us to apply multiple functions to a column. It returns a DataFrame, with the functions as the columns and the groups as the index. It is like a more flexible version of `groupby()`.

To use it, we need to pass a dictionary to it. The keys of the dictionary are the names of the columns, and the values are the functions to apply.

You can also pass custom functions to `agg()`.

Let's check the min, max, mean and median revenue and rating of each actor.

In [None]:
agg_functions = ["mean", "median", "min", "max"]
movies_df_actors.agg({"Revenue (Millions)": agg_functions, "Rating": agg_functions})

## Data Visualization

Pandas has a built-in visualization system. It is based on Matplotlib, which we will learn about in tutorial 7.

We will learn more about the different types of plots and how to customize them later, but for now, let's just have a quick look at some basic plots.

### Scatter Plots

We can use the `scatter()` method to create a scatter plot.

What is the relationship between the revenue and the rating of the movies? Are highly rated movies more profitable?

In [None]:
movies_df_actors.plot(kind="scatter", x="Rating", y="Revenue (Millions)", title="Revenue by Rating")

There are some interesting outliers, but the general trend is unclear.

Let's confirm that using correlation. We will learn about more advanced statistical methods in tutorial  7.

In [None]:
movies_df_actors.corrwith(movies_df_actors["Rating"])

### Line Plots

How does the average Metascore change over time? We can use a line plot to visualize this.

In [None]:
# The figsize parameter is used to set the size of the plot.
movies_df_actors.groupby("Year").Metascore.mean().plot(kind="line", title="Metascore by Year", figsize=(10, 5))

### Bar Plots

Let's use a bar plot to visualize the number of movies per year.

In [None]:
movies_df_actors.groupby("Year").Title.nunique().plot(kind="bar", x="Year", title="Number of Movies per Year", figsize=(10, 5))

### Histograms

We can use a histogram to visualize the distribution of a column.

Let's check the distribution of Ratings.

In [None]:
movies_df_actors.Rating.hist(bins=10, figsize=(10, 5))

# Putting It All Together - Tasks

Let's go back to our original questions and practice what we have learned.

- Who are the highest rated actors?
- Which director-actor combination generates the highest revenue?
- Do actors whose names start with a J tend to have higher ratings?
- Do movies with the word "love" in the title tend to have higher revenue?

In [None]:
# Who are the highest rated actors?
display(movies_df_actors[["Actors", "Rating"]].sort_values(by="Rating", ascending=False).head())
display(movies_df_actors.groupby("Actors")["Rating"].mean().sort_values(ascending=False).head())

In [None]:
# Which director-actor combination generates the highest revenue?
movies_df_actors.groupby(by=["Director", "Actors"])["Revenue (Millions)"].sum().sort_values(ascending=False).head()


In [None]:
# Do actors whose names start with a J tend to have higher ratings?
print(movies_df_actors[movies_df_actors["Actors"].str.startswith("J")].Rating.mean())
print(movies_df_actors[~movies_df_actors["Actors"].str.startswith("J")].Rating.mean())

In [None]:
# Do movies with the word "love" in the title tend to have higher revenue?
print(movies_df[movies_df["Title"].str.contains("love")]["Revenue (Millions)"].mean())
print(movies_df[~movies_df["Title"].str.contains("love")]["Revenue (Millions)"].mean())


Bonus tasks:

- Plot a line graph showing the Rating over time comparing two groups: i) Movies by the top 5 directors with the highest revenue ii) Movies by the top 5 directors with the highest number of movies
- Compare the scatter plots of Rating vs Revenue and Metascore vs Revenue. Which one is more useful?
- Create an aggregated DataFrame with the following columns: i) Director ii) Number of movies iii) Average Rating iv) Average Revenue v) Average Metascore. 
- Create an aggregated DataFrame with the following columns: i) Title ii) Number of words in the title iii)  First symbol of the title iv) Average number of words in the title of movies starting with this symbol  v) Actors vi) Number of actors
- Create a line plot showing the average number of actors per movie over time.
- Create a scatter plot showing the relationship between the number of words in the title and rating.

Plot a line graph showing the Rating over time comparing two groups: i) Movies by the top 5 directors with the highest revenue ii) Movies by the top 5 directors with the highest number of movies

In [None]:
# i) Movies by the top 5 directors with the highest revenue 
directors_highest_revenue = movies_df.sort_values(by="Revenue (Millions)", ascending=False)["Director"].drop_duplicates().head(5).values
movies_directors_highest_revenue = movies_df[movies_df["Director"].isin(directors_highest_revenue)]
mean_rating_by_year_i = movies_directors_highest_revenue.groupby("Year")["Rating"].mean()
mean_rating_by_year_i.plot()

In [None]:
#ii) Movies by the top 5 directors with the highest number of movies
directors_highest_n_movies = movies_df.groupby("Director")["Title"].nunique().sort_values(ascending=False).head(5).index.values
movies_directors_n_movies = movies_df[movies_df["Director"].isin(directors_highest_n_movies)]
mean_rating_by_year_ii = movies_directors_n_movies.groupby("Year")["Rating"].mean()
mean_rating_by_year_ii.plot()

Compare the scatter plots of Rating vs Revenue and Metascore vs Revenue. Which one is more useful?

In [None]:
movies_df.plot(kind="scatter", x="Rating", y="Revenue (Millions)", title="Revenue by Rating")
movies_df.plot(kind="scatter", x="Metascore", y="Revenue (Millions)", title="Revenue by Metascore")

Create an aggregated DataFrame with the following columns: i) Director ii) Number of movies iii) Average Rating iv) Average Revenue v) Average Metascore.

In [None]:
movies_df.groupby("Director").agg({"Title": "count", "Rating": "mean", "Revenue (Millions)": "mean", "Metascore": "mean"}).reset_index().rename(columns={
  "Title": "Number of movies", "Rating": "Average Rating", "Revenue (Millions)":"Average Revenue (Millions)", "Metascore": "Average Metascore"
})

Create an aggregated DataFrame with the following columns: i) Title ii) Number of words in the title iii)  First symbol of the title iv) Average number of words in the title of movies starting with this symbol  v) Actors vi) Number of actors

In [None]:
# i, v
df = movies_df[["Title", "Actors"]].copy()
df = df.rename(columns={"Title": "i", "Actors": "v"})

# ii) Number of words in the title
df["ii"] = movies_df.Title.str.split(" ").str.len()

# iii) First symbol of the title
df["iii"] = movies_df.Title.str[0]

# iv) Average number of words in the title of movies starting with this symbol
avg_title_len_with_1st_symbol = df.groupby("iii")['ii'].mean()
df["iv"] = df["iii"].apply(lambda symbol: avg_title_len_with_1st_symbol[symbol])

# vi) Number of actors
df["vi"] = movies_df.Actors.str.len()

df[["i", "ii", "iii", "iv", "v", "vi"]]

Create a line plot showing the average number of actors per movie over time.

In [None]:
df = movies_df[["Year", "Actors"]].copy()
df["n_actors"] = df.Actors.str.len()
df.groupby("Year")["n_actors"].mean().plot()

Create a scatter plot showing the relationship between the number of words in the title and rating.

In [None]:
df = movies_df[["Title", "Rating"]].copy()
df["n_words_title"] = movies_df.Title.str.split(" ").str.len()
df.plot(kind="scatter", x="n_words_title", y="Rating")