### Plot Description Based Recommender

In the previous notebooks, we built an IMDB Top 250 clone (a type of simple recommender) and a knowledge-based recommender that suggested movies based on timeline, genre, and duration. However, these systems were extremely primitive. The simple recommender did not take into consideration an individual user's pereferences. The knowledge-based recommender did take account of the user's preference for genres, timelines and duration, but the model and its recommendations still remained very generic.

Imagine that Alice likes the movies *The Dark Knigth, Iron Man,* and *Man of Steel*. It is pretty evident that Alice has a taste for superhero movies. However, our models from the previous chapter would not be able to capture this detail. The best it could do is suggest *action* movies (by making Alice input *action* as the preferred genre), which is a superset of superhero movies.

It is also possible that two movies have the same genre, timeline, and duration characteristics, but differ hugely in their audience. Consider *The Hangover* and *Forgetting Sarah Marshall*, for example. Both these movies were released in the first decade of the 21st century, both lasted around two hours, and both were comedies. However, the kind of audience that enjoyed these movies was very different.

An obvious fix to this problem is to ask the user for more metadata as input. For instance, if we introduced a *sub-genre* input, the user would be able to input values such as *superhero*, *black comedy*, and *romantic comedy*, and obtain more appropriate results, but this solution suffers heavily from the perspective of usability.

The first problem is that we do not possess data on *sub-genres*. Secondly, even if we did, our users are extremely unlikely to possess knowledge of their favorite movies' metadata. Finally, even if they did, they would certainly not have the patience to input it into a long form. Instead, what they would be willing to do is tell you the movies they like/dislike and expect recommendations that match their tastes.

As we discussed in the first chapter, this is exactly what sites like Netflix do. When you sign up on Netflix for the first time, it doesn't have any information about your tastes for it to build a profile, leverage the power of its community, and give you recommendations with (a concept we'll explore in later chapters). Instead, what it does is ask you for a few movies you like and show you results that are most similar to those movies.

In this notebook, we are going to build two types of content-based recommender:

* **Plot description-based recommender:** This model compares the descriptions and taglines of different movies, and provides recommendations that have the most similar plot descriptions.

* **Metadata-based recommender:** This model takes a host of features, such as genres, keywords, cast, and crew, into consideration and provides recommendations that are most similar with respect to the aforementioned features.


### Exporting the clean DataFrame
In the previous chapter, we performed a series of data wrangling and cleaning processes on our metadata in order to convert it into a form that was more usable. To avoid having to perform these steps again, let's save this cleaned DataFrame into a CSV file. As always, doing this with pandas happens to be extremely easy.

In the knowledge recommender notebook from Chapter 4, enter the following code in the last cell:

In [None]:
# # convert the cleaned (non-exploded) dataframe df into a CSV file and save it in the data folder
# # Set parameter index to False as the index of the DataFrame has no inherent meaning
# df.to_csv('../data.') 

The data folder now contain a new file, `metadata_clean.csv`.

Let's create a new folder, Chapter 4, and open a new Jupyter Notebook within this folder. Let's now import our new file into this Notebook:

In [102]:
import pandas as pd
import numpy as np

# Import data from the clean file
# df = pd.read_csv('./data/metadata_clean.csv')
df = pd.read_csv('/kaggle/input/metadata-of-movies/metadata_clean.csv')


# Print the head of the cleaned DataFrame
df.head(10)

Unnamed: 0,title,genres,runtime,vote_average,vote_count,year
0,Toy Story,"['Animation', 'Comedy', 'Family']",81.0,7.7,5415.0,1995
1,Jumanji,"['Adventure', 'Fantasy', 'Family']",104.0,6.9,2413.0,1995
2,Grumpier Old Men,"['Romance', 'Comedy']",101.0,6.5,92.0,1995
3,Waiting to Exhale,"['Comedy', 'Drama', 'Romance']",127.0,6.1,34.0,1995
4,Father of the Bride Part II,['Comedy'],106.0,5.7,173.0,1995
5,Heat,"['Action', 'Crime', 'Drama', 'Thriller']",170.0,7.7,1886.0,1995
6,Sabrina,"['Comedy', 'Romance']",127.0,6.2,141.0,1995
7,Tom and Huck,"['Action', 'Adventure', 'Drama', 'Family']",97.0,5.4,45.0,1995
8,Sudden Death,"['Action', 'Adventure', 'Thriller']",106.0,5.5,174.0,1995
9,GoldenEye,"['Adventure', 'Action', 'Thriller']",130.0,6.6,1194.0,1995


The cell should output a DataFrame that is already clean and in the desired form.

### Document vectors
Essentially, the models we are building compute the pairwise similarity between bodies of text. But how do we numerically quantify the similarity between two bodies of text?

To put it another way, consider three movies: A, B, and C. How can we mathematically prove that the plot of A is more similar to the plot of B than to that of C (or vice versa)?

The first step toward answering these questions is to represent the bodies of text (henceforth referred to as documents) as mathematical quantities. This is done as a representing these documents as vectors. In other words, every documents is depicted as a series of *n* numbers, where each number represents a dimension and *n* is the size of the vocabulary of all the documents put together.

But what are the values of these vectors? The answer to that question depends on the *vectorizer* we are using to convert our documents into vectors. The two most popular vectorizers are CountVectorizer and TF-IDFVectorizer. There are many other types. This [medium](https://towardsdatascience.com/hacking-scikit-learns-vectorizers-9ef26a7170af) post talks about them in-depth.

#### CountVectorizer
CountVectorizer is the simplest type of vectorizer and is best explained with the help of an example. Imagine that we have three documents, A, B, and C, which are as follows:
* **A:** The sun is a star.
* **B:** My love is like a red, red rose.
* **C:** Mary had a little lamb

We now have to convert these documents into their vector forms using CountVectorizer. The first step is to compute the size of the vocabulary. The vocebulary is the number of unique words present across all documents. Therefore, the vocabulary for this set of three documents is as follows: the, sun, is, a, star, my, love, like, red, rose, mary, had, little, lamb. Consequently, the size of the vocabulary is 14.

It is common practice to not include extremely common words such as a, the, is, had, my, and so on (also known as stop words) in the vocabulary. Therefore, eliminating the stop words, our vocabulary, *V*, is as follows:

V: like, little, lamb, love, mary, red, rose, sun, star

The size of our vocabulary is now nine. Therefore, our documents will be represented as nine-dimensional vectors, and each dimension here will represent the number of times a particular word occurs in a document. In other words, the first dimension will represent the number of times like occurs, the second will represent the number of times little occurs, and so on.

Therefore, using the CountVectorizer approach, A, B, and C will now be represented as follows:

* **A:** (0, 0, 0, 0, 0, 0, 1, 1)
* **B:** (1, 0, 0, 1, 2, 1, 0, 0)
* **C:** (0, 1, 1, 0, 1, 0, 0, 0)

#### TF-IDFVectorizer
Not all words in a document carry equal weight. We already observed this when we eliminated the stop words from our vocabulary altogether. But the words that were in the vocabulary were all given equal weighting.

But should this always be the case?

For example, considers a corpus of documents on dogs. Now, it is obvious that all these documents will frequently contain the word dog. Therefore, the appearance of the word dog isn't as important as another word that only appears in a few documents.

**TF-IDFVectorizer (Term Frequency-Inverse Document Frequency)** takes the aforementioned point into consideration and assigns weights to each word according to the following formula. For every word *i* in document *j*, the following applies:


In this formula, the following is true:
* w, is the weight of word *i* in document *j*
* df, is the number of documents that contain the term *i*
* N is the total number of documents

We won't go too much into the formula and the associated calculations. Just keep in mind that the weight of a word in a document is greater if it occurs more frequently in that document and is present in fewer documents. The weight *w* takes values between 0 and 1:

We will be using TF-IDFVectorizer because some words (pictured in the preceding word
cloud) occur much more frequently in plot descriptions than others. It is therefore a good
idea to assign weights to each word in a document according to the TF-IDF formula. Another reason to use TF-IDF is that it speeds up the calculation of the cosine similarity
score between a pair of documents. We will discuss this point in greater detail when we
implement this in code.

#### The cosine similarity score
We will discuss similarity scores in detail in `Chapter 5`, *Getting Started with Data Mining Techniques*. Presently, we will make use of the *cosine similarity* metric to build our models. The cosine score is extremely robust and easy to calculate (especially when used in conjunction with TF-IDFVectorizer).

The cosine similarity score between two documents, *x* and *y*, is as follows:


The cosine score can take any value between -1 and 1. The higher the cosine score, the more similar the documents are to each other. We now have a good theoretical base to proceed to build the content-based recommenders using Python.

### Plot description-based recommender
Our plot description-based recommender will take in a movie title as an argument and output a list of movies that are most similar based on their plots. These are the steps we are going to perform in building this model:

1. Obtain the data required to build the model
2. Create TF-IDF vectors for the plot description (or overview) of every movie
3. Compute the pairwise cosine similarity score of every movie
4. Write the recommender function that takes in a movie title as an argument and outputs movies most similar to it based on the plot

#### Preparing the data
In its present form, the DataFrame, although clean, does not contain the features that are required to build the plot description-based recommender. Fortunately, these requisite features are available in the original metadata file.

All we have to do is import them and add them to our DataFrame:

In [110]:
# Import the original file
# orig_df = pd.read_csv('./data/movies_metadata.csv', low_memory=False)
orig_df = pd.read_csv('/kaggle/input/metadata-of-movies/movies_metadata.csv', low_memory=False)

# Add the useful features into the cleaned dataframe
df['overview'], df['id'] = orig_df['overview'], orig_df['id']

df.head()

Unnamed: 0,title,genres,runtime,vote_average,vote_count,year,overview,id
0,Toy Story,"['Animation', 'Comedy', 'Family']",81.0,7.7,5415.0,1995,"Led by Woody, Andy's toys live happily in his ...",862
1,Jumanji,"['Adventure', 'Fantasy', 'Family']",104.0,6.9,2413.0,1995,When siblings Judy and Peter discover an encha...,8844
2,Grumpier Old Men,"['Romance', 'Comedy']",101.0,6.5,92.0,1995,A family wedding reignites the ancient feud be...,15602
3,Waiting to Exhale,"['Comedy', 'Drama', 'Romance']",127.0,6.1,34.0,1995,"Cheated on, mistreated and stepped on, the wom...",31357
4,Father of the Bride Part II,['Comedy'],106.0,5.7,173.0,1995,Just when George Banks has recovered from his ...,11862


The DataFrame should now contain two new features: `overview` and `id`. We will use `overview` in building this model and id for building the next.

The `overview` feature consists of strings and, ideally, we should clean them up by removing all punctuation and converting all the words to lowercase. However, as we will see shortly, all this will be done for us automatically by `scikit-learn`, the library we're going to use

#### Creating the TF-IDF matrix (Term Frequency-Inverser Document Frequency)
The next step is to create a DataFrame where each row represents the TF-IDF vector of the `overview` feature of the corresponding movie in our main DataFrame. To do this, we will use the `scikit-learn` library, which gives us access to a TfidfVectorizer object to perform this process effortlessly:

In [4]:
# !pip install scikit-learn

In [59]:
# Import TfIdfVectorizer from the scikit-learn library
from sklearn.feature_extraction.text import TfidfVectorizer

# Define a  TF-IDF Vectorizer object. Remove all english stopwords
tfidf = TfidfVectorizer(stop_words='english')

# Replace NaN with an empty string
df['overview'] = df['overview'].fillna('')

# Construct the required TF-IDF matrix by applying the fit_transform method on the overview feature
tfidf_matrix = tfidf.fit_transform(df['overview'])

# Output the shape of tfidf_matrix
tfidf_matrix.shape

(45466, 75827)

If you notice, the vectorizer created a 75,827-dimensional vector for the overview of every movie.

#### Computing the cosine similarity score
The next step is to calculate the pairwise cosine similarity score of every movie. In other words, we are going to create a 45,466 x 45,466 matrix. And the matrix is symmetric in nature and every element in the diagonals is 1, since it is the similarity score of the movie with itself.

`scikit-learn` has functionality for computing the aforementioned similarity matrix. Calculating the cosine similarity is, however, a computationally expensive process. Fortunately, since the movie plots are rep as TF-IDF vectors, their magnitude is always 1. 

Hence, there is no need for the calculation of the denominator in the cosine similarity formula as it will always be 1. The work is now reduced to computing the much simpler and computationally cheaper dot product (a functionality that is also provided by `scikit-learn`).

In [6]:
# Import linear_kernel to compute the dot product
from sklearn.metrics.pairwise import linear_kernel

# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

#### Building the recommender function
The final step is to create the recommender function. However, before doing that, create a reverse mapping of movie titles and their respective indices. In other words, create a pandas series with the index as the movie title and the value as the corresponding index in the main DataFrame:

In [77]:
# Construct a reverse mapping of indices and movie titles, and drop duplicate titles, if any
indices = pd.Series(df.index, index=df['title']).drop_duplicates()

Perform the following steps in building the recommender function:

1. Declare the title of the movie as an argument.
2. Obtain the index of the movie from the `indices` reverse mapping.
3. Get the list of cosine similarity scores for that particular moveie with all movies using `cosine_sim`. Convert this into a list of tuples where the first element is the position and the second is the similarity score.
4. Sort this list of tuples on the basis of the cosine similarity scores.
5. Get the top 10 elements of this list. Ignore the first element as it refers to the similarity score with itself (the movie most similar to a particular movie is obviously the movie itself).
6. Return the titles corresponding to the indices of the top 10 elements, excluding the first:


In [109]:
# Function that takes in movie title as input and gives recommendations
def content_recommender(title, cosine_sim=cosine_sim, df=df, indices=indices):
    # Obtain the index of the movie that matches the title
    idx = indices[title]
    
    # Get the pairwise similarity scores of all movies with that movie
    # And convert it into a list of tuples as described above
    sim_scores = list(enumerate(cosine_sim[idx]))
    
    # Sort the movies based on the cosine similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    
    # Get the scores of the 10 most similar movies. Ignore the first movie.
    sim_scores = sim_scores[1:11]
    
    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]
    
    # Return the top 10 most similar movies
    return df['title'].iloc[movie_indices]

Finally built my very own first content-based recommender. Now it is time to see our recommender in action. Try to ask it for recommendations of movies similar to `The Lion King:`

In [62]:
#Get recommendations for The Lion King
content_recommender('The Lion King')

34682    How the Lion Cub and the Turtle Sang a Song
9353                                The Lion King 1½
9115                  The Lion King 2: Simba's Pride
42829                                           Prey
25654                                 Fearless Fagan
17041                                   African Cats
27933              Massaï, les guerriers de la pluie
6094                                       Born Free
37409                                     Sour Grape
3203                                The Waiting Game
Name: title, dtype: object

Notice that the recommender has suggested all of the `The Lion King's` sequels in its top-10 list. We also notice that most of the movies in the list have to do with lions.

It goes without saying that a person who loves `The Lion King` is very likely to have a thing for Disney movies. They may also prefer to watch animated movies. Unfortunately, our plot description recommentder isn't able to capture all this information.


Therefore, in the next section, we will build a recommender that uses more advanced metadata, such as genres, cast, crew, and keywords (or sub-genres). This recommender will be able to do a much better job of identifying an individual's taste for a particular director, actor, sub-genre, and so on.

### Metadata-based recommender

Largely follow the same steps as the plot description-based recommender to build our metadata-based model. The main difference, of course, is in the type of data we use to build the model.

#### Preparing the data
To build this model, we will be using the following metadata:
* The genre of the movie.
* THe director of the movie. This person is part of the crew.
* The movie's three major stars. They are part of the cast.
* Sub-genres or keywords.

With the exception of genres, our DataFrames (both original and cleaned) do not contain the data that we require. Therefore, for this exercise, we will need to download two additional file: `credits.csv`, which contains information on the cast and crew of the movies, and `keywords.csv`, which contains information on the sub-genres.   

#### The keywords and credits datasets
Load the new data into the existing Jupyter Notebook:

In [107]:
# Load the keywords and credits files
cred_df = pd.read_csv('/kaggle/input/the-movies-dataset/credits.csv')
key_df = pd.read_csv('/kaggle/input/the-movies-dataset/keywords.csv')

# Print the head of the credit dataframe
cred_df.head()

# Print the head of the keywords dataframe
key_df.head()

print(key_df, cred_df)

           id                                           keywords
0         862  [{'id': 931, 'name': 'jealousy'}, {'id': 4290,...
1        8844  [{'id': 10090, 'name': 'board game'}, {'id': 1...
2       15602  [{'id': 1495, 'name': 'fishing'}, {'id': 12392...
3       31357  [{'id': 818, 'name': 'based on novel'}, {'id':...
4       11862  [{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n...
...       ...                                                ...
46414  439050             [{'id': 10703, 'name': 'tragic love'}]
46415  111109  [{'id': 2679, 'name': 'artist'}, {'id': 14531,...
46416   67758                                                 []
46417  227506                                                 []
46418  461257                                                 []

[46419 rows x 2 columns]                                                     cast  \
0      [{'cast_id': 14, 'character': 'Woody (voice)',...   
1      [{'cast_id': 1, 'character': 'Alan Parrish', '...   
2      [{'cast

Notice that the cast, crew, and the keywords are in the familiar `list of dictionaries` form. Just like `genres`, we reduce them to a string or a list of strings.

Before doing this, join the DataFrames so that all features are in a single DataFrame. Joining pandas DataFrames is identical to joining tables in SQL. The key we are going to use to join the DataFrames is the `id` feature. However, in order to use this, we need to explicitly convert is listed as an ID. This is clearly bad data. Therefore turn it into an integer.

In [113]:
# Convert the IDs of df into int
df['id'] = df['id'].astype('int')
df

Unnamed: 0,title,genres,runtime,vote_average,vote_count,year,overview,id
0,Toy Story,"['Animation', 'Comedy', 'Family']",81.0,7.7,5415.0,1995,"Led by Woody, Andy's toys live happily in his ...",862
1,Jumanji,"['Adventure', 'Fantasy', 'Family']",104.0,6.9,2413.0,1995,When siblings Judy and Peter discover an encha...,8844
2,Grumpier Old Men,"['Romance', 'Comedy']",101.0,6.5,92.0,1995,A family wedding reignites the ancient feud be...,15602
3,Waiting to Exhale,"['Comedy', 'Drama', 'Romance']",127.0,6.1,34.0,1995,"Cheated on, mistreated and stepped on, the wom...",31357
4,Father of the Bride Part II,['Comedy'],106.0,5.7,173.0,1995,Just when George Banks has recovered from his ...,11862
...,...,...,...,...,...,...,...,...
45461,Subdue,"['Drama', 'Family']",90.0,4.0,1.0,0,Rising and falling between a man and woman.,439050
45462,Century of Birthing,['Drama'],360.0,9.0,3.0,2011,An artist struggles to finish his work while a...,111109
45463,Betrayal,"['Action', 'Drama', 'Thriller']",90.0,3.8,6.0,2003,"When one of her hits goes wrong, a professiona...",67758
45464,Satan Triumphant,[],87.0,0.0,0.0,1917,"In a small town live two brothers, one a minis...",227506


Running the preceding code results in a `ValueError`. Take a closer inspection and notice that '1997-08-20' is listed as an ID. This is clearly bad data. Therefore, find all the rows with bad IDs and remove them in order fo the code execution to be successful:

In [120]:
# Function to convert all non-integer IDs to NaN
def clean_ids(x):
    try:
        return int(x)
    except:
        return np.nan

# Clean the ids of df
df['id'] = df ['id'].apply(clean_ids)

# Filter all rows that have a null ID
df = df[df['id'].notnull()]

df

Unnamed: 0,title,genres,runtime,vote_average,vote_count,year,overview,id,cast,crew,keywords
0,Toy Story,"['Animation', 'Comedy', 'Family']",81.0,7.7,5415.0,1995,"Led by Woody, Andy's toys live happily in his ...",862,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...","[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,Jumanji,"['Adventure', 'Fantasy', 'Family']",104.0,6.9,2413.0,1995,When siblings Judy and Peter discover an encha...,8844,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...","[{'id': 10090, 'name': 'board game'}, {'id': 1..."
2,Grumpier Old Men,"['Romance', 'Comedy']",101.0,6.5,92.0,1995,A family wedding reignites the ancient feud be...,15602,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...","[{'id': 1495, 'name': 'fishing'}, {'id': 12392..."
3,Waiting to Exhale,"['Comedy', 'Drama', 'Romance']",127.0,6.1,34.0,1995,"Cheated on, mistreated and stepped on, the wom...",31357,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...","[{'id': 818, 'name': 'based on novel'}, {'id':..."
4,Father of the Bride Part II,['Comedy'],106.0,5.7,173.0,1995,Just when George Banks has recovered from his ...,11862,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...","[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n..."
...,...,...,...,...,...,...,...,...,...,...,...
46623,Subdue,"['Drama', 'Family']",90.0,4.0,1.0,0,Rising and falling between a man and woman.,439050,"[{'cast_id': 0, 'character': '', 'credit_id': ...","[{'credit_id': '5894a97d925141426c00818c', 'de...","[{'id': 10703, 'name': 'tragic love'}]"
46624,Century of Birthing,['Drama'],360.0,9.0,3.0,2011,An artist struggles to finish his work while a...,111109,"[{'cast_id': 1002, 'character': 'Sister Angela...","[{'credit_id': '52fe4af1c3a36847f81e9b15', 'de...","[{'id': 2679, 'name': 'artist'}, {'id': 14531,..."
46625,Betrayal,"['Action', 'Drama', 'Thriller']",90.0,3.8,6.0,2003,"When one of her hits goes wrong, a professiona...",67758,"[{'cast_id': 6, 'character': 'Emily Shaw', 'cr...","[{'credit_id': '52fe4776c3a368484e0c8387', 'de...",[]
46626,Satan Triumphant,[],87.0,0.0,0.0,1917,"In a small town live two brothers, one a minis...",227506,"[{'cast_id': 2, 'character': '', 'credit_id': ...","[{'credit_id': '533bccebc3a36844cf0011a7', 'de...",[]


It is now a good position to convert the IDs of all three DataFrames into integers and merge them into a single DataFrame:

In [121]:
# Ensure the IDs of df, key_df, and cred_df are of type int
if df['id'].dtype != int:
    df = df.copy()
    df['id'] = df['id'].astype(int)

if key_df['id'].dtype != int:
    key_df = key_df.copy()
    key_df['id'] = key_df['id'].astype(int)

if cred_df['id'].dtype != int:
    cred_df = cred_df.copy()
    cred_df['id'] = cred_df['id'].astype(int)

# Merge keywords and credits into your main metadata dataframe only if necessary
if not {'cast', 'crew'}.issubset(df.columns):
    df = df.merge(cred_df, on='id', suffixes=('', '_cred'))
    df.drop(columns=['cast_cred', 'crew_cred'], inplace=True, errors='ignore')

if 'keywords' not in df.columns:
    df = df.merge(key_df, on='id', suffixes=('', '_key'))
    df.drop(columns=['keywords_key'], inplace=True, errors='ignore')

# Drop duplicate columns if they exist
df = df.loc[:, ~df.columns.duplicated()]

# Display the head of the merged df
df.head()


Unnamed: 0,title,genres,runtime,vote_average,vote_count,year,overview,id,cast,crew,keywords
0,Toy Story,"['Animation', 'Comedy', 'Family']",81.0,7.7,5415.0,1995,"Led by Woody, Andy's toys live happily in his ...",862,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...","[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,Jumanji,"['Adventure', 'Fantasy', 'Family']",104.0,6.9,2413.0,1995,When siblings Judy and Peter discover an encha...,8844,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...","[{'id': 10090, 'name': 'board game'}, {'id': 1..."
2,Grumpier Old Men,"['Romance', 'Comedy']",101.0,6.5,92.0,1995,A family wedding reignites the ancient feud be...,15602,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...","[{'id': 1495, 'name': 'fishing'}, {'id': 12392..."
3,Waiting to Exhale,"['Comedy', 'Drama', 'Romance']",127.0,6.1,34.0,1995,"Cheated on, mistreated and stepped on, the wom...",31357,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...","[{'id': 818, 'name': 'based on novel'}, {'id':..."
4,Father of the Bride Part II,['Comedy'],106.0,5.7,173.0,1995,Just when George Banks has recovered from his ...,11862,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...","[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n..."


### Wrangling keywords, cast and crew
Now that the desired features in a single DataFrame, convert them into a form that is usable. More specifically, these are the transformations looking to be performed:

* Convert `keywords` into a list of strings where each strings is a keyword (similar to genres). Include only the top three keywords. Therefore, this list can have a maximum of three elements.
* Convert `crew` into `director`. In other words, extract only the director of the movie and ignore all other crew members.
* Convert `cast` into a list of strings where each string is a star. Like `keywords`, only include the top three stars in the cast.

The first step is to convert these stringified objects into native Python Objects:

In [122]:
# Convert the stringified objects into the native python objects
from ast import literal_eval

features = ['cast', 'crew', 'keywords', 'genres']
for feature in features:
    df[feature] = df[feature].apply(literal_eval)

Next, extract the director from the `crew` list. To do this, first examine the structure of the dictionary

In [123]:
# Print the first cast member of the first movie in df
df.iloc[0]['crew'][0]

{'credit_id': '52fe4284c3a36847f8024f49',
 'department': 'Directing',
 'gender': 2,
 'id': 7879,
 'job': 'Director',
 'name': 'John Lasseter',
 'profile_path': '/7EdqiNbr4FRjIhKHyPPdFfEEEFG.jpg'}

See that this directory consists of `job` and `name` keys. Since the only interest is the director, loop through all the crew members in a particular list and extract the `name` when the `job` is `Director`. Let's write a function that does this:



In [124]:
# Extract the director's name. If director is not listed, return NaN
def get_director(x):
    for crew_member in x:
        if crew_member['job'] == 'Director':
            return crew_member['name']
        return np.nan

In [128]:
# Define the new director feature
df['director'] = df['crew'].apply(get_director)

# Print the directors of the first five movies
df['director'].head()

0      John Lasseter
1                NaN
2      Howard Deutch
3    Forest Whitaker
4                NaN
Name: director, dtype: object

In [138]:
# Define the function to extract top 3 elements or entire list if less than 3
def generate_list(x):
    if isinstance(x, list):
        names = [ele['name'] for ele in x if isinstance(ele, dict) and 'name' in ele]
        # Check if more than 3 elements exist. If yes, return only first three.
        # If no, return entire list.
        if len(names) > 3:
            names = names[:3]
        return names
    # Return empty list in case of missing/malformed data
    return []

In [144]:
#Apply the generate_list function to cast and keywords
df['cast'] = df['cast'].apply(generate_list)
df['keywords'] = df['keywords'].apply(generate_list)

#Only consider a maximum of 3 genres
df['genres'] = df['genres'].apply(lambda x: x[:3])

In [143]:
# Print the new features of the first 5 movies along with title
df[['title', 'cast', 'director', 'keywords', 'genres']].head(5)

Unnamed: 0,title,cast,director,keywords,genres
0,Toy Story,[],John Lasseter,[],"[Animation, Comedy, Family]"
1,Jumanji,[],,[],"[Adventure, Fantasy, Family]"
2,Grumpier Old Men,[],Howard Deutch,[],"[Romance, Comedy]"
3,Waiting to Exhale,[],Forest Whitaker,[],"[Comedy, Drama, Romance]"
4,Father of the Bride Part II,[],,[],[Comedy]
