## Predicting Book-to-Film Adaptations
This project predicts the likelihood of books becoming films and finding success on the big screen. 

We answered the question, “Which characteristics of books are strongly correlated to highly grossing, and positively rated book-to-movie adaptations?” The book characteristics that we explored, and hoped would be associated with film success, were book ratings, page counts, book genres, book audiences, and publisher type. 

In [1]:
# Import dependencies
import pandas as pd
import matplotlib as pyplot
import numpy as np
import seaborn as sns
from scipy import stats

In [2]:
# Reference csv file paths
goodreads_path = "Raw_Data/books.csv"
keywords_path = "Raw_Data/keywords.csv"
movies_path = "Raw_Data/movies_metadata.csv"
ratings_path = "Raw_Data/ratings.csv"
link_path = "Raw_Data/links.csv"

# Import csv files as DataFrames
# Book data
goodreads_df = pd.read_csv(goodreads_path, encoding="utf-8")
# Movie data
keywords_df = pd.read_csv(keywords_path, encoding="utf-8")
movies_df = pd.read_csv(movies_path, encoding="utf-8")
ratings_df = pd.read_csv(ratings_path, encoding="utf-8")
# Id linking data
link_df = pd.read_csv(link_path, encoding="utf-8")

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


In [22]:
# Filter df nested dict for 'based on novel' to find book-to-movie adaptations
adaptations = keywords_df[['based on novel' in row for row in keywords_df['keywords']]]

# Convert int to str to merge dfs
adaptations['id'] = adaptations['id'].astype('str')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  adaptations['id'] = adaptations['id'].astype('str')


In [4]:
# Join dfs on 'id' field 
adaptations_merge = pd.merge(adaptations, movies_df,  how = 'inner', on = 'id')

# Visualize
adaptations_merge.head(2)

Unnamed: 0,id,keywords,adult,belongs_to_collection,budget,genres,homepage,imdb_id,original_language,original_title,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,31357,"[{'id': 818, 'name': 'based on novel'}, {'id':...",False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,tt0114885,en,Waiting to Exhale,...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
1,4584,"[{'id': 420, 'name': 'bowling'}, {'id': 818, '...",False,,16500000,"[{'id': 18, 'name': 'Drama'}, {'id': 10749, 'n...",,tt0114388,en,Sense and Sensibility,...,1995-12-13,135000000.0,136.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Lose your heart and come to your senses.,Sense and Sensibility,False,7.2,364.0


In [5]:
# Select subset of necessary columns for analysis
adaptations_merge = adaptations_merge[['id', 'keywords', 'adult', 'budget', 'genres', 'imdb_id',
                                       'original_language', 'original_title', 'release_date', 
                                      'revenue', 'runtime', 'spoken_languages', 'status', 'title']]

# Select movies that have already been released for analysis 
adaptations_merge = adaptations_merge[adaptations_merge['status'] == 'Released']

# Drop rows with NaN values
adaptations_merge = adaptations_merge.dropna(how = 'any')

# Visualize
adaptations_merge.head(2)

Unnamed: 0,id,keywords,adult,budget,genres,imdb_id,original_language,original_title,release_date,revenue,runtime,spoken_languages,status,title
0,31357,"[{'id': 818, 'name': 'based on novel'}, {'id':...",False,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",tt0114885,en,Waiting to Exhale,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Waiting to Exhale
1,4584,"[{'id': 420, 'name': 'bowling'}, {'id': 818, '...",False,16500000,"[{'id': 18, 'name': 'Drama'}, {'id': 10749, 'n...",tt0114388,en,Sense and Sensibility,1995-12-13,135000000.0,136.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Sense and Sensibility


There are 829 released films based on novels. 

In [36]:
# Drop null rows and convert float to int to remove decimal, then int to str to merge dfs on same datatype
link_df = link_df.dropna()
link_df['tmdbId'] = link_df['tmdbId'].astype(np.int64)
link_df['tmdbId'] = link_df['tmdbId'].astype(np.str)

# Rename tmdbId to id for merging
link_df = link_df.rename(columns = {'movieId': "movieId", 'tmdbId': 'id'})

# Merge movie and ratings data on id field
movies_merge = pd.merge(adaptations_merge, link_df,  how = 'inner', on ='id')
movies_merge.head(3)

KeyError: 'tmdbId'

In [24]:
movies_merge.shape

(833, 17)

In [26]:
# Select subset of necessary columns for analysis
ratings_df = ratings_df[['movieId', 'rating']]

# View data types
ratings_df.info()

# View shape
ratings_df.shape

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26024289 entries, 0 to 26024288
Data columns (total 2 columns):
 #   Column   Dtype  
---  ------   -----  
 0   movieId  int64  
 1   rating   float64
dtypes: float64(1), int64(1)
memory usage: 397.1 MB


(26024289, 2)

There are ~26M ratings provided for 45,115 unique movies.

In [35]:
# Testing movies
test = movies_merge[movies_merge['movieId'] == 356]
test

Unnamed: 0,id,keywords,adult,budget,genres,imdb_id,original_language,original_title,release_date,revenue,runtime,spoken_languages,status,title,index,movieId,imdbId
13,13,"[{'id': 422, 'name': 'vietnam veteran'}, {'id'...",False,55000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",tt0109830,en,Forrest Gump,1994-07-06,677945399.0,142.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Forrest Gump,352,356,109830


In [14]:
# How many movies reviewed in dataset
num_movies = ratings_df.value_counts('movieId')
num_movies

movieId
356       91921
318       91082
296       87901
593       84078
2571      77960
          ...  
151575        1
151581        1
113014        1
151589        1
176275        1
Length: 45115, dtype: int64

In [33]:
# Groupby id to calculate mean of user ratings
ratings_group = ratings_df.groupby('movieId')
ratings_mean = ratings_group['rating'].mean().round(2).reset_index()
ratings_mean.head()

Unnamed: 0,movieId,rating
0,1,3.89
1,2,3.24
2,3,3.18
3,4,2.88
4,5,3.08


In [37]:
# Merge ratings and movie adaptations df
final_movies_df = pd.merge(movies_merge, ratings_mean, how = 'left', on = 'movieId')
final_movies_df.head()

Unnamed: 0,id,keywords,adult,budget,genres,imdb_id,original_language,original_title,release_date,revenue,runtime,spoken_languages,status,title,index,movieId,imdbId,rating
0,31357,"[{'id': 818, 'name': 'based on novel'}, {'id':...",False,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",tt0114885,en,Waiting to Exhale,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Waiting to Exhale,3,4,114885,2.88
1,4584,"[{'id': 420, 'name': 'bowling'}, {'id': 818, '...",False,16500000,"[{'id': 18, 'name': 'Drama'}, {'id': 10749, 'n...",tt0114388,en,Sense and Sensibility,1995-12-13,135000000.0,136.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Sense and Sensibility,16,17,114388,3.95
2,8012,"[{'id': 395, 'name': 'gambling'}, {'id': 416, ...",False,30250000,"[{'id': 35, 'name': 'Comedy'}, {'id': 53, 'nam...",tt0113161,en,Get Shorty,1995-10-20,115101622.0,105.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Get Shorty,20,21,113161,3.57
3,11859,"[{'id': 258, 'name': 'bomb'}, {'id': 416, 'nam...",False,50000000,"[{'id': 28, 'name': 'Action'}, {'id': 53, 'nam...",tt0113010,en,Fair Game,1995-11-03,11534477.0,91.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Fair Game,70,71,113010,2.35
4,39428,"[{'id': 531, 'name': 'southern usa'}, {'id': 8...",False,0,"[{'id': 18, 'name': 'Drama'}]",tt0113952,en,The Neon Bible,1995-08-23,0.0,91.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,The Neon Bible,136,138,113952,3.28


In [None]:
# Select subset of necessary columns for analysis
goodreads_df = [['']]

In [None]:
# Merge Goodreads data with final_movies_df
final_df = pd.merge(final_movies_df, goodreads_df, how = 'left', on = 'title')
final_df.head()

In [None]:
# Export csv of final_df
final_df.to_csv('book_to_film_adaptations.csv')