# Analysis Demonstration

### (things to talk about: how we ran analysis and gathered data, problems we ran into and how to solve them, how our functions could be used, things that surprised us, interesting conclusions, etc)

## Import Functions

In [7]:
# Packages
import pandas as pd
import numpy as np
import requests
import re
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt
import math
import seaborn as sns
from collections import Counter

# Our Functions
from scrape_data import scrape_imdb, scrape_rotten_tomatoes, get_html_text
from clean_data import clean_imdb, clean_rotten_tomatoes
from analysis_functions import list_averages



## Use Our Functions to Scrape and Clean Data
Our functions allow you to scrape data from imdb and rotten tomatoes lists of movies. The two lists we are using for our data are https://editorial.rottentomatoes.com/guide/disney-100-essential-movies/ and https://www.imdb.com/list/ls089035876/?sort=release_date,desc&st_dt=&mode=detail&page=1 which goes on to have five pages of disney movies. After scraping we create a csv file of the raw data which we then pass to our cleaning functions. The raw data as well as the final cleaned datasets are included in the data folder.

In [2]:
# Scrape from Rotten Tomatoes

webpage = "https://editorial.rottentomatoes.com/guide/disney-100-essential-movies/"
rotten_tomatoes = scrape_rotten_tomatoes(webpage)
rotten_tomatoes.to_csv('data/rotten_tomatoes_raw.csv', index = False)



  soup = BeautifulSoup(r.text)


In [10]:
# Scrape from IMDB

webpages = ["https://www.imdb.com/list/ls089035876/?sort=release_date,desc&st_dt=&mode=detail&page=1", 
            "https://www.imdb.com/list/ls089035876/?sort=release_date,desc&st_dt=&mode=detail&page=2",
            "https://www.imdb.com/list/ls089035876/?sort=release_date,desc&st_dt=&mode=detail&page=3", 
            "https://www.imdb.com/list/ls089035876/?sort=release_date,desc&st_dt=&mode=detail&page=4",
            "https://www.imdb.com/list/ls089035876/?sort=release_date,desc&st_dt=&mode=detail&page=5"]

imdb = pd.DataFrame()
for webpage in webpages:
    imdb = pd.concat([imdb, scrape_imdb(webpage)])

imdb.to_csv('disney_movie_analysis/data/imdb_raw.csv', index = False)



  soup = BeautifulSoup(r.text)


UnboundLocalError: local variable 'text' referenced before assignment

In [None]:
# Clean Rotten Tomatoes Dataframe

rotten_tomatoes = pd.read_csv('data/rotten_tomatoes_raw.csv')

rotten_tomatoes = clean_rotten_tomatoes(rotten_tomatoes)

rotten_tomatoes.to_csv('data/rotten_tomatoes.csv', index = False)

In [29]:
# Clean IMDB Dataframe

imdb = pd.read_csv('data/imdb_raw.csv')

# Drop 19 since it has no reviews or data aside from name
imdb = imdb.drop(19, axis='index')

# Drop 170 since it is a DVD containing episodes from different shows and not a movie
imdb = imdb.drop(170, axis='index')

imdb = clean_imdb(imdb)

imdb.to_csv('disney_movie_analysis/data/imdb.csv', index = False)

We can then merge the two datasets to run comparisons.

In [None]:
merged = rotten_tomatoes.merge(imdb, how= 'inner', on = ['title', 'year'])
merged.to_csv('disney_movie_analysis/data/merged.csv', index = False)

Below we utilize our list_averages function to get the average score for each director included in the directors column. Here we look at the top 20 directors.

item_count = imdb['director'].apply(lambda x: Counter(x))
top_directors = pd.DataFrame(sum(item_count, Counter()).most_common(20))


director_scores = list_averages(imdb, imdb['director'], imdb['score'])
top_scores = director_scores[director_scores[0].isin(list(top_directors[0]))]

top_directors.merge(top_scores, 'outer', on = [0])

We use the same function to compute ratings for each genre. And then compare the genre ratings between imdb and rotten tomatoes.

In [None]:
imdb['genre'] = imdb['genre'].apply(lambda x: x.replace(', ', ',')).str.split(',')
genre_ratings = list_averages(imdb, imdb['genre'], imdb['score'])
genre_ratings.sort_values(by = 1)

In [None]:
merged['genre'] = merged['genre'].apply(lambda x: x.replace(', ', ',')).str.split(',')
genre_ratings2 = list_averages(merged, merged['genre'], merged['score_y'])
genre_ratings2.sort_values(by = 1)

In [None]:
genre_ratings3 = list_averages(merged, merged['genre'], merged['comparison_score'])
genre_ratings3[1] = genre_ratings3[1].astype(float)
genre_ratings3.sort_values(by = 1)

Continuing our comparisons, we can look at the average score for rotten tomatoes and for IMDB.

In [None]:
print("IMDB Score:" + merged.score_y.mean())
print("Rotten Tomatoes Score:" + merged.comparison_score.mean())

Here is the score by decade for IMDB.

In [None]:
merged.groupby(merged['decade']).score_y.mean()

And here is the same for Rotten Tomatoes.

In [None]:
merged.groupby(merged['decade']).comparison_score.mean()

And finally a graphical comparison.

In [None]:
plt.scatter(merged.score_y, merged.comparison_score) 
plt.xlabel('IMDB')
plt.ylabel('Rotten Tomatoes')
plt.title('')

Another object of interest is the scores by movie rating.

In [None]:
imdb.groupby(imdb['rating']).score.mean()

We can also look at trends over the years in the following graphs.

In [None]:
plt.plot(imdb.groupby(imdb['year']).gross.mean(),  'r') 
plt.xlabel('Year')
plt.ylabel('Gross')
plt.title('Average Gross By Year')

In [None]:
plt.plot(imdb.groupby(imdb['year']).runtime.mean(),  'r') 
plt.xlabel('Year')
plt.ylabel('Runtime')
plt.title('Average Runtime By Year')

In [None]:
plt.plot(imdb.groupby(imdb['year']).score.mean(),  'r') 
plt.xlabel('Year')
plt.ylabel('Mean Score')
plt.title('Average IMDB Score By Year')

And finally the complete pairplot.

In [None]:
sns.pairplot(imdb, hue = 'rating')