# Purpose

My data scrape appears to have a lot of data missing. I'm investigating why so that I can figure out what I need to change to improve the scrape.

From my small check I think I need to do the following:

* Send initial search to imdb.com that requires all movies to have a rating
* Scrape more years 2010 - 2019
* Update movie page scrape to look for more ways movie rating is reported
* Update movie page scrape to look for more ways budget and gross sales numbers are reported

In [1]:
from bs4 import BeautifulSoup
from random import randint
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import requests
import re
import time
import pickle

In [2]:
from mymovie import Movie

In [3]:
%matplotlib inline

Read in the pickle file with the 2000 movie page scrapes

In [4]:
EXPANDED_PICKLE_FILE = open("Movie_populated_objects.pkl","rb")

list_of_populated_movies = []

while 1:
    try:
        mymovie = pickle.load(EXPANDED_PICKLE_FILE)
        list_of_populated_movies.append(mymovie)
    except EOFError:
        print("Done loading movies")
        break

EXPANDED_PICKLE_FILE.close()

Done loading movies


In [5]:
print(len(list_of_populated_movies))

2000


In [6]:
column_names_in_df = list(list_of_populated_movies[0].__dict__.keys())

In [7]:
list_of_movie_data_lists = []
for movie_key in column_names_in_df:
    list_of_movie_data_values = []
    for movie_obj in list_of_populated_movies:
        list_of_movie_data_values.append( movie_obj.__dict__[movie_key] )
        
    list_of_movie_data_lists.append(list_of_movie_data_values)
    
movie_dict = dict(zip(column_names_in_df,list_of_movie_data_lists))

In [8]:
movie_df = pd.DataFrame(movie_dict)
movie_df.head(10)

Unnamed: 0,title,directlink_url,domesticTotalGross,rating,director,releaseDate,genre,runtime,cast1,cast2,cast3,budget,star_rating
0,The Wolf of Wall Street,http://www.imdb.com/title/tt0993846/,116900694,R,Martin Scorsese,25 December 2013,Biography,180,Leonardo DiCaprio,Jonah Hill,Margot Robbie,100000000,8.2
1,Prisoners,http://www.imdb.com/title/tt1392214/,61002302,R,Denis Villeneuve,20 September 2013,Crime,153,Hugh Jackman,Jake Gyllenhaal,Viola Davis,46000000,8.1
2,The Croods,http://www.imdb.com/title/tt0481499/,187168425,PG,Kirk DeMicco,22 March 2013,Animation,98,Nicolas Cage,Ryan Reynolds,Emma Stone,135000000,7.2
3,Snowpiercer,http://www.imdb.com/title/tt1706620/,4563650,R,Bong Joon Ho,11 July 2014,Action,126,Chris Evans,Jamie Bell,Tilda Swinton,39200000,7.1
4,Frozen,http://www.imdb.com/title/tt2294629/,400738009,PG,Chris Buck,27 November 2013,Animation,102,Kristen Bell,Idina Menzel,Jonathan Groff,150000000,7.4
5,Man of Steel,http://www.imdb.com/title/tt0770828/,291045518,PG-13,Zack Snyder,14 June 2013,Action,143,Henry Cavill,Amy Adams,Michael Shannon,225000000,7.0
6,The Great Gatsby,http://www.imdb.com/title/tt1343092/,144840419,PG-13,Baz Luhrmann,10 May 2013,Drama,143,Leonardo DiCaprio,Carey Mulligan,Joel Edgerton,105000000,7.2
7,Her,http://www.imdb.com/title/tt1798709/,25568251,R,Spike Jonze,10 January 2014,Drama,126,Joaquin Phoenix,Amy Adams,Scarlett Johansson,23000000,8.0
8,The Hunger Games: Catching Fire,http://www.imdb.com/title/tt1951264/,424668047,PG-13,Francis Lawrence,22 November 2013,Action,146,Jennifer Lawrence,Josh Hutcherson,Liam Hemsworth,130000000,7.5
9,The Conjuring,http://www.imdb.com/title/tt1457767/,137400141,R,James Wan,19 July 2013,Horror,112,Patrick Wilson,Vera Farmiga,Ron Livingston,20000000,7.5


General info about this data frame


In [9]:
movie_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 13 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   title               2000 non-null   object 
 1   directlink_url      2000 non-null   object 
 2   domesticTotalGross  2000 non-null   int64  
 3   rating              2000 non-null   object 
 4   director            2000 non-null   object 
 5   releaseDate         2000 non-null   object 
 6   genre               2000 non-null   object 
 7   runtime             2000 non-null   int64  
 8   cast1               2000 non-null   object 
 9   cast2               2000 non-null   object 
 10  cast3               2000 non-null   object 
 11  budget              2000 non-null   int64  
 12  star_rating         2000 non-null   float64
dtypes: float64(1), int64(3), object(9)
memory usage: 203.2+ KB


In [11]:
movie_df.describe()

Unnamed: 0,domesticTotalGross,runtime,budget,star_rating
count,2000.0,2000.0,2000.0,2000.0
mean,5364675.0,96.4345,5042918.0,5.72735
std,29293390.0,31.19425,21343720.0,1.344225
min,0.0,0.0,0.0,0.0
25%,0.0,87.0,0.0,5.0
50%,0.0,96.0,0.0,5.9
75%,2577.0,110.0,110500.0,6.7
max,424668000.0,252.0,225000000.0,9.4


Value counts for a few columns

In [12]:
movie_df["domesticTotalGross"].value_counts()

0           1479
373375         1
53895          1
152200         1
4076           1
            ... 
173472         1
2146999        1
30664106       1
32015787       1
37884          1
Name: domesticTotalGross, Length: 522, dtype: int64

So there are 1479/2000 movies that reported 0 for domestic gross. What movies are reporting 0?

In [14]:
#Print everything I ask to be printed
pd.set_option('display.max_rows', 200)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', -1)
movie_df[ movie_df["domesticTotalGross"] == 0 ].head(100)

  pd.set_option('display.max_colwidth', -1)


Unnamed: 0,title,directlink_url,domesticTotalGross,rating,director,releaseDate,genre,runtime,cast1,cast2,cast3,budget,star_rating
38,The Frozen Ground,http://www.imdb.com/title/tt2005374/,0,R,Scott Walker,1 February 2013,Crime,105,Nicolas Cage,Vanessa Hudgens,John Cusack,27220000,6.4
83,U Want Me 2 Kill Him?,http://www.imdb.com/title/tt0485061/,0,,Andrew Douglas,23 April 2015,Drama,92,Stephanie Leonidas,Toby Regbo,Joanne Froggatt,0,6.3
114,Nurse 3D,http://www.imdb.com/title/tt1913166/,0,R,Douglas Aarniokoski,24 October 2013,Horror,84,Paz de la Huerta,Katrina Bowden,Judd Nelson,10000000,4.5
123,Curse of Chucky,http://www.imdb.com/title/tt2230358/,0,R,Don Mancini,24 September 2013,Horror,97,Chantal Quesnelle,Fiona Dourif,Jordan Gavaris,5000000,5.6
130,Miracle in Cell No. 7,http://www.imdb.com/title/tt2659414/,0,,Hwan-kyung Lee,8 March 2013,Comedy,127,Seung-ryong Ryu,So Won Kal,Dal-su Oh,0,8.2
140,Behind The Candelabra,http://www.imdb.com/title/tt1291580/,0,,Steven Soderbergh,26 May 2013,Biography,118,Michael Douglas,Matt Damon,Scott Bakula,23000000,7.0
144,Odd Thomas,http://www.imdb.com/title/tt1767354/,0,,Stephen Sommers,28 February 2014,Comedy,100,Anton Yelchin,Ashley Sommers,Leonor Varela,27000000,6.8
147,Mystery Road,http://www.imdb.com/title/tt2236054/,0,,Ivan Sen,15 August 2013,Crime,121,Aaron Pedersen,Hugo Weaving,Ryan Kwanten,0,6.6
155,The Lifeguard,http://www.imdb.com/title/tt2265534/,0,R,Liz W. Garcia,30 July 2013,Drama,98,Kristen Bell,Mamie Gummer,Martin Starr,0,5.6
165,The Garden of Words,http://www.imdb.com/title/tt2591814/,0,,Makoto Shinkai,31 May 2013,Animation,46,Miyu Irino,Kana Hanazawa,Fumi Hirano,0,7.5


In [16]:
movie_df["rating"].value_counts()

         1458
R        338 
PG-13    141 
PG       61  
NC-17    1   
PG-      1   
Name: rating, dtype: int64

In [18]:
movie_df[ movie_df["rating"] == "" ].head(100)

Unnamed: 0,title,directlink_url,domesticTotalGross,rating,director,releaseDate,genre,runtime,cast1,cast2,cast3,budget,star_rating
21,Nymphomaniac: Vol. I,http://www.imdb.com/title/tt1937390/,785896,,Lars von Trier,6 March 2014,Drama,117,Charlotte Gainsbourg,Stellan Skarsgård,Stacy Martin,0,6.9
54,Coherence,http://www.imdb.com/title/tt2866360/,102617,,James Ward Byrkit,6 August 2014,Drama,89,Emily Baldoni,Maury Sterling,Nicholas Brendon,50000,7.2
62,Monsters University,http://www.imdb.com/title/tt1453405/,268492764,,Dan Scanlon,21 June 2013,Animation,104,Billy Crystal,John Goodman,Steve Buscemi,200000000,7.3
81,Nymphomaniac: Vol. II,http://www.imdb.com/title/tt2382009/,327167,,Lars von Trier,20 March 2014,Drama,124,Charlotte Gainsbourg,Stellan Skarsgård,Willem Dafoe,0,6.7
83,U Want Me 2 Kill Him?,http://www.imdb.com/title/tt0485061/,0,,Andrew Douglas,23 April 2015,Drama,92,Stephanie Leonidas,Toby Regbo,Joanne Froggatt,0,6.3
106,The Great Beauty,http://www.imdb.com/title/tt2358891/,2852400,,Paolo Sorrentino,14 March 2014,Drama,141,Toni Servillo,Carlo Verdone,Sabrina Ferilli,0,7.8
126,Young & Beautiful,http://www.imdb.com/title/tt2752200/,61067,,François Ozon,25 April 2014,Drama,95,Marine Vacth,Géraldine Pailhas,Frédéric Pierrot,0,6.7
130,Miracle in Cell No. 7,http://www.imdb.com/title/tt2659414/,0,,Hwan-kyung Lee,8 March 2013,Comedy,127,Seung-ryong Ryu,So Won Kal,Dal-su Oh,0,8.2
140,Behind The Candelabra,http://www.imdb.com/title/tt1291580/,0,,Steven Soderbergh,26 May 2013,Biography,118,Michael Douglas,Matt Damon,Scott Bakula,23000000,7.0
144,Odd Thomas,http://www.imdb.com/title/tt1767354/,0,,Stephen Sommers,28 February 2014,Comedy,100,Anton Yelchin,Ashley Sommers,Leonor Varela,27000000,6.8
