* Topic:        Challenge Set 3
* Subject:      Pandas (Exploring and visualizing our scraped movie data in pandas)
* Date:         TODO
* Name:         Michael Green ( malexgreen@gmail.com )

What data do I need to read in if I'm going to scrape IMDB.com?

Here's what is in `GITROOT/challenges/challenges_data/2013_movies.csv`

```csv
Title,Budget,DomesticTotalGross,Director,Rating,Runtime,ReleaseDate

```

So if I'm going to scrape IMDB, I'll need the same data:

* `title`: Name of the movie
* `budget`: How much money was spent to make the movie
* `domesticTotalGross`: How much money did the movie make in gross ticket sales in the US
* `rating`: MPAA Rating
* `director`: Name of the director
* `releaseDate`: Calendar date of the release. Time of release is assumed to be 12AM EDT

Note: This code will be reused for [Project 2](https://github.com/michael-a-green/onl20_ds4/blob/master/curriculum/project-02/project-02-introduction/project_02.md). Some of the fetures extracted in this notebook are used for this challenge (Challenge Set 3) and other features are used in the project.

Below are the features that are extracted in this notebook that are used for the challenge:

* `title`: Name of the movie
* `budget`: How much money was spent to make the movie
* `domesticTotalGross`: How much money did the movie make in gross ticket sales in the US
* `rating`: MPAA Rating
* `director`: Name of the director
* `releaseDate`: Calendar date of the release. Time of release is assumed to be 12AM EDT

Below are the features that will be used in the linear regression that will be performed in Project 2:

* `budget`: How much money was spent to make the movie
* `rating`: MPAA Rating, one-hot encoded
* `encodedDirector`: Name of the director, but encoded in a specific way that will be explained below
* `releaseDate`: Calendar date of the release. Time of release is assumed to be 12AM EDT
* `genre` one-hot encoded. See below for more details
* `runtime` measured in minutes
* `encodedCast1` one-hot encoded value of the name of a cast member if that cast member is in a specific list of cast members. Details of the encoding are explained below
* `encodedCast2` one-hot encoded value of the name of a another cast member if that cast member is in a specific list of cast members. Details of the encoding are explained below




Notes on imdb.com web page structure:

I searched for the move _Star Wars: Episode IV - A New Hope_.

This is the URL it gave to me:
`https://www.imdb.com/title/tt0076759/?ref_=nv_sr_srsg_0`

This is where I found the content I need for this page`<div>` of a movie entry:

```html
<div id="pagecontent" class="pagecontent">
    <div id="main_bottom" class="main">
        
        <div class="article" id="titleStoryLine">
            
        </div>
    
    
    <div class="article" id="titleDetails">

        <div class="txt-block>
            <h4 class="inline">Release Date:</h4>
            " 25 May 1977 (USA) "
            <!-- more stuff see web page for details -->
        </div>
        <div class="txt-block>
            <h4 class="inline">Also Known As:</h4>
            " Star Wars: Episode IV - A New Hope "
             <!-- more stuff see web page for details -->
        </div>
        <h3 class="subheading">Box Office</h3>
        <div class="txt-block">
            <h4 class="inline">Budget:</h4>
            "$11,000,000 "
            <!-- more stuff see web page for details -->
        </div>
        <div class="txt-block">
            <h4 class="inline">Gross USA:</h4>
            " $460,998,507 "
        </div>
        
    </div>

</div>


```

In [1]:
#Just going to try beautiful soup to see if I can grab all of the txt-blocks
from bs4 import BeautifulSoup
from random import randint
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import requests
import re
import time
import pickle

In [2]:
#movie module
from mymovie import Movie

In [3]:
%matplotlib inline

In [4]:
def my_wait(start,stop):
    """
    waits a number of seconds randomly selected between start and stop
    """
    if start <= 0:
        start = 10
    
    if stop <= 0:
        stop = 30
    
    if stop <= start:
        stop = start + 10
    
    wait_time = randint(start,stop)
    
    time.sleep(wait_time)
    
    return
    

In [5]:
def my_print(print_string,debug=0,LOG_FILE=None):
    """
    LOG_FILE = Must be a file handle
    """
    
    if (debug):
        if (LOG_FILE == None):
            print(print_string)
        else:
            print(print_string, file=LOG_FILE)


In [6]:
star_wars = Movie("Star Wars Episode IV","www.intel.com")
my_print(star_wars,1)

title = Star Wars Episode IV domesticTotalGross = 0 rating =  director =  releaseDate =  runtime = 0 cast1 =  cast2 =  cast3 =   genre =  budget = 0 star_rating = 0 directlink_url = www.intel.com


In [7]:
#search for movies released in 2013

#Set DEBUG to some_value >= 1 if you want to see debug messages
DEBUG = 1
log_file_name = "notebook.log"
LOG_FILE = open(log_file_name,"w")


#possibly better search
#search_url = "https://www.imdb.com/search/title/?title_type=feature&release_date=2010-01-01,2019-12-31&certificates=US%3AG,US%3APG,US%3APG-13,US%3AR,US%3ANC-17"
#search_url = "https://www.imdb.com/search/title/?title_type=feature&release_date=2013-01-01,2013-12-31"

search_url = "https://www.imdb.com/search/title/?title_type=feature&release_date=2010-01-01,2019-12-31&certificates=US%3AG,US%3APG,US%3APG-13,US%3AR,US%3ANC-17&countries=us"

IMDB_ROOT_URL = "http://www.imdb.com"

list_of_movie_objects = []

#flag that tells me if this is the first search I 
#sent to IMDB or not (1 means it's the first search, 0 means it's a subsequent search)
first_search = 1

NUMBER_OF_MOVIES = 4000
NUMBER_OF_MOVIES_PER_SEARCH = 50

#won't exactly get the number of movies I want but won't get
#less than the number of movies I want
if (NUMBER_OF_MOVIES % NUMBER_OF_MOVIES_PER_SEARCH) == 0:
    number_of_searches = NUMBER_OF_MOVIES // NUMBER_OF_MOVIES_PER_SEARCH
else:
    number_of_searches = (NUMBER_OF_MOVIES // NUMBER_OF_MOVIES_PER_SEARCH) + 1


#FOR DEBUG ONLY
#hard coding number_of_searches
#number_of_searches = 3


    

In [8]:
#search IMD for movies
#find direct link to movies
#create a movie object with the title and the link to the movie
#write it to a pickle file
#find link to next search

for i in range(number_of_searches):
    
    my_print("\nwaiting before I scrape...", DEBUG, LOG_FILE)
    my_wait(10,35)
    my_print("done waiting", DEBUG, LOG_FILE)
    
    #DANGER: INITIATE SEARCH REQUEST TO IMDB
    web_response = requests.get(search_url)
    
    if (web_response.status_code == 200):
        my_print("web request was good", DEBUG, LOG_FILE)
    else:
        print("error: web request failed",DEBUG, LOG_FILE)
        break
    
    #contains IMDB movie search results
    web_response_text = web_response.text
    my_print("got web response text", DEBUG, LOG_FILE)
    
    #convert web response to a soup object
    web_response_soup = BeautifulSoup(web_response_text)
    
    my_print("got web response soup", DEBUG, LOG_FILE)
    
    #the <div> that has the data I want is this one
    listOfListerItemContentDIV = web_response_soup.find_all("div",class_="lister-item-content")
    
    my_print("got divs for movies", DEBUG, LOG_FILE)
    
    if first_search == 1:
        PICKLE_FILE = open("Movie_objects.pkl","wb")
        my_print("opening pickle file for writing", DEBUG, LOG_FILE)
    else:
        PICKLE_FILE = open("Movie_objects.pkl","ab")
        my_print("opening pickle file for appending", DEBUG, LOG_FILE)


    
    #########################################################
    #
    # search through the <div class_="lister-item-content> tags
    # looking for the information I want
    #
    ###########################################################
    for listerItemContentDIV in listOfListerItemContentDIV:
        
        #these accesses  rely on the knowledge of the structure
        #of the web page. hopefully it doesn't change a lot
        link_to_movie = listerItemContentDIV.h3.a.get("href")
        title_of_the_movie = listerItemContentDIV.h3.a.text
        link_to_movie = IMDB_ROOT_URL + link_to_movie
        
#        if first_search == 1:
#            PICKLE_FILE = open("Movie_objects.p","wb")
#            #first_search = 0
#        else:
#            PICKLE_FILE = open("Movie_objects.p","ab")
        
        #CREATING MOVIE OBJECT
        themovie = Movie(title_of_the_movie,link_to_movie)
        
        my_print(themovie, DEBUG, LOG_FILE)
        
        if themovie != None:
            #TODO: Check status of pickle.dump() call for error
            pickle.dump(themovie,PICKLE_FILE)
        else:
            my_print("error: themovie is None and should not be!", 1, LOG_FILE)
            break
        
        my_print("done writing pickl file", DEBUG, LOG_FILE)
        
    #we found the first 50 movie links
    #close the pickle
    PICKLE_FILE.close()
    #wait a little to give time to flush the file buffer
    my_wait(1,3)
    
    #grab link to the next 50 movies
    #The links are in the <div class="desc">
    listOfDescDIV = web_response_soup.find_all("div",class_="desc")
    
    #Big assumption. There will be two <div class="desc">
    #tags on the search web page, and that the "Next" link and the
    #"Previous" link will be in that <div> tag. So you can
    #always pick the first one
    DescDIV_a_tag = listOfDescDIV[0].find_all("a")
    
    for a_tag in DescDIV_a_tag:
        temp_text = a_tag.text
        if re.match(r"Next",temp_text):
            link_to_next_page = IMDB_ROOT_URL + a_tag.get("href")
        
#    print("listOfDescDIV[0] -->\n{}\n".format(listOfDescDIV[0].__dict__))
    
#    if first_search == 1:
#        link_to_next_page = IMDB_ROOT_URL + listOfDescDIV[0].a.get("href")
#    else:
#        link_to_next_page = ""
#        listOfDescDIV_soup = BeautifulSoup(listOfDescDIV.text)
#        list_of_a_ref = listOfDescDIV_soup.find_all("a")
#        
#        for aref in list_of_a_ref:
#            if re.match(r"Next",aref.text):
#                link_to_next_page = IMDB_ROOT_URL + aref.get("href")
       
    
    #going to try this
    #get text for DIV
    #pass it to soup to get another soup object
    
    my_print("link to next search page is {}".format(link_to_next_page), DEBUG, LOG_FILE)

    
    #doing this for debug purposes --> making debugging easier
    
    #is this right????
    #search_url = link_to_next_page.get("href")
    search_url = link_to_next_page
    first_search = 0
    
    

In [None]:
#Check Code Uncomment to check pickle file
#also reference for sear
#Just going to grab all of the Movie objects out of the pickle file and see if they all got in
PICKLE_FILE = open("Movie_objects.pkl","rb")

#saving it here as a backup just in case I get locked out
#at least I'll have the data
EXPANDED_PICKLE_FILE = open("Movie_populated_objects.pkl","wb")
list_of_populated_movies = []
while 1:
    try:
        mymovie = pickle.load(PICKLE_FILE)
        mymovie.populate_movie(DEBUG=1,LOG_FILE=LOG_FILE,start_time=5,stop_time=17)
        list_of_populated_movies.append(mymovie)
        pickle.dump(mymovie,EXPANDED_PICKLE_FILE)
        #uncomment for debugging only

        my_print(mymovie, DEBUG, LOG_FILE)
        
    except EOFError:
        break

PICKLE_FILE.close()
EXPANDED_PICKLE_FILE.close()

How many movies did I scrape?

In [None]:
print(len(list_of_populated_movies))


In [None]:
#creating a dictionary of the entries of each movie object
Movie.__dict__.keys()

In [None]:
column_names_in_df = list(list_of_populated_movies[0].__dict__.keys())
my_print(column_names_in_df,DEBUG)

In [None]:
column_values_in_df = list(list_of_populated_movies[0].__dict__.values())
my_print(column_values_in_df,DEBUG)

In [None]:

#TODO: Write a get method that takes a data member name as an argument so I can do a 2D comprehension (or at least a for loop) next time or find out how to do that without a get method
list_of_movie_data_lists = []

for movie_key in column_names_in_df:
    
    list_of_movie_data_values = []
    
    for movie_obj in list_of_populated_movies:
        list_of_movie_data_values.append( movie_obj.__dict__[movie_key] )
    
    list_of_movie_data_lists.append(list_of_movie_data_values)

#create dictionary that will be used to create pandas data frame

movie_dict = dict(zip(column_names_in_df,list_of_movie_data_lists))



In [None]:
movie_dict.keys()

In [None]:
movie_df = pd.DataFrame(movie_dict)
movie_df.head(10)

# Challenge 1
Plot domestic gross over time


I interpret this question to mean to plot the domestic gross for movies in the data frame in the order in which movies were released


create a datetime data type column in the data frame

In [None]:
movie_df["releasedDateTime"] = pd.to_datetime( movie_df["releaseDate"] )

In [None]:
movie_df.head()

In [None]:
movie_df.sort_values(["releasedDateTime"],inplace=True)

In [None]:
movie_df.head()

In [None]:
movie_df.tail()

In [None]:
plt.figure(figsize=(20,10))
plt.xticks(rotation="vertical")
plt.plot(movie_df["releasedDateTime"],movie_df["domesticTotalGross"])
plt.title("Plot of Domestic Gross For Movies First Released in 2013")
plt.xlabel("Date")
plt.ylabel("Gross in Dollars")
plt.show()

# Challenge 2
Plot runtime vs. domestic total gross.

In [None]:
plt.figure(figsize=(20,10))
plt.xticks(rotation="vertical")
plt.scatter(movie_df["runtime"],movie_df["domesticTotalGross"],alpha=0.15,color="green")
plt.title("Plot of Domestic Gross vs Run Timr in 2013")
plt.xlabel("Run Time in Minutes")
plt.ylabel("Gross in Dollars")
plt.show()

# Challenge 3
Group your data by Rating and find the average runtime and domestic total gross at each level of Rating.


In [None]:
movie_avgruntime_per_rating_df = movie_df.groupby(["rating"])["runtime"].mean().reset_index()

In [None]:
movie_avgruntime_per_rating_df.head()

In [None]:
movie_avgruntime_per_rating_df.rename(columns={"runtime":"avg_runtime"},inplace=True)

In [None]:
movie_avgruntime_per_rating_df

In [None]:
movie_totdomgross_per_rating_df = movie_df.groupby(["rating"])["domesticTotalGross"].sum().reset_index()

In [None]:
movie_totdomgross_per_rating_df

In [None]:
movie_totdomgross_per_rating_df.sort_values(["domesticTotalGross"], inplace=True)

In [None]:
movie_totdomgross_per_rating_df

# Challenge 4
Make one figure with (N=the number of MPAA ratings there are) subplots, and in each plot the release date vs the domestic total gross.



In [None]:
 movie_df["rating"].value_counts()

What are the movies with no MPAA Rating?

In [None]:
pd.set_option('display.max_rows', 20)
#pd.set_option('display.max_columns', None)
#pd.set_option('display.width', None)
#pd.set_option('display.max_colwidth', -1)
movie_df[ movie_df["rating"]=="" ].head(100)

In [None]:
plt.figure(figsize=(20,20))
plt.suptitle('domestic tot gross vs release date for movies with the same MPAA rating',fontsize = 16)


plt.subplot(3,2,1) # (number of rows, number of columns, number of plot)
plt.scatter(movie_df[ movie_df["rating"]=="" ]["releasedDateTime"], movie_df[ movie_df["rating"]=="" ]["domesticTotalGross"],alpha=0.2)
plt.title("MPAA Rating Unknown")
plt.xlabel("Run Time in Minutes")
plt.ylabel("Gross in Dollars")

plt.subplot(3,2,2)
plt.scatter(movie_df[ movie_df["rating"]=="R" ]["releasedDateTime"], movie_df[ movie_df["rating"]=="R" ]["domesticTotalGross"],alpha=0.2,color="green")
plt.title("MPAA Rated R")
plt.xlabel("Run Time in Minutes")
plt.ylabel("Gross in Dollars")

plt.subplot(3,2,3)
plt.scatter(movie_df[ movie_df["rating"]=="PG-13" ]["releasedDateTime"], movie_df[ movie_df["rating"]=="PG-13" ]["domesticTotalGross"],alpha=0.2,color="purple")
plt.title("MPAA Rated PG-13")
plt.xlabel("Run Time in Minutes")
plt.ylabel("Gross in Dollars")

plt.subplot(3,2,4)
plt.scatter(movie_df[ movie_df["rating"]=="PG" ]["releasedDateTime"], movie_df[ movie_df["rating"]=="PG" ]["domesticTotalGross"],alpha=0.2,color="red")
plt.title("MPAA Rated PG")
plt.xlabel("Run Time in Minutes")
plt.ylabel("Gross in Dollars")

plt.subplot(3,2,5)
plt.scatter(movie_df[ movie_df["rating"]=="NC-17" ]["releasedDateTime"], movie_df[ movie_df["rating"]=="NC-17" ]["domesticTotalGross"],color="red",marker="+")
plt.title("MPAA Rated NC-17")
plt.xlabel("Run Time in Minutes")
plt.ylabel("Gross in Dollars")

plt.subplot(3,2,6)
plt.scatter(movie_df[ movie_df["rating"]=="PG-" ]["releasedDateTime"], movie_df[ movie_df["rating"]=="PG-" ]["domesticTotalGross"],color="black")
plt.title("MPAA Rated PG- (most likely rated PG-13)")
plt.xlabel("Run Time in Minutes")
plt.ylabel("Gross in Dollars");


# Challenge 5
What director in your dataset has the highest gross per movie?

Find the gross per director


In [None]:
movie_totgross_per_director_df = movie_df.groupby(["director"])["domesticTotalGross"].sum().reset_index()

In [None]:
movie_totgross_per_director_df.sort_values(["domesticTotalGross"],inplace=True,ascending=False)

In [None]:
movie_totgross_per_director_df.head()

[Francis Lawrence](https://en.wikipedia.org/wiki/Francis_Lawrence) is pretty prolific so the data here seems right.

# Challenge 6
Bin your dataset into months and make a bar graph of the mean domestic total gross by month. Error bars will represent the standard error of the mean.

Title of graph should include: Mean Domestic Total Gross by Month in 2013

Topic for consideration: what is the correct formula for the standard error of the mean? Examine the error bars and see if they are “reasonable.”

Create a new column that holds the name of the month of the release date for  the movie in the same row as the release date

In [None]:
import seaborn as sns

In [None]:
movie_df["releasedMonth"] = movie_df["releasedDateTime"].dt.month_name()
movie_df["releasedMonthNumber"] = pd.DatetimeIndex(movie_df["releasedDateTime"]).month

In [None]:
movie_avggross_per_month = movie_df.groupby(["releasedMonthNumber","releasedMonth"])["domesticTotalGross"].mean().reset_index()

In [None]:
movie_avggross_per_month.rename(columns={"domesticTotalGross":"domesticMeanGross"},inplace=True)

In [None]:
movie_avggross_per_month.sort_values(["releasedMonthNumber"],inplace=True)

In [None]:
movie_avggross_per_month

In [None]:



months_of_the_year = ["January","February","March","April","May","June","July","August","September","October","November","December"]
list_of_domestic_gross = []

for month_name in months_of_the_year:
    list_of_domestic_gross.append( movie_avggross_per_month[movie_avggross_per_month["releasedMonth"]==month_name]["domesticMeanGross"] )

In [None]:
#len(list_of_domestic_gross)

In [None]:
plt.figure(figsize=(10,6))
plt.plot(months_of_the_year,list_of_domestic_gross)
plt.xticks(rotation=60)
plt.xlabel("Month of the Year")
plt.ylabel("AVG Domestic Gross")
plt.title("AVG Domestic Gross for all movies released in month on X-axis in 2013")
plt.show();

In [None]:
plt.figure(figsize=(20,20))
sns.boxplot(y=movie_avggross_per_month["domesticMeanGross"],x=movie_avggross_per_month["releasedMonthNumber"]);
#movie_avggross_per_month.boxplot(figsize=(20,10),by="releasedMonthNumber")

Going to attempt to pull up web site data for movies by name

From inspecting the search result HTML all links to movies are found under these types of HTML elements:

```html
<div class="lister-item-content">
    <h3 class="lister-item-header">
    <!-- Stuff -->
    <a href="LINK_TO_MOVIE">NAME_OF_MOVIE</a>
    <!-- Stuff -->
    </h3>
</div>
      

```

So I should grab all the `<div>` tags with that class and then get the link within them


IMDB Only shows the first 50 movies when you request search results.

So I will need the link to the next 50 movies.

That is located in the web page returned on the search request here:

```html

<div class="desc">
    <a href="LINK_TO_NEXT_PAGE_IN_SEARCH_RESULTS">Next<!--text--></a>
</div>

```