# <span style='color:White'><span style='background :Blue' > ALL TIME HIGHEST-GROSSING MOVIES WORLDWIDE ANALYSIS  </span></span> 
---

## <span style='color:White'><span style='background :Red' > Data Collection  </span></span>

### <span style='color:White'><span style='background :Black' > Function: </span></span>  integer_verifier()
To retrieve the data we need first to define the length of the movie list, having a size limit of 1000 items. This is an auxiliary function that will help verify that the user introduce the correct input (an integer between 1 and 0 inclusive).

In [None]:
def integer_verifier(number, start, end):
    """
    Verify that input is an integer in specified range (of integers).
    
    Parameters
    ----------
    number: integer - Value to verify
    start:  integer - Lower limit of range
    end:    integer - Upper limit of range
    
    Result
    ----------
    Boolean: True if integer in range, otherwise False
    
    """
    #Verify that range is ok.
    if isinstance(start, int) == False or isinstance(end, int) == False:
        print("Range must be between two integers.")
        return ''
    elif (start>end):
        print("Check that start is not a higher number than end.")
        return ''
    else:
    #Verify that number is intenger in range.
        if isinstance(number, int) == False or (number<start) or (number>end):
            result = False
        else:
            result = True
    return result

### <span style='color:White'><span style='background :Black' > Function: </span></span>  movie_table()
As a next step we need to retrieve the list of highest grossing movies from Box Office Mojo. As previously mentioned, the source can give us a list of up to 1000 items. Using the integer_verifier() function we make sure that the input is correct. After that, depending on the input, we are able to retrieve the corresponding list from different links (as each page of the source contains 200 items).

**Note:** For this, we require to import read_html and defines the links. The latter are included already in secret.py file.

In [None]:
from pandas.io.html import read_html

In [None]:
from secret import *

In [None]:
def movie_table(movie_length):
    """
    Return table of highest grossing movies from Box Office Mojo.
    Length of table can go from 1 to 1000 items.
    
    Parameters
    ----------
    movie_length: integer - Length of table
    
    Result
    ----------
    DataFrame: Table of size X obtained from Box Office Mojo
    
    """
    #This function supports movie_list() so input should be valid already.
    #Still, in case it is used separately, it will just return with no input.
    if integer_verifier(movie_length, 1, 1000) == False:
        print('Movie_length must be an integer between 1 and 1000 inclusive.')
        return 
    #Table will add elements depending on selected length.
    if movie_length<=200:
        table1 = read_html(link1, index_col=0, attrs = {'class':'a-bordered'})[0]
    elif (movie_length>200) and (movie_length<=400):
        table1 = read_html(link1, index_col=0, attrs = {'class':'a-bordered'})[0]
        table2 = read_html(link2, index_col=0, attrs = {'class':'a-bordered'})[0]
        table1 = table1.append(table2)
    elif (movie_length>400) and (movie_length<=600):
        table1 = read_html(link1, index_col=0, attrs = {'class':'a-bordered'})[0]
        table2 = read_html(link2, index_col=0, attrs = {'class':'a-bordered'})[0]
        table3 = read_html(link3, index_col=0, attrs = {'class':'a-bordered'})[0]
        table1 = table1.append(table2)
        table1 = table1.append(table3)
    elif (movie_length>600) and (movie_length<=800):
        table1 = read_html(link1, index_col=0, attrs = {'class':'a-bordered'})[0]
        table2 = read_html(link2, index_col=0, attrs = {'class':'a-bordered'})[0]
        table3 = read_html(link3, index_col=0, attrs = {'class':'a-bordered'})[0]        
        table4 = read_html(link4, index_col=0, attrs = {'class':'a-bordered'})[0]
        table1 = table1.append(table2)
        table1 = table1.append(table3)
        table1 = table1.append(table4)
    elif (movie_length>800):
        table1 = read_html(link1, index_col=0, attrs = {'class':'a-bordered'})[0]
        table2 = read_html(link2, index_col=0, attrs = {'class':'a-bordered'})[0]
        table3 = read_html(link3, index_col=0, attrs = {'class':'a-bordered'})[0]        
        table4 = read_html(link4, index_col=0, attrs = {'class':'a-bordered'})[0]
        table5 = read_html(link5, index_col=0, attrs = {'class':'a-bordered'})[0]
        table1 = table1.append(table2)
        table1 = table1.append(table3)
        table1 = table1.append(table4)
        table1 = table1.append(table5)
    #Limit table accordingly.
    list_of_movies = table1[0:movie_length]
    #There is one particular correction that must be made for an specific movie:
    list_of_movies = list_of_movies.replace(['Fantastic Four: Rise of the Silver Surfer'],'Fantastic_4')
    return list_of_movies

### <span style='color:White'><span style='background :Black' > Function: </span></span>  movie_list()
Finally, to fully collect all data we make one final function that incorportates the previous one. This one make use of the list of movies from movie_table() to go movie by movie collecting data from the OMDb API and storing it in a DataFrame. We must take into account that we are only selecting information we consider relevant and still requires to be processed.

**Note:** For this, we require to import requests, json and re. Additionaly, we make use of the corresponding API key already imported from secret.py file.

In [None]:
import requests, json, re

In [None]:
def movie_list(movie_length):
    """
    Return list of highest grossing movies (worlwide).
    Length of list in the range [1;1000] dependent on user.
    Includes data retrieved from IMDb API.
    
    Parameters
    ----------
    movie_length: integer - Length of table
    
    Result
    ----------
    DataFrame: Table with movie data from Box Office Mojo and OMDb API
    
    """
    #Use integer_verifier() to check user has introduce adequate input.
    if integer_verifier(movie_length, 1, 1000) == False:
        return print("Please indicate an integer between 1 and 1000.")
    else:
        #If input is correct, then movie_list is generated.
        list_of_movies = movie_table(movie_length)
        list_of_movies = list_of_movies.drop("Year", axis = 1)
        #Additional fields are created to store additional data from API.
        fields = ['Rated', 'Released', 'Runtime', 'Genre', 'Director', 'Writer', 'Actors', 'Plot', 'Language', 'Country', 'Awards', 'imdbRating','Poster']
        for field in fields:
            list_of_movies[field] = None
    #Retrieve data from API and store it.
    for i in range(0,len(list_of_movies)):
        #Get data from IMDb API.
        api_link = f'https://www.omdbapi.com/?apikey={apiKey}&t={list_of_movies.iloc[i][0]}&plot=full'
        api_data = json.loads(requests.get(api_link).text)
        for field in fields:
            #Copy data into dataframe.
            list_of_movies[field][i+1] = api_data[field]
    #There is one particular correction that must be made for an specific movie:
    list_of_movies = list_of_movies.replace(['Fantastic_4'],'Fantastic Four: Rise of the Silver Surfer')
    return list_of_movies

For the following part we are going to work with full data, reason why we store it in a new variable: **top_movies**.

In [None]:
top_movies = movie_list(1000)
top_movies

In case the user requires to download the data into an .xlsx file just run the following lines. It may help later with processing time in case they want to work with already retrieved data.

**Note:** It requires the use of pandas to read back the generated file.

In [None]:
import pandas as pd

In [None]:
top_movies.to_excel('Top_Grossing_Films.xlsx',header=True)
#pd.read_excel('Top_Grossing_Films.xlsx', index_col=0)