# Phase II: Data Curation, Exploratory Analysis and Plotting (5\%)

### Team Members:
- Logan Lary
- Mark Tran
- Sabrina Valerjev

## Part 1: 
(1%) Expresses the central motivation of the project and explains the (at least) two key questions to be explored. Gives a summary of the data processing pipeline so a technical expert can easily follow along.

## Project Motivation

## Summary of the Data Processing Pipeline

## Part 2: 
(2\%) Obtains, cleans, and merges all data sources involved in the project.

In [3]:
# adding relevant imports
import requests
from requests_html import HTML
import json
import pathlib
import pandas as pd
import requests
from requests_html import HTML
from dataclasses import dataclass

In [4]:
# Source 1: Box Office Mojo
@dataclass
class ScrapeBoxOffice:
    base_endpoint:str = "https://www.boxofficemojo.com/year/world/"
    year:int = None
    save_raw:bool = False
    save:bool = False
    output_dir: str = "."
    table_selector: str = '.imdb-scroll-table'
    table_data = []
    table_header_names = []
    df = pd.DataFrame()
    
    @property
    def name(self):
        return self.year if isinstance(self.year, int) else 'world'
    
    def get_endpoint(self):
        endpoint = self.base_endpoint
        if isinstance(self.year, int):
            endpoint = f"{endpoint}{self.year}/"
        return endpoint
    
    def get_output_dir(self):
        return pathlib.Path(self.output_dir)
    
    def extract_html_str(self, endpoint=None):
        url = endpoint if endpoint is not None else self.get_endpoint()
        r = requests.get(url, stream=True)
        html_text = None
        status = r.status_code
        if r.status_code == 200:
            html_text = r.text
            if self.save_raw:
                output_fname = f"{self.name}.html"
                raw_output_dir = self.get_output_dir() / 'html'
                raw_output_dir.mkdir(exist_ok=True, parents=True)
                output_fname = raw_output_dir / output_fname
                with open(f"{output_fname}", 'w') as f:
                    f.write(html_text)
            return html_text, status
        return html_text, status
    
    def parse_html(self, html_str=''):
        r_html = HTML(html=html_str)
        r_table = r_html.find(self.table_selector)
        if len(r_table) == 0:
            return None
        table_data = []
        header_names = []
        parsed_table = r_table[0]
        rows = parsed_table.find("tr")
        header_row = rows[0]
        header_cols = header_row.find('th')
        header_names = [x.text for x in header_cols]
        for row in rows[1:]:
            cols = row.find("td")
            row_data = []
            row_dict_data = {}
            for i, col in enumerate(cols):
                header_name = header_names[i]
                row_data.append(col.text)
            table_data.append(row_data)
        self.table_data = table_data
        self.table_header_names = header_names
        return self.table_data, self.table_header_names
    
    def to_df(self, data=[], columns=[]):
        return pd.DataFrame(data, columns=columns)
    
    def run(self, save=False):
        save = self.save if save is False else save
        endpoint = self.get_endpoint()
        html_str, status = self.extract_html_str(endpoint=endpoint)
        if status not in range(200, 299):
            raise Exception(f"Extraction failed, endpoint status {status} at {endpoint}")
        data, headers = self.parse_html(html_str if html_str is not None else '')
        df = self.to_df(data=data, columns=headers)
        self.df = df
        if save:
            filepath = self.get_output_dir() / f'{self.name}.csv'
            df.to_csv(filepath, index=False)
        return self.df

In [17]:
# Source 2: OMDb
API_KEY = "f3eb77a3"
URL = "http://www.omdbapi.com/?t="

def get_movie_data(url, movie):
    ''' Takes in the name of a movie and returns associated data on the movie.'''
    movie_link = process_movie_name(movie)
    complete_url = url + movie_link + "&apikey=" + API_KEY
    response = requests.get(complete_url) 
    return response.json()

def process_movie_name(movie):
    ''' Takes in the name of a movie and modifies it so that it can be used in API call.'''
    words = movie.split()
    return '+'.join(words)

# get the list of all movies in a year
# get data on all those movies
# save to a json
def get_year_movie_data(movie_titles, url, year):
    empty_data = {}
    data_list = []
    for movie in movie_titles:
        response = get_movie_data(url, movie)
        data_list.append(response)
    with open("MovieData" + year + ".json", 'w') as json_file:
        json.dump(data_list, json_file, indent=4) 

year = 2010
dataframe = pd.DataFrame()
while year < 2013:
    scrapper = ScrapeBoxOffice(year=year, save=True, save_raw=True, output_dir='data')
    df_box = scrapper.run()
    movies_year = df_box["Release Group"].tolist()
    get_year_movie_data(movies_year, "http://www.omdbapi.com/?t=", str(year))
    file_path_movie = "MovieData" + str(year) + ".json"
    df_movie_data = pd.read_json(file_path_movie)
    box_df_bet = df_box.rename(columns={"Release Group": 'Title'})
    master_df = pd.merge(df_movie_data, box_df_bet, on = "Title", how = "inner")
    master_df["Year"] = year
    dataframe = pd.concat([dataframe, master_df])
    year = year + 1

In [18]:
dataframe.tail()

Unnamed: 0,Title,Year,Rated,Released,Runtime,Genre,Director,Writer,Actors,Plot,...,Website,Response,Error,totalSeasons,Rank,Worldwide,Domestic,%,Foreign,%.1
186,Fall in Love Like a Star,2012,PG-13,04 Dec 2015,98 min,Romance,Tony Chan,"Tony Chan, Yiliang Xu, Liying Lin","Mi Yang, Yifeng Li, Shu Chen",A superstar musician falls in love with his ma...,...,,True,,,194,"$22,640,355",-,-,"$22,640,355",100%
187,Upside Down,2012,PG-13,01 May 2013,109 min,"Drama, Romance, Sci-Fi",Juan Solanas,"Juan Solanas, Santiago Amigorena, Pierre Magny","Jim Sturgess, Kirsten Dunst, Timothy Spall",Adam and Eden fell in love as teens despite th...,...,,True,,,195,"$22,187,813","$105,095",0.5%,"$22,082,718",99.5%
188,Unbowed,2012,Not Rated,18 Jan 2012,100 min,"Crime, Drama",Jeong Ji-yeong,"Hyeon-geun Han, Jeong Ji-yeong","Ahn Sung-ki, Park Won-sang, Na Young-hee",Kim Kyung-ho is fired by his university after ...,...,,True,,,196,"$22,132,903",-,-,"$22,132,903",100%
189,Confession of Murder,2012,Not Rated,08 Nov 2012,119 min,"Action, Crime, Mystery",Jung Byung-gil,"Jung Byung-gil, Won-Chan Hong, Dong-kyu Kim","Jeong Jae-yeong, Park Shi-hoo, Jung Hae-kyun",Lee Du-Seok publishes an autobiography describ...,...,,True,,,197,"$21,701,525",-,-,"$21,701,525",100%
190,Beasts of the Southern Wild,2012,PG-13,05 Jul 2012,93 min,"Adventure, Drama, Fantasy",Benh Zeitlin,"Lucy Alibar, Benh Zeitlin","Quvenzhané Wallis, Dwight Henry, Levy Easterly",Faced with both her hot-tempered father's fadi...,...,,True,,,200,"$21,107,746","$12,795,746",60.6%,"$8,312,000",39.4%


In [32]:
# cleaning the data
def clean_box_office(df):
    '''Cleans box office sales by removing dollar signs and commas, and drops rows where Domestic value is "-".'''
    # Clean Worldwide column
    df = df[df["Domestic"] != "-"]
    df = df.dropna(subset = ["Worldwide", "Domestic", "Foreign"])
    df["Worldwide"] = (
        df["Worldwide"]
        .astype(str)  
        .str.replace("$", "", regex=False)  
        .str.replace(",", "", regex=False)  
        .astype(int)
    )
    # Clean Domestic column
    df["Domestic"] = (
        df["Domestic"]
        .astype(str)  
        .str.replace("$", "", regex=False) 
        .str.replace(",", "", regex=False)  
    
    )
    # Clean Foreign column
    df["Foreign"] = (
        df["Foreign"]
        .astype(str)  
        .str.replace("$", "", regex=False)  
        .str.replace(",", "", regex=False)  
    )
    # Creating new columns because the raw numbers are too large to process
    df["Worldwide_millions"] = pd.to_numeric(df["Worldwide"]) / 1000000
    df["Domestic_millions"] = pd.to_numeric(df["Domestic"]) / 1000000
    df["Foreign_millions"] = pd.to_numeric(df["Foreign"], errors="coerce") / 1000000
    return df

#dataframe = dataframe.drop(["Type", "Poster", "DVD", "totalSeasons", "Error", "Response"], axis=1)
#dataframe = dataframe.drop(["Website", "Rank"], axis=1)
dataframe = dataframe.drop(["Production"], axis=1)
cleaned_df = clean_box_office(dataframe)

In [33]:
cleaned_df.head()

Unnamed: 0,Title,Year,Rated,Released,Runtime,Genre,Director,Writer,Actors,Plot,...,imdbID,BoxOffice,Worldwide,Domestic,%,Foreign,%.1,Worldwide_millions,Domestic_millions,Foreign_millions
0,Toy Story 3,2010,G,18 Jun 2010,103 min,"Animation, Adventure, Comedy",Lee Unkrich,"John Lasseter, Andrew Stanton, Lee Unkrich","Tom Hanks, Tim Allen, Joan Cusack",The toys are mistakenly delivered to a day-car...,...,tt0435761,"$415,004,880",1066969703,415004880,38.9%,651964823,61.1%,1066.969703,415.00488,651.964823
1,Alice in Wonderland,2010,PG,05 Mar 2010,108 min,"Adventure, Family, Fantasy",Tim Burton,"Linda Woolverton, Lewis Carroll","Mia Wasikowska, Johnny Depp, Helena Bonham Carter",Nineteen-year-old Alice returns to the magical...,...,tt1014759,"$334,191,110",1025467110,334191110,32.6%,691276000,67.4%,1025.46711,334.19111,691.276
2,Harry Potter and the Deathly Hallows: Part 1,2010,PG-13,19 Nov 2010,146 min,"Adventure, Family, Fantasy",David Yates,"Steve Kloves, J.K. Rowling","Daniel Radcliffe, Emma Watson, Rupert Grint",Harry Potter is tasked with the dangerous and ...,...,tt0926084,"$296,374,621",960283305,295983305,30.8%,664300000,69.2%,960.283305,295.983305,664.3
3,Inception,2010,PG-13,16 Jul 2010,148 min,"Action, Adventure, Sci-Fi",Christopher Nolan,Christopher Nolan,"Leonardo DiCaprio, Joseph Gordon-Levitt, Ellio...",A thief who steals corporate secrets through t...,...,tt1375666,"$292,587,330",828258695,292576195,35.3%,535682500,64.7%,828.258695,292.576195,535.6825
4,Shrek Forever After,2010,PG,21 May 2010,93 min,"Animation, Adventure, Comedy",Mike Mitchell,"Josh Klausner, Darren Lemke, William Steig","Mike Myers, Cameron Diaz, Eddie Murphy",Rumpelstiltskin tricks a mid-life crisis burde...,...,tt0892791,"$238,736,787",752600867,238736787,31.7%,513864080,68.3%,752.600867,238.736787,513.86408


## Part 3:
(2\%) Builds at least two visualizations (graphs/plots) from the data which help to understand or answer the questions of interest. These visualizations will be graded based on how much information they can effectively communicate to readers. Please make sure your visualization are sufficiently distinct from each other.