# Phase III: First ML Proof of Concept (5\%)

### Team Names:
-
-
-

## Part 1
(3%) The implementation (using NumPy) of your first ML model as a function call to the cleaned data

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import sklearn.model_selection

Possible ML Ideas:
- How financially successful will a movie be? 
    - Genre (limit to the first in the list), MPAA, Review score (ratings), country 


In [141]:
dataframe = pd.read_csv("final_merged_movie_data.csv")
print(dataframe.columns)

Index(['Title', 'Year', 'Rated', 'Released', 'Runtime', 'Genre', 'Director',
       'Writer', 'Actors', 'Plot', 'Language', 'Country', 'Awards', 'Poster',
       'Ratings', 'Metascore', 'imdbRating', 'imdbVotes', 'imdbID', 'Type',
       'DVD', 'BoxOffice', 'Production', 'Website', 'Response', 'Error',
       'totalSeasons', 'Rank', 'Worldwide', 'Domestic', '%', 'Foreign', '%.1',
       'Rating', 'Popularity', 'Keywords', 'Ratings Amount'],
      dtype='object')


In [142]:
dataframe.head()

Unnamed: 0,Title,Year,Rated,Released,Runtime,Genre,Director,Writer,Actors,Plot,...,Rank,Worldwide,Domestic,%,Foreign,%.1,Rating,Popularity,Keywords,Ratings Amount
0,Toy Story 3,2010,G,18 Jun 2010,103 min,"Animation, Adventure, Comedy",Lee Unkrich,"John Lasseter, Andrew Stanton, Lee Unkrich","Tom Hanks, Tim Allen, Joan Cusack",The toys are mistakenly delivered to a day-car...,...,1,"$1,066,969,703","$415,004,880",38.9%,"$651,964,823",61.1%,7.799,30.31,"escape, hostage, college, villain, sequel, bud...",14838.0
1,Alice in Wonderland,2010,PG,05 Mar 2010,108 min,"Adventure, Family, Fantasy",Tim Burton,"Linda Woolverton, Lewis Carroll","Mia Wasikowska, Johnny Depp, Helena Bonham Carter",Nineteen-year-old Alice returns to the magical...,...,2,"$1,025,467,110","$334,191,110",32.6%,"$691,276,000",67.4%,6.638,26.627,"based on novel or book, queen, psychotic, fant...",14108.0
2,Harry Potter and the Deathly Hallows: Part 1,2010,PG-13,19 Nov 2010,146 min,"Adventure, Family, Fantasy",David Yates,"Steve Kloves, J.K. Rowling","Daniel Radcliffe, Emma Watson, Rupert Grint",Harry Potter is tasked with the dangerous and ...,...,3,"$960,283,305","$295,983,305",30.8%,"$664,300,000",69.2%,7.7,37.477,"witch, friendship, london, england, corruption...",19315.0
3,Inception,2010,PG-13,16 Jul 2010,148 min,"Action, Adventure, Sci-Fi",Christopher Nolan,Christopher Nolan,"Leonardo DiCaprio, Joseph Gordon-Levitt, Ellio...",A thief who steals corporate secrets through t...,...,4,"$828,258,695","$292,576,195",35.3%,"$535,682,500",64.7%,8.369,46.201,"rescue, mission, dreams, airplane, paris, fran...",37094.0
4,Shrek Forever After,2010,PG,21 May 2010,93 min,"Animation, Adventure, Comedy",Mike Mitchell,"Josh Klausner, Darren Lemke, William Steig","Mike Myers, Cameron Diaz, Eddie Murphy",Rumpelstiltskin tricks a mid-life crisis burde...,...,5,"$752,600,867","$238,736,787",31.7%,"$513,864,080",68.3%,6.38,37.911,"witch, sequel, ogre",7440.0


In [143]:
def clean_box_office(df):
    df = df[df["Domestic"] != "-"]
    df = df.dropna(subset = ["Worldwide", "Domestic", "Foreign"])
    df["Worldwide"] = (
        df["Worldwide"]
        .astype(str)  
        .str.replace("$", "", regex=False)  
        .str.replace(",", "", regex=False)  
        .astype(int)
    )
    # Clean Domestic column
    df["Domestic"] = (
        df["Domestic"]
        .astype(str)  
        .str.replace("$", "", regex=False) 
        .str.replace(",", "", regex=False)  
    
    )
    # Clean Foreign column
    df["Foreign"] = (
        df["Foreign"]
        .astype(str)  
        .str.replace("$", "", regex=False)  
        .str.replace(",", "", regex=False)  
    )
    # Creating new columns because the raw numbers are too large to process
    df["Worldwide_millions"] = pd.to_numeric(df["Worldwide"]) / 1000000
    df["Domestic_millions"] = pd.to_numeric(df["Domestic"]) / 1000000
    df["Foreign_millions"] = pd.to_numeric(df["Foreign"], errors="coerce") / 1000000

    df = df.dropna(subset = ["Runtime"])
    df["Runtime"] = df["Runtime"].str.extract('(\d+)').astype(int)
    
    df["Release Month"] = pd.to_datetime(df["Released"], format='%d %b %Y').dt.month
    
    df["Genre"] = df["Genre"].tolist()
    
   
    return df

In [144]:
df_clean = clean_box_office(dataframe)

In [145]:
# only will run once
df_clean = df_clean.drop(["Type", "Poster", "DVD", "totalSeasons", "Error", \
                            "Response", "Website", "Rank", "Production", "Ratings", \
                            "imdbRating", "imdbVotes", "%", "%.1", "Popularity", "Ratings Amount",\
                            "Worldwide", "Domestic", "Foreign"], axis=1)

In [146]:
df_clean.head()
print(df_clean.columns)

Index(['Title', 'Year', 'Rated', 'Released', 'Runtime', 'Genre', 'Director',
       'Writer', 'Actors', 'Plot', 'Language', 'Country', 'Awards',
       'Metascore', 'imdbID', 'BoxOffice', 'Rating', 'Keywords',
       'Worldwide_millions', 'Domestic_millions', 'Foreign_millions',
       'Release Month'],
      dtype='object')


In [2]:
from sklearn.linear_model import LinearRegression

def line_of_best_fit(X, y):
    """
    Returns slope and intercept of a line of best fit.
    
    Args:
        X (array): can be either 1-d or 2-d
        y (array): a 1-d array including all corresponding response values to X
        
    Returns:
        vector (array): vector containing the coefficients for line of best fit; first term is intercept, the second is slope
    """

    new_x = add_bias_column(X)  
    model = LinearRegression(fit_intercept=False)  
    model.fit(new_x, y)

    intercept = model.coef_[0].tolist()
    slope = model.coef_[1:].tolist()
    coefficients = [intercept] + slope
    vector = [coefficients][0]
    return vector

In [3]:
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error

def linreg_predict(Xnew, ynew, m):
    """
    Args:
        Xnew (array): either a 1-d or 2-d array
        ynew (array): 1-d array 
        m (array): 1-d array contains coefficients from the line_of_best_fit function

    Returns:
        dct (dictionary): contains four key-value pairs
            - ypreds: predicted values from applying m to Xnew
            - resids: the residuals, the differences between ynew and ypreds
            - mse: mean squared error
            - r2: coefficient of determination
    """
    x = add_bias_column(Xnew)
    ypreds = x.dot(m)
    resids = ynew - ypreds
    mse = mean_squared_error(ynew, ypreds)
    r2 = r2_score(ynew, ypreds)
    
    dct =  {
        'ypreds': ypreds,
        'resids': resids,
        'mse': mse,
        'r2': r2
    }
    return dct

## Part 2
(2%) A discussion of the preliminary results:
   - This may include checking of assumptions, generated plots/tables, measures of fit, or other attributes of the analysis
   - It does not have to be fully correct, but as a proof of concept must demonstrate that the group is close to completing the analysis

Ethical discussion:

- There could potentially be some representational bias due to the fact that the data collected may be only including the most popular movies from major studios, while more indie movies would not be included (even if it was successful). This could potentially skew the data towards more dominant groups and limit the visibility of other diverse film/film makers.
- Additionally, there could be allocative bias in our data since most of the movies we have scrapped seemed to be in the action genre. This in turn could make our ML biased towards action movies and underestimate the success potential of other movies genres such as horror.
- The last ethical concern that comes to mind is that there could be some potential bias in our data cleaning process. For example, some movies have more than one genre, however we are only interested in the first genre in the list. This can in turn underrepresent some genres within our ML