## Length of the report {-}
The length of the report must be no more than 15 pages, when printed as PDF. However, there is no requirement on the minimum number of pages.

You may put additional stuff as Appendix. You may refer to the Appendix in the main report to support your arguments. However, your appendix is unlikely to be checked while grading, unless the grader deems it necessary. The appendix, references, and information about GitHub and individual contribution will not be included in the page count, and there is no limit on the length of the appendix.

**Delete this section from the report, when using this template.** 

## Code should be put separately in the code template {-}
Your report should be in a research-paper like style. If there is something that can only be explained by showing the code, then you may put it, otherwise do not put the code in the report. We will check your code in the code template. 

**Delete this section from the report, when using this template.** 

## Background / Motivation

What motivated you to work on this problem?

Mention any background about the problem, if it is required to understand your analysis later on.

## Problem statement 

Describe your problem statement. Articulate your objectives using absolutely no jargon. Interpret the problem as inference and/or prediction.

We are interested in understanding which factors contribute to the gross revenue of a movie, especially a high-grossing movie. By looking at a number of different variables, including budget, genre, rating, and continent, we hope to better understand the relationships between these factors and the gross revenue of a movie.

This is an inference problem since we are trying to understand the relationships between different predictors and the gross revenue rather than actually predicting the gross revenue of a movie.

## Data sources
What data did you use? Provide details about your data. Include links to data if you are using open-access data.

Our data, "Movies Industries", is from user Daniel Grijalva on Kaggle (https://www.kaggle.com/datasets/danielgrijalvas/movies)

The dataset provides information on 6820 movies (220 movies per year, 1986-2016). The data was scraped from IMDb. While the dataset contained 16 variables, we only used the following to develop our model:
- budget: the budget of a movie. Some movies don't have this, so it appears as 0
- country: country of origin
- genre: main genre of the movie.
- gross: revenue of the movie
- rating: rating of the movie (R, PG, etc.)
- runtime: duration of the movie
- score: IMDb user rating
- votes: number of user votes
- year: year of release

## Stakeholders
Who cares? If you are successful, what difference will it make to them?

We have three main stakeholders:
1. **Filmmakers, such as directors and producers, and investors in the film industry:**
By understanding key relationships between revenue and variables such as actors, year, etc., filmmakers can strategically make decisions about their movies to attain the most revenue possible. Investors in the film industry and studio executives will be interested in our analysis for the same reasons as filmmakers. If investors understand the key relationships that we will analyze, such as an actor that is consistently in top grossing movies, investors can better assess if a movie is worth investing in.
2. **Everyday movie goers (like us!):**
If a movie has a high gross revenue, it’s normally safe to assume the movie is good and worthwhile. Understanding the relationships found in our model can help everyday movie goers enhance their experiences by watching high-grossing movies. 
3. **People looking for/approached to have their work be adapted into a film:**
Similar to filmmakers, if someone is going to have their work turned into a movie, they can use the relationships that we find in our model to determine if this venture is worthwhile or if they need to make some adjustments. For example, the genre of true crime has boomed in the last few years; if someone has a story relating to true crime, whether they are a victim who wants to tell a story or have written a fictional story, they can use this relationship to their advantage and have their movie be more popular/more fruitful now than it would have been in the past. Understanding these relationships of what will make their movie successful.

## Data quality check / cleaning / preparation 

In a tabular form, show the distribution of values of each variable used in the analysis - for both categorical and continuous variables. Distribution of a categorical variable must include the number of missing values, the number of unique values, the frequency of all its levels. If a categorical variable has too many levels, you may just include the counts of the top 3-5 levels. 

If the tables in this section take too much space, you may put them in the appendix, and just mention any useful insights you obtained from the data quality check that helped you develop the model or helped you realize the necessary data cleaning / preparation.

Were there any potentially incorrect values of variables that required cleaning? If yes, how did you clean them? 

Did you do any data wrangling or data preparation before the data was ready to use for model development? Did you create any new predictors from exisiting predictors? For example, if you have number of transactions and spend in a credit card dataset, you may create spend per transaction for predicting if a customer pays their credit card bill. Mention the steps at a broad level, you may put minor details in the appendix. Only mention the steps that ended up being useful towards developing your final model(s).

In [3]:
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm
import itertools
import time

Before the data could be used for model development, existing predictors were created from new predictors. Using the `country` for each movie, a `continent` variable was created for each movie to generalize the region of origin for each movie (since country is very specific). Furthermore, the `year` for each movie was used to create a `decade` variable for each movie since year is also very specific and may not be helpful in deciphering trends. 

## Exploratory data analysis

Put the relevant EDA here (visualizations, tables, etc.) that helped you figure out useful predictors for developing the model(s). Only put the EDA that ended up being useful towards developing your final model(s). 

List the insights (as bullet points) you got from EDA that ended up being useful towards developing your final model. 

Again, if there are too many plots / tables, you may put them into appendix, and just mention the insights you got from them.

## Approach

What kind of a model (linear / logistic / other) did you use? What performance metric(s) did you optimize and why?

Is there anything unorthodox / new in your approach? 

What problems did you anticipate? What problems did you encounter? Did the very first model you tried work? 

Did your problem already have solution(s) (posted on Kaggle or elsewhere). If yes, then how did you build upon those solutions, what did you do differently? Is your model better as compared to those solutions in terms of prediction / inference?

**Important: Mention any code repositories (with citations) or other sources that you used, and specifically what changes you made to them for your project.**

We used a linear model. Since we were mainly interested in inference, we were mainly interested in optimizing R^2 since this is indication of a model's fit to the data. However, we also were interested in comparing the RSE and RMSE of our data to analyze/detect any under/overfitting in the data. We chose RMSE because we wanted penalize larger errors for the sake of our stakeholders.

## Developing the model

Explain the steps taken to develop and improve the base model - informative visualizations / addressing modeling assumption violations / variable transformation / interactions / outlier treatment / influential points treatment / addressing over-fitting / addressing multicollinearity / variable selection - stepwise regression, lasso, ridge regression). 

Did you succeed in achieving your goal, or did you fail? Why?

**Put the final model equation**.

**Important: This section should be rigorous and thorough. Present detailed information about decision you made, why you made them, and any evidence/experimentation to back them up.**

Based on the base model (model with all variables and no transformation/interaction terms), we identified variables that seemed to be significant based on their p-values. `budget`, `votes`, `score` all seemed signifcant. Some of the categories of `genre`, `rating`, and `decade` also seemed signifcant. `runtime` and `continent` did not seem significant. Therefore, we decided that the main variables to use in our analysis would be `budget`, `votes`, `score`, `genre`, `rating`, and `decade`. 

Through the pairplot (below), we saw that gross had a quadratic relationship with the numeric predictors. Therefore, we knew that we would likely have to transform `gross` in our model. We also saw that `gross` was highly correlated with `budget`, `votes`, `score`. We also saw a strong correlation between `votes` and `score`. 

In [None]:
# Pairplot of variables

We also wanted to further examine the relationship between the categorical variables, `rating`, `genre`, and `decade`, with the response. A boxplot showed that some genres, such as family, animation, and action, tended to have more high-grossing movies compared to other genres. We also saw through a lineplot that `decade` seemed to be correlated with `gross`, in that more recent movies tend to be more high-grossing.

In [5]:
# Distribution of gross by genre

In [None]:
# Visualization of the trend between decade and gross

We also wanted to make sure that our model satisfied the linearity and constant variance assumptions as well as check for potential problems such as multicollinearity and influential points. Through the residual plot of our base model (below), we saw that our model satisfied the linearity assumption since the residuals are distributed more or less in a similar manner on both sides of the line Residuals = 0 for all fitted values. However, we saw that there was non-constant variance of the error terms (heteroskasticity), as the variance of errors seems to increase with increase in the fitted values. Therefore, we knew that we would have to transform the response `gross` in our model. 

In [6]:
# Residual plot for the base model

We did a VIF test on the variables to see if there was an multicollinearity occuring in the model. Since the VIFs for all the numeric variables were less than 2, we concluded that multicollinearity would not affect our model and inference (see VIF test in appendix). Furthermore, we also checked for points that were both outliers and high leverage points (influential points). We found 7 points that were influential points in the dataset. As such, we removed them from the dataset.

We also wanted to confirm what variables would be important in our analysis using forward variable selection. In order to do so, we created categorical variable 

## Limitations of the model with regard to inference / prediction

If it is inference, will the inference hold for a certain period of time, for a certain subset of population, and / or for certain conditions.

If it is prediction, then will it be possible / convenient / expensive for the stakeholders to collect the data relating to the predictors in the model. Using your model, how soon will the stakeholder be able to predict the outcome before the outcome occurs. For example, if the model predicts the number of bikes people will rent in Evanston on a certain day, then how many days before that day will your model be able to make the prediction. This will depend on how soon the data that your model uses becomes available. If you are predicting election results, how many days / weeks / months / years before the election can you predict the results. 

When will your model become too obsolete to be useful?

## Other sections *(optional)*

You are welcome to introduce additional sections or subsections, if required, to address any specific aspects of your project in detail. For example, you may briefly discuss potential future work that the research community could focus on to make further progress in the direction of your project's topic.

## Conclusions and Recommendations to stakeholder(s)

What conclusions do you draw based on your model? If it is inference you may draw conclusions based on the coefficients, statistical significance of predictors / interactions, etc. If it is prediction, you may draw conclusions based on prediction accuracy, or other performance metrics.

How do you use those conclusions to come up with meaningful recommendations for stakeholders? The recommendations must be action-items for stakeholders that they can directly implement without any further analysis. Be as precise as possible. The stakeholder(s) are depending on you to come up with practically implementable recommendations, instead of having to think for themselves.

If your recommendations are not practically implementable by stakeholders, how will they help them? Is there some additional data / analysis / domain expertise you need to do to make the recommendations implementable? 

Do the stakeholder(s) need to be aware about some limitations of your model? Is your model only good for one-time use, or is it possible to update your model at a certain frequency (based on recent data) to keep using it in the future? If it can be used in the future, then for how far into the future?

## GitHub and individual contribution {-}

Put the **Github link** for the project repository.

Add details of each team member's contribution in the table below.

<html>
<style>
table, td, th {
  border: 1px solid black;
}

table {
  border-collapse: collapse;
  width: 100%;
}

th {
  text-align: left;
}
    

</style>
<body>

<h2>Individual contribution</h2>

<table style="width:100%">
     <colgroup>
       <col span="1" style="width: 15%;">
       <col span="1" style="width: 20%;">
       <col span="1" style="width: 50%;">
       <col span="1" style="width: 15%;"> 
    </colgroup>
  <tr>
    <th>Team member</th>
    <th>Contributed aspects</th>
    <th>Details</th>
    <th>Number of GitHub commits</th>
  </tr>
  <tr>
    <td>Elton John</td>
    <td>Data cleaning and EDA</td>
    <td>Cleaned data to impute missing values and developed visualizations to identify appropriate variable transformations.</td>
    <td>100</td>
  </tr>
  <tr>
    <td>Xena Valenzuela</td>
    <td>Assumptions and interactions</td>
    <td>Checked and addressed modeling assumptions and identified relevant variable interactions.</td>
    <td>120</td>
  </tr>
    <tr>
    <td>Sankaranarayanan Balasubramanian</td>
    <td>Outlier and influential points treatment</td>
    <td>Identified outliers/influential points and analayzed their effect on the model.</td>
    <td>130</td>    
  </tr>
    <tr>
    <td>Chun-Li</td>
    <td>Variable selection and addressing overfitting</td>
    <td>Performed variable selection on an exhaustive set of predictors to address multicollinearity and overfitting.</td>
    <td>150</td>    
  </tr>
</table>

List the **challenges** you faced when collaborating with the team on GitHub. Are you comfortable using GitHub? 
Do you feel GitHuB made collaboration easier? If not, then why? *(Individual team members can put their opinion separately, if different from the rest of the team)*

## References {-}

List and number all bibliographical references. When referenced in the text, enclose the citation number in square brackets, for example [1].

[1] Authors. The frobnicatable foo filter, 2014. Face and Gesture submission ID 324. Supplied as additional material
fg324.pdf. 3


## Appendix {-}

You may put additional stuff here as Appendix. You may refer to the Appendix in the main report to support your arguments. However, the appendix section is unlikely to be checked while grading, unless the grader deems it necessary.