In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
import statsmodels.api as sm

## Data quality check / cleaning / preparation 

Put code with comments. The comments should explain the code such that it can be easily understood. You may put text *(in a markdown cell)* before a large chunk of code to explain the overall purpose of the code, if it is not intuitive. **Put the name of the person / persons who contributed to each code chunk / set of code chunks.** An example is given below.

### Data quality check and cleaning
*By Jackson Baker*

In [4]:
movies = pd.read_csv('movies.csv')

In [7]:
# Changing some binary string columns to binary numeric
movies.loc[movies.Oscar_Best_Picture_won == 'Yes', 'Oscar_Best_Picture_won'] = 1
movies.loc[movies.Oscar_Best_Picture_won == 'No', 'Oscar_Best_Picture_won'] = 0
movies.loc[movies.Oscar_Best_Picture_nominated == 'Yes', 'Oscar_Best_Picture_nominated'] = 1
movies.loc[movies.Oscar_Best_Picture_nominated == 'No', 'Oscar_Best_Picture_nominated'] = 0

movies.loc[movies.Oscar_Best_Director_won == 'Yes', 'Oscar_Best_Director_won'] = 1
movies.loc[movies.Oscar_Best_Director_won == 'No', 'Oscar_Best_Director_won'] = 0
movies.loc[movies.Oscar_Best_Director_nominated == 'Yes', 'Oscar_Best_Director_nominated'] = 1
movies.loc[movies.Oscar_Best_Director_nominated == 'No', 'Oscar_Best_Director_nominated'] = 0

movies.loc[movies.Oscar_Best_Actor_won == 'Yes', 'Oscar_Best_Actor_won'] = 1
movies.loc[movies.Oscar_Best_Actor_won == 'No', 'Oscar_Best_Actor_won'] = 0
movies.loc[movies.Oscar_Best_Actor_nominated == 'Yes', 'Oscar_Best_Actor_nominated'] = 1
movies.loc[movies.Oscar_Best_Actor_nominated == 'No', 'Oscar_Best_Actor_nominated'] = 0

movies.loc[movies.Oscar_Best_Actress_won == 'Yes', 'Oscar_Best_Actress_won'] = 1
movies.loc[movies.Oscar_Best_Actress_won == 'No', 'Oscar_Best_Actress_won'] = 0
movies.loc[movies.Oscar_Best_Actress_nominated == 'Yes', 'Oscar_Best_Actress_nominated'] = 1
movies.loc[movies.Oscar_Best_Actress_nominated == 'No', 'Oscar_Best_Actress_nominated'] = 0

In [8]:
# Making a bunch of non-numeric variables into numeric

movies['metascore'] = pd.to_numeric(movies['metascore'])
movies['Oscar_Best_Picture_won'] = pd.to_numeric(movies['Oscar_Best_Picture_won'])
movies['Oscar_Best_Director_nominated'] = pd.to_numeric(movies['Oscar_Best_Director_nominated'])
movies['Oscar_Best_Actor_nominated'] = pd.to_numeric(movies['Oscar_Best_Actor_nominated'])
movies['Oscar_Best_Actress_nominated'] = pd.to_numeric(movies['Oscar_Best_Actress_nominated'])
movies['Golden_Globes_nominated'] = pd.to_numeric(movies['Golden_Globes_nominated'])
movies['BAFTA_nominated'] = pd.to_numeric(movies['BAFTA_nominated'])
movies['month'] = pd.to_numeric(movies['release_date.month'])
movies['votes'] = pd.to_numeric(movies['votes'])
movies['gross'] = pd.to_numeric(movies['gross'])

In [10]:
# Getting the columns that were most relevant and important to our analysis
movies_new = movies[['year', 'movie', 'certificate', 'genre', 'duration', 'rate', 'metascore', 'votes', 'gross', 
                   'release_date', 'user_reviews', 'critic_reviews', 'popularity', 'awards_wins', 'awards_nominations',
                   'Oscar_nominated', 'month']]

# Removing entries in the certificate column that didn't make a ton of sense
new_movies_new = movies_new.loc[movies['certificate'] != 'Not Rated']
new_movies_new = new_movies_new.loc[movies['certificate'] != 'TV-MA']

### Data cleaning
*By Xena Valenzuela*

From the data quality check we realized that:

1. Some of the columns that should have contained only numeric values, specifically <>, <>, and <> have special characters such as \*, #, %. We'll remove these characters, and convert the datatype of these columns to numeric.

2. Some of the columns have more than 60% missing values, and it is very difficult to impute their values, as the values seem to be missing at random with negligible association with other predictors. We'll remove such columns from the data.

3. The column `number_of_bedrooms` has some unreasonably high values such as 15. As our data consist of single-family homes in Evanston, we suspect that any value greater than 5 may be incorrect. We'll replace all values that are greater than 5 with an estimate obtained using the $K$-nearest neighbor approach.

4. The columns `house_price` has some unreasonably high values. We'll tag all values greater than 1 billion dollars as "potentially incorrect observation", to see if they distort our prediction / inference later on.

The code below implements the above cleaning.

### Data preparation
*By Sankaranarayanan Balasubramanian and Chun-Li*

The following data preparation steps helped us to prepare our data for implementing various modeling / validation techniques:

1. Since we need to predict house price, we derived some new predictors *(from existing predictors)* that intuitively seem to be helpuful to predict house price. 

2. We have shuffled the dataset to prepare it for K-fold cross validation.

3. We have created a standardized version of the dataset, as we will use it to develop Lasso / Ridge regression models.

## Exploratory data analysis

Put code with comments. The comments should explain the code such that it can be easily understood. You may put text *(in a markdown cell)* before a large chunk of code to explain the overall purpose of the code, if it is not intuitive. **Put the name of the person / persons who contributed to each code chunk / set of code chunks.**

In [5]:
# Pairplot by Jackson
sns.pairplot(movies_new)

NameError: name 'movies_new' is not defined

## Developing the model

Put code with comments. The comments should explain the code such that it can be easily understood. You may put text *(in a markdown cell)* before a large chunk of code to explain the overall purpose of the code, if it is not intuitive. **Put the name of the person / persons who contributed to each code chunk / set of code chunks.**

### Code fitting the final model

Put the code(s) that fit the final model(s) in separate cell(s), i.e., the code with the `.ols()` or `.logit()` functions.

## Conclusions and Recommendations to stakeholder(s)

You may or may not have code to put in this section. Delete this section if it is irrelevant.