# Frame the problem
Normally people can get the value of their house by contacting a housing agency, and or by looking at house prices in their area. In this notebook we will use historical data to predict housing prices in California. The historical data stems from a dataset from 1997 containing housing prices in California. 


# Get the data
The dataset is given as part of an exercise in the book "Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow concepts, tools, and techniques to build intelligent systems", 2019. The dataset is downloaded as a .tgz file through github which contains the dataset in a .csv file.

In [1]:
import pandas as pd
import numpy as np

train = pd.read_csv('./data/train.csv')
test = pd.read_csv('./data/test.csv')
sample_submission = pd.read_csv('data/sample_submission.csv')
                   
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 23 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   id                     3000 non-null   int64  
 1   belongs_to_collection  604 non-null    object 
 2   budget                 3000 non-null   int64  
 3   genres                 2993 non-null   object 
 4   homepage               946 non-null    object 
 5   imdb_id                3000 non-null   object 
 6   original_language      3000 non-null   object 
 7   original_title         3000 non-null   object 
 8   overview               2992 non-null   object 
 9   popularity             3000 non-null   float64
 10  poster_path            2999 non-null   object 
 11  production_companies   2844 non-null   object 
 12  production_countries   2945 non-null   object 
 13  release_date           3000 non-null   object 
 14  runtime                2998 non-null   float64
 15  spok

# Explore the data to gain insight
Lets start off with getting an overview of the dataset

# Prepare the data


In [2]:
x_test_og = test
x_train_og = train

x_train = x_train_og.drop("revenue", axis=1)
x_train_labels = x_train_og["revenue"].copy()

In [1]:
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline, make_pipeline
import sys
sys.path.append('../')
!{sys.executable} -m pip install -e ../localpackages
from localpackages.custom_transformers.custom_transformers import SelectColumnsTransformer
                
columns = ["id", "budget", "popularity", "runtime"]

data_cols_pipeline = make_pipeline(
    SelectColumnsTransformer(columns),   
    SimpleImputer(missing_values=np.nan, strategy="median"),
    SimpleImputer(missing_values=0, strategy="median")
)

#preprocessing_features = DataFrameFeatureUnion([data_cols_pipeline, nullvalues_pipline])

Obtaining file:///C:/Users/kenne/DAT158ML/ML_2/boxofficeapp/localpackages
Installing collected packages: localpackages
  Attempting uninstall: localpackages
    Found existing installation: localpackages 0.1
    Uninstalling localpackages-0.1:
      Successfully uninstalled localpackages-0.1
  Running setup.py develop for localpackages
Successfully installed localpackages


NameError: name 'np' is not defined

In [None]:
#x_train=preprocessing_features.fit_transform(x_train)
x_train_prepared=data_cols_pipeline.fit_transform(x_train)
x_train.head()

In [None]:
print(x_train_prepared)

# Explore different models and short-list the best ones
Starting with Linear Regression Model

In [None]:
from sklearn.linear_model import LinearRegression
lr_model = LinearRegression()
lr_model.fit(x_train_prepared, x_train_labels)

In [None]:
from sklearn.metrics import mean_squared_error
x_train_predictions = lr_model.predict(x_train_prepared)
lr_mse = mean_squared_error(x_train_labels, x_train_predictions)
lr_rmse = np.sqrt(lr_mse)
lr_rmse

In [None]:
from sklearn.ensemble import RandomForestRegressor
rf_model = RandomForestRegressor()
rf_model.fit(x_train_prepared, x_train_labels)

predictions = rf_model.predict(x_train_prepared)
forest_mse = mean_squared_error(x_train_labels, predictions)
forest_rmse = np.sqrt(forest_mse)
forest_rmse

In [None]:
def display_scores(scores):
    print("Scores:", scores)
    print("Mean:", scores.mean())
    print("Standard deviation:", scores.std())

In [None]:
from sklearn.model_selection import cross_val_score

lin_scores = cross_val_score(lr_model, x_train_prepared, x_train_labels,scoring="neg_mean_squared_error", cv=10)
lin_rmse_scores = np.sqrt(-lin_scores)
display_scores(lin_rmse_scores)

# Fine tuning

In [None]:
from sklearn.model_selection import GridSearchCV
param_grid = [{'n_estimators': [3, 10, 30], 'max_features': [1, 2, 3]},{'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [1, 2, 3]},]
forest_reg = RandomForestRegressor()
grid_search = GridSearchCV(forest_reg, param_grid, cv=5,scoring='neg_mean_squared_error',return_train_score=True)
grid_search.fit(x_train_prepared, x_train_labels)

In [None]:
grid_search.best_params_

In [None]:
final_model = grid_search.best_estimator_

X_test = x_train_og.drop("revenue", axis=1)
y_test = x_train_og["revenue"].copy()
X_test_prepared = data_cols_pipeline.transform(X_test)

final_predictions = final_model.predict(X_test_prepared)
final_mse = mean_squared_error(y_test, final_predictions)
final_rmse = np.sqrt(final_mse)
final_rmse

In [None]:
from joblib import dump
dump(final_model, '../models/test_boxoffice_model.joblib', compress=6)
dump(data_cols_pipeline, '../models/test_transform_predict.joblib')

# Present your solution

# Launch, monitor and maintain your system
When launching our data model we need to change our input data with live production data from the housing market.
While transfering to production data we should also have unit tests in place to make sure things are behaving as expected.

We need to monitor the performance of our model and cause events/triggers if the performance is low. We also need to monitor the data input; if the input data isnt as expected it will cause problems with our model. Further as the data evolves, we need to monitor if we need to update our model. The model can become stale as the data input evolves.

Even though we are going to monitor the input data and performance, we should regardless retrain our models on a regular basis on fresh input data. This will be an automatically scheduled job that runs every so often.