# Frame the problem
Normally people can get the value of their house by contacting a housing agency, and or by looking at house prices in their area. In this notebook we will use historical data to predict housing prices in California. The historical data stems from a dataset from 1997 containing housing prices in California. 


# Get the data
The dataset is given as part of an exercise in the book "Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow concepts, tools, and techniques to build intelligent systems", 2019. The dataset is downloaded as a .tgz file through github which contains the dataset in a .csv file.

In [1]:
import pandas as pd
import numpy as np

train = pd.read_csv('./data/train.csv')
test = pd.read_csv('./data/test.csv')
sample_submission = pd.read_csv('data/sample_submission.csv')
                   
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 23 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   id                     3000 non-null   int64  
 1   belongs_to_collection  604 non-null    object 
 2   budget                 3000 non-null   int64  
 3   genres                 2993 non-null   object 
 4   homepage               946 non-null    object 
 5   imdb_id                3000 non-null   object 
 6   original_language      3000 non-null   object 
 7   original_title         3000 non-null   object 
 8   overview               2992 non-null   object 
 9   popularity             3000 non-null   float64
 10  poster_path            2999 non-null   object 
 11  production_companies   2844 non-null   object 
 12  production_countries   2945 non-null   object 
 13  release_date           3000 non-null   object 
 14  runtime                2998 non-null   float64
 15  spok

# Explore the data to gain insight
Lets start off with getting an overview of the dataset

# Prepare the data


In [2]:
x_test_og = test
x_train_og = train

x_train = x_train_og.drop("revenue", axis=1)
x_train_labels = x_train_og["revenue"].copy()

In [3]:
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline, make_pipeline
import sys
sys.path.append('../')
!{sys.executable} -m pip install -e ../localpackages
from localpackages.custom_transformers.custom_transformers import SelectColumnsTransformer
                
columns = ["id", "budget", "popularity", "runtime"]

data_cols_pipeline = make_pipeline(
    SelectColumnsTransformer(columns),   
    SimpleImputer(missing_values=np.nan, strategy="median"),
    SimpleImputer(missing_values=0, strategy="median")
)

#preprocessing_features = DataFrameFeatureUnion([data_cols_pipeline, nullvalues_pipline])

Obtaining file:///C:/Users/kenne/DAT158ML/ML_2/boxofficeapp/localpackages
Installing collected packages: localpackages
  Attempting uninstall: localpackages
    Found existing installation: localpackages 0.1
    Uninstalling localpackages-0.1:
      Successfully uninstalled localpackages-0.1
  Running setup.py develop for localpackages
Successfully installed localpackages


In [4]:
#x_train=preprocessing_features.fit_transform(x_train)
x_train_prepared=data_cols_pipeline.fit_transform(x_train)
x_train.head()

Unnamed: 0,id,belongs_to_collection,budget,genres,homepage,imdb_id,original_language,original_title,overview,popularity,...,production_countries,release_date,runtime,spoken_languages,status,tagline,title,Keywords,cast,crew
0,1,"[{'id': 313576, 'name': 'Hot Tub Time Machine ...",14000000,"[{'id': 35, 'name': 'Comedy'}]",,tt2637294,en,Hot Tub Time Machine 2,"When Lou, who has become the ""father of the In...",6.575393,...,"[{'iso_3166_1': 'US', 'name': 'United States o...",2/20/15,93.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,The Laws of Space and Time are About to be Vio...,Hot Tub Time Machine 2,"[{'id': 4379, 'name': 'time travel'}, {'id': 9...","[{'cast_id': 4, 'character': 'Lou', 'credit_id...","[{'credit_id': '59ac067c92514107af02c8c8', 'de..."
1,2,"[{'id': 107674, 'name': 'The Princess Diaries ...",40000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,tt0368933,en,The Princess Diaries 2: Royal Engagement,Mia Thermopolis is now a college graduate and ...,8.248895,...,"[{'iso_3166_1': 'US', 'name': 'United States o...",8/6/04,113.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,It can take a lifetime to find true love; she'...,The Princess Diaries 2: Royal Engagement,"[{'id': 2505, 'name': 'coronation'}, {'id': 42...","[{'cast_id': 1, 'character': 'Mia Thermopolis'...","[{'credit_id': '52fe43fe9251416c7502563d', 'de..."
2,3,,3300000,"[{'id': 18, 'name': 'Drama'}]",http://sonyclassics.com/whiplash/,tt2582802,en,Whiplash,"Under the direction of a ruthless instructor, ...",64.29999,...,"[{'iso_3166_1': 'US', 'name': 'United States o...",10/10/14,105.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,The road to greatness can take you to the edge.,Whiplash,"[{'id': 1416, 'name': 'jazz'}, {'id': 1523, 'n...","[{'cast_id': 5, 'character': 'Andrew Neimann',...","[{'credit_id': '54d5356ec3a3683ba0000039', 'de..."
3,4,,1200000,"[{'id': 53, 'name': 'Thriller'}, {'id': 18, 'n...",http://kahaanithefilm.com/,tt1821480,hi,Kahaani,Vidya Bagchi (Vidya Balan) arrives in Kolkata ...,3.174936,...,"[{'iso_3166_1': 'IN', 'name': 'India'}]",3/9/12,122.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,,Kahaani,"[{'id': 10092, 'name': 'mystery'}, {'id': 1054...","[{'cast_id': 1, 'character': 'Vidya Bagchi', '...","[{'credit_id': '52fe48779251416c9108d6eb', 'de..."
4,5,,0,"[{'id': 28, 'name': 'Action'}, {'id': 53, 'nam...",,tt1380152,ko,마린보이,Marine Boy is the story of a former national s...,1.14807,...,"[{'iso_3166_1': 'KR', 'name': 'South Korea'}]",2/5/09,118.0,"[{'iso_639_1': 'ko', 'name': '한국어/조선말'}]",Released,,Marine Boy,,"[{'cast_id': 3, 'character': 'Chun-soo', 'cred...","[{'credit_id': '52fe464b9251416c75073b43', 'de..."


In [5]:
print(x_train_prepared)

[[1.0000000e+00 1.4000000e+07 6.5753930e+00 9.3000000e+01]
 [2.0000000e+00 4.0000000e+07 8.2488950e+00 1.1300000e+02]
 [3.0000000e+00 3.3000000e+06 6.4299990e+01 1.0500000e+02]
 ...
 [2.9980000e+03 6.5000000e+07 1.4482345e+01 1.2000000e+02]
 [2.9990000e+03 4.2000000e+07 1.5725542e+01 9.0000000e+01]
 [3.0000000e+03 3.5000000e+07 1.0512109e+01 1.0600000e+02]]


# Explore different models and short-list the best ones
Starting with Linear Regression Model

In [6]:
from sklearn.linear_model import LinearRegression
lr_model = LinearRegression()
lr_model.fit(x_train_prepared, x_train_labels)

LinearRegression()

In [7]:
from sklearn.metrics import mean_squared_error
x_train_predictions = lr_model.predict(x_train_prepared)
lr_mse = mean_squared_error(x_train_labels, x_train_predictions)
lr_rmse = np.sqrt(lr_mse)
lr_rmse

85593019.11023888

In [8]:
from sklearn.ensemble import RandomForestRegressor
rf_model = RandomForestRegressor()
rf_model.fit(x_train_prepared, x_train_labels)

predictions = rf_model.predict(x_train_prepared)
forest_mse = mean_squared_error(x_train_labels, predictions)
forest_rmse = np.sqrt(forest_mse)
forest_rmse

32317472.289727237

In [9]:
def display_scores(scores):
    print("Scores:", scores)
    print("Mean:", scores.mean())
    print("Standard deviation:", scores.std())

In [10]:
from sklearn.model_selection import cross_val_score

lin_scores = cross_val_score(lr_model, x_train_prepared, x_train_labels,scoring="neg_mean_squared_error", cv=10)
lin_rmse_scores = np.sqrt(-lin_scores)
display_scores(lin_rmse_scores)

Scores: [87023153.76917268 91836290.57052778 94762944.79910764 83947501.87149791
 79069342.69100092 99981530.13325717 69268789.7863695  83996277.12427057
 88549997.20720717 80317960.33494434]
Mean: 85875378.82873558
Standard deviation: 8259596.441139532


# Fine tuning

In [11]:
from sklearn.model_selection import GridSearchCV
param_grid = [{'n_estimators': [3, 10, 30], 'max_features': [1, 2, 3]},{'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [1, 2, 3]},]
forest_reg = RandomForestRegressor()
grid_search = GridSearchCV(forest_reg, param_grid, cv=5,scoring='neg_mean_squared_error',return_train_score=True)
grid_search.fit(x_train_prepared, x_train_labels)

GridSearchCV(cv=5, estimator=RandomForestRegressor(),
             param_grid=[{'max_features': [1, 2, 3],
                          'n_estimators': [3, 10, 30]},
                         {'bootstrap': [False], 'max_features': [1, 2, 3],
                          'n_estimators': [3, 10]}],
             return_train_score=True, scoring='neg_mean_squared_error')

In [12]:
grid_search.best_params_

{'max_features': 2, 'n_estimators': 10}

In [13]:
final_model = grid_search.best_estimator_

X_test = x_train_og.drop("revenue", axis=1)
y_test = x_train_og["revenue"].copy()
X_test_prepared = data_cols_pipeline.transform(X_test)

final_predictions = final_model.predict(X_test_prepared)
final_mse = mean_squared_error(y_test, final_predictions)
final_rmse = np.sqrt(final_mse)
final_rmse

37255723.74151581

In [14]:
from joblib import dump
dump(final_model, '../models/test_boxoffice_model.joblib', compress=6)
dump(data_cols_pipeline, '../models/test_transform_predict.joblib')

['../models/test_transform_predict.joblib']

# Present your solution

# Launch, monitor and maintain your system
When launching our data model we need to change our input data with live production data from the housing market.
While transfering to production data we should also have unit tests in place to make sure things are behaving as expected.

We need to monitor the performance of our model and cause events/triggers if the performance is low. We also need to monitor the data input; if the input data isnt as expected it will cause problems with our model. Further as the data evolves, we need to monitor if we need to update our model. The model can become stale as the data input evolves.

Even though we are going to monitor the input data and performance, we should regardless retrain our models on a regular basis on fresh input data. This will be an automatically scheduled job that runs every so often.